Hi Saj,
What is the size of the input data that you are putting on the stream ?
Have you tried running the same application with different set of data ?
Its weird that exactly after 2 hours the streaming stops. Try running the
same application with different data of different size to look if it
Hello,
I have a Spark Standalone cluster running in HA mode. I launched a
application using spark-submit with cluster and supervised mode enabled and
it launched sucessfully on one of the worker nodes.
How can I stop/restart/kill or otherwise manage such task running in a
standalone cluster?
(Forgot to cc user mail list)
On 11/16/14 4:59 PM, Cheng Lian wrote:
Hey Sadhan,
Thanks for the additional information, this is helpful. Seems that
some Parquet internal contract was broken, but I'm not sure whether
it's caused by Spark SQL or Parquet, or even maybe the Parquet file
itself
Hi,
Can anybody help me on this please, haven't been able to find the problem
:(
Thanks.
On Nov 15, 2014 4:48 PM, Bahubali Jain bahub...@gmail.com wrote:
Hi,
Trying to use spark streaming, but I am struggling with word count :(
I want consolidate output of the word count (not on a per window
I guess, maybe you don’t need invoke reduceByKey() after mapToPair, because
updateStateByKey had covered it. For your reference, here is a sample written
by scala using text file stream instead of socket as below:
object LocalStatefulWordCount extends App {
val sparkConf = new
Can you check in the worker logs what exactly is happening!??
Thanks
Best Regards
On Sun, Nov 16, 2014 at 2:54 AM, jschindler john.schind...@utexas.edu
wrote:
UPDATE
I have removed and added things systematically to the job and have figured
that the inclusion of the construction of the
This usually happens when one of the worker is stuck on GC Pause and it
times out. Enable the following configurations while creating sparkContext:
sc.set(spark.rdd.compress,true)
sc.set(spark.storage.memoryFraction,1)
sc.set(spark.core.connection.ack.wait.timeout,6000)
Have a look at this doc http://spark.apache.org/docs/latest/security.html
You can configure your network to only accept connections from the trusted
CIDR. If you are using Cloud services like Ec2/Azure/GCE etc, then it is
straight forward from their web portal. If you are having a bunch of custom
Manish and others,
A follow up question on my mind is whether there are protobuf (or other
binary format) frameworks in the vein of PMML. Perhaps scientific data
storage frameworks like netcdf, root are possible also.
I like the comprehensiveness of PMML but as you mention the complexity of
Indeed it did. Thanks!
Ron
From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Friday, November 14, 2014 9:53 PM
To: Daniel, Ronald (ELS-SDG)
Cc: user@spark.apache.org
Subject: Re: filtering a SchemaRDD
If I use row[6] instead of row[text] I get what I am looking for. However,
Hi,
I have a method that returns DenseMatrix:
def func(str: String): DenseMatrix = {
...
...
}
But I keep getting this error:
*class DenseMatrix takes type parameters*
I tried this too:
def func(str: String): DenseMatrix(Int, Int, Array[Double]) = {
...
...
}
But this gives me
Hi,
On Fri, Nov 14, 2014 at 3:20 PM, Mayur Rustagi mayur.rust...@gmail.com
wrote:
I wonder if SparkConf is dynamically updated on all worker nodes or only
during initialization. It can be used to piggyback information.
Otherwise I guess you are stuck with Broadcast.
Primarily I have had
Hello,
Is it possible to reuse RDD implementations written in Scala/Java with PySpark?
Thanks,
Nam
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
Thank you Marcelo. I tried your suggestion (# mvn -pl :spark-examples_2.10
compile), but it required to download many spark components (as listed below),
which I have already compiled on my server.
Downloading:
Hi Cheng,
I tried reading the parquet file(on which we were getting the exception)
through parquet-tools and it is able to dump the file and I can read the
metadata, etc. I also loaded the file through hive table and can run a
table scan query on it as well. Let me know if I can do more to help
Hi all,
I'm iterating over an RDD (representing a distributed matrix...have to
roll my own in Python) and making changes to different submatrices at
each iteration. The loop structure looks something like:
for i in range(x):
VAR = sc.broadcast(i)
rdd.map(func1).reduceByKey(func2)
M =
Hi again,
On Mon, Nov 17, 2014 at 8:16 AM, Tobias Pfeiffer t...@preferred.jp wrote:
I have been trying to mis-use broadcast as in
- create a class with a boolean var, set to true
- query this boolean on the executors as a prerequisite to process the
next item
- when I want to shutdown, I
Hi All.
I am trying to get my head around why using accumulators and accumulables seems
to be the most recommended method for accumulating running sums, averages,
variances and the like, whereas the aggregate method seems to me to be the
right one. I have no performance measurements as of yet,
Hi,
I am new to spark. I met a problem when I intended to load one dataset.
I have a dataset where the data is in json format and I'd like to load it
as a RDD.
As one record may span multiple lines, so SparkContext.textFile() is not
doable. I also tried to use json4s to parse the json manually
Hi,
Is there any way to know which of my functions perform better in Spark? In
other words, say I have achieved same thing using two different
implementations. How do I judge as to which implementation is better than
the other. Is processing time the only metric that we can use to claim the
Check this video out:
https://www.youtube.com/watch?v=dmL0N3qfSc8list=UURzsq7k4-kT-h3TDUBQ82-w
On Mon, Nov 17, 2014 at 9:43 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
Hi,
Is there any way to know which of my functions perform better in Spark? In
other words, say I have achieved same
|SQLContext.jsonFile| assumes one JSON record per line. Although I
haven’t tried yet, it seems that this |JsonInputFormat| [1] can be
helpful. You may read your original data set with
|SparkContext.hadoopFile| and |JsonInputFormat|, then transform the
resulted |RDD[String]| into a |JsonRDD|
Guys,
Recently we are migrating our backend pipeline from to Spark.
In our pipeline, we have a MPI-based HAC implementation, to ensure the
result consistency of migration, we also want to migrate this
MPI-implemented code to Spark.
However, during the migration process, I found that there are
Thanks I did go through the video it was very informative, but I think I's
looking for the Transformations section @ page
https://spark.apache.org/docs/0.9.1/scala-programming-guide.html.
On Mon, Nov 17, 2014 at 10:31 AM, Samarth Mailinglist
mailinglistsama...@gmail.com wrote:
Check this
You are changing these paths and filenames to match your own actual scripts
and file locations right?
On Nov 17, 2014 4:59 AM, Samarth Mailinglist mailinglistsama...@gmail.com
wrote:
I am trying to run a job written in python with the following command:
bin/spark-submit --master
25 matches
Mail list logo