Hi,
I have a Spark Streaming application that reads messages from Kafka
(multiple topics) and does aggregation on the data via updateStateByKey with
50 Spark workers where each has 1 core and 6G RAM. It is working fine for
the first 10mins or so, but then it will stuck in the foreachRDD
And here is the Thread Dump, where seems every worker is waiting for Executor
#6 Thread 95: sparkExecutor-akka.actor.default-dispatcher-22 (RUNNABLE) to
be complete:
Thread 41: BLOCK_MANAGER cleanup timer (WAITING)
Thread 42: BROADCAST_VARS cleanup timer (WAITING)
Thread 44: shuffle-client-0
Hi Lee, it's actually not related to threading at all - you would still have
the same problem even if you were using a single thread. See this section (
https://spark.apache.org/docs/latest/programming-guide.html#passing-functions-to-spark)
of the Spark docs.
On June 5, 2015, at 5:12 PM, Lee
Hello Sparkers,
I'm reading data from a CSV file, applying some transformations and ending
up with an RDD of pairs (String,Iterable).
I have already prepared Parquet files. I want now to take the previous
(key,value) RDD and populate the parquet files like follows:
- key holds the name of the
Very interesting and relevant thread for production level usage of spark.
@Arun, can you kindly confirm if Daniel’s suggestion helped your usecase?
Thanks,
Kapil Malik | kma...@adobe.commailto:kma...@adobe.com | 33430 / 8800836581
From: Daniel Mahler [mailto:dmah...@gmail.com]
Sent: 13 April
I want to use spark for reading compressed .bed file for reading gene
sequencing alignments data.
I want to store bed file data in db and then use external gene expression
data to find overlaps etc, which database is best for it ?
Thanks
-Roni
Hi,
Please take a look at:
[SPARK-3019] Pluggable block transfer interface (BlockTransferService)
- NioBlockTransferService implements BlockTransferService and replaces the
old BlockManagerWorker
Cheers
On Sat, Jun 6, 2015 at 2:23 AM, bit1...@163.com bit1...@163.com wrote:
Hi,
I remembered
Hello,
I've created a spark cluster on ec2 using the spark-ec2 script. I would
like to be able to modify the logging level of the spark-shell when it is
running on the master. I've copied the log4j.properties template file and
changed the root logger level to WARN and that doesn't seem to have
Hi,
I remembered that there is a class called BlockManagerWorker in spark previous
releases. In the 1.3.1 code, I could see that some method comment still refers
to BlockManagerWorker which doesn't exist at all.
I would ask which class takes place of BlockManagerWorker in Spark 1.3.1?
Thanks.
Can you describe your use case in a bit more detail since not all people on
this mailing list are familiar with gene sequencing alignments data ?
Thanks
On Fri, Jun 5, 2015 at 11:42 PM, roni roni.epi...@gmail.com wrote:
I want to use spark for reading compressed .bed file for reading gene
Hi,
How can I write to multiple outputs for each key? I tried to create
custom partitioner or define the number of partition but does not work.
There are only the few tasks/partitions (which equals to the number of
all key combination) gets large datasets, data is not splitting to all
Hi Will,
That doesn't seem to be the case and was part of the source of my
confusion. The code currently in the run method of the runnable works
perfectly fine with the lambda expressions when it is invoked from the main
method. They also work when they are invoked from within a separate method
Hi,
I do a word count application with 600M text file, and give the RDD's
StorageLevel as StorageLevel.MEMORY_AND_DISK_2.
I got two questions that I can't explain:
1. The StorageLevel shown on the UI is Disk Serialized 2x Replicated,but I am
using StorageLevel.MEMORY_AND_DISK_2,where is the
Hi,
I try to insert data into a partitioned hive table. The groupByKey is to
combine dataset into a partition of the hive table. After the
groupByKey, I converted the iterable[X] to DB by X.toList.toDF(). But
the hiveContext.sql throws NullPointerException, see below. Any
suggestions? What
Hi Lee, I'm stuck with only mobile devices for correspondence right now, so
I can't get to shell to play with this issue - this is all supposition; I
think that the lambdas are closing over the context because it's a
constructor parameter to your Runnable class, which is why inlining the
lambdas
Hi Pierre,
One way is to recreate your credentials until AWS generates one without a
slash character in it. Another way I've been using is to pass these
credentials outside the S3 file path by setting the following (where sc is
the SparkContext).
I believe groupByKey currently requires that all items for a specific key fit
into a single and executive's memory:
http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html
This previous discussion has some pointers if you must
17 matches
Mail list logo