Spark Streaming Stuck After 10mins Issue...

2015-06-06 Thread EH
Hi, I have a Spark Streaming application that reads messages from Kafka (multiple topics) and does aggregation on the data via updateStateByKey with 50 Spark workers where each has 1 core and 6G RAM. It is working fine for the first 10mins or so, but then it will stuck in the foreachRDD

Re: Spark Streaming Stuck After 10mins Issue...

2015-06-06 Thread EH
And here is the Thread Dump, where seems every worker is waiting for Executor #6 Thread 95: sparkExecutor-akka.actor.default-dispatcher-22 (RUNNABLE) to be complete: Thread 41: BLOCK_MANAGER cleanup timer (WAITING) Thread 42: BROADCAST_VARS cleanup timer (WAITING) Thread 44: shuffle-client-0

Re: SparkContext Threading

2015-06-06 Thread Will Briggs
Hi Lee, it's actually not related to threading at all - you would still have the same problem even if you were using a single thread. See this section ( https://spark.apache.org/docs/latest/programming-guide.html#passing-functions-to-spark) of the Spark docs. On June 5, 2015, at 5:12 PM, Lee

Filling Parquet files by values in Value of a JavaPairRDD

2015-06-06 Thread Mohamed Nadjib Mami
Hello Sparkers, I'm reading data from a CSV file, applying some transformations and ending up with an RDD of pairs (String,Iterable). I have already prepared Parquet files. I want now to take the previous (key,value) RDD and populate the parquet files like follows: - key holds the name of the

RE: Problem getting program to run on 15TB input

2015-06-06 Thread Kapil Malik
Very interesting and relevant thread for production level usage of spark. @Arun, can you kindly confirm if Daniel’s suggestion helped your usecase? Thanks, Kapil Malik | kma...@adobe.commailto:kma...@adobe.com | 33430 / 8800836581 From: Daniel Mahler [mailto:dmah...@gmail.com] Sent: 13 April

which database for gene alignment data ?

2015-06-06 Thread roni
I want to use spark for reading compressed .bed file for reading gene sequencing alignments data. I want to store bed file data in db and then use external gene expression data to find overlaps etc, which database is best for it ? Thanks -Roni

Re: Which class takes place of BlockManagerWorker in Spark 1.3.1

2015-06-06 Thread Ted Yu
Hi, Please take a look at: [SPARK-3019] Pluggable block transfer interface (BlockTransferService) - NioBlockTransferService implements BlockTransferService and replaces the old BlockManagerWorker Cheers On Sat, Jun 6, 2015 at 2:23 AM, bit1...@163.com bit1...@163.com wrote: Hi, I remembered

Logging in spark-shell on master

2015-06-06 Thread Robert Pond
Hello, I've created a spark cluster on ec2 using the spark-ec2 script. I would like to be able to modify the logging level of the spark-shell when it is running on the master. I've copied the log4j.properties template file and changed the root logger level to WARN and that doesn't seem to have

Which class takes place of BlockManagerWorker in Spark 1.3.1

2015-06-06 Thread bit1...@163.com
Hi, I remembered that there is a class called BlockManagerWorker in spark previous releases. In the 1.3.1 code, I could see that some method comment still refers to BlockManagerWorker which doesn't exist at all. I would ask which class takes place of BlockManagerWorker in Spark 1.3.1? Thanks.

Re: which database for gene alignment data ?

2015-06-06 Thread Ted Yu
Can you describe your use case in a bit more detail since not all people on this mailing list are familiar with gene sequencing alignments data ? Thanks On Fri, Jun 5, 2015 at 11:42 PM, roni roni.epi...@gmail.com wrote: I want to use spark for reading compressed .bed file for reading gene

write multiple outputs by key

2015-06-06 Thread patcharee
Hi, How can I write to multiple outputs for each key? I tried to create custom partitioner or define the number of partition but does not work. There are only the few tasks/partitions (which equals to the number of all key combination) gets large datasets, data is not splitting to all

Re: SparkContext Threading

2015-06-06 Thread Lee McFadden
Hi Will, That doesn't seem to be the case and was part of the source of my confusion. The code currently in the run method of the runnable works perfectly fine with the lambda expressions when it is invoked from the main method. They also work when they are invoked from within a separate method

Don't understand the numbers on the Storage UI(/storage/rdd/?id=4)

2015-06-06 Thread bit1...@163.com
Hi, I do a word count application with 600M text file, and give the RDD's StorageLevel as StorageLevel.MEMORY_AND_DISK_2. I got two questions that I can't explain: 1. The StorageLevel shown on the UI is Disk Serialized 2x Replicated,but I am using StorageLevel.MEMORY_AND_DISK_2,where is the

hiveContext.sql NullPointerException

2015-06-06 Thread patcharee
Hi, I try to insert data into a partitioned hive table. The groupByKey is to combine dataset into a partition of the hive table. After the groupByKey, I converted the iterable[X] to DB by X.toList.toDF(). But the hiveContext.sql throws NullPointerException, see below. Any suggestions? What

Re: SparkContext Threading

2015-06-06 Thread William Briggs
Hi Lee, I'm stuck with only mobile devices for correspondence right now, so I can't get to shell to play with this issue - this is all supposition; I think that the lambdas are closing over the context because it's a constructor parameter to your Runnable class, which is why inlining the lambdas

Re: Access several s3 buckets, with credentials containing /

2015-06-06 Thread Sujit Pal
Hi Pierre, One way is to recreate your credentials until AWS generates one without a slash character in it. Another way I've been using is to pass these credentials outside the S3 file path by setting the following (where sc is the SparkContext).

Re: write multiple outputs by key

2015-06-06 Thread Will Briggs
I believe groupByKey currently requires that all items for a specific key fit into a single and executive's memory: http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html This previous discussion has some pointers if you must