Re: Spark Streaming + Kafka failure recovery

2015-05-21 Thread Bill Jay
/StreamingContext.html. The information on consumed offset can be recovered from the checkpoint. On Tue, May 19, 2015 at 2:38 PM, Bill Jay bill.jaypeter...@gmail.com wrote: If a Spark streaming job stops at 12:01 and I resume the job at 12:02. Will it still start to consume the data that were

Spark Streaming + Kafka failure recovery

2015-05-19 Thread Bill Jay
Hi all, I am currently using Spark streaming to consume and save logs every hour in our production pipeline. The current setting is to run a crontab job to check every minute whether the job is still there and if not resubmit a Spark streaming job. I am currently using the direct approach for

Re: Spark Streaming + Kafka failure recovery

2015-05-19 Thread Bill Jay
19, 2015 at 12:42 PM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi all, I am currently using Spark streaming to consume and save logs every hour in our production pipeline. The current setting is to run a crontab job to check every minute whether the job is still there and if not resubmit

Partition number of Spark Streaming Kafka receiver-based approach

2015-05-18 Thread Bill Jay
Hi all, I am reading the docs of receiver-based Kafka consumer. The last parameters of KafkaUtils.createStream is per topic number of Kafka partitions to consume. My question is, does the number of partitions for topic in this parameter need to match the number of partitions in Kafka. For

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-30 Thread Bill Jay
to do with it. On Wed, Apr 29, 2015 at 6:50 PM, Bill Jay bill.jaypeter...@gmail.com wrote: This function is called in foreachRDD. I think it should be running in the executors. I add the statement.close() in the code and it is running. I will let you know if this fixes the issue. On Wed

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-30 Thread Bill Jay
each batch. On Thu, Apr 30, 2015 at 11:15 AM, Cody Koeninger c...@koeninger.org wrote: Did you use lsof to see what files were opened during the job? On Thu, Apr 30, 2015 at 1:05 PM, Bill Jay bill.jaypeter...@gmail.com wrote: The data ingestion is in outermost portion in foreachRDD block

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Bill Jay
:07 PM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi all, I am using the direct approach to receive real-time data from Kafka in the following link: https://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html My code follows the word count direct example: https://github.com

Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Bill Jay
Hi all, I am using the direct approach to receive real-time data from Kafka in the following link: https://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html My code follows the word count direct example:

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Bill Jay
...@gmail.com wrote: Maybe add statement.close() in finally block ? Streaming / Kafka experts may have better insight. On Wed, Apr 29, 2015 at 2:25 PM, Bill Jay bill.jaypeter...@gmail.com wrote: Thanks for the suggestion. I ran the command and the limit is 1024. Based on my understanding

Re: Lifecycle of RDD in spark-streaming

2014-11-27 Thread Bill Jay
is a must. (And soon we will have that on the metrics report as well!! :-) ) -kr, Gerard. On Thu, Nov 27, 2014 at 8:14 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi TD, I am using Spark Streaming to consume data from Kafka and do some aggregation and ingest the results into RDS. I

Re: Lifecycle of RDD in spark-streaming

2014-11-26 Thread Bill Jay
Just add one more point. If Spark streaming knows when the RDD will not be used any more, I believe Spark will not try to retrieve data it will not use any more. However, in practice, I often encounter the error of cannot compute split. Based on my understanding, this is because Spark cleared out

Re: Lifecycle of RDD in spark-streaming

2014-11-26 Thread Bill Jay
this, is by asking Spark Streaming remember stuff for longer, by using streamingContext.remember(duration). This will ensure that Spark Streaming will keep around all the stuff for at least that duration. Hope this helps. TD On Wed, Nov 26, 2014 at 5:07 PM, Bill Jay bill.jaypeter...@gmail.com

Re: Error when Spark streaming consumes from Kafka

2014-11-23 Thread Bill Jay
On Sun, Nov 23, 2014 at 2:13 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi all, I am using Spark to consume from Kafka. However, after the job has run for several hours, I saw the following failure of an executor: kafka.common.ConsumerRebalanceFailedException: group-1416624735998_ip-172-31

Error when Spark streaming consumes from Kafka

2014-11-22 Thread Bill Jay
Hi all, I am using Spark to consume from Kafka. However, after the job has run for several hours, I saw the following failure of an executor: kafka.common.ConsumerRebalanceFailedException: group-1416624735998_ip-172-31-5-242.ec2.internal-1416648124230-547d2c31 can't rebalance after 4 retries

Re: Spark streaming cannot receive any message from Kafka

2014-11-18 Thread Bill Jay
there’s the a parameter in KafkaUtils.createStream you can specify the spark parallelism, also what is the exception stacks. Thanks Jerry *From:* Bill Jay [mailto:bill.jaypeter...@gmail.com] *Sent:* Tuesday, November 18, 2014 2:47 AM *To:* Helena Edelson *Cc:* Jay Vyas; u

Spark streaming: java.io.IOException: Version Mismatch (Expected: 28, Received: 18245 )

2014-11-18 Thread Bill Jay
Hi all, I am running a Spark Streaming job. It was able to produce the correct results up to some time. Later on, the job was still running but producing no result. I checked the Spark streaming UI and found that 4 tasks of a stage failed. The error messages showed that Job aborted due to stage

Re: Spark streaming cannot receive any message from Kafka

2014-11-17 Thread Bill Jay
for local mode. Beside there’s a Kafka wordcount example in Spark Streaming example, you can try that. I’ve tested with latest master, it’s OK. Thanks Jerry *From:* Tobias Pfeiffer [mailto:t...@preferred.jp t...@preferred.jp] *Sent:* Thursday, November 13, 2014 8:45 AM *To:* Bill Jay *Cc:* u

Spark streaming cannot receive any message from Kafka

2014-11-12 Thread Bill Jay
Hi all, I have a Spark streaming job which constantly receives messages from Kafka. I was using Spark 1.0.2 and the job has been running for a month. However, when I am currently using Spark 1.1.0. the Spark streaming job cannot receive any messages from Kafka. I have not made any change to the

Spark streaming cannot receive any message from Kafka

2014-11-12 Thread Bill Jay
Hi all, I have a Spark streaming job which constantly receives messages from Kafka. I was using Spark 1.0.2 and the job has been running for a month. However, when I am currently using Spark 1.1.0. the Spark streaming job cannot receive any messages from Kafka. I have not made any change to the

Re: Spark streaming cannot receive any message from Kafka

2014-11-12 Thread Bill Jay
in Spark Streaming example, you can try that. I’ve tested with latest master, it’s OK. Thanks Jerry *From:* Tobias Pfeiffer [mailto:t...@preferred.jp] *Sent:* Thursday, November 13, 2014 8:45 AM *To:* Bill Jay *Cc:* u...@spark.incubator.apache.org *Subject:* Re: Spark streaming cannot

Spark streaming job failed due to java.util.concurrent.TimeoutException

2014-11-03 Thread Bill Jay
Hi all, I have a spark streaming job that consumes data from Kafka and produces some simple operations on the data. This job is run in an EMR cluster with 10 nodes. The batch size I use is 1 minute and it takes around 10 seconds to generate the results that are inserted to a MySQL database.

Re: combineByKey at ShuffledDStream.scala

2014-07-23 Thread Bill Jay
, 2014 at 10:05 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Can you give an idea of the streaming program? Rest of the transformation you are doing on the input streams? On Tue, Jul 22, 2014 at 11:05 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi all, I am currently running

Re: Spark Streaming: no job has started yet

2014-07-23 Thread Bill Jay
ak...@sigmoidanalytics.com wrote: Can you paste the piece of code? Thanks Best Regards On Wed, Jul 23, 2014 at 1:22 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi all, I am running a spark streaming job. The job hangs on one stage, which shows as follows: Details for Stage 4

Get Spark Streaming timestamp

2014-07-23 Thread Bill Jay
Hi all, I have a question regarding Spark streaming. When we use the saveAsTextFiles function and my batch is 60 seconds, Spark will generate a series of files such as: result-140614896, result-140614802, result-140614808, etc. I think this is the timestamp for the beginning of each

Re: Get Spark Streaming timestamp

2014-07-23 Thread Bill Jay
:39 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi all, I have a question regarding Spark streaming. When we use the saveAsTextFiles function and my batch is 60 seconds, Spark will generate a series of files such as: result-140614896, result-140614802, result-140614808, etc

Re: spark streaming rate limiting from kafka

2014-07-22 Thread Bill Jay
repartition(400), you have 400 partitions on one host and the other hosts receive nothing of the data? Tobias On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay bill.jaypeter...@gmail.com wrote: I also have an issue consuming from Kafka. When I consume from Kafka, there are always a single executor

combineByKey at ShuffledDStream.scala

2014-07-22 Thread Bill Jay
Hi all, I am currently running a Spark Streaming program, which consumes data from Kakfa and does the group by operation on the data. I try to optimize the running time of the program because it looks slow to me. It seems the stage named: * combineByKey at ShuffledDStream.scala:42 * always

Spark Streaming: no job has started yet

2014-07-22 Thread Bill Jay
Hi all, I am running a spark streaming job. The job hangs on one stage, which shows as follows: Details for Stage 4 Summary MetricsNo tasks have started yetTasksNo tasks have started yet Does anyone have an idea on this? Thanks! Bill Bill

Re: spark streaming rate limiting from kafka

2014-07-21 Thread Bill Jay
hosts receive nothing of the data? Tobias On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay bill.jaypeter...@gmail.com wrote: I also have an issue consuming from Kafka. When I consume from Kafka, there are always a single executor working on this job. Even I use repartition, it seems

Re: spark streaming rate limiting from kafka

2014-07-20 Thread Bill Jay
), you have 400 partitions on one host and the other hosts receive nothing of the data? Tobias On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay bill.jaypeter...@gmail.com wrote: I also have an issue consuming from Kafka. When I consume from Kafka, there are always a single executor working on this job

Re: Spark Streaming timestamps

2014-07-18 Thread Bill Jay
On Thu, Jul 17, 2014 at 2:05 PM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi Tathagata, Thanks for your answer. Please see my further question below: On Wed, Jul 16, 2014 at 6:57 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Answers inline. On Wed, Jul 16, 2014 at 5:39 PM, Bill

Re: Error: No space left on device

2014-07-17 Thread Bill Jay
Hi, I also have some issues with repartition. In my program, I consume data from Kafka. After I consume data, I use repartition(N). However, although I set N to be 120, there are around 18 executors allocated for my reduce stage. I am not sure how the repartition command works ton ensure the

Re: Spark Streaming timestamps

2014-07-17 Thread Bill Jay
Hi Tathagata, Thanks for your answer. Please see my further question below: On Wed, Jul 16, 2014 at 6:57 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Answers inline. On Wed, Jul 16, 2014 at 5:39 PM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi all, I am currently using Spark

Re: spark streaming rate limiting from kafka

2014-07-17 Thread Bill Jay
I also have an issue consuming from Kafka. When I consume from Kafka, there are always a single executor working on this job. Even I use repartition, it seems that there is still a single executor. Does anyone has an idea how to add parallelism to this job? On Thu, Jul 17, 2014 at 2:06 PM, Chen

Re: Number of executors change during job running

2014-07-16 Thread Bill Jay
tathagata.das1...@gmail.com wrote: Can you give me a screen shot of the stages page in the web ui, the spark logs, and the code that is causing this behavior. This seems quite weird to me. TD On Mon, Jul 14, 2014 at 2:11 PM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi Tathagata, It seems

Spark Streaming timestamps

2014-07-16 Thread Bill Jay
Hi all, I am currently using Spark Streaming to conduct a real-time data analytics. We receive data from Kafka. We want to generate output files that contain results that are based on the data we receive from a specific time interval. I have several questions on Spark Streaming's timestamp: 1)

Re: Number of executors change during job running

2014-07-14 Thread Bill Jay
the stage / time to process the whole batch. TD On Fri, Jul 11, 2014 at 8:32 PM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi Tathagata, Do you mean that the data is not shuffled until the reduce stage? That means groupBy still only uses 2 machines? I think I used repartition(300

Re: Number of executors change during job running

2014-07-11 Thread Bill Jay
ui. It will show break of time for each task, including GC times, etc. That might give some indication. TD On Thu, Jul 10, 2014 at 5:13 PM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi Tathagata, I set default parallelism as 300 in my configuration file. Sometimes there are more

Re: Join two Spark Streaming

2014-07-11 Thread Bill Jay
= { val newUniqueValuesRDD = newDataRDD.union(distinctValues).distinct uniqueValuesRDD = newUniqueValuesRDD // periodically call uniqueValuesRDD.checkpoint() val uniqueCount = uniqueValuesRDD.count() newDataRDD.map(x = x / count) }) On Tue, Jul 8, 2014 at 11:03 AM, Bill

Re: Spark Streaming with Kafka NoClassDefFoundError

2014-07-11 Thread Bill Jay
I have met similar issues. The reason is probably because in Spark assembly, spark-streaming-kafka is not included. Currently, I am using Maven to generate a shaded package with all the dependencies. You may try to use sbt assembly to include the dependencies in your jar file. Bill On Thu, Jul

Re: Number of executors change during job running

2014-07-11 Thread Bill Jay
. It will show break of time for each task, including GC times, etc. That might give some indication. TD On Thu, Jul 10, 2014 at 5:13 PM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi Tathagata, I set default parallelism as 300 in my configuration file. Sometimes there are more executors

Re: Spark Streaming with Kafka NoClassDefFoundError

2014-07-11 Thread Bill Jay
assembly is not working. Is there any other way to include particular jars with assembly command? Regards, Dilip On Friday 11 July 2014 12:45 PM, Bill Jay wrote: I have met similar issues. The reason is probably because in Spark assembly, spark-streaming-kafka is not included. Currently

Re: Number of executors change during job running

2014-07-11 Thread Bill Jay
context. However the data distribution may be skewed in which case, you can use a repartition operation to redistributed the data more evenly (both DStream and RDD have repartition). TD On Fri, Jul 11, 2014 at 12:22 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi Tathagata, I also tried

Re: Number of executors change during job running

2014-07-11 Thread Bill Jay
submission. As a result, the simple save file action took more than 2 minutes. Do you have any idea how Spark determined the number of executors for different stages? Thanks! Bill On Fri, Jul 11, 2014 at 2:01 PM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi Tathagata, Below is my main function. I

Re: Number of executors change during job running

2014-07-11 Thread Bill Jay
, Bill Jay bill.jaypeter...@gmail.com wrote: Hi folks, I just ran another job that only received data from Kafka, did some filtering, and then save as text files in HDFS. There was no reducing work involved. Surprisingly, the number of executors for the saveAsTextFiles stage was also 2

Re: Use Spark Streaming to update result whenever data come

2014-07-10 Thread Bill Jay
with embarassingly parallel operations such as map or filter. I hope someone else might provide more insight here. Tobias On Thu, Jul 10, 2014 at 9:57 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi Tobias, Now I did the re-partition and ran the program again. I find a bottleneck

Re: Number of executors change during job running

2014-07-10 Thread Bill Jay
? If the reduce by key is not set, then the number of reducers used in the stages can keep changing across batches. TD On Wed, Jul 9, 2014 at 4:05 PM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi all, I have a Spark streaming job running on yarn. It consume data from Kafka and group the data

Re: Use Spark Streaming to update result whenever data come

2014-07-09 Thread Bill Jay
something like repartition(N) or repartition(N*2) (with N the number of your nodes) after you receive your data. Tobias On Wed, Jul 9, 2014 at 3:09 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi Tobias, Thanks for the suggestion. I have tried to add more nodes from 300 to 400. It seems

Number of executors change during job running

2014-07-09 Thread Bill Jay
Hi all, I have a Spark streaming job running on yarn. It consume data from Kafka and group the data by a certain field. The data size is 480k lines per minute where the batch size is 1 minute. For some batches, the program sometimes take more than 3 minute to finish the groupBy operation, which

Re: Use Spark Streaming to update result whenever data come

2014-07-09 Thread Bill Jay
...@preferred.jp wrote: Bill, I haven't worked with Yarn, but I would try adding a repartition() call after you receive your data from Kafka. I would be surprised if that didn't help. On Thu, Jul 10, 2014 at 6:23 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi Tobias, I was using Spark 0.9

Join two Spark Streaming

2014-07-08 Thread Bill Jay
Hi all, I am working on a pipeline that needs to join two Spark streams. The input is a stream of integers. And the output is the number of integer's appearance divided by the total number of unique integers. Suppose the input is: 1 2 3 1 2 2 There are 3 unique integers and 1 appears twice.

Re: Use Spark Streaming to update result whenever data come

2014-07-08 Thread Bill Jay
On Thu, Jul 3, 2014 at 7:09 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi all, I have a problem of using Spark Streaming to accept input data and update a result. The input of the data is from Kafka and the output is to report a map which is updated by historical data in every minute

Spark-streaming-kafka error

2014-07-08 Thread Bill Jay
Hi all, I used sbt to package a code that uses spark-streaming-kafka. The packaging succeeded. However, when I submitted to yarn, the job ran for 10 seconds and there was an error in the log file as follows: Caused by: java.lang.NoClassDefFoundError: org/apache/spark/streaming/kafka/KafkaUtils$

Re: Spark-streaming-kafka error

2014-07-08 Thread Bill Jay
application jar? If I remember correctly, it's not bundled with the downloadable compiled version of Spark. Tobias On Wed, Jul 9, 2014 at 8:18 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi all, I used sbt to package a code that uses spark-streaming-kafka. The packaging succeeded. However, when

Use Spark Streaming to update result whenever data come

2014-07-02 Thread Bill Jay
Hi all, I have a problem of using Spark Streaming to accept input data and update a result. The input of the data is from Kafka and the output is to report a map which is updated by historical data in every minute. My current method is to set batch size as 1 minute and use foreachRDD to update

Re: Could not compute split, block not found

2014-07-01 Thread Bill Jay
employ some asynchronous solution where your tasks return immediately and deliver their result via a callback later? Tobias On Tue, Jul 1, 2014 at 2:26 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Tobias, Your suggestion is very helpful. I will definitely investigate it. Just curious

Re: Could not compute split, block not found

2014-07-01 Thread Bill Jay
, Bill Jay bill.jaypeter...@gmail.com wrote: Tobias, Your suggestion is very helpful. I will definitely investigate it. Just curious. Suppose the batch size is t seconds. In practice, does Spark always require the program to finish processing the data of t seconds within t seconds' processing

slf4j multiple bindings

2014-07-01 Thread Bill Jay
Hi all, I have an issue with multiple slf4j bindings. My program was running correctly. I just added the new dependency kryo. And when I submitted a job, the job was killed because of the following error messages: *SLF4J: Class path contains multiple SLF4J bindings.* The log said there were

Re: Could not compute split, block not found

2014-06-30 Thread Bill Jay
think may happen. Can you maybe do something hacky like throwing away a part of the data so that processing time gets below one minute, then check whether you still get that error? Tobias ​​ On Mon, Jun 30, 2014 at 1:56 PM, Bill Jay bill.jaypeter...@gmail.com wrote: Tobias, Thanks

Re: Could not compute split, block not found

2014-06-29 Thread Bill Jay
waiting for processing, so old data was deleted. When it was time to process that data, it didn't exist any more. Is that a possible reason in your case? Tobias On Sat, Jun 28, 2014 at 5:59 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi, I am running a spark streaming job with 1 minute

Could not compute split, block not found

2014-06-27 Thread Bill Jay
Hi, I am running a spark streaming job with 1 minute as the batch size. It ran around 84 minutes and was killed because of the exception with the following information: *java.lang.Exception: Could not compute split, block input-0-1403893740400 not found* Before it was killed, it was able to

Re: Spark Streaming RDD transformation

2014-06-26 Thread Bill Jay
be used to make an RDD from a List or Array of objects. But that's not really related to streaming or updating a Map. On Thu, Jun 26, 2014 at 1:40 PM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi all, I am current working on a project that requires to transform each RDD in a DStream