/StreamingContext.html.
The information on consumed offset can be recovered from the checkpoint.
On Tue, May 19, 2015 at 2:38 PM, Bill Jay bill.jaypeter...@gmail.com
wrote:
If a Spark streaming job stops at 12:01 and I resume the job at 12:02.
Will it still start to consume the data that were
Hi all,
I am currently using Spark streaming to consume and save logs every hour in
our production pipeline. The current setting is to run a crontab job to
check every minute whether the job is still there and if not resubmit a
Spark streaming job. I am currently using the direct approach for
19, 2015 at 12:42 PM, Bill Jay bill.jaypeter...@gmail.com
wrote:
Hi all,
I am currently using Spark streaming to consume and save logs every hour
in our production pipeline. The current setting is to run a crontab job to
check every minute whether the job is still there and if not resubmit
Hi all,
I am reading the docs of receiver-based Kafka consumer. The last parameters
of KafkaUtils.createStream is per topic number of Kafka partitions to
consume. My question is, does the number of partitions for topic in this
parameter need to match the number of partitions in Kafka.
For
to do with it.
On Wed, Apr 29, 2015 at 6:50 PM, Bill Jay bill.jaypeter...@gmail.com
wrote:
This function is called in foreachRDD. I think it should be running in
the executors. I add the statement.close() in the code and it is running. I
will let you know if this fixes the issue.
On Wed
each batch.
On Thu, Apr 30, 2015 at 11:15 AM, Cody Koeninger c...@koeninger.org wrote:
Did you use lsof to see what files were opened during the job?
On Thu, Apr 30, 2015 at 1:05 PM, Bill Jay bill.jaypeter...@gmail.com
wrote:
The data ingestion is in outermost portion in foreachRDD block
:07 PM, Bill Jay bill.jaypeter...@gmail.com
wrote:
Hi all,
I am using the direct approach to receive real-time data from Kafka in
the following link:
https://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html
My code follows the word count direct example:
https://github.com
Hi all,
I am using the direct approach to receive real-time data from Kafka in the
following link:
https://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html
My code follows the word count direct example:
...@gmail.com wrote:
Maybe add statement.close() in finally block ?
Streaming / Kafka experts may have better insight.
On Wed, Apr 29, 2015 at 2:25 PM, Bill Jay bill.jaypeter...@gmail.com
wrote:
Thanks for the suggestion. I ran the command and the limit is 1024.
Based on my understanding
is a must. (And soon we will have that on the metrics report
as well!! :-) )
-kr, Gerard.
On Thu, Nov 27, 2014 at 8:14 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:
Hi TD,
I am using Spark Streaming to consume data from Kafka and do some
aggregation and ingest the results into RDS. I
Just add one more point. If Spark streaming knows when the RDD will not be
used any more, I believe Spark will not try to retrieve data it will not
use any more. However, in practice, I often encounter the error of cannot
compute split. Based on my understanding, this is because Spark cleared
out
this, is by
asking Spark Streaming remember stuff for longer, by using
streamingContext.remember(duration). This will ensure that Spark
Streaming will keep around all the stuff for at least that duration.
Hope this helps.
TD
On Wed, Nov 26, 2014 at 5:07 PM, Bill Jay bill.jaypeter...@gmail.com
On Sun, Nov 23, 2014 at 2:13 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:
Hi all,
I am using Spark to consume from Kafka. However, after the job has run
for several hours, I saw the following failure of an executor:
kafka.common.ConsumerRebalanceFailedException:
group-1416624735998_ip-172-31
Hi all,
I am using Spark to consume from Kafka. However, after the job has run for
several hours, I saw the following failure of an executor:
kafka.common.ConsumerRebalanceFailedException:
group-1416624735998_ip-172-31-5-242.ec2.internal-1416648124230-547d2c31
can't rebalance after 4 retries
there’s the a parameter in KafkaUtils.createStream you can specify
the spark parallelism, also what is the exception stacks.
Thanks
Jerry
*From:* Bill Jay [mailto:bill.jaypeter...@gmail.com]
*Sent:* Tuesday, November 18, 2014 2:47 AM
*To:* Helena Edelson
*Cc:* Jay Vyas; u
Hi all,
I am running a Spark Streaming job. It was able to produce the correct
results up to some time. Later on, the job was still running but producing
no result. I checked the Spark streaming UI and found that 4 tasks of a
stage failed.
The error messages showed that Job aborted due to stage
for local mode. Beside there’s a Kafka wordcount example in Spark Streaming
example, you can try that. I’ve tested with latest master, it’s OK.
Thanks
Jerry
*From:* Tobias Pfeiffer [mailto:t...@preferred.jp t...@preferred.jp]
*Sent:* Thursday, November 13, 2014 8:45 AM
*To:* Bill Jay
*Cc:* u
Hi all,
I have a Spark streaming job which constantly receives messages from Kafka.
I was using Spark 1.0.2 and the job has been running for a month. However,
when I am currently using Spark 1.1.0. the Spark streaming job cannot
receive any messages from Kafka. I have not made any change to the
Hi all,
I have a Spark streaming job which constantly receives messages from Kafka.
I was using Spark 1.0.2 and the job has been running for a month. However,
when I am currently using Spark 1.1.0. the Spark streaming job cannot
receive any messages from Kafka. I have not made any change to the
in Spark Streaming
example, you can try that. I’ve tested with latest master, it’s OK.
Thanks
Jerry
*From:* Tobias Pfeiffer [mailto:t...@preferred.jp]
*Sent:* Thursday, November 13, 2014 8:45 AM
*To:* Bill Jay
*Cc:* u...@spark.incubator.apache.org
*Subject:* Re: Spark streaming cannot
Hi all,
I have a spark streaming job that consumes data from Kafka and produces
some simple operations on the data. This job is run in an EMR cluster with
10 nodes. The batch size I use is 1 minute and it takes around 10 seconds
to generate the results that are inserted to a MySQL database.
, 2014 at 10:05 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Can you give an idea of the streaming program? Rest of the transformation
you are doing on the input streams?
On Tue, Jul 22, 2014 at 11:05 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:
Hi all,
I am currently running
ak...@sigmoidanalytics.com
wrote:
Can you paste the piece of code?
Thanks
Best Regards
On Wed, Jul 23, 2014 at 1:22 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:
Hi all,
I am running a spark streaming job. The job hangs on one stage, which
shows as follows:
Details for Stage 4
Hi all,
I have a question regarding Spark streaming. When we use the
saveAsTextFiles function and my batch is 60 seconds, Spark will generate a
series of files such as:
result-140614896, result-140614802, result-140614808, etc.
I think this is the timestamp for the beginning of each
:39 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:
Hi all,
I have a question regarding Spark streaming. When we use the
saveAsTextFiles function and my batch is 60 seconds, Spark will generate a
series of files such as:
result-140614896, result-140614802, result-140614808, etc
repartition(400), you have 400 partitions on one
host and the other hosts receive nothing of the data?
Tobias
On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:
I also have an issue consuming from Kafka. When I consume from Kafka,
there are always a single executor
Hi all,
I am currently running a Spark Streaming program, which consumes data from
Kakfa and does the group by operation on the data. I try to optimize the
running time of the program because it looks slow to me. It seems the stage
named:
* combineByKey at ShuffledDStream.scala:42 *
always
Hi all,
I am running a spark streaming job. The job hangs on one stage, which shows
as follows:
Details for Stage 4
Summary MetricsNo tasks have started yetTasksNo tasks have started yet
Does anyone have an idea on this?
Thanks!
Bill
Bill
hosts receive nothing of the data?
Tobias
On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:
I also have an issue consuming from Kafka. When I consume from Kafka,
there are always a single executor working on this job. Even I use
repartition, it seems
), you have 400 partitions on one
host and the other hosts receive nothing of the data?
Tobias
On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:
I also have an issue consuming from Kafka. When I consume from Kafka,
there are always a single executor working on this job
On Thu, Jul 17, 2014 at 2:05 PM, Bill Jay bill.jaypeter...@gmail.com
wrote:
Hi Tathagata,
Thanks for your answer. Please see my further question below:
On Wed, Jul 16, 2014 at 6:57 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
Answers inline.
On Wed, Jul 16, 2014 at 5:39 PM, Bill
Hi,
I also have some issues with repartition. In my program, I consume data
from Kafka. After I consume data, I use repartition(N). However, although I
set N to be 120, there are around 18 executors allocated for my reduce
stage. I am not sure how the repartition command works ton ensure the
Hi Tathagata,
Thanks for your answer. Please see my further question below:
On Wed, Jul 16, 2014 at 6:57 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Answers inline.
On Wed, Jul 16, 2014 at 5:39 PM, Bill Jay bill.jaypeter...@gmail.com
wrote:
Hi all,
I am currently using Spark
I also have an issue consuming from Kafka. When I consume from Kafka, there
are always a single executor working on this job. Even I use repartition,
it seems that there is still a single executor. Does anyone has an idea how
to add parallelism to this job?
On Thu, Jul 17, 2014 at 2:06 PM, Chen
tathagata.das1...@gmail.com
wrote:
Can you give me a screen shot of the stages page in the web ui, the spark
logs, and the code that is causing this behavior. This seems quite weird to
me.
TD
On Mon, Jul 14, 2014 at 2:11 PM, Bill Jay bill.jaypeter...@gmail.com
wrote:
Hi Tathagata,
It seems
Hi all,
I am currently using Spark Streaming to conduct a real-time data analytics.
We receive data from Kafka. We want to generate output files that contain
results that are based on the data we receive from a specific time
interval.
I have several questions on Spark Streaming's timestamp:
1)
the stage / time to process the whole batch.
TD
On Fri, Jul 11, 2014 at 8:32 PM, Bill Jay bill.jaypeter...@gmail.com
wrote:
Hi Tathagata,
Do you mean that the data is not shuffled until the reduce stage? That
means groupBy still only uses 2 machines?
I think I used repartition(300
ui. It will show break of time for each task,
including GC times, etc. That might give some indication.
TD
On Thu, Jul 10, 2014 at 5:13 PM, Bill Jay bill.jaypeter...@gmail.com
wrote:
Hi Tathagata,
I set default parallelism as 300 in my configuration file. Sometimes
there are more
= {
val newUniqueValuesRDD = newDataRDD.union(distinctValues).distinct
uniqueValuesRDD = newUniqueValuesRDD
// periodically call uniqueValuesRDD.checkpoint()
val uniqueCount = uniqueValuesRDD.count()
newDataRDD.map(x = x / count)
})
On Tue, Jul 8, 2014 at 11:03 AM, Bill
I have met similar issues. The reason is probably because in Spark
assembly, spark-streaming-kafka is not included. Currently, I am using
Maven to generate a shaded package with all the dependencies. You may try
to use sbt assembly to include the dependencies in your jar file.
Bill
On Thu, Jul
. It will show break of time for each task,
including GC times, etc. That might give some indication.
TD
On Thu, Jul 10, 2014 at 5:13 PM, Bill Jay bill.jaypeter...@gmail.com
wrote:
Hi Tathagata,
I set default parallelism as 300 in my configuration file. Sometimes
there are more executors
assembly
is not working. Is there any other way to include particular jars with
assembly command?
Regards,
Dilip
On Friday 11 July 2014 12:45 PM, Bill Jay wrote:
I have met similar issues. The reason is probably because in Spark
assembly, spark-streaming-kafka is not included. Currently
context. However the data distribution may be skewed in which case,
you can use a repartition operation to redistributed the data more evenly
(both DStream and RDD have repartition).
TD
On Fri, Jul 11, 2014 at 12:22 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:
Hi Tathagata,
I also tried
submission.
As a result, the simple save file action took more than 2 minutes. Do you
have any idea how Spark determined the number of executors
for different stages?
Thanks!
Bill
On Fri, Jul 11, 2014 at 2:01 PM, Bill Jay bill.jaypeter...@gmail.com
wrote:
Hi Tathagata,
Below is my main function. I
, Bill Jay bill.jaypeter...@gmail.com
wrote:
Hi folks,
I just ran another job that only received data from Kafka, did some
filtering, and then save as text files in HDFS. There was no reducing work
involved. Surprisingly, the number of executors for the saveAsTextFiles
stage was also 2
with embarassingly parallel
operations such as map or filter. I hope someone else might provide more
insight here.
Tobias
On Thu, Jul 10, 2014 at 9:57 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:
Hi Tobias,
Now I did the re-partition and ran the program again. I find a bottleneck
? If the reduce by key is not set, then the number of reducers
used in the stages can keep changing across batches.
TD
On Wed, Jul 9, 2014 at 4:05 PM, Bill Jay bill.jaypeter...@gmail.com
wrote:
Hi all,
I have a Spark streaming job running on yarn. It consume data from Kafka
and group the data
something like repartition(N) or
repartition(N*2) (with N the number of your nodes) after you receive your
data.
Tobias
On Wed, Jul 9, 2014 at 3:09 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:
Hi Tobias,
Thanks for the suggestion. I have tried to add more nodes from 300 to
400. It seems
Hi all,
I have a Spark streaming job running on yarn. It consume data from Kafka
and group the data by a certain field. The data size is 480k lines per
minute where the batch size is 1 minute.
For some batches, the program sometimes take more than 3 minute to finish
the groupBy operation, which
...@preferred.jp wrote:
Bill,
I haven't worked with Yarn, but I would try adding a repartition() call
after you receive your data from Kafka. I would be surprised if that didn't
help.
On Thu, Jul 10, 2014 at 6:23 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:
Hi Tobias,
I was using Spark 0.9
Hi all,
I am working on a pipeline that needs to join two Spark streams. The input
is a stream of integers. And the output is the number of integer's
appearance divided by the total number of unique integers. Suppose the
input is:
1
2
3
1
2
2
There are 3 unique integers and 1 appears twice.
On Thu, Jul 3, 2014 at 7:09 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:
Hi all,
I have a problem of using Spark Streaming to accept input data and update
a result.
The input of the data is from Kafka and the output is to report a map
which is updated by historical data in every minute
Hi all,
I used sbt to package a code that uses spark-streaming-kafka. The packaging
succeeded. However, when I submitted to yarn, the job ran for 10 seconds
and there was an error in the log file as follows:
Caused by: java.lang.NoClassDefFoundError:
org/apache/spark/streaming/kafka/KafkaUtils$
application jar? If I remember correctly, it's not
bundled with the downloadable compiled version of Spark.
Tobias
On Wed, Jul 9, 2014 at 8:18 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:
Hi all,
I used sbt to package a code that uses spark-streaming-kafka. The
packaging succeeded. However, when
Hi all,
I have a problem of using Spark Streaming to accept input data and update a
result.
The input of the data is from Kafka and the output is to report a map which
is updated by historical data in every minute. My current method is to set
batch size as 1 minute and use foreachRDD to update
employ some asynchronous
solution where your tasks return immediately and deliver their result via a
callback later?
Tobias
On Tue, Jul 1, 2014 at 2:26 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:
Tobias,
Your suggestion is very helpful. I will definitely investigate it.
Just curious
, Bill Jay bill.jaypeter...@gmail.com
wrote:
Tobias,
Your suggestion is very helpful. I will definitely investigate it.
Just curious. Suppose the batch size is t seconds. In practice, does
Spark always require the program to finish processing the data of t seconds
within t seconds' processing
Hi all,
I have an issue with multiple slf4j bindings. My program was running
correctly. I just added the new dependency kryo. And when I submitted a
job, the job was killed because of the following error messages:
*SLF4J: Class path contains multiple SLF4J bindings.*
The log said there were
think
may happen.
Can you maybe do something hacky like throwing away a part of the data so
that processing time gets below one minute, then check whether you still
get that error?
Tobias
On Mon, Jun 30, 2014 at 1:56 PM, Bill Jay bill.jaypeter...@gmail.com
wrote:
Tobias,
Thanks
waiting for processing, so old data was deleted. When it was
time to process that data, it didn't exist any more. Is that a
possible reason in your case?
Tobias
On Sat, Jun 28, 2014 at 5:59 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:
Hi,
I am running a spark streaming job with 1 minute
Hi,
I am running a spark streaming job with 1 minute as the batch size. It ran
around 84 minutes and was killed because of the exception with the
following information:
*java.lang.Exception: Could not compute split, block input-0-1403893740400
not found*
Before it was killed, it was able to
be used to make an RDD from a List or Array of objects. But that's not
really related to streaming or updating a Map.
On Thu, Jun 26, 2014 at 1:40 PM, Bill Jay bill.jaypeter...@gmail.com
wrote:
Hi all,
I am current working on a project that requires to transform each RDD in
a
DStream
62 matches
Mail list logo