Hi guys,
My Spark Streaming application have this java.lang.OutOfMemoryError: GC
overhead limit exceeded error in SparkStreaming driver program. I have
done the following to debug with it:
1. improved the driver memory from 1GB to 2GB, this error came after 22
hrs. When the memory was 1GB, it
Hi guys,
curious how you deal with the logs. I feel difficulty in debugging with the
logs: run spark-streaming in our yarn cluster using client-mode. So have
two logs: yarn log and local log ( for client ). Whenever I have problem,
the log is too big to read with gedit and grep. (e.g. after
Hi guys,
We have a use case where we need to use consecutive data points to predict
the status. (yes, like using time series data to predict the machine
failure). Is there a straight-forward way to do this in Spark Streaming?
If all consecutive data points are in one batch, it's not complicated
. But this will
be improved a lt when we have IndexedRDD
https://github.com/apache/spark/pull/1297 in the Spark that allows
faster single value updates to a key-value (within each partition, without
processing the entire partition.
Soon.
TD
On Thu, Jul 17, 2014 at 5:57 PM, Yan Fang yanfang
Hi guys,
sure you have similar use case and want to know how you deal with that. In
our application, we want to check the previous state of some keys and
compare with their current state.
AFAIK, Spark Streaming does not have key-value access. So current what I am
doing is storing the previous
Hi TD,
Thank you for the quick replying and backing my approach. :)
1) The example is this:
1. In the first 2 second interval, after updateStateByKey, I get a few keys
and their states, say, (a - 1, b - 2, c - 3)
2. In the following 2 second interval, I only receive c and d and their
value. But
Hi guys,
need some help in this problem. In our use case, we need to continuously
insert values into the database. So our approach is to create the jdbc
object in the main method and then do the inserting operation in the
DStream foreachRDD operation. Is this approach reasonable?
Then the
you're instantiating the JDBC connection in the driver,
and using it inside a closure that would be run in a remote executor.
That means that the connection object would need to be serializable.
If that sounds like what you're doing, it won't work.
On Thu, Jul 17, 2014 at 1:37 PM, Yan Fang
,
Fang, Yan
yanfang...@gmail.com
+1 (206) 849-4108
On Thu, Jul 17, 2014 at 2:53 PM, Sean Owen so...@cloudera.com wrote:
On Thu, Jul 17, 2014 at 10:39 PM, Yan Fang yanfang...@gmail.com wrote:
Thank you for the help. If I use TD's approache, it works and there is no
exception. Only drawback
RDD by doing stateStream.foreachRDD(...). There you can do
whatever operation on all the key-state pairs.
TD
On Thu, Jul 17, 2014 at 11:58 AM, Yan Fang yanfang...@gmail.com wrote:
Hi TD,
Thank you for the quick replying and backing my approach. :)
1) The example is this:
1
, Yan Fang yanfang...@gmail.com wrote:
Hi Tathagata,
Thank you. Is task slot equivalent to the core number? Or actually one
core can run multiple tasks at the same time?
Best,
Fang, Yan
yanfang...@gmail.com
+1 (206) 849-4108
On Fri, Jul 11, 2014 at 1:45 PM, Tathagata Das
think the same executor can
be used for receiver and the processing task also (this part am not very
sure)
On Fri, Jul 11, 2014 at 12:29 AM, Yan Fang yanfang...@gmail.com wrote:
Hi all,
I am working to improve the parallelism of the Spark Streaming
application. But I have problem
on, thanks for chipping in!
On Fri, Jul 11, 2014 at 11:16 AM, Yan Fang yanfang...@gmail.com wrote:
Hi Praveen,
Thank you for the answer. That's interesting because if I only bring up
one executor for the Spark Streaming, it seems only the receiver is
working, no other tasks are happening
Hi all,
I am working to improve the parallelism of the Spark Streaming application.
But I have problem in understanding how the executors are used and the
application is distributed.
1. In YARN, is one executor equal one container?
2. I saw the statement that a streaming receiver runs on one
I am using the Spark Streaming and have the following two questions:
1. If more than one output operations are put in the same StreamingContext
(basically, I mean, I put all the output operations in the same class), are
they processed one by one as the order they appear in the class? Or they
are
at a
time. This *can* be parallelized using an undocumented config parameter
spark.streaming.concurrentJobs which is by default set to 1.
2. Yes, the output operations (and the spark jobs that are involved with
them) gets queued up.
TD
On Wed, Jul 9, 2014 at 11:22 AM, Yan Fang yanfang
Hi guys,
I am a little confusing by the checkpointing in Spark Streaming. It
checkpoints the intermediate data for the stateful operations for sure.
Does it also checkpoint the information of StreamingContext? Because it
seems we can recreate the SC from the checkpoint in a driver node failure
Hi guys,
Not sure if you have similar issues. Did not find relevant tickets in
JIRA. When I deploy the Spark Streaming to YARN, I have following two
issues:
1. The UI port is random. It is not default 4040. I have to look at the
container's log to check the UI port. Is this suppose to be this
2014-07-07 11:20 GMT-07:00 Yan Fang yanfang...@gmail.com:
Hi guys,
Not sure if you have similar issues. Did not find relevant tickets in
JIRA. When I deploy the Spark Streaming to YARN, I have following two
issues:
1. The UI port is random. It is not default 4040. I have to look
, not the executor containers. This is because SparkUI belongs to
the SparkContext, which only exists on the driver.
Andrew
2014-07-07 11:20 GMT-07:00 Yan Fang yanfang...@gmail.com:
Hi guys,
Not sure if you have similar issues. Did not find relevant tickets in
JIRA. When I deploy the Spark Streaming
I know the order of processing DStream is guaranteed. Wondering if the
order of messages in one DStream is guaranteed. My gut feeling is yes for
the question because RDD is immutable. Some simple tests prove this. Want
to hear from authority to persuade myself. Thank you.
Best,
Fang, Yan
21 matches
Mail list logo