Hey Tim,

Let me get the key points.
1. If you are not writing back to Kafka, the delay is stable? That is,
instead of "foreachRDD { // write to kafka }"  if you do "dstream.count",
then the delay is stable. Right?
2. If so, then Kafka is the bottleneck. Is the number of partitions, that
you spoke of the in the second mail, that determines the parallelism in
writes? Is it stable with 30 partitions?

Regarding the block exception, could you give me a trace of info level
logging that leads to this error? Basically I want trace the lifecycle of
the block.

On Thu, Feb 12, 2015 at 6:29 PM, Tim Smith <secs...@gmail.com> wrote:

> Hi Gerard,
>
> Great write-up and really good guidance in there.
>
> I have to be honest, I don't know why but setting # of partitions for each
> dStream to a low number (5-10) just causes the app to choke/crash. Setting
> it to 20 gets the app going but with not so great delays. Bump it up to 30
> and I start winning the war where processing time is consistently below
> batch time window (20 seconds) except for a batch every few batches where
> the compute time spikes 10x the usual.
>
> Following your guide, I took out some "logInfo" statements I had in the
> app but didn't seem to make much difference :(
>
> With a higher time window (20 seconds), I got the app to run stably for a
> few hours but then ran into the dreaded "java.lang.Exception: Could not
> compute split, block input-0-1423761240800 not found". Wonder if I need to
> add RDD persistence back?
>
> Also, I am reaching out to Virdata with some ProServ inquiries.
>
> Thanks
>
>
>
>
>
> On Thu, Feb 12, 2015 at 4:30 AM, Gerard Maas <gerard.m...@gmail.com>
> wrote:
>
>> Hi Tim,
>>
>> From this: " There are 5 kafka receivers and each incoming stream is
>> split into 40 partitions"  I suspect that you're creating too many tasks
>> for Spark to process on time.
>> Could you try some of the 'knobs' I describe here to see if that would
>> help?
>>
>> http://www.virdata.com/tuning-spark/
>>
>> -kr, Gerard.
>>
>> On Thu, Feb 12, 2015 at 8:44 AM, Tim Smith <secs...@gmail.com> wrote:
>>
>>> Just read the thread "Are these numbers abnormal for spark streaming?"
>>> and I think I am seeing similar results - that is - increasing the window
>>> seems to be the trick here. I will have to monitor for a few hours/days
>>> before I can conclude (there are so many knobs/dials).
>>>
>>>
>>>
>>> On Wed, Feb 11, 2015 at 11:16 PM, Tim Smith <secs...@gmail.com> wrote:
>>>
>>>> On Spark 1.2 (have been seeing this behaviour since 1.0), I have a
>>>> streaming app that consumes data from Kafka and writes it back to Kafka
>>>> (different topic). My big problem has been Total Delay. While execution
>>>> time is usually <window size (in seconds), the total delay ranges from a
>>>> minutes to hours(s) (keeps going up).
>>>>
>>>> For a little while, I thought I had solved the issue by bumping up the
>>>> driver memory. Then I expanded my Kafka cluster to add more nodes and the
>>>> issue came up again. I tried a few things to smoke out the issue and
>>>> something tells me the driver is the bottleneck again:
>>>>
>>>> 1) From my app, I took out the entire write-out-to-kafka piece. Sure
>>>> enough, execution, scheduling delay and hence total delay fell to sub
>>>> second. This assured me that whatever processing I do before writing back
>>>> to kafka isn't the bottleneck.
>>>>
>>>> 2) In my app, I had RDD persistence set at different points but my code
>>>> wasn't really re-using any RDDs so I took out all explicit persist()
>>>> statements. And added, "spar...unpersist" to "true" in the context. After
>>>> this, it doesn't seem to matter how much memory I give my executor, the
>>>> total delay seems to be in the same range. I tried per executor memory from
>>>> 2G to 12G with no change in total delay so executors aren't memory starved.
>>>> Also, in the SparkUI, under the Executors tab, all executors show 0/1060MB
>>>> used when per executor memory is set to 2GB, for example.
>>>>
>>>> 3) Input rate in the kafka consumer restricts spikes in incoming data.
>>>>
>>>> 4) Tried FIFO and FAIR but didn't make any difference.
>>>>
>>>> 5) Adding executors beyond a certain points seems useless (I guess
>>>> excess ones just sit idle).
>>>>
>>>> At any given point in time, the SparkUI shows only one batch pending
>>>> processing. So with just one batch pending processing, why would the
>>>> scheduling delay run into minutes/hours if execution time is within the
>>>> batch window duration? There aren't any failed stages or jobs.
>>>>
>>>> Right now, I have 100 executors ( i have tried setting executors from
>>>> 50-150), each with 2GB and 4 cores and the driver running with 16GB. There
>>>> are 5 kafka receivers and each incoming stream is split into 40 partitions.
>>>> Per receiver, input rate is restricted to 20000 messages per second.
>>>>
>>>> Can anyone help me with clues or areas to look into, for
>>>> troubleshooting the issue?
>>>>
>>>> One nugget I found buried in the code says:
>>>> "The scheduler delay includes the network delay to send the task to the
>>>> worker machine and to send back the result (but not the time to fetch the
>>>> task result, if it needed to be fetched from the block manager on the
>>>> worker)."
>>>>
>>>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala
>>>>
>>>> Could this be an issue with the driver being a bottlneck? All the
>>>> executors posting their logs/stats to the driver?
>>>>
>>>> Thanks,
>>>>
>>>> Tim
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to