Re: spark streaming and the spark shell

2014-03-27 Thread Evgeny Shishkin
 
 2.   I notice that once I start ssc.start(), my stream starts processing and
 continues indefinitely...even if I close the socket on the server end (I'm
 using unix command nc to mimic a server as explained in the streaming
 programming guide .)  Can I tell my stream to detect if it's lost a
 connection and therefore stop executing?  (Or even better, to attempt to
 re-establish the connection?)
 
 
 
 Currently, not yet. But I am aware of this and this behavior will be
 improved in the future.

Now i understand why out spark streaming job starts to generate zero sized rdds 
from kafkainput, 
when one worker get OOM or crashes.

And we can’t detect it! Great. So spark streaming just doesn’t suite yet for 
24/7 operation =\

Re: KafkaInputDStream mapping of partitions to tasks

2014-03-27 Thread Evgeny Shishkin

On 28 Mar 2014, at 00:34, Scott Clasen scott.cla...@gmail.com wrote:
Actually looking closer it is stranger than I thought,

in the spark UI, one executor has executed 4 tasks, and one has executed
1928

Can anyone explain the workings of a KafkaInputStream wrt kafka partitions
and mapping to spark executors and tasks?


Well, there are some issues with kafkainput now.
When you do KafkaUtils.createStream, it creates kafka high level consumer on 
one node!
I don’t really know how many rdd it will generate during batch window.
But when this rdd are created, spark schedules consecutive transformations on 
that one node,
because of locality.

You can try to repartition() those rdds. Sometime it helps.

To try to consume from kafka on multiple machines you can do (1 to 
N).map(KafkaUtils.createStream)
But then arises issue with kafka high-level consumer! 
Those consumers operate in one consumer group, and they try to decide which 
consumer consumes which partitions.
And it may just fail to do syncpartitionrebalance, and then you have only a few 
consumers really consuming. 
To mitigate this problem, you can set rebalance retries very high, and pray it 
helps.

Then arises yet another feature — if your receiver dies (OOM, hardware failer), 
you just stop receiving from kafka!
Brilliant.
And another feature — if you ask spark’s kafkainput to begin with 
auto.offset.reset = smallest, it will reset you offsets every time you ran 
application!
It does not comply with documentation (reset to earliest offsets if it does not 
find offsets on zookeeper),
it just erase your offsets and start again from zero!

And remember that you should restart your streaming app when there is any 
failure on receiver!

So, at the bottom — kafka input stream just does not work.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/KafkaInputDStream-mapping-of-partitions-to-tasks-tp3360p3374.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Spark Streaming + Kafka + Mesos/Marathon strangeness

2014-03-27 Thread Evgeny Shishkin

On 28 Mar 2014, at 01:44, Tathagata Das tathagata.das1...@gmail.com wrote:

 The more I think about it the problem is not about /tmp, its more about the 
 workers not having enough memory. Blocks of received data could be falling 
 out of memory before it is getting processed. 
 BTW, what is the storage level that you are using for your input stream? If 
 you are using MEMORY_ONLY, then try MEMORY_AND_DISK. That is safer because it 
 ensure that if received data falls out of memory it will be at least saved to 
 disk.
 
 TD
 

And i saw such errors because of cleaner.rtt.
Thich erases everything. Even needed rdds.



 
 On Thu, Mar 27, 2014 at 2:29 PM, Scott Clasen scott.cla...@gmail.com wrote:
 Heh sorry that wasnt a clear question, I know 'how' to set it but dont know
 what value to use in a mesos cluster, since the processes are running in lxc
 containers they wont be sharing a filesystem (or machine for that matter)
 
 I cant use an s3n:// url for local dir can I?
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Kafka-Mesos-Marathon-strangeness-tp3285p3373.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 



Re: KafkaInputDStream mapping of partitions to tasks

2014-03-27 Thread Evgeny Shishkin

On 28 Mar 2014, at 02:10, Scott Clasen scott.cla...@gmail.com wrote:

 Thanks everyone for the discussion.
 
 Just to note, I restarted the job yet again, and this time there are indeed
 tasks being executed by both worker nodes. So the behavior does seem
 inconsistent/broken atm.
 
 Then I added a third node to the cluster, and a third executor came up, and
 everything broke :|
 
 

This is kafka’s high-level consumer. Try to raise rebalance retries.

Also, as this consumer is threaded, it have some protection against this 
failure - first it waits some time, and then rebalances.
But for spark cluster i think this time is not enough.
If there was a way to wait every spark executor to start, rebalance, and only 
when start to consume, this issue would be less visible.   



 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/KafkaInputDStream-mapping-of-partitions-to-tasks-tp3360p3391.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.