Re: spark streaming and the spark shell
2. I notice that once I start ssc.start(), my stream starts processing and continues indefinitely...even if I close the socket on the server end (I'm using unix command nc to mimic a server as explained in the streaming programming guide .) Can I tell my stream to detect if it's lost a connection and therefore stop executing? (Or even better, to attempt to re-establish the connection?) Currently, not yet. But I am aware of this and this behavior will be improved in the future. Now i understand why out spark streaming job starts to generate zero sized rdds from kafkainput, when one worker get OOM or crashes. And we can’t detect it! Great. So spark streaming just doesn’t suite yet for 24/7 operation =\
Re: KafkaInputDStream mapping of partitions to tasks
On 28 Mar 2014, at 00:34, Scott Clasen scott.cla...@gmail.com wrote: Actually looking closer it is stranger than I thought, in the spark UI, one executor has executed 4 tasks, and one has executed 1928 Can anyone explain the workings of a KafkaInputStream wrt kafka partitions and mapping to spark executors and tasks? Well, there are some issues with kafkainput now. When you do KafkaUtils.createStream, it creates kafka high level consumer on one node! I don’t really know how many rdd it will generate during batch window. But when this rdd are created, spark schedules consecutive transformations on that one node, because of locality. You can try to repartition() those rdds. Sometime it helps. To try to consume from kafka on multiple machines you can do (1 to N).map(KafkaUtils.createStream) But then arises issue with kafka high-level consumer! Those consumers operate in one consumer group, and they try to decide which consumer consumes which partitions. And it may just fail to do syncpartitionrebalance, and then you have only a few consumers really consuming. To mitigate this problem, you can set rebalance retries very high, and pray it helps. Then arises yet another feature — if your receiver dies (OOM, hardware failer), you just stop receiving from kafka! Brilliant. And another feature — if you ask spark’s kafkainput to begin with auto.offset.reset = smallest, it will reset you offsets every time you ran application! It does not comply with documentation (reset to earliest offsets if it does not find offsets on zookeeper), it just erase your offsets and start again from zero! And remember that you should restart your streaming app when there is any failure on receiver! So, at the bottom — kafka input stream just does not work. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KafkaInputDStream-mapping-of-partitions-to-tasks-tp3360p3374.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark Streaming + Kafka + Mesos/Marathon strangeness
On 28 Mar 2014, at 01:44, Tathagata Das tathagata.das1...@gmail.com wrote: The more I think about it the problem is not about /tmp, its more about the workers not having enough memory. Blocks of received data could be falling out of memory before it is getting processed. BTW, what is the storage level that you are using for your input stream? If you are using MEMORY_ONLY, then try MEMORY_AND_DISK. That is safer because it ensure that if received data falls out of memory it will be at least saved to disk. TD And i saw such errors because of cleaner.rtt. Thich erases everything. Even needed rdds. On Thu, Mar 27, 2014 at 2:29 PM, Scott Clasen scott.cla...@gmail.com wrote: Heh sorry that wasnt a clear question, I know 'how' to set it but dont know what value to use in a mesos cluster, since the processes are running in lxc containers they wont be sharing a filesystem (or machine for that matter) I cant use an s3n:// url for local dir can I? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Kafka-Mesos-Marathon-strangeness-tp3285p3373.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: KafkaInputDStream mapping of partitions to tasks
On 28 Mar 2014, at 02:10, Scott Clasen scott.cla...@gmail.com wrote: Thanks everyone for the discussion. Just to note, I restarted the job yet again, and this time there are indeed tasks being executed by both worker nodes. So the behavior does seem inconsistent/broken atm. Then I added a third node to the cluster, and a third executor came up, and everything broke :| This is kafka’s high-level consumer. Try to raise rebalance retries. Also, as this consumer is threaded, it have some protection against this failure - first it waits some time, and then rebalances. But for spark cluster i think this time is not enough. If there was a way to wait every spark executor to start, rebalance, and only when start to consume, this issue would be less visible. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KafkaInputDStream-mapping-of-partitions-to-tasks-tp3360p3391.html Sent from the Apache Spark User List mailing list archive at Nabble.com.