Re: handling of empty partitions

2017-01-11 Thread Georg Heiler
I see that there is the possibility to improve and make the algorithm more fault tolerant as outlined by both of you. Could you explain a little bit more why +--++ | foo| bar| +--++ |2016-01-01| first|

Re: [Streaming] ConcurrentModificationExceptions when Windowing

2017-01-11 Thread Kalvin Chau
I've filed an issue here https://issues.apache.org/jira/browse/SPARK-19185, let me know if I missed anything! --Kalvin On Wed, Jan 11, 2017 at 5:43 PM Shixiong(Ryan) Zhu wrote: > Thanks for reporting this. Finally I understood the root cause. Could you > file a JIRA on

Re: handling of empty partitions

2017-01-11 Thread Liang-Chi Hsieh
Hi Georg, It is not strange. As I said before, it depends how the data is partitioned. When you try to get the available value from next partition like this: var lastNotNullRow: Option[FooBar] = toCarryBd.value.get(i).get if (lastNotNullRow == None) { lastNotNullRow =

Re: [Streaming] ConcurrentModificationExceptions when Windowing

2017-01-11 Thread Kalvin Chau
Here is the minimal code example where I was able to replicate: Batch interval is set to 2 to get the exceptions to happen more often. val kafkaParams = Map[String, Object]( "bootstrap.servers" -> brokers, "key.deserializer" -> classOf[KafkaAvroDeserializer], "value.deserializer" ->

Re: [Streaming] ConcurrentModificationExceptions when Windowing

2017-01-11 Thread Kalvin Chau
I have not modified that configuration setting, and that doesn't seem to be documented anywhere. Does the Kafka 0.10 require the number of cores on an executor be set to 1? I didn't see that documented anywhere either. On Wed, Jan 11, 2017 at 3:27 PM Shixiong(Ryan) Zhu

Re: [Streaming] ConcurrentModificationExceptions when Windowing

2017-01-11 Thread Kalvin Chau
"spark.speculation" is not set, so it would be whatever the default is. On Wed, Jan 11, 2017 at 3:43 PM Shixiong(Ryan) Zhu wrote: > Or do you enable "spark.speculation"? If not, Spark Streaming should not > launch two tasks using the same TopicPartition. > > On Wed,

Re: [Streaming] ConcurrentModificationExceptions when Windowing

2017-01-11 Thread Shixiong(Ryan) Zhu
Could you post your codes, please? On Wed, Jan 11, 2017 at 3:53 PM, Kalvin Chau wrote: > "spark.speculation" is not set, so it would be whatever the default is. > > > On Wed, Jan 11, 2017 at 3:43 PM Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrote: > >> Or do you

Re: [Streaming] ConcurrentModificationExceptions when Windowing

2017-01-11 Thread Shixiong(Ryan) Zhu
Or do you enable "spark.speculation"? If not, Spark Streaming should not launch two tasks using the same TopicPartition. On Wed, Jan 11, 2017 at 3:33 PM, Kalvin Chau wrote: > I have not modified that configuration setting, and that doesn't seem to > be documented

Re: [Streaming] ConcurrentModificationExceptions when Windowing

2017-01-11 Thread Kalvin Chau
I'm not re-using any InputDStreams actually, this is one InputDStream that has a window applied to it. Then when Spark creates and assigns tasks to read from the Topic, one executor gets assigned two tasks to read from the same TopicPartition, and uses the same CachedKafkaConsumer to read from

Re: [Streaming] ConcurrentModificationExceptions when Windowing

2017-01-11 Thread Shixiong(Ryan) Zhu
Do you change "spark.streaming.concurrentJobs" to more than 1? Kafka 0.10 connector requires it must be 1. On Wed, Jan 11, 2017 at 3:25 PM, Kalvin Chau wrote: > I'm not re-using any InputDStreams actually, this is one InputDStream that > has a window applied to it. >

Re: [Streaming] ConcurrentModificationExceptions when Windowing

2017-01-11 Thread Shixiong(Ryan) Zhu
I think you may reuse the kafka DStream (the DStream returned by createDirectStream). If you need to read from the same Kafka source, you need to create another DStream. On Wed, Jan 11, 2017 at 2:38 PM, Kalvin Chau wrote: > Hi, > > We've been running into

[Streaming] ConcurrentModificationExceptions when Windowing

2017-01-11 Thread Kalvin Chau
Hi, We've been running into ConcurrentModificationExcpetions "KafkaConsumer is not safe for multi-threaded access" with the CachedKafkaConsumer. I've been working through debugging this issue and after looking through some of the spark source code I think this is a bug. Our set up is: Spark

Re: [PYSPARK] Python tests organization

2017-01-11 Thread Saikat Kanjilal
Maciej/Reynolds, If its ok with you guys I can start working on a proposal and create a JIRA, let me know next steps. Thanks in advance. From: Maciej Szymkiewicz Sent: Wednesday, January 11, 2017 10:14 AM To: Saikat Kanjilal Subject:

Re: [PYSPARK] Python tests organization

2017-01-11 Thread Reynold Xin
Yes absolutely. On Wed, Jan 11, 2017 at 9:54 AM Saikat Kanjilal wrote: > > > > > > > > > > > > > > > Is it worth to come up with a proposal for this and float to dev? > > > > > > > > > > > -- > > > *From:* Reynold Xin > > >

Re: [PYSPARK] Python tests organization

2017-01-11 Thread Saikat Kanjilal
Is it worth to come up with a proposal for this and float to dev? From: Reynold Xin Sent: Wednesday, January 11, 2017 9:47 AM To: Maciej Szymkiewicz; Saikat Kanjilal; dev@spark.apache.org Subject: Re: [PYSPARK] Python tests organization It

Re: [PYSPARK] Python tests organization

2017-01-11 Thread Reynold Xin
It would be good to break them down a bit more, provided that we don't increase for example total runtime due to extra setup. On Wed, Jan 11, 2017 at 9:45 AM Saikat Kanjilal wrote: > > > > > > > > > > > > > > > Hello Maciej, > > > If there's a jira available for this I'd

Re: [PYSPARK] Python tests organization

2017-01-11 Thread Saikat Kanjilal
Hello Maciej, If there's a jira available for this I'd like to help get this moving, let me know next steps. Thanks in advance. From: Maciej Szymkiewicz Sent: Wednesday, January 11, 2017 4:18 AM To: dev@spark.apache.org Subject:

[PYSPARK] Python tests organization

2017-01-11 Thread Maciej Szymkiewicz
Hi, I can't help but wonder if there is any practical reason for keeping monolithic test modules. These things are already pretty large (1500 - 2200 LOCs) and can only grow. Development aside, I assume that many users use tests the same way as me, to check the intended behavior, and largish

Re: Spark Improvement Proposals

2017-01-11 Thread Reynold Xin
+1 on all counts (consensus, time bound, define roles) I can update the doc in the next few days and share back. Then maybe we can just officially vote on this. As Tim suggested, we might not get it 100% right the first time and would need to re-iterate. But that's fine. On Thu, Jan 5, 2017 at