Re: Kafka segmentation

2016-11-19 Thread Cody Koeninger
I mean I don't understand exactly what the issue is. Can you fill in these blanks My settings are : My code is : I expected to see : Instead, I saw : On Thu, Nov 17, 2016 at 12:53 PM, Hoang Bao Thien wrote: > I am sorry I don't understand your idea. What do you mean

Re: Kafka segmentation

2016-11-17 Thread Hoang Bao Thien
I am sorry I don't understand your idea. What do you mean exactly? On Fri, Nov 18, 2016 at 1:50 AM, Cody Koeninger wrote: > Ok, I don't think I'm clear on the issue then. Can you say what the > expected behavior is, and what the observed behavior is? > > On Thu, Nov 17,

Re: Kafka segmentation

2016-11-17 Thread Cody Koeninger
Ok, I don't think I'm clear on the issue then. Can you say what the expected behavior is, and what the observed behavior is? On Thu, Nov 17, 2016 at 10:48 AM, Hoang Bao Thien wrote: > Hi, > > Thanks for your comments. But in fact, I don't want to limit the size of >

Re: Kafka segmentation

2016-11-17 Thread Hoang Bao Thien
Hi, Thanks for your comments. But in fact, I don't want to limit the size of batches, it could be any greater size as it does. Thien On Fri, Nov 18, 2016 at 1:17 AM, Cody Koeninger wrote: > If you want a consistent limit on the size of batches, use >

Re: Kafka segmentation

2016-11-17 Thread Cody Koeninger
If you want a consistent limit on the size of batches, use spark.streaming.kafka.maxRatePerPartition (assuming you're using createDirectStream) http://spark.apache.org/docs/latest/configuration.html#spark-streaming On Thu, Nov 17, 2016 at 12:52 AM, Hoang Bao Thien wrote:

Re: Kafka segmentation

2016-11-17 Thread Hoang Bao Thien
Hi, I use CSV and other text files to Kafka just to test Kafka + Spark Streaming by using direct stream. That's why I don't want Spark streaming reads CSVs or text files directly. In addition, I don't want a giant batch of records like the link you sent. The problem is that we should receive the

Re: Kafka segmentation

2016-11-16 Thread bo yang
I did not remember what exact configuration I was using. That link has some good information! Thanks Cody! On Wed, Nov 16, 2016 at 5:32 PM, Cody Koeninger wrote: > Yeah, if you're reporting issues, please be clear as to whether > backpressure is enabled, and whether

Re: Kafka segmentation

2016-11-16 Thread Cody Koeninger
Yeah, if you're reporting issues, please be clear as to whether backpressure is enabled, and whether maxRatePerPartition is set. I expect that there is something wrong with backpressure, see e.g. https://issues.apache.org/jira/browse/SPARK-18371 On Wed, Nov 16, 2016 at 5:05 PM, bo yang

Re: Kafka segmentation

2016-11-16 Thread bo yang
I hit similar issue with Spark Streaming. The batch size seemed a little random. Sometime it was large with many Kafka messages inside same batch, sometimes it was very small with just a few messages. Is it possible that was caused by the backpressure implementation in Spark Streaming? On Wed,

Re: Kafka segmentation

2016-11-16 Thread Cody Koeninger
Moved to user list. I'm not really clear on what you're trying to accomplish (why put the csv file through Kafka instead of reading it directly with spark?) auto.offset.reset=largest just means that when starting the job without any defined offsets, it will start at the highest (most recent)