I hit similar issue with Spark Streaming. The batch size seemed a little random. Sometime it was large with many Kafka messages inside same batch, sometimes it was very small with just a few messages. Is it possible that was caused by the backpressure implementation in Spark Streaming?
On Wed, Nov 16, 2016 at 4:22 PM, Cody Koeninger <c...@koeninger.org> wrote: > Moved to user list. > > I'm not really clear on what you're trying to accomplish (why put the > csv file through Kafka instead of reading it directly with spark?) > > auto.offset.reset=largest just means that when starting the job > without any defined offsets, it will start at the highest (most > recent) available offsets. That's probably not what you want if > you've already loaded csv lines into kafka. > > On Wed, Nov 16, 2016 at 2:45 PM, Hoang Bao Thien <hbthien0...@gmail.com> > wrote: > > Hi all, > > > > I would like to ask a question related to the size of Kafka stream. I > want > > to put data (e.g., file *.csv) to Kafka then use Spark streaming to get > the > > output from Kafka and then save to Hive by using SparkSQL. The file csv > is > > about 100MB with ~250K messages/rows (Each row has about 10 fields of > > integer). I see that Spark Streaming first received two > partitions/batches, > > the first is of 60K messages and the second is of 50K msgs. But from the > > third batch, Spark just received 200 messages for each batch (or > partition). > > I think that this problem is coming from Kafka or some configuration in > > Spark. I already tried to configure with the setting > > "auto.offset.reset=largest", but every batch only gets 200 messages. > > > > Could you please tell me how to fix this problem? > > Thank you so much. > > > > Best regards, > > Alex > > > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >