If you want a consistent limit on the size of batches, use spark.streaming.kafka.maxRatePerPartition (assuming you're using createDirectStream)
http://spark.apache.org/docs/latest/configuration.html#spark-streaming On Thu, Nov 17, 2016 at 12:52 AM, Hoang Bao Thien <hbthien0...@gmail.com> wrote: > Hi, > > I use CSV and other text files to Kafka just to test Kafka + Spark Streaming > by using direct stream. That's why I don't want Spark streaming reads CSVs > or text files directly. > In addition, I don't want a giant batch of records like the link you sent. > The problem is that we should receive the "similar" number of record of all > batchs instead of the first two or three batches have so large number of > records (e.g., 100K) but the last 1000 batches with only 200 records. > > I know that the problem is not from the auto.offset.reset=largest, but I > don't know what I can do in this case. > > Do you and other ones could suggest me some solutions please as this seems > the normal situation with Kafka+SpartStreaming. > > Thanks. > Alex > > > > On Thu, Nov 17, 2016 at 2:32 AM, Cody Koeninger <c...@koeninger.org> wrote: >> >> Yeah, if you're reporting issues, please be clear as to whether >> backpressure is enabled, and whether maxRatePerPartition is set. >> >> I expect that there is something wrong with backpressure, see e.g. >> https://issues.apache.org/jira/browse/SPARK-18371 >> >> On Wed, Nov 16, 2016 at 5:05 PM, bo yang <bobyan...@gmail.com> wrote: >> > I hit similar issue with Spark Streaming. The batch size seemed a little >> > random. Sometime it was large with many Kafka messages inside same >> > batch, >> > sometimes it was very small with just a few messages. Is it possible >> > that >> > was caused by the backpressure implementation in Spark Streaming? >> > >> > On Wed, Nov 16, 2016 at 4:22 PM, Cody Koeninger <c...@koeninger.org> >> > wrote: >> >> >> >> Moved to user list. >> >> >> >> I'm not really clear on what you're trying to accomplish (why put the >> >> csv file through Kafka instead of reading it directly with spark?) >> >> >> >> auto.offset.reset=largest just means that when starting the job >> >> without any defined offsets, it will start at the highest (most >> >> recent) available offsets. That's probably not what you want if >> >> you've already loaded csv lines into kafka. >> >> >> >> On Wed, Nov 16, 2016 at 2:45 PM, Hoang Bao Thien >> >> <hbthien0...@gmail.com> >> >> wrote: >> >> > Hi all, >> >> > >> >> > I would like to ask a question related to the size of Kafka stream. I >> >> > want >> >> > to put data (e.g., file *.csv) to Kafka then use Spark streaming to >> >> > get >> >> > the >> >> > output from Kafka and then save to Hive by using SparkSQL. The file >> >> > csv >> >> > is >> >> > about 100MB with ~250K messages/rows (Each row has about 10 fields of >> >> > integer). I see that Spark Streaming first received two >> >> > partitions/batches, >> >> > the first is of 60K messages and the second is of 50K msgs. But from >> >> > the >> >> > third batch, Spark just received 200 messages for each batch (or >> >> > partition). >> >> > I think that this problem is coming from Kafka or some configuration >> >> > in >> >> > Spark. I already tried to configure with the setting >> >> > "auto.offset.reset=largest", but every batch only gets 200 messages. >> >> > >> >> > Could you please tell me how to fix this problem? >> >> > Thank you so much. >> >> > >> >> > Best regards, >> >> > Alex >> >> > >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> >> > > > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org