I am sorry I don't understand your idea. What do you mean exactly? On Fri, Nov 18, 2016 at 1:50 AM, Cody Koeninger <c...@koeninger.org> wrote:
> Ok, I don't think I'm clear on the issue then. Can you say what the > expected behavior is, and what the observed behavior is? > > On Thu, Nov 17, 2016 at 10:48 AM, Hoang Bao Thien <hbthien0...@gmail.com> > wrote: > > Hi, > > > > Thanks for your comments. But in fact, I don't want to limit the size of > > batches, it could be any greater size as it does. > > > > Thien > > > > On Fri, Nov 18, 2016 at 1:17 AM, Cody Koeninger <c...@koeninger.org> > wrote: > >> > >> If you want a consistent limit on the size of batches, use > >> spark.streaming.kafka.maxRatePerPartition (assuming you're using > >> createDirectStream) > >> > >> http://spark.apache.org/docs/latest/configuration.html#spark-streaming > >> > >> On Thu, Nov 17, 2016 at 12:52 AM, Hoang Bao Thien < > hbthien0...@gmail.com> > >> wrote: > >> > Hi, > >> > > >> > I use CSV and other text files to Kafka just to test Kafka + Spark > >> > Streaming > >> > by using direct stream. That's why I don't want Spark streaming reads > >> > CSVs > >> > or text files directly. > >> > In addition, I don't want a giant batch of records like the link you > >> > sent. > >> > The problem is that we should receive the "similar" number of record > of > >> > all > >> > batchs instead of the first two or three batches have so large number > of > >> > records (e.g., 100K) but the last 1000 batches with only 200 records. > >> > > >> > I know that the problem is not from the auto.offset.reset=largest, > but I > >> > don't know what I can do in this case. > >> > > >> > Do you and other ones could suggest me some solutions please as this > >> > seems > >> > the normal situation with Kafka+SpartStreaming. > >> > > >> > Thanks. > >> > Alex > >> > > >> > > >> > > >> > On Thu, Nov 17, 2016 at 2:32 AM, Cody Koeninger <c...@koeninger.org> > >> > wrote: > >> >> > >> >> Yeah, if you're reporting issues, please be clear as to whether > >> >> backpressure is enabled, and whether maxRatePerPartition is set. > >> >> > >> >> I expect that there is something wrong with backpressure, see e.g. > >> >> https://issues.apache.org/jira/browse/SPARK-18371 > >> >> > >> >> On Wed, Nov 16, 2016 at 5:05 PM, bo yang <bobyan...@gmail.com> > wrote: > >> >> > I hit similar issue with Spark Streaming. The batch size seemed a > >> >> > little > >> >> > random. Sometime it was large with many Kafka messages inside same > >> >> > batch, > >> >> > sometimes it was very small with just a few messages. Is it > possible > >> >> > that > >> >> > was caused by the backpressure implementation in Spark Streaming? > >> >> > > >> >> > On Wed, Nov 16, 2016 at 4:22 PM, Cody Koeninger < > c...@koeninger.org> > >> >> > wrote: > >> >> >> > >> >> >> Moved to user list. > >> >> >> > >> >> >> I'm not really clear on what you're trying to accomplish (why put > >> >> >> the > >> >> >> csv file through Kafka instead of reading it directly with spark?) > >> >> >> > >> >> >> auto.offset.reset=largest just means that when starting the job > >> >> >> without any defined offsets, it will start at the highest (most > >> >> >> recent) available offsets. That's probably not what you want if > >> >> >> you've already loaded csv lines into kafka. > >> >> >> > >> >> >> On Wed, Nov 16, 2016 at 2:45 PM, Hoang Bao Thien > >> >> >> <hbthien0...@gmail.com> > >> >> >> wrote: > >> >> >> > Hi all, > >> >> >> > > >> >> >> > I would like to ask a question related to the size of Kafka > >> >> >> > stream. I > >> >> >> > want > >> >> >> > to put data (e.g., file *.csv) to Kafka then use Spark streaming > >> >> >> > to > >> >> >> > get > >> >> >> > the > >> >> >> > output from Kafka and then save to Hive by using SparkSQL. The > >> >> >> > file > >> >> >> > csv > >> >> >> > is > >> >> >> > about 100MB with ~250K messages/rows (Each row has about 10 > fields > >> >> >> > of > >> >> >> > integer). I see that Spark Streaming first received two > >> >> >> > partitions/batches, > >> >> >> > the first is of 60K messages and the second is of 50K msgs. But > >> >> >> > from > >> >> >> > the > >> >> >> > third batch, Spark just received 200 messages for each batch (or > >> >> >> > partition). > >> >> >> > I think that this problem is coming from Kafka or some > >> >> >> > configuration > >> >> >> > in > >> >> >> > Spark. I already tried to configure with the setting > >> >> >> > "auto.offset.reset=largest", but every batch only gets 200 > >> >> >> > messages. > >> >> >> > > >> >> >> > Could you please tell me how to fix this problem? > >> >> >> > Thank you so much. > >> >> >> > > >> >> >> > Best regards, > >> >> >> > Alex > >> >> >> > > >> >> >> > >> >> >> > >> >> >> ------------------------------------------------------------ > --------- > >> >> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >> >> >> > >> >> > > >> > > >> > > > > > >