Hi,

Thanks for your comments. But in fact, I don't want to limit the size of
batches, it could be any greater size as it does.

Thien

On Fri, Nov 18, 2016 at 1:17 AM, Cody Koeninger <c...@koeninger.org> wrote:

> If you want a consistent limit on the size of batches, use
> spark.streaming.kafka.maxRatePerPartition  (assuming you're using
> createDirectStream)
>
> http://spark.apache.org/docs/latest/configuration.html#spark-streaming
>
> On Thu, Nov 17, 2016 at 12:52 AM, Hoang Bao Thien <hbthien0...@gmail.com>
> wrote:
> > Hi,
> >
> > I use CSV and other text files to Kafka just to test Kafka + Spark
> Streaming
> > by using direct stream. That's why I don't want Spark streaming reads
> CSVs
> > or text files directly.
> > In addition, I don't want a giant batch of records like the link you
> sent.
> > The problem is that we should receive the "similar" number of record of
> all
> > batchs instead of the first two or three batches have so large number of
> > records (e.g., 100K) but the last 1000 batches with only 200 records.
> >
> > I know that the problem is not from the auto.offset.reset=largest, but I
> > don't know what I can do in this case.
> >
> > Do you and other ones could suggest me some solutions please as this
> seems
> > the normal situation with Kafka+SpartStreaming.
> >
> > Thanks.
> > Alex
> >
> >
> >
> > On Thu, Nov 17, 2016 at 2:32 AM, Cody Koeninger <c...@koeninger.org>
> wrote:
> >>
> >> Yeah, if you're reporting issues, please be clear as to whether
> >> backpressure is enabled, and whether maxRatePerPartition is set.
> >>
> >> I expect that there is something wrong with backpressure, see e.g.
> >> https://issues.apache.org/jira/browse/SPARK-18371
> >>
> >> On Wed, Nov 16, 2016 at 5:05 PM, bo yang <bobyan...@gmail.com> wrote:
> >> > I hit similar issue with Spark Streaming. The batch size seemed a
> little
> >> > random. Sometime it was large with many Kafka messages inside same
> >> > batch,
> >> > sometimes it was very small with just a few messages. Is it possible
> >> > that
> >> > was caused by the backpressure implementation in Spark Streaming?
> >> >
> >> > On Wed, Nov 16, 2016 at 4:22 PM, Cody Koeninger <c...@koeninger.org>
> >> > wrote:
> >> >>
> >> >> Moved to user list.
> >> >>
> >> >> I'm not really clear on what you're trying to accomplish (why put the
> >> >> csv file through Kafka instead of reading it directly with spark?)
> >> >>
> >> >> auto.offset.reset=largest just means that when starting the job
> >> >> without any defined offsets, it will start at the highest (most
> >> >> recent) available offsets.  That's probably not what you want if
> >> >> you've already loaded csv lines into kafka.
> >> >>
> >> >> On Wed, Nov 16, 2016 at 2:45 PM, Hoang Bao Thien
> >> >> <hbthien0...@gmail.com>
> >> >> wrote:
> >> >> > Hi all,
> >> >> >
> >> >> > I would like to ask a question related to the size of Kafka
> stream. I
> >> >> > want
> >> >> > to put data (e.g., file *.csv) to Kafka then use Spark streaming to
> >> >> > get
> >> >> > the
> >> >> > output from Kafka and then save to Hive by using SparkSQL. The file
> >> >> > csv
> >> >> > is
> >> >> > about 100MB with ~250K messages/rows (Each row has about 10 fields
> of
> >> >> > integer). I see that Spark Streaming first received two
> >> >> > partitions/batches,
> >> >> > the first is of 60K messages and the second is of 50K msgs. But
> from
> >> >> > the
> >> >> > third batch, Spark just received 200 messages for each batch (or
> >> >> > partition).
> >> >> > I think that this problem is coming from Kafka or some
> configuration
> >> >> > in
> >> >> > Spark. I already tried to configure with the setting
> >> >> > "auto.offset.reset=largest", but every batch only gets 200
> messages.
> >> >> >
> >> >> > Could you please tell me how to fix this problem?
> >> >> > Thank you so much.
> >> >> >
> >> >> > Best regards,
> >> >> > Alex
> >> >> >
> >> >>
> >> >> ------------------------------------------------------------
> ---------
> >> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >> >>
> >> >
> >
> >
>

Reply via email to