I'm not sure the lessons here are completely generalizable.

In this case there is a production system that works pretty well and
that has many dependencies (i.e. many producers and consumers, some
critical). I just needed to add a non-critical weekly (or possibly
just one-time, if the POC does not go well) dump to HDFS. In my
opinion, speeding up the HDFS dump didn't justify taking any risk with
the production system (and there is some risk in adding partitioning,
not large perhaps but some due-diligence regarding producers,
consumers and disk space is required). On the other hand, adding
inter-partitioning split to a MapReduce system is something I did
successfully before, so it seemed not too difficult, not too risky and
a fun challenge too.

Obviously, if there was a critical real-time stream-processing system
that required speeding up, figuring out a good way to add partitions
without screwing up anything else in the system would have been worth
it.

It all depends :)

Gwen

On Thu, Jul 23, 2015 at 9:10 PM, Ewen Cheslack-Postava
<e...@confluent.io> wrote:
> Gwen,
>
> I'm curious about this use case. Given the Kafka -> HDFS flow, it obviously
> relates to Copycat. More generally, this could be a problem even when
> streaming data if your processing takes too long such that your consumer
> simply can't keep up with the rate at which messages are produced.
>
> The "easy" solution would have been to use more partitions since the
> problem in both the batch and streaming cases is that you need more
> processing throughput. In the case that required modifying Camus, was this
> not an option simply because making that modification was too painful
> (i.e., if there had been more partitions to start with, it might not have
> been needed at all) or because there were other constraints on partitioning?
>
> -Ewen
>
> On Thu, Jul 23, 2015 at 2:45 PM, Gwen Shapira <gshap...@cloudera.com> wrote:
>
>> Agree.
>>
>> On Thu, Jul 23, 2015 at 2:43 PM, Jiangjie Qin <j...@linkedin.com.invalid>
>> wrote:
>> > Ah, I see. Thanks for the use case, Gwen. I guess in that case it seems
>> the
>> > time to use low level consumer.
>> >
>> > Jiangjie (Becket) Qin
>> >
>> > On Thu, Jul 23, 2015 at 9:52 AM, Gwen Shapira <gshap...@cloudera.com>
>> wrote:
>> >
>> >> As crazy as it sounds, there is an actual use-case there.
>> >>
>> >> Writing to HDFS can be very slow, so if you do a batch dump from a
>> >> topic to HDFS, you may want more consumers reading from the topic than
>> >> for "normal" streaming use-cases. We ended up modifying Camus to split
>> >> a partition between multiple mappers (take start and end offsets and
>> >> divide into ranges) to solve this problem. Not exactly a round-robin
>> >> but same idea.
>> >>
>> >> I think thats what J A was referring to in "decoupling consumers" -
>> >> different consumers have slightly different requirements.
>> >>
>> >> Gwen
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Thu, Jul 23, 2015 at 9:44 AM, Jiangjie Qin <j...@linkedin.com.invalid
>> >
>> >> wrote:
>> >> > J A,
>> >> >
>> >> > It looks to me that in your case you actually want to scale the topic,
>> >> > right? Otherwise wouldn't a single consumer be enough?
>> >> >
>> >> > Jiangjie (Becket) Qin
>> >> >
>> >> > On Wed, Jul 22, 2015 at 7:39 PM, J A <mbatth...@gmail.com> wrote:
>> >> >
>> >> >> Why have partition at all, if I don't need to scale topic. Coupling
>> >> topic
>> >> >> scalability with consumer scalability just goes against messaging
>> >> systems
>> >> >> core principle of decoupling consumer and producers
>> >> >>
>> >> >> On Wednesday, July 22, 2015, Aditya Auradkar
>> >> >> <aaurad...@linkedin.com.invalid>
>> >> >> wrote:
>> >> >>
>> >> >> > Hi,
>> >> >> >
>> >> >> > Why not simply have as many partitions as the set of consumers you
>> >> want
>> >> >> to
>> >> >> > round robin across?
>> >> >> >
>> >> >> > Aditya
>> >> >> >
>> >> >> > On Wed, Jul 22, 2015 at 2:37 PM, Ashish Singh <asi...@cloudera.com
>> >> >> > <javascript:;>> wrote:
>> >> >> >
>> >> >> > > Hey, don't you think that would be against the basic ordering
>> >> >> guarantees
>> >> >> > > Kafka provides?
>> >> >> > >
>> >> >> > > On Wed, Jul 22, 2015 at 2:14 PM, J A <mbatth...@gmail.com
>> >> >> <javascript:;>>
>> >> >> > wrote:
>> >> >> > >
>> >> >> > > > Hi, This is reference to stackoverflow question "
>> >> >> > > >
>> >> >> > > >
>> >> >> > >
>> >> >> >
>> >> >>
>> >>
>> http://stackoverflow.com/questions/31547216/kafka-log-deletion-and-load-balancing-across-consumers
>> >> >> > > > "
>> >> >> > > > Since Kafka 0.8 already maintains a client offset, i would
>> like to
>> >> >> > > request
>> >> >> > > > a feature, where a single partition consumption can be round
>> robin
>> >> >> > > across a
>> >> >> > > > set of consumers. The message delivery strategy should be an
>> >> option
>> >> >> > > chosen
>> >> >> > > > by the consumer.
>> >> >> > > >
>> >> >> > >
>> >> >> > >
>> >> >> > >
>> >> >> > > --
>> >> >> > >
>> >> >> > > Regards,
>> >> >> > > Ashish
>> >> >> > >
>> >> >> >
>> >> >>
>> >>
>>
>
>
>
> --
> Thanks,
> Ewen

Reply via email to