You could then chunk the data (wrapped in an outer message so you have meta
data like file name, total size, current chunk size) and produce that with
the partition key being filename.

We are in progress working on a system for doing file loading to Kafka
(which will eventually support both chunked and pointers [initially
chunking line by line since use case 1 is to read from a closed file handle
location]) https://github.com/stealthly/f2k (there is not much there yet
maybe in the next few days / later this week) maybe useful for your use
case or we could eventually add your use case to it.

/*******************************************
 Joe Stein
 Founder, Principal Consultant
 Big Data Open Source Security LLC
 http://www.stealth.ly
 Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop>
********************************************/


On Tue, Jun 24, 2014 at 12:37 PM, Denny Lee <denny.g....@gmail.com> wrote:

> Hey Joe,
>
> Yes, I have - my original plan is to do something similar to what you
> suggested which was to simply push the data into HDFS / S3 and then having
> only the event information within Kafka so that way multiple consumers can
> just read the event information and ping HDFS/S3 for the actual message
> itself.
>
> Part of the reason for considering just pushing the entire message up is
> due to the potential where we will have a firehose of messages of this size
> and we will need to push this data to multiple locations.
>
> Thanks,
> Denny
>
> On June 24, 2014 at 9:26:49 AM, Joe Stein (joe.st...@stealth.ly) wrote:
>
> Hi Denny, have you considered saving those files to HDFS and sending the
> "event" information to Kafka?
>
> You could then pass that off to Apache Spark in a consumer and get data
> locality for the file saved (or something of the sort [no pun intended]).
>
> You could also stream every line (or however you want to "chunk" it) in
> the
> file as a separate message to the broker with a wrapping message object
> (so
> you know which file you are dealing with when consuming).
>
> What you plan to-do with the data has a lot to-do with how you are going
> to
> process and manage it.
>
> /*******************************************
> Joe Stein
> Founder, Principal Consultant
> Big Data Open Source Security LLC
> http://www.stealth.ly
> Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop>
> ********************************************/
>
>
> On Tue, Jun 24, 2014 at 11:35 AM, Denny Lee <denny.g....@gmail.com>
> wrote:
>
> > By any chance has anyone worked with using Kafka with message sizes that
> > are approximately 50MB in size? Based on from some of the previous
> threads
> > there are probably some concerns on memory pressure due to the
> compression
> > on the broker and decompression on the consumer and a best practices on
> > ensuring batch size (to ultimately not have the compressed message
> exceed
> > message size limit).
> >
> > Any other best practices or thoughts concerning this scenario?
> >
> > Thanks!
> > Denny
> >
> >
>
>

Reply via email to