Re: Kafka S3 Connector: Sort by a field within a partition

Eric Beabes Thu, 29 Apr 2021 10:56:41 -0700

Source (devices) are sending messages to AWS SQS (Not Kafka). Each message
contains the path of the file on S3. (We have no control on the source.
They won't change the way it's being done.)


SSS will be listening to the SQS queue. We are thinking SSS will read each
SQS message, get the file location, read the file, clean up as needed &
build a message that will be written to Kafka.

Let's just say millions of messages will come in from SQS per minute into
SSS. As for SSS, we are thinking batch window size could be 5 seconds
(configurable).


On Thu, Apr 29, 2021 at 10:43 AM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Ok thanks for the info.
>
> One question I forgot to ask is what is the streaming interval that the
> source is sending messages to Kafka to be processed inside SSS? For example
> are these market data etc?
>
> HTH
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 29 Apr 2021 at 18:35, Eric Beabes <mailinglist...@gmail.com>
> wrote:
>
> > We're thinking Kafka will allow us to scale to billions of messages in a
> > day. That's the promise of Kafka, right? No other reason really. Main
> goal
> > is to "batch" the messages per hour, create file(s) on S3 which are
> sorted
> > by device_id so that we can do more aggregations which can later be
> sliced
> > & diced using UI.
> >
> > Feel free to suggest alternatives. Thanks.
> >
> >
> > On Thu, Apr 29, 2021 at 10:22 AM Mich Talebzadeh <
> > mich.talebza...@gmail.com>
> > wrote:
> >
> > > Hi Eric,
> > >
> > > On your second point "Is there a better way to do this"
> > >
> > > You are going to use Spark Structured Streaming (SSS) to clean and
> enrich
> > > the data and then push the messages to Kafka.
> > >
> > > I assume you will be using foreachBatch in this case. What purpose is
> > there
> > > for Kafka to receive the enriched data from SSS? Any other reason
> except
> > > hourly partition of your data?
> > >
> > > HTH
> > >
> > >
> > >
> > >    view my Linkedin profile
> > > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
> > >
> > >
> > >
> > > *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any
> > > loss, damage or destruction of data or any other property which may
> arise
> > > from relying on this email's technical content is explicitly
> disclaimed.
> > > The author will in no case be liable for any monetary damages arising
> > from
> > > such loss, damage or destruction.
> > >
> > >
> > >
> > >
> > > On Thu, 29 Apr 2021 at 18:07, Eric Beabes <mailinglist...@gmail.com>
> > > wrote:
> > >
> > > > We’ve a use case where lots of messages will come in via AWS SQS from
> > > > various devices. We’re thinking of reading these messages using Spark
> > > > Structured Streaming, cleaning them up as needed & saving each
> message
> > on
> > > > Kafka. Later we’re thinking of using Kafka S3 Connector to push them
> to
> > > S3
> > > > on an hourly basis; meaning there will be a different directory for
> > each
> > > > hour. Challenge is that, within this hourly “partition” the messages
> > need
> > > > to be “sorted by” a certain field (let’s say device_id). Reason
> being,
> > > > we’re planning to create an EXTERNAL table on it with BUCKETS on
> > > device_id.
> > > > This will speed up the subsequent Aggregation jobs.
> > > >
> > > > Questions:
> > > >
> > > > 1) Does Kafka S3 Connector allow messages to be sorted by a
> particular
> > > > field within a partition – or – do we need to extend it?
> > > > 2) Is there a better way to do this?
> > > >
> > >
> >
>

Re: Kafka S3 Connector: Sort by a field within a partition

Reply via email to