Hi Michael,
{quote}
Do you mean that in such a case Samza should be combined with another
Stream processing framework (such as Storm)?
{quote}
No. I didn't mean combining it with any other framework.

{quote}
"the job bootstraps the data from the source" - do you mean that
you have a background process for this purpose or just listen to an
additional stream of change log from some other framework?
{quote}
I didn't mean a background process. I meant just listening from a stream of
change log from a data source.

At LinkedIn, we use databus. The jobs will configure databus (for a give
data source) as one of the input streams for the job. Databus is a source
agnostic distributed change data capture system. You can find more
information here <https://github.com/linkedin/databus>. The advantage is
that the databus client is capable of "bootstrapping" from the source
automatically and then, switching to simply capture changes from the data
source. In this scenario, Samza doesn't do anything special, except that it
will continue consuming from databus stream when bootstrapping. Once
bootstrap is complete, the job can start processing events from other input
streams as well.

I hope my explanation clarifies your question. :)

Thanks!
Navina


On Mon, Sep 21, 2015 at 1:56 AM, Michael Sklyar <mikesk...@gmail.com> wrote:

> Thank you for your replies,
>
> I understand that making an external blocking request in a single event
> thread will result in extremely low throughput. However this can be solved
> by multi threading and/or asynchronous approach. It is clear that in any
> case using external services can never achieve the throughput of simple
> transformations. However most stream processing need, from time to time, to
> query some external storage, web service etc...
>
> Do you mean that in such a case Samza should be combined with another
> Stream processing framework (such as Storm)?
>
> Navina, "the job bootstraps the data from the source" - do you mean that
> you have a background process for this purpose or just listen to an
> additional stream of change log from some other framework?
>
> Thanks,
> Michael
>
> On Mon, Sep 21, 2015 at 6:52 AM, Navina Ramesh
> <nram...@linkedin.com.invalid
> > wrote:
>
> > Hi Michael,
> > I agree with what Yan said. While nothing stops you from doing it, it is
> > not encouraged as it affect throughput and realtime processing.
> >
> > {quote}
> > It seems that Samza design suits very well "data transformation"
> scenarios,
> > what is not clear is how well can it support external services?
> > {quote}
> > We have some similar use-cases at LinkedIn where the Samza jobs need to
> > query to external data sources. We do use a pattern where the job
> > bootstraps the data from the source using a change-capture system like
> > databus and buffer it locally, before processing from input streams.
> > Depending on the scale of your data, this model may or may not work for
> > you. However, there is no in-built support for this in Samza.
> >
> > Thanks!
> > Navina
> >
> > On Sun, Sep 20, 2015 at 7:55 PM, Yan Fang <yanfang...@gmail.com> wrote:
> >
> > > Hi Michael,
> > >
> > > Samza is designed for high-throughput and realtime processing. If you
> are
> > > using HTTP request/external service, you may not retrieve the same
> > > performance as not using it. However, technically speaking, there is
> > > nothing blocking you to do this, (well, discouraged anyway :). Samza by
> > > default does not provide this feature. So you maybe a little cautious
> > when
> > > implementing this.
> > >
> > > Thanks,
> > >
> > > Fang, Yan
> > > yanfang...@gmail.com
> > >
> > > On Sun, Sep 20, 2015 at 4:28 PM, Michael Sklyar <mikesk...@gmail.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > What would be the best approach for doing "blocking" operations in
> > Samza?
> > > >
> > > > For example, we have a kafka stream of urls for which we need to
> gather
> > > > external data via HTTP (such as alexa rank, get the page title and
> > > > headers..). Other scenarios include database access and decision
> making
> > > via
> > > > a rule engine.
> > > >
> > > > Samza processes messages in a singe thread, HTTP requests might take
> > > > hundreds of miliseconds. With the single threaded design the
> throughput
> > > > would be very limited, which can be solved with an asynchronous
> > approach.
> > > > However Samza documentation explicitely states
> > > > "*You are strongly discouraged from using threads in your job’s
> code*".
> > > >
> > > > It seems that Samza design suits very well "data transformation"
> > > scenarios,
> > > > what is not clear is how well can it support external services?
> > > >
> > > > Thanks,
> > > > Michael Sklyar
> > > >
> > >
> >
> >
> >
> > --
> > Navina R.
> >
>



-- 
Navina R.

Reply via email to