Re: join two datasets

Joe Witt Fri, 22 Feb 2019 06:58:28 -0800

Right I agree with Bryan so let me expand a bit.

There are some key primitives that stream processing systems address as it
relates to joining two live streams that those systems are designed to
solve well.  NiFi offers nothing special/unique in that space.

Now, as Bryan pointed out a really common case in NiFi is to have a
reference/lookup dataset accessible through one of our LookupService
implementations (and you can write your own).  The use cases for this are
'live stream of event data coming in with a customer id, ip address, etc...
some other lookup value and you want to them lookup that value against a
reference dataset and merge in the results such as 'geolocation, customer
address/billing data, etc..'.   This is very doable.

The key difference here is about the liveness of the streams of
information.  If they're both always changing/updating/morphing then you
want a full stream join compliment of features.  If one of the datasets is
updating infrequently like geolocation data, customer information, etc..
then our existing capabilities work quite well.

You might then say 'well why don't I just use a stream processing system
for all this?'  That then ties into the fact that those systems are
designed for different tradeoffs in memory, APIs to handle small/large
objects, lack provenance, etc..  NiFi is a stream processing system with a
bias to managing the flow of information large/small/fast/batch/etc..
Where as stream processing/analytics type systems are built with a bias
toward analytic processing, often run in memory or have key points with
checkpointing into kafka or other systems (which also implies a certain per
event size, etc..).

Hopefully this distinction helps.

Thanks
Joe

On Fri, Feb 22, 2019 at 9:51 AM Bryan Bende <bbe...@gmail.com> wrote:

> Hi Boris,
>
> Joining across two different data streams is not really something NiFi
> is aiming to solve.
>
> Generally I think we'd say that you'd use one of the stream processing
> systems like Flink, Spark, Storm, etc.
>
> Another possible option might be to pull the data and land it in a
> common location like Hive, then you can run a single query against
> Hive that joins the tables.
>
> Others may have more experience with solving this than I do, so
> curious to hear other approaches people have taken.
>
> -Bryan
>
> On Fri, Feb 22, 2019 at 9:08 AM Boris Tyukin <bo...@boristyukin.com>
> wrote:
> >
> > Hi guys,
> >
> > I pull two datasets from two different databases on schedule and need to
> join both on some ID and then publish combined dataset to Kafka.
> >
> > What is the best way to do this? Puzzled how I would synchronize two
> data pulls so data is joined for exact flowfiles I need, i.e. if there are
> errors anythere, I do not want to join older flowfile with a newer one.
> >
> > Thanks!
> > Boris
>

Re: join two datasets

Reply via email to