Re: Integration of DataSketches into Flink

leerho Mon, 27 Apr 2020 10:38:41 -0700

Hi Arvid,

Note: I am dual listing this thread on both dev lists for better tracking.

   1. I'm curious on how you would estimate the effort to port datasketches
>    to Flink? It already has a Java API, but how difficult would it be to
>    subdivide the tasks into parallel chunks of work? Since it's already
> ported
>    on Pig, I think we could use this port as a baseline

Most systems (including systems like Druid, Hive, Pig, Spark, PostgreSQL,
Databases, Streaming Platforms, Map-Reduce Platforms, etc) have some sort
of aggregation API, which allows users to plug in custom aggregation
functions.  Typical API functions found in these APIs are Initialize(),
Update() (or Add()), Merge(), and getResult().  How these are named and
operate vary considerably from system to system.  These APIs are sometimes
called User Defined Functions (UDFs) or User Defined Aggregation Functions
(UDAFs).

DataSketches is a library of Sketching (streaming) aggregation functions,
each of which perform specific types of aggregation. For example, counting
unique items, determining quantiles and histograms of unknown
distributions, identifying most frequent items (heavy hitters) from a
stream, etc.   The advantage of using DataSketches is that they are
extremely fast, small in size, and have well defined error properties
defined by published scientific papers that define the underlying
mathematics.

The task of porting DataSketches is usually developing a thin wrapping
layer that translates the specific UDAF API of Flink to the equivalent API
methods of the targeted sketches in the library.  This is best done by
someone with deep knowledge of the UDAF code of the targeted system.   We
are certainly available answer questions about the DataSketches APIs.
 Although we did write the UDAF layers for Hive and Pig, we did that as a
proof of concept and example on how to write such layers.  We are a small
team and are not in a position to support these integration layers for
every system out there.

2. Do you have any idea who is usually driving the adoptions?

To start, you only need to write the UDAF layer for the sketches that you
think would be in most demand by your users.  The big 4 categories are
distinct (unique) counting, quantiles, frequent-items, and sampling.  This
is a natural way of subdividing the task: choose the sketches you want to
adapt and in what order.  Each sketch is independent so it can be adapted
whenever it is needed.

Please let us know if you have any further questions :)

Lee.

On Mon, Apr 27, 2020 at 2:11 AM Arvid Heise <ar...@ververica.com> wrote:

> Hi Lee,
>
> I must admit that I also heard of data sketches for the first time (there
> are really many Apache projects).
>
> Datasketches sounds really exciting. As a (former) data engineer, I can
> 100% say that this is something that (end-)users want and need and it would
> make so much sense to have it in Flink from the get-go.
> Flink, however, is a quite old project already, which grew at a strong pace
> leading to some 150 modules in the core. We are currently in the process to
> restructure that and reduce the number of things in the core, such that
> build times and stability improve.
>
> To counter that we created Flink packages [1], which includes everything
> new that we deem to not be essential. I'd propose to incorporate a Flink
> datasketch package there. If it seems like it's becoming essential, we can
> still move it to core at a later point.
>
> As I have seen on the page, there are already plenty of adoptions. That
> leaves a few questions to me.
>
>    1. I'm curious on how you would estimate the effort to port datasketches
>    to Flink? It already has a Java API, but how difficult would it be to
>    subdivide the tasks into parallel chunks of work? Since it's already
> ported
>    on Pig, I think we could use this port as a baseline.
>    2. Do you have any idea who is usually driving the adoptions?
>
>
> [1] https://flink-packages.org/
>
> On Sun, Apr 26, 2020 at 8:07 AM leerho <lee...@gmail.com> wrote:
>
> > Hello All,
> >
> > I am a committer on DataSketches.apache.org
> > <http://datasketches.apache.org/> and just learning about Flink,  Since
> > Flink is designed for stateful stream processing I would think it would
> > make sense to have the DataSketches library integrated into its core so
> all
> > users of Flink could take advantage of these advanced streaming
> > algorithms.  If there is interest in the Flink community for this
> > capability, please contact us at d...@datasketches.apache.org or on our
> > datasketches-dev Slack channel.
> > Cheers,
> > Lee.
> >
>
>
> --
>
> Arvid Heise | Senior Java Developer
>
> <https://www.ververica.com/>
>
> Follow us @VervericaData
>
> --
>
> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
> Ververica GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
> (Toni) Cheng
>

Re: Integration of DataSketches into Flink

Reply via email to