Re: Integration of DataSketches into Flink

Flavio Pompermaier Mon, 27 Apr 2020 10:58:39 -0700

If this can encourage Lee I'm one of the Flink users that already use
datasketches and I found it an amazing library.
When I was trying it out (lat year) I tried to stimulate some discussion[1]
but at that time it was probably too early..
I really hope that now things are mature for both communities!


[1]
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-and-sketches-td26852.html

Best,
Flavio

On Mon, Apr 27, 2020 at 7:37 PM leerho <lee...@gmail.com> wrote:

> Hi Arvid,
>
> Note: I am dual listing this thread on both dev lists for better tracking.
>
>    1. I'm curious on how you would estimate the effort to port datasketches
> >    to Flink? It already has a Java API, but how difficult would it be to
> >    subdivide the tasks into parallel chunks of work? Since it's already
> > ported
> >    on Pig, I think we could use this port as a baseline
>
>
> Most systems (including systems like Druid, Hive, Pig, Spark, PostgreSQL,
> Databases, Streaming Platforms, Map-Reduce Platforms, etc) have some sort
> of aggregation API, which allows users to plug in custom aggregation
> functions.  Typical API functions found in these APIs are Initialize(),
> Update() (or Add()), Merge(), and getResult().  How these are named and
> operate vary considerably from system to system.  These APIs are sometimes
> called User Defined Functions (UDFs) or User Defined Aggregation Functions
> (UDAFs).
>
> DataSketches is a library of Sketching (streaming) aggregation functions,
> each of which perform specific types of aggregation. For example, counting
> unique items, determining quantiles and histograms of unknown
> distributions, identifying most frequent items (heavy hitters) from a
> stream, etc.   The advantage of using DataSketches is that they are
> extremely fast, small in size, and have well defined error properties
> defined by published scientific papers that define the underlying
> mathematics.
>
> The task of porting DataSketches is usually developing a thin wrapping
> layer that translates the specific UDAF API of Flink to the equivalent API
> methods of the targeted sketches in the library.  This is best done by
> someone with deep knowledge of the UDAF code of the targeted system.   We
> are certainly available answer questions about the DataSketches APIs.
>  Although we did write the UDAF layers for Hive and Pig, we did that as a
> proof of concept and example on how to write such layers.  We are a small
> team and are not in a position to support these integration layers for
> every system out there.
>
> 2. Do you have any idea who is usually driving the adoptions?
>
>
> To start, you only need to write the UDAF layer for the sketches that you
> think would be in most demand by your users.  The big 4 categories are
> distinct (unique) counting, quantiles, frequent-items, and sampling.  This
> is a natural way of subdividing the task: choose the sketches you want to
> adapt and in what order.  Each sketch is independent so it can be adapted
> whenever it is needed.
>
> Please let us know if you have any further questions :)
>
> Lee.
>
>
>
>
> On Mon, Apr 27, 2020 at 2:11 AM Arvid Heise <ar...@ververica.com> wrote:
>
> > Hi Lee,
> >
> > I must admit that I also heard of data sketches for the first time (there
> > are really many Apache projects).
> >
> > Datasketches sounds really exciting. As a (former) data engineer, I can
> > 100% say that this is something that (end-)users want and need and it
> would
> > make so much sense to have it in Flink from the get-go.
> > Flink, however, is a quite old project already, which grew at a strong
> pace
> > leading to some 150 modules in the core. We are currently in the process
> to
> > restructure that and reduce the number of things in the core, such that
> > build times and stability improve.
> >
> > To counter that we created Flink packages [1], which includes everything
> > new that we deem to not be essential. I'd propose to incorporate a Flink
> > datasketch package there. If it seems like it's becoming essential, we
> can
> > still move it to core at a later point.
> >
> > As I have seen on the page, there are already plenty of adoptions. That
> > leaves a few questions to me.
> >
> >    1. I'm curious on how you would estimate the effort to port
> datasketches
> >    to Flink? It already has a Java API, but how difficult would it be to
> >    subdivide the tasks into parallel chunks of work? Since it's already
> > ported
> >    on Pig, I think we could use this port as a baseline.
> >    2. Do you have any idea who is usually driving the adoptions?
> >
> >
> > [1] https://flink-packages.org/
> >
> > On Sun, Apr 26, 2020 at 8:07 AM leerho <lee...@gmail.com> wrote:
> >
> > > Hello All,
> > >
> > > I am a committer on DataSketches.apache.org
> > > <http://datasketches.apache.org/> and just learning about Flink,
> Since
> > > Flink is designed for stateful stream processing I would think it would
> > > make sense to have the DataSketches library integrated into its core so
> > all
> > > users of Flink could take advantage of these advanced streaming
> > > algorithms.  If there is interest in the Flink community for this
> > > capability, please contact us at d...@datasketches.apache.org or on our
> > > datasketches-dev Slack channel.
> > > Cheers,
> > > Lee.
> > >
> >
> >
> > --
> >
> > Arvid Heise | Senior Java Developer
> >
> > <https://www.ververica.com/>
> >
> > Follow us @VervericaData
> >
> > --
> >
> > Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> > Conference
> >
> > Stream Processing | Event Driven | Real Time
> >
> > --
> >
> > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
> >
> > --
> > Ververica GmbH
> > Registered at Amtsgericht Charlottenburg: HRB 158244 B
> > Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
> > (Toni) Cheng
> >

Re: Integration of DataSketches into Flink

Reply via email to