Re: Integration of DataSketches into Flink

leerho Wed, 29 Apr 2020 15:35:16 -0700

Seth,
Thanks for the enthusiastic reply.

However, I have some questions ... and concerns :)


1) Create a page on the flink packages website.


I looked at this website and it raises a number of red flags for me:

   - There is no instructions anywhere on the site on how to add a listing.
   - The "Login with Github" raises security concerns and without any
   explanation:
      - Why would I want or need to authorize this site to have "access to
      my email account"!  Whoa!
      - This site has registered fewer than 100 GitHub users.  That is a
      very small number. It seems a lot of GitHub users have the same concerns
      that I have.
   - The packages listed are "not endorsed by Apache Flink project or
   Ververica.  This site is not affiliated with or released by Apache Flink".
   There is no verification of licensing.
   - In other words, this site carries zero or even negative weight.  Why
   would I want to add a listing for our very high quality and properly
   licensed Apache DataSketches product alongside other listings that are
   possibly junk?


2) Implement Type Information for DataSketches


In terms of serialization and deserialization, the sketches in our library
have their own serialization: to and from a byte array, which is also
language independent across Java, C++ and Python.  How to transport bytes
from one system to another is system dependent and external to the
DataSketches library.  Some systems use Base64, or ProtoBuf, or Kryo, or
Kafka, or whatever.  As long as we can deserialize (or wrap) the same byte
array that was serialized we are fine.

If you are asking for metadata about a specific blob of bytes, such as
which sketch created the blob of bytes, we can perhaps do that, but the
documentation is not clear about how much metadata is really required,
because our library does not need it.  So we could use some help here in
defining what is really required.  Be aware that metadata also increases
the storage for an object, and we have worked very hard to keep the stored
size of our sketches very small, because that is one of the key advantages
of using sketches.  This is also why we don't use Java serialization, it is
way too heavy!

3) Implementing Sketch UDFs


Thanks for the references, but this was getting way too deep into the weeds
for me right now.  I would suggest we start simple and then build these
UDF's later, as they seem optional, if I understand your comments correctly.

I would suggest we set up a video call with a couple of your key developers
that could steer us quickly through the options.

Please be aware that we are *extremely* resource limited, Flink is at least
10 times our size, so we could use some help in getting started.  What
would be ideal would be for someone in your community that is interested in
seeing DataSketches integrated into Flink work with us on making it
happen.

I am looking forward to working with Flink to make this happen.

Cheers,

Lee.


On Mon, Apr 27, 2020 at 2:15 PM Seth Wiesman <sjwies...@gmail.com> wrote:

> One more point I forgot to mention.
>
> Flink SQL supports Hive UDF's[1]. I haven't tested it, but the datasketch
> hive package should just work out of the box.
>
> Seth
>
> [1]
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/hive/hive_functions.html
>
> On Mon, Apr 27, 2020 at 2:27 PM Seth Wiesman <sjwies...@gmail.com> wrote:
>
> > Hi Lee,
> >
> > I really like this project, I used it with Flink a few years ago when it
> > was still Yahoo DataSketches. The projects clearly complement each other.
> > As Arvid mentioned, the Flink community is trying to foster an ecosystem
> > larger than what is in the main Flink repository. The reason is that the
> > project has grown to such a scale that it cannot reasonably maintain
> > everything. To encourage that sort of growth, Flink is extensively
> > pluggable which means that components do not need to live within the main
> > repository to be treated first-class.
> >
> > I'd like to outline somethings the DataSketch community could do to
> > integrate with Flink.
> >
> > 1) Create a page on the flink packages website.
> >
> > The flink community hosts a website call flink packages to increase the
> > visibility of ecosystem projects with the flink user base[1].
> Datasketches
> > are usable from Flink today so I'd encourage you to create a page right
> > away.
> >
> > 2) Implement TypeInformation for DataSketches
> >
> > TypeInformation is Flink's internal type system and is used as a factory
> > for creating serializing for different types. These serializers are what
> > Flink uses when shuffling data around the cluster and when storing
> records
> > in state backends as state. Providing type information instances for the
> > different sketch types, which would just be wrappers around existing
> > serializers in the data sketch codebase. This should be relatively
> > straightforward. There is no DataStream aggregation API in the way you
> are
> > describing so this is the *only* step you would need to take to provide
> > first-class support for Flink DataStream API[2][3].
> >
> > 3) Implement sketch UDFs
> >
> > Along with its Java API, Flink also offers a relational API and UDFs. The
> > community could provide UDFs for datasketches like Hive. To do so only
> > requires implementing the aggregation function interface[4]. Flink SQL
> > offers the concept of modules, which are a collection of SQL UDFs that
> can
> > easily be loaded in the system[5]. A DataSketch SQL module would provide
> a
> > simple way for users to get started and expose these UDFs as if they were
> > native to Flink.
> >
> > I hope this helps, I look forward to watching the DataSketch community
> > grow!
> >
> > Seth
> >
> > [1] https://flink-packages.org/
> > [2]
> >
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/types_serialization.html
> > [3]
> >
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/datastream_api.html
> > [4]
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/functions/udfs.html#aggregation-functions
> > [5]
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/modules.html
> >
> >
> > On Mon, Apr 27, 2020 at 12:57 PM Flavio Pompermaier <
> pomperma...@okkam.it>
> > wrote:
> >
> >> If this can encourage Lee I'm one of the Flink users that already use
> >> datasketches and I found it an amazing library.
> >> When I was trying it out (lat year) I tried to stimulate some
> >> discussion[1]
> >> but at that time it was probably too early..
> >> I really hope that now things are mature for both communities!
> >>
> >> [1]
> >>
> >>
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-and-sketches-td26852.html
> >>
> >> Best,
> >> Flavio
> >>
> >> On Mon, Apr 27, 2020 at 7:37 PM leerho <lee...@gmail.com> wrote:
> >>
> >> > Hi Arvid,
> >> >
> >> > Note: I am dual listing this thread on both dev lists for better
> >> tracking.
> >> >
> >> >    1. I'm curious on how you would estimate the effort to port
> >> datasketches
> >> > >    to Flink? It already has a Java API, but how difficult would it
> be
> >> to
> >> > >    subdivide the tasks into parallel chunks of work? Since it's
> >> already
> >> > > ported
> >> > >    on Pig, I think we could use this port as a baseline
> >> >
> >> >
> >> > Most systems (including systems like Druid, Hive, Pig, Spark,
> >> PostgreSQL,
> >> > Databases, Streaming Platforms, Map-Reduce Platforms, etc) have some
> >> sort
> >> > of aggregation API, which allows users to plug in custom aggregation
> >> > functions.  Typical API functions found in these APIs are
> Initialize(),
> >> > Update() (or Add()), Merge(), and getResult().  How these are named
> and
> >> > operate vary considerably from system to system.  These APIs are
> >> sometimes
> >> > called User Defined Functions (UDFs) or User Defined Aggregation
> >> Functions
> >> > (UDAFs).
> >> >
> >> > DataSketches is a library of Sketching (streaming) aggregation
> >> functions,
> >> > each of which perform specific types of aggregation. For example,
> >> counting
> >> > unique items, determining quantiles and histograms of unknown
> >> > distributions, identifying most frequent items (heavy hitters) from a
> >> > stream, etc.   The advantage of using DataSketches is that they are
> >> > extremely fast, small in size, and have well defined error properties
> >> > defined by published scientific papers that define the underlying
> >> > mathematics.
> >> >
> >> > The task of porting DataSketches is usually developing a thin wrapping
> >> > layer that translates the specific UDAF API of Flink to the equivalent
> >> API
> >> > methods of the targeted sketches in the library.  This is best done by
> >> > someone with deep knowledge of the UDAF code of the targeted system.
> >>  We
> >> > are certainly available answer questions about the DataSketches APIs.
> >> >  Although we did write the UDAF layers for Hive and Pig, we did that
> as
> >> a
> >> > proof of concept and example on how to write such layers.  We are a
> >> small
> >> > team and are not in a position to support these integration layers for
> >> > every system out there.
> >> >
> >> > 2. Do you have any idea who is usually driving the adoptions?
> >> >
> >> >
> >> > To start, you only need to write the UDAF layer for the sketches that
> >> you
> >> > think would be in most demand by your users.  The big 4 categories are
> >> > distinct (unique) counting, quantiles, frequent-items, and sampling.
> >> This
> >> > is a natural way of subdividing the task: choose the sketches you want
> >> to
> >> > adapt and in what order.  Each sketch is independent so it can be
> >> adapted
> >> > whenever it is needed.
> >> >
> >> > Please let us know if you have any further questions :)
> >> >
> >> > Lee.
> >> >
> >> >
> >> >
> >> >
> >> > On Mon, Apr 27, 2020 at 2:11 AM Arvid Heise <ar...@ververica.com>
> >> wrote:
> >> >
> >> > > Hi Lee,
> >> > >
> >> > > I must admit that I also heard of data sketches for the first time
> >> (there
> >> > > are really many Apache projects).
> >> > >
> >> > > Datasketches sounds really exciting. As a (former) data engineer, I
> >> can
> >> > > 100% say that this is something that (end-)users want and need and
> it
> >> > would
> >> > > make so much sense to have it in Flink from the get-go.
> >> > > Flink, however, is a quite old project already, which grew at a
> strong
> >> > pace
> >> > > leading to some 150 modules in the core. We are currently in the
> >> process
> >> > to
> >> > > restructure that and reduce the number of things in the core, such
> >> that
> >> > > build times and stability improve.
> >> > >
> >> > > To counter that we created Flink packages [1], which includes
> >> everything
> >> > > new that we deem to not be essential. I'd propose to incorporate a
> >> Flink
> >> > > datasketch package there. If it seems like it's becoming essential,
> we
> >> > can
> >> > > still move it to core at a later point.
> >> > >
> >> > > As I have seen on the page, there are already plenty of adoptions.
> >> That
> >> > > leaves a few questions to me.
> >> > >
> >> > >    1. I'm curious on how you would estimate the effort to port
> >> > datasketches
> >> > >    to Flink? It already has a Java API, but how difficult would it
> be
> >> to
> >> > >    subdivide the tasks into parallel chunks of work? Since it's
> >> already
> >> > > ported
> >> > >    on Pig, I think we could use this port as a baseline.
> >> > >    2. Do you have any idea who is usually driving the adoptions?
> >> > >
> >> > >
> >> > > [1] https://flink-packages.org/
> >> > >
> >> > > On Sun, Apr 26, 2020 at 8:07 AM leerho <lee...@gmail.com> wrote:
> >> > >
> >> > > > Hello All,
> >> > > >
> >> > > > I am a committer on DataSketches.apache.org
> >> > > > <http://datasketches.apache.org/> and just learning about Flink,
> >> > Since
> >> > > > Flink is designed for stateful stream processing I would think it
> >> would
> >> > > > make sense to have the DataSketches library integrated into its
> >> core so
> >> > > all
> >> > > > users of Flink could take advantage of these advanced streaming
> >> > > > algorithms.  If there is interest in the Flink community for this
> >> > > > capability, please contact us at d...@datasketches.apache.org or
> on
> >> our
> >> > > > datasketches-dev Slack channel.
> >> > > > Cheers,
> >> > > > Lee.
> >> > > >
> >> > >
> >> > >
> >> > > --
> >> > >
> >> > > Arvid Heise | Senior Java Developer
> >> > >
> >> > > <https://www.ververica.com/>
> >> > >
> >> > > Follow us @VervericaData
> >> > >
> >> > > --
> >> > >
> >> > > Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> >> > > Conference
> >> > >
> >> > > Stream Processing | Event Driven | Real Time
> >> > >
> >> > > --
> >> > >
> >> > > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
> >> > >
> >> > > --
> >> > > Ververica GmbH
> >> > > Registered at Amtsgericht Charlottenburg: HRB 158244 B
> >> > > Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason,
> >> Ji
> >> > > (Toni) Cheng
> >> > >
> >>
> >
>

Re: Integration of DataSketches into Flink

Reply via email to