Re: Integration of DataSketches into Flink

2020-04-29 Thread leerho
Seth,
Thanks for the enthusiastic reply.

However, I have some questions ... and concerns :)

1) Create a page on the flink packages website.


I looked at this website and it raises a number of red flags for me:

   - There is no instructions anywhere on the site on how to add a listing.
   - The "Login with Github" raises security concerns and without any
   explanation:
  - Why would I want or need to authorize this site to have "access to
  my email account"!  Whoa!
  - This site has registered fewer than 100 GitHub users.  That is a
  very small number. It seems a lot of GitHub users have the same concerns
  that I have.
   - The packages listed are "not endorsed by Apache Flink project or
   Ververica.  This site is not affiliated with or released by Apache Flink".
   There is no verification of licensing.
   - In other words, this site carries zero or even negative weight.  Why
   would I want to add a listing for our very high quality and properly
   licensed Apache DataSketches product alongside other listings that are
   possibly junk?


2) Implement Type Information for DataSketches


In terms of serialization and deserialization, the sketches in our library
have their own serialization: to and from a byte array, which is also
language independent across Java, C++ and Python.  How to transport bytes
from one system to another is system dependent and external to the
DataSketches library.  Some systems use Base64, or ProtoBuf, or Kryo, or
Kafka, or whatever.  As long as we can deserialize (or wrap) the same byte
array that was serialized we are fine.

If you are asking for metadata about a specific blob of bytes, such as
which sketch created the blob of bytes, we can perhaps do that, but the
documentation is not clear about how much metadata is really required,
because our library does not need it.  So we could use some help here in
defining what is really required.  Be aware that metadata also increases
the storage for an object, and we have worked very hard to keep the stored
size of our sketches very small, because that is one of the key advantages
of using sketches.  This is also why we don't use Java serialization, it is
way too heavy!

3) Implementing Sketch UDFs


Thanks for the references, but this was getting way too deep into the weeds
for me right now.  I would suggest we start simple and then build these
UDF's later, as they seem optional, if I understand your comments correctly.

I would suggest we set up a video call with a couple of your key developers
that could steer us quickly through the options.

Please be aware that we are *extremely* resource limited, Flink is at least
10 times our size, so we could use some help in getting started.  What
would be ideal would be for someone in your community that is interested in
seeing DataSketches integrated into Flink work with us on making it
happen.

I am looking forward to working with Flink to make this happen.

Cheers,

Lee.


On Mon, Apr 27, 2020 at 2:15 PM Seth Wiesman  wrote:

> One more point I forgot to mention.
>
> Flink SQL supports Hive UDF's[1]. I haven't tested it, but the datasketch
> hive package should just work out of the box.
>
> Seth
>
> [1]
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/hive/hive_functions.html
>
> On Mon, Apr 27, 2020 at 2:27 PM Seth Wiesman  wrote:
>
> > Hi Lee,
> >
> > I really like this project, I used it with Flink a few years ago when it
> > was still Yahoo DataSketches. The projects clearly complement each other.
> > As Arvid mentioned, the Flink community is trying to foster an ecosystem
> > larger than what is in the main Flink repository. The reason is that the
> > project has grown to such a scale that it cannot reasonably maintain
> > everything. To encourage that sort of growth, Flink is extensively
> > pluggable which means that components do not need to live within the main
> > repository to be treated first-class.
> >
> > I'd like to outline somethings the DataSketch community could do to
> > integrate with Flink.
> >
> > 1) Create a page on the flink packages website.
> >
> > The flink community hosts a website call flink packages to increase the
> > visibility of ecosystem projects with the flink user base[1].
> Datasketches
> > are usable from Flink today so I'd encourage you to create a page right
> > away.
> >
> > 2) Implement TypeInformation for DataSketches
> >
> > TypeInformation is Flink's internal type system and is used as a factory
> > for creating serializing for different types. These serializers are what
> > Flink uses when shuffling data around the cluster and when storing
> records
> > in state backends as state. Providing type information instances for the
> > different sketch types, which would just be wrappers around existing
> > serializers in the data sketch codebase. This should be relatively
> > straightforward. There is no DataStream aggregation API in the way you
> are
> > describing so this is the *only* step 

Re: Integration of DataSketches into Flink

2020-04-27 Thread Seth Wiesman
One more point I forgot to mention.

Flink SQL supports Hive UDF's[1]. I haven't tested it, but the datasketch
hive package should just work out of the box.

Seth

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/hive/hive_functions.html

On Mon, Apr 27, 2020 at 2:27 PM Seth Wiesman  wrote:

> Hi Lee,
>
> I really like this project, I used it with Flink a few years ago when it
> was still Yahoo DataSketches. The projects clearly complement each other.
> As Arvid mentioned, the Flink community is trying to foster an ecosystem
> larger than what is in the main Flink repository. The reason is that the
> project has grown to such a scale that it cannot reasonably maintain
> everything. To encourage that sort of growth, Flink is extensively
> pluggable which means that components do not need to live within the main
> repository to be treated first-class.
>
> I'd like to outline somethings the DataSketch community could do to
> integrate with Flink.
>
> 1) Create a page on the flink packages website.
>
> The flink community hosts a website call flink packages to increase the
> visibility of ecosystem projects with the flink user base[1]. Datasketches
> are usable from Flink today so I'd encourage you to create a page right
> away.
>
> 2) Implement TypeInformation for DataSketches
>
> TypeInformation is Flink's internal type system and is used as a factory
> for creating serializing for different types. These serializers are what
> Flink uses when shuffling data around the cluster and when storing records
> in state backends as state. Providing type information instances for the
> different sketch types, which would just be wrappers around existing
> serializers in the data sketch codebase. This should be relatively
> straightforward. There is no DataStream aggregation API in the way you are
> describing so this is the *only* step you would need to take to provide
> first-class support for Flink DataStream API[2][3].
>
> 3) Implement sketch UDFs
>
> Along with its Java API, Flink also offers a relational API and UDFs. The
> community could provide UDFs for datasketches like Hive. To do so only
> requires implementing the aggregation function interface[4]. Flink SQL
> offers the concept of modules, which are a collection of SQL UDFs that can
> easily be loaded in the system[5]. A DataSketch SQL module would provide a
> simple way for users to get started and expose these UDFs as if they were
> native to Flink.
>
> I hope this helps, I look forward to watching the DataSketch community
> grow!
>
> Seth
>
> [1] https://flink-packages.org/
> [2]
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/types_serialization.html
> [3]
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/datastream_api.html
> [4]
> https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/functions/udfs.html#aggregation-functions
> [5]
> https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/modules.html
>
>
> On Mon, Apr 27, 2020 at 12:57 PM Flavio Pompermaier 
> wrote:
>
>> If this can encourage Lee I'm one of the Flink users that already use
>> datasketches and I found it an amazing library.
>> When I was trying it out (lat year) I tried to stimulate some
>> discussion[1]
>> but at that time it was probably too early..
>> I really hope that now things are mature for both communities!
>>
>> [1]
>>
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-and-sketches-td26852.html
>>
>> Best,
>> Flavio
>>
>> On Mon, Apr 27, 2020 at 7:37 PM leerho  wrote:
>>
>> > Hi Arvid,
>> >
>> > Note: I am dual listing this thread on both dev lists for better
>> tracking.
>> >
>> >1. I'm curious on how you would estimate the effort to port
>> datasketches
>> > >to Flink? It already has a Java API, but how difficult would it be
>> to
>> > >subdivide the tasks into parallel chunks of work? Since it's
>> already
>> > > ported
>> > >on Pig, I think we could use this port as a baseline
>> >
>> >
>> > Most systems (including systems like Druid, Hive, Pig, Spark,
>> PostgreSQL,
>> > Databases, Streaming Platforms, Map-Reduce Platforms, etc) have some
>> sort
>> > of aggregation API, which allows users to plug in custom aggregation
>> > functions.  Typical API functions found in these APIs are Initialize(),
>> > Update() (or Add()), Merge(), and getResult().  How these are named and
>> > operate vary considerably from system to system.  These APIs are
>> sometimes
>> > called User Defined Functions (UDFs) or User Defined Aggregation
>> Functions
>> > (UDAFs).
>> >
>> > DataSketches is a library of Sketching (streaming) aggregation
>> functions,
>> > each of which perform specific types of aggregation. For example,
>> counting
>> > unique items, determining quantiles and histograms of unknown
>> > distributions, identifying most frequent items (heavy hitters) from a
>> > stream, etc.   The advantage of using DataSketches is that they are
>> > extremely fast, 

Re: Integration of DataSketches into Flink

2020-04-27 Thread Seth Wiesman
Hi Lee,

I really like this project, I used it with Flink a few years ago when it
was still Yahoo DataSketches. The projects clearly complement each other.
As Arvid mentioned, the Flink community is trying to foster an ecosystem
larger than what is in the main Flink repository. The reason is that the
project has grown to such a scale that it cannot reasonably maintain
everything. To encourage that sort of growth, Flink is extensively
pluggable which means that components do not need to live within the main
repository to be treated first-class.

I'd like to outline somethings the DataSketch community could do to
integrate with Flink.

1) Create a page on the flink packages website.

The flink community hosts a website call flink packages to increase the
visibility of ecosystem projects with the flink user base[1]. Datasketches
are usable from Flink today so I'd encourage you to create a page right
away.

2) Implement TypeInformation for DataSketches

TypeInformation is Flink's internal type system and is used as a factory
for creating serializing for different types. These serializers are what
Flink uses when shuffling data around the cluster and when storing records
in state backends as state. Providing type information instances for the
different sketch types, which would just be wrappers around existing
serializers in the data sketch codebase. This should be relatively
straightforward. There is no DataStream aggregation API in the way you are
describing so this is the *only* step you would need to take to provide
first-class support for Flink DataStream API[2][3].

3) Implement sketch UDFs

Along with its Java API, Flink also offers a relational API and UDFs. The
community could provide UDFs for datasketches like Hive. To do so only
requires implementing the aggregation function interface[4]. Flink SQL
offers the concept of modules, which are a collection of SQL UDFs that can
easily be loaded in the system[5]. A DataSketch SQL module would provide a
simple way for users to get started and expose these UDFs as if they were
native to Flink.

I hope this helps, I look forward to watching the DataSketch community grow!

Seth

[1] https://flink-packages.org/
[2]
https://ci.apache.org/projects/flink/flink-docs-stable/dev/types_serialization.html
[3]
https://ci.apache.org/projects/flink/flink-docs-stable/dev/datastream_api.html
[4]
https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/functions/udfs.html#aggregation-functions
[5]
https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/modules.html


On Mon, Apr 27, 2020 at 12:57 PM Flavio Pompermaier 
wrote:

> If this can encourage Lee I'm one of the Flink users that already use
> datasketches and I found it an amazing library.
> When I was trying it out (lat year) I tried to stimulate some discussion[1]
> but at that time it was probably too early..
> I really hope that now things are mature for both communities!
>
> [1]
>
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-and-sketches-td26852.html
>
> Best,
> Flavio
>
> On Mon, Apr 27, 2020 at 7:37 PM leerho  wrote:
>
> > Hi Arvid,
> >
> > Note: I am dual listing this thread on both dev lists for better
> tracking.
> >
> >1. I'm curious on how you would estimate the effort to port
> datasketches
> > >to Flink? It already has a Java API, but how difficult would it be
> to
> > >subdivide the tasks into parallel chunks of work? Since it's already
> > > ported
> > >on Pig, I think we could use this port as a baseline
> >
> >
> > Most systems (including systems like Druid, Hive, Pig, Spark, PostgreSQL,
> > Databases, Streaming Platforms, Map-Reduce Platforms, etc) have some sort
> > of aggregation API, which allows users to plug in custom aggregation
> > functions.  Typical API functions found in these APIs are Initialize(),
> > Update() (or Add()), Merge(), and getResult().  How these are named and
> > operate vary considerably from system to system.  These APIs are
> sometimes
> > called User Defined Functions (UDFs) or User Defined Aggregation
> Functions
> > (UDAFs).
> >
> > DataSketches is a library of Sketching (streaming) aggregation functions,
> > each of which perform specific types of aggregation. For example,
> counting
> > unique items, determining quantiles and histograms of unknown
> > distributions, identifying most frequent items (heavy hitters) from a
> > stream, etc.   The advantage of using DataSketches is that they are
> > extremely fast, small in size, and have well defined error properties
> > defined by published scientific papers that define the underlying
> > mathematics.
> >
> > The task of porting DataSketches is usually developing a thin wrapping
> > layer that translates the specific UDAF API of Flink to the equivalent
> API
> > methods of the targeted sketches in the library.  This is best done by
> > someone with deep knowledge of the UDAF code of the targeted system.   We
> > are certainly available answer 

Re: Integration of DataSketches into Flink

2020-04-27 Thread Flavio Pompermaier
If this can encourage Lee I'm one of the Flink users that already use
datasketches and I found it an amazing library.
When I was trying it out (lat year) I tried to stimulate some discussion[1]
but at that time it was probably too early..
I really hope that now things are mature for both communities!

[1]
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-and-sketches-td26852.html

Best,
Flavio

On Mon, Apr 27, 2020 at 7:37 PM leerho  wrote:

> Hi Arvid,
>
> Note: I am dual listing this thread on both dev lists for better tracking.
>
>1. I'm curious on how you would estimate the effort to port datasketches
> >to Flink? It already has a Java API, but how difficult would it be to
> >subdivide the tasks into parallel chunks of work? Since it's already
> > ported
> >on Pig, I think we could use this port as a baseline
>
>
> Most systems (including systems like Druid, Hive, Pig, Spark, PostgreSQL,
> Databases, Streaming Platforms, Map-Reduce Platforms, etc) have some sort
> of aggregation API, which allows users to plug in custom aggregation
> functions.  Typical API functions found in these APIs are Initialize(),
> Update() (or Add()), Merge(), and getResult().  How these are named and
> operate vary considerably from system to system.  These APIs are sometimes
> called User Defined Functions (UDFs) or User Defined Aggregation Functions
> (UDAFs).
>
> DataSketches is a library of Sketching (streaming) aggregation functions,
> each of which perform specific types of aggregation. For example, counting
> unique items, determining quantiles and histograms of unknown
> distributions, identifying most frequent items (heavy hitters) from a
> stream, etc.   The advantage of using DataSketches is that they are
> extremely fast, small in size, and have well defined error properties
> defined by published scientific papers that define the underlying
> mathematics.
>
> The task of porting DataSketches is usually developing a thin wrapping
> layer that translates the specific UDAF API of Flink to the equivalent API
> methods of the targeted sketches in the library.  This is best done by
> someone with deep knowledge of the UDAF code of the targeted system.   We
> are certainly available answer questions about the DataSketches APIs.
>  Although we did write the UDAF layers for Hive and Pig, we did that as a
> proof of concept and example on how to write such layers.  We are a small
> team and are not in a position to support these integration layers for
> every system out there.
>
> 2. Do you have any idea who is usually driving the adoptions?
>
>
> To start, you only need to write the UDAF layer for the sketches that you
> think would be in most demand by your users.  The big 4 categories are
> distinct (unique) counting, quantiles, frequent-items, and sampling.  This
> is a natural way of subdividing the task: choose the sketches you want to
> adapt and in what order.  Each sketch is independent so it can be adapted
> whenever it is needed.
>
> Please let us know if you have any further questions :)
>
> Lee.
>
>
>
>
> On Mon, Apr 27, 2020 at 2:11 AM Arvid Heise  wrote:
>
> > Hi Lee,
> >
> > I must admit that I also heard of data sketches for the first time (there
> > are really many Apache projects).
> >
> > Datasketches sounds really exciting. As a (former) data engineer, I can
> > 100% say that this is something that (end-)users want and need and it
> would
> > make so much sense to have it in Flink from the get-go.
> > Flink, however, is a quite old project already, which grew at a strong
> pace
> > leading to some 150 modules in the core. We are currently in the process
> to
> > restructure that and reduce the number of things in the core, such that
> > build times and stability improve.
> >
> > To counter that we created Flink packages [1], which includes everything
> > new that we deem to not be essential. I'd propose to incorporate a Flink
> > datasketch package there. If it seems like it's becoming essential, we
> can
> > still move it to core at a later point.
> >
> > As I have seen on the page, there are already plenty of adoptions. That
> > leaves a few questions to me.
> >
> >1. I'm curious on how you would estimate the effort to port
> datasketches
> >to Flink? It already has a Java API, but how difficult would it be to
> >subdivide the tasks into parallel chunks of work? Since it's already
> > ported
> >on Pig, I think we could use this port as a baseline.
> >2. Do you have any idea who is usually driving the adoptions?
> >
> >
> > [1] https://flink-packages.org/
> >
> > On Sun, Apr 26, 2020 at 8:07 AM leerho  wrote:
> >
> > > Hello All,
> > >
> > > I am a committer on DataSketches.apache.org
> > >  and just learning about Flink,
> Since
> > > Flink is designed for stateful stream processing I would think it would
> > > make sense to have the DataSketches library integrated into its core so
> > all
> > > users of 

Re: Integration of DataSketches into Flink

2020-04-27 Thread leerho
Hi Arvid,

Note: I am dual listing this thread on both dev lists for better tracking.

   1. I'm curious on how you would estimate the effort to port datasketches
>to Flink? It already has a Java API, but how difficult would it be to
>subdivide the tasks into parallel chunks of work? Since it's already
> ported
>on Pig, I think we could use this port as a baseline


Most systems (including systems like Druid, Hive, Pig, Spark, PostgreSQL,
Databases, Streaming Platforms, Map-Reduce Platforms, etc) have some sort
of aggregation API, which allows users to plug in custom aggregation
functions.  Typical API functions found in these APIs are Initialize(),
Update() (or Add()), Merge(), and getResult().  How these are named and
operate vary considerably from system to system.  These APIs are sometimes
called User Defined Functions (UDFs) or User Defined Aggregation Functions
(UDAFs).

DataSketches is a library of Sketching (streaming) aggregation functions,
each of which perform specific types of aggregation. For example, counting
unique items, determining quantiles and histograms of unknown
distributions, identifying most frequent items (heavy hitters) from a
stream, etc.   The advantage of using DataSketches is that they are
extremely fast, small in size, and have well defined error properties
defined by published scientific papers that define the underlying
mathematics.

The task of porting DataSketches is usually developing a thin wrapping
layer that translates the specific UDAF API of Flink to the equivalent API
methods of the targeted sketches in the library.  This is best done by
someone with deep knowledge of the UDAF code of the targeted system.   We
are certainly available answer questions about the DataSketches APIs.
 Although we did write the UDAF layers for Hive and Pig, we did that as a
proof of concept and example on how to write such layers.  We are a small
team and are not in a position to support these integration layers for
every system out there.

2. Do you have any idea who is usually driving the adoptions?


To start, you only need to write the UDAF layer for the sketches that you
think would be in most demand by your users.  The big 4 categories are
distinct (unique) counting, quantiles, frequent-items, and sampling.  This
is a natural way of subdividing the task: choose the sketches you want to
adapt and in what order.  Each sketch is independent so it can be adapted
whenever it is needed.

Please let us know if you have any further questions :)

Lee.




On Mon, Apr 27, 2020 at 2:11 AM Arvid Heise  wrote:

> Hi Lee,
>
> I must admit that I also heard of data sketches for the first time (there
> are really many Apache projects).
>
> Datasketches sounds really exciting. As a (former) data engineer, I can
> 100% say that this is something that (end-)users want and need and it would
> make so much sense to have it in Flink from the get-go.
> Flink, however, is a quite old project already, which grew at a strong pace
> leading to some 150 modules in the core. We are currently in the process to
> restructure that and reduce the number of things in the core, such that
> build times and stability improve.
>
> To counter that we created Flink packages [1], which includes everything
> new that we deem to not be essential. I'd propose to incorporate a Flink
> datasketch package there. If it seems like it's becoming essential, we can
> still move it to core at a later point.
>
> As I have seen on the page, there are already plenty of adoptions. That
> leaves a few questions to me.
>
>1. I'm curious on how you would estimate the effort to port datasketches
>to Flink? It already has a Java API, but how difficult would it be to
>subdivide the tasks into parallel chunks of work? Since it's already
> ported
>on Pig, I think we could use this port as a baseline.
>2. Do you have any idea who is usually driving the adoptions?
>
>
> [1] https://flink-packages.org/
>
> On Sun, Apr 26, 2020 at 8:07 AM leerho  wrote:
>
> > Hello All,
> >
> > I am a committer on DataSketches.apache.org
> >  and just learning about Flink,  Since
> > Flink is designed for stateful stream processing I would think it would
> > make sense to have the DataSketches library integrated into its core so
> all
> > users of Flink could take advantage of these advanced streaming
> > algorithms.  If there is interest in the Flink community for this
> > capability, please contact us at d...@datasketches.apache.org or on our
> > datasketches-dev Slack channel.
> > Cheers,
> > Lee.
> >
>
>
> --
>
> Arvid Heise | Senior Java Developer
>
> 
>
> Follow us @VervericaData
>
> --
>
> Join Flink Forward  - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
> Ververica GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B

Re: Integration of DataSketches into Flink

2020-04-27 Thread Arvid Heise
Hi Lee,

I must admit that I also heard of data sketches for the first time (there
are really many Apache projects).

Datasketches sounds really exciting. As a (former) data engineer, I can
100% say that this is something that (end-)users want and need and it would
make so much sense to have it in Flink from the get-go.
Flink, however, is a quite old project already, which grew at a strong pace
leading to some 150 modules in the core. We are currently in the process to
restructure that and reduce the number of things in the core, such that
build times and stability improve.

To counter that we created Flink packages [1], which includes everything
new that we deem to not be essential. I'd propose to incorporate a Flink
datasketch package there. If it seems like it's becoming essential, we can
still move it to core at a later point.

As I have seen on the page, there are already plenty of adoptions. That
leaves a few questions to me.

   1. I'm curious on how you would estimate the effort to port datasketches
   to Flink? It already has a Java API, but how difficult would it be to
   subdivide the tasks into parallel chunks of work? Since it's already ported
   on Pig, I think we could use this port as a baseline.
   2. Do you have any idea who is usually driving the adoptions?


[1] https://flink-packages.org/

On Sun, Apr 26, 2020 at 8:07 AM leerho  wrote:

> Hello All,
>
> I am a committer on DataSketches.apache.org
>  and just learning about Flink,  Since
> Flink is designed for stateful stream processing I would think it would
> make sense to have the DataSketches library integrated into its core so all
> users of Flink could take advantage of these advanced streaming
> algorithms.  If there is interest in the Flink community for this
> capability, please contact us at d...@datasketches.apache.org or on our
> datasketches-dev Slack channel.
> Cheers,
> Lee.
>


-- 

Arvid Heise | Senior Java Developer



Follow us @VervericaData

--

Join Flink Forward  - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
(Toni) Cheng