Hey Quanlong,
Initially I added the first library as copying all the required files into
the same dir. It was ~1.5 years ago so my memories are faint but as I
remember there was an issue with the library having compilation issues if
we kept the structure of the files. As a Quick workaround came the idea to
copy the files to a common dir just to unblock us and start working on
adding sketching functionalities.
This was the first commit: https://gerrit.cloudera.org/#/c/15746/
We can open a Jira for sure to move the Datasketches library to toolchain
and let's see if anyone has some free cycles to take care of that.
https://issues.apache.org/jira/browse/IMPALA-10868

Cheers,
Gabor

On Mon, Aug 16, 2021 at 3:44 PM Quanlong Huang <huangquanl...@gmail.com>
wrote:

> Thank Fucun for creating the JIRAs!
>
> Regarding the dependency. I see that the current approach is to copy all
> > files from Datasketches into a single pile. Is there a better way?
>
> Is there a historical reason that we don't add DataSketches into the
> native-toolchain?
>
> Regards,
> Quanlong
>
>
> On Mon, Aug 16, 2021 at 3:00 PM fucun chu <im.fu...@gmail.com> wrote:
>
> > Hi,
> >
> > 1. Upgrade DataSketches to version 3.1.0
> > Issue tracking: https://issues.apache.org/jira/browse/IMPALA-10863
> >
> > 2. Use bytes to serializing sketches
> > Impala currently does not support the BINARY data type, we can write
> > sketches as binary instead of strings once it's supported:
> > https://issues.apache.org/jira/browse/IMPALA-9482.
> >
> > 3. Regarding deserialization
> > Using std::unique_ptr can reduce unnecessary constructor, here is to
> solve
> > the problem that `compact_theta_sketch` cannot be directly constructed.
> >
> >
> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L2104
> >
> > 4. DsHllMerge function optimization
> > One solution is to use hll_union instead of hll_sketch, which provides
> > support for updating hll_sketch sketches and primitive data types, But
> > other union sketches (cpc theta) do not provide primitive type support.
> >
> >
> https://github.com/apache/datasketches-cpp/blob/master/hll/include/hll.hpp#L521-L581
> >
> > https://issues.apache.org/jira/browse/IMPALA-10864
> >
> > Regards.
> > Fucun
> >
> > Alexander Saydakov <sayda...@verizonmedia.com> 于2021年8月13日周五 上午4:58写道:
> >
> > > Hi Impala development community,
> > > I am a member of Apache DataSketches team. I looked at the DataSketches
> > > functions in Impala and I have a few questions and suggestions.
> > >
> > > Regarding the dependency. I see that the current approach is to copy
> all
> > > files from Datasketches into a single pile. Is there a better way?
> > >
> > > I see that you are using version 3.0.0 currently. We released version
> > 3.1.0
> > > about a month ago. It has some important bug fixes, and a Theta sketch
> > > feature to avoid some cost of deserialization. This feature can speed
> up
> > > Theta union operations.
> > >
> > > Regarding your approach to serializing sketches. We have two ways of
> > > serializing and deserializing sketches: stream and bytes. You are using
> > > bytes for deserializing, but a temporary stringstream for serializing
> > >
> > >
> >
> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1624
> > > just to copy the internals of this stream as bytes (using memcopy) in
> the
> > > StringStreamToStringVal function
> > >
> > >
> >
> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/datasketches-common.cc#L80
> > > I would think it should be faster to use serialization to bytes, and
> the
> > > StringStreamToStringVal function can be eliminated.
> > >
> > > Regarding deserialization. I see in some cases that a sketch
> constructor
> > is
> > > called just to replace this instance with a deserialized one. This
> extra
> > > construction seems unnecessary.
> > >
> > >
> >
> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1819
> > >
> > > Looking at this DsHllMerge function
> > >
> > >
> >
> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1759
> > > it seems that the merge is done pairwise. Is it possible to arrange
> this
> > > process as init, multiple merges and finalize (serialize) at the end?
> It
> > is
> > > quite costly to initialize a union, update it with two sketches and
> then
> > > call get_result(). If many such merges happen, the overhead of
> > initializing
> > > a fresh union and finalizing it for each pair can be substantial.
> > >
> > > Regards,
> > > Alexander.
> > >
> >
>

Reply via email to