Thank Fucun for creating the JIRAs!

Regarding the dependency. I see that the current approach is to copy all
> files from Datasketches into a single pile. Is there a better way?

Is there a historical reason that we don't add DataSketches into the
native-toolchain?

Regards,
Quanlong


On Mon, Aug 16, 2021 at 3:00 PM fucun chu <im.fu...@gmail.com> wrote:

> Hi,
>
> 1. Upgrade DataSketches to version 3.1.0
> Issue tracking: https://issues.apache.org/jira/browse/IMPALA-10863
>
> 2. Use bytes to serializing sketches
> Impala currently does not support the BINARY data type, we can write
> sketches as binary instead of strings once it's supported:
> https://issues.apache.org/jira/browse/IMPALA-9482.
>
> 3. Regarding deserialization
> Using std::unique_ptr can reduce unnecessary constructor, here is to solve
> the problem that `compact_theta_sketch` cannot be directly constructed.
>
> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L2104
>
> 4. DsHllMerge function optimization
> One solution is to use hll_union instead of hll_sketch, which provides
> support for updating hll_sketch sketches and primitive data types, But
> other union sketches (cpc theta) do not provide primitive type support.
>
> https://github.com/apache/datasketches-cpp/blob/master/hll/include/hll.hpp#L521-L581
>
> https://issues.apache.org/jira/browse/IMPALA-10864
>
> Regards.
> Fucun
>
> Alexander Saydakov <sayda...@verizonmedia.com> 于2021年8月13日周五 上午4:58写道:
>
> > Hi Impala development community,
> > I am a member of Apache DataSketches team. I looked at the DataSketches
> > functions in Impala and I have a few questions and suggestions.
> >
> > Regarding the dependency. I see that the current approach is to copy all
> > files from Datasketches into a single pile. Is there a better way?
> >
> > I see that you are using version 3.0.0 currently. We released version
> 3.1.0
> > about a month ago. It has some important bug fixes, and a Theta sketch
> > feature to avoid some cost of deserialization. This feature can speed up
> > Theta union operations.
> >
> > Regarding your approach to serializing sketches. We have two ways of
> > serializing and deserializing sketches: stream and bytes. You are using
> > bytes for deserializing, but a temporary stringstream for serializing
> >
> >
> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1624
> > just to copy the internals of this stream as bytes (using memcopy) in the
> > StringStreamToStringVal function
> >
> >
> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/datasketches-common.cc#L80
> > I would think it should be faster to use serialization to bytes, and the
> > StringStreamToStringVal function can be eliminated.
> >
> > Regarding deserialization. I see in some cases that a sketch constructor
> is
> > called just to replace this instance with a deserialized one. This extra
> > construction seems unnecessary.
> >
> >
> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1819
> >
> > Looking at this DsHllMerge function
> >
> >
> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1759
> > it seems that the merge is done pairwise. Is it possible to arrange this
> > process as init, multiple merges and finalize (serialize) at the end? It
> is
> > quite costly to initialize a union, update it with two sketches and then
> > call get_result(). If many such merges happen, the overhead of
> initializing
> > a fresh union and finalizing it for each pair can be substantial.
> >
> > Regards,
> > Alexander.
> >
>

Reply via email to