Hi,

1. Upgrade DataSketches to version 3.1.0
Issue tracking: https://issues.apache.org/jira/browse/IMPALA-10863

2. Use bytes to serializing sketches
Impala currently does not support the BINARY data type, we can write
sketches as binary instead of strings once it's supported:
https://issues.apache.org/jira/browse/IMPALA-9482.

3. Regarding deserialization
Using std::unique_ptr can reduce unnecessary constructor, here is to solve
the problem that `compact_theta_sketch` cannot be directly constructed.
https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L2104

4. DsHllMerge function optimization
One solution is to use hll_union instead of hll_sketch, which provides
support for updating hll_sketch sketches and primitive data types, But
other union sketches (cpc theta) do not provide primitive type support.
https://github.com/apache/datasketches-cpp/blob/master/hll/include/hll.hpp#L521-L581

https://issues.apache.org/jira/browse/IMPALA-10864

Regards.
Fucun

Alexander Saydakov <sayda...@verizonmedia.com> 于2021年8月13日周五 上午4:58写道:

> Hi Impala development community,
> I am a member of Apache DataSketches team. I looked at the DataSketches
> functions in Impala and I have a few questions and suggestions.
>
> Regarding the dependency. I see that the current approach is to copy all
> files from Datasketches into a single pile. Is there a better way?
>
> I see that you are using version 3.0.0 currently. We released version 3.1.0
> about a month ago. It has some important bug fixes, and a Theta sketch
> feature to avoid some cost of deserialization. This feature can speed up
> Theta union operations.
>
> Regarding your approach to serializing sketches. We have two ways of
> serializing and deserializing sketches: stream and bytes. You are using
> bytes for deserializing, but a temporary stringstream for serializing
>
> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1624
> just to copy the internals of this stream as bytes (using memcopy) in the
> StringStreamToStringVal function
>
> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/datasketches-common.cc#L80
> I would think it should be faster to use serialization to bytes, and the
> StringStreamToStringVal function can be eliminated.
>
> Regarding deserialization. I see in some cases that a sketch constructor is
> called just to replace this instance with a deserialized one. This extra
> construction seems unnecessary.
>
> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1819
>
> Looking at this DsHllMerge function
>
> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1759
> it seems that the merge is done pairwise. Is it possible to arrange this
> process as init, multiple merges and finalize (serialize) at the end? It is
> quite costly to initialize a union, update it with two sketches and then
> call get_result(). If many such merges happen, the overhead of initializing
> a fresh union and finalizing it for each pair can be substantial.
>
> Regards,
> Alexander.
>

Reply via email to