Hi, 1. Upgrade DataSketches to version 3.1.0 Issue tracking: https://issues.apache.org/jira/browse/IMPALA-10863
2. Use bytes to serializing sketches Impala currently does not support the BINARY data type, we can write sketches as binary instead of strings once it's supported: https://issues.apache.org/jira/browse/IMPALA-9482. 3. Regarding deserialization Using std::unique_ptr can reduce unnecessary constructor, here is to solve the problem that `compact_theta_sketch` cannot be directly constructed. https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L2104 4. DsHllMerge function optimization One solution is to use hll_union instead of hll_sketch, which provides support for updating hll_sketch sketches and primitive data types, But other union sketches (cpc theta) do not provide primitive type support. https://github.com/apache/datasketches-cpp/blob/master/hll/include/hll.hpp#L521-L581 https://issues.apache.org/jira/browse/IMPALA-10864 Regards. Fucun Alexander Saydakov <sayda...@verizonmedia.com> 于2021年8月13日周五 上午4:58写道: > Hi Impala development community, > I am a member of Apache DataSketches team. I looked at the DataSketches > functions in Impala and I have a few questions and suggestions. > > Regarding the dependency. I see that the current approach is to copy all > files from Datasketches into a single pile. Is there a better way? > > I see that you are using version 3.0.0 currently. We released version 3.1.0 > about a month ago. It has some important bug fixes, and a Theta sketch > feature to avoid some cost of deserialization. This feature can speed up > Theta union operations. > > Regarding your approach to serializing sketches. We have two ways of > serializing and deserializing sketches: stream and bytes. You are using > bytes for deserializing, but a temporary stringstream for serializing > > https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1624 > just to copy the internals of this stream as bytes (using memcopy) in the > StringStreamToStringVal function > > https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/datasketches-common.cc#L80 > I would think it should be faster to use serialization to bytes, and the > StringStreamToStringVal function can be eliminated. > > Regarding deserialization. I see in some cases that a sketch constructor is > called just to replace this instance with a deserialized one. This extra > construction seems unnecessary. > > https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1819 > > Looking at this DsHllMerge function > > https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1759 > it seems that the merge is done pairwise. Is it possible to arrange this > process as init, multiple merges and finalize (serialize) at the end? It is > quite costly to initialize a union, update it with two sketches and then > call get_result(). If many such merges happen, the overhead of initializing > a fresh union and finalizing it for each pair can be substantial. > > Regards, > Alexander. >