Thank Fucun for creating the JIRAs! Regarding the dependency. I see that the current approach is to copy all > files from Datasketches into a single pile. Is there a better way?
Is there a historical reason that we don't add DataSketches into the native-toolchain? Regards, Quanlong On Mon, Aug 16, 2021 at 3:00 PM fucun chu <im.fu...@gmail.com> wrote: > Hi, > > 1. Upgrade DataSketches to version 3.1.0 > Issue tracking: https://issues.apache.org/jira/browse/IMPALA-10863 > > 2. Use bytes to serializing sketches > Impala currently does not support the BINARY data type, we can write > sketches as binary instead of strings once it's supported: > https://issues.apache.org/jira/browse/IMPALA-9482. > > 3. Regarding deserialization > Using std::unique_ptr can reduce unnecessary constructor, here is to solve > the problem that `compact_theta_sketch` cannot be directly constructed. > > https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L2104 > > 4. DsHllMerge function optimization > One solution is to use hll_union instead of hll_sketch, which provides > support for updating hll_sketch sketches and primitive data types, But > other union sketches (cpc theta) do not provide primitive type support. > > https://github.com/apache/datasketches-cpp/blob/master/hll/include/hll.hpp#L521-L581 > > https://issues.apache.org/jira/browse/IMPALA-10864 > > Regards. > Fucun > > Alexander Saydakov <sayda...@verizonmedia.com> 于2021年8月13日周五 上午4:58写道: > > > Hi Impala development community, > > I am a member of Apache DataSketches team. I looked at the DataSketches > > functions in Impala and I have a few questions and suggestions. > > > > Regarding the dependency. I see that the current approach is to copy all > > files from Datasketches into a single pile. Is there a better way? > > > > I see that you are using version 3.0.0 currently. We released version > 3.1.0 > > about a month ago. It has some important bug fixes, and a Theta sketch > > feature to avoid some cost of deserialization. This feature can speed up > > Theta union operations. > > > > Regarding your approach to serializing sketches. We have two ways of > > serializing and deserializing sketches: stream and bytes. You are using > > bytes for deserializing, but a temporary stringstream for serializing > > > > > https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1624 > > just to copy the internals of this stream as bytes (using memcopy) in the > > StringStreamToStringVal function > > > > > https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/datasketches-common.cc#L80 > > I would think it should be faster to use serialization to bytes, and the > > StringStreamToStringVal function can be eliminated. > > > > Regarding deserialization. I see in some cases that a sketch constructor > is > > called just to replace this instance with a deserialized one. This extra > > construction seems unnecessary. > > > > > https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1819 > > > > Looking at this DsHllMerge function > > > > > https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1759 > > it seems that the merge is done pairwise. Is it possible to arrange this > > process as init, multiple merges and finalize (serialize) at the end? It > is > > quite costly to initialize a union, update it with two sketches and then > > call get_result(). If many such merges happen, the overhead of > initializing > > a fresh union and finalizing it for each pair can be substantial. > > > > Regards, > > Alexander. > > >