I submitted a pull request with some changes I tried to explain here. https://github.com/apache/impala/pull/30
There are still open questions for me regarding: - better dependency mechanism - updating dependency to the latest 3.1.0 - process flow in aggregate functions (avoiding overhead of pairwise merge) On Tue, Aug 24, 2021 at 7:36 PM Alexander Saydakov < sayda...@verizonmedia.com> wrote: > I am afraid that I was misunderstood regarding a few points. Let me try to > clarify. > > Regarding serialization using bytes as opposed to a stream. This has > nothing to do with BINARY data type in Impala. > Currently I see in the Impala code something like this (simplified): > std::stringstream tmp; > sketch.serialize(tmp); > std::string str = tmp.str(); // in StringStreamToStringVal > StringVal result(context, str.size()); > memcpy(result.ptr, str.c_str(), str.size()); > > You could do it faster like this: > auto bytes = sketch.serialize(); > StringVal result(context, bytes.size()); > memcpy(result.ptr, bytes.data() bytes.size()); > > Regarding unnecessary constructor during deserialization. I see a code > like this (HLL is an example, but the pattern is the same): > datasketches::hll_sketch src_sketch(DS_SKETCH_CONFIG, DS_HLL_TYPE); // > construct an empty sketch, which is not needed > DeserializeDsSketch(src, &src_sketch); // pass it into a function, which > will replace it by an assignment (hopefully a move, not copy) > // in the function > *sketch = T::deserialize((void*)serialized_sketch.ptr, > serialized_sketch.len); > > This can be accomplished like so avoiding unnecessary constructor: > datasketches::hll_sketch src_sketch = > datasketches::hll_sketch::deserialize((void*)serialized_sketch.ptr, > serialized_sketch.len); > > Regarding merge (not just HLL). I am talking about the overall process > flow. Looking at this UDA documentation: > > https://github.com/apache/impala/blob/b1ca0894462e4b9c040fda41622f8b97b1ec15e1/be/src/udf/udf.h#L340 > it seems to me that it should be possible to have a flow like Init(), > Merge() ... Merge(), Finalize() > So why the implementation here finalizes each time?: > > https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1759 > Instantiating a union for each pairwise merge and especially calling > get_result() has a lot of overhead. > > > > On Wed, Aug 18, 2021 at 12:49 AM Gabor Kaszab <gaborkas...@apache.org> > wrote: > >> Hey Quanlong, >> Initially I added the first library as copying all the required files >> into the same dir. It was ~1.5 years ago so my memories are faint but as I >> remember there was an issue with the library having compilation issues if >> we kept the structure of the files. As a Quick workaround came the idea to >> copy the files to a common dir just to unblock us and start working on >> adding sketching functionalities. >> This was the first commit: https://gerrit.cloudera.org/#/c/15746/ >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__gerrit.cloudera.org_-23_c_15746_&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=Gux0Patz_D0kWZBeFg7ElKCiUah7Q9_PLHogwHpGZuk&e=> >> We can open a Jira for sure to move the Datasketches library to toolchain >> and let's see if anyone has some free cycles to take care of that. >> https://issues.apache.org/jira/browse/IMPALA-10868 >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_IMPALA-2D10868&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=ixGxYvTuy2tUFFHn2T7Y8Mhev5G5f2D9_LdIMlWD-ZM&e=> >> >> Cheers, >> Gabor >> >> On Mon, Aug 16, 2021 at 3:44 PM Quanlong Huang <huangquanl...@gmail.com> >> wrote: >> >>> Thank Fucun for creating the JIRAs! >>> >>> Regarding the dependency. I see that the current approach is to copy all >>> > files from Datasketches into a single pile. Is there a better way? >>> >>> Is there a historical reason that we don't add DataSketches into the >>> native-toolchain? >>> >>> Regards, >>> Quanlong >>> >>> >>> On Mon, Aug 16, 2021 at 3:00 PM fucun chu <im.fu...@gmail.com> wrote: >>> >>> > Hi, >>> > >>> > 1. Upgrade DataSketches to version 3.1.0 >>> > Issue tracking: https://issues.apache.org/jira/browse/IMPALA-10863 >>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_IMPALA-2D10863&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=hT4PA0h2zT7dWD90xKlsww9OV5B8BrLbDtUiCfrxc24&e=> >>> > >>> > 2. Use bytes to serializing sketches >>> > Impala currently does not support the BINARY data type, we can write >>> > sketches as binary instead of strings once it's supported: >>> > https://issues.apache.org/jira/browse/IMPALA-9482 >>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_IMPALA-2D9482&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=XUz0h0DnWnGcE0Y3LQ1efRqEnrkWERW5jbdT9piEmE8&e=> >>> . >>> > >>> > 3. Regarding deserialization >>> > Using std::unique_ptr can reduce unnecessary constructor, here is to >>> solve >>> > the problem that `compact_theta_sketch` cannot be directly constructed. >>> > >>> > >>> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L2104 >>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_aggregate-2Dfunctions-2Dir.cc-23L2104&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=a4LLYpJUgaAFcpgL9zFKqHeKgVffcERTdJP2623G-TQ&e=> >>> > >>> > 4. DsHllMerge function optimization >>> > One solution is to use hll_union instead of hll_sketch, which provides >>> > support for updating hll_sketch sketches and primitive data types, But >>> > other union sketches (cpc theta) do not provide primitive type support. >>> > >>> > >>> https://github.com/apache/datasketches-cpp/blob/master/hll/include/hll.hpp#L521-L581 >>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_datasketches-2Dcpp_blob_master_hll_include_hll.hpp-23L521-2DL581&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=lkmVnnuioQT2zQgmBUbcts8c86LUALm2Oj6eSYogqfY&e=> >>> > >>> > https://issues.apache.org/jira/browse/IMPALA-10864 >>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_IMPALA-2D10864&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=rbOAZsy1yN_8nMOmcr4BbDEOS67DSBeE1ArHwnZXsR0&e=> >>> > >>> > Regards. >>> > Fucun >>> > >>> > Alexander Saydakov <sayda...@verizonmedia.com> 于2021年8月13日周五 上午4:58写道: >>> > >>> > > Hi Impala development community, >>> > > I am a member of Apache DataSketches team. I looked at the >>> DataSketches >>> > > functions in Impala and I have a few questions and suggestions. >>> > > >>> > > Regarding the dependency. I see that the current approach is to copy >>> all >>> > > files from Datasketches into a single pile. Is there a better way? >>> > > >>> > > I see that you are using version 3.0.0 currently. We released version >>> > 3.1.0 >>> > > about a month ago. It has some important bug fixes, and a Theta >>> sketch >>> > > feature to avoid some cost of deserialization. This feature can >>> speed up >>> > > Theta union operations. >>> > > >>> > > Regarding your approach to serializing sketches. We have two ways of >>> > > serializing and deserializing sketches: stream and bytes. You are >>> using >>> > > bytes for deserializing, but a temporary stringstream for serializing >>> > > >>> > > >>> > >>> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1624 >>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_aggregate-2Dfunctions-2Dir.cc-23L1624&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=gR9Ktn2vYHkI1ey_3E5_C6Ox9DU61vHxn5T6qH8wmHM&e=> >>> > > just to copy the internals of this stream as bytes (using memcopy) >>> in the >>> > > StringStreamToStringVal function >>> > > >>> > > >>> > >>> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/datasketches-common.cc#L80 >>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_datasketches-2Dcommon.cc-23L80&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=4SRfbNoCr2si6yIXDPrXS6hN__TGtw9U1hbmnj85I1c&e=> >>> > > I would think it should be faster to use serialization to bytes, and >>> the >>> > > StringStreamToStringVal function can be eliminated. >>> > > >>> > > Regarding deserialization. I see in some cases that a sketch >>> constructor >>> > is >>> > > called just to replace this instance with a deserialized one. This >>> extra >>> > > construction seems unnecessary. >>> > > >>> > > >>> > >>> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1819 >>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_aggregate-2Dfunctions-2Dir.cc-23L1819&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=NbstsM1-dJh1LimtdmtM98ohuL6WtiulY47qGopnmFE&e=> >>> > > >>> > > Looking at this DsHllMerge function >>> > > >>> > > >>> > >>> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1759 >>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_aggregate-2Dfunctions-2Dir.cc-23L1759&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=Yj4JWLjD9OCNAuLI2ZcTb-EZj0StX4lJ8ODziVrgbxg&e=> >>> > > it seems that the merge is done pairwise. Is it possible to arrange >>> this >>> > > process as init, multiple merges and finalize (serialize) at the >>> end? It >>> > is >>> > > quite costly to initialize a union, update it with two sketches and >>> then >>> > > call get_result(). If many such merges happen, the overhead of >>> > initializing >>> > > a fresh union and finalizing it for each pair can be substantial. >>> > > >>> > > Regards, >>> > > Alexander. >>> > > >>> > >>> >>