I am afraid that I was misunderstood regarding a few points. Let me try to clarify.
Regarding serialization using bytes as opposed to a stream. This has nothing to do with BINARY data type in Impala. Currently I see in the Impala code something like this (simplified): std::stringstream tmp; sketch.serialize(tmp); std::string str = tmp.str(); // in StringStreamToStringVal StringVal result(context, str.size()); memcpy(result.ptr, str.c_str(), str.size()); You could do it faster like this: auto bytes = sketch.serialize(); StringVal result(context, bytes.size()); memcpy(result.ptr, bytes.data() bytes.size()); Regarding unnecessary constructor during deserialization. I see a code like this (HLL is an example, but the pattern is the same): datasketches::hll_sketch src_sketch(DS_SKETCH_CONFIG, DS_HLL_TYPE); // construct an empty sketch, which is not needed DeserializeDsSketch(src, &src_sketch); // pass it into a function, which will replace it by an assignment (hopefully a move, not copy) // in the function *sketch = T::deserialize((void*)serialized_sketch.ptr, serialized_sketch.len); This can be accomplished like so avoiding unnecessary constructor: datasketches::hll_sketch src_sketch = datasketches::hll_sketch::deserialize((void*)serialized_sketch.ptr, serialized_sketch.len); Regarding merge (not just HLL). I am talking about the overall process flow. Looking at this UDA documentation: https://github.com/apache/impala/blob/b1ca0894462e4b9c040fda41622f8b97b1ec15e1/be/src/udf/udf.h#L340 it seems to me that it should be possible to have a flow like Init(), Merge() ... Merge(), Finalize() So why the implementation here finalizes each time?: https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1759 Instantiating a union for each pairwise merge and especially calling get_result() has a lot of overhead. On Wed, Aug 18, 2021 at 12:49 AM Gabor Kaszab <gaborkas...@apache.org> wrote: > Hey Quanlong, > Initially I added the first library as copying all the required files into > the same dir. It was ~1.5 years ago so my memories are faint but as I > remember there was an issue with the library having compilation issues if > we kept the structure of the files. As a Quick workaround came the idea to > copy the files to a common dir just to unblock us and start working on > adding sketching functionalities. > This was the first commit: https://gerrit.cloudera.org/#/c/15746/ > <https://urldefense.proofpoint.com/v2/url?u=https-3A__gerrit.cloudera.org_-23_c_15746_&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=Gux0Patz_D0kWZBeFg7ElKCiUah7Q9_PLHogwHpGZuk&e=> > We can open a Jira for sure to move the Datasketches library to toolchain > and let's see if anyone has some free cycles to take care of that. > https://issues.apache.org/jira/browse/IMPALA-10868 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_IMPALA-2D10868&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=ixGxYvTuy2tUFFHn2T7Y8Mhev5G5f2D9_LdIMlWD-ZM&e=> > > Cheers, > Gabor > > On Mon, Aug 16, 2021 at 3:44 PM Quanlong Huang <huangquanl...@gmail.com> > wrote: > >> Thank Fucun for creating the JIRAs! >> >> Regarding the dependency. I see that the current approach is to copy all >> > files from Datasketches into a single pile. Is there a better way? >> >> Is there a historical reason that we don't add DataSketches into the >> native-toolchain? >> >> Regards, >> Quanlong >> >> >> On Mon, Aug 16, 2021 at 3:00 PM fucun chu <im.fu...@gmail.com> wrote: >> >> > Hi, >> > >> > 1. Upgrade DataSketches to version 3.1.0 >> > Issue tracking: https://issues.apache.org/jira/browse/IMPALA-10863 >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_IMPALA-2D10863&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=hT4PA0h2zT7dWD90xKlsww9OV5B8BrLbDtUiCfrxc24&e=> >> > >> > 2. Use bytes to serializing sketches >> > Impala currently does not support the BINARY data type, we can write >> > sketches as binary instead of strings once it's supported: >> > https://issues.apache.org/jira/browse/IMPALA-9482 >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_IMPALA-2D9482&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=XUz0h0DnWnGcE0Y3LQ1efRqEnrkWERW5jbdT9piEmE8&e=> >> . >> > >> > 3. Regarding deserialization >> > Using std::unique_ptr can reduce unnecessary constructor, here is to >> solve >> > the problem that `compact_theta_sketch` cannot be directly constructed. >> > >> > >> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L2104 >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_aggregate-2Dfunctions-2Dir.cc-23L2104&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=a4LLYpJUgaAFcpgL9zFKqHeKgVffcERTdJP2623G-TQ&e=> >> > >> > 4. DsHllMerge function optimization >> > One solution is to use hll_union instead of hll_sketch, which provides >> > support for updating hll_sketch sketches and primitive data types, But >> > other union sketches (cpc theta) do not provide primitive type support. >> > >> > >> https://github.com/apache/datasketches-cpp/blob/master/hll/include/hll.hpp#L521-L581 >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_datasketches-2Dcpp_blob_master_hll_include_hll.hpp-23L521-2DL581&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=lkmVnnuioQT2zQgmBUbcts8c86LUALm2Oj6eSYogqfY&e=> >> > >> > https://issues.apache.org/jira/browse/IMPALA-10864 >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_IMPALA-2D10864&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=rbOAZsy1yN_8nMOmcr4BbDEOS67DSBeE1ArHwnZXsR0&e=> >> > >> > Regards. >> > Fucun >> > >> > Alexander Saydakov <sayda...@verizonmedia.com> 于2021年8月13日周五 上午4:58写道: >> > >> > > Hi Impala development community, >> > > I am a member of Apache DataSketches team. I looked at the >> DataSketches >> > > functions in Impala and I have a few questions and suggestions. >> > > >> > > Regarding the dependency. I see that the current approach is to copy >> all >> > > files from Datasketches into a single pile. Is there a better way? >> > > >> > > I see that you are using version 3.0.0 currently. We released version >> > 3.1.0 >> > > about a month ago. It has some important bug fixes, and a Theta sketch >> > > feature to avoid some cost of deserialization. This feature can speed >> up >> > > Theta union operations. >> > > >> > > Regarding your approach to serializing sketches. We have two ways of >> > > serializing and deserializing sketches: stream and bytes. You are >> using >> > > bytes for deserializing, but a temporary stringstream for serializing >> > > >> > > >> > >> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1624 >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_aggregate-2Dfunctions-2Dir.cc-23L1624&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=gR9Ktn2vYHkI1ey_3E5_C6Ox9DU61vHxn5T6qH8wmHM&e=> >> > > just to copy the internals of this stream as bytes (using memcopy) in >> the >> > > StringStreamToStringVal function >> > > >> > > >> > >> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/datasketches-common.cc#L80 >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_datasketches-2Dcommon.cc-23L80&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=4SRfbNoCr2si6yIXDPrXS6hN__TGtw9U1hbmnj85I1c&e=> >> > > I would think it should be faster to use serialization to bytes, and >> the >> > > StringStreamToStringVal function can be eliminated. >> > > >> > > Regarding deserialization. I see in some cases that a sketch >> constructor >> > is >> > > called just to replace this instance with a deserialized one. This >> extra >> > > construction seems unnecessary. >> > > >> > > >> > >> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1819 >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_aggregate-2Dfunctions-2Dir.cc-23L1819&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=NbstsM1-dJh1LimtdmtM98ohuL6WtiulY47qGopnmFE&e=> >> > > >> > > Looking at this DsHllMerge function >> > > >> > > >> > >> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1759 >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_aggregate-2Dfunctions-2Dir.cc-23L1759&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=Yj4JWLjD9OCNAuLI2ZcTb-EZj0StX4lJ8ODziVrgbxg&e=> >> > > it seems that the merge is done pairwise. Is it possible to arrange >> this >> > > process as init, multiple merges and finalize (serialize) at the end? >> It >> > is >> > > quite costly to initialize a union, update it with two sketches and >> then >> > > call get_result(). If many such merges happen, the overhead of >> > initializing >> > > a fresh union and finalizing it for each pair can be substantial. >> > > >> > > Regards, >> > > Alexander. >> > > >> > >> >