I am afraid that I was misunderstood regarding a few points. Let me try to
clarify.

Regarding serialization using bytes as opposed to a stream. This has
nothing to do with BINARY data type in Impala.
Currently I see in the Impala code something like this (simplified):
std::stringstream tmp;
sketch.serialize(tmp);
std::string str = tmp.str(); // in StringStreamToStringVal
StringVal result(context, str.size());
memcpy(result.ptr, str.c_str(), str.size());

You could do it faster like this:
auto bytes = sketch.serialize();
StringVal result(context, bytes.size());
memcpy(result.ptr, bytes.data() bytes.size());

Regarding unnecessary constructor during deserialization. I see a code like
this (HLL is an example, but the pattern is the same):
datasketches::hll_sketch src_sketch(DS_SKETCH_CONFIG, DS_HLL_TYPE); //
construct an empty sketch, which is not needed
DeserializeDsSketch(src, &src_sketch); // pass it into a function, which
will replace it by an assignment (hopefully a move, not copy)
// in the function
*sketch = T::deserialize((void*)serialized_sketch.ptr,
serialized_sketch.len);

This can be accomplished like so avoiding unnecessary constructor:
datasketches::hll_sketch src_sketch =
datasketches::hll_sketch::deserialize((void*)serialized_sketch.ptr,
serialized_sketch.len);

Regarding merge (not just HLL). I am talking about the overall process
flow. Looking at this UDA documentation:
https://github.com/apache/impala/blob/b1ca0894462e4b9c040fda41622f8b97b1ec15e1/be/src/udf/udf.h#L340
it seems to me that it should be possible to have a flow like Init(),
Merge() ... Merge(), Finalize()
So why the implementation here finalizes each time?:
https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1759
Instantiating a union for each pairwise merge and especially calling
get_result() has a lot of overhead.



On Wed, Aug 18, 2021 at 12:49 AM Gabor Kaszab <gaborkas...@apache.org>
wrote:

> Hey Quanlong,
> Initially I added the first library as copying all the required files into
> the same dir. It was ~1.5 years ago so my memories are faint but as I
> remember there was an issue with the library having compilation issues if
> we kept the structure of the files. As a Quick workaround came the idea to
> copy the files to a common dir just to unblock us and start working on
> adding sketching functionalities.
> This was the first commit: https://gerrit.cloudera.org/#/c/15746/
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__gerrit.cloudera.org_-23_c_15746_&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=Gux0Patz_D0kWZBeFg7ElKCiUah7Q9_PLHogwHpGZuk&e=>
> We can open a Jira for sure to move the Datasketches library to toolchain
> and let's see if anyone has some free cycles to take care of that.
> https://issues.apache.org/jira/browse/IMPALA-10868
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_IMPALA-2D10868&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=ixGxYvTuy2tUFFHn2T7Y8Mhev5G5f2D9_LdIMlWD-ZM&e=>
>
> Cheers,
> Gabor
>
> On Mon, Aug 16, 2021 at 3:44 PM Quanlong Huang <huangquanl...@gmail.com>
> wrote:
>
>> Thank Fucun for creating the JIRAs!
>>
>> Regarding the dependency. I see that the current approach is to copy all
>> > files from Datasketches into a single pile. Is there a better way?
>>
>> Is there a historical reason that we don't add DataSketches into the
>> native-toolchain?
>>
>> Regards,
>> Quanlong
>>
>>
>> On Mon, Aug 16, 2021 at 3:00 PM fucun chu <im.fu...@gmail.com> wrote:
>>
>> > Hi,
>> >
>> > 1. Upgrade DataSketches to version 3.1.0
>> > Issue tracking: https://issues.apache.org/jira/browse/IMPALA-10863
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_IMPALA-2D10863&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=hT4PA0h2zT7dWD90xKlsww9OV5B8BrLbDtUiCfrxc24&e=>
>> >
>> > 2. Use bytes to serializing sketches
>> > Impala currently does not support the BINARY data type, we can write
>> > sketches as binary instead of strings once it's supported:
>> > https://issues.apache.org/jira/browse/IMPALA-9482
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_IMPALA-2D9482&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=XUz0h0DnWnGcE0Y3LQ1efRqEnrkWERW5jbdT9piEmE8&e=>
>> .
>> >
>> > 3. Regarding deserialization
>> > Using std::unique_ptr can reduce unnecessary constructor, here is to
>> solve
>> > the problem that `compact_theta_sketch` cannot be directly constructed.
>> >
>> >
>> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L2104
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_aggregate-2Dfunctions-2Dir.cc-23L2104&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=a4LLYpJUgaAFcpgL9zFKqHeKgVffcERTdJP2623G-TQ&e=>
>> >
>> > 4. DsHllMerge function optimization
>> > One solution is to use hll_union instead of hll_sketch, which provides
>> > support for updating hll_sketch sketches and primitive data types, But
>> > other union sketches (cpc theta) do not provide primitive type support.
>> >
>> >
>> https://github.com/apache/datasketches-cpp/blob/master/hll/include/hll.hpp#L521-L581
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_datasketches-2Dcpp_blob_master_hll_include_hll.hpp-23L521-2DL581&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=lkmVnnuioQT2zQgmBUbcts8c86LUALm2Oj6eSYogqfY&e=>
>> >
>> > https://issues.apache.org/jira/browse/IMPALA-10864
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_IMPALA-2D10864&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=rbOAZsy1yN_8nMOmcr4BbDEOS67DSBeE1ArHwnZXsR0&e=>
>> >
>> > Regards.
>> > Fucun
>> >
>> > Alexander Saydakov <sayda...@verizonmedia.com> 于2021年8月13日周五 上午4:58写道:
>> >
>> > > Hi Impala development community,
>> > > I am a member of Apache DataSketches team. I looked at the
>> DataSketches
>> > > functions in Impala and I have a few questions and suggestions.
>> > >
>> > > Regarding the dependency. I see that the current approach is to copy
>> all
>> > > files from Datasketches into a single pile. Is there a better way?
>> > >
>> > > I see that you are using version 3.0.0 currently. We released version
>> > 3.1.0
>> > > about a month ago. It has some important bug fixes, and a Theta sketch
>> > > feature to avoid some cost of deserialization. This feature can speed
>> up
>> > > Theta union operations.
>> > >
>> > > Regarding your approach to serializing sketches. We have two ways of
>> > > serializing and deserializing sketches: stream and bytes. You are
>> using
>> > > bytes for deserializing, but a temporary stringstream for serializing
>> > >
>> > >
>> >
>> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1624
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_aggregate-2Dfunctions-2Dir.cc-23L1624&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=gR9Ktn2vYHkI1ey_3E5_C6Ox9DU61vHxn5T6qH8wmHM&e=>
>> > > just to copy the internals of this stream as bytes (using memcopy) in
>> the
>> > > StringStreamToStringVal function
>> > >
>> > >
>> >
>> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/datasketches-common.cc#L80
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_datasketches-2Dcommon.cc-23L80&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=4SRfbNoCr2si6yIXDPrXS6hN__TGtw9U1hbmnj85I1c&e=>
>> > > I would think it should be faster to use serialization to bytes, and
>> the
>> > > StringStreamToStringVal function can be eliminated.
>> > >
>> > > Regarding deserialization. I see in some cases that a sketch
>> constructor
>> > is
>> > > called just to replace this instance with a deserialized one. This
>> extra
>> > > construction seems unnecessary.
>> > >
>> > >
>> >
>> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1819
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_aggregate-2Dfunctions-2Dir.cc-23L1819&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=NbstsM1-dJh1LimtdmtM98ohuL6WtiulY47qGopnmFE&e=>
>> > >
>> > > Looking at this DsHllMerge function
>> > >
>> > >
>> >
>> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1759
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_aggregate-2Dfunctions-2Dir.cc-23L1759&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=Yj4JWLjD9OCNAuLI2ZcTb-EZj0StX4lJ8ODziVrgbxg&e=>
>> > > it seems that the merge is done pairwise. Is it possible to arrange
>> this
>> > > process as init, multiple merges and finalize (serialize) at the end?
>> It
>> > is
>> > > quite costly to initialize a union, update it with two sketches and
>> then
>> > > call get_result(). If many such merges happen, the overhead of
>> > initializing
>> > > a fresh union and finalizing it for each pair can be substantial.
>> > >
>> > > Regards,
>> > > Alexander.
>> > >
>> >
>>
>

Reply via email to