I submitted a pull request with some changes I tried to explain here.
https://github.com/apache/impala/pull/30

There are still open questions for me regarding:
- better dependency mechanism
- updating dependency to the latest 3.1.0
- process flow in aggregate functions (avoiding overhead of pairwise merge)


On Tue, Aug 24, 2021 at 7:36 PM Alexander Saydakov <
sayda...@verizonmedia.com> wrote:

> I am afraid that I was misunderstood regarding a few points. Let me try to
> clarify.
>
> Regarding serialization using bytes as opposed to a stream. This has
> nothing to do with BINARY data type in Impala.
> Currently I see in the Impala code something like this (simplified):
> std::stringstream tmp;
> sketch.serialize(tmp);
> std::string str = tmp.str(); // in StringStreamToStringVal
> StringVal result(context, str.size());
> memcpy(result.ptr, str.c_str(), str.size());
>
> You could do it faster like this:
> auto bytes = sketch.serialize();
> StringVal result(context, bytes.size());
> memcpy(result.ptr, bytes.data() bytes.size());
>
> Regarding unnecessary constructor during deserialization. I see a code
> like this (HLL is an example, but the pattern is the same):
> datasketches::hll_sketch src_sketch(DS_SKETCH_CONFIG, DS_HLL_TYPE); //
> construct an empty sketch, which is not needed
> DeserializeDsSketch(src, &src_sketch); // pass it into a function, which
> will replace it by an assignment (hopefully a move, not copy)
> // in the function
> *sketch = T::deserialize((void*)serialized_sketch.ptr,
> serialized_sketch.len);
>
> This can be accomplished like so avoiding unnecessary constructor:
> datasketches::hll_sketch src_sketch =
> datasketches::hll_sketch::deserialize((void*)serialized_sketch.ptr,
> serialized_sketch.len);
>
> Regarding merge (not just HLL). I am talking about the overall process
> flow. Looking at this UDA documentation:
>
> https://github.com/apache/impala/blob/b1ca0894462e4b9c040fda41622f8b97b1ec15e1/be/src/udf/udf.h#L340
> it seems to me that it should be possible to have a flow like Init(),
> Merge() ... Merge(), Finalize()
> So why the implementation here finalizes each time?:
>
> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1759
> Instantiating a union for each pairwise merge and especially calling
> get_result() has a lot of overhead.
>
>
>
> On Wed, Aug 18, 2021 at 12:49 AM Gabor Kaszab <gaborkas...@apache.org>
> wrote:
>
>> Hey Quanlong,
>> Initially I added the first library as copying all the required files
>> into the same dir. It was ~1.5 years ago so my memories are faint but as I
>> remember there was an issue with the library having compilation issues if
>> we kept the structure of the files. As a Quick workaround came the idea to
>> copy the files to a common dir just to unblock us and start working on
>> adding sketching functionalities.
>> This was the first commit: https://gerrit.cloudera.org/#/c/15746/
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__gerrit.cloudera.org_-23_c_15746_&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=Gux0Patz_D0kWZBeFg7ElKCiUah7Q9_PLHogwHpGZuk&e=>
>> We can open a Jira for sure to move the Datasketches library to toolchain
>> and let's see if anyone has some free cycles to take care of that.
>> https://issues.apache.org/jira/browse/IMPALA-10868
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_IMPALA-2D10868&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=ixGxYvTuy2tUFFHn2T7Y8Mhev5G5f2D9_LdIMlWD-ZM&e=>
>>
>> Cheers,
>> Gabor
>>
>> On Mon, Aug 16, 2021 at 3:44 PM Quanlong Huang <huangquanl...@gmail.com>
>> wrote:
>>
>>> Thank Fucun for creating the JIRAs!
>>>
>>> Regarding the dependency. I see that the current approach is to copy all
>>> > files from Datasketches into a single pile. Is there a better way?
>>>
>>> Is there a historical reason that we don't add DataSketches into the
>>> native-toolchain?
>>>
>>> Regards,
>>> Quanlong
>>>
>>>
>>> On Mon, Aug 16, 2021 at 3:00 PM fucun chu <im.fu...@gmail.com> wrote:
>>>
>>> > Hi,
>>> >
>>> > 1. Upgrade DataSketches to version 3.1.0
>>> > Issue tracking: https://issues.apache.org/jira/browse/IMPALA-10863
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_IMPALA-2D10863&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=hT4PA0h2zT7dWD90xKlsww9OV5B8BrLbDtUiCfrxc24&e=>
>>> >
>>> > 2. Use bytes to serializing sketches
>>> > Impala currently does not support the BINARY data type, we can write
>>> > sketches as binary instead of strings once it's supported:
>>> > https://issues.apache.org/jira/browse/IMPALA-9482
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_IMPALA-2D9482&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=XUz0h0DnWnGcE0Y3LQ1efRqEnrkWERW5jbdT9piEmE8&e=>
>>> .
>>> >
>>> > 3. Regarding deserialization
>>> > Using std::unique_ptr can reduce unnecessary constructor, here is to
>>> solve
>>> > the problem that `compact_theta_sketch` cannot be directly constructed.
>>> >
>>> >
>>> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L2104
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_aggregate-2Dfunctions-2Dir.cc-23L2104&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=a4LLYpJUgaAFcpgL9zFKqHeKgVffcERTdJP2623G-TQ&e=>
>>> >
>>> > 4. DsHllMerge function optimization
>>> > One solution is to use hll_union instead of hll_sketch, which provides
>>> > support for updating hll_sketch sketches and primitive data types, But
>>> > other union sketches (cpc theta) do not provide primitive type support.
>>> >
>>> >
>>> https://github.com/apache/datasketches-cpp/blob/master/hll/include/hll.hpp#L521-L581
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_datasketches-2Dcpp_blob_master_hll_include_hll.hpp-23L521-2DL581&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=lkmVnnuioQT2zQgmBUbcts8c86LUALm2Oj6eSYogqfY&e=>
>>> >
>>> > https://issues.apache.org/jira/browse/IMPALA-10864
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_IMPALA-2D10864&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=rbOAZsy1yN_8nMOmcr4BbDEOS67DSBeE1ArHwnZXsR0&e=>
>>> >
>>> > Regards.
>>> > Fucun
>>> >
>>> > Alexander Saydakov <sayda...@verizonmedia.com> 于2021年8月13日周五 上午4:58写道:
>>> >
>>> > > Hi Impala development community,
>>> > > I am a member of Apache DataSketches team. I looked at the
>>> DataSketches
>>> > > functions in Impala and I have a few questions and suggestions.
>>> > >
>>> > > Regarding the dependency. I see that the current approach is to copy
>>> all
>>> > > files from Datasketches into a single pile. Is there a better way?
>>> > >
>>> > > I see that you are using version 3.0.0 currently. We released version
>>> > 3.1.0
>>> > > about a month ago. It has some important bug fixes, and a Theta
>>> sketch
>>> > > feature to avoid some cost of deserialization. This feature can
>>> speed up
>>> > > Theta union operations.
>>> > >
>>> > > Regarding your approach to serializing sketches. We have two ways of
>>> > > serializing and deserializing sketches: stream and bytes. You are
>>> using
>>> > > bytes for deserializing, but a temporary stringstream for serializing
>>> > >
>>> > >
>>> >
>>> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1624
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_aggregate-2Dfunctions-2Dir.cc-23L1624&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=gR9Ktn2vYHkI1ey_3E5_C6Ox9DU61vHxn5T6qH8wmHM&e=>
>>> > > just to copy the internals of this stream as bytes (using memcopy)
>>> in the
>>> > > StringStreamToStringVal function
>>> > >
>>> > >
>>> >
>>> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/datasketches-common.cc#L80
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_datasketches-2Dcommon.cc-23L80&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=4SRfbNoCr2si6yIXDPrXS6hN__TGtw9U1hbmnj85I1c&e=>
>>> > > I would think it should be faster to use serialization to bytes, and
>>> the
>>> > > StringStreamToStringVal function can be eliminated.
>>> > >
>>> > > Regarding deserialization. I see in some cases that a sketch
>>> constructor
>>> > is
>>> > > called just to replace this instance with a deserialized one. This
>>> extra
>>> > > construction seems unnecessary.
>>> > >
>>> > >
>>> >
>>> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1819
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_aggregate-2Dfunctions-2Dir.cc-23L1819&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=NbstsM1-dJh1LimtdmtM98ohuL6WtiulY47qGopnmFE&e=>
>>> > >
>>> > > Looking at this DsHllMerge function
>>> > >
>>> > >
>>> >
>>> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1759
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_aggregate-2Dfunctions-2Dir.cc-23L1759&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=9P5FJbSuxeno178bgZx3UyCSCaWXANsxg9OX4di1U-Q&s=Yj4JWLjD9OCNAuLI2ZcTb-EZj0StX4lJ8ODziVrgbxg&e=>
>>> > > it seems that the merge is done pairwise. Is it possible to arrange
>>> this
>>> > > process as init, multiple merges and finalize (serialize) at the
>>> end? It
>>> > is
>>> > > quite costly to initialize a union, update it with two sketches and
>>> then
>>> > > call get_result(). If many such merges happen, the overhead of
>>> > initializing
>>> > > a fresh union and finalizing it for each pair can be substantial.
>>> > >
>>> > > Regards,
>>> > > Alexander.
>>> > >
>>> >
>>>
>>

Reply via email to