I am away for a few days. I will have a look soon.
Thank you.

On Mon, Aug 16, 2021 at 9:44 AM Quanlong Huang <huangquanl...@gmail.com>
wrote:

> Thank Fucun for creating the JIRAs!
>
> Regarding the dependency. I see that the current approach is to copy all
>> files from Datasketches into a single pile. Is there a better way?
>
> Is there a historical reason that we don't add DataSketches into the
> native-toolchain?
>
> Regards,
> Quanlong
>
>
> On Mon, Aug 16, 2021 at 3:00 PM fucun chu <im.fu...@gmail.com> wrote:
>
>> Hi,
>>
>> 1. Upgrade DataSketches to version 3.1.0
>> Issue tracking: https://issues.apache.org/jira/browse/IMPALA-10863
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_IMPALA-2D10863&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=2lu5xmqLIi4uz1vevF2ODxyyofcIfYI3Nup18Na-KEo&s=iBQtxz4UGDBMoYcnKDNYATQynGLPLZBstgSkA277VRo&e=>
>>
>> 2. Use bytes to serializing sketches
>> Impala currently does not support the BINARY data type, we can write
>> sketches as binary instead of strings once it's supported:
>> https://issues.apache.org/jira/browse/IMPALA-9482
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_IMPALA-2D9482&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=2lu5xmqLIi4uz1vevF2ODxyyofcIfYI3Nup18Na-KEo&s=32MbroPcSDWqe2FIkHG2aRBRa8JaOkG6LWbAFbFVcdY&e=>
>> .
>>
>> 3. Regarding deserialization
>> Using std::unique_ptr can reduce unnecessary constructor, here is to solve
>> the problem that `compact_theta_sketch` cannot be directly constructed.
>>
>> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L2104
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_aggregate-2Dfunctions-2Dir.cc-23L2104&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=2lu5xmqLIi4uz1vevF2ODxyyofcIfYI3Nup18Na-KEo&s=i6YX1VlcX-Q-y5JX3UNYOUl_1Xtw9vMeq9teRzx7KQg&e=>
>>
>> 4. DsHllMerge function optimization
>> One solution is to use hll_union instead of hll_sketch, which provides
>> support for updating hll_sketch sketches and primitive data types, But
>> other union sketches (cpc theta) do not provide primitive type support.
>>
>> https://github.com/apache/datasketches-cpp/blob/master/hll/include/hll.hpp#L521-L581
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_datasketches-2Dcpp_blob_master_hll_include_hll.hpp-23L521-2DL581&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=2lu5xmqLIi4uz1vevF2ODxyyofcIfYI3Nup18Na-KEo&s=qed-km3TAPsXHtV0ITYtnSySRTmWflHeYwqLjVmUfdQ&e=>
>>
>> https://issues.apache.org/jira/browse/IMPALA-10864
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_IMPALA-2D10864&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=2lu5xmqLIi4uz1vevF2ODxyyofcIfYI3Nup18Na-KEo&s=4UbfiVJC5jw0Ig6qpNu7tHzoRUI0A8ZWlqlODyYzWDI&e=>
>>
>> Regards.
>> Fucun
>>
>> Alexander Saydakov <sayda...@verizonmedia.com> 于2021年8月13日周五 上午4:58写道:
>>
>> > Hi Impala development community,
>> > I am a member of Apache DataSketches team. I looked at the DataSketches
>> > functions in Impala and I have a few questions and suggestions.
>> >
>> > Regarding the dependency. I see that the current approach is to copy all
>> > files from Datasketches into a single pile. Is there a better way?
>> >
>> > I see that you are using version 3.0.0 currently. We released version
>> 3.1.0
>> > about a month ago. It has some important bug fixes, and a Theta sketch
>> > feature to avoid some cost of deserialization. This feature can speed up
>> > Theta union operations.
>> >
>> > Regarding your approach to serializing sketches. We have two ways of
>> > serializing and deserializing sketches: stream and bytes. You are using
>> > bytes for deserializing, but a temporary stringstream for serializing
>> >
>> >
>> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1624
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_aggregate-2Dfunctions-2Dir.cc-23L1624&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=2lu5xmqLIi4uz1vevF2ODxyyofcIfYI3Nup18Na-KEo&s=KXZIqKu1hXSFCI2PLEXAYueCXGUkEyHYU-_kKQLGah0&e=>
>> > just to copy the internals of this stream as bytes (using memcopy) in
>> the
>> > StringStreamToStringVal function
>> >
>> >
>> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/datasketches-common.cc#L80
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_datasketches-2Dcommon.cc-23L80&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=2lu5xmqLIi4uz1vevF2ODxyyofcIfYI3Nup18Na-KEo&s=GsWAdaRvX0pzxSsKvulW_VAU4I8cy1OGwabCdU8-7uA&e=>
>> > I would think it should be faster to use serialization to bytes, and the
>> > StringStreamToStringVal function can be eliminated.
>> >
>> > Regarding deserialization. I see in some cases that a sketch
>> constructor is
>> > called just to replace this instance with a deserialized one. This extra
>> > construction seems unnecessary.
>> >
>> >
>> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1819
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_aggregate-2Dfunctions-2Dir.cc-23L1819&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=2lu5xmqLIi4uz1vevF2ODxyyofcIfYI3Nup18Na-KEo&s=5GZQAUF5LKvMP1QcjFxrWXWDQSdX2ICzm7POKMLi-1Y&e=>
>> >
>> > Looking at this DsHllMerge function
>> >
>> >
>> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1759
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_aggregate-2Dfunctions-2Dir.cc-23L1759&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=2lu5xmqLIi4uz1vevF2ODxyyofcIfYI3Nup18Na-KEo&s=AVmB8Varo7CKBuIObB9AYVGbIyU1Jhmi4Xh7CkLfLJk&e=>
>> > it seems that the merge is done pairwise. Is it possible to arrange this
>> > process as init, multiple merges and finalize (serialize) at the end?
>> It is
>> > quite costly to initialize a union, update it with two sketches and then
>> > call get_result(). If many such merges happen, the overhead of
>> initializing
>> > a fresh union and finalizing it for each pair can be substantial.
>> >
>> > Regards,
>> > Alexander.
>> >
>>
>

Reply via email to