I am away for a few days. I will have a look soon. Thank you. On Mon, Aug 16, 2021 at 9:44 AM Quanlong Huang <huangquanl...@gmail.com> wrote:
> Thank Fucun for creating the JIRAs! > > Regarding the dependency. I see that the current approach is to copy all >> files from Datasketches into a single pile. Is there a better way? > > Is there a historical reason that we don't add DataSketches into the > native-toolchain? > > Regards, > Quanlong > > > On Mon, Aug 16, 2021 at 3:00 PM fucun chu <im.fu...@gmail.com> wrote: > >> Hi, >> >> 1. Upgrade DataSketches to version 3.1.0 >> Issue tracking: https://issues.apache.org/jira/browse/IMPALA-10863 >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_IMPALA-2D10863&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=2lu5xmqLIi4uz1vevF2ODxyyofcIfYI3Nup18Na-KEo&s=iBQtxz4UGDBMoYcnKDNYATQynGLPLZBstgSkA277VRo&e=> >> >> 2. Use bytes to serializing sketches >> Impala currently does not support the BINARY data type, we can write >> sketches as binary instead of strings once it's supported: >> https://issues.apache.org/jira/browse/IMPALA-9482 >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_IMPALA-2D9482&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=2lu5xmqLIi4uz1vevF2ODxyyofcIfYI3Nup18Na-KEo&s=32MbroPcSDWqe2FIkHG2aRBRa8JaOkG6LWbAFbFVcdY&e=> >> . >> >> 3. Regarding deserialization >> Using std::unique_ptr can reduce unnecessary constructor, here is to solve >> the problem that `compact_theta_sketch` cannot be directly constructed. >> >> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L2104 >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_aggregate-2Dfunctions-2Dir.cc-23L2104&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=2lu5xmqLIi4uz1vevF2ODxyyofcIfYI3Nup18Na-KEo&s=i6YX1VlcX-Q-y5JX3UNYOUl_1Xtw9vMeq9teRzx7KQg&e=> >> >> 4. DsHllMerge function optimization >> One solution is to use hll_union instead of hll_sketch, which provides >> support for updating hll_sketch sketches and primitive data types, But >> other union sketches (cpc theta) do not provide primitive type support. >> >> https://github.com/apache/datasketches-cpp/blob/master/hll/include/hll.hpp#L521-L581 >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_datasketches-2Dcpp_blob_master_hll_include_hll.hpp-23L521-2DL581&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=2lu5xmqLIi4uz1vevF2ODxyyofcIfYI3Nup18Na-KEo&s=qed-km3TAPsXHtV0ITYtnSySRTmWflHeYwqLjVmUfdQ&e=> >> >> https://issues.apache.org/jira/browse/IMPALA-10864 >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_IMPALA-2D10864&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=2lu5xmqLIi4uz1vevF2ODxyyofcIfYI3Nup18Na-KEo&s=4UbfiVJC5jw0Ig6qpNu7tHzoRUI0A8ZWlqlODyYzWDI&e=> >> >> Regards. >> Fucun >> >> Alexander Saydakov <sayda...@verizonmedia.com> 于2021年8月13日周五 上午4:58写道: >> >> > Hi Impala development community, >> > I am a member of Apache DataSketches team. I looked at the DataSketches >> > functions in Impala and I have a few questions and suggestions. >> > >> > Regarding the dependency. I see that the current approach is to copy all >> > files from Datasketches into a single pile. Is there a better way? >> > >> > I see that you are using version 3.0.0 currently. We released version >> 3.1.0 >> > about a month ago. It has some important bug fixes, and a Theta sketch >> > feature to avoid some cost of deserialization. This feature can speed up >> > Theta union operations. >> > >> > Regarding your approach to serializing sketches. We have two ways of >> > serializing and deserializing sketches: stream and bytes. You are using >> > bytes for deserializing, but a temporary stringstream for serializing >> > >> > >> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1624 >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_aggregate-2Dfunctions-2Dir.cc-23L1624&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=2lu5xmqLIi4uz1vevF2ODxyyofcIfYI3Nup18Na-KEo&s=KXZIqKu1hXSFCI2PLEXAYueCXGUkEyHYU-_kKQLGah0&e=> >> > just to copy the internals of this stream as bytes (using memcopy) in >> the >> > StringStreamToStringVal function >> > >> > >> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/datasketches-common.cc#L80 >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_datasketches-2Dcommon.cc-23L80&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=2lu5xmqLIi4uz1vevF2ODxyyofcIfYI3Nup18Na-KEo&s=GsWAdaRvX0pzxSsKvulW_VAU4I8cy1OGwabCdU8-7uA&e=> >> > I would think it should be faster to use serialization to bytes, and the >> > StringStreamToStringVal function can be eliminated. >> > >> > Regarding deserialization. I see in some cases that a sketch >> constructor is >> > called just to replace this instance with a deserialized one. This extra >> > construction seems unnecessary. >> > >> > >> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1819 >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_aggregate-2Dfunctions-2Dir.cc-23L1819&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=2lu5xmqLIi4uz1vevF2ODxyyofcIfYI3Nup18Na-KEo&s=5GZQAUF5LKvMP1QcjFxrWXWDQSdX2ICzm7POKMLi-1Y&e=> >> > >> > Looking at this DsHllMerge function >> > >> > >> https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1759 >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_impala_blob_3d365276ea00f349df3629944b731eb4408d2c4f_be_src_exprs_aggregate-2Dfunctions-2Dir.cc-23L1759&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=2lu5xmqLIi4uz1vevF2ODxyyofcIfYI3Nup18Na-KEo&s=AVmB8Varo7CKBuIObB9AYVGbIyU1Jhmi4Xh7CkLfLJk&e=> >> > it seems that the merge is done pairwise. Is it possible to arrange this >> > process as init, multiple merges and finalize (serialize) at the end? >> It is >> > quite costly to initialize a union, update it with two sketches and then >> > call get_result(). If many such merges happen, the overhead of >> initializing >> > a fresh union and finalizing it for each pair can be substantial. >> > >> > Regards, >> > Alexander. >> > >> >