Hi Gabor, My quick question would be that taking into account that the order of the > items provided to datasketches:hll_sketch is not deterministic is it normal > behaviour that for the same dataset I get a different estimate each time I > run my query? > I'm trying to figure out if this is due to some issues with my code or > normal characteristics of the C++ library of DataSketches.
Please refer to our documentation where we discuss Data Insensitivity <https://datasketches.apache.org/docs/Architecture/SketchCriteria.html> and Order Sensitivity <https://datasketches.apache.org/docs/Architecture/OrderSensitivity.html>, specifically. In general, because sketches are probabilistic algorithms and often depend on internal randomization, Absolute Order Insensitivity (AOI) is not guaranteed, only that the result will be within the specified error bounds with the specified confidence, which is what we call Bounded Order Insensitivity (BOI). However, even though some of the sketch algorithms under certain conditions, can be AOI, we do not recommend that you depend on that in your testing. Instead of comparing with some previous estimate exactly, it is better to check that your new estimate is within the specified error bounds and confidence interval for that sketch and its configuration. My second question would be that in case Hive uses the Hive connectors from > DataSketches and Impala uses the provided C++ library is it guaranteed that > whatever sketch is written by any of these systems it can be correctly read > with the other? I see binary compatibility mentioned on the official web > page just wanted to double check if there are any exceptions to this. We do our best to guarantee "binary compatibility" across C++, Java and Python and are doing a lot of cross-language testing to ensure that. What this means, for example, is that a sketch generated in C++ and serialized to its binary image, can be deserialized and read in Java, or visa versa. Note that our fundamental serialization format is an array of bytes. It is up to the specific environment to choose how they wish to transport this array of bytes to other systems without corruption. Typical transport schemes include Base64, Kafka, ProtoBuf, etc. I am pleased that you are integrating DataSketches into Impala. Please continue to post questions to us if you need further help! Lee. On Mon, Apr 27, 2020 at 6:19 AM Gabor Kaszab <gaborkas...@apache.org> wrote: > Hey, > > I'm an Apache Impala (distributed, fast, SQL query engine on big data) > contributor and recently started working on pulling in HLL sketching from > DataSketches. I managed to put a PoC together where Impala runs a > count(distinct) estimate on a column of a table where in the background it > uses Datasketches' HLL algorithm from apache/incubator-datasketches-cpp to > produce the results. > > My quick question would be that taking into account that the order of the > items provided to datasketches:hll_sketch is not deterministic is it normal > behaviour that for the same dataset I get a different estimate each time I > run my query? > I'm trying to figure out if this is due to some issues with my code or > normal characteristics of the C++ library of DataSketches. > > My second question would be that in case Hive uses the Hive connectors > from DataSketches and Impala uses the provided C++ library is it guaranteed > that whatever sketch is written by any of these systems it can be correctly > read with the other? I see binary compatibility mentioned on the official > web page just wanted to double check if there are any exceptions to this. > > Cheers, > Gabor >