Hello Štefan, We did a major study and comparison <https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html> of the DataSketches HLL sketch to the Clearspring implementation of the HLL++ sketch back in 2017 and found that the Clearspring sketch had serious error problems, did not implement the Google HLL++ paper correctly, and is slow.
To answer your question as to whether any of your CS HLL sketch data can be recovered, I would say no. And even if it could be recovered, with the serious problems of the CS implementation, I wouldn't trust it. On Tue, Mar 11, 2025 at 6:24 AM Štefan Miklošovič <smikloso...@apache.org> wrote: > Hello Datasketches community, > > I am from Apache Cassandra where we use Clearspring (1) for estimating the > cardinalities for rows in Cassandra's SSTables. We serialize the whole > HyperLogLog from (1) (more or less) to the disk and then we deserialize it > back and we merge all logs together to know the final result across the > whole data. > > (1) is, as you probably know, archived / not actively contributed anymore. > Hence, we are looking for replacements. > > Datasketches are quite an obvious choice but I would like to know some > answers to the questions before the transition. > > We need to work with old data as well. If there is an SSTable on a disk > with HLL from Clearspring, then we can not merge this to Datasketches, > right? In other words, this is not possible: > > @Test > public void testMerging() throws Throwable > { > // wrapper around Clearspring > LegacyCardinality clearspringCardinality = new > LegacyCardinality(new HyperLogLogPlus(13, 25)); > clearspringCardinality.offerHashed(12345); > > // wrapper around Datasketches HLL > DefaultCardinality datasketchesCardinality = new > DefaultCardinality(); > datasketchesCardinality.offerHashed(23456); > > // this fails, as well as similar variations of that > clearspringCardinality.merge(new > LegacyCardinality(HyperLogLogPlus.Builder.build(datasketchesCardinality.getBytes())).getCardinality()); > } > > It would be great if you confirmed (or denied) that there is no way to > merge these two together. How would you go around this problem in general? > If they are not mergeable, then we would need to find another way to deal > with this but that is another story. > > I see that there is (2) which is a great in-depth description of > differences between two but there is no information to my knowledge which > would say if one is convertible to another. > > Thank you and regards > > Stefan Miklosovic > > (1) https://github.com/addthis/stream-lib/tree/master > (2) https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html >