Hello! I'm working on integrating Datasketches' HllSketch into Apache Spark, such that we have the ability to write out + reaggregate intermediate sketches (not currently supported via approx_count_distinct's HLL++ implementation). I had a few questions about best practices.
I'm working on an implementation that utilizes a static length byte array within Spark's aggregation buffer, wrapped within a WritableMemory instance. I'm then wrapping that within a HllSketch instance when I want to update the sketch, or wrapping it in a Union instance when I want to merge sketches. Hoping someone can give me some guidance on the following: - I initially was having the HllSketch instances operate 'on-heap' and then serializing them out / heapifying them back into existence as often as is required by Spark. My bet is that passing around a raw byte array (and wrapping with WriteableMemory/HllSketch/Union instances as needed) will reduce serialization/deserialization/garbage collection overhead. Can someone confirm this is the intended usage/benefit of the writeableWrap() functionality? - Utilizing the raw byte array requires that I initialize a max-sized buffer (given the HllSketch config) up-front, so it seems the tradeoff here is that I'm allocating more memory up-front than I may need. Is my understanding of the tradeoff correct? - The Union implementation will only wrap a HLL_8 typed buffer; right now I'm having to have the Union merge sketches 'on-heap' and then overwrite the Spark byte buffer with the Union's updateableByteArray when the HllSketches aren't configured as HLL_8. I think this is expected, but wanted to confirm? I have a few follow-up questions about Theta sketches, but figured I'd start with the HllSketch before broadening the implementation. Thanks! Ryan Berti Senior Data Engineer | Ads DE M 7023217573 5808 W Sunset Blvd | Los Angeles, CA 90028
