Hello!

I'm working on integrating Datasketches' HllSketch into Apache Spark, such
that we have the ability to write out + reaggregate intermediate sketches
(not currently supported via approx_count_distinct's HLL++ implementation).
I had a few questions about best practices.

I'm working on an implementation that utilizes a static length byte array
within Spark's aggregation buffer, wrapped within a WritableMemory
instance. I'm then wrapping that within a HllSketch instance when I want to
update the sketch, or wrapping it in a Union instance when I want to merge
sketches. Hoping someone can give me some guidance on the following:

   - I initially was having the HllSketch instances operate 'on-heap' and
   then serializing them out / heapifying them back into existence as often as
   is required by Spark. My bet is that passing around a raw byte array (and
   wrapping with WriteableMemory/HllSketch/Union instances as needed) will
   reduce serialization/deserialization/garbage collection overhead. Can
   someone confirm this is the intended usage/benefit of the writeableWrap()
   functionality?
   - Utilizing the raw byte array requires that I initialize a
   max-sized buffer (given the HllSketch config) up-front, so it seems the
   tradeoff here is that I'm allocating more memory up-front than I may need.
   Is my understanding of the tradeoff correct?
   - The Union implementation will only wrap a HLL_8 typed buffer; right
   now I'm having to have the Union merge sketches 'on-heap' and then
   overwrite the Spark byte buffer with the Union's updateableByteArray when
   the HllSketches aren't configured as HLL_8. I think this is expected, but
   wanted to confirm?

I have a few follow-up questions about Theta sketches, but figured I'd
start with the HllSketch before broadening the implementation.

Thanks!

Ryan Berti

Senior Data Engineer  |  Ads DE

M 7023217573

5808 W Sunset Blvd  |  Los Angeles, CA 90028

Reply via email to