Re: [DISCUSS] Enhancements to Tuple Sketch.

David Cromberge Thu, 28 May 2020 07:40:25 -0700

Lee, the array of bytes implementation is a better approach than adding 
off-heap implementations for the integer, double and strings Tuple sketches.


I originally overlooked the ArrayOfDoubles sketch for the purposes of tracking 
engagement, since it implies that many values could be associated with a hashed 
key, which doesn’t quite fit the use case.  

Having said that, I have now looked through the implementation and have 
switched to the array of doubles sketch instead - after all, you pointed out 
that it should suffice. 

I have run some initial benchmarks on sizing, and in compacted form I did not 
get a reduction in the size of a compacted sketch generated from a real data 
set, despite the benefits of using primitive doubles.  I realise that this is 
dependent on the test case and tuning / configuration parameters, so we could 
add a TODO item to add a characterisation test for this, if there is not one 
already.

We don’t seem to have discussed how the Theta sketches may be included for 
intersections and unions regarding an AOD sketch.   For unions, the behaviour 
is delegated to a merge operation on the sketch, which ultimately adds values 
together for the same key.  Concerning intersection, a combiner implementation 
is used to determine how values should be combined.  It is noteworthy that in 
the druid extension, the values are summed together, with a comment noting that 
this may not apply to all circumstances.

I would propose a similar mechanism for both union and intersection on the 
other sketches, where a default array of values can be provided for a tuple 
sketch:

        public void update(final org.apache.datasketches.theta.Sketch sketchIn, 
double[] defaultValues, ArrayOfDoublesCombiner c) {...}

Of course, a factory could be provided that creates a combiner according to a 
summary mode.  The use case for this suggestion is to have context specific 
behaviour with regard to merging / combining values in the case of unions and 
intersections, which could be sourced from the user or user query.

I would be interested to hear your thoughts on the ArrayOfDoubles/ArrayOfBytes 
Tuple sketch integration with Theta sketches,
David


> On 28 May 2020, at 04:32, leerho <[email protected]> wrote:
> 
> David,  In fact, double values do just fine with integer data.  They are 
> twice as big at the primitive level, however, the dedicated ArrayOfDoubles 
> implementation in total might actually be smaller, since it is not carrying 
> all the object overhead that are required to do generics. Plus, it will be a 
> whole lot faster! It is already fully implemented with all the set 
> operations, off-heap memory, serialization /deserialization and a full test 
> suite.  
> 
> Designing a similar dedicated ArrayOfIntegers would be a lot of work and 
> wouldn't be my top priority for the next dedicated Tuple sketch to build.  
> What would be more flexible would actually be a dedicated ArrayOfBytes 
> implementation, because bytes are the foundation from which we can derive 
> almost any summary we want.  
> 
> Think about it.
> 
> Lee.
> 
> 
> 
> On Wed, May 27, 2020 at 5:54 PM leerho <[email protected] 
> <mailto:[email protected]>> wrote:
> David,  This is good feedback.  I didn't realize you wanted off-heap 
> operation.  That changes a lot of things.  I have to go right now, but let me 
> think about this.  Attempting to leverage generics off-heap is messy.  For 
> your case it may be better to either leverage the ArrayOfDoubles 
> implementation, which already operates off-heap, or think about emulating the 
> AOD and creating a dedicated AOIntegers.  
> 
> Lee.
> 
> On Wed, May 27, 2020 at 3:04 PM David Cromberge 
> <[email protected] <mailto:[email protected]>> wrote:
> Hi Lee,
> 
> Thanks for providing such a detailed plan of action for the Tuple sketch 
> package.
> The enhancements that you have listed are interesting, and I will certainly 
> check out your branch to get a clearer understanding of how the library is 
> evolving.
> 
> For what it’s worth, here is a record of my attempt to integrate Theta 
> sketches into the Tuple sketch set operations:
> https://github.com/davecromberge/incubator-datasketches-java/commit/961ad48bbe709ccfcb973a7fab69e53088f113a5
>  
> <https://github.com/davecromberge/incubator-datasketches-java/commit/961ad48bbe709ccfcb973a7fab69e53088f113a5>
> 
> Although I have a cursory understanding of the library’s internals, I 
> included the commit above because there were some interesting tradeoffs to 
> the implementation, and it gave me a better appreciation for the the internal 
> workings of the existing Tuple sketch as well as some of the finer points in 
> your improvement work.   To a lesser degree, it also serves to independently 
> confirm your argument for adding new variants of the update methods!
> 
> During implementation, I was also faced with the decision as to whether to 
> duplicate the methods or to convert a Theta sketch to a Tuple sketch first 
> and delegate to the existing methods.  But, as you noted, this requires an 
> additional iteration through the result set and incurs a performance penalty. 
>  Therefore, I also duplicated the existing update methods, with some changes 
> for result extraction.  To ensure correctness, I found it necessary to 
> duplicate a large portion of the existing test cases as well - replicating so 
> many of the existing tests was not ideal, but helped verify that the 
> implementation was correct.
> It’s also worth mentioning that I had some difficulty implementing the AnotB 
> functionality, and in fact the results were incorrect when the sketch crossed 
> into estimation mode (see ignored tests).  I’m pleased to have attempted the 
> exercise because it will give much better context as I study your branch 
> further - especially the AnotB refactoring.
> 
> There is one addition that I would like to suggest to your list of TODO items 
> - namely, off-heap implementations.  I am considering using the integer tuple 
> sketch for our engagement use case and would prefer to prevent memory 
> pressure by loading many sketches onto the heap.  I have noticed this come up 
> in the past on #datasketches slack channel in a conversation with some Druid 
> team members.  It appears that the off-heap implementations were omitted from 
> the library due to time constraints, and this is area where I could also 
> potentially provide default implementations for the other tuple sketches.  I 
> think this is important to consider, because the existing ArrayOfDoubles 
> implementation uses an abstract class for the parent.  Making the other Tuple 
> sketches abstract in a similar manner is potentially a breaking change as 
> well.
> 
> I am excited to collaborate on this together on this feature, and I would be 
> happy to contribute in any possible way and coordinate through the project 
> TODO page <https://github.com/apache/incubator-datasketches-java/projects/1>!
> 
> David
> 
> 
> 
>> On 27 May 2020, at 20:04, leerho <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> David,
>> 
>> Thanks.  I have been putting a lot of thought into it as well and decided 
>> that it was time to make some other long-needed changes in the Tuple Family 
>> of sketches as well including the package layout, which has been quite 
>> cumbersome.  I would suggest holding back on you actual implementation work 
>> until you understand what I have changed so far and then we can strategize 
>> on how to finish the work.   I have checked in my changes so far into the 
>> "Tuple_Theta_Extension" branch, which you can check out to see what I have 
>> been up to :)
>> 
>> The family of tuple sketches have evolved over time and somewhat 
>> haphazardly.  So the first think I decided to do was to do some rearranging 
>> of the package structure to make future downstream improvements and 
>> extensions easier.  
>> 
>> 1.  The first problem is that the root tuple directory was cluttered with 
>> two different groups of classes that made it difficult for anyone to figure 
>> out what is going on.  One group of classes form the base generic classes of 
>> the tuple sketch on which the concrete extensions "adouble" (a single 
>> double), "aninteger" (a single integer), and "strings" (array of strings) 
>> depend.  These three concrete extensions are already in their own sub 
>> directories.  
>> 
>> The second, largest group of classes were a dedicated non-generic 
>> implementation of the tuple sketch, which implemented an array of doubles.  
>> All of these classes had "ArrayOfDoubles" in their name.  These classes 
>> shared no code with the root generic tuple classes except for a few methods 
>> in the SerializerDeserializer and the Util classes.  By making a few methods 
>> public, I was able to move all of the "ArrayOfDoubles" classes into their 
>> own subdirectory.  This creates an incompatible API break, which will force 
>> us to move to a 2.0.0 for the next version.   Now the tuple root directory 
>> is much cleaner and easier to navigate and understand.  There are several 
>> reasons for this separate dedicated implementation. First, we felt that a 
>> configurable array of doubles would be a relatively common use case.  
>> Second, we wanted a full concrete example of the tuple sketch as an example 
>> of what it would look like including both on-heap and off-heap variants.   
>> It is this ArrayOfDoubles implementation that has been integrated into 
>> Druid, for example. 
>> 
>> 2. Now that the package directories are cleaned up I was able to focus on 
>> what it would mean to allow Tuple sketches to perform set operations with 
>> Theta sketches.  
>> 
>> One approach would be to just provide a converter to take in a Theta sketch 
>> and produce a Tuple sketch with some default or configured summary and leave 
>> everything else the way it is.  But this is less efficient as it requires 
>> more object creation and copying than a direct integration would.  It turns 
>> out that modifying the generic Union and Intersection classes only required 
>> adding one method to each.  I did some minor code cleanup and code 
>> documentation at the same time.  
>> 
>> The AnotB operator is another story.  We have never been really happy with 
>> how this was implemented the first time.  The current API is clumsy.  So I 
>> have taken the opportunity to redesign the API for this class.  It still has 
>> the current API methods but deprecated.  With the new modified class the 
>> user has several ways of performing AnotB.
>> 
>> As stateless operations:
>> With Tuple: resultSk = aNotB(skTupleA, skTupleB);
>> With Theta: resultSk = aNotB(skTupleA, skThetaB);
>> As stateful, sequential operations:
>> void setA(skTupleA);
>> void notB(skTupleB);   or   void notB(skThetaB);   //These are 
>> interchangable.
>> ...
>> void notB(skTupleB);   or   void notB(skThetaB);   //These are 
>> interchangable.
>> resultSk = getResult(reset = false);  // This allows getting an intermediate 
>> result
>> void notB(skTupleB);   or   void notB(skThetaB);   //Continue...
>> resultSK = getResult(reset = true); //This returns the result and clears the 
>> internal state to empty.
>> This I think is pretty slick and flexible.  
>> 
>> Work yet to be done on main:
>> Reexamine the Union and Intersection APIs to add the option of an 
>> intermediate result.
>> Update the other concrete extensions to take advantage of the above new API: 
>> "aninteger", "strings".
>> Examine the dedicated "ArrayOfDoubes" implementation to see how hard it 
>> would be to make the same changes as above.  Implement. Test.
>> Work yet to be done on test:
>> 
>> I did major redesign of the testing class for the AnotB generic class using 
>> the "adouble" concrete extension.  You can see this in 
>> AdoubleAnotBTest.java.  This is essentially a deep exhaustive test of the 
>> base AnotB classes via the concrete extension.   
>> With the deep testing using the "adouble" done, we still need to design new 
>> tests for the "aninteger" and "strings" extensions.  These can be shallow 
>> tests.
>> If we decide to do the same API extensions on the ArrayOfDoubles classes, 
>> those will need to be tested.
>> Work to be done on documentation:
>> The website documentation is still rather thin on the whole Tuple family.  
>> Having someone that is a real user of these classes contribute to the 
>> documentation to make it more understandable would be outstanding!
>> Work to be done on characterization.
>> The Tuple family has some characterization, but it is sparse and a lot more 
>> would work here would give users a sense of the performance they could 
>> expect.  We have also found that characterization is a powerful way to find 
>> statistical bugs that don't show up in unit tests.   I could guide you 
>> through how to set up the various "test harnesses", which is really pretty 
>> simple, but the real thinking goes into the design of the test and 
>> understanding the output.  This is a great way to really understand how 
>> these sketches behave and why.  
>> Work to be done on code reviews:
>> Having independent set of eyes going over the code would also be a huge 
>> contribution.  
>> Once you have had a chance to study this we should talk about how you want 
>> to contribute.  Clearly a lot of what I have done so far required deep 
>> understanding of the Tuple and Theta classes and was was much more efficient 
>> for me to do.  It would have been a hard slog for anyone new to the library 
>> to undertake.
>> 
>> Once we decide on a strategy, we should put kanban cards in the project TODO 
>> page <https://github.com/apache/incubator-datasketches-java/projects/1>.
>> 
>> Please let me know what you think!
>> 
>> Lee.
>> 
>>    
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On Wed, May 27, 2020 at 7:53 AM David Cromberge 
>> <[email protected] <mailto:[email protected]>> wrote:
>> Thank you Lee for your proposal regarding my use case and Tuple sketches.
>> 
>> I have spent some time considering the proposal, and I have started 
>> implementing a potential solution.
>> 
>> At what stage of the pipeline should characterisation tests be proposed, 
>> since they would obviously depend on a new SNAPSHOT version of the core 
>> library being available? 
>> 
>> I would be grateful for any input about the characterisation workflow.
>> 
>> Thank you,
>> David
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected] 
>> <mailto:[email protected]>
>> For additional commands, e-mail: [email protected] 
>> <mailto:[email protected]>
>> 
>

Re: [DISCUSS] Enhancements to Tuple Sketch.

Reply via email to