Re: [DISCUSS] Enhancements to Tuple Sketch.

David Cromberge Fri, 29 May 2020 04:08:13 -0700

Hi Lee,

I have studied our usage of the off-heap features of the library.  Originally 
there was a compelling reason to leverage this whilst decoding from serialised 
bytes in storage.  When servicing queries on behalf of many clients, decoding 
sketches into the heap was potentially problematic from a scalability 
perspective.  However, for various reasons(unrelated to the library), we do not 
currently make use of the off-heap features.
Regardless, it would be interesting to hear some of the learnings from the 
Druid case (whenever these can be provided), as I am busy collecting notes and 
intend to supplement the website documentation where applicable.
To sum up, off-heap may be reconsidered in the future, but is not currently a 
priority.


If a byte-array Tuple sketch is being proposed, it may be possible in the end 
to define the various specialisations (integer, double, strings and doubles) in 
terms of the byte-array implementation.  It could also be of interest to define 
a position in the preamble for Tuple sketches where there is a flag to identify 
the summary type, where the defaults (integer, double, strings and doubles) may 
reserve an identifier.  This would allow for a powerful Tuple deserialiser that 
generalises over the summary.  However, I realise that there are other concerns 
that come into play with backwards compatibility, as well as performance 
criteria.

David.


> On 28 May 2020, at 16:12, leerho <[email protected]> wrote:
> 
> David, you are correct.  I also looked at the AOD implementation last night 
> and it lacks some notable capabilities that I think you need, but others 
> would want as well.
> Integration of Theta (as you noted)
> There is no selectable "mode" equivalent for a Union on how to combine two 
> double summaries. Intersection has a dedicated combiner, but doesn't have the 
> "mode" capability to allow easy choices between a set of modes.
> There may be other issues, but I haven't studied this AOD code for several 
> years :(
> I am wondering if, for your case, if we had a generic, only-on-heap solution 
> relatively quickly that you could use and get some experience with and could 
> characterize its performance in your environment.  And then work on a more 
> efficient off-heap solution later. 
> 
> Using sketches off-heap requires some significant design decisions at the 
> system level to make it work really well.  We have worked with the Druid 
> folks for quite a while and have learned some things that may be helpful for 
> you.  But these issues are outside and beyond the issues of designing a 
> sketch that can work off-heap.   It would be helpful to get a sense of where 
> you are in your thinking about going off-heap.
> 
> Lee.
> 
> 
> 
> On Thu, May 28, 2020 at 7:45 AM David Cromberge 
> <[email protected] <mailto:[email protected]>> wrote:
> As a follow-up to the proposal below, it may not be necessary to provide a 
> combiner and default set of values together, since there may be some 
> redundancy here.
> It’s probably better for me to re-iterate the intention - to provide a 
> meaningful way to interpret the presence of a key in a theta with some 
> suitable default value.
> 
>> On 28 May 2020, at 15:39, David Cromberge <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Lee, the array of bytes implementation is a better approach than adding 
>> off-heap implementations for the integer, double and strings Tuple sketches.
>> 
>> I originally overlooked the ArrayOfDoubles sketch for the purposes of 
>> tracking engagement, since it implies that many values could be associated 
>> with a hashed key, which doesn’t quite fit the use case.  
>> 
>> Having said that, I have now looked through the implementation and have 
>> switched to the array of doubles sketch instead - after all, you pointed out 
>> that it should suffice. 
>> 
>> I have run some initial benchmarks on sizing, and in compacted form I did 
>> not get a reduction in the size of a compacted sketch generated from a real 
>> data set, despite the benefits of using primitive doubles.  I realise that 
>> this is dependent on the test case and tuning / configuration parameters, so 
>> we could add a TODO item to add a characterisation test for this, if there 
>> is not one already.
>> 
>> We don’t seem to have discussed how the Theta sketches may be included for 
>> intersections and unions regarding an AOD sketch.   For unions, the 
>> behaviour is delegated to a merge operation on the sketch, which ultimately 
>> adds values together for the same key.  Concerning intersection, a combiner 
>> implementation is used to determine how values should be combined.  It is 
>> noteworthy that in the druid extension, the values are summed together, with 
>> a comment noting that this may not apply to all circumstances.
>> 
>> I would propose a similar mechanism for both union and intersection on the 
>> other sketches, where a default array of values can be provided for a tuple 
>> sketch:
>> 
>>      public void update(final org.apache.datasketches.theta.Sketch sketchIn, 
>> double[] defaultValues, ArrayOfDoublesCombiner c) {...}
>> 
>> Of course, a factory could be provided that creates a combiner according to 
>> a summary mode.  The use case for this suggestion is to have context 
>> specific behaviour with regard to merging / combining values in the case of 
>> unions and intersections, which could be sourced from the user or user query.
>> 
>> I would be interested to hear your thoughts on the 
>> ArrayOfDoubles/ArrayOfBytes Tuple sketch integration with Theta sketches,
>> David
>> 
>> 
>>> On 28 May 2020, at 04:32, leerho <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> David,  In fact, double values do just fine with integer data.  They are 
>>> twice as big at the primitive level, however, the dedicated ArrayOfDoubles 
>>> implementation in total might actually be smaller, since it is not carrying 
>>> all the object overhead that are required to do generics. Plus, it will be 
>>> a whole lot faster! It is already fully implemented with all the set 
>>> operations, off-heap memory, serialization /deserialization and a full test 
>>> suite.  
>>> 
>>> Designing a similar dedicated ArrayOfIntegers would be a lot of work and 
>>> wouldn't be my top priority for the next dedicated Tuple sketch to build.  
>>> What would be more flexible would actually be a dedicated ArrayOfBytes 
>>> implementation, because bytes are the foundation from which we can derive 
>>> almost any summary we want.  
>>> 
>>> Think about it.
>>> 
>>> Lee.
>>> 
>>> 
>>> 
>>> On Wed, May 27, 2020 at 5:54 PM leerho <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> David,  This is good feedback.  I didn't realize you wanted off-heap 
>>> operation.  That changes a lot of things.  I have to go right now, but let 
>>> me think about this.  Attempting to leverage generics off-heap is messy.  
>>> For your case it may be better to either leverage the ArrayOfDoubles 
>>> implementation, which already operates off-heap, or think about emulating 
>>> the AOD and creating a dedicated AOIntegers.  
>>> 
>>> Lee.
>>> 
>>> On Wed, May 27, 2020 at 3:04 PM David Cromberge 
>>> <[email protected] <mailto:[email protected]>> 
>>> wrote:
>>> Hi Lee,
>>> 
>>> Thanks for providing such a detailed plan of action for the Tuple sketch 
>>> package.
>>> The enhancements that you have listed are interesting, and I will certainly 
>>> check out your branch to get a clearer understanding of how the library is 
>>> evolving.
>>> 
>>> For what it’s worth, here is a record of my attempt to integrate Theta 
>>> sketches into the Tuple sketch set operations:
>>> https://github.com/davecromberge/incubator-datasketches-java/commit/961ad48bbe709ccfcb973a7fab69e53088f113a5
>>>  
>>> <https://github.com/davecromberge/incubator-datasketches-java/commit/961ad48bbe709ccfcb973a7fab69e53088f113a5>
>>> 
>>> Although I have a cursory understanding of the library’s internals, I 
>>> included the commit above because there were some interesting tradeoffs to 
>>> the implementation, and it gave me a better appreciation for the the 
>>> internal workings of the existing Tuple sketch as well as some of the finer 
>>> points in your improvement work.   To a lesser degree, it also serves to 
>>> independently confirm your argument for adding new variants of the update 
>>> methods!
>>> 
>>> During implementation, I was also faced with the decision as to whether to 
>>> duplicate the methods or to convert a Theta sketch to a Tuple sketch first 
>>> and delegate to the existing methods.  But, as you noted, this requires an 
>>> additional iteration through the result set and incurs a performance 
>>> penalty.  Therefore, I also duplicated the existing update methods, with 
>>> some changes for result extraction.  To ensure correctness, I found it 
>>> necessary to duplicate a large portion of the existing test cases as well - 
>>> replicating so many of the existing tests was not ideal, but helped verify 
>>> that the implementation was correct.
>>> It’s also worth mentioning that I had some difficulty implementing the 
>>> AnotB functionality, and in fact the results were incorrect when the sketch 
>>> crossed into estimation mode (see ignored tests).  I’m pleased to have 
>>> attempted the exercise because it will give much better context as I study 
>>> your branch further - especially the AnotB refactoring.
>>> 
>>> There is one addition that I would like to suggest to your list of TODO 
>>> items - namely, off-heap implementations.  I am considering using the 
>>> integer tuple sketch for our engagement use case and would prefer to 
>>> prevent memory pressure by loading many sketches onto the heap.  I have 
>>> noticed this come up in the past on #datasketches slack channel in a 
>>> conversation with some Druid team members.  It appears that the off-heap 
>>> implementations were omitted from the library due to time constraints, and 
>>> this is area where I could also potentially provide default implementations 
>>> for the other tuple sketches.  I think this is important to consider, 
>>> because the existing ArrayOfDoubles implementation uses an abstract class 
>>> for the parent.  Making the other Tuple sketches abstract in a similar 
>>> manner is potentially a breaking change as well.
>>> 
>>> I am excited to collaborate on this together on this feature, and I would 
>>> be happy to contribute in any possible way and coordinate through the 
>>> project TODO page 
>>> <https://github.com/apache/incubator-datasketches-java/projects/1>!
>>> 
>>> David
>>> 
>>> 
>>> 
>>>> On 27 May 2020, at 20:04, leerho <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> David,
>>>> 
>>>> Thanks.  I have been putting a lot of thought into it as well and decided 
>>>> that it was time to make some other long-needed changes in the Tuple 
>>>> Family of sketches as well including the package layout, which has been 
>>>> quite cumbersome.  I would suggest holding back on you actual 
>>>> implementation work until you understand what I have changed so far and 
>>>> then we can strategize on how to finish the work.   I have checked in my 
>>>> changes so far into the "Tuple_Theta_Extension" branch, which you can 
>>>> check out to see what I have been up to :)
>>>> 
>>>> The family of tuple sketches have evolved over time and somewhat 
>>>> haphazardly.  So the first think I decided to do was to do some 
>>>> rearranging of the package structure to make future downstream 
>>>> improvements and extensions easier.  
>>>> 
>>>> 1.  The first problem is that the root tuple directory was cluttered with 
>>>> two different groups of classes that made it difficult for anyone to 
>>>> figure out what is going on.  One group of classes form the base generic 
>>>> classes of the tuple sketch on which the concrete extensions "adouble" (a 
>>>> single double), "aninteger" (a single integer), and "strings" (array of 
>>>> strings) depend.  These three concrete extensions are already in their own 
>>>> sub directories.  
>>>> 
>>>> The second, largest group of classes were a dedicated non-generic 
>>>> implementation of the tuple sketch, which implemented an array of doubles. 
>>>>  All of these classes had "ArrayOfDoubles" in their name.  These classes 
>>>> shared no code with the root generic tuple classes except for a few 
>>>> methods in the SerializerDeserializer and the Util classes.  By making a 
>>>> few methods public, I was able to move all of the "ArrayOfDoubles" classes 
>>>> into their own subdirectory.  This creates an incompatible API break, 
>>>> which will force us to move to a 2.0.0 for the next version.   Now the 
>>>> tuple root directory is much cleaner and easier to navigate and 
>>>> understand.  There are several reasons for this separate dedicated 
>>>> implementation. First, we felt that a configurable array of doubles would 
>>>> be a relatively common use case.  Second, we wanted a full concrete 
>>>> example of the tuple sketch as an example of what it would look like 
>>>> including both on-heap and off-heap variants.   It is this ArrayOfDoubles 
>>>> implementation that has been integrated into Druid, for example. 
>>>> 
>>>> 2. Now that the package directories are cleaned up I was able to focus on 
>>>> what it would mean to allow Tuple sketches to perform set operations with 
>>>> Theta sketches.  
>>>> 
>>>> One approach would be to just provide a converter to take in a Theta 
>>>> sketch and produce a Tuple sketch with some default or configured summary 
>>>> and leave everything else the way it is.  But this is less efficient as it 
>>>> requires more object creation and copying than a direct integration would. 
>>>>  It turns out that modifying the generic Union and Intersection classes 
>>>> only required adding one method to each.  I did some minor code cleanup 
>>>> and code documentation at the same time.  
>>>> 
>>>> The AnotB operator is another story.  We have never been really happy with 
>>>> how this was implemented the first time.  The current API is clumsy.  So I 
>>>> have taken the opportunity to redesign the API for this class.  It still 
>>>> has the current API methods but deprecated.  With the new modified class 
>>>> the user has several ways of performing AnotB.
>>>> 
>>>> As stateless operations:
>>>> With Tuple: resultSk = aNotB(skTupleA, skTupleB);
>>>> With Theta: resultSk = aNotB(skTupleA, skThetaB);
>>>> As stateful, sequential operations:
>>>> void setA(skTupleA);
>>>> void notB(skTupleB);   or   void notB(skThetaB);   //These are 
>>>> interchangable.
>>>> ...
>>>> void notB(skTupleB);   or   void notB(skThetaB);   //These are 
>>>> interchangable.
>>>> resultSk = getResult(reset = false);  // This allows getting an 
>>>> intermediate result
>>>> void notB(skTupleB);   or   void notB(skThetaB);   //Continue...
>>>> resultSK = getResult(reset = true); //This returns the result and clears 
>>>> the internal state to empty.
>>>> This I think is pretty slick and flexible.  
>>>> 
>>>> Work yet to be done on main:
>>>> Reexamine the Union and Intersection APIs to add the option of an 
>>>> intermediate result.
>>>> Update the other concrete extensions to take advantage of the above new 
>>>> API: "aninteger", "strings".
>>>> Examine the dedicated "ArrayOfDoubes" implementation to see how hard it 
>>>> would be to make the same changes as above.  Implement. Test.
>>>> Work yet to be done on test:
>>>> 
>>>> I did major redesign of the testing class for the AnotB generic class 
>>>> using the "adouble" concrete extension.  You can see this in 
>>>> AdoubleAnotBTest.java.  This is essentially a deep exhaustive test of the 
>>>> base AnotB classes via the concrete extension.   
>>>> With the deep testing using the "adouble" done, we still need to design 
>>>> new tests for the "aninteger" and "strings" extensions.  These can be 
>>>> shallow tests.
>>>> If we decide to do the same API extensions on the ArrayOfDoubles classes, 
>>>> those will need to be tested.
>>>> Work to be done on documentation:
>>>> The website documentation is still rather thin on the whole Tuple family.  
>>>> Having someone that is a real user of these classes contribute to the 
>>>> documentation to make it more understandable would be outstanding!
>>>> Work to be done on characterization.
>>>> The Tuple family has some characterization, but it is sparse and a lot 
>>>> more would work here would give users a sense of the performance they 
>>>> could expect.  We have also found that characterization is a powerful way 
>>>> to find statistical bugs that don't show up in unit tests.   I could guide 
>>>> you through how to set up the various "test harnesses", which is really 
>>>> pretty simple, but the real thinking goes into the design of the test and 
>>>> understanding the output.  This is a great way to really understand how 
>>>> these sketches behave and why.  
>>>> Work to be done on code reviews:
>>>> Having independent set of eyes going over the code would also be a huge 
>>>> contribution.  
>>>> Once you have had a chance to study this we should talk about how you want 
>>>> to contribute.  Clearly a lot of what I have done so far required deep 
>>>> understanding of the Tuple and Theta classes and was was much more 
>>>> efficient for me to do.  It would have been a hard slog for anyone new to 
>>>> the library to undertake.
>>>> 
>>>> Once we decide on a strategy, we should put kanban cards in the project 
>>>> TODO page 
>>>> <https://github.com/apache/incubator-datasketches-java/projects/1>.
>>>> 
>>>> Please let me know what you think!
>>>> 
>>>> Lee.
>>>> 
>>>>    
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Wed, May 27, 2020 at 7:53 AM David Cromberge 
>>>> <[email protected] <mailto:[email protected]>> 
>>>> wrote:
>>>> Thank you Lee for your proposal regarding my use case and Tuple sketches.
>>>> 
>>>> I have spent some time considering the proposal, and I have started 
>>>> implementing a potential solution.
>>>> 
>>>> At what stage of the pipeline should characterisation tests be proposed, 
>>>> since they would obviously depend on a new SNAPSHOT version of the core 
>>>> library being available? 
>>>> 
>>>> I would be grateful for any input about the characterisation workflow.
>>>> 
>>>> Thank you,
>>>> David
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected] 
>>>> <mailto:[email protected]>
>>>> For additional commands, e-mail: [email protected] 
>>>> <mailto:[email protected]>
>>>> 
>>> 
>> 
>

Re: [DISCUSS] Enhancements to Tuple Sketch.

Reply via email to