Re: [DISCUSS] Enhancements to Tuple Sketch.

leerho Wed, 27 May 2020 17:55:24 -0700

David,  This is good feedback.  I didn't realize you wanted off-heap
operation.  That changes a lot of things.  I have to go right now, but let
me think about this.  Attempting to leverage generics off-heap is messy.
For your case it may be better to either leverage the ArrayOfDoubles
implementation, which already operates off-heap, or think about emulating
the AOD and creating a dedicated AOIntegers.


Lee.

On Wed, May 27, 2020 at 3:04 PM David Cromberge <
[email protected]> wrote:

> Hi Lee,
>
> Thanks for providing such a detailed plan of action for the Tuple sketch
> package.
> The enhancements that you have listed are interesting, and I will
> certainly check out your branch to get a clearer understanding of how the
> library is evolving.
>
> For what it’s worth, here is a record of my attempt to integrate Theta
> sketches into the Tuple sketch set operations:
>
> https://github.com/davecromberge/incubator-datasketches-java/commit/961ad48bbe709ccfcb973a7fab69e53088f113a5
>
> Although I have a cursory understanding of the library’s internals, I
> included the commit above because there were some interesting tradeoffs to
> the implementation, and it gave me a better appreciation for the the
> internal workings of the existing Tuple sketch as well as some of the finer
> points in your improvement work.   To a lesser degree, it also serves to
> independently confirm your argument for adding new variants of the update
> methods!
>
> During implementation, I was also faced with the decision as to whether to
> duplicate the methods or to convert a Theta sketch to a Tuple sketch first
> and delegate to the existing methods.  But, as you noted, this requires an
> additional iteration through the result set and incurs a performance
> penalty.  Therefore, I also duplicated the existing update methods, with
> some changes for result extraction.  To ensure correctness, I found it
> necessary to duplicate a large portion of the existing test cases as well -
> replicating so many of the existing tests was not ideal, but helped verify
> that the implementation was correct.
> It’s also worth mentioning that I had some difficulty implementing the
> AnotB functionality, and in fact the results were incorrect when the sketch
> crossed into estimation mode (see ignored tests).  I’m pleased to have
> attempted the exercise because it will give much better context as I study
> your branch further - especially the AnotB refactoring.
>
> There is one addition that I would like to suggest to your list of TODO
> items - namely, off-heap implementations.  I am considering using the
> integer tuple sketch for our engagement use case and would prefer to
> prevent memory pressure by loading many sketches onto the heap.  I have
> noticed this come up in the past on #datasketches slack channel in a
> conversation with some Druid team members.  It appears that the off-heap
> implementations were omitted from the library due to time constraints, and
> this is area where I could also potentially provide default implementations
> for the other tuple sketches.  I think this is important to consider,
> because the existing ArrayOfDoubles implementation uses an abstract class
> for the parent.  Making the other Tuple sketches abstract in a similar
> manner is potentially a breaking change as well.
>
> I am excited to collaborate on this together on this feature, and I would
> be happy to contribute in any possible way and coordinate through the project
> TODO page
> <https://github.com/apache/incubator-datasketches-java/projects/1>!
>
> David
>
>
>
> On 27 May 2020, at 20:04, leerho <[email protected]> wrote:
>
> David,
>
> Thanks.  I have been putting a lot of thought into it as well and decided
> that it was time to make some other long-needed changes in the Tuple Family
> of sketches as well including the package layout, which has been quite
> cumbersome.  I would suggest holding back on you actual implementation work
> until you understand what I have changed so far and then we can strategize
> on how to finish the work.   I have checked in my changes so far into the
> "Tuple_Theta_Extension" branch, which you can check out to see what I have
> been up to :)
>
> The family of tuple sketches have evolved over time and somewhat
> haphazardly.  So the first think I decided to do was to do some rearranging
> of the package structure to make future downstream improvements and
> extensions easier.
>
> 1.  The first problem is that the root tuple directory was cluttered with
> two different groups of classes that made it difficult for anyone to figure
> out what is going on.  One group of classes form the base generic classes
> of the tuple sketch on which the concrete extensions "adouble" (a single
> double), "aninteger" (a single integer), and "strings" (array of strings)
> depend.  These three concrete extensions are already in their own sub
> directories.
>
> The second, largest group of classes were a dedicated non-generic
> implementation of the tuple sketch, which implemented an array of doubles.
> All of these classes had "ArrayOfDoubles" in their name.  These classes
> shared no code with the root generic tuple classes except for a few methods
> in the SerializerDeserializer and the Util classes.  By making a few
> methods public, I was able to move all of the "ArrayOfDoubles" classes into
> their own subdirectory.  This creates an incompatible API break, which will
> force us to move to a 2.0.0 for the next version.   Now the tuple root
> directory is much cleaner and easier to navigate and understand.  There are
> several reasons for this separate dedicated implementation. First, we felt
> that a configurable array of doubles would be a relatively common use
> case.  Second, we wanted a full concrete example of the tuple sketch as an
> example of what it would look like including both on-heap and off-heap
> variants.   It is this ArrayOfDoubles implementation that has been
> integrated into Druid, for example.
>
> 2. Now that the package directories are cleaned up I was able to focus on
> what it would mean to allow Tuple sketches to perform set operations with
> Theta sketches.
>
> One approach would be to just provide a converter to take in a Theta
> sketch and produce a Tuple sketch with some default or configured summary
> and leave everything else the way it is.  But this is less efficient as it
> requires more object creation and copying than a direct integration would.
> It turns out that modifying the generic Union and Intersection classes only
> required adding one method to each.  I did some minor code cleanup and code
> documentation at the same time.
>
> The AnotB operator is another story.  We have never been really happy with
> how this was implemented the first time.  The current API is clumsy.  So I
> have taken the opportunity to redesign the API for this class.  It still
> has the current API methods but deprecated.  With the new modified class
> the user has several ways of performing AnotB.
>
> As stateless operations:
>
>    - With Tuple: resultSk = aNotB(skTupleA, skTupleB);
>    - With Theta: resultSk = aNotB(skTupleA, skThetaB);
>
> As stateful, sequential operations:
>
>    - void setA(skTupleA);
>    - void notB(skTupleB);   or   void notB(skThetaB);   //These are
>    interchangable.
>    - ...
>    - void notB(skTupleB);   or   void notB(skThetaB);   //These are
>    interchangable.
>    - resultSk = getResult(reset = false);  // This allows getting an
>    intermediate result
>    - void notB(skTupleB);   or   void notB(skThetaB);   //Continue...
>    - resultSK = getResult(reset = true); //This returns the result and
>    clears the internal state to empty.
>
> This I think is pretty slick and flexible.
>
> Work yet to be done on main:
>
>    - Reexamine the Union and Intersection APIs to add the option of an
>    intermediate result.
>    - Update the other concrete extensions to take advantage of the above
>    new API: "aninteger", "strings".
>    - Examine the dedicated "ArrayOfDoubes" implementation to see how hard
>    it would be to make the same changes as above.  Implement. Test.
>
> Work yet to be done on test:
>
> I did major redesign of the testing class for the AnotB generic class
> using the "adouble" concrete extension.  You can see this in
> AdoubleAnotBTest.java.  This is essentially a deep exhaustive test of the
> base AnotB classes via the concrete extension.
>
>    - With the deep testing using the "adouble" done, we still need to
>    design new tests for the "aninteger" and "strings" extensions.  These can
>    be shallow tests.
>    - If we decide to do the same API extensions on the ArrayOfDoubles
>    classes, those will need to be tested.
>
> Work to be done on documentation:
>
>    - The website documentation is still rather thin on the whole Tuple
>    family.  Having someone that is a real user of these classes contribute to
>    the documentation to make it more understandable would be outstanding!
>
> Work to be done on characterization.
>
>    - The Tuple family has some characterization, but it is sparse and a
>    lot more would work here would give users a sense of the performance they
>    could expect.  We have also found that characterization is a powerful way
>    to find statistical bugs that don't show up in unit tests.   I could guide
>    you through how to set up the various "test harnesses", which is really
>    pretty simple, but the real thinking goes into the design of the test and
>    understanding the output.  This is a great way to really understand how
>    these sketches behave and why.
>
> Work to be done on code reviews:
>
>    - Having independent set of eyes going over the code would also be a
>    huge contribution.
>
> Once you have had a chance to study this we should talk about how you want
> to contribute.  Clearly a lot of what I have done so far required deep
> understanding of the Tuple and Theta classes and was was much more
> efficient for me to do.  It would have been a hard slog for anyone new to
> the library to undertake.
>
> Once we decide on a strategy, we should put kanban cards in the project
> TODO page
> <https://github.com/apache/incubator-datasketches-java/projects/1>.
>
> Please let me know what you think!
>
> Lee.
>
>
>
>
>
>
>
>
>
>
>
> On Wed, May 27, 2020 at 7:53 AM David Cromberge <
> [email protected]> wrote:
>
>> Thank you Lee for your proposal regarding my use case and Tuple sketches.
>>
>> I have spent some time considering the proposal, and I have started
>> implementing a potential solution.
>>
>> At what stage of the pipeline should characterisation tests be proposed,
>> since they would obviously depend on a new SNAPSHOT version of the core
>> library being available?
>>
>> I would be grateful for any input about the characterisation workflow.
>>
>> Thank you,
>> David
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>

Re: [DISCUSS] Enhancements to Tuple Sketch.

Reply via email to