Re: [DISCUSS] Enhancements to Tuple Sketch.

leerho Wed, 27 May 2020 12:05:09 -0700

David,

Thanks.  I have been putting a lot of thought into it as well and decided
that it was time to make some other long-needed changes in the Tuple Family
of sketches as well including the package layout, which has been quite
cumbersome.  I would suggest holding back on you actual implementation work
until you understand what I have changed so far and then we can strategize
on how to finish the work.   I have checked in my changes so far into the
"Tuple_Theta_Extension" branch, which you can check out to see what I have
been up to :)

The family of tuple sketches have evolved over time and somewhat
haphazardly.  So the first think I decided to do was to do some rearranging
of the package structure to make future downstream improvements and
extensions easier.

1.  The first problem is that the root tuple directory was cluttered with
two different groups of classes that made it difficult for anyone to figure
out what is going on.  One group of classes form the base generic classes
of the tuple sketch on which the concrete extensions "adouble" (a single
double), "aninteger" (a single integer), and "strings" (array of strings)
depend.  These three concrete extensions are already in their own sub
directories.

The second, largest group of classes were a dedicated non-generic
implementation of the tuple sketch, which implemented an array of doubles.
All of these classes had "ArrayOfDoubles" in their name.  These classes
shared no code with the root generic tuple classes except for a few methods
in the SerializerDeserializer and the Util classes.  By making a few
methods public, I was able to move all of the "ArrayOfDoubles" classes into
their own subdirectory.  This creates an incompatible API break, which will
force us to move to a 2.0.0 for the next version.   Now the tuple root
directory is much cleaner and easier to navigate and understand.  There are
several reasons for this separate dedicated implementation. First, we felt
that a configurable array of doubles would be a relatively common use
case.  Second, we wanted a full concrete example of the tuple sketch as an
example of what it would look like including both on-heap and off-heap
variants.   It is this ArrayOfDoubles implementation that has been
integrated into Druid, for example.

2. Now that the package directories are cleaned up I was able to focus on
what it would mean to allow Tuple sketches to perform set operations with
Theta sketches.

One approach would be to just provide a converter to take in a Theta sketch
and produce a Tuple sketch with some default or configured summary and
leave everything else the way it is.  But this is less efficient as it
requires more object creation and copying than a direct integration would.
It turns out that modifying the generic Union and Intersection classes only
required adding one method to each.  I did some minor code cleanup and code
documentation at the same time.

The AnotB operator is another story.  We have never been really happy with
how this was implemented the first time.  The current API is clumsy.  So I
have taken the opportunity to redesign the API for this class.  It still
has the current API methods but deprecated.  With the new modified class
the user has several ways of performing AnotB.

As stateless operations:

   - With Tuple: resultSk = aNotB(skTupleA, skTupleB);
   - With Theta: resultSk = aNotB(skTupleA, skThetaB);

As stateful, sequential operations:

   - void setA(skTupleA);
   - void notB(skTupleB);   or   void notB(skThetaB);   //These are
   interchangable.
   - ...
   - void notB(skTupleB);   or   void notB(skThetaB);   //These are
   interchangable.
   - resultSk = getResult(reset = false);  // This allows getting an
   intermediate result
   - void notB(skTupleB);   or   void notB(skThetaB);   //Continue...
   - resultSK = getResult(reset = true); //This returns the result and
   clears the internal state to empty.

This I think is pretty slick and flexible.

Work yet to be done on main:

   - Reexamine the Union and Intersection APIs to add the option of an
   intermediate result.
   - Update the other concrete extensions to take advantage of the above
   new API: "aninteger", "strings".
   - Examine the dedicated "ArrayOfDoubes" implementation to see how hard
   it would be to make the same changes as above.  Implement. Test.

Work yet to be done on test:

I did major redesign of the testing class for the AnotB generic class using
the "adouble" concrete extension.  You can see this in
AdoubleAnotBTest.java.  This is essentially a deep exhaustive test of the
base AnotB classes via the concrete extension.

   - With the deep testing using the "adouble" done, we still need to
   design new tests for the "aninteger" and "strings" extensions.  These can
   be shallow tests.
   - If we decide to do the same API extensions on the ArrayOfDoubles
   classes, those will need to be tested.

Work to be done on documentation:

   - The website documentation is still rather thin on the whole Tuple
   family.  Having someone that is a real user of these classes contribute to
   the documentation to make it more understandable would be outstanding!

Work to be done on characterization.

   - The Tuple family has some characterization, but it is sparse and a lot
   more would work here would give users a sense of the performance they could
   expect.  We have also found that characterization is a powerful way to find
   statistical bugs that don't show up in unit tests.   I could guide you
   through how to set up the various "test harnesses", which is really pretty
   simple, but the real thinking goes into the design of the test and
   understanding the output.  This is a great way to really understand how
   these sketches behave and why.

Work to be done on code reviews:

   - Having independent set of eyes going over the code would also be a
   huge contribution.

Once you have had a chance to study this we should talk about how you want
to contribute.  Clearly a lot of what I have done so far required deep
understanding of the Tuple and Theta classes and was was much more
efficient for me to do.  It would have been a hard slog for anyone new to
the library to undertake.

Once we decide on a strategy, we should put kanban cards in the project
TODO page <https://github.com/apache/incubator-datasketches-java/projects/1>
.

Please let me know what you think!

Lee.

On Wed, May 27, 2020 at 7:53 AM David Cromberge <
[email protected]> wrote:

> Thank you Lee for your proposal regarding my use case and Tuple sketches.
>
> I have spent some time considering the proposal, and I have started
> implementing a potential solution.
>
> At what stage of the pipeline should characterisation tests be proposed,
> since they would obviously depend on a new SNAPSHOT version of the core
> library being available?
>
> I would be grateful for any input about the characterisation workflow.
>
> Thank you,
> David
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: [DISCUSS] Enhancements to Tuple Sketch.

Reply via email to