Re: [DISCUSS] Enhancements to Tuple Sketch.

leerho Thu, 28 May 2020 08:13:23 -0700

David, you are correct.  I also looked at the AOD implementation last night
and it lacks some notable capabilities that I think you need, but others
would want as well.


   - Integration of Theta (as you noted)
   - There is no selectable "mode" equivalent for a Union on how to combine
   two double summaries. Intersection has a dedicated combiner, but doesn't
   have the "mode" capability to allow easy choices between a set of modes.
   - There may be other issues, but I haven't studied this AOD code for
   several years :(

I am wondering if, for your case, if we had a generic, only-on-heap
solution relatively quickly that you could use and get some experience with
and could characterize its performance in your environment.  And then work
on a more efficient off-heap solution later.

Using sketches off-heap requires some significant design decisions at the
system level to make it work really well.  We have worked with the Druid
folks for quite a while and have learned some things that may be helpful
for you.  But these issues are outside and beyond the issues of designing a
sketch that can work off-heap.   It would be helpful to get a sense of
where you are in your thinking about going off-heap.

Lee.



On Thu, May 28, 2020 at 7:45 AM David Cromberge <
[email protected]> wrote:

> As a follow-up to the proposal below, it may not be necessary to provide a
> combiner and default set of values together, since there may be some
> redundancy here.
> It’s probably better for me to re-iterate the intention - to provide a
> meaningful way to interpret the presence of a key in a theta with some
> suitable default value.
>
> On 28 May 2020, at 15:39, David Cromberge <[email protected]>
> wrote:
>
> Lee, the array of bytes implementation is a better approach than adding
> off-heap implementations for the integer, double and strings Tuple sketches.
>
> I originally overlooked the ArrayOfDoubles sketch for the purposes of
> tracking engagement, since it implies that many values could be associated
> with a hashed key, which doesn’t quite fit the use case.
>
> Having said that, I have now looked through the implementation and have
> switched to the array of doubles sketch instead - after all, you pointed
> out that it should suffice.
>
> I have run some initial benchmarks on sizing, and in compacted form I did
> not get a reduction in the size of a compacted sketch generated from a real
> data set, despite the benefits of using primitive doubles.  I realise that
> this is dependent on the test case and tuning / configuration parameters,
> so we could add a TODO item to add a characterisation test for this, if
> there is not one already.
>
> We don’t seem to have discussed how the Theta sketches may be included for
> intersections and unions regarding an AOD sketch.   For unions, the
> behaviour is delegated to a merge operation on the sketch, which ultimately
> adds values together for the same key.  Concerning intersection, a combiner
> implementation is used to determine how values should be combined.  It is
> noteworthy that in the druid extension, the values are summed together,
> with a comment noting that this may not apply to all circumstances.
>
> I would propose a similar mechanism for both union and intersection on the
> other sketches, where a default array of values can be provided for a tuple
> sketch:
>
> public void update(final org.apache.datasketches.theta.Sketch sketchIn,
> double[] defaultValues, ArrayOfDoublesCombiner c) {...}
>
> Of course, a factory could be provided that creates a combiner according
> to a summary mode.  The use case for this suggestion is to have context
> specific behaviour with regard to merging / combining values in the case of
> unions and intersections, which could be sourced from the user or user
> query.
>
> I would be interested to hear your thoughts on the
> ArrayOfDoubles/ArrayOfBytes Tuple sketch integration with Theta sketches,
> David
>
>
> On 28 May 2020, at 04:32, leerho <[email protected]> wrote:
>
> David,  In fact, double values do just fine with integer data.  They are
> twice as big at the primitive level, however, the dedicated ArrayOfDoubles
> implementation in total might actually be smaller, since it is not carrying
> all the object overhead that are required to do generics. Plus, it will be
> a whole lot faster! It is already fully implemented with all the set
> operations, off-heap memory, serialization /deserialization and a full test
> suite.
>
> Designing a similar dedicated ArrayOfIntegers would be a lot of work and
> wouldn't be my top priority for the next dedicated Tuple sketch to build.
> What would be more flexible would actually be a dedicated ArrayOfBytes
> implementation, because bytes are the foundation from which we can derive
> almost any summary we want.
>
> Think about it.
>
> Lee.
>
>
>
> On Wed, May 27, 2020 at 5:54 PM leerho <[email protected]> wrote:
>
>> David,  This is good feedback.  I didn't realize you wanted off-heap
>> operation.  That changes a lot of things.  I have to go right now, but let
>> me think about this.  Attempting to leverage generics off-heap is messy.
>> For your case it may be better to either leverage the ArrayOfDoubles
>> implementation, which already operates off-heap, or think about emulating
>> the AOD and creating a dedicated AOIntegers.
>>
>> Lee.
>>
>> On Wed, May 27, 2020 at 3:04 PM David Cromberge <
>> [email protected]> wrote:
>>
>>> Hi Lee,
>>>
>>> Thanks for providing such a detailed plan of action for the Tuple sketch
>>> package.
>>> The enhancements that you have listed are interesting, and I will
>>> certainly check out your branch to get a clearer understanding of how the
>>> library is evolving.
>>>
>>> For what it’s worth, here is a record of my attempt to integrate Theta
>>> sketches into the Tuple sketch set operations:
>>>
>>> https://github.com/davecromberge/incubator-datasketches-java/commit/961ad48bbe709ccfcb973a7fab69e53088f113a5
>>>
>>> Although I have a cursory understanding of the library’s internals, I
>>> included the commit above because there were some interesting tradeoffs to
>>> the implementation, and it gave me a better appreciation for the the
>>> internal workings of the existing Tuple sketch as well as some of the finer
>>> points in your improvement work.   To a lesser degree, it also serves to
>>> independently confirm your argument for adding new variants of the update
>>> methods!
>>>
>>> During implementation, I was also faced with the decision as to whether
>>> to duplicate the methods or to convert a Theta sketch to a Tuple sketch
>>> first and delegate to the existing methods.  But, as you noted, this
>>> requires an additional iteration through the result set and incurs a
>>> performance penalty.  Therefore, I also duplicated the existing update
>>> methods, with some changes for result extraction.  To ensure correctness, I
>>> found it necessary to duplicate a large portion of the existing test cases
>>> as well - replicating so many of the existing tests was not ideal, but
>>> helped verify that the implementation was correct.
>>> It’s also worth mentioning that I had some difficulty implementing the
>>> AnotB functionality, and in fact the results were incorrect when the sketch
>>> crossed into estimation mode (see ignored tests).  I’m pleased to have
>>> attempted the exercise because it will give much better context as I study
>>> your branch further - especially the AnotB refactoring.
>>>
>>> There is one addition that I would like to suggest to your list of TODO
>>> items - namely, off-heap implementations.  I am considering using the
>>> integer tuple sketch for our engagement use case and would prefer to
>>> prevent memory pressure by loading many sketches onto the heap.  I have
>>> noticed this come up in the past on #datasketches slack channel in a
>>> conversation with some Druid team members.  It appears that the off-heap
>>> implementations were omitted from the library due to time constraints, and
>>> this is area where I could also potentially provide default implementations
>>> for the other tuple sketches.  I think this is important to consider,
>>> because the existing ArrayOfDoubles implementation uses an abstract class
>>> for the parent.  Making the other Tuple sketches abstract in a similar
>>> manner is potentially a breaking change as well.
>>>
>>> I am excited to collaborate on this together on this feature, and I
>>> would be happy to contribute in any possible way and coordinate through the 
>>> project
>>> TODO page
>>> <https://github.com/apache/incubator-datasketches-java/projects/1>!
>>>
>>> David
>>>
>>>
>>>
>>> On 27 May 2020, at 20:04, leerho <[email protected]> wrote:
>>>
>>> David,
>>>
>>> Thanks.  I have been putting a lot of thought into it as well and
>>> decided that it was time to make some other long-needed changes in the
>>> Tuple Family of sketches as well including the package layout, which has
>>> been quite cumbersome.  I would suggest holding back on you actual
>>> implementation work until you understand what I have changed so far and
>>> then we can strategize on how to finish the work.   I have checked in my
>>> changes so far into the "Tuple_Theta_Extension" branch, which you can check
>>> out to see what I have been up to :)
>>>
>>> The family of tuple sketches have evolved over time and somewhat
>>> haphazardly.  So the first think I decided to do was to do some rearranging
>>> of the package structure to make future downstream improvements and
>>> extensions easier.
>>>
>>> 1.  The first problem is that the root tuple directory was cluttered
>>> with two different groups of classes that made it difficult for anyone to
>>> figure out what is going on.  One group of classes form the base generic
>>> classes of the tuple sketch on which the concrete extensions "adouble" (a
>>> single double), "aninteger" (a single integer), and "strings" (array of
>>> strings) depend.  These three concrete extensions are already in their own
>>> sub directories.
>>>
>>> The second, largest group of classes were a dedicated non-generic
>>> implementation of the tuple sketch, which implemented an array of doubles.
>>> All of these classes had "ArrayOfDoubles" in their name.  These classes
>>> shared no code with the root generic tuple classes except for a few methods
>>> in the SerializerDeserializer and the Util classes.  By making a few
>>> methods public, I was able to move all of the "ArrayOfDoubles" classes into
>>> their own subdirectory.  This creates an incompatible API break, which will
>>> force us to move to a 2.0.0 for the next version.   Now the tuple root
>>> directory is much cleaner and easier to navigate and understand.  There are
>>> several reasons for this separate dedicated implementation. First, we felt
>>> that a configurable array of doubles would be a relatively common use
>>> case.  Second, we wanted a full concrete example of the tuple sketch as an
>>> example of what it would look like including both on-heap and off-heap
>>> variants.   It is this ArrayOfDoubles implementation that has been
>>> integrated into Druid, for example.
>>>
>>> 2. Now that the package directories are cleaned up I was able to focus
>>> on what it would mean to allow Tuple sketches to perform set operations
>>> with Theta sketches.
>>>
>>> One approach would be to just provide a converter to take in a Theta
>>> sketch and produce a Tuple sketch with some default or configured summary
>>> and leave everything else the way it is.  But this is less efficient as it
>>> requires more object creation and copying than a direct integration would.
>>> It turns out that modifying the generic Union and Intersection classes only
>>> required adding one method to each.  I did some minor code cleanup and code
>>> documentation at the same time.
>>>
>>> The AnotB operator is another story.  We have never been really happy
>>> with how this was implemented the first time.  The current API is clumsy.
>>> So I have taken the opportunity to redesign the API for this class.  It
>>> still has the current API methods but deprecated.  With the new
>>> modified class the user has several ways of performing AnotB.
>>>
>>> As stateless operations:
>>>
>>>    - With Tuple: resultSk = aNotB(skTupleA, skTupleB);
>>>    - With Theta: resultSk = aNotB(skTupleA, skThetaB);
>>>
>>> As stateful, sequential operations:
>>>
>>>    - void setA(skTupleA);
>>>    - void notB(skTupleB);   or   void notB(skThetaB);   //These are
>>>    interchangable.
>>>    - ...
>>>    - void notB(skTupleB);   or   void notB(skThetaB);   //These are
>>>    interchangable.
>>>    - resultSk = getResult(reset = false);  // This allows getting an
>>>    intermediate result
>>>    - void notB(skTupleB);   or   void notB(skThetaB);   //Continue...
>>>    - resultSK = getResult(reset = true); //This returns the result and
>>>    clears the internal state to empty.
>>>
>>> This I think is pretty slick and flexible.
>>>
>>> Work yet to be done on main:
>>>
>>>    - Reexamine the Union and Intersection APIs to add the option of an
>>>    intermediate result.
>>>    - Update the other concrete extensions to take advantage of the
>>>    above new API: "aninteger", "strings".
>>>    - Examine the dedicated "ArrayOfDoubes" implementation to see how
>>>    hard it would be to make the same changes as above.  Implement. Test.
>>>
>>> Work yet to be done on test:
>>>
>>> I did major redesign of the testing class for the AnotB generic class
>>> using the "adouble" concrete extension.  You can see this in
>>> AdoubleAnotBTest.java.  This is essentially a deep exhaustive test of the
>>> base AnotB classes via the concrete extension.
>>>
>>>    - With the deep testing using the "adouble" done, we still need to
>>>    design new tests for the "aninteger" and "strings" extensions.  These can
>>>    be shallow tests.
>>>    - If we decide to do the same API extensions on the ArrayOfDoubles
>>>    classes, those will need to be tested.
>>>
>>> Work to be done on documentation:
>>>
>>>    - The website documentation is still rather thin on the whole Tuple
>>>    family.  Having someone that is a real user of these classes contribute 
>>> to
>>>    the documentation to make it more understandable would be outstanding!
>>>
>>> Work to be done on characterization.
>>>
>>>    - The Tuple family has some characterization, but it is sparse and a
>>>    lot more would work here would give users a sense of the performance they
>>>    could expect.  We have also found that characterization is a powerful way
>>>    to find statistical bugs that don't show up in unit tests.   I could 
>>> guide
>>>    you through how to set up the various "test harnesses", which is really
>>>    pretty simple, but the real thinking goes into the design of the test and
>>>    understanding the output.  This is a great way to really understand how
>>>    these sketches behave and why.
>>>
>>> Work to be done on code reviews:
>>>
>>>    - Having independent set of eyes going over the code would also be a
>>>    huge contribution.
>>>
>>> Once you have had a chance to study this we should talk about how you
>>> want to contribute.  Clearly a lot of what I have done so far required deep
>>> understanding of the Tuple and Theta classes and was was much more
>>> efficient for me to do.  It would have been a hard slog for anyone new to
>>> the library to undertake.
>>>
>>> Once we decide on a strategy, we should put kanban cards in the project
>>> TODO page
>>> <https://github.com/apache/incubator-datasketches-java/projects/1>.
>>>
>>> Please let me know what you think!
>>>
>>> Lee.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, May 27, 2020 at 7:53 AM David Cromberge <
>>> [email protected]> wrote:
>>>
>>>> Thank you Lee for your proposal regarding my use case and Tuple
>>>> sketches.
>>>>
>>>> I have spent some time considering the proposal, and I have started
>>>> implementing a potential solution.
>>>>
>>>> At what stage of the pipeline should characterisation tests be
>>>> proposed, since they would obviously depend on a new SNAPSHOT version of
>>>> the core library being available?
>>>>
>>>> I would be grateful for any input about the characterisation workflow.
>>>>
>>>> Thank you,
>>>> David
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>>
>>>>
>>>
>
>

Re: [DISCUSS] Enhancements to Tuple Sketch.

Reply via email to