Re: [DISCUSS] Enhancements to Tuple Sketch.

leerho Fri, 29 May 2020 11:57:15 -0700

David,

This is great, you are getting up-to-speed fast!.  I really appreciate your
digging into this :)


Putting sketches off-heap is an interesting and deep topic.  I will try to
share some of the concepts here, but ultimately, this kind of information
needs to be in some sort of tutorial section on the web site.  Perhaps you
might have suggestions on how this could be better presented.

Large Java system clusters have large amounts of RAM.  A few years ago a
single machine in a cluster might have 24 CPUs, 48 hyperthreads, and 256GB
of RAM.  A medium-sized cluster of such machines might consist of 100 such
machines, where the cluster RAM is now 25TB.  That is a good chunk of
memory!

One model might be where each machine might have only one JVM acting like a
supervisor and configured with perhaps 16 GB of RAM.   All the rest of
memory is allocated to data, which is paged in and out from disk or even
ingested directly from stream feeds or back-end systems such as Hadoop.

Traditionally, most of the data was in the form of primitives and strings,
but one underlying assumption historically has been that the data is static
in size.  When the data is read in, you know what the size is and that
doesn't change for the lifetime that that data exists in memory. Analytic
processing, of course, create large amounts of intermediate storage but
even in these cases, it is generally pretty easy to predict how much
storage is required.

Sketches, viewed as data, especially when there are billions of sketches
that need to be processed present some new challenges.  Suppose in my query
processing I need to merge millions of sketches together and all of these
sketches are sitting in off-heap memory, as they were preprocessed and
built in the back-end system and ingested into my analytic query engine
cluster.  If we had to "heapify" each sketch image into a sketch object
prior to its being merged would be very costly as that requires a
deserialization and copy for each sketch.  So the first objective of having
sketches off-heap is to provide the ability to interpret the sketch image
for merging purposes without having to copy or deserialize the image.
 This we do with our "Wrap(Memory)" functions.

But the query engine needs to allocate perhaps thousands of set operators
(Union, Intersection, Difference) that do the merging.  It would also be
useful to have these allocated in large segment columns off-heap and manage
their memory allocation directly in order to reduce the pressure on the
garbage collector.  There are two challenges with this.  First, these
operators grow dynamically, and second, Java does not really support the
concept of programming dynamic objects off-heap. The only mechanism that
Java provides has been the ByteBuffer, which has severe limitations.

This is why we created the Memory component.  You can think of it as a
ByteBuffer replacement, but it is much more.  The Memory component is what
allows us to do updating and merging of sketches off-heap.  Once we decide
to do this, we are back into more of a C/C++ style of programming where we
need to do our own "malloc()" and "free()" operations directly.   But this
is exactly what large, real-time, query and analysis engines have been
doing for a while.  And these systems have been taking advantage of hidden
capabilities in Java, such as Unsafe, in order to achieve unparalleled
performance.  Use of Unsafe is a hotly debated topic in the Java community,
nonetheless, our Memory component also takes advantage of Unsafe (as does
the ByteBuffer).  (How we move beyond Java 8 is a whole different topic!)

The first attempt to allocate a segment of, say, 1000 dynamic sketches (or
set operators) in off-heap memory if often to choose a slot size that is
the maximum size that a sketch can grow to, given its K configuration, and
then evenly divide the segment into 1000 slots of that size.  This turns
out to be horribly wasteful.  Big data is almost never just one chunk, but
is often highly partitioned into multiple dimensions.  And the result of
this natural fragmentation is that the sizes of all the combinations of
these dimensions tend to follow a power-law distribution, also called "the
long tail".  This happens in nature and almost anything categorized by
humans.  What this means is that if you have millions of sketches that have
processed the millions of streams of all the dimensional combinations, the
vast majority of the sketches will have 1 or a few entries and be very
small in size.  If each of these tiny sketches occupy a slot in memory that
was set to the maximum size to which that sketch can grow, we have wasted a
huge amount of memory.

A smarter approach is one we learned as we were integrating our C++
datasketches library into PostgreSQL.  Of course C++ doesn't have the
off-heap problem at all. Nevertheless, the PostgreSQL system needs to
manage and track what gets allocated and deallocated in memory.  So
PostgreSQL created "palloc()" and "pfree()" functions for the
user-developer, where the PostgreSQL can intercept and manage the
underlying malloc and free processes.   This is also what we have proposed
to Druid and I think they like this approach, but they also have a long
history of using fixed sized slots and transitioning to dynamically sized
slots initiated by the user process is a major transition for them and will
take a while.

The question still remains on how to manage the overall system memory
requirements.  Although we cannot predict which sketches among the millions
of sketches will need only small slots and which ones will grow to the
maximum size, in aggregate, we can learn (from the data) and predict how
much memory we will need which tends to be pretty stable.

Cheers,

Lee.











On Fri, May 29, 2020 at 4:08 AM David Cromberge <
[email protected]> wrote:

> Hi Lee,
>
> I have studied our usage of the off-heap features of the library.
> Originally there was a compelling reason to leverage this whilst decoding
> from serialised bytes in storage.  When servicing queries on behalf of many
> clients, decoding sketches into the heap was potentially problematic from a
> scalability perspective.  However, for various reasons(unrelated to the
> library), we do not currently make use of the off-heap features.
> Regardless, it would be interesting to hear some of the learnings from the
> Druid case (whenever these can be provided), as I am busy collecting notes
> and intend to supplement the website documentation where applicable.
> To sum up, off-heap may be reconsidered in the future, but is not
> currently a priority.
>
> If a byte-array Tuple sketch is being proposed, it may be possible in the
> end to define the various specialisations (integer, double, strings and
> doubles) in terms of the byte-array implementation.  It could also be of
> interest to define a position in the preamble for Tuple sketches where
> there is a flag to identify the summary type, where the defaults (integer,
> double, strings and doubles) may reserve an identifier.  This would allow
> for a powerful Tuple deserialiser that generalises over the summary.
> However, I realise that there are other concerns that come into play with
> backwards compatibility, as well as performance criteria.
>
> David.
>
>
> On 28 May 2020, at 16:12, leerho <[email protected]> wrote:
>
> David, you are correct.  I also looked at the AOD implementation last
> night and it lacks some notable capabilities that I think you need, but
> others would want as well.
>
>    - Integration of Theta (as you noted)
>    - There is no selectable "mode" equivalent for a Union on how to
>    combine two double summaries. Intersection has a dedicated combiner, but
>    doesn't have the "mode" capability to allow easy choices between a set of
>    modes.
>    - There may be other issues, but I haven't studied this AOD code for
>    several years :(
>
> I am wondering if, for your case, if we had a generic, only-on-heap
> solution relatively quickly that you could use and get some experience with
> and could characterize its performance in your environment.  And then work
> on a more efficient off-heap solution later.
>
> Using sketches off-heap requires some significant design decisions at the
> system level to make it work really well.  We have worked with the Druid
> folks for quite a while and have learned some things that may be helpful
> for you.  But these issues are outside and beyond the issues of designing a
> sketch that can work off-heap.   It would be helpful to get a sense of
> where you are in your thinking about going off-heap.
>
> Lee.
>
>
>
> On Thu, May 28, 2020 at 7:45 AM David Cromberge <
> [email protected]> wrote:
>
>> As a follow-up to the proposal below, it may not be necessary to provide
>> a combiner and default set of values together, since there may be some
>> redundancy here.
>> It’s probably better for me to re-iterate the intention - to provide a
>> meaningful way to interpret the presence of a key in a theta with some
>> suitable default value.
>>
>> On 28 May 2020, at 15:39, David Cromberge <[email protected]>
>> wrote:
>>
>> Lee, the array of bytes implementation is a better approach than adding
>> off-heap implementations for the integer, double and strings Tuple sketches.
>>
>> I originally overlooked the ArrayOfDoubles sketch for the purposes of
>> tracking engagement, since it implies that many values could be associated
>> with a hashed key, which doesn’t quite fit the use case.
>>
>> Having said that, I have now looked through the implementation and have
>> switched to the array of doubles sketch instead - after all, you pointed
>> out that it should suffice.
>>
>> I have run some initial benchmarks on sizing, and in compacted form I did
>> not get a reduction in the size of a compacted sketch generated from a real
>> data set, despite the benefits of using primitive doubles.  I realise that
>> this is dependent on the test case and tuning / configuration parameters,
>> so we could add a TODO item to add a characterisation test for this, if
>> there is not one already.
>>
>> We don’t seem to have discussed how the Theta sketches may be included
>> for intersections and unions regarding an AOD sketch.   For unions, the
>> behaviour is delegated to a merge operation on the sketch, which ultimately
>> adds values together for the same key.  Concerning intersection, a combiner
>> implementation is used to determine how values should be combined.  It is
>> noteworthy that in the druid extension, the values are summed together,
>> with a comment noting that this may not apply to all circumstances.
>>
>> I would propose a similar mechanism for both union and intersection on
>> the other sketches, where a default array of values can be provided for a
>> tuple sketch:
>>
>> public void update(final org.apache.datasketches.theta.Sketch sketchIn,
>> double[] defaultValues, ArrayOfDoublesCombiner c) {...}
>>
>> Of course, a factory could be provided that creates a combiner according
>> to a summary mode.  The use case for this suggestion is to have context
>> specific behaviour with regard to merging / combining values in the case of
>> unions and intersections, which could be sourced from the user or user
>> query.
>>
>> I would be interested to hear your thoughts on the
>> ArrayOfDoubles/ArrayOfBytes Tuple sketch integration with Theta sketches,
>> David
>>
>>
>> On 28 May 2020, at 04:32, leerho <[email protected]> wrote:
>>
>> David,  In fact, double values do just fine with integer data.  They are
>> twice as big at the primitive level, however, the dedicated ArrayOfDoubles
>> implementation in total might actually be smaller, since it is not carrying
>> all the object overhead that are required to do generics. Plus, it will be
>> a whole lot faster! It is already fully implemented with all the set
>> operations, off-heap memory, serialization /deserialization and a full test
>> suite.
>>
>> Designing a similar dedicated ArrayOfIntegers would be a lot of work and
>> wouldn't be my top priority for the next dedicated Tuple sketch to build.
>> What would be more flexible would actually be a dedicated ArrayOfBytes
>> implementation, because bytes are the foundation from which we can derive
>> almost any summary we want.
>>
>> Think about it.
>>
>> Lee.
>>
>>
>>
>> On Wed, May 27, 2020 at 5:54 PM leerho <[email protected]> wrote:
>>
>>> David,  This is good feedback.  I didn't realize you wanted off-heap
>>> operation.  That changes a lot of things.  I have to go right now, but let
>>> me think about this.  Attempting to leverage generics off-heap is messy.
>>> For your case it may be better to either leverage the ArrayOfDoubles
>>> implementation, which already operates off-heap, or think about emulating
>>> the AOD and creating a dedicated AOIntegers.
>>>
>>> Lee.
>>>
>>> On Wed, May 27, 2020 at 3:04 PM David Cromberge <
>>> [email protected]> wrote:
>>>
>>>> Hi Lee,
>>>>
>>>> Thanks for providing such a detailed plan of action for the Tuple
>>>> sketch package.
>>>> The enhancements that you have listed are interesting, and I will
>>>> certainly check out your branch to get a clearer understanding of how the
>>>> library is evolving.
>>>>
>>>> For what it’s worth, here is a record of my attempt to integrate Theta
>>>> sketches into the Tuple sketch set operations:
>>>>
>>>> https://github.com/davecromberge/incubator-datasketches-java/commit/961ad48bbe709ccfcb973a7fab69e53088f113a5
>>>>
>>>> Although I have a cursory understanding of the library’s internals, I
>>>> included the commit above because there were some interesting tradeoffs to
>>>> the implementation, and it gave me a better appreciation for the the
>>>> internal workings of the existing Tuple sketch as well as some of the finer
>>>> points in your improvement work.   To a lesser degree, it also serves to
>>>> independently confirm your argument for adding new variants of the update
>>>> methods!
>>>>
>>>> During implementation, I was also faced with the decision as to whether
>>>> to duplicate the methods or to convert a Theta sketch to a Tuple sketch
>>>> first and delegate to the existing methods.  But, as you noted, this
>>>> requires an additional iteration through the result set and incurs a
>>>> performance penalty.  Therefore, I also duplicated the existing update
>>>> methods, with some changes for result extraction.  To ensure correctness, I
>>>> found it necessary to duplicate a large portion of the existing test cases
>>>> as well - replicating so many of the existing tests was not ideal, but
>>>> helped verify that the implementation was correct.
>>>> It’s also worth mentioning that I had some difficulty implementing the
>>>> AnotB functionality, and in fact the results were incorrect when the sketch
>>>> crossed into estimation mode (see ignored tests).  I’m pleased to have
>>>> attempted the exercise because it will give much better context as I study
>>>> your branch further - especially the AnotB refactoring.
>>>>
>>>> There is one addition that I would like to suggest to your list of TODO
>>>> items - namely, off-heap implementations.  I am considering using the
>>>> integer tuple sketch for our engagement use case and would prefer to
>>>> prevent memory pressure by loading many sketches onto the heap.  I have
>>>> noticed this come up in the past on #datasketches slack channel in a
>>>> conversation with some Druid team members.  It appears that the off-heap
>>>> implementations were omitted from the library due to time constraints, and
>>>> this is area where I could also potentially provide default implementations
>>>> for the other tuple sketches.  I think this is important to consider,
>>>> because the existing ArrayOfDoubles implementation uses an abstract class
>>>> for the parent.  Making the other Tuple sketches abstract in a similar
>>>> manner is potentially a breaking change as well.
>>>>
>>>> I am excited to collaborate on this together on this feature, and I
>>>> would be happy to contribute in any possible way and coordinate through 
>>>> the project
>>>> TODO page
>>>> <https://github.com/apache/incubator-datasketches-java/projects/1>!
>>>>
>>>> David
>>>>
>>>>
>>>>
>>>> On 27 May 2020, at 20:04, leerho <[email protected]> wrote:
>>>>
>>>> David,
>>>>
>>>> Thanks.  I have been putting a lot of thought into it as well and
>>>> decided that it was time to make some other long-needed changes in the
>>>> Tuple Family of sketches as well including the package layout, which has
>>>> been quite cumbersome.  I would suggest holding back on you actual
>>>> implementation work until you understand what I have changed so far and
>>>> then we can strategize on how to finish the work.   I have checked in my
>>>> changes so far into the "Tuple_Theta_Extension" branch, which you can check
>>>> out to see what I have been up to :)
>>>>
>>>> The family of tuple sketches have evolved over time and somewhat
>>>> haphazardly.  So the first think I decided to do was to do some rearranging
>>>> of the package structure to make future downstream improvements and
>>>> extensions easier.
>>>>
>>>> 1.  The first problem is that the root tuple directory was cluttered
>>>> with two different groups of classes that made it difficult for anyone to
>>>> figure out what is going on.  One group of classes form the base generic
>>>> classes of the tuple sketch on which the concrete extensions "adouble" (a
>>>> single double), "aninteger" (a single integer), and "strings" (array of
>>>> strings) depend.  These three concrete extensions are already in their own
>>>> sub directories.
>>>>
>>>> The second, largest group of classes were a dedicated non-generic
>>>> implementation of the tuple sketch, which implemented an array of doubles.
>>>> All of these classes had "ArrayOfDoubles" in their name.  These classes
>>>> shared no code with the root generic tuple classes except for a few methods
>>>> in the SerializerDeserializer and the Util classes.  By making a few
>>>> methods public, I was able to move all of the "ArrayOfDoubles" classes into
>>>> their own subdirectory.  This creates an incompatible API break, which will
>>>> force us to move to a 2.0.0 for the next version.   Now the tuple root
>>>> directory is much cleaner and easier to navigate and understand.  There are
>>>> several reasons for this separate dedicated implementation. First, we felt
>>>> that a configurable array of doubles would be a relatively common use
>>>> case.  Second, we wanted a full concrete example of the tuple sketch as an
>>>> example of what it would look like including both on-heap and off-heap
>>>> variants.   It is this ArrayOfDoubles implementation that has been
>>>> integrated into Druid, for example.
>>>>
>>>> 2. Now that the package directories are cleaned up I was able to focus
>>>> on what it would mean to allow Tuple sketches to perform set operations
>>>> with Theta sketches.
>>>>
>>>> One approach would be to just provide a converter to take in a Theta
>>>> sketch and produce a Tuple sketch with some default or configured summary
>>>> and leave everything else the way it is.  But this is less efficient as it
>>>> requires more object creation and copying than a direct integration would.
>>>> It turns out that modifying the generic Union and Intersection classes only
>>>> required adding one method to each.  I did some minor code cleanup and code
>>>> documentation at the same time.
>>>>
>>>> The AnotB operator is another story.  We have never been really happy
>>>> with how this was implemented the first time.  The current API is clumsy.
>>>> So I have taken the opportunity to redesign the API for this class.  It
>>>> still has the current API methods but deprecated.  With the new
>>>> modified class the user has several ways of performing AnotB.
>>>>
>>>> As stateless operations:
>>>>
>>>>    - With Tuple: resultSk = aNotB(skTupleA, skTupleB);
>>>>    - With Theta: resultSk = aNotB(skTupleA, skThetaB);
>>>>
>>>> As stateful, sequential operations:
>>>>
>>>>    - void setA(skTupleA);
>>>>    - void notB(skTupleB);   or   void notB(skThetaB);   //These are
>>>>    interchangable.
>>>>    - ...
>>>>    - void notB(skTupleB);   or   void notB(skThetaB);   //These are
>>>>    interchangable.
>>>>    - resultSk = getResult(reset = false);  // This allows getting an
>>>>    intermediate result
>>>>    - void notB(skTupleB);   or   void notB(skThetaB);   //Continue...
>>>>    - resultSK = getResult(reset = true); //This returns the result and
>>>>    clears the internal state to empty.
>>>>
>>>> This I think is pretty slick and flexible.
>>>>
>>>> Work yet to be done on main:
>>>>
>>>>    - Reexamine the Union and Intersection APIs to add the option of an
>>>>    intermediate result.
>>>>    - Update the other concrete extensions to take advantage of the
>>>>    above new API: "aninteger", "strings".
>>>>    - Examine the dedicated "ArrayOfDoubes" implementation to see how
>>>>    hard it would be to make the same changes as above.  Implement. Test.
>>>>
>>>> Work yet to be done on test:
>>>>
>>>> I did major redesign of the testing class for the AnotB generic class
>>>> using the "adouble" concrete extension.  You can see this in
>>>> AdoubleAnotBTest.java.  This is essentially a deep exhaustive test of the
>>>> base AnotB classes via the concrete extension.
>>>>
>>>>    - With the deep testing using the "adouble" done, we still need to
>>>>    design new tests for the "aninteger" and "strings" extensions.  These 
>>>> can
>>>>    be shallow tests.
>>>>    - If we decide to do the same API extensions on the ArrayOfDoubles
>>>>    classes, those will need to be tested.
>>>>
>>>> Work to be done on documentation:
>>>>
>>>>    - The website documentation is still rather thin on the whole Tuple
>>>>    family.  Having someone that is a real user of these classes contribute 
>>>> to
>>>>    the documentation to make it more understandable would be outstanding!
>>>>
>>>> Work to be done on characterization.
>>>>
>>>>    - The Tuple family has some characterization, but it is sparse and
>>>>    a lot more would work here would give users a sense of the performance 
>>>> they
>>>>    could expect.  We have also found that characterization is a powerful 
>>>> way
>>>>    to find statistical bugs that don't show up in unit tests.   I could 
>>>> guide
>>>>    you through how to set up the various "test harnesses", which is really
>>>>    pretty simple, but the real thinking goes into the design of the test 
>>>> and
>>>>    understanding the output.  This is a great way to really understand how
>>>>    these sketches behave and why.
>>>>
>>>> Work to be done on code reviews:
>>>>
>>>>    - Having independent set of eyes going over the code would also be
>>>>    a huge contribution.
>>>>
>>>> Once you have had a chance to study this we should talk about how you
>>>> want to contribute.  Clearly a lot of what I have done so far required deep
>>>> understanding of the Tuple and Theta classes and was was much more
>>>> efficient for me to do.  It would have been a hard slog for anyone new to
>>>> the library to undertake.
>>>>
>>>> Once we decide on a strategy, we should put kanban cards in the project
>>>> TODO page
>>>> <https://github.com/apache/incubator-datasketches-java/projects/1>.
>>>>
>>>> Please let me know what you think!
>>>>
>>>> Lee.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, May 27, 2020 at 7:53 AM David Cromberge <
>>>> [email protected]> wrote:
>>>>
>>>>> Thank you Lee for your proposal regarding my use case and Tuple
>>>>> sketches.
>>>>>
>>>>> I have spent some time considering the proposal, and I have started
>>>>> implementing a potential solution.
>>>>>
>>>>> At what stage of the pipeline should characterisation tests be
>>>>> proposed, since they would obviously depend on a new SNAPSHOT version of
>>>>> the core library being available?
>>>>>
>>>>> I would be grateful for any input about the characterisation workflow.
>>>>>
>>>>> Thank you,
>>>>> David
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>>
>>>>>
>>>>
>>
>>
>

Re: [DISCUSS] Enhancements to Tuple Sketch.

Reply via email to