Re: [DISCUSS] Enhancements to Tuple Sketch.

David Cromberge Fri, 05 Jun 2020 02:39:28 -0700

I think community extensions would be a great idea, Lee.  I have a few ideas 
that could be included in that space.


Previously, the difficulties that I had with the API for conversion involved 
the requirement to access the internal state of a sketch, which is inaccessible 
by design.  The specific example is accessing the hashed keys, and creating a 
new sketch from them without rehashing them.
If a summary conversion can safely be done as an extension, this would be an 
ideal place for it.

Thanks for the background on the Druid off-heap use case, that’s useful 
information!

I will try and explain my concern on compatibly on the new AOD thread that you 
have created.

Dave.

> On 2 Jun 2020, at 20:31, leerho <[email protected]> wrote:
> 
> Dave,
> 
> Thank you for your thoughtful responses!  We really value your feedback!  
> 
> I was thinking through how this could be implemented, and it would be 
> difficult to do without introducing problems with backwards compatibility.
> 
> Could you elaborate on this a bit more?  What is an example of backwards 
> compatibility you were thinking of?
> 
> In my mind I was thinking of using the Memory package to interpret the AOB 
> summary into whatever the user wanted.  This would be external to the library 
> and would be written by the user.  We could certainly provide examples of how 
> to do this.  
> 
> The AOB model in theory also allows for C struct types where each of the 
> "columns" could represent different types, but packed very efficiently into 
> an overall byte array.  But it is up to the user to define what the structure 
> is.  Again, Memory can facilitate this encoding and decoding of the bytes.  
> There are other encoding/decoding schemes that could play a role here such as 
> Google's protobuf, flatbuffers, or flexbuffers.  Now, how do we make the core 
> tuple code flexible enough to handle the updating of these varied columns?
> 
> You idea of converters is interesting, where one wants to convert, for 
> example, an array of integers to an array of doubles (which is pretty much a 
> one-way conversion). Nevertheless, couldn't those be external to the core 
> library?  Interesting, my colleagues and I were talking just this morning 
> about the idea of a contributor's extensions area where this kind of stuff 
> could be shared.  The question that comes up is how these extensions would to 
> be somehow coordinated or specified to be compatible with specific core 
> releases.  Eventually, we may need a separate repository for that.
> 
> ****
> 
> assuming that the write workload is independent of the read workload, it is 
> fair to assume that the data being queried is immutable, and of a fixed size, 
> such as an update sketch (write) vs compact sketch (read).
> 
> Not always.  Several years ago our system colleagues at Yahoo implemented a 
> real-time query engine/database, where the raw data was continuous, real-time 
> streams of data from web servers all over the world.  These streams were 
> split into many dimensions by a back-end Storm system and ingested directly 
> into Druid.  In Druid these streams were then directed into many rotating 
> time-window buffers of sketches.  Each time-window buffer had sketches on 1 
> minute time intervals, 48 hours deep.  Each event was directed to the proper 
> dimension and it's time stamp directed it to the correct sketch in the 
> time-window.  Because the sketches are essentially "additive", they allowed 
> for late-data processing, which is particularly a problem with data collected 
> from mobile phones (it can easily be 24 hours late!).  All of these sketches 
> are being updated continuously and there are millions of them.  When the 
> query comes it can query the sketches of the specified time range and 
> dimensions and produce results to the user in seconds.   At the time, all of 
> these sketches were allocated fixed-sized slots, so the memory usage was not 
> all that efficient.  Nevertheless, this system produced real-time results to 
> queries that were virtually impossible to do using exact methods, and with a 
> smaller system footprint!   This system processed over 1 billion sketches 
> every day.  They were not on the heap.
> 
> 
> Cheers!  Keep the ideas coming!
> 
> Lee.
> 
> On Tue, Jun 2, 2020 at 7:49 AM David Cromberge <[email protected] 
> <mailto:[email protected]>> wrote:
> Thanks for such a detailed response Lee!  You and your have really raised the 
> bar and made Apache data-sketches a very welcoming and approachable community.
> 
> I had to re-read your response several times to appreciate the finer details 
> of the difficulties that you were referring to.  For an analytics 
> architecture, I have anecdotal experience that it is often useful to leverage 
> some intrinsic properties of analytic data to make simplifications.
> In this instance, assuming that the write workload is independent of the read 
> workload, it is fair to assume that the data being queried is immutable, and 
> of a fixed size, such as an update sketch (write) vs compact sketch (read).  
> However, I didn’t account for storing intermediate results of set operations 
> off-heap as well.    At first, it seemed natural to store the intermediate 
> accumulator sketch on the heap, and combine this in a pairwise fashion with 
> off-heap sketches.  This is probably a fundamental problem with my 
> understanding of off-heap allocation, because mixing the two memory sources 
> during an operation would not be possible.  You have provided a lot of 
> insight here regarding all the problems with dynamically sized slot 
> allocation, and its challenges.
> 
> In the past, I have used a memory-mapped file to allow the OS to manage 
> loading/unloading file segments into memory on demand, which is used in 
> Influx and Prometheus TSDB.  I had this in mind when you mentioned Postgres 
> managing memory allocations.  
> 
> Not to derail the thread, but I would like to raise another discussion at 
> some point on the broader topic of whether it is a good idea to performa 
> 1000s of set operations at query time, or whether pre-combining and 
> eliminating the long tail and uninteresting data (with a tool like 
> Macrobase), would help minimise the error introduced by all the set 
> operations.  As mentioned on the site, using pre-computations to deal with 
> power-law distributions would lead to many small sketches which would store 
> every key anyway.
> 
> Returning to the topic at hand, we previously discussed having a generic byte 
> implementation for a Tuple Sketch, which could unify the different concrete 
> summary types.  I was thinking through how this could be implemented, and it 
> would be difficult to do without introducing problems with backwards 
> compatibility.  I had an alternative idea that also serves to further widen 
> the utility of different Tuple Sketches - namely, introducing a conversion 
> between summary types.
> 
> This function could be defined as:
> 
>       public Sketch<S> convertSummaries<T extends Summary>(S -> T converter) 
> { … }
> 
> This approach may be used to freely convert between summary representations, 
> and integer, strings, double etc.  For ArrayOfDoubles, a more concrete 
> representation would be needed:
> 
>       public Sketch<S> convertSummaries<S extends Summary>([double] -> S 
> converter) { … }
> 
> Such adaptation is compelling because it prevents an all-or-nothing approach 
> to storing sketches, in some external datastore.  If we decided today to use 
> a TupleInteger, and ran into limitations, we could adapt them to double 
> implementations and write these back to storage on access, in a lazy manner.  
> Moreover, there may be cases where sketches between datatypes could be mixed 
> in the set operations.  The downside is, that some summary types may not be 
> coercible, but exposing this as a function allows the user to decide.
> 
> Concerning my situation, I have started adopting your latest branch, where I 
> have successfully started combining Integer Tuple sketches with Theta 
> sketches for the engagements use case.  It is a substantial victory not to 
> have to re-encode all our Theta sketches!
> 
> Thanks again for the background material, you have helped me and my team 
> immensely.
> 
> David
> 
> 
> 
>> On 29 May 2020, at 19:56, leerho <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> David, 
>> 
>> This is great, you are getting up-to-speed fast!.  I really appreciate your 
>> digging into this :)
>> 
>> Putting sketches off-heap is an interesting and deep topic.  I will try to 
>> share some of the concepts here, but ultimately, this kind of information 
>> needs to be in some sort of tutorial section on the web site.  Perhaps you 
>> might have suggestions on how this could be better presented.
>> 
>> Large Java system clusters have large amounts of RAM.  A few years ago a 
>> single machine in a cluster might have 24 CPUs, 48 hyperthreads, and 256GB 
>> of RAM.  A medium-sized cluster of such machines might consist of 100 such 
>> machines, where the cluster RAM is now 25TB.  That is a good chunk of 
>> memory! 
>> 
>> One model might be where each machine might have only one JVM acting like a 
>> supervisor and configured with perhaps 16 GB of RAM.   All the rest of 
>> memory is allocated to data, which is paged in and out from disk or even 
>> ingested directly from stream feeds or back-end systems such as Hadoop.  
>> 
>> Traditionally, most of the data was in the form of primitives and strings, 
>> but one underlying assumption historically has been that the data is static 
>> in size.  When the data is read in, you know what the size is and that 
>> doesn't change for the lifetime that that data exists in memory. Analytic 
>> processing, of course, create large amounts of intermediate storage but even 
>> in these cases, it is generally pretty easy to predict how much storage is 
>> required.
>> 
>> Sketches, viewed as data, especially when there are billions of sketches 
>> that need to be processed present some new challenges.  Suppose in my query 
>> processing I need to merge millions of sketches together and all of these 
>> sketches are sitting in off-heap memory, as they were preprocessed and built 
>> in the back-end system and ingested into my analytic query engine cluster.  
>> If we had to "heapify" each sketch image into a sketch object prior to its 
>> being merged would be very costly as that requires a deserialization and 
>> copy for each sketch.  So the first objective of having sketches off-heap is 
>> to provide the ability to interpret the sketch image for merging purposes 
>> without having to copy or deserialize the image.   This we do with our 
>> "Wrap(Memory)" functions. 
>> 
>> But the query engine needs to allocate perhaps thousands of set operators 
>> (Union, Intersection, Difference) that do the merging.  It would also be 
>> useful to have these allocated in large segment columns off-heap and manage 
>> their memory allocation directly in order to reduce the pressure on the 
>> garbage collector.  There are two challenges with this.  First, these 
>> operators grow dynamically, and second, Java does not really support the 
>> concept of programming dynamic objects off-heap. The only mechanism that 
>> Java provides has been the ByteBuffer, which has severe limitations.
>> 
>> This is why we created the Memory component.  You can think of it as a 
>> ByteBuffer replacement, but it is much more.  The Memory component is what 
>> allows us to do updating and merging of sketches off-heap.  Once we decide 
>> to do this, we are back into more of a C/C++ style of programming where we 
>> need to do our own "malloc()" and "free()" operations directly.   But this 
>> is exactly what large, real-time, query and analysis engines have been doing 
>> for a while.  And these systems have been taking advantage of hidden 
>> capabilities in Java, such as Unsafe, in order to achieve unparalleled 
>> performance.  Use of Unsafe is a hotly debated topic in the Java community, 
>> nonetheless, our Memory component also takes advantage of Unsafe (as does 
>> the ByteBuffer).  (How we move beyond Java 8 is a whole different topic!)
>> 
>> The first attempt to allocate a segment of, say, 1000 dynamic sketches (or 
>> set operators) in off-heap memory if often to choose a slot size that is the 
>> maximum size that a sketch can grow to, given its K configuration, and then 
>> evenly divide the segment into 1000 slots of that size.  This turns out to 
>> be horribly wasteful.  Big data is almost never just one chunk, but is often 
>> highly partitioned into multiple dimensions.  And the result of this natural 
>> fragmentation is that the sizes of all the combinations of these dimensions 
>> tend to follow a power-law distribution, also called "the long tail".  This 
>> happens in nature and almost anything categorized by humans.  What this 
>> means is that if you have millions of sketches that have processed the 
>> millions of streams of all the dimensional combinations, the vast majority 
>> of the sketches will have 1 or a few entries and be very small in size.  If 
>> each of these tiny sketches occupy a slot in memory that was set to the 
>> maximum size to which that sketch can grow, we have wasted a huge amount of 
>> memory.
>> 
>> A smarter approach is one we learned as we were integrating our C++ 
>> datasketches library into PostgreSQL.  Of course C++ doesn't have the 
>> off-heap problem at all. Nevertheless, the PostgreSQL system needs to manage 
>> and track what gets allocated and deallocated in memory.  So PostgreSQL 
>> created "palloc()" and "pfree()" functions for the user-developer, where the 
>> PostgreSQL can intercept and manage the underlying malloc and free 
>> processes.   This is also what we have proposed to Druid and I think they 
>> like this approach, but they also have a long history of using fixed sized 
>> slots and transitioning to dynamically sized slots initiated by the user 
>> process is a major transition for them and will take a while.
>> 
>> The question still remains on how to manage the overall system memory 
>> requirements.  Although we cannot predict which sketches among the millions 
>> of sketches will need only small slots and which ones will grow to the 
>> maximum size, in aggregate, we can learn (from the data) and predict how 
>> much memory we will need which tends to be pretty stable.  
>> 
>> Cheers,
>> 
>> Lee.
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On Fri, May 29, 2020 at 4:08 AM David Cromberge 
>> <[email protected] <mailto:[email protected]>> wrote:
>> Hi Lee,
>> 
>> I have studied our usage of the off-heap features of the library.  
>> Originally there was a compelling reason to leverage this whilst decoding 
>> from serialised bytes in storage.  When servicing queries on behalf of many 
>> clients, decoding sketches into the heap was potentially problematic from a 
>> scalability perspective.  However, for various reasons(unrelated to the 
>> library), we do not currently make use of the off-heap features.
>> Regardless, it would be interesting to hear some of the learnings from the 
>> Druid case (whenever these can be provided), as I am busy collecting notes 
>> and intend to supplement the website documentation where applicable.
>> To sum up, off-heap may be reconsidered in the future, but is not currently 
>> a priority.
>> 
>> If a byte-array Tuple sketch is being proposed, it may be possible in the 
>> end to define the various specialisations (integer, double, strings and 
>> doubles) in terms of the byte-array implementation.  It could also be of 
>> interest to define a position in the preamble for Tuple sketches where there 
>> is a flag to identify the summary type, where the defaults (integer, double, 
>> strings and doubles) may reserve an identifier.  This would allow for a 
>> powerful Tuple deserialiser that generalises over the summary.  However, I 
>> realise that there are other concerns that come into play with backwards 
>> compatibility, as well as performance criteria.
>> 
>> David.
>> 
>> 
>>> On 28 May 2020, at 16:12, leerho <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> David, you are correct.  I also looked at the AOD implementation last night 
>>> and it lacks some notable capabilities that I think you need, but others 
>>> would want as well.
>>> Integration of Theta (as you noted)
>>> There is no selectable "mode" equivalent for a Union on how to combine two 
>>> double summaries. Intersection has a dedicated combiner, but doesn't have 
>>> the "mode" capability to allow easy choices between a set of modes.
>>> There may be other issues, but I haven't studied this AOD code for several 
>>> years :(
>>> I am wondering if, for your case, if we had a generic, only-on-heap 
>>> solution relatively quickly that you could use and get some experience with 
>>> and could characterize its performance in your environment.  And then work 
>>> on a more efficient off-heap solution later. 
>>> 
>>> Using sketches off-heap requires some significant design decisions at the 
>>> system level to make it work really well.  We have worked with the Druid 
>>> folks for quite a while and have learned some things that may be helpful 
>>> for you.  But these issues are outside and beyond the issues of designing a 
>>> sketch that can work off-heap.   It would be helpful to get a sense of 
>>> where you are in your thinking about going off-heap.
>>> 
>>> Lee.
>>> 
>>> 
>>> 
>>> On Thu, May 28, 2020 at 7:45 AM David Cromberge 
>>> <[email protected] <mailto:[email protected]>> 
>>> wrote:
>>> As a follow-up to the proposal below, it may not be necessary to provide a 
>>> combiner and default set of values together, since there may be some 
>>> redundancy here.
>>> It’s probably better for me to re-iterate the intention - to provide a 
>>> meaningful way to interpret the presence of a key in a theta with some 
>>> suitable default value.
>>> 
>>>> On 28 May 2020, at 15:39, David Cromberge <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> Lee, the array of bytes implementation is a better approach than adding 
>>>> off-heap implementations for the integer, double and strings Tuple 
>>>> sketches.
>>>> 
>>>> I originally overlooked the ArrayOfDoubles sketch for the purposes of 
>>>> tracking engagement, since it implies that many values could be associated 
>>>> with a hashed key, which doesn’t quite fit the use case.  
>>>> 
>>>> Having said that, I have now looked through the implementation and have 
>>>> switched to the array of doubles sketch instead - after all, you pointed 
>>>> out that it should suffice. 
>>>> 
>>>> I have run some initial benchmarks on sizing, and in compacted form I did 
>>>> not get a reduction in the size of a compacted sketch generated from a 
>>>> real data set, despite the benefits of using primitive doubles.  I realise 
>>>> that this is dependent on the test case and tuning / configuration 
>>>> parameters, so we could add a TODO item to add a characterisation test for 
>>>> this, if there is not one already.
>>>> 
>>>> We don’t seem to have discussed how the Theta sketches may be included for 
>>>> intersections and unions regarding an AOD sketch.   For unions, the 
>>>> behaviour is delegated to a merge operation on the sketch, which 
>>>> ultimately adds values together for the same key.  Concerning 
>>>> intersection, a combiner implementation is used to determine how values 
>>>> should be combined.  It is noteworthy that in the druid extension, the 
>>>> values are summed together, with a comment noting that this may not apply 
>>>> to all circumstances.
>>>> 
>>>> I would propose a similar mechanism for both union and intersection on the 
>>>> other sketches, where a default array of values can be provided for a 
>>>> tuple sketch:
>>>> 
>>>>    public void update(final org.apache.datasketches.theta.Sketch sketchIn, 
>>>> double[] defaultValues, ArrayOfDoublesCombiner c) {...}
>>>> 
>>>> Of course, a factory could be provided that creates a combiner according 
>>>> to a summary mode.  The use case for this suggestion is to have context 
>>>> specific behaviour with regard to merging / combining values in the case 
>>>> of unions and intersections, which could be sourced from the user or user 
>>>> query.
>>>> 
>>>> I would be interested to hear your thoughts on the 
>>>> ArrayOfDoubles/ArrayOfBytes Tuple sketch integration with Theta sketches,
>>>> David
>>>> 
>>>> 
>>>>> On 28 May 2020, at 04:32, leerho <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>> David,  In fact, double values do just fine with integer data.  They are 
>>>>> twice as big at the primitive level, however, the dedicated 
>>>>> ArrayOfDoubles implementation in total might actually be smaller, since 
>>>>> it is not carrying all the object overhead that are required to do 
>>>>> generics. Plus, it will be a whole lot faster! It is already fully 
>>>>> implemented with all the set operations, off-heap memory, serialization 
>>>>> /deserialization and a full test suite.  
>>>>> 
>>>>> Designing a similar dedicated ArrayOfIntegers would be a lot of work and 
>>>>> wouldn't be my top priority for the next dedicated Tuple sketch to build. 
>>>>>  What would be more flexible would actually be a dedicated ArrayOfBytes 
>>>>> implementation, because bytes are the foundation from which we can derive 
>>>>> almost any summary we want.  
>>>>> 
>>>>> Think about it.
>>>>> 
>>>>> Lee.
>>>>> 
>>>>> 
>>>>> 
>>>>> On Wed, May 27, 2020 at 5:54 PM leerho <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> David,  This is good feedback.  I didn't realize you wanted off-heap 
>>>>> operation.  That changes a lot of things.  I have to go right now, but 
>>>>> let me think about this.  Attempting to leverage generics off-heap is 
>>>>> messy.  For your case it may be better to either leverage the 
>>>>> ArrayOfDoubles implementation, which already operates off-heap, or think 
>>>>> about emulating the AOD and creating a dedicated AOIntegers.  
>>>>> 
>>>>> Lee.
>>>>> 
>>>>> On Wed, May 27, 2020 at 3:04 PM David Cromberge 
>>>>> <[email protected] <mailto:[email protected]>> 
>>>>> wrote:
>>>>> Hi Lee,
>>>>> 
>>>>> Thanks for providing such a detailed plan of action for the Tuple sketch 
>>>>> package.
>>>>> The enhancements that you have listed are interesting, and I will 
>>>>> certainly check out your branch to get a clearer understanding of how the 
>>>>> library is evolving.
>>>>> 
>>>>> For what it’s worth, here is a record of my attempt to integrate Theta 
>>>>> sketches into the Tuple sketch set operations:
>>>>> https://github.com/davecromberge/incubator-datasketches-java/commit/961ad48bbe709ccfcb973a7fab69e53088f113a5
>>>>>  
>>>>> <https://github.com/davecromberge/incubator-datasketches-java/commit/961ad48bbe709ccfcb973a7fab69e53088f113a5>
>>>>> 
>>>>> Although I have a cursory understanding of the library’s internals, I 
>>>>> included the commit above because there were some interesting tradeoffs 
>>>>> to the implementation, and it gave me a better appreciation for the the 
>>>>> internal workings of the existing Tuple sketch as well as some of the 
>>>>> finer points in your improvement work.   To a lesser degree, it also 
>>>>> serves to independently confirm your argument for adding new variants of 
>>>>> the update methods!
>>>>> 
>>>>> During implementation, I was also faced with the decision as to whether 
>>>>> to duplicate the methods or to convert a Theta sketch to a Tuple sketch 
>>>>> first and delegate to the existing methods.  But, as you noted, this 
>>>>> requires an additional iteration through the result set and incurs a 
>>>>> performance penalty.  Therefore, I also duplicated the existing update 
>>>>> methods, with some changes for result extraction.  To ensure correctness, 
>>>>> I found it necessary to duplicate a large portion of the existing test 
>>>>> cases as well - replicating so many of the existing tests was not ideal, 
>>>>> but helped verify that the implementation was correct.
>>>>> It’s also worth mentioning that I had some difficulty implementing the 
>>>>> AnotB functionality, and in fact the results were incorrect when the 
>>>>> sketch crossed into estimation mode (see ignored tests).  I’m pleased to 
>>>>> have attempted the exercise because it will give much better context as I 
>>>>> study your branch further - especially the AnotB refactoring.
>>>>> 
>>>>> There is one addition that I would like to suggest to your list of TODO 
>>>>> items - namely, off-heap implementations.  I am considering using the 
>>>>> integer tuple sketch for our engagement use case and would prefer to 
>>>>> prevent memory pressure by loading many sketches onto the heap.  I have 
>>>>> noticed this come up in the past on #datasketches slack channel in a 
>>>>> conversation with some Druid team members.  It appears that the off-heap 
>>>>> implementations were omitted from the library due to time constraints, 
>>>>> and this is area where I could also potentially provide default 
>>>>> implementations for the other tuple sketches.  I think this is important 
>>>>> to consider, because the existing ArrayOfDoubles implementation uses an 
>>>>> abstract class for the parent.  Making the other Tuple sketches abstract 
>>>>> in a similar manner is potentially a breaking change as well.
>>>>> 
>>>>> I am excited to collaborate on this together on this feature, and I would 
>>>>> be happy to contribute in any possible way and coordinate through the 
>>>>> project TODO page 
>>>>> <https://github.com/apache/incubator-datasketches-java/projects/1>!
>>>>> 
>>>>> David
>>>>> 
>>>>> 
>>>>> 
>>>>>> On 27 May 2020, at 20:04, leerho <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> 
>>>>>> David,
>>>>>> 
>>>>>> Thanks.  I have been putting a lot of thought into it as well and 
>>>>>> decided that it was time to make some other long-needed changes in the 
>>>>>> Tuple Family of sketches as well including the package layout, which has 
>>>>>> been quite cumbersome.  I would suggest holding back on you actual 
>>>>>> implementation work until you understand what I have changed so far and 
>>>>>> then we can strategize on how to finish the work.   I have checked in my 
>>>>>> changes so far into the "Tuple_Theta_Extension" branch, which you can 
>>>>>> check out to see what I have been up to :)
>>>>>> 
>>>>>> The family of tuple sketches have evolved over time and somewhat 
>>>>>> haphazardly.  So the first think I decided to do was to do some 
>>>>>> rearranging of the package structure to make future downstream 
>>>>>> improvements and extensions easier.  
>>>>>> 
>>>>>> 1.  The first problem is that the root tuple directory was cluttered 
>>>>>> with two different groups of classes that made it difficult for anyone 
>>>>>> to figure out what is going on.  One group of classes form the base 
>>>>>> generic classes of the tuple sketch on which the concrete extensions 
>>>>>> "adouble" (a single double), "aninteger" (a single integer), and 
>>>>>> "strings" (array of strings) depend.  These three concrete extensions 
>>>>>> are already in their own sub directories.  
>>>>>> 
>>>>>> The second, largest group of classes were a dedicated non-generic 
>>>>>> implementation of the tuple sketch, which implemented an array of 
>>>>>> doubles.  All of these classes had "ArrayOfDoubles" in their name.  
>>>>>> These classes shared no code with the root generic tuple classes except 
>>>>>> for a few methods in the SerializerDeserializer and the Util classes.  
>>>>>> By making a few methods public, I was able to move all of the 
>>>>>> "ArrayOfDoubles" classes into their own subdirectory.  This creates an 
>>>>>> incompatible API break, which will force us to move to a 2.0.0 for the 
>>>>>> next version.   Now the tuple root directory is much cleaner and easier 
>>>>>> to navigate and understand.  There are several reasons for this separate 
>>>>>> dedicated implementation. First, we felt that a configurable array of 
>>>>>> doubles would be a relatively common use case.  Second, we wanted a full 
>>>>>> concrete example of the tuple sketch as an example of what it would look 
>>>>>> like including both on-heap and off-heap variants.   It is this 
>>>>>> ArrayOfDoubles implementation that has been integrated into Druid, for 
>>>>>> example. 
>>>>>> 
>>>>>> 2. Now that the package directories are cleaned up I was able to focus 
>>>>>> on what it would mean to allow Tuple sketches to perform set operations 
>>>>>> with Theta sketches.  
>>>>>> 
>>>>>> One approach would be to just provide a converter to take in a Theta 
>>>>>> sketch and produce a Tuple sketch with some default or configured 
>>>>>> summary and leave everything else the way it is.  But this is less 
>>>>>> efficient as it requires more object creation and copying than a direct 
>>>>>> integration would.  It turns out that modifying the generic Union and 
>>>>>> Intersection classes only required adding one method to each.  I did 
>>>>>> some minor code cleanup and code documentation at the same time.  
>>>>>> 
>>>>>> The AnotB operator is another story.  We have never been really happy 
>>>>>> with how this was implemented the first time.  The current API is 
>>>>>> clumsy.  So I have taken the opportunity to redesign the API for this 
>>>>>> class.  It still has the current API methods but deprecated.  With the 
>>>>>> new modified class the user has several ways of performing AnotB.
>>>>>> 
>>>>>> As stateless operations:
>>>>>> With Tuple: resultSk = aNotB(skTupleA, skTupleB);
>>>>>> With Theta: resultSk = aNotB(skTupleA, skThetaB);
>>>>>> As stateful, sequential operations:
>>>>>> void setA(skTupleA);
>>>>>> void notB(skTupleB);   or   void notB(skThetaB);   //These are 
>>>>>> interchangable.
>>>>>> ...
>>>>>> void notB(skTupleB);   or   void notB(skThetaB);   //These are 
>>>>>> interchangable.
>>>>>> resultSk = getResult(reset = false);  // This allows getting an 
>>>>>> intermediate result
>>>>>> void notB(skTupleB);   or   void notB(skThetaB);   //Continue...
>>>>>> resultSK = getResult(reset = true); //This returns the result and clears 
>>>>>> the internal state to empty.
>>>>>> This I think is pretty slick and flexible.  
>>>>>> 
>>>>>> Work yet to be done on main:
>>>>>> Reexamine the Union and Intersection APIs to add the option of an 
>>>>>> intermediate result.
>>>>>> Update the other concrete extensions to take advantage of the above new 
>>>>>> API: "aninteger", "strings".
>>>>>> Examine the dedicated "ArrayOfDoubes" implementation to see how hard it 
>>>>>> would be to make the same changes as above.  Implement. Test.
>>>>>> Work yet to be done on test:
>>>>>> 
>>>>>> I did major redesign of the testing class for the AnotB generic class 
>>>>>> using the "adouble" concrete extension.  You can see this in 
>>>>>> AdoubleAnotBTest.java.  This is essentially a deep exhaustive test of 
>>>>>> the base AnotB classes via the concrete extension.   
>>>>>> With the deep testing using the "adouble" done, we still need to design 
>>>>>> new tests for the "aninteger" and "strings" extensions.  These can be 
>>>>>> shallow tests.
>>>>>> If we decide to do the same API extensions on the ArrayOfDoubles 
>>>>>> classes, those will need to be tested.
>>>>>> Work to be done on documentation:
>>>>>> The website documentation is still rather thin on the whole Tuple 
>>>>>> family.  Having someone that is a real user of these classes contribute 
>>>>>> to the documentation to make it more understandable would be outstanding!
>>>>>> Work to be done on characterization.
>>>>>> The Tuple family has some characterization, but it is sparse and a lot 
>>>>>> more would work here would give users a sense of the performance they 
>>>>>> could expect.  We have also found that characterization is a powerful 
>>>>>> way to find statistical bugs that don't show up in unit tests.   I could 
>>>>>> guide you through how to set up the various "test harnesses", which is 
>>>>>> really pretty simple, but the real thinking goes into the design of the 
>>>>>> test and understanding the output.  This is a great way to really 
>>>>>> understand how these sketches behave and why.  
>>>>>> Work to be done on code reviews:
>>>>>> Having independent set of eyes going over the code would also be a huge 
>>>>>> contribution.  
>>>>>> Once you have had a chance to study this we should talk about how you 
>>>>>> want to contribute.  Clearly a lot of what I have done so far required 
>>>>>> deep understanding of the Tuple and Theta classes and was was much more 
>>>>>> efficient for me to do.  It would have been a hard slog for anyone new 
>>>>>> to the library to undertake.
>>>>>> 
>>>>>> Once we decide on a strategy, we should put kanban cards in the project 
>>>>>> TODO page 
>>>>>> <https://github.com/apache/incubator-datasketches-java/projects/1>.
>>>>>> 
>>>>>> Please let me know what you think!
>>>>>> 
>>>>>> Lee.
>>>>>> 
>>>>>>    
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Wed, May 27, 2020 at 7:53 AM David Cromberge 
>>>>>> <[email protected] <mailto:[email protected]>> 
>>>>>> wrote:
>>>>>> Thank you Lee for your proposal regarding my use case and Tuple sketches.
>>>>>> 
>>>>>> I have spent some time considering the proposal, and I have started 
>>>>>> implementing a potential solution.
>>>>>> 
>>>>>> At what stage of the pipeline should characterisation tests be proposed, 
>>>>>> since they would obviously depend on a new SNAPSHOT version of the core 
>>>>>> library being available? 
>>>>>> 
>>>>>> I would be grateful for any input about the characterisation workflow.
>>>>>> 
>>>>>> Thank you,
>>>>>> David
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: [email protected] 
>>>>>> <mailto:[email protected]>
>>>>>> For additional commands, e-mail: [email protected] 
>>>>>> <mailto:[email protected]>
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>

Re: [DISCUSS] Enhancements to Tuple Sketch.

Reply via email to