Re: [DISCUSS] Enhancements to Tuple Sketch.

leerho Tue, 02 Jun 2020 12:32:06 -0700

Dave,

Thank you for your thoughtful responses!  We really value your feedback!


I was thinking through how this could be implemented, and it would be
> difficult to do without introducing problems with backwards compatibility.


Could you elaborate on this a bit more?  What is an example of backwards
compatibility you were thinking of?

In my mind I was thinking of using the Memory package to interpret the AOB
summary into whatever the user wanted.  This would be external to the
library and would be written by the user.  We could certainly provide
examples of how to do this.

The AOB model in theory also allows for *C struct* types where each of the
"columns" could represent different types, but packed very efficiently into
an overall byte array.  But it is up to the user to define what the
structure is.  Again, Memory can facilitate this encoding and decoding of
the bytes.  There are other encoding/decoding schemes that could play a
role here such as Google's protobuf, flatbuffers, or flexbuffers.  Now, how
do we make the core tuple code flexible enough to handle the updating of
these varied columns?

You idea of converters is interesting, where one wants to convert, for
example, an array of integers to an array of doubles (which is pretty much
a one-way conversion). Nevertheless, couldn't those be external to the core
library?  Interesting, my colleagues and I were talking just this morning
about the idea of a contributor's *extensions* area where this kind of
stuff could be shared.  The question that comes up is how these extensions
would to be somehow coordinated or specified to be compatible with specific
core releases.  Eventually, we may need a separate repository for that.

****

assuming that the write workload is independent of the read workload, it is
> fair to assume that the data being queried is immutable, and of a fixed
> size, such as an update sketch (write) vs compact sketch (read).


Not always.  Several years ago our system colleagues at Yahoo implemented a
real-time query engine/database, where the raw data was continuous,
real-time streams of data from web servers all over the world.  These
streams were split into many dimensions by a back-end Storm system and
ingested directly into Druid.  In Druid these streams were then directed
into many rotating time-window buffers of sketches.  Each time-window
buffer had sketches on 1 minute time intervals, 48 hours deep.  Each event
was directed to the proper dimension and it's time stamp directed it to the
correct sketch in the time-window.  Because the sketches are essentially
"additive", they allowed for late-data processing, which is particularly a
problem with data collected from mobile phones (it can easily be 24 hours
late!).  All of these sketches are being updated continuously and there are
millions of them.  When the query comes it can query the sketches of the
specified time range and dimensions and produce results to the user in
seconds.   At the time, all of these sketches were allocated fixed-sized
slots, so the memory usage was not all that efficient.  Nevertheless, this
system produced real-time results to queries that were virtually impossible
to do using exact methods, and with a smaller system footprint!   This
system processed over 1 billion sketches every day.  They were not on the
heap.


Cheers!  Keep the ideas coming!

Lee.

On Tue, Jun 2, 2020 at 7:49 AM David Cromberge <
[email protected]> wrote:

> Thanks for such a detailed response Lee!  You and your have really raised
> the bar and made Apache data-sketches a very welcoming and approachable
> community.
>
> I had to re-read your response several times to appreciate the finer
> details of the difficulties that you were referring to.  For an analytics
> architecture, I have anecdotal experience that it is often useful to
> leverage some intrinsic properties of analytic data to make simplifications.
> In this instance, assuming that the write workload is independent of the
> read workload, it is fair to assume that the data being queried is
> immutable, and of a fixed size, such as an update sketch (write) vs compact
> sketch (read).
> However, I didn’t account for storing intermediate results of set
> operations off-heap as well.    At first, it seemed natural to store the
> intermediate accumulator sketch on the heap, and combine this in a pairwise
> fashion with off-heap sketches.  This is probably a fundamental problem
> with my understanding of off-heap allocation, because mixing the two memory
> sources during an operation would not be possible.  You have provided a lot
> of insight here regarding all the problems with dynamically sized slot
> allocation, and its challenges.
>
> In the past, I have used a memory-mapped file to allow the OS to manage
> loading/unloading file segments into memory on demand, which is used in
> Influx and Prometheus TSDB.  I had this in mind when you mentioned Postgres
> managing memory allocations.
>
> Not to derail the thread, but I would like to raise another discussion at
> some point on the broader topic of whether it is a good idea to performa
> 1000s of set operations at query time, or whether pre-combining and
> eliminating the long tail and uninteresting data (with a tool like
> Macrobase), would help minimise the error introduced by all the set
> operations.  As mentioned on the site, using pre-computations to deal with
> power-law distributions would lead to many small sketches which would store
> every key anyway.
>
> Returning to the topic at hand, we previously discussed having a generic
> byte implementation for a Tuple Sketch, which could unify the different
> concrete summary types.  I was thinking through how this could be
> implemented, and it would be difficult to do without introducing problems
> with backwards compatibility.  I had an alternative idea that also serves
> to further widen the utility of different Tuple Sketches - namely,
> introducing a conversion between summary types.
>
> This function could be defined as:
>
> public Sketch<S> convertSummaries<T extends Summary>(S -> T converter) { …
> }
>
> This approach may be used to freely convert between summary
> representations, and integer, strings, double etc.  For ArrayOfDoubles, a
> more concrete representation would be needed:
>
> public Sketch<S> convertSummaries<S extends Summary>([double] -> S
> converter) { … }
>
> Such adaptation is compelling because it prevents an all-or-nothing
> approach to storing sketches, in some external datastore.  If we decided
> today to use a TupleInteger, and ran into limitations, we could adapt them
> to double implementations and write these back to storage on access, in a
> lazy manner.  Moreover, there may be cases where sketches between datatypes
> could be mixed in the set operations.  The downside is, that some summary
> types may not be coercible, but exposing this as a function allows the user
> to decide.
>
> Concerning my situation, I have started adopting your latest branch, where
> I have successfully started combining Integer Tuple sketches with Theta
> sketches for the engagements use case.  It is a substantial victory not to
> have to re-encode all our Theta sketches!
>
> Thanks again for the background material, you have helped me and my team
> immensely.
>
> David
>
>
>
> On 29 May 2020, at 19:56, leerho <[email protected]> wrote:
>
> David,
>
> This is great, you are getting up-to-speed fast!.  I really appreciate
> your digging into this :)
>
> Putting sketches off-heap is an interesting and deep topic.  I will try to
> share some of the concepts here, but ultimately, this kind of information
> needs to be in some sort of tutorial section on the web site.  Perhaps you
> might have suggestions on how this could be better presented.
>
> Large Java system clusters have large amounts of RAM.  A few years ago a
> single machine in a cluster might have 24 CPUs, 48 hyperthreads, and 256GB
> of RAM.  A medium-sized cluster of such machines might consist of 100 such
> machines, where the cluster RAM is now 25TB.  That is a good chunk of
> memory!
>
> One model might be where each machine might have only one JVM acting like
> a supervisor and configured with perhaps 16 GB of RAM.   All the rest of
> memory is allocated to data, which is paged in and out from disk or even
> ingested directly from stream feeds or back-end systems such as Hadoop.
>
> Traditionally, most of the data was in the form of primitives and strings,
> but one underlying assumption historically has been that the data is static
> in size.  When the data is read in, you know what the size is and that
> doesn't change for the lifetime that that data exists in memory. Analytic
> processing, of course, create large amounts of intermediate storage but
> even in these cases, it is generally pretty easy to predict how much
> storage is required.
>
> Sketches, viewed as data, especially when there are billions of sketches
> that need to be processed present some new challenges.  Suppose in my query
> processing I need to merge millions of sketches together and all of these
> sketches are sitting in off-heap memory, as they were preprocessed and
> built in the back-end system and ingested into my analytic query engine
> cluster.  If we had to "heapify" each sketch image into a sketch object
> prior to its being merged would be very costly as that requires a
> deserialization and copy for each sketch.  So the first objective of having
> sketches off-heap is to provide the ability to interpret the sketch image
> for merging purposes without having to copy or deserialize the image.
>  This we do with our "Wrap(Memory)" functions.
>
> But the query engine needs to allocate perhaps thousands of set operators
> (Union, Intersection, Difference) that do the merging.  It would also be
> useful to have these allocated in large segment columns off-heap and manage
> their memory allocation directly in order to reduce the pressure on the
> garbage collector.  There are two challenges with this.  First, these
> operators grow dynamically, and second, Java does not really support the
> concept of programming dynamic objects off-heap. The only mechanism that
> Java provides has been the ByteBuffer, which has severe limitations.
>
> This is why we created the Memory component.  You can think of it as a
> ByteBuffer replacement, but it is much more.  The Memory component is what
> allows us to do updating and merging of sketches off-heap.  Once we decide
> to do this, we are back into more of a C/C++ style of programming where we
> need to do our own "malloc()" and "free()" operations directly.   But this
> is exactly what large, real-time, query and analysis engines have been
> doing for a while.  And these systems have been taking advantage of hidden
> capabilities in Java, such as Unsafe, in order to achieve unparalleled
> performance.  Use of Unsafe is a hotly debated topic in the Java community,
> nonetheless, our Memory component also takes advantage of Unsafe (as does
> the ByteBuffer).  (How we move beyond Java 8 is a whole different topic!)
>
> The first attempt to allocate a segment of, say, 1000 dynamic sketches (or
> set operators) in off-heap memory if often to choose a slot size that is
> the maximum size that a sketch can grow to, given its K configuration, and
> then evenly divide the segment into 1000 slots of that size.  This turns
> out to be horribly wasteful.  Big data is almost never just one chunk, but
> is often highly partitioned into multiple dimensions.  And the result of
> this natural fragmentation is that the sizes of all the combinations of
> these dimensions tend to follow a power-law distribution, also called "the
> long tail".  This happens in nature and almost anything categorized by
> humans.  What this means is that if you have millions of sketches that have
> processed the millions of streams of all the dimensional combinations, the
> vast majority of the sketches will have 1 or a few entries and be very
> small in size.  If each of these tiny sketches occupy a slot in memory that
> was set to the maximum size to which that sketch can grow, we have wasted a
> huge amount of memory.
>
> A smarter approach is one we learned as we were integrating our C++
> datasketches library into PostgreSQL.  Of course C++ doesn't have the
> off-heap problem at all. Nevertheless, the PostgreSQL system needs to
> manage and track what gets allocated and deallocated in memory.  So
> PostgreSQL created "palloc()" and "pfree()" functions for the
> user-developer, where the PostgreSQL can intercept and manage the
> underlying malloc and free processes.   This is also what we have proposed
> to Druid and I think they like this approach, but they also have a long
> history of using fixed sized slots and transitioning to dynamically sized
> slots initiated by the user process is a major transition for them and will
> take a while.
>
> The question still remains on how to manage the overall system memory
> requirements.  Although we cannot predict which sketches among the millions
> of sketches will need only small slots and which ones will grow to the
> maximum size, in aggregate, we can learn (from the data) and predict how
> much memory we will need which tends to be pretty stable.
>
> Cheers,
>
> Lee.
>
>
>
>
>
>
>
>
>
>
>
> On Fri, May 29, 2020 at 4:08 AM David Cromberge <
> [email protected]> wrote:
>
>> Hi Lee,
>>
>> I have studied our usage of the off-heap features of the library.
>> Originally there was a compelling reason to leverage this whilst decoding
>> from serialised bytes in storage.  When servicing queries on behalf of many
>> clients, decoding sketches into the heap was potentially problematic from a
>> scalability perspective.  However, for various reasons(unrelated to the
>> library), we do not currently make use of the off-heap features.
>> Regardless, it would be interesting to hear some of the learnings from
>> the Druid case (whenever these can be provided), as I am busy collecting
>> notes and intend to supplement the website documentation where applicable.
>> To sum up, off-heap may be reconsidered in the future, but is not
>> currently a priority.
>>
>> If a byte-array Tuple sketch is being proposed, it may be possible in the
>> end to define the various specialisations (integer, double, strings and
>> doubles) in terms of the byte-array implementation.  It could also be of
>> interest to define a position in the preamble for Tuple sketches where
>> there is a flag to identify the summary type, where the defaults (integer,
>> double, strings and doubles) may reserve an identifier.  This would allow
>> for a powerful Tuple deserialiser that generalises over the summary.
>> However, I realise that there are other concerns that come into play with
>> backwards compatibility, as well as performance criteria.
>>
>> David.
>>
>>
>> On 28 May 2020, at 16:12, leerho <[email protected]> wrote:
>>
>> David, you are correct.  I also looked at the AOD implementation last
>> night and it lacks some notable capabilities that I think you need, but
>> others would want as well.
>>
>>    - Integration of Theta (as you noted)
>>    - There is no selectable "mode" equivalent for a Union on how to
>>    combine two double summaries. Intersection has a dedicated combiner, but
>>    doesn't have the "mode" capability to allow easy choices between a set of
>>    modes.
>>    - There may be other issues, but I haven't studied this AOD code for
>>    several years :(
>>
>> I am wondering if, for your case, if we had a generic, only-on-heap
>> solution relatively quickly that you could use and get some experience with
>> and could characterize its performance in your environment.  And then work
>> on a more efficient off-heap solution later.
>>
>> Using sketches off-heap requires some significant design decisions at the
>> system level to make it work really well.  We have worked with the Druid
>> folks for quite a while and have learned some things that may be helpful
>> for you.  But these issues are outside and beyond the issues of designing a
>> sketch that can work off-heap.   It would be helpful to get a sense of
>> where you are in your thinking about going off-heap.
>>
>> Lee.
>>
>>
>>
>> On Thu, May 28, 2020 at 7:45 AM David Cromberge <
>> [email protected]> wrote:
>>
>>> As a follow-up to the proposal below, it may not be necessary to provide
>>> a combiner and default set of values together, since there may be some
>>> redundancy here.
>>> It’s probably better for me to re-iterate the intention - to provide a
>>> meaningful way to interpret the presence of a key in a theta with some
>>> suitable default value.
>>>
>>> On 28 May 2020, at 15:39, David Cromberge <[email protected]>
>>> wrote:
>>>
>>> Lee, the array of bytes implementation is a better approach than adding
>>> off-heap implementations for the integer, double and strings Tuple sketches.
>>>
>>> I originally overlooked the ArrayOfDoubles sketch for the purposes of
>>> tracking engagement, since it implies that many values could be associated
>>> with a hashed key, which doesn’t quite fit the use case.
>>>
>>> Having said that, I have now looked through the implementation and have
>>> switched to the array of doubles sketch instead - after all, you pointed
>>> out that it should suffice.
>>>
>>> I have run some initial benchmarks on sizing, and in compacted form I
>>> did not get a reduction in the size of a compacted sketch generated from a
>>> real data set, despite the benefits of using primitive doubles.  I realise
>>> that this is dependent on the test case and tuning / configuration
>>> parameters, so we could add a TODO item to add a characterisation test for
>>> this, if there is not one already.
>>>
>>> We don’t seem to have discussed how the Theta sketches may be included
>>> for intersections and unions regarding an AOD sketch.   For unions, the
>>> behaviour is delegated to a merge operation on the sketch, which ultimately
>>> adds values together for the same key.  Concerning intersection, a combiner
>>> implementation is used to determine how values should be combined.  It is
>>> noteworthy that in the druid extension, the values are summed together,
>>> with a comment noting that this may not apply to all circumstances.
>>>
>>> I would propose a similar mechanism for both union and intersection on
>>> the other sketches, where a default array of values can be provided for a
>>> tuple sketch:
>>>
>>> public void update(final org.apache.datasketches.theta.Sketch sketchIn,
>>> double[] defaultValues, ArrayOfDoublesCombiner c) {...}
>>>
>>> Of course, a factory could be provided that creates a combiner according
>>> to a summary mode.  The use case for this suggestion is to have context
>>> specific behaviour with regard to merging / combining values in the case of
>>> unions and intersections, which could be sourced from the user or user
>>> query.
>>>
>>> I would be interested to hear your thoughts on the
>>> ArrayOfDoubles/ArrayOfBytes Tuple sketch integration with Theta sketches,
>>> David
>>>
>>>
>>> On 28 May 2020, at 04:32, leerho <[email protected]> wrote:
>>>
>>> David,  In fact, double values do just fine with integer data.  They are
>>> twice as big at the primitive level, however, the dedicated ArrayOfDoubles
>>> implementation in total might actually be smaller, since it is not carrying
>>> all the object overhead that are required to do generics. Plus, it will be
>>> a whole lot faster! It is already fully implemented with all the set
>>> operations, off-heap memory, serialization /deserialization and a full test
>>> suite.
>>>
>>> Designing a similar dedicated ArrayOfIntegers would be a lot of work and
>>> wouldn't be my top priority for the next dedicated Tuple sketch to build.
>>> What would be more flexible would actually be a dedicated ArrayOfBytes
>>> implementation, because bytes are the foundation from which we can derive
>>> almost any summary we want.
>>>
>>> Think about it.
>>>
>>> Lee.
>>>
>>>
>>>
>>> On Wed, May 27, 2020 at 5:54 PM leerho <[email protected]> wrote:
>>>
>>>> David,  This is good feedback.  I didn't realize you wanted off-heap
>>>> operation.  That changes a lot of things.  I have to go right now, but let
>>>> me think about this.  Attempting to leverage generics off-heap is messy.
>>>> For your case it may be better to either leverage the ArrayOfDoubles
>>>> implementation, which already operates off-heap, or think about emulating
>>>> the AOD and creating a dedicated AOIntegers.
>>>>
>>>> Lee.
>>>>
>>>> On Wed, May 27, 2020 at 3:04 PM David Cromberge <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Lee,
>>>>>
>>>>> Thanks for providing such a detailed plan of action for the Tuple
>>>>> sketch package.
>>>>> The enhancements that you have listed are interesting, and I will
>>>>> certainly check out your branch to get a clearer understanding of how the
>>>>> library is evolving.
>>>>>
>>>>> For what it’s worth, here is a record of my attempt to integrate Theta
>>>>> sketches into the Tuple sketch set operations:
>>>>>
>>>>> https://github.com/davecromberge/incubator-datasketches-java/commit/961ad48bbe709ccfcb973a7fab69e53088f113a5
>>>>>
>>>>> Although I have a cursory understanding of the library’s internals, I
>>>>> included the commit above because there were some interesting tradeoffs to
>>>>> the implementation, and it gave me a better appreciation for the the
>>>>> internal workings of the existing Tuple sketch as well as some of the 
>>>>> finer
>>>>> points in your improvement work.   To a lesser degree, it also serves to
>>>>> independently confirm your argument for adding new variants of the update
>>>>> methods!
>>>>>
>>>>> During implementation, I was also faced with the decision as to
>>>>> whether to duplicate the methods or to convert a Theta sketch to a Tuple
>>>>> sketch first and delegate to the existing methods.  But, as you noted, 
>>>>> this
>>>>> requires an additional iteration through the result set and incurs a
>>>>> performance penalty.  Therefore, I also duplicated the existing update
>>>>> methods, with some changes for result extraction.  To ensure correctness, 
>>>>> I
>>>>> found it necessary to duplicate a large portion of the existing test cases
>>>>> as well - replicating so many of the existing tests was not ideal, but
>>>>> helped verify that the implementation was correct.
>>>>> It’s also worth mentioning that I had some difficulty implementing the
>>>>> AnotB functionality, and in fact the results were incorrect when the 
>>>>> sketch
>>>>> crossed into estimation mode (see ignored tests).  I’m pleased to have
>>>>> attempted the exercise because it will give much better context as I study
>>>>> your branch further - especially the AnotB refactoring.
>>>>>
>>>>> There is one addition that I would like to suggest to your list of
>>>>> TODO items - namely, off-heap implementations.  I am considering using the
>>>>> integer tuple sketch for our engagement use case and would prefer to
>>>>> prevent memory pressure by loading many sketches onto the heap.  I have
>>>>> noticed this come up in the past on #datasketches slack channel in a
>>>>> conversation with some Druid team members.  It appears that the off-heap
>>>>> implementations were omitted from the library due to time constraints, and
>>>>> this is area where I could also potentially provide default 
>>>>> implementations
>>>>> for the other tuple sketches.  I think this is important to consider,
>>>>> because the existing ArrayOfDoubles implementation uses an abstract class
>>>>> for the parent.  Making the other Tuple sketches abstract in a similar
>>>>> manner is potentially a breaking change as well.
>>>>>
>>>>> I am excited to collaborate on this together on this feature, and I
>>>>> would be happy to contribute in any possible way and coordinate through 
>>>>> the project
>>>>> TODO page
>>>>> <https://github.com/apache/incubator-datasketches-java/projects/1>!
>>>>>
>>>>> David
>>>>>
>>>>>
>>>>>
>>>>> On 27 May 2020, at 20:04, leerho <[email protected]> wrote:
>>>>>
>>>>> David,
>>>>>
>>>>> Thanks.  I have been putting a lot of thought into it as well and
>>>>> decided that it was time to make some other long-needed changes in the
>>>>> Tuple Family of sketches as well including the package layout, which has
>>>>> been quite cumbersome.  I would suggest holding back on you actual
>>>>> implementation work until you understand what I have changed so far and
>>>>> then we can strategize on how to finish the work.   I have checked in my
>>>>> changes so far into the "Tuple_Theta_Extension" branch, which you can 
>>>>> check
>>>>> out to see what I have been up to :)
>>>>>
>>>>> The family of tuple sketches have evolved over time and somewhat
>>>>> haphazardly.  So the first think I decided to do was to do some 
>>>>> rearranging
>>>>> of the package structure to make future downstream improvements and
>>>>> extensions easier.
>>>>>
>>>>> 1.  The first problem is that the root tuple directory was cluttered
>>>>> with two different groups of classes that made it difficult for anyone to
>>>>> figure out what is going on.  One group of classes form the base generic
>>>>> classes of the tuple sketch on which the concrete extensions "adouble" (a
>>>>> single double), "aninteger" (a single integer), and "strings" (array of
>>>>> strings) depend.  These three concrete extensions are already in their own
>>>>> sub directories.
>>>>>
>>>>> The second, largest group of classes were a dedicated non-generic
>>>>> implementation of the tuple sketch, which implemented an array of doubles.
>>>>> All of these classes had "ArrayOfDoubles" in their name.  These classes
>>>>> shared no code with the root generic tuple classes except for a few 
>>>>> methods
>>>>> in the SerializerDeserializer and the Util classes.  By making a few
>>>>> methods public, I was able to move all of the "ArrayOfDoubles" classes 
>>>>> into
>>>>> their own subdirectory.  This creates an incompatible API break, which 
>>>>> will
>>>>> force us to move to a 2.0.0 for the next version.   Now the tuple root
>>>>> directory is much cleaner and easier to navigate and understand.  There 
>>>>> are
>>>>> several reasons for this separate dedicated implementation. First, we felt
>>>>> that a configurable array of doubles would be a relatively common use
>>>>> case.  Second, we wanted a full concrete example of the tuple sketch as an
>>>>> example of what it would look like including both on-heap and off-heap
>>>>> variants.   It is this ArrayOfDoubles implementation that has been
>>>>> integrated into Druid, for example.
>>>>>
>>>>> 2. Now that the package directories are cleaned up I was able to focus
>>>>> on what it would mean to allow Tuple sketches to perform set operations
>>>>> with Theta sketches.
>>>>>
>>>>> One approach would be to just provide a converter to take in a Theta
>>>>> sketch and produce a Tuple sketch with some default or configured summary
>>>>> and leave everything else the way it is.  But this is less efficient as it
>>>>> requires more object creation and copying than a direct integration would.
>>>>> It turns out that modifying the generic Union and Intersection classes 
>>>>> only
>>>>> required adding one method to each.  I did some minor code cleanup and 
>>>>> code
>>>>> documentation at the same time.
>>>>>
>>>>> The AnotB operator is another story.  We have never been really happy
>>>>> with how this was implemented the first time.  The current API is clumsy.
>>>>> So I have taken the opportunity to redesign the API for this class.  It
>>>>> still has the current API methods but deprecated.  With the new
>>>>> modified class the user has several ways of performing AnotB.
>>>>>
>>>>> As stateless operations:
>>>>>
>>>>>    - With Tuple: resultSk = aNotB(skTupleA, skTupleB);
>>>>>    - With Theta: resultSk = aNotB(skTupleA, skThetaB);
>>>>>
>>>>> As stateful, sequential operations:
>>>>>
>>>>>    - void setA(skTupleA);
>>>>>    - void notB(skTupleB);   or   void notB(skThetaB);   //These are
>>>>>    interchangable.
>>>>>    - ...
>>>>>    - void notB(skTupleB);   or   void notB(skThetaB);   //These are
>>>>>    interchangable.
>>>>>    - resultSk = getResult(reset = false);  // This allows getting an
>>>>>    intermediate result
>>>>>    - void notB(skTupleB);   or   void notB(skThetaB);   //Continue...
>>>>>    - resultSK = getResult(reset = true); //This returns the result
>>>>>    and clears the internal state to empty.
>>>>>
>>>>> This I think is pretty slick and flexible.
>>>>>
>>>>> Work yet to be done on main:
>>>>>
>>>>>    - Reexamine the Union and Intersection APIs to add the option of
>>>>>    an intermediate result.
>>>>>    - Update the other concrete extensions to take advantage of the
>>>>>    above new API: "aninteger", "strings".
>>>>>    - Examine the dedicated "ArrayOfDoubes" implementation to see how
>>>>>    hard it would be to make the same changes as above.  Implement. Test.
>>>>>
>>>>> Work yet to be done on test:
>>>>>
>>>>> I did major redesign of the testing class for the AnotB generic class
>>>>> using the "adouble" concrete extension.  You can see this in
>>>>> AdoubleAnotBTest.java.  This is essentially a deep exhaustive test of the
>>>>> base AnotB classes via the concrete extension.
>>>>>
>>>>>    - With the deep testing using the "adouble" done, we still need to
>>>>>    design new tests for the "aninteger" and "strings" extensions.  These 
>>>>> can
>>>>>    be shallow tests.
>>>>>    - If we decide to do the same API extensions on the ArrayOfDoubles
>>>>>    classes, those will need to be tested.
>>>>>
>>>>> Work to be done on documentation:
>>>>>
>>>>>    - The website documentation is still rather thin on the whole
>>>>>    Tuple family.  Having someone that is a real user of these classes
>>>>>    contribute to the documentation to make it more understandable would be
>>>>>    outstanding!
>>>>>
>>>>> Work to be done on characterization.
>>>>>
>>>>>    - The Tuple family has some characterization, but it is sparse and
>>>>>    a lot more would work here would give users a sense of the performance 
>>>>> they
>>>>>    could expect.  We have also found that characterization is a powerful 
>>>>> way
>>>>>    to find statistical bugs that don't show up in unit tests.   I could 
>>>>> guide
>>>>>    you through how to set up the various "test harnesses", which is really
>>>>>    pretty simple, but the real thinking goes into the design of the test 
>>>>> and
>>>>>    understanding the output.  This is a great way to really understand how
>>>>>    these sketches behave and why.
>>>>>
>>>>> Work to be done on code reviews:
>>>>>
>>>>>    - Having independent set of eyes going over the code would also be
>>>>>    a huge contribution.
>>>>>
>>>>> Once you have had a chance to study this we should talk about how you
>>>>> want to contribute.  Clearly a lot of what I have done so far required 
>>>>> deep
>>>>> understanding of the Tuple and Theta classes and was was much more
>>>>> efficient for me to do.  It would have been a hard slog for anyone new to
>>>>> the library to undertake.
>>>>>
>>>>> Once we decide on a strategy, we should put kanban cards in the project
>>>>> TODO page
>>>>> <https://github.com/apache/incubator-datasketches-java/projects/1>.
>>>>>
>>>>> Please let me know what you think!
>>>>>
>>>>> Lee.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, May 27, 2020 at 7:53 AM David Cromberge <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Thank you Lee for your proposal regarding my use case and Tuple
>>>>>> sketches.
>>>>>>
>>>>>> I have spent some time considering the proposal, and I have started
>>>>>> implementing a potential solution.
>>>>>>
>>>>>> At what stage of the pipeline should characterisation tests be
>>>>>> proposed, since they would obviously depend on a new SNAPSHOT version of
>>>>>> the core library being available?
>>>>>>
>>>>>> I would be grateful for any input about the characterisation workflow.
>>>>>>
>>>>>> Thank you,
>>>>>> David
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>> For additional commands, e-mail: [email protected]
>>>>>>
>>>>>>
>>>>>
>>>
>>>
>>
>

Re: [DISCUSS] Enhancements to Tuple Sketch.

Reply via email to