I think community extensions would be a great idea, Lee. I have a few ideas that could be included in that space.
Previously, the difficulties that I had with the API for conversion involved the requirement to access the internal state of a sketch, which is inaccessible by design. The specific example is accessing the hashed keys, and creating a new sketch from them without rehashing them. If a summary conversion can safely be done as an extension, this would be an ideal place for it. Thanks for the background on the Druid off-heap use case, that’s useful information! I will try and explain my concern on compatibly on the new AOD thread that you have created. Dave. > On 2 Jun 2020, at 20:31, leerho <[email protected]> wrote: > > Dave, > > Thank you for your thoughtful responses! We really value your feedback! > > I was thinking through how this could be implemented, and it would be > difficult to do without introducing problems with backwards compatibility. > > Could you elaborate on this a bit more? What is an example of backwards > compatibility you were thinking of? > > In my mind I was thinking of using the Memory package to interpret the AOB > summary into whatever the user wanted. This would be external to the library > and would be written by the user. We could certainly provide examples of how > to do this. > > The AOB model in theory also allows for C struct types where each of the > "columns" could represent different types, but packed very efficiently into > an overall byte array. But it is up to the user to define what the structure > is. Again, Memory can facilitate this encoding and decoding of the bytes. > There are other encoding/decoding schemes that could play a role here such as > Google's protobuf, flatbuffers, or flexbuffers. Now, how do we make the core > tuple code flexible enough to handle the updating of these varied columns? > > You idea of converters is interesting, where one wants to convert, for > example, an array of integers to an array of doubles (which is pretty much a > one-way conversion). Nevertheless, couldn't those be external to the core > library? Interesting, my colleagues and I were talking just this morning > about the idea of a contributor's extensions area where this kind of stuff > could be shared. The question that comes up is how these extensions would to > be somehow coordinated or specified to be compatible with specific core > releases. Eventually, we may need a separate repository for that. > > **** > > assuming that the write workload is independent of the read workload, it is > fair to assume that the data being queried is immutable, and of a fixed size, > such as an update sketch (write) vs compact sketch (read). > > Not always. Several years ago our system colleagues at Yahoo implemented a > real-time query engine/database, where the raw data was continuous, real-time > streams of data from web servers all over the world. These streams were > split into many dimensions by a back-end Storm system and ingested directly > into Druid. In Druid these streams were then directed into many rotating > time-window buffers of sketches. Each time-window buffer had sketches on 1 > minute time intervals, 48 hours deep. Each event was directed to the proper > dimension and it's time stamp directed it to the correct sketch in the > time-window. Because the sketches are essentially "additive", they allowed > for late-data processing, which is particularly a problem with data collected > from mobile phones (it can easily be 24 hours late!). All of these sketches > are being updated continuously and there are millions of them. When the > query comes it can query the sketches of the specified time range and > dimensions and produce results to the user in seconds. At the time, all of > these sketches were allocated fixed-sized slots, so the memory usage was not > all that efficient. Nevertheless, this system produced real-time results to > queries that were virtually impossible to do using exact methods, and with a > smaller system footprint! This system processed over 1 billion sketches > every day. They were not on the heap. > > > Cheers! Keep the ideas coming! > > Lee. > > On Tue, Jun 2, 2020 at 7:49 AM David Cromberge <[email protected] > <mailto:[email protected]>> wrote: > Thanks for such a detailed response Lee! You and your have really raised the > bar and made Apache data-sketches a very welcoming and approachable community. > > I had to re-read your response several times to appreciate the finer details > of the difficulties that you were referring to. For an analytics > architecture, I have anecdotal experience that it is often useful to leverage > some intrinsic properties of analytic data to make simplifications. > In this instance, assuming that the write workload is independent of the read > workload, it is fair to assume that the data being queried is immutable, and > of a fixed size, such as an update sketch (write) vs compact sketch (read). > However, I didn’t account for storing intermediate results of set operations > off-heap as well. At first, it seemed natural to store the intermediate > accumulator sketch on the heap, and combine this in a pairwise fashion with > off-heap sketches. This is probably a fundamental problem with my > understanding of off-heap allocation, because mixing the two memory sources > during an operation would not be possible. You have provided a lot of > insight here regarding all the problems with dynamically sized slot > allocation, and its challenges. > > In the past, I have used a memory-mapped file to allow the OS to manage > loading/unloading file segments into memory on demand, which is used in > Influx and Prometheus TSDB. I had this in mind when you mentioned Postgres > managing memory allocations. > > Not to derail the thread, but I would like to raise another discussion at > some point on the broader topic of whether it is a good idea to performa > 1000s of set operations at query time, or whether pre-combining and > eliminating the long tail and uninteresting data (with a tool like > Macrobase), would help minimise the error introduced by all the set > operations. As mentioned on the site, using pre-computations to deal with > power-law distributions would lead to many small sketches which would store > every key anyway. > > Returning to the topic at hand, we previously discussed having a generic byte > implementation for a Tuple Sketch, which could unify the different concrete > summary types. I was thinking through how this could be implemented, and it > would be difficult to do without introducing problems with backwards > compatibility. I had an alternative idea that also serves to further widen > the utility of different Tuple Sketches - namely, introducing a conversion > between summary types. > > This function could be defined as: > > public Sketch<S> convertSummaries<T extends Summary>(S -> T converter) > { … } > > This approach may be used to freely convert between summary representations, > and integer, strings, double etc. For ArrayOfDoubles, a more concrete > representation would be needed: > > public Sketch<S> convertSummaries<S extends Summary>([double] -> S > converter) { … } > > Such adaptation is compelling because it prevents an all-or-nothing approach > to storing sketches, in some external datastore. If we decided today to use > a TupleInteger, and ran into limitations, we could adapt them to double > implementations and write these back to storage on access, in a lazy manner. > Moreover, there may be cases where sketches between datatypes could be mixed > in the set operations. The downside is, that some summary types may not be > coercible, but exposing this as a function allows the user to decide. > > Concerning my situation, I have started adopting your latest branch, where I > have successfully started combining Integer Tuple sketches with Theta > sketches for the engagements use case. It is a substantial victory not to > have to re-encode all our Theta sketches! > > Thanks again for the background material, you have helped me and my team > immensely. > > David > > > >> On 29 May 2020, at 19:56, leerho <[email protected] >> <mailto:[email protected]>> wrote: >> >> David, >> >> This is great, you are getting up-to-speed fast!. I really appreciate your >> digging into this :) >> >> Putting sketches off-heap is an interesting and deep topic. I will try to >> share some of the concepts here, but ultimately, this kind of information >> needs to be in some sort of tutorial section on the web site. Perhaps you >> might have suggestions on how this could be better presented. >> >> Large Java system clusters have large amounts of RAM. A few years ago a >> single machine in a cluster might have 24 CPUs, 48 hyperthreads, and 256GB >> of RAM. A medium-sized cluster of such machines might consist of 100 such >> machines, where the cluster RAM is now 25TB. That is a good chunk of >> memory! >> >> One model might be where each machine might have only one JVM acting like a >> supervisor and configured with perhaps 16 GB of RAM. All the rest of >> memory is allocated to data, which is paged in and out from disk or even >> ingested directly from stream feeds or back-end systems such as Hadoop. >> >> Traditionally, most of the data was in the form of primitives and strings, >> but one underlying assumption historically has been that the data is static >> in size. When the data is read in, you know what the size is and that >> doesn't change for the lifetime that that data exists in memory. Analytic >> processing, of course, create large amounts of intermediate storage but even >> in these cases, it is generally pretty easy to predict how much storage is >> required. >> >> Sketches, viewed as data, especially when there are billions of sketches >> that need to be processed present some new challenges. Suppose in my query >> processing I need to merge millions of sketches together and all of these >> sketches are sitting in off-heap memory, as they were preprocessed and built >> in the back-end system and ingested into my analytic query engine cluster. >> If we had to "heapify" each sketch image into a sketch object prior to its >> being merged would be very costly as that requires a deserialization and >> copy for each sketch. So the first objective of having sketches off-heap is >> to provide the ability to interpret the sketch image for merging purposes >> without having to copy or deserialize the image. This we do with our >> "Wrap(Memory)" functions. >> >> But the query engine needs to allocate perhaps thousands of set operators >> (Union, Intersection, Difference) that do the merging. It would also be >> useful to have these allocated in large segment columns off-heap and manage >> their memory allocation directly in order to reduce the pressure on the >> garbage collector. There are two challenges with this. First, these >> operators grow dynamically, and second, Java does not really support the >> concept of programming dynamic objects off-heap. The only mechanism that >> Java provides has been the ByteBuffer, which has severe limitations. >> >> This is why we created the Memory component. You can think of it as a >> ByteBuffer replacement, but it is much more. The Memory component is what >> allows us to do updating and merging of sketches off-heap. Once we decide >> to do this, we are back into more of a C/C++ style of programming where we >> need to do our own "malloc()" and "free()" operations directly. But this >> is exactly what large, real-time, query and analysis engines have been doing >> for a while. And these systems have been taking advantage of hidden >> capabilities in Java, such as Unsafe, in order to achieve unparalleled >> performance. Use of Unsafe is a hotly debated topic in the Java community, >> nonetheless, our Memory component also takes advantage of Unsafe (as does >> the ByteBuffer). (How we move beyond Java 8 is a whole different topic!) >> >> The first attempt to allocate a segment of, say, 1000 dynamic sketches (or >> set operators) in off-heap memory if often to choose a slot size that is the >> maximum size that a sketch can grow to, given its K configuration, and then >> evenly divide the segment into 1000 slots of that size. This turns out to >> be horribly wasteful. Big data is almost never just one chunk, but is often >> highly partitioned into multiple dimensions. And the result of this natural >> fragmentation is that the sizes of all the combinations of these dimensions >> tend to follow a power-law distribution, also called "the long tail". This >> happens in nature and almost anything categorized by humans. What this >> means is that if you have millions of sketches that have processed the >> millions of streams of all the dimensional combinations, the vast majority >> of the sketches will have 1 or a few entries and be very small in size. If >> each of these tiny sketches occupy a slot in memory that was set to the >> maximum size to which that sketch can grow, we have wasted a huge amount of >> memory. >> >> A smarter approach is one we learned as we were integrating our C++ >> datasketches library into PostgreSQL. Of course C++ doesn't have the >> off-heap problem at all. Nevertheless, the PostgreSQL system needs to manage >> and track what gets allocated and deallocated in memory. So PostgreSQL >> created "palloc()" and "pfree()" functions for the user-developer, where the >> PostgreSQL can intercept and manage the underlying malloc and free >> processes. This is also what we have proposed to Druid and I think they >> like this approach, but they also have a long history of using fixed sized >> slots and transitioning to dynamically sized slots initiated by the user >> process is a major transition for them and will take a while. >> >> The question still remains on how to manage the overall system memory >> requirements. Although we cannot predict which sketches among the millions >> of sketches will need only small slots and which ones will grow to the >> maximum size, in aggregate, we can learn (from the data) and predict how >> much memory we will need which tends to be pretty stable. >> >> Cheers, >> >> Lee. >> >> >> >> >> >> >> >> >> >> >> >> On Fri, May 29, 2020 at 4:08 AM David Cromberge >> <[email protected] <mailto:[email protected]>> wrote: >> Hi Lee, >> >> I have studied our usage of the off-heap features of the library. >> Originally there was a compelling reason to leverage this whilst decoding >> from serialised bytes in storage. When servicing queries on behalf of many >> clients, decoding sketches into the heap was potentially problematic from a >> scalability perspective. However, for various reasons(unrelated to the >> library), we do not currently make use of the off-heap features. >> Regardless, it would be interesting to hear some of the learnings from the >> Druid case (whenever these can be provided), as I am busy collecting notes >> and intend to supplement the website documentation where applicable. >> To sum up, off-heap may be reconsidered in the future, but is not currently >> a priority. >> >> If a byte-array Tuple sketch is being proposed, it may be possible in the >> end to define the various specialisations (integer, double, strings and >> doubles) in terms of the byte-array implementation. It could also be of >> interest to define a position in the preamble for Tuple sketches where there >> is a flag to identify the summary type, where the defaults (integer, double, >> strings and doubles) may reserve an identifier. This would allow for a >> powerful Tuple deserialiser that generalises over the summary. However, I >> realise that there are other concerns that come into play with backwards >> compatibility, as well as performance criteria. >> >> David. >> >> >>> On 28 May 2020, at 16:12, leerho <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> David, you are correct. I also looked at the AOD implementation last night >>> and it lacks some notable capabilities that I think you need, but others >>> would want as well. >>> Integration of Theta (as you noted) >>> There is no selectable "mode" equivalent for a Union on how to combine two >>> double summaries. Intersection has a dedicated combiner, but doesn't have >>> the "mode" capability to allow easy choices between a set of modes. >>> There may be other issues, but I haven't studied this AOD code for several >>> years :( >>> I am wondering if, for your case, if we had a generic, only-on-heap >>> solution relatively quickly that you could use and get some experience with >>> and could characterize its performance in your environment. And then work >>> on a more efficient off-heap solution later. >>> >>> Using sketches off-heap requires some significant design decisions at the >>> system level to make it work really well. We have worked with the Druid >>> folks for quite a while and have learned some things that may be helpful >>> for you. But these issues are outside and beyond the issues of designing a >>> sketch that can work off-heap. It would be helpful to get a sense of >>> where you are in your thinking about going off-heap. >>> >>> Lee. >>> >>> >>> >>> On Thu, May 28, 2020 at 7:45 AM David Cromberge >>> <[email protected] <mailto:[email protected]>> >>> wrote: >>> As a follow-up to the proposal below, it may not be necessary to provide a >>> combiner and default set of values together, since there may be some >>> redundancy here. >>> It’s probably better for me to re-iterate the intention - to provide a >>> meaningful way to interpret the presence of a key in a theta with some >>> suitable default value. >>> >>>> On 28 May 2020, at 15:39, David Cromberge <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> >>>> Lee, the array of bytes implementation is a better approach than adding >>>> off-heap implementations for the integer, double and strings Tuple >>>> sketches. >>>> >>>> I originally overlooked the ArrayOfDoubles sketch for the purposes of >>>> tracking engagement, since it implies that many values could be associated >>>> with a hashed key, which doesn’t quite fit the use case. >>>> >>>> Having said that, I have now looked through the implementation and have >>>> switched to the array of doubles sketch instead - after all, you pointed >>>> out that it should suffice. >>>> >>>> I have run some initial benchmarks on sizing, and in compacted form I did >>>> not get a reduction in the size of a compacted sketch generated from a >>>> real data set, despite the benefits of using primitive doubles. I realise >>>> that this is dependent on the test case and tuning / configuration >>>> parameters, so we could add a TODO item to add a characterisation test for >>>> this, if there is not one already. >>>> >>>> We don’t seem to have discussed how the Theta sketches may be included for >>>> intersections and unions regarding an AOD sketch. For unions, the >>>> behaviour is delegated to a merge operation on the sketch, which >>>> ultimately adds values together for the same key. Concerning >>>> intersection, a combiner implementation is used to determine how values >>>> should be combined. It is noteworthy that in the druid extension, the >>>> values are summed together, with a comment noting that this may not apply >>>> to all circumstances. >>>> >>>> I would propose a similar mechanism for both union and intersection on the >>>> other sketches, where a default array of values can be provided for a >>>> tuple sketch: >>>> >>>> public void update(final org.apache.datasketches.theta.Sketch sketchIn, >>>> double[] defaultValues, ArrayOfDoublesCombiner c) {...} >>>> >>>> Of course, a factory could be provided that creates a combiner according >>>> to a summary mode. The use case for this suggestion is to have context >>>> specific behaviour with regard to merging / combining values in the case >>>> of unions and intersections, which could be sourced from the user or user >>>> query. >>>> >>>> I would be interested to hear your thoughts on the >>>> ArrayOfDoubles/ArrayOfBytes Tuple sketch integration with Theta sketches, >>>> David >>>> >>>> >>>>> On 28 May 2020, at 04:32, leerho <[email protected] >>>>> <mailto:[email protected]>> wrote: >>>>> >>>>> David, In fact, double values do just fine with integer data. They are >>>>> twice as big at the primitive level, however, the dedicated >>>>> ArrayOfDoubles implementation in total might actually be smaller, since >>>>> it is not carrying all the object overhead that are required to do >>>>> generics. Plus, it will be a whole lot faster! It is already fully >>>>> implemented with all the set operations, off-heap memory, serialization >>>>> /deserialization and a full test suite. >>>>> >>>>> Designing a similar dedicated ArrayOfIntegers would be a lot of work and >>>>> wouldn't be my top priority for the next dedicated Tuple sketch to build. >>>>> What would be more flexible would actually be a dedicated ArrayOfBytes >>>>> implementation, because bytes are the foundation from which we can derive >>>>> almost any summary we want. >>>>> >>>>> Think about it. >>>>> >>>>> Lee. >>>>> >>>>> >>>>> >>>>> On Wed, May 27, 2020 at 5:54 PM leerho <[email protected] >>>>> <mailto:[email protected]>> wrote: >>>>> David, This is good feedback. I didn't realize you wanted off-heap >>>>> operation. That changes a lot of things. I have to go right now, but >>>>> let me think about this. Attempting to leverage generics off-heap is >>>>> messy. For your case it may be better to either leverage the >>>>> ArrayOfDoubles implementation, which already operates off-heap, or think >>>>> about emulating the AOD and creating a dedicated AOIntegers. >>>>> >>>>> Lee. >>>>> >>>>> On Wed, May 27, 2020 at 3:04 PM David Cromberge >>>>> <[email protected] <mailto:[email protected]>> >>>>> wrote: >>>>> Hi Lee, >>>>> >>>>> Thanks for providing such a detailed plan of action for the Tuple sketch >>>>> package. >>>>> The enhancements that you have listed are interesting, and I will >>>>> certainly check out your branch to get a clearer understanding of how the >>>>> library is evolving. >>>>> >>>>> For what it’s worth, here is a record of my attempt to integrate Theta >>>>> sketches into the Tuple sketch set operations: >>>>> https://github.com/davecromberge/incubator-datasketches-java/commit/961ad48bbe709ccfcb973a7fab69e53088f113a5 >>>>> >>>>> <https://github.com/davecromberge/incubator-datasketches-java/commit/961ad48bbe709ccfcb973a7fab69e53088f113a5> >>>>> >>>>> Although I have a cursory understanding of the library’s internals, I >>>>> included the commit above because there were some interesting tradeoffs >>>>> to the implementation, and it gave me a better appreciation for the the >>>>> internal workings of the existing Tuple sketch as well as some of the >>>>> finer points in your improvement work. To a lesser degree, it also >>>>> serves to independently confirm your argument for adding new variants of >>>>> the update methods! >>>>> >>>>> During implementation, I was also faced with the decision as to whether >>>>> to duplicate the methods or to convert a Theta sketch to a Tuple sketch >>>>> first and delegate to the existing methods. But, as you noted, this >>>>> requires an additional iteration through the result set and incurs a >>>>> performance penalty. Therefore, I also duplicated the existing update >>>>> methods, with some changes for result extraction. To ensure correctness, >>>>> I found it necessary to duplicate a large portion of the existing test >>>>> cases as well - replicating so many of the existing tests was not ideal, >>>>> but helped verify that the implementation was correct. >>>>> It’s also worth mentioning that I had some difficulty implementing the >>>>> AnotB functionality, and in fact the results were incorrect when the >>>>> sketch crossed into estimation mode (see ignored tests). I’m pleased to >>>>> have attempted the exercise because it will give much better context as I >>>>> study your branch further - especially the AnotB refactoring. >>>>> >>>>> There is one addition that I would like to suggest to your list of TODO >>>>> items - namely, off-heap implementations. I am considering using the >>>>> integer tuple sketch for our engagement use case and would prefer to >>>>> prevent memory pressure by loading many sketches onto the heap. I have >>>>> noticed this come up in the past on #datasketches slack channel in a >>>>> conversation with some Druid team members. It appears that the off-heap >>>>> implementations were omitted from the library due to time constraints, >>>>> and this is area where I could also potentially provide default >>>>> implementations for the other tuple sketches. I think this is important >>>>> to consider, because the existing ArrayOfDoubles implementation uses an >>>>> abstract class for the parent. Making the other Tuple sketches abstract >>>>> in a similar manner is potentially a breaking change as well. >>>>> >>>>> I am excited to collaborate on this together on this feature, and I would >>>>> be happy to contribute in any possible way and coordinate through the >>>>> project TODO page >>>>> <https://github.com/apache/incubator-datasketches-java/projects/1>! >>>>> >>>>> David >>>>> >>>>> >>>>> >>>>>> On 27 May 2020, at 20:04, leerho <[email protected] >>>>>> <mailto:[email protected]>> wrote: >>>>>> >>>>>> David, >>>>>> >>>>>> Thanks. I have been putting a lot of thought into it as well and >>>>>> decided that it was time to make some other long-needed changes in the >>>>>> Tuple Family of sketches as well including the package layout, which has >>>>>> been quite cumbersome. I would suggest holding back on you actual >>>>>> implementation work until you understand what I have changed so far and >>>>>> then we can strategize on how to finish the work. I have checked in my >>>>>> changes so far into the "Tuple_Theta_Extension" branch, which you can >>>>>> check out to see what I have been up to :) >>>>>> >>>>>> The family of tuple sketches have evolved over time and somewhat >>>>>> haphazardly. So the first think I decided to do was to do some >>>>>> rearranging of the package structure to make future downstream >>>>>> improvements and extensions easier. >>>>>> >>>>>> 1. The first problem is that the root tuple directory was cluttered >>>>>> with two different groups of classes that made it difficult for anyone >>>>>> to figure out what is going on. One group of classes form the base >>>>>> generic classes of the tuple sketch on which the concrete extensions >>>>>> "adouble" (a single double), "aninteger" (a single integer), and >>>>>> "strings" (array of strings) depend. These three concrete extensions >>>>>> are already in their own sub directories. >>>>>> >>>>>> The second, largest group of classes were a dedicated non-generic >>>>>> implementation of the tuple sketch, which implemented an array of >>>>>> doubles. All of these classes had "ArrayOfDoubles" in their name. >>>>>> These classes shared no code with the root generic tuple classes except >>>>>> for a few methods in the SerializerDeserializer and the Util classes. >>>>>> By making a few methods public, I was able to move all of the >>>>>> "ArrayOfDoubles" classes into their own subdirectory. This creates an >>>>>> incompatible API break, which will force us to move to a 2.0.0 for the >>>>>> next version. Now the tuple root directory is much cleaner and easier >>>>>> to navigate and understand. There are several reasons for this separate >>>>>> dedicated implementation. First, we felt that a configurable array of >>>>>> doubles would be a relatively common use case. Second, we wanted a full >>>>>> concrete example of the tuple sketch as an example of what it would look >>>>>> like including both on-heap and off-heap variants. It is this >>>>>> ArrayOfDoubles implementation that has been integrated into Druid, for >>>>>> example. >>>>>> >>>>>> 2. Now that the package directories are cleaned up I was able to focus >>>>>> on what it would mean to allow Tuple sketches to perform set operations >>>>>> with Theta sketches. >>>>>> >>>>>> One approach would be to just provide a converter to take in a Theta >>>>>> sketch and produce a Tuple sketch with some default or configured >>>>>> summary and leave everything else the way it is. But this is less >>>>>> efficient as it requires more object creation and copying than a direct >>>>>> integration would. It turns out that modifying the generic Union and >>>>>> Intersection classes only required adding one method to each. I did >>>>>> some minor code cleanup and code documentation at the same time. >>>>>> >>>>>> The AnotB operator is another story. We have never been really happy >>>>>> with how this was implemented the first time. The current API is >>>>>> clumsy. So I have taken the opportunity to redesign the API for this >>>>>> class. It still has the current API methods but deprecated. With the >>>>>> new modified class the user has several ways of performing AnotB. >>>>>> >>>>>> As stateless operations: >>>>>> With Tuple: resultSk = aNotB(skTupleA, skTupleB); >>>>>> With Theta: resultSk = aNotB(skTupleA, skThetaB); >>>>>> As stateful, sequential operations: >>>>>> void setA(skTupleA); >>>>>> void notB(skTupleB); or void notB(skThetaB); //These are >>>>>> interchangable. >>>>>> ... >>>>>> void notB(skTupleB); or void notB(skThetaB); //These are >>>>>> interchangable. >>>>>> resultSk = getResult(reset = false); // This allows getting an >>>>>> intermediate result >>>>>> void notB(skTupleB); or void notB(skThetaB); //Continue... >>>>>> resultSK = getResult(reset = true); //This returns the result and clears >>>>>> the internal state to empty. >>>>>> This I think is pretty slick and flexible. >>>>>> >>>>>> Work yet to be done on main: >>>>>> Reexamine the Union and Intersection APIs to add the option of an >>>>>> intermediate result. >>>>>> Update the other concrete extensions to take advantage of the above new >>>>>> API: "aninteger", "strings". >>>>>> Examine the dedicated "ArrayOfDoubes" implementation to see how hard it >>>>>> would be to make the same changes as above. Implement. Test. >>>>>> Work yet to be done on test: >>>>>> >>>>>> I did major redesign of the testing class for the AnotB generic class >>>>>> using the "adouble" concrete extension. You can see this in >>>>>> AdoubleAnotBTest.java. This is essentially a deep exhaustive test of >>>>>> the base AnotB classes via the concrete extension. >>>>>> With the deep testing using the "adouble" done, we still need to design >>>>>> new tests for the "aninteger" and "strings" extensions. These can be >>>>>> shallow tests. >>>>>> If we decide to do the same API extensions on the ArrayOfDoubles >>>>>> classes, those will need to be tested. >>>>>> Work to be done on documentation: >>>>>> The website documentation is still rather thin on the whole Tuple >>>>>> family. Having someone that is a real user of these classes contribute >>>>>> to the documentation to make it more understandable would be outstanding! >>>>>> Work to be done on characterization. >>>>>> The Tuple family has some characterization, but it is sparse and a lot >>>>>> more would work here would give users a sense of the performance they >>>>>> could expect. We have also found that characterization is a powerful >>>>>> way to find statistical bugs that don't show up in unit tests. I could >>>>>> guide you through how to set up the various "test harnesses", which is >>>>>> really pretty simple, but the real thinking goes into the design of the >>>>>> test and understanding the output. This is a great way to really >>>>>> understand how these sketches behave and why. >>>>>> Work to be done on code reviews: >>>>>> Having independent set of eyes going over the code would also be a huge >>>>>> contribution. >>>>>> Once you have had a chance to study this we should talk about how you >>>>>> want to contribute. Clearly a lot of what I have done so far required >>>>>> deep understanding of the Tuple and Theta classes and was was much more >>>>>> efficient for me to do. It would have been a hard slog for anyone new >>>>>> to the library to undertake. >>>>>> >>>>>> Once we decide on a strategy, we should put kanban cards in the project >>>>>> TODO page >>>>>> <https://github.com/apache/incubator-datasketches-java/projects/1>. >>>>>> >>>>>> Please let me know what you think! >>>>>> >>>>>> Lee. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Wed, May 27, 2020 at 7:53 AM David Cromberge >>>>>> <[email protected] <mailto:[email protected]>> >>>>>> wrote: >>>>>> Thank you Lee for your proposal regarding my use case and Tuple sketches. >>>>>> >>>>>> I have spent some time considering the proposal, and I have started >>>>>> implementing a potential solution. >>>>>> >>>>>> At what stage of the pipeline should characterisation tests be proposed, >>>>>> since they would obviously depend on a new SNAPSHOT version of the core >>>>>> library being available? >>>>>> >>>>>> I would be grateful for any input about the characterisation workflow. >>>>>> >>>>>> Thank you, >>>>>> David >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: [email protected] >>>>>> <mailto:[email protected]> >>>>>> For additional commands, e-mail: [email protected] >>>>>> <mailto:[email protected]> >>>>>> >>>>> >>>> >>> >> >
