Hi Lee, I have studied our usage of the off-heap features of the library. Originally there was a compelling reason to leverage this whilst decoding from serialised bytes in storage. When servicing queries on behalf of many clients, decoding sketches into the heap was potentially problematic from a scalability perspective. However, for various reasons(unrelated to the library), we do not currently make use of the off-heap features. Regardless, it would be interesting to hear some of the learnings from the Druid case (whenever these can be provided), as I am busy collecting notes and intend to supplement the website documentation where applicable. To sum up, off-heap may be reconsidered in the future, but is not currently a priority.
If a byte-array Tuple sketch is being proposed, it may be possible in the end to define the various specialisations (integer, double, strings and doubles) in terms of the byte-array implementation. It could also be of interest to define a position in the preamble for Tuple sketches where there is a flag to identify the summary type, where the defaults (integer, double, strings and doubles) may reserve an identifier. This would allow for a powerful Tuple deserialiser that generalises over the summary. However, I realise that there are other concerns that come into play with backwards compatibility, as well as performance criteria. David. > On 28 May 2020, at 16:12, leerho <[email protected]> wrote: > > David, you are correct. I also looked at the AOD implementation last night > and it lacks some notable capabilities that I think you need, but others > would want as well. > Integration of Theta (as you noted) > There is no selectable "mode" equivalent for a Union on how to combine two > double summaries. Intersection has a dedicated combiner, but doesn't have the > "mode" capability to allow easy choices between a set of modes. > There may be other issues, but I haven't studied this AOD code for several > years :( > I am wondering if, for your case, if we had a generic, only-on-heap solution > relatively quickly that you could use and get some experience with and could > characterize its performance in your environment. And then work on a more > efficient off-heap solution later. > > Using sketches off-heap requires some significant design decisions at the > system level to make it work really well. We have worked with the Druid > folks for quite a while and have learned some things that may be helpful for > you. But these issues are outside and beyond the issues of designing a > sketch that can work off-heap. It would be helpful to get a sense of where > you are in your thinking about going off-heap. > > Lee. > > > > On Thu, May 28, 2020 at 7:45 AM David Cromberge > <[email protected] <mailto:[email protected]>> wrote: > As a follow-up to the proposal below, it may not be necessary to provide a > combiner and default set of values together, since there may be some > redundancy here. > It’s probably better for me to re-iterate the intention - to provide a > meaningful way to interpret the presence of a key in a theta with some > suitable default value. > >> On 28 May 2020, at 15:39, David Cromberge <[email protected] >> <mailto:[email protected]>> wrote: >> >> Lee, the array of bytes implementation is a better approach than adding >> off-heap implementations for the integer, double and strings Tuple sketches. >> >> I originally overlooked the ArrayOfDoubles sketch for the purposes of >> tracking engagement, since it implies that many values could be associated >> with a hashed key, which doesn’t quite fit the use case. >> >> Having said that, I have now looked through the implementation and have >> switched to the array of doubles sketch instead - after all, you pointed out >> that it should suffice. >> >> I have run some initial benchmarks on sizing, and in compacted form I did >> not get a reduction in the size of a compacted sketch generated from a real >> data set, despite the benefits of using primitive doubles. I realise that >> this is dependent on the test case and tuning / configuration parameters, so >> we could add a TODO item to add a characterisation test for this, if there >> is not one already. >> >> We don’t seem to have discussed how the Theta sketches may be included for >> intersections and unions regarding an AOD sketch. For unions, the >> behaviour is delegated to a merge operation on the sketch, which ultimately >> adds values together for the same key. Concerning intersection, a combiner >> implementation is used to determine how values should be combined. It is >> noteworthy that in the druid extension, the values are summed together, with >> a comment noting that this may not apply to all circumstances. >> >> I would propose a similar mechanism for both union and intersection on the >> other sketches, where a default array of values can be provided for a tuple >> sketch: >> >> public void update(final org.apache.datasketches.theta.Sketch sketchIn, >> double[] defaultValues, ArrayOfDoublesCombiner c) {...} >> >> Of course, a factory could be provided that creates a combiner according to >> a summary mode. The use case for this suggestion is to have context >> specific behaviour with regard to merging / combining values in the case of >> unions and intersections, which could be sourced from the user or user query. >> >> I would be interested to hear your thoughts on the >> ArrayOfDoubles/ArrayOfBytes Tuple sketch integration with Theta sketches, >> David >> >> >>> On 28 May 2020, at 04:32, leerho <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> David, In fact, double values do just fine with integer data. They are >>> twice as big at the primitive level, however, the dedicated ArrayOfDoubles >>> implementation in total might actually be smaller, since it is not carrying >>> all the object overhead that are required to do generics. Plus, it will be >>> a whole lot faster! It is already fully implemented with all the set >>> operations, off-heap memory, serialization /deserialization and a full test >>> suite. >>> >>> Designing a similar dedicated ArrayOfIntegers would be a lot of work and >>> wouldn't be my top priority for the next dedicated Tuple sketch to build. >>> What would be more flexible would actually be a dedicated ArrayOfBytes >>> implementation, because bytes are the foundation from which we can derive >>> almost any summary we want. >>> >>> Think about it. >>> >>> Lee. >>> >>> >>> >>> On Wed, May 27, 2020 at 5:54 PM leerho <[email protected] >>> <mailto:[email protected]>> wrote: >>> David, This is good feedback. I didn't realize you wanted off-heap >>> operation. That changes a lot of things. I have to go right now, but let >>> me think about this. Attempting to leverage generics off-heap is messy. >>> For your case it may be better to either leverage the ArrayOfDoubles >>> implementation, which already operates off-heap, or think about emulating >>> the AOD and creating a dedicated AOIntegers. >>> >>> Lee. >>> >>> On Wed, May 27, 2020 at 3:04 PM David Cromberge >>> <[email protected] <mailto:[email protected]>> >>> wrote: >>> Hi Lee, >>> >>> Thanks for providing such a detailed plan of action for the Tuple sketch >>> package. >>> The enhancements that you have listed are interesting, and I will certainly >>> check out your branch to get a clearer understanding of how the library is >>> evolving. >>> >>> For what it’s worth, here is a record of my attempt to integrate Theta >>> sketches into the Tuple sketch set operations: >>> https://github.com/davecromberge/incubator-datasketches-java/commit/961ad48bbe709ccfcb973a7fab69e53088f113a5 >>> >>> <https://github.com/davecromberge/incubator-datasketches-java/commit/961ad48bbe709ccfcb973a7fab69e53088f113a5> >>> >>> Although I have a cursory understanding of the library’s internals, I >>> included the commit above because there were some interesting tradeoffs to >>> the implementation, and it gave me a better appreciation for the the >>> internal workings of the existing Tuple sketch as well as some of the finer >>> points in your improvement work. To a lesser degree, it also serves to >>> independently confirm your argument for adding new variants of the update >>> methods! >>> >>> During implementation, I was also faced with the decision as to whether to >>> duplicate the methods or to convert a Theta sketch to a Tuple sketch first >>> and delegate to the existing methods. But, as you noted, this requires an >>> additional iteration through the result set and incurs a performance >>> penalty. Therefore, I also duplicated the existing update methods, with >>> some changes for result extraction. To ensure correctness, I found it >>> necessary to duplicate a large portion of the existing test cases as well - >>> replicating so many of the existing tests was not ideal, but helped verify >>> that the implementation was correct. >>> It’s also worth mentioning that I had some difficulty implementing the >>> AnotB functionality, and in fact the results were incorrect when the sketch >>> crossed into estimation mode (see ignored tests). I’m pleased to have >>> attempted the exercise because it will give much better context as I study >>> your branch further - especially the AnotB refactoring. >>> >>> There is one addition that I would like to suggest to your list of TODO >>> items - namely, off-heap implementations. I am considering using the >>> integer tuple sketch for our engagement use case and would prefer to >>> prevent memory pressure by loading many sketches onto the heap. I have >>> noticed this come up in the past on #datasketches slack channel in a >>> conversation with some Druid team members. It appears that the off-heap >>> implementations were omitted from the library due to time constraints, and >>> this is area where I could also potentially provide default implementations >>> for the other tuple sketches. I think this is important to consider, >>> because the existing ArrayOfDoubles implementation uses an abstract class >>> for the parent. Making the other Tuple sketches abstract in a similar >>> manner is potentially a breaking change as well. >>> >>> I am excited to collaborate on this together on this feature, and I would >>> be happy to contribute in any possible way and coordinate through the >>> project TODO page >>> <https://github.com/apache/incubator-datasketches-java/projects/1>! >>> >>> David >>> >>> >>> >>>> On 27 May 2020, at 20:04, leerho <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> >>>> David, >>>> >>>> Thanks. I have been putting a lot of thought into it as well and decided >>>> that it was time to make some other long-needed changes in the Tuple >>>> Family of sketches as well including the package layout, which has been >>>> quite cumbersome. I would suggest holding back on you actual >>>> implementation work until you understand what I have changed so far and >>>> then we can strategize on how to finish the work. I have checked in my >>>> changes so far into the "Tuple_Theta_Extension" branch, which you can >>>> check out to see what I have been up to :) >>>> >>>> The family of tuple sketches have evolved over time and somewhat >>>> haphazardly. So the first think I decided to do was to do some >>>> rearranging of the package structure to make future downstream >>>> improvements and extensions easier. >>>> >>>> 1. The first problem is that the root tuple directory was cluttered with >>>> two different groups of classes that made it difficult for anyone to >>>> figure out what is going on. One group of classes form the base generic >>>> classes of the tuple sketch on which the concrete extensions "adouble" (a >>>> single double), "aninteger" (a single integer), and "strings" (array of >>>> strings) depend. These three concrete extensions are already in their own >>>> sub directories. >>>> >>>> The second, largest group of classes were a dedicated non-generic >>>> implementation of the tuple sketch, which implemented an array of doubles. >>>> All of these classes had "ArrayOfDoubles" in their name. These classes >>>> shared no code with the root generic tuple classes except for a few >>>> methods in the SerializerDeserializer and the Util classes. By making a >>>> few methods public, I was able to move all of the "ArrayOfDoubles" classes >>>> into their own subdirectory. This creates an incompatible API break, >>>> which will force us to move to a 2.0.0 for the next version. Now the >>>> tuple root directory is much cleaner and easier to navigate and >>>> understand. There are several reasons for this separate dedicated >>>> implementation. First, we felt that a configurable array of doubles would >>>> be a relatively common use case. Second, we wanted a full concrete >>>> example of the tuple sketch as an example of what it would look like >>>> including both on-heap and off-heap variants. It is this ArrayOfDoubles >>>> implementation that has been integrated into Druid, for example. >>>> >>>> 2. Now that the package directories are cleaned up I was able to focus on >>>> what it would mean to allow Tuple sketches to perform set operations with >>>> Theta sketches. >>>> >>>> One approach would be to just provide a converter to take in a Theta >>>> sketch and produce a Tuple sketch with some default or configured summary >>>> and leave everything else the way it is. But this is less efficient as it >>>> requires more object creation and copying than a direct integration would. >>>> It turns out that modifying the generic Union and Intersection classes >>>> only required adding one method to each. I did some minor code cleanup >>>> and code documentation at the same time. >>>> >>>> The AnotB operator is another story. We have never been really happy with >>>> how this was implemented the first time. The current API is clumsy. So I >>>> have taken the opportunity to redesign the API for this class. It still >>>> has the current API methods but deprecated. With the new modified class >>>> the user has several ways of performing AnotB. >>>> >>>> As stateless operations: >>>> With Tuple: resultSk = aNotB(skTupleA, skTupleB); >>>> With Theta: resultSk = aNotB(skTupleA, skThetaB); >>>> As stateful, sequential operations: >>>> void setA(skTupleA); >>>> void notB(skTupleB); or void notB(skThetaB); //These are >>>> interchangable. >>>> ... >>>> void notB(skTupleB); or void notB(skThetaB); //These are >>>> interchangable. >>>> resultSk = getResult(reset = false); // This allows getting an >>>> intermediate result >>>> void notB(skTupleB); or void notB(skThetaB); //Continue... >>>> resultSK = getResult(reset = true); //This returns the result and clears >>>> the internal state to empty. >>>> This I think is pretty slick and flexible. >>>> >>>> Work yet to be done on main: >>>> Reexamine the Union and Intersection APIs to add the option of an >>>> intermediate result. >>>> Update the other concrete extensions to take advantage of the above new >>>> API: "aninteger", "strings". >>>> Examine the dedicated "ArrayOfDoubes" implementation to see how hard it >>>> would be to make the same changes as above. Implement. Test. >>>> Work yet to be done on test: >>>> >>>> I did major redesign of the testing class for the AnotB generic class >>>> using the "adouble" concrete extension. You can see this in >>>> AdoubleAnotBTest.java. This is essentially a deep exhaustive test of the >>>> base AnotB classes via the concrete extension. >>>> With the deep testing using the "adouble" done, we still need to design >>>> new tests for the "aninteger" and "strings" extensions. These can be >>>> shallow tests. >>>> If we decide to do the same API extensions on the ArrayOfDoubles classes, >>>> those will need to be tested. >>>> Work to be done on documentation: >>>> The website documentation is still rather thin on the whole Tuple family. >>>> Having someone that is a real user of these classes contribute to the >>>> documentation to make it more understandable would be outstanding! >>>> Work to be done on characterization. >>>> The Tuple family has some characterization, but it is sparse and a lot >>>> more would work here would give users a sense of the performance they >>>> could expect. We have also found that characterization is a powerful way >>>> to find statistical bugs that don't show up in unit tests. I could guide >>>> you through how to set up the various "test harnesses", which is really >>>> pretty simple, but the real thinking goes into the design of the test and >>>> understanding the output. This is a great way to really understand how >>>> these sketches behave and why. >>>> Work to be done on code reviews: >>>> Having independent set of eyes going over the code would also be a huge >>>> contribution. >>>> Once you have had a chance to study this we should talk about how you want >>>> to contribute. Clearly a lot of what I have done so far required deep >>>> understanding of the Tuple and Theta classes and was was much more >>>> efficient for me to do. It would have been a hard slog for anyone new to >>>> the library to undertake. >>>> >>>> Once we decide on a strategy, we should put kanban cards in the project >>>> TODO page >>>> <https://github.com/apache/incubator-datasketches-java/projects/1>. >>>> >>>> Please let me know what you think! >>>> >>>> Lee. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Wed, May 27, 2020 at 7:53 AM David Cromberge >>>> <[email protected] <mailto:[email protected]>> >>>> wrote: >>>> Thank you Lee for your proposal regarding my use case and Tuple sketches. >>>> >>>> I have spent some time considering the proposal, and I have started >>>> implementing a potential solution. >>>> >>>> At what stage of the pipeline should characterisation tests be proposed, >>>> since they would obviously depend on a new SNAPSHOT version of the core >>>> library being available? >>>> >>>> I would be grateful for any input about the characterisation workflow. >>>> >>>> Thank you, >>>> David >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: [email protected] >>>> <mailto:[email protected]> >>>> For additional commands, e-mail: [email protected] >>>> <mailto:[email protected]> >>>> >>> >> >
