David, you are correct. I also looked at the AOD implementation last night and it lacks some notable capabilities that I think you need, but others would want as well.
- Integration of Theta (as you noted) - There is no selectable "mode" equivalent for a Union on how to combine two double summaries. Intersection has a dedicated combiner, but doesn't have the "mode" capability to allow easy choices between a set of modes. - There may be other issues, but I haven't studied this AOD code for several years :( I am wondering if, for your case, if we had a generic, only-on-heap solution relatively quickly that you could use and get some experience with and could characterize its performance in your environment. And then work on a more efficient off-heap solution later. Using sketches off-heap requires some significant design decisions at the system level to make it work really well. We have worked with the Druid folks for quite a while and have learned some things that may be helpful for you. But these issues are outside and beyond the issues of designing a sketch that can work off-heap. It would be helpful to get a sense of where you are in your thinking about going off-heap. Lee. On Thu, May 28, 2020 at 7:45 AM David Cromberge < [email protected]> wrote: > As a follow-up to the proposal below, it may not be necessary to provide a > combiner and default set of values together, since there may be some > redundancy here. > It’s probably better for me to re-iterate the intention - to provide a > meaningful way to interpret the presence of a key in a theta with some > suitable default value. > > On 28 May 2020, at 15:39, David Cromberge <[email protected]> > wrote: > > Lee, the array of bytes implementation is a better approach than adding > off-heap implementations for the integer, double and strings Tuple sketches. > > I originally overlooked the ArrayOfDoubles sketch for the purposes of > tracking engagement, since it implies that many values could be associated > with a hashed key, which doesn’t quite fit the use case. > > Having said that, I have now looked through the implementation and have > switched to the array of doubles sketch instead - after all, you pointed > out that it should suffice. > > I have run some initial benchmarks on sizing, and in compacted form I did > not get a reduction in the size of a compacted sketch generated from a real > data set, despite the benefits of using primitive doubles. I realise that > this is dependent on the test case and tuning / configuration parameters, > so we could add a TODO item to add a characterisation test for this, if > there is not one already. > > We don’t seem to have discussed how the Theta sketches may be included for > intersections and unions regarding an AOD sketch. For unions, the > behaviour is delegated to a merge operation on the sketch, which ultimately > adds values together for the same key. Concerning intersection, a combiner > implementation is used to determine how values should be combined. It is > noteworthy that in the druid extension, the values are summed together, > with a comment noting that this may not apply to all circumstances. > > I would propose a similar mechanism for both union and intersection on the > other sketches, where a default array of values can be provided for a tuple > sketch: > > public void update(final org.apache.datasketches.theta.Sketch sketchIn, > double[] defaultValues, ArrayOfDoublesCombiner c) {...} > > Of course, a factory could be provided that creates a combiner according > to a summary mode. The use case for this suggestion is to have context > specific behaviour with regard to merging / combining values in the case of > unions and intersections, which could be sourced from the user or user > query. > > I would be interested to hear your thoughts on the > ArrayOfDoubles/ArrayOfBytes Tuple sketch integration with Theta sketches, > David > > > On 28 May 2020, at 04:32, leerho <[email protected]> wrote: > > David, In fact, double values do just fine with integer data. They are > twice as big at the primitive level, however, the dedicated ArrayOfDoubles > implementation in total might actually be smaller, since it is not carrying > all the object overhead that are required to do generics. Plus, it will be > a whole lot faster! It is already fully implemented with all the set > operations, off-heap memory, serialization /deserialization and a full test > suite. > > Designing a similar dedicated ArrayOfIntegers would be a lot of work and > wouldn't be my top priority for the next dedicated Tuple sketch to build. > What would be more flexible would actually be a dedicated ArrayOfBytes > implementation, because bytes are the foundation from which we can derive > almost any summary we want. > > Think about it. > > Lee. > > > > On Wed, May 27, 2020 at 5:54 PM leerho <[email protected]> wrote: > >> David, This is good feedback. I didn't realize you wanted off-heap >> operation. That changes a lot of things. I have to go right now, but let >> me think about this. Attempting to leverage generics off-heap is messy. >> For your case it may be better to either leverage the ArrayOfDoubles >> implementation, which already operates off-heap, or think about emulating >> the AOD and creating a dedicated AOIntegers. >> >> Lee. >> >> On Wed, May 27, 2020 at 3:04 PM David Cromberge < >> [email protected]> wrote: >> >>> Hi Lee, >>> >>> Thanks for providing such a detailed plan of action for the Tuple sketch >>> package. >>> The enhancements that you have listed are interesting, and I will >>> certainly check out your branch to get a clearer understanding of how the >>> library is evolving. >>> >>> For what it’s worth, here is a record of my attempt to integrate Theta >>> sketches into the Tuple sketch set operations: >>> >>> https://github.com/davecromberge/incubator-datasketches-java/commit/961ad48bbe709ccfcb973a7fab69e53088f113a5 >>> >>> Although I have a cursory understanding of the library’s internals, I >>> included the commit above because there were some interesting tradeoffs to >>> the implementation, and it gave me a better appreciation for the the >>> internal workings of the existing Tuple sketch as well as some of the finer >>> points in your improvement work. To a lesser degree, it also serves to >>> independently confirm your argument for adding new variants of the update >>> methods! >>> >>> During implementation, I was also faced with the decision as to whether >>> to duplicate the methods or to convert a Theta sketch to a Tuple sketch >>> first and delegate to the existing methods. But, as you noted, this >>> requires an additional iteration through the result set and incurs a >>> performance penalty. Therefore, I also duplicated the existing update >>> methods, with some changes for result extraction. To ensure correctness, I >>> found it necessary to duplicate a large portion of the existing test cases >>> as well - replicating so many of the existing tests was not ideal, but >>> helped verify that the implementation was correct. >>> It’s also worth mentioning that I had some difficulty implementing the >>> AnotB functionality, and in fact the results were incorrect when the sketch >>> crossed into estimation mode (see ignored tests). I’m pleased to have >>> attempted the exercise because it will give much better context as I study >>> your branch further - especially the AnotB refactoring. >>> >>> There is one addition that I would like to suggest to your list of TODO >>> items - namely, off-heap implementations. I am considering using the >>> integer tuple sketch for our engagement use case and would prefer to >>> prevent memory pressure by loading many sketches onto the heap. I have >>> noticed this come up in the past on #datasketches slack channel in a >>> conversation with some Druid team members. It appears that the off-heap >>> implementations were omitted from the library due to time constraints, and >>> this is area where I could also potentially provide default implementations >>> for the other tuple sketches. I think this is important to consider, >>> because the existing ArrayOfDoubles implementation uses an abstract class >>> for the parent. Making the other Tuple sketches abstract in a similar >>> manner is potentially a breaking change as well. >>> >>> I am excited to collaborate on this together on this feature, and I >>> would be happy to contribute in any possible way and coordinate through the >>> project >>> TODO page >>> <https://github.com/apache/incubator-datasketches-java/projects/1>! >>> >>> David >>> >>> >>> >>> On 27 May 2020, at 20:04, leerho <[email protected]> wrote: >>> >>> David, >>> >>> Thanks. I have been putting a lot of thought into it as well and >>> decided that it was time to make some other long-needed changes in the >>> Tuple Family of sketches as well including the package layout, which has >>> been quite cumbersome. I would suggest holding back on you actual >>> implementation work until you understand what I have changed so far and >>> then we can strategize on how to finish the work. I have checked in my >>> changes so far into the "Tuple_Theta_Extension" branch, which you can check >>> out to see what I have been up to :) >>> >>> The family of tuple sketches have evolved over time and somewhat >>> haphazardly. So the first think I decided to do was to do some rearranging >>> of the package structure to make future downstream improvements and >>> extensions easier. >>> >>> 1. The first problem is that the root tuple directory was cluttered >>> with two different groups of classes that made it difficult for anyone to >>> figure out what is going on. One group of classes form the base generic >>> classes of the tuple sketch on which the concrete extensions "adouble" (a >>> single double), "aninteger" (a single integer), and "strings" (array of >>> strings) depend. These three concrete extensions are already in their own >>> sub directories. >>> >>> The second, largest group of classes were a dedicated non-generic >>> implementation of the tuple sketch, which implemented an array of doubles. >>> All of these classes had "ArrayOfDoubles" in their name. These classes >>> shared no code with the root generic tuple classes except for a few methods >>> in the SerializerDeserializer and the Util classes. By making a few >>> methods public, I was able to move all of the "ArrayOfDoubles" classes into >>> their own subdirectory. This creates an incompatible API break, which will >>> force us to move to a 2.0.0 for the next version. Now the tuple root >>> directory is much cleaner and easier to navigate and understand. There are >>> several reasons for this separate dedicated implementation. First, we felt >>> that a configurable array of doubles would be a relatively common use >>> case. Second, we wanted a full concrete example of the tuple sketch as an >>> example of what it would look like including both on-heap and off-heap >>> variants. It is this ArrayOfDoubles implementation that has been >>> integrated into Druid, for example. >>> >>> 2. Now that the package directories are cleaned up I was able to focus >>> on what it would mean to allow Tuple sketches to perform set operations >>> with Theta sketches. >>> >>> One approach would be to just provide a converter to take in a Theta >>> sketch and produce a Tuple sketch with some default or configured summary >>> and leave everything else the way it is. But this is less efficient as it >>> requires more object creation and copying than a direct integration would. >>> It turns out that modifying the generic Union and Intersection classes only >>> required adding one method to each. I did some minor code cleanup and code >>> documentation at the same time. >>> >>> The AnotB operator is another story. We have never been really happy >>> with how this was implemented the first time. The current API is clumsy. >>> So I have taken the opportunity to redesign the API for this class. It >>> still has the current API methods but deprecated. With the new >>> modified class the user has several ways of performing AnotB. >>> >>> As stateless operations: >>> >>> - With Tuple: resultSk = aNotB(skTupleA, skTupleB); >>> - With Theta: resultSk = aNotB(skTupleA, skThetaB); >>> >>> As stateful, sequential operations: >>> >>> - void setA(skTupleA); >>> - void notB(skTupleB); or void notB(skThetaB); //These are >>> interchangable. >>> - ... >>> - void notB(skTupleB); or void notB(skThetaB); //These are >>> interchangable. >>> - resultSk = getResult(reset = false); // This allows getting an >>> intermediate result >>> - void notB(skTupleB); or void notB(skThetaB); //Continue... >>> - resultSK = getResult(reset = true); //This returns the result and >>> clears the internal state to empty. >>> >>> This I think is pretty slick and flexible. >>> >>> Work yet to be done on main: >>> >>> - Reexamine the Union and Intersection APIs to add the option of an >>> intermediate result. >>> - Update the other concrete extensions to take advantage of the >>> above new API: "aninteger", "strings". >>> - Examine the dedicated "ArrayOfDoubes" implementation to see how >>> hard it would be to make the same changes as above. Implement. Test. >>> >>> Work yet to be done on test: >>> >>> I did major redesign of the testing class for the AnotB generic class >>> using the "adouble" concrete extension. You can see this in >>> AdoubleAnotBTest.java. This is essentially a deep exhaustive test of the >>> base AnotB classes via the concrete extension. >>> >>> - With the deep testing using the "adouble" done, we still need to >>> design new tests for the "aninteger" and "strings" extensions. These can >>> be shallow tests. >>> - If we decide to do the same API extensions on the ArrayOfDoubles >>> classes, those will need to be tested. >>> >>> Work to be done on documentation: >>> >>> - The website documentation is still rather thin on the whole Tuple >>> family. Having someone that is a real user of these classes contribute >>> to >>> the documentation to make it more understandable would be outstanding! >>> >>> Work to be done on characterization. >>> >>> - The Tuple family has some characterization, but it is sparse and a >>> lot more would work here would give users a sense of the performance they >>> could expect. We have also found that characterization is a powerful way >>> to find statistical bugs that don't show up in unit tests. I could >>> guide >>> you through how to set up the various "test harnesses", which is really >>> pretty simple, but the real thinking goes into the design of the test and >>> understanding the output. This is a great way to really understand how >>> these sketches behave and why. >>> >>> Work to be done on code reviews: >>> >>> - Having independent set of eyes going over the code would also be a >>> huge contribution. >>> >>> Once you have had a chance to study this we should talk about how you >>> want to contribute. Clearly a lot of what I have done so far required deep >>> understanding of the Tuple and Theta classes and was was much more >>> efficient for me to do. It would have been a hard slog for anyone new to >>> the library to undertake. >>> >>> Once we decide on a strategy, we should put kanban cards in the project >>> TODO page >>> <https://github.com/apache/incubator-datasketches-java/projects/1>. >>> >>> Please let me know what you think! >>> >>> Lee. >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On Wed, May 27, 2020 at 7:53 AM David Cromberge < >>> [email protected]> wrote: >>> >>>> Thank you Lee for your proposal regarding my use case and Tuple >>>> sketches. >>>> >>>> I have spent some time considering the proposal, and I have started >>>> implementing a potential solution. >>>> >>>> At what stage of the pipeline should characterisation tests be >>>> proposed, since they would obviously depend on a new SNAPSHOT version of >>>> the core library being available? >>>> >>>> I would be grateful for any input about the characterisation workflow. >>>> >>>> Thank you, >>>> David >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: [email protected] >>>> For additional commands, e-mail: [email protected] >>>> >>>> >>> > >
