David, This is good feedback. I didn't realize you wanted off-heap operation. That changes a lot of things. I have to go right now, but let me think about this. Attempting to leverage generics off-heap is messy. For your case it may be better to either leverage the ArrayOfDoubles implementation, which already operates off-heap, or think about emulating the AOD and creating a dedicated AOIntegers.
Lee. On Wed, May 27, 2020 at 3:04 PM David Cromberge < [email protected]> wrote: > Hi Lee, > > Thanks for providing such a detailed plan of action for the Tuple sketch > package. > The enhancements that you have listed are interesting, and I will > certainly check out your branch to get a clearer understanding of how the > library is evolving. > > For what it’s worth, here is a record of my attempt to integrate Theta > sketches into the Tuple sketch set operations: > > https://github.com/davecromberge/incubator-datasketches-java/commit/961ad48bbe709ccfcb973a7fab69e53088f113a5 > > Although I have a cursory understanding of the library’s internals, I > included the commit above because there were some interesting tradeoffs to > the implementation, and it gave me a better appreciation for the the > internal workings of the existing Tuple sketch as well as some of the finer > points in your improvement work. To a lesser degree, it also serves to > independently confirm your argument for adding new variants of the update > methods! > > During implementation, I was also faced with the decision as to whether to > duplicate the methods or to convert a Theta sketch to a Tuple sketch first > and delegate to the existing methods. But, as you noted, this requires an > additional iteration through the result set and incurs a performance > penalty. Therefore, I also duplicated the existing update methods, with > some changes for result extraction. To ensure correctness, I found it > necessary to duplicate a large portion of the existing test cases as well - > replicating so many of the existing tests was not ideal, but helped verify > that the implementation was correct. > It’s also worth mentioning that I had some difficulty implementing the > AnotB functionality, and in fact the results were incorrect when the sketch > crossed into estimation mode (see ignored tests). I’m pleased to have > attempted the exercise because it will give much better context as I study > your branch further - especially the AnotB refactoring. > > There is one addition that I would like to suggest to your list of TODO > items - namely, off-heap implementations. I am considering using the > integer tuple sketch for our engagement use case and would prefer to > prevent memory pressure by loading many sketches onto the heap. I have > noticed this come up in the past on #datasketches slack channel in a > conversation with some Druid team members. It appears that the off-heap > implementations were omitted from the library due to time constraints, and > this is area where I could also potentially provide default implementations > for the other tuple sketches. I think this is important to consider, > because the existing ArrayOfDoubles implementation uses an abstract class > for the parent. Making the other Tuple sketches abstract in a similar > manner is potentially a breaking change as well. > > I am excited to collaborate on this together on this feature, and I would > be happy to contribute in any possible way and coordinate through the project > TODO page > <https://github.com/apache/incubator-datasketches-java/projects/1>! > > David > > > > On 27 May 2020, at 20:04, leerho <[email protected]> wrote: > > David, > > Thanks. I have been putting a lot of thought into it as well and decided > that it was time to make some other long-needed changes in the Tuple Family > of sketches as well including the package layout, which has been quite > cumbersome. I would suggest holding back on you actual implementation work > until you understand what I have changed so far and then we can strategize > on how to finish the work. I have checked in my changes so far into the > "Tuple_Theta_Extension" branch, which you can check out to see what I have > been up to :) > > The family of tuple sketches have evolved over time and somewhat > haphazardly. So the first think I decided to do was to do some rearranging > of the package structure to make future downstream improvements and > extensions easier. > > 1. The first problem is that the root tuple directory was cluttered with > two different groups of classes that made it difficult for anyone to figure > out what is going on. One group of classes form the base generic classes > of the tuple sketch on which the concrete extensions "adouble" (a single > double), "aninteger" (a single integer), and "strings" (array of strings) > depend. These three concrete extensions are already in their own sub > directories. > > The second, largest group of classes were a dedicated non-generic > implementation of the tuple sketch, which implemented an array of doubles. > All of these classes had "ArrayOfDoubles" in their name. These classes > shared no code with the root generic tuple classes except for a few methods > in the SerializerDeserializer and the Util classes. By making a few > methods public, I was able to move all of the "ArrayOfDoubles" classes into > their own subdirectory. This creates an incompatible API break, which will > force us to move to a 2.0.0 for the next version. Now the tuple root > directory is much cleaner and easier to navigate and understand. There are > several reasons for this separate dedicated implementation. First, we felt > that a configurable array of doubles would be a relatively common use > case. Second, we wanted a full concrete example of the tuple sketch as an > example of what it would look like including both on-heap and off-heap > variants. It is this ArrayOfDoubles implementation that has been > integrated into Druid, for example. > > 2. Now that the package directories are cleaned up I was able to focus on > what it would mean to allow Tuple sketches to perform set operations with > Theta sketches. > > One approach would be to just provide a converter to take in a Theta > sketch and produce a Tuple sketch with some default or configured summary > and leave everything else the way it is. But this is less efficient as it > requires more object creation and copying than a direct integration would. > It turns out that modifying the generic Union and Intersection classes only > required adding one method to each. I did some minor code cleanup and code > documentation at the same time. > > The AnotB operator is another story. We have never been really happy with > how this was implemented the first time. The current API is clumsy. So I > have taken the opportunity to redesign the API for this class. It still > has the current API methods but deprecated. With the new modified class > the user has several ways of performing AnotB. > > As stateless operations: > > - With Tuple: resultSk = aNotB(skTupleA, skTupleB); > - With Theta: resultSk = aNotB(skTupleA, skThetaB); > > As stateful, sequential operations: > > - void setA(skTupleA); > - void notB(skTupleB); or void notB(skThetaB); //These are > interchangable. > - ... > - void notB(skTupleB); or void notB(skThetaB); //These are > interchangable. > - resultSk = getResult(reset = false); // This allows getting an > intermediate result > - void notB(skTupleB); or void notB(skThetaB); //Continue... > - resultSK = getResult(reset = true); //This returns the result and > clears the internal state to empty. > > This I think is pretty slick and flexible. > > Work yet to be done on main: > > - Reexamine the Union and Intersection APIs to add the option of an > intermediate result. > - Update the other concrete extensions to take advantage of the above > new API: "aninteger", "strings". > - Examine the dedicated "ArrayOfDoubes" implementation to see how hard > it would be to make the same changes as above. Implement. Test. > > Work yet to be done on test: > > I did major redesign of the testing class for the AnotB generic class > using the "adouble" concrete extension. You can see this in > AdoubleAnotBTest.java. This is essentially a deep exhaustive test of the > base AnotB classes via the concrete extension. > > - With the deep testing using the "adouble" done, we still need to > design new tests for the "aninteger" and "strings" extensions. These can > be shallow tests. > - If we decide to do the same API extensions on the ArrayOfDoubles > classes, those will need to be tested. > > Work to be done on documentation: > > - The website documentation is still rather thin on the whole Tuple > family. Having someone that is a real user of these classes contribute to > the documentation to make it more understandable would be outstanding! > > Work to be done on characterization. > > - The Tuple family has some characterization, but it is sparse and a > lot more would work here would give users a sense of the performance they > could expect. We have also found that characterization is a powerful way > to find statistical bugs that don't show up in unit tests. I could guide > you through how to set up the various "test harnesses", which is really > pretty simple, but the real thinking goes into the design of the test and > understanding the output. This is a great way to really understand how > these sketches behave and why. > > Work to be done on code reviews: > > - Having independent set of eyes going over the code would also be a > huge contribution. > > Once you have had a chance to study this we should talk about how you want > to contribute. Clearly a lot of what I have done so far required deep > understanding of the Tuple and Theta classes and was was much more > efficient for me to do. It would have been a hard slog for anyone new to > the library to undertake. > > Once we decide on a strategy, we should put kanban cards in the project > TODO page > <https://github.com/apache/incubator-datasketches-java/projects/1>. > > Please let me know what you think! > > Lee. > > > > > > > > > > > > On Wed, May 27, 2020 at 7:53 AM David Cromberge < > [email protected]> wrote: > >> Thank you Lee for your proposal regarding my use case and Tuple sketches. >> >> I have spent some time considering the proposal, and I have started >> implementing a potential solution. >> >> At what stage of the pipeline should characterisation tests be proposed, >> since they would obviously depend on a new SNAPSHOT version of the core >> library being available? >> >> I would be grateful for any input about the characterisation workflow. >> >> Thank you, >> David >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> >
