David, Thanks. I have been putting a lot of thought into it as well and decided that it was time to make some other long-needed changes in the Tuple Family of sketches as well including the package layout, which has been quite cumbersome. I would suggest holding back on you actual implementation work until you understand what I have changed so far and then we can strategize on how to finish the work. I have checked in my changes so far into the "Tuple_Theta_Extension" branch, which you can check out to see what I have been up to :)
The family of tuple sketches have evolved over time and somewhat haphazardly. So the first think I decided to do was to do some rearranging of the package structure to make future downstream improvements and extensions easier. 1. The first problem is that the root tuple directory was cluttered with two different groups of classes that made it difficult for anyone to figure out what is going on. One group of classes form the base generic classes of the tuple sketch on which the concrete extensions "adouble" (a single double), "aninteger" (a single integer), and "strings" (array of strings) depend. These three concrete extensions are already in their own sub directories. The second, largest group of classes were a dedicated non-generic implementation of the tuple sketch, which implemented an array of doubles. All of these classes had "ArrayOfDoubles" in their name. These classes shared no code with the root generic tuple classes except for a few methods in the SerializerDeserializer and the Util classes. By making a few methods public, I was able to move all of the "ArrayOfDoubles" classes into their own subdirectory. This creates an incompatible API break, which will force us to move to a 2.0.0 for the next version. Now the tuple root directory is much cleaner and easier to navigate and understand. There are several reasons for this separate dedicated implementation. First, we felt that a configurable array of doubles would be a relatively common use case. Second, we wanted a full concrete example of the tuple sketch as an example of what it would look like including both on-heap and off-heap variants. It is this ArrayOfDoubles implementation that has been integrated into Druid, for example. 2. Now that the package directories are cleaned up I was able to focus on what it would mean to allow Tuple sketches to perform set operations with Theta sketches. One approach would be to just provide a converter to take in a Theta sketch and produce a Tuple sketch with some default or configured summary and leave everything else the way it is. But this is less efficient as it requires more object creation and copying than a direct integration would. It turns out that modifying the generic Union and Intersection classes only required adding one method to each. I did some minor code cleanup and code documentation at the same time. The AnotB operator is another story. We have never been really happy with how this was implemented the first time. The current API is clumsy. So I have taken the opportunity to redesign the API for this class. It still has the current API methods but deprecated. With the new modified class the user has several ways of performing AnotB. As stateless operations: - With Tuple: resultSk = aNotB(skTupleA, skTupleB); - With Theta: resultSk = aNotB(skTupleA, skThetaB); As stateful, sequential operations: - void setA(skTupleA); - void notB(skTupleB); or void notB(skThetaB); //These are interchangable. - ... - void notB(skTupleB); or void notB(skThetaB); //These are interchangable. - resultSk = getResult(reset = false); // This allows getting an intermediate result - void notB(skTupleB); or void notB(skThetaB); //Continue... - resultSK = getResult(reset = true); //This returns the result and clears the internal state to empty. This I think is pretty slick and flexible. Work yet to be done on main: - Reexamine the Union and Intersection APIs to add the option of an intermediate result. - Update the other concrete extensions to take advantage of the above new API: "aninteger", "strings". - Examine the dedicated "ArrayOfDoubes" implementation to see how hard it would be to make the same changes as above. Implement. Test. Work yet to be done on test: I did major redesign of the testing class for the AnotB generic class using the "adouble" concrete extension. You can see this in AdoubleAnotBTest.java. This is essentially a deep exhaustive test of the base AnotB classes via the concrete extension. - With the deep testing using the "adouble" done, we still need to design new tests for the "aninteger" and "strings" extensions. These can be shallow tests. - If we decide to do the same API extensions on the ArrayOfDoubles classes, those will need to be tested. Work to be done on documentation: - The website documentation is still rather thin on the whole Tuple family. Having someone that is a real user of these classes contribute to the documentation to make it more understandable would be outstanding! Work to be done on characterization. - The Tuple family has some characterization, but it is sparse and a lot more would work here would give users a sense of the performance they could expect. We have also found that characterization is a powerful way to find statistical bugs that don't show up in unit tests. I could guide you through how to set up the various "test harnesses", which is really pretty simple, but the real thinking goes into the design of the test and understanding the output. This is a great way to really understand how these sketches behave and why. Work to be done on code reviews: - Having independent set of eyes going over the code would also be a huge contribution. Once you have had a chance to study this we should talk about how you want to contribute. Clearly a lot of what I have done so far required deep understanding of the Tuple and Theta classes and was was much more efficient for me to do. It would have been a hard slog for anyone new to the library to undertake. Once we decide on a strategy, we should put kanban cards in the project TODO page <https://github.com/apache/incubator-datasketches-java/projects/1> . Please let me know what you think! Lee. On Wed, May 27, 2020 at 7:53 AM David Cromberge < [email protected]> wrote: > Thank you Lee for your proposal regarding my use case and Tuple sketches. > > I have spent some time considering the proposal, and I have started > implementing a potential solution. > > At what stage of the pipeline should characterisation tests be proposed, > since they would obviously depend on a new SNAPSHOT version of the core > library being available? > > I would be grateful for any input about the characterisation workflow. > > Thank you, > David > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
