Greetings, all, I wanted to provide an update on my hacking activities this past week so that you have an idea of where things are heading with DailyProjectData. As you all know, I've been redesigning the DailyProjectFileMetric class, which was an 800 LOC behemoth that reflected the heroic efforts on the part of Hongbing and Cedric to create a system that could deal with the processing of literally hundreds of thousands of FileMetric instances in very general ways with acceptable runtime performance and without blowing out the heap. DailyProjectFileMetric presents a very interesting set of design challenges, and Hongbing and Cedric did a really great job of satisfying them: I learned a lot from reading the code.
However, the original class has visible bugs and is quite difficult for a newcomer to understand and maintain. It basically represents a shotgun marriage between the original implementation, which pre-dated telemetry and was intended to just produce DailyProjectDetails summaries and drilldowns, with the later needs of telemetry, which requires much more flexibility in the kinds of size data to compute and cache. This marriage, alas, is not a happy one, and indeed Cedric at one point suggested divorce: producing two entirely separate classes, one specifically for telemetry and one specifically for DailyProjectDetails. In some sense, the two functionalities were already sleeping in separate bedrooms within the single class, with their own individual intermediate data structures and so forth. I resisted the idea of divorce, not because of some antiquated attachment to the idea that "What Hackystat hath put together, let no Hacker put asunder", but rather because at a conceptual level, the two functions (telemetry and DailyProjectDetails) simply have a lot in common. I want to take a few minutes to outline the new design so that anyone who is currently fiddling with DailyProject classes can see where I'm heading and hopefully get some ideas to assist them in their efforts. Sometime during the 7.5 release cycle, I will more formally document this stuff in the Design DailyProject Classes chapter of the developer's guide. Until then, Use the Source, Luke. In the new design, the 800 LOC class has been restructured into four new classes: 1. DailyProjectFileMetric2: This is the new top-level class, about 250 LOC, with all methods around 20 LOC or less. It implements a newly discovered DailyProjectData design pattern: a DailyProjectData constructor must be light-weight and do no raw sensor data processing. (The current DailyProjectFileMetric constructor does a lot of raw sensor processing in the constructor, and can take several seconds to complete.) I came to this conclusion after realizing that while we do guarantee that the DailyProjectData cache will *return* a single instance of a DailyProjectData subclass, we _do not_ guarantee that only a single instance will be instantiated! This was a revelation to me, and could potentially explain some interesting performance situations (such as when the telemetry wall sends a bunch of requests nearly simultaneously to the server, potentially resulting in many instantiations of the same DailyProjectFileMetric class.) After considering a number of alternatives, the design I am currently using involves a private initialize() method that is always called first thing in every top-level, synchronized public method, and which sets a boolean after completing so that subsequent invocations do no work. This means that no processing occurs for a given DailyProjectFileMetric[Project, Day] instance until after the instance is returned from the DailyProjectData cache. Since the public methods are all synchronized, all of the threads will wait for the first one to finish initialization. This should guarantee that the initialization occurs only once regardless of how many instantiations actually occur. The remaining three classes implement abstract data types that encapsulate the important data structures and processing results for FileMetric processing: 2. FileMetricRawDataMap: This class, about 300 LOC, reads in the raw FileMetric sensor data from all of the users associated with the Project and constructs a mapping from [Tool, TopLevelWorkspaceName] to instances of an inner class called FileMetricLatestBatch. This results of this process includes two levels of filtering: only those FileMetric instances associated with the given Project are included, and only those FileMetric instances with the latest "runtime" value for a given top-level [Tool, TopLevelWorkspace] are included. By making sure that instances of this class are only bound to local variables, it is easy to ensure that the FileMetric sensor data is available for GC and that FileMetricRawDataMaps are always temporary, intermediate data structures. The class also includes a method called getDefaultPreferredTool. One of the problems with the original DailyProjectFileMetric class is inconsistency in the way FileMetric data generated by different tools but regarding the same underlying source objects is selected. In some situations, the latest sent data is selected, while in others, metric data from different tools will be merged together based upon which tool sent the highest value. There might even be other implicit selection techniques that I didn't discover. The FileMetricRawDataMap class provides the getDefaultPreferredTool method as a way to provide a simple, consistent approach to dealing with the presence of FileMetric data from multiple tools on the same underlying code base. This method samples a runtime value from each tool's latest batch, and returns the tool with the latest runtime. As a "default" method, it is possible for future clients to provide alternative selection criteria, but this at least implements a consistent, simple to understand approach. 3. FileMetricFilePatternCache: This class, about 400 LOC, uses the FileMetricRawDataMap to build a mapping from [FilePattern, FileType, SizeMetricName] to Long, where the Long represents the actual size metric value associated with the tuple. Unlike the FileMetricRawDataMap, which is a transitory class that does not exist beyond a single method invocation, the FileMetricFilePatternCache is bound to an instance variable and thus persists over the life of the DailyProjectFileMetric instance with which it is associated. The current design enables us to play around pretty easily with the way this cache is managed over time. In the current implementation, what happens is that the initialize() method builds a FileMetricRawDataMap and then populates the cache with [FilePattern, FileType, SizeMetricName] tuples representing all top-level workspaces, all known filetypes, and three size metrics (numFiles, sourceLines, totalLines). Thereafter, when a request comes in for the metric associated with a [FilePattern, FileType, SizeMetricName] tuple, we check the cache to see if it is already present and if so, return the value directly. Things get very interesting when the tuple is not present. I won't go into details here, but the gist of it is that if the new tuple consists of a known metric and wild cards (i.e. the wild-card FilePattern "**" or the wildcard FileType "*"), then we can compute the answer from the data in the cache and return it (updating the cache) without needing to go back to the raw data. On the other hand, if the new tuple consists of a size metric we haven't seen before (such as "WMC" or "commentLines"), or a FilePattern we haven't seen before (such as "**/*.java"), then we have to go back to the raw sensor data. So, we build a new (local) instance of FileMetricRawDataMap, do the processing, update the cache with the new info, and return. One of the interesting outcomes from this cache maintenance design is that I was unable to come up with an approach that dealt well with "common" scenarios without also being susceptible to certain "pathological" cases where performance has the potential to be quite bad! Well, we'll just have to see what happens. The good news is that because of the more modular nature of the new design, it should be pretty easy to instrument in order to figure out what's going on when badness occurs. 4. FileMetricSummaryGrid: The final one of the four classes, about 250 LOC, is used to generate the data structures required by the DailyProjectDetails analysis command (i.e. the summary string and the drilldown by top-level workspace.) The constructor for this class takes the FileMetricFilePatternCache instance, which was conveniently initialized with exactly the right set of data required to generate the summary string and drilldowns. This class creates some intermediate data structures, and so the appropriate way to use it is to create it in a local variable, get the summary string and drilldown data structures from it, cache those results, and then let the FileMetricSummaryGrid instance go out of scope. What's also nice about this organization is that the FileMetricSummaryGrid is constructed only when the summary string or drilldown is requested for the first time. This saves some computation in the case of telemetry-induced DailyProjectData instances where no summary string/drilldowns will be needed. I am a bit embarrassed to note that the results of all of this restructuring is now a total of 1200 LOC across the four classes! In my defense, I believe that the extra 400 LOC can be more than accounted for by (a) additional comments---the new classes have significantly more documentation; (b) debugging utilities--- for example each of the three ADT classes has a toString() method which together probably account for 100 LOC out of the 400, and (c) a more complicated cache maintenance algorithm in the FileMetricFilePattern cache, which should hopefully result in improved performance for common cases. Over the next several days, I will be writing unit tests to evaluate the functional correctness of the API, and also playing with the telemetry stuff on my laptop to see how it behaves. My hope is that by late this week, it will be sufficiently stable that we can replace the original DailyProjectFileMetric class by this one, and check it out on the Telemetry Wall. You're certainly free to peruse the code at any time, but you might want to wait a few more days until I've worked the kinks out. I am sure there are a few more surprises in store for me. Cheers, Philip
