[HACKYSTAT-DEV-L] DailyProjectData design patterns (DailyProjectFileMetrics redesign)

Philip Johnson Mon, 29 May 2006 05:38:37 -0700

Greetings, all,

I wanted to provide an update on my hacking activities this past week so
that you have an idea of where things are heading with DailyProjectData.
As you all know, I've been redesigning the DailyProjectFileMetric class,
which was an 800 LOC behemoth that reflected the heroic efforts on the part
of Hongbing and Cedric to create a system that could deal with the
processing of literally hundreds of thousands of FileMetric instances in
very general ways with acceptable runtime performance and without blowing
out the heap.  DailyProjectFileMetric presents a very interesting set of
design challenges, and Hongbing and Cedric did a really great job of
satisfying them: I learned a lot from reading the code.


However, the original class has visible bugs and is quite difficult for a
newcomer to understand and maintain.  It basically represents a shotgun
marriage between the original implementation, which pre-dated telemetry and
was intended to just produce DailyProjectDetails summaries and drilldowns,
with the later needs of telemetry, which requires much more flexibility in
the kinds of size data to compute and cache.

This marriage, alas, is not a happy one, and indeed Cedric at one point
suggested divorce: producing two entirely separate classes, one
specifically for telemetry and one specifically for DailyProjectDetails.
In some sense, the two functionalities were already sleeping in separate
bedrooms within the single class, with their own individual intermediate
data structures and so forth.

I resisted the idea of divorce, not because of some antiquated attachment
to the idea that "What Hackystat hath put together, let no Hacker put
asunder", but rather because at a conceptual level, the two functions
(telemetry and DailyProjectDetails) simply have a lot in common.

I want to take a few minutes to outline the new design so that anyone who
is currently fiddling with DailyProject classes can see where I'm heading
and hopefully get some ideas to assist them in their efforts.  Sometime
during the 7.5 release cycle, I will more formally document this stuff in
the Design DailyProject Classes chapter of the developer's guide.  Until
then, Use the Source, Luke.

In the new design, the 800 LOC class has been restructured into four new
classes:

1. DailyProjectFileMetric2: This is the new top-level class, about 250 LOC,
with all methods around 20 LOC or less.  It implements a newly discovered
DailyProjectData design pattern: a DailyProjectData constructor must be
light-weight and do no raw sensor data processing. (The current
DailyProjectFileMetric constructor does a lot of raw sensor processing in the
constructor, and can take several seconds to complete.)  I came to this
conclusion after realizing that while we do guarantee that the
DailyProjectData cache will *return* a single instance of a DailyProjectData
subclass, we _do not_ guarantee that only a single instance will be
instantiated!  This was a revelation to me, and could potentially explain
some interesting performance situations (such as when the telemetry wall
sends a bunch of requests nearly simultaneously to the server, potentially
resulting in many instantiations of the same DailyProjectFileMetric class.)

After considering a number of alternatives, the design I am currently
using involves a private initialize() method that is always called
first thing in every top-level, synchronized public method, and which sets
a boolean after completing so that subsequent invocations do no work.  This
means that no processing occurs for a given DailyProjectFileMetric[Project,
Day] instance until after the instance is returned from the
DailyProjectData cache. Since the public methods are all synchronized, all
of the threads will wait for the first one to finish initialization. This
should guarantee that the initialization occurs only once regardless of how
many instantiations actually occur.

The remaining three classes implement abstract data types that encapsulate
the important data structures and processing results for FileMetric
processing:

2. FileMetricRawDataMap: This class, about 300 LOC, reads in the raw
FileMetric sensor data from all of the users associated with the Project
and constructs a mapping from [Tool, TopLevelWorkspaceName] to instances of
an inner class called FileMetricLatestBatch.  This results of this process
includes two levels of filtering: only those FileMetric instances associated
with the given Project are included, and only those FileMetric instances
with the latest "runtime" value for a given top-level [Tool,
TopLevelWorkspace] are included.  

By making sure that instances of this class are only bound to local
variables, it is easy to ensure that the FileMetric sensor data is
available for GC and that FileMetricRawDataMaps are always temporary,
intermediate data structures.

The class also includes a method called getDefaultPreferredTool.  One of
the problems with the original DailyProjectFileMetric class is
inconsistency in the way FileMetric data generated by different tools but
regarding the same underlying source objects is selected. In some
situations, the latest sent data is selected, while in others, metric data
from different tools will be merged together based upon which tool sent
the highest value.  There might even be other implicit selection techniques
that I didn't discover. 

The FileMetricRawDataMap class provides the getDefaultPreferredTool method
as a way to provide a simple, consistent approach to dealing with the
presence of FileMetric data from multiple tools on the same underlying code
base. This method samples a runtime value from each tool's latest batch,
and returns the tool with the latest runtime.  As a "default" method, it is
possible for future clients to provide alternative selection criteria, but
this at least implements a consistent, simple to understand approach.

3. FileMetricFilePatternCache: This class, about 400 LOC, uses the
FileMetricRawDataMap to build a mapping from [FilePattern, FileType,
SizeMetricName] to Long, where the Long represents the actual size metric
value associated with the tuple.

Unlike the FileMetricRawDataMap, which is a transitory class that does not
exist beyond a single method invocation, the FileMetricFilePatternCache is
bound to an instance variable and thus persists over the life of the
DailyProjectFileMetric instance with which it is associated.   

The current design enables us to play around pretty easily with the way
this cache is managed over time. In the current implementation, what happens is 
that the
initialize() method builds a FileMetricRawDataMap and then populates the
cache with [FilePattern, FileType, SizeMetricName] tuples representing all
top-level workspaces, all known filetypes, and three size metrics
(numFiles, sourceLines, totalLines).  

Thereafter, when a request comes in for the metric associated with a
[FilePattern, FileType, SizeMetricName] tuple, we check the cache to see if
it is already present and if so, return the value directly.  Things get
very interesting when the tuple is not present.  I won't go into details
here, but the gist of it is that if the new tuple consists of a known
metric and wild cards (i.e. the wild-card FilePattern "**" or the wildcard
FileType "*"), then we can compute the answer from the data in the cache
and return it (updating the cache) without needing to go back to the raw
data.  On the other hand, if the new tuple consists of a size metric we
haven't seen before (such as "WMC" or "commentLines"), or a FilePattern we
haven't seen before (such as "**/*.java"), then we have to go back to the
raw sensor data.  So, we build a new (local) instance of
FileMetricRawDataMap, do the processing, update the cache with the new
info, and return.

One of the interesting outcomes from this cache maintenance design is that
I was unable to come up with an approach that dealt well with "common"
scenarios without also being susceptible to certain "pathological" cases
where performance has the potential to be quite bad! Well, we'll just have
to see what happens.  The good news is that because of the more modular 
nature of the new design, it should be pretty easy to instrument in order to 
figure out what's going on when badness occurs. 

4. FileMetricSummaryGrid: The final one of the four classes, about 250 LOC, is
used to generate the data structures required by the DailyProjectDetails
analysis command (i.e. the summary string and the drilldown by top-level
workspace.)  The constructor for this class takes the
FileMetricFilePatternCache instance, which was conveniently initialized
with exactly the right set of data required to generate the summary string
and drilldowns.  This class creates some intermediate data structures, and
so the appropriate way to use it is to create it in a local variable, get
the summary string and drilldown data structures from it, cache those
results, and then let the FileMetricSummaryGrid instance go out of scope.  
What's also nice about
this organization is that the FileMetricSummaryGrid is constructed only
when the summary string or drilldown is requested for the first time.  This
saves some computation in the case of telemetry-induced DailyProjectData
instances where no summary string/drilldowns will be needed.

I am a bit embarrassed to note that the results of all of this
restructuring is now a total of 1200 LOC across the four classes!  In my
defense, I believe that the extra 400 LOC can be more than accounted for by
(a) additional comments---the new classes have significantly more
documentation; (b) debugging utilities--- for example each of the three ADT
classes has a toString() method which together probably account for 100 LOC
out of the 400, and (c) a more complicated cache maintenance algorithm in
the FileMetricFilePattern cache, which should hopefully result in improved
performance for common cases.

Over the next several days, I will be writing unit tests to evaluate the
functional correctness of the API, and also playing with the telemetry
stuff on my laptop to see how it behaves.  My hope is that by late this
week, it will be sufficiently stable that we can replace the original
DailyProjectFileMetric class by this one, and check it out on the Telemetry
Wall. 

You're certainly free to peruse the code at any time, but you might want to
wait a few more days until I've worked the kinks out. I am sure there are a
few more surprises in store for me.

Cheers,
Philip

[HACKYSTAT-DEV-L] DailyProjectData design patterns (DailyProjectFileMetrics redesign)

Reply via email to