Hi,
I was working on improving hackystat performance in the past week.
http://hackydev.ics.hawaii.edu:8080/browse/HACK-595
The total data we have in compressed zip format is around 220MB.
Considering how redundant we are with our xml data, I can almost say
that our data can fit in a normalized database less than 100MB. There
should be no problem for a server with 2 Xeon CPUs and 2G RAM to process
this amount of data. But...
Following is my profiling-guided findings:
(1) When sensor data is cached in memory, almost 100% of time is spent
determine whether a file name or a class name belongs to a project.
Determining for a class name is especially time consuming, because it
involving a loop for all the file names associated with that class name.
(2) Workspace method is called hundreds of thousands times. Any minor
improvement will have huge impact on performance.
(3) When sensor data is not cached in memory, using FileMetric data
computation as example, roughly half of the time is spent creating
SensorData instances (the cost of loading xml data is negligible, more
than 80% of time is spent in the call recognizeData()) , and another
half is spent determining whether a file belongs to a project.
I have made changes to some badly-behaved DailyProjectObjects, and
Hongbing has improved Workspace.isSubWorkspace() code. We should be OK
for a while.
I am going to close HACK-595. Following is my recommendation for future
releases (from lowest hanging fruit):
(1) Get rid of Java class mapper. Modify Emma, UnitTest, and JDepend
sensors to send file name instead of class name.
(2) Workspace code performance review. Extract any tiny bit performance
we can get.
(3) Performance review on Sdt code, especially the part related to
evolutionary sensor data, starting from the type we have most data (e.g.
FileMetric).
(4) Make sure that future implementation of DailyProjectObject does not
hold references to SensorData. DailyProjectObject should also have small
footprint (e.g.. not to hold reference to thousands of String objects).
(5) For reducers that return multiple telemetry streams, change them to
do computation column-wise instead of row-wise. For example, a reducer
computes size telemetry for each type of language should finished all
computations for the first day before moving to the next day. It should
not compute java size telemetry first, and then compute c size
telemetry, and then perl telemetry, etc. This is to reduce the chance
that raw SensorData need to be reloaded from disk.
Cheers,
Cedric