Hi,

I was working on improving hackystat performance in the past week.
 http://hackydev.ics.hawaii.edu:8080/browse/HACK-595

The total data we have in compressed zip format is around 220MB. Considering how redundant we are with our xml data, I can almost say that our data can fit in a normalized database less than 100MB. There should be no problem for a server with 2 Xeon CPUs and 2G RAM to process this amount of data. But...

Following is my profiling-guided findings:

(1) When sensor data is cached in memory, almost 100% of time is spent determine whether a file name or a class name belongs to a project. Determining for a class name is especially time consuming, because it involving a loop for all the file names associated with that class name.

(2) Workspace method is called hundreds of thousands times. Any minor improvement will have huge impact on performance.

(3) When sensor data is not cached in memory, using FileMetric data computation as example, roughly half of the time is spent creating SensorData instances (the cost of loading xml data is negligible, more than 80% of time is spent in the call recognizeData()) , and another half is spent determining whether a file belongs to a project.

I have made changes to some badly-behaved DailyProjectObjects, and Hongbing has improved Workspace.isSubWorkspace() code. We should be OK for a while.

I am going to close HACK-595. Following is my recommendation for future releases (from lowest hanging fruit):

(1) Get rid of Java class mapper. Modify Emma, UnitTest, and JDepend sensors to send file name instead of class name.

(2) Workspace code performance review. Extract any tiny bit performance we can get.

(3) Performance review on Sdt code, especially the part related to evolutionary sensor data, starting from the type we have most data (e.g. FileMetric).

(4) Make sure that future implementation of DailyProjectObject does not hold references to SensorData. DailyProjectObject should also have small footprint (e.g.. not to hold reference to thousands of String objects).

(5) For reducers that return multiple telemetry streams, change them to do computation column-wise instead of row-wise. For example, a reducer computes size telemetry for each type of language should finished all computations for the first day before moving to the next day. It should not compute java size telemetry first, and then compute c size telemetry, and then perl telemetry, etc. This is to reduce the chance that raw SensorData need to be reloaded from disk.

Cheers,

Cedric

Reply via email to