[HACKYSTAT-DEV-L] Summary report on Hackystat performance analysis

(Cedric) Qin ZHANG Wed, 15 Mar 2006 15:32:15 -0800

Hi,

I was working on improving hackystat performance in the past week.
 http://hackydev.ics.hawaii.edu:8080/browse/HACK-595

The total data we have in compressed zip format is around 220MB.Considering how redundant we are with our xml data, I can almost saythat our data can fit in a normalized database less than 100MB. Thereshould be no problem for a server with 2 Xeon CPUs and 2G RAM to processthis amount of data. But...


Following is my profiling-guided findings:

(1) When sensor data is cached in memory, almost 100% of time is spentdetermine whether a file name or a class name belongs to a project.Determining for a class name is especially time consuming, because itinvolving a loop for all the file names associated with that class name.

(2) Workspace method is called hundreds of thousands times. Any minorimprovement will have huge impact on performance.

(3) When sensor data is not cached in memory, using FileMetric datacomputation as example, roughly half of the time is spent creatingSensorData instances (the cost of loading xml data is negligible, morethan 80% of time is spent in the call recognizeData()) , and anotherhalf is spent determining whether a file belongs to a project.

I have made changes to some badly-behaved DailyProjectObjects, andHongbing has improved Workspace.isSubWorkspace() code. We should be OKfor a while.

I am going to close HACK-595. Following is my recommendation for futurereleases (from lowest hanging fruit):

(1) Get rid of Java class mapper. Modify Emma, UnitTest, and JDependsensors to send file name instead of class name.

(2) Workspace code performance review. Extract any tiny bit performancewe can get.

(3) Performance review on Sdt code, especially the part related toevolutionary sensor data, starting from the type we have most data (e.g.FileMetric).

(4) Make sure that future implementation of DailyProjectObject does nothold references to SensorData. DailyProjectObject should also have smallfootprint (e.g.. not to hold reference to thousands of String objects).

(5) For reducers that return multiple telemetry streams, change them todo computation column-wise instead of row-wise. For example, a reducercomputes size telemetry for each type of language should finished allcomputations for the first day before moving to the next day. It shouldnot compute java size telemetry first, and then compute c sizetelemetry, and then perl telemetry, etc. This is to reduce the chancethat raw SensorData need to be reloaded from disk.


Cheers,

Cedric

[HACKYSTAT-DEV-L] Summary report on Hackystat performance analysis

Reply via email to