> that our data can fit in a normalized database less than 100MB. 

I'm not sure if I shared this or not, but I did spend a little while getting 
Hackystat to work with Berkeley DB XML.  Basically, I mapped our XML directly 
into the database. Thus, it is not normalized at all. But, I thought it would 
have increased a performance a little, but I don't think it did. 

It would take about only a week or so of effort to fully integrate Berkeley DB 
XML with Hackystat. So, if anyone is interested on trying it out, let me know.

thanks, aaron

----- Original Message -----
From: "(Cedric) Qin ZHANG" <[EMAIL PROTECTED]>
Date: Wednesday, March 15, 2006 1:32 pm
Subject: [HACKYSTAT-DEV-L] Summary report on Hackystat performance analysis
To: [email protected]

> Hi,
> 
> I was working on improving hackystat performance in the past week.
>  http://hackydev.ics.hawaii.edu:8080/browse/HACK-595
> 
> The total data we have in compressed zip format is around 220MB. 
> Considering how redundant we are with our xml data, I can almost 
> say 
> that our data can fit in a normalized database less than 100MB. 
> There 
> should be no problem for a server with 2 Xeon CPUs and 2G RAM to 
> process 
> this amount of data. But...
> 
> Following is my profiling-guided findings:
> 
> (1) When sensor data is cached in memory, almost 100% of time is 
> spent 
> determine whether a file name or a class name belongs to a project. 
> Determining for a class name is especially time consuming, because 
> it 
> involving a loop for all the file names associated with that class 
> name.
> (2) Workspace method is called hundreds of thousands times. Any 
> minor 
> improvement will have huge impact on performance.
> 
> (3) When sensor data is not cached in memory, using FileMetric data 
> computation as example, roughly half of the time is spent creating 
> SensorData instances (the cost of loading xml data is negligible, 
> more 
> than 80% of time is spent in the call recognizeData()) , and 
> another 
> half is spent determining whether a file belongs to a project.
> 
> I have made changes to some badly-behaved DailyProjectObjects, and 
> Hongbing has improved Workspace.isSubWorkspace() code. We should be 
> OK 
> for a while.
> 
> I am going to close HACK-595. Following is my recommendation for 
> future 
> releases (from lowest hanging fruit):
> 
> (1) Get rid of Java class mapper. Modify Emma, UnitTest, and 
> JDepend 
> sensors to send file name instead of class name.
> 
> (2) Workspace code performance review. Extract any tiny bit 
> performance 
> we can get.
> 
> (3) Performance review on Sdt code, especially the part related to 
> evolutionary sensor data, starting from the type we have most data 
> (e.g. 
> FileMetric).
> 
> (4) Make sure that future implementation of DailyProjectObject does 
> not 
> hold references to SensorData. DailyProjectObject should also have 
> small 
> footprint (e.g.. not to hold reference to thousands of String 
> objects).
> (5) For reducers that return multiple telemetry streams, change 
> them to 
> do computation column-wise instead of row-wise. For example, a 
> reducer 
> computes size telemetry for each type of language should finished 
> all 
> computations for the first day before moving to the next day. It 
> should 
> not compute java size telemetry first, and then compute c size 
> telemetry, and then perl telemetry, etc. This is to reduce the 
> chance 
> that raw SensorData need to be reloaded from disk.
> 
> Cheers,
> 
> Cedric
> 

Reply via email to