> that our data can fit in a normalized database less than 100MB. I'm not sure if I shared this or not, but I did spend a little while getting Hackystat to work with Berkeley DB XML. Basically, I mapped our XML directly into the database. Thus, it is not normalized at all. But, I thought it would have increased a performance a little, but I don't think it did.
It would take about only a week or so of effort to fully integrate Berkeley DB XML with Hackystat. So, if anyone is interested on trying it out, let me know. thanks, aaron ----- Original Message ----- From: "(Cedric) Qin ZHANG" <[EMAIL PROTECTED]> Date: Wednesday, March 15, 2006 1:32 pm Subject: [HACKYSTAT-DEV-L] Summary report on Hackystat performance analysis To: [email protected] > Hi, > > I was working on improving hackystat performance in the past week. > http://hackydev.ics.hawaii.edu:8080/browse/HACK-595 > > The total data we have in compressed zip format is around 220MB. > Considering how redundant we are with our xml data, I can almost > say > that our data can fit in a normalized database less than 100MB. > There > should be no problem for a server with 2 Xeon CPUs and 2G RAM to > process > this amount of data. But... > > Following is my profiling-guided findings: > > (1) When sensor data is cached in memory, almost 100% of time is > spent > determine whether a file name or a class name belongs to a project. > Determining for a class name is especially time consuming, because > it > involving a loop for all the file names associated with that class > name. > (2) Workspace method is called hundreds of thousands times. Any > minor > improvement will have huge impact on performance. > > (3) When sensor data is not cached in memory, using FileMetric data > computation as example, roughly half of the time is spent creating > SensorData instances (the cost of loading xml data is negligible, > more > than 80% of time is spent in the call recognizeData()) , and > another > half is spent determining whether a file belongs to a project. > > I have made changes to some badly-behaved DailyProjectObjects, and > Hongbing has improved Workspace.isSubWorkspace() code. We should be > OK > for a while. > > I am going to close HACK-595. Following is my recommendation for > future > releases (from lowest hanging fruit): > > (1) Get rid of Java class mapper. Modify Emma, UnitTest, and > JDepend > sensors to send file name instead of class name. > > (2) Workspace code performance review. Extract any tiny bit > performance > we can get. > > (3) Performance review on Sdt code, especially the part related to > evolutionary sensor data, starting from the type we have most data > (e.g. > FileMetric). > > (4) Make sure that future implementation of DailyProjectObject does > not > hold references to SensorData. DailyProjectObject should also have > small > footprint (e.g.. not to hold reference to thousands of String > objects). > (5) For reducers that return multiple telemetry streams, change > them to > do computation column-wise instead of row-wise. For example, a > reducer > computes size telemetry for each type of language should finished > all > computations for the first day before moving to the next day. It > should > not compute java size telemetry first, and then compute c size > telemetry, and then perl telemetry, etc. This is to reduce the > chance > that raw SensorData need to be reloaded from disk. > > Cheers, > > Cedric >
