Re: [HACKYSTAT-DEV-L] Summary report on Hackystat performance analysis

(Cedric) Qin ZHANG Wed, 15 Mar 2006 17:39:44 -0800

Hi, Aaron,

You misunderstood me. My point is that the size of our data is not thatlarge, and our server should be more than enough to handle them. I amnot suggesting that we migrate to a DB, because the cost of reading xmlfile from disk and parsing it is almost negligible.


Cheers,

Cedric

Aaron Akihisa Kagawa wrote:

that our data can fit in a normalized database less than 100MB.
I'm not sure if I shared this or not, but I did spend a little while getting Hackystat to work with Berkeley DB XML. Basically, I mapped our XML directly into the database. Thus, it is not normalized at all. But, I thought it would have increased a performance a little, but I don't think it did.
It would take about only a week or so of effort to fully integrate Berkeley DB 
XML with Hackystat. So, if anyone is interested on trying it out, let me know.

thanks, aaron

----- Original Message -----
From: "(Cedric) Qin ZHANG" <[EMAIL PROTECTED]>
Date: Wednesday, March 15, 2006 1:32 pm
Subject: [HACKYSTAT-DEV-L] Summary report on Hackystat performance analysis
To: [email protected]
Hi,

I was working on improving hackystat performance in the past week.
 http://hackydev.ics.hawaii.edu:8080/browse/HACK-595
The total data we have in compressed zip format is around 220MB.Considering how redundant we are with our xml data, I can almostsaythat our data can fit in a normalized database less than 100MB.Thereshould be no problem for a server with 2 Xeon CPUs and 2G RAM toprocessthis amount of data. But...
Following is my profiling-guided findings:
(1) When sensor data is cached in memory, almost 100% of time isspentdetermine whether a file name or a class name belongs to a project.Determining for a class name is especially time consuming, becauseitinvolving a loop for all the file names associated with that classname.(2) Workspace method is called hundreds of thousands times. Anyminorimprovement will have huge impact on performance.
(3) When sensor data is not cached in memory, using FileMetric datacomputation as example, roughly half of the time is spent creatingSensorData instances (the cost of loading xml data is negligible,morethan 80% of time is spent in the call recognizeData()) , andanotherhalf is spent determining whether a file belongs to a project.
I have made changes to some badly-behaved DailyProjectObjects, andHongbing has improved Workspace.isSubWorkspace() code. We should beOKfor a while.
I am going to close HACK-595. Following is my recommendation forfuturereleases (from lowest hanging fruit):
(1) Get rid of Java class mapper. Modify Emma, UnitTest, andJDependsensors to send file name instead of class name.
(2) Workspace code performance review. Extract any tiny bitperformancewe can get.
(3) Performance review on Sdt code, especially the part related toevolutionary sensor data, starting from the type we have most data(e.g.FileMetric).
(4) Make sure that future implementation of DailyProjectObject doesnothold references to SensorData. DailyProjectObject should also havesmallfootprint (e.g.. not to hold reference to thousands of Stringobjects).(5) For reducers that return multiple telemetry streams, changethem todo computation column-wise instead of row-wise. For example, areducercomputes size telemetry for each type of language should finishedallcomputations for the first day before moving to the next day. Itshouldnot compute java size telemetry first, and then compute c sizetelemetry, and then perl telemetry, etc. This is to reduce thechancethat raw SensorData need to be reloaded from disk.
Cheers,

Cedric

Re: [HACKYSTAT-DEV-L] Summary report on Hackystat performance analysis

Reply via email to