Hi, Aaron,
You misunderstood me. My point is that the size of our data is not that
large, and our server should be more than enough to handle them. I am
not suggesting that we migrate to a DB, because the cost of reading xml
file from disk and parsing it is almost negligible.
Cheers,
Cedric
Aaron Akihisa Kagawa wrote:
that our data can fit in a normalized database less than 100MB.
I'm not sure if I shared this or not, but I did spend a little while getting Hackystat to work with Berkeley DB XML. Basically, I mapped our XML directly into the database. Thus, it is not normalized at all. But, I thought it would have increased a performance a little, but I don't think it did.
It would take about only a week or so of effort to fully integrate Berkeley DB
XML with Hackystat. So, if anyone is interested on trying it out, let me know.
thanks, aaron
----- Original Message -----
From: "(Cedric) Qin ZHANG" <[EMAIL PROTECTED]>
Date: Wednesday, March 15, 2006 1:32 pm
Subject: [HACKYSTAT-DEV-L] Summary report on Hackystat performance analysis
To: [email protected]
Hi,
I was working on improving hackystat performance in the past week.
http://hackydev.ics.hawaii.edu:8080/browse/HACK-595
The total data we have in compressed zip format is around 220MB.
Considering how redundant we are with our xml data, I can almost
say
that our data can fit in a normalized database less than 100MB.
There
should be no problem for a server with 2 Xeon CPUs and 2G RAM to
process
this amount of data. But...
Following is my profiling-guided findings:
(1) When sensor data is cached in memory, almost 100% of time is
spent
determine whether a file name or a class name belongs to a project.
Determining for a class name is especially time consuming, because
it
involving a loop for all the file names associated with that class
name.
(2) Workspace method is called hundreds of thousands times. Any
minor
improvement will have huge impact on performance.
(3) When sensor data is not cached in memory, using FileMetric data
computation as example, roughly half of the time is spent creating
SensorData instances (the cost of loading xml data is negligible,
more
than 80% of time is spent in the call recognizeData()) , and
another
half is spent determining whether a file belongs to a project.
I have made changes to some badly-behaved DailyProjectObjects, and
Hongbing has improved Workspace.isSubWorkspace() code. We should be
OK
for a while.
I am going to close HACK-595. Following is my recommendation for
future
releases (from lowest hanging fruit):
(1) Get rid of Java class mapper. Modify Emma, UnitTest, and
JDepend
sensors to send file name instead of class name.
(2) Workspace code performance review. Extract any tiny bit
performance
we can get.
(3) Performance review on Sdt code, especially the part related to
evolutionary sensor data, starting from the type we have most data
(e.g.
FileMetric).
(4) Make sure that future implementation of DailyProjectObject does
not
hold references to SensorData. DailyProjectObject should also have
small
footprint (e.g.. not to hold reference to thousands of String
objects).
(5) For reducers that return multiple telemetry streams, change
them to
do computation column-wise instead of row-wise. For example, a
reducer
computes size telemetry for each type of language should finished
all
computations for the first day before moving to the next day. It
should
not compute java size telemetry first, and then compute c size
telemetry, and then perl telemetry, etc. This is to reduce the
chance
that raw SensorData need to be reloaded from disk.
Cheers,
Cedric