Great work, Cedric, and very useful observations on what to do next.
Getting rid of the workspace mapper is a top priority for 7.4.
Evolution (the reorganizeData() method call you talk about below) can
indeed become expensive. I want to evolve several other SDTs in the next
version, so we'll need to keep an eye on this overhead. One way to minimize
this overhead is to allow sensor data to write out the evolved code to disk
after doing the reorganization; that would make the cost one-time-only.
The design pattern to do telemetry computations column-wise rather than
row-wise is very insightful. I'd never thought of that issue before. Nice
thinking!
Cheers,
Philip
--On Wednesday, March 15, 2006 1:32 PM -1000 "(Cedric) Qin ZHANG"
<[EMAIL PROTECTED]> wrote:
Hi,
I was working on improving hackystat performance in the past week.
http://hackydev.ics.hawaii.edu:8080/browse/HACK-595
The total data we have in compressed zip format is around 220MB.
Considering how redundant we are with our xml data, I can almost say that
our data can fit in a normalized database less than 100MB. There should
be no problem for a server with 2 Xeon CPUs and 2G RAM to process this
amount of data. But...
Following is my profiling-guided findings:
(1) When sensor data is cached in memory, almost 100% of time is spent
determine whether a file name or a class name belongs to a project.
Determining for a class name is especially time consuming, because it
involving a loop for all the file names associated with that class name.
(2) Workspace method is called hundreds of thousands times. Any minor
improvement will have huge impact on performance. (3) When sensor data
is not cached in memory, using FileMetric data computation as example,
roughly half of the time is spent creating SensorData instances (the cost
of loading xml data is negligible, more than 80% of time is spent in the
call recognizeData()) , and another half is spent determining whether a
file belongs to a project.
I have made changes to some badly-behaved DailyProjectObjects, and
Hongbing has improved Workspace.isSubWorkspace() code. We should be OK
for a while.
I am going to close HACK-595. Following is my recommendation for future
releases (from lowest hanging fruit):
(1) Get rid of Java class mapper. Modify Emma, UnitTest, and JDepend
sensors to send file name instead of class name.
(2) Workspace code performance review. Extract any tiny bit performance
we can get.
(3) Performance review on Sdt code, especially the part related to
evolutionary sensor data, starting from the type we have most data (e.g.
FileMetric).
(4) Make sure that future implementation of DailyProjectObject does not
hold references to SensorData. DailyProjectObject should also have small
footprint (e.g.. not to hold reference to thousands of String objects).
(5) For reducers that return multiple telemetry streams, change them to
do computation column-wise instead of row-wise. For example, a reducer
computes size telemetry for each type of language should finished all
computations for the first day before moving to the next day. It should
not compute java size telemetry first, and then compute c size telemetry,
and then perl telemetry, etc. This is to reduce the chance that raw
SensorData need to be reloaded from disk.
Cheers,
Cedric