haha. yeah. you misunderstood me too. i was totally off subject. i was just saying that i tried a database with hackystat. it was an interesting task that i have no time for any more.
thanks, aaron ----- Original Message ----- From: "(Cedric) Qin ZHANG" <[EMAIL PROTECTED]> Date: Wednesday, March 15, 2006 3:39 pm Subject: Re: [HACKYSTAT-DEV-L] Summary report on Hackystat performance analysis To: Aaron Akihisa Kagawa <[EMAIL PROTECTED]> Cc: [email protected] > Hi, Aaron, > > You misunderstood me. My point is that the size of our data is not > that > large, and our server should be more than enough to handle them. I > am > not suggesting that we migrate to a DB, because the cost of reading > xml > file from disk and parsing it is almost negligible. > > Cheers, > > Cedric > > Aaron Akihisa Kagawa wrote: > >> that our data can fit in a normalized database less than 100MB. > >> > > > > I'm not sure if I shared this or not, but I did spend a little > while getting Hackystat to work with Berkeley DB XML. Basically, I > mapped our XML directly into the database. Thus, it is not > normalized at all. But, I thought it would have increased a > performance a little, but I don't think it did. > > > > It would take about only a week or so of effort to fully > integrate Berkeley DB XML with Hackystat. So, if anyone is > interested on trying it out, let me know. > > > > thanks, aaron > > > > ----- Original Message ----- > > From: "(Cedric) Qin ZHANG" <[EMAIL PROTECTED]> > > Date: Wednesday, March 15, 2006 1:32 pm > > Subject: [HACKYSTAT-DEV-L] Summary report on Hackystat > performance analysis > > To: [email protected] > > > > > >> Hi, > >> > >> I was working on improving hackystat performance in the past week. > >> http://hackydev.ics.hawaii.edu:8080/browse/HACK-595 > >> > >> The total data we have in compressed zip format is around 220MB. > >> Considering how redundant we are with our xml data, I can almost > >> say > >> that our data can fit in a normalized database less than 100MB. > >> There > >> should be no problem for a server with 2 Xeon CPUs and 2G RAM to > >> process > >> this amount of data. But... > >> > >> Following is my profiling-guided findings: > >> > >> (1) When sensor data is cached in memory, almost 100% of time is > >> spent > >> determine whether a file name or a class name belongs to a > project. > >> Determining for a class name is especially time consuming, > because > >> it > >> involving a loop for all the file names associated with that > class > >> name. > >> (2) Workspace method is called hundreds of thousands times. Any > >> minor > >> improvement will have huge impact on performance. > >> > >> (3) When sensor data is not cached in memory, using FileMetric > data > >> computation as example, roughly half of the time is spent > creating > >> SensorData instances (the cost of loading xml data is > negligible, > >> more > >> than 80% of time is spent in the call recognizeData()) , and > >> another > >> half is spent determining whether a file belongs to a project. > >> > >> I have made changes to some badly-behaved DailyProjectObjects, > and > >> Hongbing has improved Workspace.isSubWorkspace() code. We should > be > >> OK > >> for a while. > >> > >> I am going to close HACK-595. Following is my recommendation for > >> future > >> releases (from lowest hanging fruit): > >> > >> (1) Get rid of Java class mapper. Modify Emma, UnitTest, and > >> JDepend > >> sensors to send file name instead of class name. > >> > >> (2) Workspace code performance review. Extract any tiny bit > >> performance > >> we can get. > >> > >> (3) Performance review on Sdt code, especially the part related > to > >> evolutionary sensor data, starting from the type we have most > data > >> (e.g. > >> FileMetric). > >> > >> (4) Make sure that future implementation of DailyProjectObject > does > >> not > >> hold references to SensorData. DailyProjectObject should also > have > >> small > >> footprint (e.g.. not to hold reference to thousands of String > >> objects). > >> (5) For reducers that return multiple telemetry streams, change > >> them to > >> do computation column-wise instead of row-wise. For example, a > >> reducer > >> computes size telemetry for each type of language should > finished > >> all > >> computations for the first day before moving to the next day. It > >> should > >> not compute java size telemetry first, and then compute c size > >> telemetry, and then perl telemetry, etc. This is to reduce the > >> chance > >> that raw SensorData need to be reloaded from disk. > >> > >> Cheers, > >> > >> Cedric > >> > >> > >
