haha. yeah. you misunderstood me too.  i was totally off subject. 

i was just saying that i tried a database with hackystat. it was an interesting 
task that i have no time for any more. 

thanks, aaron

----- Original Message -----
From: "(Cedric) Qin ZHANG" <[EMAIL PROTECTED]>
Date: Wednesday, March 15, 2006 3:39 pm
Subject: Re: [HACKYSTAT-DEV-L] Summary report on Hackystat performance analysis
To: Aaron Akihisa Kagawa <[EMAIL PROTECTED]>
Cc: [email protected]

> Hi, Aaron,
> 
> You misunderstood me. My point is that the size of our data is not 
> that 
> large, and our server should be more than enough to handle them. I 
> am 
> not suggesting that we migrate to a DB, because the cost of reading 
> xml 
> file from disk and parsing it is almost negligible.
> 
> Cheers,
> 
> Cedric
> 
> Aaron Akihisa Kagawa wrote:
> >> that our data can fit in a normalized database less than 100MB. 
> >>     
> >
> > I'm not sure if I shared this or not, but I did spend a little 
> while getting Hackystat to work with Berkeley DB XML.  Basically, I 
> mapped our XML directly into the database. Thus, it is not 
> normalized at all. But, I thought it would have increased a 
> performance a little, but I don't think it did. 
> >
> > It would take about only a week or so of effort to fully 
> integrate Berkeley DB XML with Hackystat. So, if anyone is 
> interested on trying it out, let me know.
> >
> > thanks, aaron
> >
> > ----- Original Message -----
> > From: "(Cedric) Qin ZHANG" <[EMAIL PROTECTED]>
> > Date: Wednesday, March 15, 2006 1:32 pm
> > Subject: [HACKYSTAT-DEV-L] Summary report on Hackystat 
> performance analysis
> > To: [email protected]
> >
> >   
> >> Hi,
> >>
> >> I was working on improving hackystat performance in the past week.
> >>  http://hackydev.ics.hawaii.edu:8080/browse/HACK-595
> >>
> >> The total data we have in compressed zip format is around 220MB. 
> >> Considering how redundant we are with our xml data, I can almost 
> >> say 
> >> that our data can fit in a normalized database less than 100MB. 
> >> There 
> >> should be no problem for a server with 2 Xeon CPUs and 2G RAM to 
> >> process 
> >> this amount of data. But...
> >>
> >> Following is my profiling-guided findings:
> >>
> >> (1) When sensor data is cached in memory, almost 100% of time is 
> >> spent 
> >> determine whether a file name or a class name belongs to a 
> project. 
> >> Determining for a class name is especially time consuming, 
> because 
> >> it 
> >> involving a loop for all the file names associated with that 
> class 
> >> name.
> >> (2) Workspace method is called hundreds of thousands times. Any 
> >> minor 
> >> improvement will have huge impact on performance.
> >>
> >> (3) When sensor data is not cached in memory, using FileMetric 
> data 
> >> computation as example, roughly half of the time is spent 
> creating 
> >> SensorData instances (the cost of loading xml data is 
> negligible, 
> >> more 
> >> than 80% of time is spent in the call recognizeData()) , and 
> >> another 
> >> half is spent determining whether a file belongs to a project.
> >>
> >> I have made changes to some badly-behaved DailyProjectObjects, 
> and 
> >> Hongbing has improved Workspace.isSubWorkspace() code. We should 
> be 
> >> OK 
> >> for a while.
> >>
> >> I am going to close HACK-595. Following is my recommendation for 
> >> future 
> >> releases (from lowest hanging fruit):
> >>
> >> (1) Get rid of Java class mapper. Modify Emma, UnitTest, and 
> >> JDepend 
> >> sensors to send file name instead of class name.
> >>
> >> (2) Workspace code performance review. Extract any tiny bit 
> >> performance 
> >> we can get.
> >>
> >> (3) Performance review on Sdt code, especially the part related 
> to 
> >> evolutionary sensor data, starting from the type we have most 
> data 
> >> (e.g. 
> >> FileMetric).
> >>
> >> (4) Make sure that future implementation of DailyProjectObject 
> does 
> >> not 
> >> hold references to SensorData. DailyProjectObject should also 
> have 
> >> small 
> >> footprint (e.g.. not to hold reference to thousands of String 
> >> objects).
> >> (5) For reducers that return multiple telemetry streams, change 
> >> them to 
> >> do computation column-wise instead of row-wise. For example, a 
> >> reducer 
> >> computes size telemetry for each type of language should 
> finished 
> >> all 
> >> computations for the first day before moving to the next day. It 
> >> should 
> >> not compute java size telemetry first, and then compute c size 
> >> telemetry, and then perl telemetry, etc. This is to reduce the 
> >> chance 
> >> that raw SensorData need to be reloaded from disk.
> >>
> >> Cheers,
> >>
> >> Cedric
> >>
> >>     
> 
> 

Reply via email to