On Fri, Sep 21, 2012 at 08:56:34AM +0100, Simon Wistow wrote: > On Thu, Sep 20, 2012 at 12:35:18PM +0100, Nicholas Clark said: > > Lots of "one trick pony" type benchmarks exist, but very few that actually > > try to look like they are doing typical things typical programs do, at the > > typical scales real programs work out, so > > As a search engineer (recovering) I'm inclined to say - get a corpus of > docs, build an inverted index out of it and then do some searches. This > will test > > > 1) File/IO Performance (Reading in the corpus) > 2) Text manipulation (Tokenizing, Stop word removal, Stemming) > 3) Data structure performance (Building the index) > 4) Maths Calculation (performing TF/IDF searches) > > All in pretty good, discrete steps. Plus by tweaking the size of the > corpus you can stress memory as well.
Thanks, this is a useful suggestion, but... I'm not a search engineer (recovering or otherwise), so this represents rather more work that I wanted to do. In that I first have to learn enough of how to *be* a search engineer to figure out how to write the above code to do something useful, and *then* how to write such code to a reasonably performant production versions, and then to turn working code into something sufficiently stand alone to be a benchmark. I don't want to be spending my time figuring out the right way to do all the above algorithms in Perl. I want to get as fast as possible to the point of figuring out how the perl interpreter (mis)behaves when presented with extant decent code to do the above. Unless there's a CPAN-in-a-box for doing most of the four steps. (which doesn't depend on external C libraries. That was one of my "preferably" criteria) So, next question - if I wanted to be as lazy as possible and write a search engine (as described above) using as much of CPAN as possible, which modules are recommended? :-) Nicholas Clark