On 21/09/2012, at 19:22, Nicholas Clark <n...@ccl4.org> wrote: > On Fri, Sep 21, 2012 at 08:56:34AM +0100, Simon Wistow wrote: >> On Thu, Sep 20, 2012 at 12:35:18PM +0100, Nicholas Clark said: >>> Lots of "one trick pony" type benchmarks exist, but very few that actually >>> try to look like they are doing typical things typical programs do, at the >>> typical scales real programs work out, so >> >> As a search engineer (recovering) I'm inclined to say - get a corpus of >> docs, build an inverted index out of it and then do some searches. This >> will test >> >> >> 1) File/IO Performance (Reading in the corpus) >> 2) Text manipulation (Tokenizing, Stop word removal, Stemming) >> 3) Data structure performance (Building the index) >> 4) Maths Calculation (performing TF/IDF searches) >> >> All in pretty good, discrete steps. Plus by tweaking the size of the >> corpus you can stress memory as well. > > Thanks, this is a useful suggestion, but... > > I'm not a search engineer (recovering or otherwise), so this represents > rather more work that I wanted to do. In that I first have to learn enough > of how to *be* a search engineer to figure out how to write the above code > to do something useful, and *then* how to write such code to a reasonably > performant production versions, and then to turn working code into something > sufficiently stand alone to be a benchmark. > > I don't want to be spending my time figuring out the right way to do all the > above algorithms in Perl. I want to get as fast as possible to the point of > figuring out how the perl interpreter (mis)behaves when presented with > extant decent code to do the above. > > Unless there's a CPAN-in-a-box for doing most of the four steps. > (which doesn't depend on external C libraries. That was one of my > "preferably" criteria) > > So, next question - if I wanted to be as lazy as possible and write a search > engine (as described above) using as much of CPAN as possible, which modules > are recommended? :-) >
I think you want Plucene. But please let someone else correct me if I'm wrong. > Nicholas Clark