Re: Testing Precision and Recall on Lucene

Murat Yakici Thu, 15 Jan 2009 05:44:13 -0800

I am not aware of any open source LSA framework out there. If you are
interested in PLSA, Lemur has got an implementation.

In a "simplest" sense Lucene is using a type of TFIDF scoring mechanism.
If you are not really concerned with Lucene's particular implementation,
then just use Lemur for your research purposes. I think Lemur is a better
choice for research than Lucene. It is got lots of other IR model
implementations that you can benchmark with. But on the other hand, Lucene
is an excellent choice for query representations, fielded searching etc.
And according to my humble experience, Lucene is really really faster than
Lemur ( both multi-threaded search apps on 80 cores 256G memory, tested
for TFIDF only). OR if you think plain TFIDF implementation is ok for you,
may be just search  some conference papers (SIGIR, ECIR etc.) with your
favourite search engine, and may be you can find a paper which already
compared TFIDF vs. LSA vs. BM25 etc. (I am sure there is a paper, I can't
remember the title). At least, in TERABYTE Track in TREC, there was a
group did perform some benchmarks including Lucene a couple of years ago.

For the metrics, if you make sure the output from your query runs is in
TREC format, again you can use  Lemur's evaluation application to get a
varity of metrics (including standart precision and recall). Lucene also
has a bechmark contribution, but I never worked with it.

LSA is a bit demanding, isn't it? You have to construct a term/document
matrix and then decompose (SVD). This could be a > 1M X 700K matrix. Are
there any matrix decomposition tools to support this size? Apache Mahout
may be.. I don't know the progress of that project with respect to matrix
implementation. You may want to check-out their list.

Murat

> Dear All,
> Thanks for your feedback.
> I want to do research on how lucene performs compared to Latent Semantic
> analysis in terms of recall and precision.
> I welcome ideas on this,does anyone know a software tool using latent
> semantic analysis that I could also download and try it?At the moment I am
> doing some simulation using Mathlab.
> Thank you
>
> David
>
>
>
>
> --- On Thu, 15/1/09, Murat Yakici <[email protected]> wrote:
> From: Murat Yakici <[email protected]>
> Subject: Re: Testing Precision and Recall on Lucene
> To: [email protected]
> Date: Thursday, 15 January, 2009, 12:17 PM
>
> Let's please don't forget the scoring function. Yes, *query* is
> important,
> however, everyone in IR knows that two different scoring functions may
> return two different sets of results for the same query!
>
> David, I think you have to be more explicit here. What exactly are you
> trying to do? Are you gonna benchmark Lucene's performance between similar
> queries (as Donna suggests) or between different frameworks. To use
> metrics like precision and recall you need to do a test on the same
> collection as well.
>
> Murat
>
>> I don't think this question makes a whole lot of sense in isolation--
>> precision and recall is all about the *query* and that is the art of the
>> developer; what is the appropriate query for your particular
>> application.
>> Lucene does just great telling you which documents had which terms and
>> which term combinations, and does a very good job of scoring them and
>> ranking them once you give a query. But *you* have to decide what the
>> appropriate query is for your application. The simplest approach, and
>> perhaps what you are looking for, is the MoreLikeThis query generator,
>> which uses (pretty good) heuristics for telling you which documents are
>> "like" a given document, but I don't know if that's the
> direction you are
>> looking for.
>>
>>
>> david muchangi <[email protected]> wrote on 01/14/2009
> 02:00:33 PM:
>>
>>> Dear All,
>>> I wish to have a quick test on how lucene performs in terms of
>>> precision and recall.Anyone with a small application that I can use
>>> quickly without having to program using the APIs?
>>> Thanks.
>>> David
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>
>
> Murat Yakici
> Department of Computer & Information Sciences
> University of Strathclyde
> Glasgow, UK
> -------------------------------------------
> The University of Strathclyde is a charitable body, registered in
> Scotland,
> with registration number SC015263.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>
>
>
>

Murat Yakici
Department of Computer & Information Sciences
University of Strathclyde
Glasgow, UK
-------------------------------------------
The University of Strathclyde is a charitable body, registered in Scotland,
with registration number SC015263.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Testing Precision and Recall on Lucene

Reply via email to