Re: Beyond Lucene 2.0 Index Design

jian chen Thu, 11 Jan 2007 14:30:27 -0800

I also got the same question. It seems it is very hard to efficiently do
phrase based query.


I think most search engines do phrase based query, or at least appear to be.
So, like in google, the query result must contain all the words user
searched on.

It seems to me that the impacted-sorted list makes sense if you are trying
to do pure vector space based ranking. This is from what I have read from
the research papers. They all talk about how to optimize the vector space
model using this impact-sorted list approach.

Unfortunately, the vector space model has serious drawbacks. It does not
take the inter-word relation into account. Thus, could result in a search
result where documents matching only some keywords ranked higher than
documents matching all of them.

I still yet to see whether the impact-sorted list approach could handle this
efficiently.

Cheers,

Jian

On 1/11/07, Marvin Humphrey <[EMAIL PROTECTED]> wrote:

On Jan 9, 2007, at 6:25 AM, Dalton, Jeffery wrote:
> e. <impact, num_docs, (doc1,...docN)>
> f. <impact, num_docs, ([doc1, freq ,<positions>],...[docN, freq
> ,<positions>])

How do you build an efficient PhraseScorer to work with an impact-
sorted posting list?

The way PhraseScorer currently works is: find a doc that contains all
terms, then see if the terms occur consecutively in phrase order,
then determine a score.  The TermDocs objects feeding PhraseScorer
return doc_nums in  ascending order, so finding an intersection is
easy.  But if the document numbers are returned in what looks to the
PhraseScorer like random order... ??

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Beyond Lucene 2.0 Index Design

Reply via email to