Paul Elschot wrote:
There is one indexing parameter that might help performance
for BooleanScorer2, it is the skip interval in Lucene's TermInfosWriter.
The current value is 16, and there was a question about it
on 16 Oct 2005 on java-dev with title "skipInterval".
I don't know how the value of skipInterval was initially determined.
It's possible that a larger value gives somewhat better query
performance in this case.
Changing the skip interval might require reindexing, though.
In Nutch the default is 128. And yes, changing this requires re-creating
the index (actually, it's enough to optimize it, so that the .tii file
is re-written).
I considered a specialised scorer for the earlier query:
+(url:term1^4.0 anchor:term1^2.0 content:term1
title:term1^1.5 host:term1^2.0)
+(url:term2^4.0 anchor:term2^2.0 content:term2
title:term2^1.5 host:term2^2.0)
url:"term1 term2"~2147483647^4.0
anchor:"term1 term2"~4^2.0
content:"term1 term2"~2147483647
title:"term1 term2"~2147483647^1.5
host:"term1 term2"~2147483647^2.0
[...]
Thank you for the detailed analysis. Currently we pursue a totally
different approach: limiting the size of the index by clever selection
of the most promising postings, and resorting the posting lists so that
they are ordered according to a "pagerank"-like value, so that we could
skip postings coming from less significant docs. Please see the
nutch-dev discussion for more details.
Oh, BTW: I just found the DisjunctionMaxQuery class, recently added it
seems. Do you think this query structure could benefit from using it
instead of the BooleanQuery?
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]