RE: Boost more recent document

Zhang, Lisheng Thu, 01 Dec 2011 11:51:43 -0800

Currently we use lucene 2.3.2, the reason why we recreate searcher each time
is that within one server we managed a few thousand independent lucene index
data folders. Those folders have different sizes, the large ones have about
200K docs (but growing).


Thanks very much for helps, Lisheng

-----Original Message-----
From: Simon Willnauer [mailto:[email protected]]
Sent: Thursday, December 01, 2011 11:34 AM
To: Zhang, Lisheng
Cc: [email protected]
Subject: Re: Boost more recent document


On Thu, Dec 1, 2011 at 8:30 PM, Zhang, Lisheng
<[email protected]> wrote:
> Hi Simon,
>
> 1) Thanks for suggesting lucene 4.0 feature, we will make use of it as soon as
>   we upgrade lucene.
>
> 2) Currently we recreate IndexSearcher for each query, which means recreate
>   underlying IndexReader for each query (I should have said IndexReader), but
>   sort performance is OK, so I would like to try CustomScoreQuery without 
> cache
>   first?

WOW - why do you do this? Can't you use the SearcherManager in Lucene 3.5?

simon
>
> Thanks very much for helps, Lisheng
>
> -----Original Message-----
> From: Simon Willnauer [mailto:[email protected]]
> Sent: Thursday, December 01, 2011 11:21 AM
> To: Zhang, Lisheng
> Cc: [email protected]
> Subject: Re: Boost more recent document
>
>
> On Thu, Dec 1, 2011 at 7:36 AM, Zhang, Lisheng
> <[email protected]> wrote:
>> Hi Simon,
>>
>> Sorry I found that I cannot use payload for this purpose because payload
>> can be accessed only through term positions but we did not use timestamp
>> for query. Ideally it would be great if we can have some doc-level "payload"
>> accessible through docId?
>
> lucene 4 has a feature called IndexDocValues which is essentially a
> payload per document per field.
>
> you can read about it here:
> http://www.searchworkings.org/blog/-/blogs/introducing-lucene-index-doc-values
> http://www.searchworkings.org/blog/-/blogs/apache-lucene-flexiblescoring-with-indexdocvalues
> http://www.searchworkings.org/blog/-/blogs/indexdocvalues-their-applications
>>
>> Then your initial suggestion to use CustomScoreQuery would be our solution,
>> from source code I see sort is implemented by FieldCache and its performance
>> seems OK even though we didnot cache reader. So we will use CustomeScoreQuery
>> without cache for now (cutting time stamp to hour or day may help), if too
>> slow we may consider selected cache.
>
> what do you mean by cache readers?
>
> simon
>>
>> Thanks very much for all your great helps, please point out if you see wrong
>> in above statements?
>>
>> Best regards, Lisheng
>>
>> -----Original Message-----
>> From: Zhang, Lisheng [mailto:[email protected]]
>> Sent: Wednesday, November 30, 2011 1:40 PM
>> To: [email protected]; [email protected]
>> Subject: RE: Boost more recent document
>>
>>
>> Hi,
>>
>> Thanks for the very interesting idea!
>>
>> Currently we use lucene 2.3.2 and we just use default merge policy (at
>> any time we have a few segments and after some accumulation small segments
>> are merged into big ones). I need to double check if docId can reflect doc
>> age.
>>
>> But I have one concern: docId may not reflect true age interval, like docId
>> difference by 2 may reflect 2m or 1h. If no better choice I may just use
>> payload and adapt a few query classes?
>>
>> Thanks very much for helps, Lisheng
>>
>> -----Original Message-----
>> From: Simon Willnauer [mailto:[email protected]]
>> Sent: Wednesday, November 30, 2011 1:02 PM
>> To: [email protected]
>> Subject: Re: Boost more recent document
>>
>>
>> If you use LogMergePolicy ie. do merges in order you could use the
>> absolute docID as a relative age value. Smaller docIDs mean younger
>> documents. Maybe this works for you?
>>
>> simon
>>
>> On Wed, Nov 30, 2011 at 9:08 PM, Zhang, Lisheng
>> <[email protected]> wrote:
>>> Thanks very much for your helps! I got the point, only problem is that
>>> I cannot afford to to use FieldCache because in our app we have many
>>> lucene index data folders, is there another simple way?
>>>
>>> Thanks again, Lisheng
>>>
>>> -----Original Message-----
>>> From: Simon Willnauer [mailto:[email protected]]
>>> Sent: Wednesday, November 30, 2011 11:40 AM
>>> To: [email protected]
>>> Subject: Re: Boost more recent document
>>>
>>>
>>> On Wed, Nov 30, 2011 at 6:59 PM, Zhang, Lisheng
>>> <[email protected]> wrote:
>>>> Hi,
>>>>
>>>> We need to boost document which is more recent (each doc has time stamp 
>>>> attribute). It seems that
>>>> we cannot use doc boost at index time because it will be condensed into 
>>>> one byte (cannot differentiate
>>>> 365 days), so we may use payload (save time stamp as payload) to boost at 
>>>> search time.
>>>>
>>>> In our app we let user enter query at browser and use QueryParser to 
>>>> generate query, the query can
>>>> be different types (TermQuery, BooleanQuery, WildcardQuery, ...), then it 
>>>> seems we need to create
>>>> each customized query class similar to PayloadTermQuery, is there another 
>>>> simpler way?
>>>
>>> you can simply index your timestamp (untokenzied) and wrap your query
>>> in a CustomScoreQuery. This query accepts your user query and a
>>> ValueSource. During search CustomScoreQuery calls your valuesource for
>>> each document that the user query scores and multiplies the result of
>>> the ValueSource into the score. Inside your valuesource you can simply
>>> get the timestamps from the FieldCache and calculate your custom
>>> boost...
>>>
>>> hope that helps
>>>
>>> simon
>>>>
>>>> Thanks very much for helps, Lisheng
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>

RE: Boost more recent document

Reply via email to