Hi, Uwe.
Thanks for you advise.
After implementing you suggestion, our calculation time drop down from ~20 days
to 3,5 hours.
/**
*
* DocumentFound - callback function for each document
*/
public void iterate(SearchOptions options, final DocumentFound found, final
Set<String> loadFields) throws Exception {
Query query = options.getQuery();
Filter queryFilter = options.getQueryFilter();
final IndexSearcher indexSearcher = new
VolumeSearcher(options).newIndexSearcher(Executors.newSingleThreadExecutor());
indexSearcher.search(query, queryFilter, new Collector() {
@Override
public void setScorer(Scorer arg0) throws IOException {
}
@Override
public void setNextReader(AtomicReaderContext arg0)
throws IOException { }
@Override
public void collect(int docID) throws IOException {
Document doc = indexSearcher.doc(docID,
loadFields);
found.found(doc);
}
@Override
public boolean acceptsDocsOutOfOrder() {
return true;
}
});
}
> On 12 нояб. 2015 г., at 21:15, Uwe Schindler <[email protected]> wrote:
>
> Hi,
>
>>> The big question is: Do you need the results paged at all?
>>
>> Yup, because if we return all results, we get OME.
>
> You get the OME because the paging collector cannot handle that, so this is
> an XY problem. Would it not be better if you application just gets the
> results as a stream and processes them one after each other? If this is the
> case (and most statistics need it like that), your much better to NOT USE
> TOPDOCS!!!! Your requirement is diametral to getting top-scoring documents!
> You want to get ALL results as a sequence.
>
>>> Do you need them sorted?
>>
>> Nope.
>
> OK, so unsorted streaming is the right approach.
>
>>> If not, the easiest approach is to use a custom Collector that does no
>> sorting and just consumes the results.
>>
>> Main bottleneck as I see come from next page search, that took ~2-4
>> seconds.
>
> This is because when paging the collector has to re-execute the whole query
> and sort all results again, just with a larger window. So if you have result
> pages of 50000 results and you want to get the second page, it will
> internally sort 100000 results, because the first page needs to be
> calculated, too. If you go forward in results the windows gets larger and
> larger, until it finally collects all results.
>
> So just get the results as a stream by implementing the Collector API is the
> right way to do this.
>
>>>
>>> Uwe
>>>
>>> -----
>>> Uwe Schindler
>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>> http://www.thetaphi.de
>>> eMail: [email protected]
>>>
>>>> -----Original Message-----
>>>> From: Valentin Popov [mailto:[email protected]]
>>>> Sent: Thursday, November 12, 2015 6:48 PM
>>>> To: [email protected]
>>>> Subject: Re: 500 millions document for loop.
>>>>
>>>> Toke, thanks!
>>>>
>>>> We will look at this solution, looks like this is that what we need.
>>>>
>>>>
>>>>> On 12 нояб. 2015 г., at 20:42, Toke Eskildsen <[email protected]>
>>>> wrote:
>>>>>
>>>>> Valentin Popov <[email protected]> wrote:
>>>>>
>>>>>> We have ~10 indexes for 500M documents, each document
>>>>>> has «archive date», and «to» address, one of our task is
>>>>>> calculate statistics of «to» for last year. Right now we are
>>>>>> using search archive_date:(current_date - 1 year) and paginate
>>>>>> results for 50k records for page. Bottleneck of that approach,
>>>>>> pagination take too long time and on powerful server it take
>>>>>> ~20 days to execute, and it is very long.
>>>>>
>>>>> Lucene does not like deep page requests due to the way the internal
>>>> Priority Queue works. Solr has CursorMark, which should be fairly simple
>> to
>>>> emulate in your Lucene handling code:
>>>>>
>>>>> http://lucidworks.com/blog/2013/12/12/coming-soon-to-solr-efficient-
>>>> cursor-based-iteration-of-large-result-sets/
>>>>>
>>>>> - Toke Eskildsen
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>>
>>>>
>>>> Regards,
>>>> Valentin Popov
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>
>>
>> С Уважением,
>> Валентин Попов
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
С Уважением,
Валентин Попов
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]