Re: 500 millions document for loop.

Valentin Popov Sat, 14 Nov 2015 04:05:33 -0800

Hi, Uwe. 

Thanks for you advise.


After implementing you suggestion, our calculation time drop down from ~20 days 
to 3,5 hours. 

/**
*
* DocumentFound - callback function for each document
*/
public void iterate(SearchOptions options, final DocumentFound found, final 
Set<String> loadFields) throws Exception {
                Query query = options.getQuery();
                Filter queryFilter = options.getQueryFilter();
                final IndexSearcher indexSearcher = new 
VolumeSearcher(options).newIndexSearcher(Executors.newSingleThreadExecutor());
                
                indexSearcher.search(query, queryFilter, new Collector() {
                        
                        @Override
                        public void setScorer(Scorer arg0) throws IOException { 
}
                        
                        @Override
                        public void setNextReader(AtomicReaderContext arg0) 
throws IOException { }
                        
                        @Override
                        public void collect(int docID) throws IOException {
                                Document doc = indexSearcher.doc(docID, 
loadFields);
                                found.found(doc);
                        }
                        
                        @Override
                        public boolean acceptsDocsOutOfOrder() { 
                                return true; 
                        }
                });
                
        }


> On 12 нояб. 2015 г., at 21:15, Uwe Schindler <[email protected]> wrote:
> 
> Hi,
> 
>>> The big question is: Do you need the results paged at all?
>> 
>> Yup, because if we return all results, we get OME.
> 
> You get the OME because the paging collector cannot handle that, so this is 
> an XY problem. Would it not be better if you application just gets the 
> results as a stream and processes them one after each other? If this is the 
> case (and most statistics need it like that), your much better to NOT USE 
> TOPDOCS!!!! Your requirement is diametral to getting top-scoring documents! 
> You want to get ALL results as a sequence.
> 
>>> Do you need them sorted?
>> 
>> Nope.
> 
> OK, so unsorted streaming is the right approach.
> 
>>> If not, the easiest approach is to use a custom Collector that does no
>> sorting and just consumes the results.
>> 
>> Main bottleneck as I see come from next page search, that took ~2-4
>> seconds.
> 
> This is because when paging the collector has to re-execute the whole query 
> and sort all results again, just with a larger window. So if you have result 
> pages of 50000 results and you want to get the second page, it will 
> internally sort 100000 results, because the first page needs to be 
> calculated, too. If you go forward in results the windows gets larger and 
> larger, until it finally collects all results.
> 
> So just get the results as a stream by implementing the Collector API is the 
> right way to do this.
> 
>>> 
>>> Uwe
>>> 
>>> -----
>>> Uwe Schindler
>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>> http://www.thetaphi.de
>>> eMail: [email protected]
>>> 
>>>> -----Original Message-----
>>>> From: Valentin Popov [mailto:[email protected]]
>>>> Sent: Thursday, November 12, 2015 6:48 PM
>>>> To: [email protected]
>>>> Subject: Re: 500 millions document for loop.
>>>> 
>>>> Toke, thanks!
>>>> 
>>>> We will look at this solution, looks like this is that what we need.
>>>> 
>>>> 
>>>>> On 12 нояб. 2015 г., at 20:42, Toke Eskildsen <[email protected]>
>>>> wrote:
>>>>> 
>>>>> Valentin Popov <[email protected]> wrote:
>>>>> 
>>>>>> We have ~10 indexes for 500M documents, each document
>>>>>> has «archive date», and «to» address, one of our task is
>>>>>> calculate statistics of «to» for last year. Right now we are
>>>>>> using search archive_date:(current_date - 1 year) and paginate
>>>>>> results for 50k records for page. Bottleneck of that approach,
>>>>>> pagination take too long time and on powerful server it take
>>>>>> ~20 days to execute, and it is very long.
>>>>> 
>>>>> Lucene does not like deep page requests due to the way the internal
>>>> Priority Queue works. Solr has CursorMark, which should be fairly simple
>> to
>>>> emulate in your Lucene handling code:
>>>>> 
>>>>> http://lucidworks.com/blog/2013/12/12/coming-soon-to-solr-efficient-
>>>> cursor-based-iteration-of-large-result-sets/
>>>>> 
>>>>> - Toke Eskildsen
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>> 
>>>> 
>>>> Regards,
>>>> Valentin Popov
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>> 
>> 
>> 
>> С Уважением,
>> Валентин Попов
>> 
>> 
>> 
>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]


 С Уважением,
Валентин Попов






---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: 500 millions document for loop.

Reply via email to