Re: 500 millions document for loop.

Valentin Popov Sat, 14 Nov 2015 04:52:09 -0800

Thank you very much!


> On 14 нояб. 2015 г., at 15:49, Uwe Schindler <[email protected]> wrote:
> 
> Hi,
> 
> This code is buggy! The collect() call of the collector does not get a 
> document ID relative to the top-level IndexSearcher, it only gets a document 
> id relative to the reader reported in setNextReader (which is a atomic reader 
> responsible for a single Lucene index segment).
> 
> In setNextReader, save the reference to the "current" reader. And use this 
> "current" reader to get the stored fields:
> 
>               indexSearcher.search(query, queryFilter, new Collector() {
>                       AtomicReader current; 
> 
>                       @Override
>                       public void setScorer(Scorer arg0) throws IOException { 
> }
> 
>                       @Override
>                       public void setNextReader(AtomicReaderContext ctx) 
> throws IOException { 
>                               current = ctx.reader();
>                       }
> 
>                       @Override
>                       public void collect(int docID) throws IOException {
>                               Document doc = current.document(docID, 
> loadFields);
>                               found.found(doc);
>                       }
> 
>                       @Override
>                       public boolean acceptsDocsOutOfOrder() {
>                               return true;
>                       }
>               });
> 
> Otherwise you get wrong document ids reported!!!
> 
> Uwe
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [email protected]
> 
>> -----Original Message-----
>> From: Valentin Popov [mailto:[email protected]]
>> Sent: Saturday, November 14, 2015 1:04 PM
>> To: [email protected]
>> Subject: Re: 500 millions document for loop.
>> 
>> Hi, Uwe.
>> 
>> Thanks for you advise.
>> 
>> After implementing you suggestion, our calculation time drop down from ~20
>> days to 3,5 hours.
>> 
>> /**
>> *
>> * DocumentFound - callback function for each document
>> */
>> public void iterate(SearchOptions options, final DocumentFound found, final
>> Set<String> loadFields) throws Exception {
>>              Query query = options.getQuery();
>>              Filter queryFilter = options.getQueryFilter();
>>              final IndexSearcher indexSearcher = new
>> VolumeSearcher(options).newIndexSearcher(Executors.newSingleThreadEx
>> ecutor());
>> 
>>              indexSearcher.search(query, queryFilter, new Collector() {
>> 
>>                      @Override
>>                      public void setScorer(Scorer arg0) throws IOException
>> { }
>> 
>>                      @Override
>>                      public void setNextReader(AtomicReaderContext
>> arg0) throws IOException { }
>> 
>>                      @Override
>>                      public void collect(int docID) throws IOException {
>>                              Document doc = indexSearcher.doc(docID,
>> loadFields);
>>                              found.found(doc);
>>                      }
>> 
>>                      @Override
>>                      public boolean acceptsDocsOutOfOrder() {
>>                              return true;
>>                      }
>>              });
>> 
>>      }
>> 
>> 
>>> On 12 нояб. 2015 г., at 21:15, Uwe Schindler <[email protected]> wrote:
>>> 
>>> Hi,
>>> 
>>>>> The big question is: Do you need the results paged at all?
>>>> 
>>>> Yup, because if we return all results, we get OME.
>>> 
>>> You get the OME because the paging collector cannot handle that, so this is
>> an XY problem. Would it not be better if you application just gets the 
>> results
>> as a stream and processes them one after each other? If this is the case (and
>> most statistics need it like that), your much better to NOT USE TOPDOCS!!!!
>> Your requirement is diametral to getting top-scoring documents! You want to
>> get ALL results as a sequence.
>>> 
>>>>> Do you need them sorted?
>>>> 
>>>> Nope.
>>> 
>>> OK, so unsorted streaming is the right approach.
>>> 
>>>>> If not, the easiest approach is to use a custom Collector that does no
>>>> sorting and just consumes the results.
>>>> 
>>>> Main bottleneck as I see come from next page search, that took ~2-4
>>>> seconds.
>>> 
>>> This is because when paging the collector has to re-execute the whole
>> query and sort all results again, just with a larger window. So if you have
>> result pages of 50000 results and you want to get the second page, it will
>> internally sort 100000 results, because the first page needs to be 
>> calculated,
>> too. If you go forward in results the windows gets larger and larger, until 
>> it
>> finally collects all results.
>>> 
>>> So just get the results as a stream by implementing the Collector API is the
>> right way to do this.
>>> 
>>>>> 
>>>>> Uwe
>>>>> 
>>>>> -----
>>>>> Uwe Schindler
>>>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>>>> http://www.thetaphi.de
>>>>> eMail: [email protected]
>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Valentin Popov [mailto:[email protected]]
>>>>>> Sent: Thursday, November 12, 2015 6:48 PM
>>>>>> To: [email protected]
>>>>>> Subject: Re: 500 millions document for loop.
>>>>>> 
>>>>>> Toke, thanks!
>>>>>> 
>>>>>> We will look at this solution, looks like this is that what we need.
>>>>>> 
>>>>>> 
>>>>>>> On 12 нояб. 2015 г., at 20:42, Toke Eskildsen <[email protected]>
>>>>>> wrote:
>>>>>>> 
>>>>>>> Valentin Popov <[email protected]> wrote:
>>>>>>> 
>>>>>>>> We have ~10 indexes for 500M documents, each document
>>>>>>>> has «archive date», and «to» address, one of our task is
>>>>>>>> calculate statistics of «to» for last year. Right now we are
>>>>>>>> using search archive_date:(current_date - 1 year) and paginate
>>>>>>>> results for 50k records for page. Bottleneck of that approach,
>>>>>>>> pagination take too long time and on powerful server it take
>>>>>>>> ~20 days to execute, and it is very long.
>>>>>>> 
>>>>>>> Lucene does not like deep page requests due to the way the internal
>>>>>> Priority Queue works. Solr has CursorMark, which should be fairly
>> simple
>>>> to
>>>>>> emulate in your Lucene handling code:
>>>>>>> 
>>>>>>> http://lucidworks.com/blog/2013/12/12/coming-soon-to-solr-
>> efficient-
>>>>>> cursor-based-iteration-of-large-result-sets/
>>>>>>> 
>>>>>>> - Toke Eskildsen
>>>>>>> 
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>> For additional commands, e-mail: [email protected]
>>>>>>> 
>>>>>> 
>>>>>> Regards,
>>>>>> Valentin Popov
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>> For additional commands, e-mail: [email protected]
>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>> 
>>>> 
>>>> 
>>>> С Уважением,
>>>> Валентин Попов
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>> 
>> 
>> С Уважением,
>> Валентин Попов
>> 
>> 
>> 
>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 


 С Уважением,
Валентин Попов






---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: 500 millions document for loop.

Reply via email to