Thank you very much!
> On 14 нояб. 2015 г., at 15:49, Uwe Schindler <[email protected]> wrote: > > Hi, > > This code is buggy! The collect() call of the collector does not get a > document ID relative to the top-level IndexSearcher, it only gets a document > id relative to the reader reported in setNextReader (which is a atomic reader > responsible for a single Lucene index segment). > > In setNextReader, save the reference to the "current" reader. And use this > "current" reader to get the stored fields: > > indexSearcher.search(query, queryFilter, new Collector() { > AtomicReader current; > > @Override > public void setScorer(Scorer arg0) throws IOException { > } > > @Override > public void setNextReader(AtomicReaderContext ctx) > throws IOException { > current = ctx.reader(); > } > > @Override > public void collect(int docID) throws IOException { > Document doc = current.document(docID, > loadFields); > found.found(doc); > } > > @Override > public boolean acceptsDocsOutOfOrder() { > return true; > } > }); > > Otherwise you get wrong document ids reported!!! > > Uwe > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: [email protected] > >> -----Original Message----- >> From: Valentin Popov [mailto:[email protected]] >> Sent: Saturday, November 14, 2015 1:04 PM >> To: [email protected] >> Subject: Re: 500 millions document for loop. >> >> Hi, Uwe. >> >> Thanks for you advise. >> >> After implementing you suggestion, our calculation time drop down from ~20 >> days to 3,5 hours. >> >> /** >> * >> * DocumentFound - callback function for each document >> */ >> public void iterate(SearchOptions options, final DocumentFound found, final >> Set<String> loadFields) throws Exception { >> Query query = options.getQuery(); >> Filter queryFilter = options.getQueryFilter(); >> final IndexSearcher indexSearcher = new >> VolumeSearcher(options).newIndexSearcher(Executors.newSingleThreadEx >> ecutor()); >> >> indexSearcher.search(query, queryFilter, new Collector() { >> >> @Override >> public void setScorer(Scorer arg0) throws IOException >> { } >> >> @Override >> public void setNextReader(AtomicReaderContext >> arg0) throws IOException { } >> >> @Override >> public void collect(int docID) throws IOException { >> Document doc = indexSearcher.doc(docID, >> loadFields); >> found.found(doc); >> } >> >> @Override >> public boolean acceptsDocsOutOfOrder() { >> return true; >> } >> }); >> >> } >> >> >>> On 12 нояб. 2015 г., at 21:15, Uwe Schindler <[email protected]> wrote: >>> >>> Hi, >>> >>>>> The big question is: Do you need the results paged at all? >>>> >>>> Yup, because if we return all results, we get OME. >>> >>> You get the OME because the paging collector cannot handle that, so this is >> an XY problem. Would it not be better if you application just gets the >> results >> as a stream and processes them one after each other? If this is the case (and >> most statistics need it like that), your much better to NOT USE TOPDOCS!!!! >> Your requirement is diametral to getting top-scoring documents! You want to >> get ALL results as a sequence. >>> >>>>> Do you need them sorted? >>>> >>>> Nope. >>> >>> OK, so unsorted streaming is the right approach. >>> >>>>> If not, the easiest approach is to use a custom Collector that does no >>>> sorting and just consumes the results. >>>> >>>> Main bottleneck as I see come from next page search, that took ~2-4 >>>> seconds. >>> >>> This is because when paging the collector has to re-execute the whole >> query and sort all results again, just with a larger window. So if you have >> result pages of 50000 results and you want to get the second page, it will >> internally sort 100000 results, because the first page needs to be >> calculated, >> too. If you go forward in results the windows gets larger and larger, until >> it >> finally collects all results. >>> >>> So just get the results as a stream by implementing the Collector API is the >> right way to do this. >>> >>>>> >>>>> Uwe >>>>> >>>>> ----- >>>>> Uwe Schindler >>>>> H.-H.-Meier-Allee 63, D-28213 Bremen >>>>> http://www.thetaphi.de >>>>> eMail: [email protected] >>>>> >>>>>> -----Original Message----- >>>>>> From: Valentin Popov [mailto:[email protected]] >>>>>> Sent: Thursday, November 12, 2015 6:48 PM >>>>>> To: [email protected] >>>>>> Subject: Re: 500 millions document for loop. >>>>>> >>>>>> Toke, thanks! >>>>>> >>>>>> We will look at this solution, looks like this is that what we need. >>>>>> >>>>>> >>>>>>> On 12 нояб. 2015 г., at 20:42, Toke Eskildsen <[email protected]> >>>>>> wrote: >>>>>>> >>>>>>> Valentin Popov <[email protected]> wrote: >>>>>>> >>>>>>>> We have ~10 indexes for 500M documents, each document >>>>>>>> has «archive date», and «to» address, one of our task is >>>>>>>> calculate statistics of «to» for last year. Right now we are >>>>>>>> using search archive_date:(current_date - 1 year) and paginate >>>>>>>> results for 50k records for page. Bottleneck of that approach, >>>>>>>> pagination take too long time and on powerful server it take >>>>>>>> ~20 days to execute, and it is very long. >>>>>>> >>>>>>> Lucene does not like deep page requests due to the way the internal >>>>>> Priority Queue works. Solr has CursorMark, which should be fairly >> simple >>>> to >>>>>> emulate in your Lucene handling code: >>>>>>> >>>>>>> http://lucidworks.com/blog/2013/12/12/coming-soon-to-solr- >> efficient- >>>>>> cursor-based-iteration-of-large-result-sets/ >>>>>>> >>>>>>> - Toke Eskildsen >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>> For additional commands, e-mail: [email protected] >>>>>>> >>>>>> >>>>>> Regards, >>>>>> Valentin Popov >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: [email protected] >>>>>> For additional commands, e-mail: [email protected] >>>>> >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: [email protected] >>>>> For additional commands, e-mail: [email protected] >>>>> >>>> >>>> >>>> С Уважением, >>>> Валентин Попов >>>> >>>> >>>> >>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: [email protected] >>>> For additional commands, e-mail: [email protected] >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >> >> >> С Уважением, >> Валентин Попов >> >> >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > С Уважением, Валентин Попов --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
