Uwe, I’m very sorry, misprinted your name ;) > On 12 нояб. 2015 г., at 21:15, Uwe Schindler <[email protected]> wrote: > > Hi, > >>> The big question is: Do you need the results paged at all? >> >> Yup, because if we return all results, we get OME. > > You get the OME because the paging collector cannot handle that, so this is > an XY problem. Would it not be better if you application just gets the > results as a stream and processes them one after each other? If this is the > case (and most statistics need it like that), your much better to NOT USE > TOPDOCS!!!! Your requirement is diametral to getting top-scoring documents! > You want to get ALL results as a sequence. > >>> Do you need them sorted? >> >> Nope. > > OK, so unsorted streaming is the right approach. > >>> If not, the easiest approach is to use a custom Collector that does no >> sorting and just consumes the results. >> >> Main bottleneck as I see come from next page search, that took ~2-4 >> seconds. > > This is because when paging the collector has to re-execute the whole query > and sort all results again, just with a larger window. So if you have result > pages of 50000 results and you want to get the second page, it will > internally sort 100000 results, because the first page needs to be > calculated, too. If you go forward in results the windows gets larger and > larger, until it finally collects all results. > > So just get the results as a stream by implementing the Collector API is the > right way to do this. > >>> >>> Uwe >>> >>> ----- >>> Uwe Schindler >>> H.-H.-Meier-Allee 63, D-28213 Bremen >>> http://www.thetaphi.de >>> eMail: [email protected] >>> >>>> -----Original Message----- >>>> From: Valentin Popov [mailto:[email protected]] >>>> Sent: Thursday, November 12, 2015 6:48 PM >>>> To: [email protected] >>>> Subject: Re: 500 millions document for loop. >>>> >>>> Toke, thanks! >>>> >>>> We will look at this solution, looks like this is that what we need. >>>> >>>> >>>>> On 12 нояб. 2015 г., at 20:42, Toke Eskildsen <[email protected]> >>>> wrote: >>>>> >>>>> Valentin Popov <[email protected]> wrote: >>>>> >>>>>> We have ~10 indexes for 500M documents, each document >>>>>> has «archive date», and «to» address, one of our task is >>>>>> calculate statistics of «to» for last year. Right now we are >>>>>> using search archive_date:(current_date - 1 year) and paginate >>>>>> results for 50k records for page. Bottleneck of that approach, >>>>>> pagination take too long time and on powerful server it take >>>>>> ~20 days to execute, and it is very long. >>>>> >>>>> Lucene does not like deep page requests due to the way the internal >>>> Priority Queue works. Solr has CursorMark, which should be fairly simple >> to >>>> emulate in your Lucene handling code: >>>>> >>>>> http://lucidworks.com/blog/2013/12/12/coming-soon-to-solr-efficient- >>>> cursor-based-iteration-of-large-result-sets/ >>>>> >>>>> - Toke Eskildsen >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: [email protected] >>>>> For additional commands, e-mail: [email protected] >>>>> >>>> >>>> Regards, >>>> Valentin Popov >>>> >>>> >>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: [email protected] >>>> For additional commands, e-mail: [email protected] >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >> >> >> С Уважением, >> Валентин Попов >> >> >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] >
С Уважением, Валентин Попов --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
