fwiw, i did some prototype with the following differences: - it streams straight to the socket output stream - it streams on-going during collecting, without necessity to store a bitset. It might have some limited extreme usage. Is there anyone interested?
On Wed, Jul 24, 2013 at 7:19 PM, Roman Chyla <roman.ch...@gmail.com> wrote: > On Tue, Jul 23, 2013 at 10:05 PM, Matt Lieber <mlie...@impetus.com> wrote: > > > That sounds like a satisfactory solution for the time being - > > I am assuming you dump the data from Solr in a csv format? > > > > JSON > > > > How did you implement the streaming processor ? (what tool did you use > for > > this? Not familiar with that) > > > > this is what dumps the docs: > > https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/response/JSONDumper.java > > it is called by one of our batch processors, which can pass it a bitset of > recs > > https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/handler/batch/BatchProviderDumpIndex.java > > as far as streaming is concerned, we were all very nicely surprised, a few > GB file (on local network) took ridiculously short time - in fact, a > colleague of mine was assuming it is not working, until we looked into the > downloaded file ;-), you may want to look at line 463 > > https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/handler/batch/BatchHandler.java > > roman > > > > You say it takes a few minutes only to dump the data - how long does it > to > > stream it back in, are performances acceptable (~ within minutes) ? > > > > Thanks, > > Matt > > > > On 7/23/13 6:57 PM, "Roman Chyla" <roman.ch...@gmail.com> wrote: > > > > >Hello Matt, > > > > > >You can consider writing a batch processing handler, which receives a > > >query > > >and instead of sending results back, it writes them into a file which is > > >then available for streaming (it has its own UUID). I am dumping many > GBs > > >of data from solr in few minutes - your query + streaming writer can go > > >very long way :) > > > > > >roman > > > > > > > > >On Tue, Jul 23, 2013 at 5:04 PM, Matt Lieber <mlie...@impetus.com> > wrote: > > > > > >> Hello Solr users, > > >> > > >> Question regarding processing a lot of docs returned from a query; I > > >> potentially have millions of documents returned back from a query. > What > > >>is > > >> the common design to deal with this ? > > >> > > >> 2 ideas I have are: > > >> - create a client service that is multithreaded to handled this > > >> - Use the Solr "pagination" to retrieve a batch of rows at a time > > >>("start, > > >> rows" in Solr Admin console ) > > >> > > >> Any other ideas that I may be missing ? > > >> > > >> Thanks, > > >> Matt > > >> > > >> > > >> ________________________________ > > >> > > >> > > >> > > >> > > >> > > >> > > >> NOTE: This message may contain information that is confidential, > > >> proprietary, privileged or otherwise protected by law. The message is > > >> intended solely for the named addressee. If received in error, please > > >> destroy and notify the sender. Any use of this email is prohibited > when > > >> received in error. Impetus does not represent, warrant and/or > guarantee, > > >> that the integrity of this communication has been maintained nor that > > >>the > > >> communication is free of errors, virus, interception or interference. > > >> > > > > > > ________________________________ > > > > > > > > > > > > > > NOTE: This message may contain information that is confidential, > > proprietary, privileged or otherwise protected by law. The message is > > intended solely for the named addressee. If received in error, please > > destroy and notify the sender. Any use of this email is prohibited when > > received in error. Impetus does not represent, warrant and/or guarantee, > > that the integrity of this communication has been maintained nor that the > > communication is free of errors, virus, interception or interference. > > > -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics <http://www.griddynamics.com> <mkhlud...@griddynamics.com>