Hi Erick, This is how I use the streaming approach.
Here is the solrconfig block. <requestHandler name="/export" class="solr.SearchHandler"> <lst name="invariants"> <str name="rq">{!xport}</str> <str name="wt">xsort</str> <str name="distrib">false</str> </lst> <arr name="components"> <str>query</str> </arr> </requestHandler> And here is the code in which SolrJ is being used. String zkHost = args[0]; String collection = args[1]; Map props = new HashMap(); props.put("q", "*:*"); props.put("qt", "/export"); props.put("sort", "fieldA asc"); props.put("fl", "fieldA,fieldB,fieldC"); CloudSolrStream cloudstream = new CloudSolrStream(zkHost,collection,props); And then I iterate through the cloud stream (TupleStream). So I am using streaming expressions (SolrJ). I have not looked at the solr logs while I started getting the JSON parsing exceptions. But I will let you know what I see the next time I run into the same exceptions. Thanks On Sat, Nov 5, 2016 at 9:32 PM, Erick Erickson <erickerick...@gmail.com> wrote: > Hmmm, export is supposed to handle 10s of million result sets. I know > of a situation where the Streaming Aggregation functionality back > ported to Solr 4.10 processes on that scale. So do you have any clue > what exactly is failing? Is there anything in the Solr logs? > > _How_ are you using /export, through Streaming Aggregation (SolrJ) or > just the raw xport handler? It might be worth trying to do this from > SolrJ if you're not, it should be a very quick program to write, just > to test we're talking 100 lines max. > > You could always roll your own cursor mark stuff by partitioning the > data amongst N threads/processes if you have any reasonable > expectation that you could form filter queries that partition the > result set anywhere near evenly. > > For example, let's say you have a field with random numbers between 0 > and 100. You could spin off 10 cursorMark-aware processes each with > its own fq clause like > > fq=partition_field:[0 TO 10} > fq=[10 TO 20} > .... > fq=[90 TO 100] > > Note the use of inclusive/exclusive end points.... > > Each one would be totally independent of all others with no > overlapping documents. And since the fq's would presumably be cached > you should be able to go as fast as you can drive your cluster. Of > course you lose query-wide sorting and the like, if that's important > you'd need to figure something out there. > > Do be aware of a potential issue. When regular doc fields are > returned, for each document returned, a 16K block of data will be > decompressed to get the stored field data. Streaming Aggregation > (/xport) reads docValues entries which are held in MMapDirectory space > so will be much, much faster. As of Solr 5.5. You can override the > decompression stuff, see: > https://issues.apache.org/jira/browse/SOLR-8220 for fields that are > both stored and docvalues... > > Best, > Erick > > On Sat, Nov 5, 2016 at 6:41 PM, Chetas Joshi <chetas.jo...@gmail.com> > wrote: > > Thanks Yonik for the explanation. > > > > Hi Erick, > > I was using the /xport functionality. But it hasn't been stable (Solr > > 5.5.0). I started running into run time Exceptions (JSON parsing > > exceptions) while reading the stream of Tuples. This started happening as > > the size of my collection increased 3 times and I started running queries > > that return millions of documents (>10mm). I don't know if it is the > query > > result size or the actual data size (total number of docs in the > > collection) that is causing the instability. > > > > org.noggit.JSONParser$ParseException: Expected ',' or '}': > > char=5,position=110938 BEFORE='uuid":"0lG99s8vyaKB2I/ > > I","space":"uuid","timestamp":1 5' AFTER='DB6 474294954},{"uuid":" > > 0lG99sHT8P5e' > > > > I won't be able to move to Solr 6.0 due to some constraints in our > > production environment and hence moving back to the cursor approach. Do > you > > have any other suggestion for me? > > > > Thanks, > > Chetas. > > > > On Fri, Nov 4, 2016 at 10:17 PM, Erick Erickson <erickerick...@gmail.com > > > > wrote: > > > >> Have you considered the /xport functionality? > >> > >> On Fri, Nov 4, 2016 at 5:56 PM, Yonik Seeley <ysee...@gmail.com> wrote: > >> > No, you can't get cursor-marks ahead of time. > >> > They are the serialized representation of the last sort values > >> > encountered (hence not known ahead of time). > >> > > >> > -Yonik > >> > > >> > > >> > On Fri, Nov 4, 2016 at 8:48 PM, Chetas Joshi <chetas.jo...@gmail.com> > >> wrote: > >> >> Hi, > >> >> > >> >> I am using the cursor approach to fetch results from Solr (5.5.0). > Most > >> of > >> >> my queries return millions of results. Is there a way I can read the > >> pages > >> >> in parallel? Is there a way I can get all the cursors well in > advance? > >> >> > >> >> Let's say my query returns 2M documents and I have set rows=100,000. > >> >> Can I have multiple threads iterating over different pages like > >> >> Thread1 -> docs 1 to 100K > >> >> Thread2 -> docs 101K to 200K > >> >> ...... > >> >> ...... > >> >> > >> >> for this to happen, can I get all the cursorMarks for a given query > so > >> that > >> >> I can leverage the following code in parallel > >> >> > >> >> cursorQ.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark) > >> >> val rsp: QueryResponse = c.query(cursorQ) > >> >> > >> >> Thank you, > >> >> Chetas. > >> >