You're executing all the queries to parallelize before even starting. Seems very inefficient. My suggestion doesn't require this first step. Perhaps it was confusing because I mentioned "your own cursorMark". Really I meant bypass that entirely, just form N queries that were restricted to N disjoint subsets of the data and process them all in parallel, either with /export or /select.
Best, Erick On Mon, Nov 14, 2016 at 3:53 PM, Chetas Joshi <chetas.jo...@gmail.com> wrote: > Thanks Joel for the explanation. > > Hi Erick, > > One of the ways I am trying to parallelize the cursor approach is by > iterating the result set twice. > (1) Once just to get all the cursor marks > > val q: SolrQuery = new solrj.SolrQuery() > q.set("q", query) > q.add("fq", query) > q.add("rows", batchSize.toString) > q.add("collection", collection) > q.add("fl", "null") > q.add("sort", "id asc") > > Here I am not asking for any field values ( "fl" -> null ) > > (2) Once I get all the cursor marks, I can start parallel threads to get > the results in parallel. > > However, the first step in fact takes a lot of time. Even more than when I > would actually iterate through the results with "fl" -> field1, field2, > field3 > > Why is this happening? > > Thanks! > > > On Thu, Nov 10, 2016 at 8:22 PM, Joel Bernstein <joels...@gmail.com> wrote: > >> Solr 5 was very early days for Streaming Expressions. Streaming Expressions >> and SQL use Java 8 so development switched to the 6.0 branch five months >> before the 6.0 release. So there was a very large jump in features and bug >> fixes from Solr 5 to Solr 6 in Streaming Expressions. >> >> Joel Bernstein >> http://joelsolr.blogspot.com/ >> >> On Thu, Nov 10, 2016 at 11:14 PM, Joel Bernstein <joels...@gmail.com> >> wrote: >> >> > In Solr 5 the /export handler wasn't escaping json text fields, which >> > would produce json parse exceptions. This was fixed in Solr 6.0. >> > >> > Joel Bernstein >> > http://joelsolr.blogspot.com/ >> > >> > On Tue, Nov 8, 2016 at 6:17 PM, Erick Erickson <erickerick...@gmail.com> >> > wrote: >> > >> >> Hmm, that should work fine. Let us know what the logs show if anything >> >> because this is weird. >> >> >> >> Best, >> >> Erick >> >> >> >> On Tue, Nov 8, 2016 at 1:00 PM, Chetas Joshi <chetas.jo...@gmail.com> >> >> wrote: >> >> > Hi Erick, >> >> > >> >> > This is how I use the streaming approach. >> >> > >> >> > Here is the solrconfig block. >> >> > >> >> > <requestHandler name="/export" class="solr.SearchHandler"> >> >> > <lst name="invariants"> >> >> > <str name="rq">{!xport}</str> >> >> > <str name="wt">xsort</str> >> >> > <str name="distrib">false</str> >> >> > </lst> >> >> > <arr name="components"> >> >> > <str>query</str> >> >> > </arr> >> >> > </requestHandler> >> >> > >> >> > And here is the code in which SolrJ is being used. >> >> > >> >> > String zkHost = args[0]; >> >> > String collection = args[1]; >> >> > >> >> > Map props = new HashMap(); >> >> > props.put("q", "*:*"); >> >> > props.put("qt", "/export"); >> >> > props.put("sort", "fieldA asc"); >> >> > props.put("fl", "fieldA,fieldB,fieldC"); >> >> > >> >> > CloudSolrStream cloudstream = new CloudSolrStream(zkHost,collect >> >> ion,props); >> >> > >> >> > And then I iterate through the cloud stream (TupleStream). >> >> > So I am using streaming expressions (SolrJ). >> >> > >> >> > I have not looked at the solr logs while I started getting the JSON >> >> parsing >> >> > exceptions. But I will let you know what I see the next time I run >> into >> >> the >> >> > same exceptions. >> >> > >> >> > Thanks >> >> > >> >> > On Sat, Nov 5, 2016 at 9:32 PM, Erick Erickson < >> erickerick...@gmail.com >> >> > >> >> > wrote: >> >> > >> >> >> Hmmm, export is supposed to handle 10s of million result sets. I know >> >> >> of a situation where the Streaming Aggregation functionality back >> >> >> ported to Solr 4.10 processes on that scale. So do you have any clue >> >> >> what exactly is failing? Is there anything in the Solr logs? >> >> >> >> >> >> _How_ are you using /export, through Streaming Aggregation (SolrJ) or >> >> >> just the raw xport handler? It might be worth trying to do this from >> >> >> SolrJ if you're not, it should be a very quick program to write, just >> >> >> to test we're talking 100 lines max. >> >> >> >> >> >> You could always roll your own cursor mark stuff by partitioning the >> >> >> data amongst N threads/processes if you have any reasonable >> >> >> expectation that you could form filter queries that partition the >> >> >> result set anywhere near evenly. >> >> >> >> >> >> For example, let's say you have a field with random numbers between 0 >> >> >> and 100. You could spin off 10 cursorMark-aware processes each with >> >> >> its own fq clause like >> >> >> >> >> >> fq=partition_field:[0 TO 10} >> >> >> fq=[10 TO 20} >> >> >> .... >> >> >> fq=[90 TO 100] >> >> >> >> >> >> Note the use of inclusive/exclusive end points.... >> >> >> >> >> >> Each one would be totally independent of all others with no >> >> >> overlapping documents. And since the fq's would presumably be cached >> >> >> you should be able to go as fast as you can drive your cluster. Of >> >> >> course you lose query-wide sorting and the like, if that's important >> >> >> you'd need to figure something out there. >> >> >> >> >> >> Do be aware of a potential issue. When regular doc fields are >> >> >> returned, for each document returned, a 16K block of data will be >> >> >> decompressed to get the stored field data. Streaming Aggregation >> >> >> (/xport) reads docValues entries which are held in MMapDirectory >> space >> >> >> so will be much, much faster. As of Solr 5.5. You can override the >> >> >> decompression stuff, see: >> >> >> https://issues.apache.org/jira/browse/SOLR-8220 for fields that are >> >> >> both stored and docvalues... >> >> >> >> >> >> Best, >> >> >> Erick >> >> >> >> >> >> On Sat, Nov 5, 2016 at 6:41 PM, Chetas Joshi <chetas.jo...@gmail.com >> > >> >> >> wrote: >> >> >> > Thanks Yonik for the explanation. >> >> >> > >> >> >> > Hi Erick, >> >> >> > I was using the /xport functionality. But it hasn't been stable >> (Solr >> >> >> > 5.5.0). I started running into run time Exceptions (JSON parsing >> >> >> > exceptions) while reading the stream of Tuples. This started >> >> happening as >> >> >> > the size of my collection increased 3 times and I started running >> >> queries >> >> >> > that return millions of documents (>10mm). I don't know if it is >> the >> >> >> query >> >> >> > result size or the actual data size (total number of docs in the >> >> >> > collection) that is causing the instability. >> >> >> > >> >> >> > org.noggit.JSONParser$ParseException: Expected ',' or '}': >> >> >> > char=5,position=110938 BEFORE='uuid":"0lG99s8vyaKB2I/ >> >> >> > I","space":"uuid","timestamp":1 5' AFTER='DB6 474294954},{"uuid":" >> >> >> > 0lG99sHT8P5e' >> >> >> > >> >> >> > I won't be able to move to Solr 6.0 due to some constraints in our >> >> >> > production environment and hence moving back to the cursor >> approach. >> >> Do >> >> >> you >> >> >> > have any other suggestion for me? >> >> >> > >> >> >> > Thanks, >> >> >> > Chetas. >> >> >> > >> >> >> > On Fri, Nov 4, 2016 at 10:17 PM, Erick Erickson < >> >> erickerick...@gmail.com >> >> >> > >> >> >> > wrote: >> >> >> > >> >> >> >> Have you considered the /xport functionality? >> >> >> >> >> >> >> >> On Fri, Nov 4, 2016 at 5:56 PM, Yonik Seeley <ysee...@gmail.com> >> >> wrote: >> >> >> >> > No, you can't get cursor-marks ahead of time. >> >> >> >> > They are the serialized representation of the last sort values >> >> >> >> > encountered (hence not known ahead of time). >> >> >> >> > >> >> >> >> > -Yonik >> >> >> >> > >> >> >> >> > >> >> >> >> > On Fri, Nov 4, 2016 at 8:48 PM, Chetas Joshi < >> >> chetas.jo...@gmail.com> >> >> >> >> wrote: >> >> >> >> >> Hi, >> >> >> >> >> >> >> >> >> >> I am using the cursor approach to fetch results from Solr >> >> (5.5.0). >> >> >> Most >> >> >> >> of >> >> >> >> >> my queries return millions of results. Is there a way I can >> read >> >> the >> >> >> >> pages >> >> >> >> >> in parallel? Is there a way I can get all the cursors well in >> >> >> advance? >> >> >> >> >> >> >> >> >> >> Let's say my query returns 2M documents and I have set >> >> rows=100,000. >> >> >> >> >> Can I have multiple threads iterating over different pages like >> >> >> >> >> Thread1 -> docs 1 to 100K >> >> >> >> >> Thread2 -> docs 101K to 200K >> >> >> >> >> ...... >> >> >> >> >> ...... >> >> >> >> >> >> >> >> >> >> for this to happen, can I get all the cursorMarks for a given >> >> query >> >> >> so >> >> >> >> that >> >> >> >> >> I can leverage the following code in parallel >> >> >> >> >> >> >> >> >> >> cursorQ.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark) >> >> >> >> >> val rsp: QueryResponse = c.query(cursorQ) >> >> >> >> >> >> >> >> >> >> Thank you, >> >> >> >> >> Chetas. >> >> >> >> >> >> >> >> >> >> > >> > >>