Re: Parallelize Cursor approach

Erick Erickson Mon, 14 Nov 2016 16:00:11 -0800

You're executing all the queries to parallelize before even starting.
Seems very inefficient. My suggestion doesn't require this first step.
Perhaps it was confusing because I mentioned "your own cursorMark".
Really I meant bypass that entirely, just form N queries that were
restricted to N disjoint subsets of the data and process them all in
parallel, either with /export or /select.


Best,
Erick

On Mon, Nov 14, 2016 at 3:53 PM, Chetas Joshi <chetas.jo...@gmail.com> wrote:
> Thanks Joel for the explanation.
>
> Hi Erick,
>
> One of the ways I am trying to parallelize the cursor approach is by
> iterating the result set twice.
> (1) Once just to get all the cursor marks
>
> val q: SolrQuery = new solrj.SolrQuery()
> q.set("q", query)
> q.add("fq", query)
> q.add("rows", batchSize.toString)
> q.add("collection", collection)
> q.add("fl", "null")
> q.add("sort", "id asc")
>
> Here I am not asking for any field values ( "fl" -> null )
>
> (2) Once I get all the cursor marks, I can start parallel threads to get
> the results in parallel.
>
> However, the first step in fact takes a lot of time. Even more than when I
> would actually iterate through the results with "fl" -> field1, field2,
> field3
>
> Why is this happening?
>
> Thanks!
>
>
> On Thu, Nov 10, 2016 at 8:22 PM, Joel Bernstein <joels...@gmail.com> wrote:
>
>> Solr 5 was very early days for Streaming Expressions. Streaming Expressions
>> and SQL use Java 8 so development switched to the 6.0 branch five months
>> before the 6.0 release. So there was a very large jump in features and bug
>> fixes from Solr 5 to Solr 6 in Streaming Expressions.
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Thu, Nov 10, 2016 at 11:14 PM, Joel Bernstein <joels...@gmail.com>
>> wrote:
>>
>> > In Solr 5 the /export handler wasn't escaping json text fields, which
>> > would produce json parse exceptions. This was fixed in Solr 6.0.
>> >
>> > Joel Bernstein
>> > http://joelsolr.blogspot.com/
>> >
>> > On Tue, Nov 8, 2016 at 6:17 PM, Erick Erickson <erickerick...@gmail.com>
>> > wrote:
>> >
>> >> Hmm, that should work fine. Let us know what the logs show if anything
>> >> because this is weird.
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Tue, Nov 8, 2016 at 1:00 PM, Chetas Joshi <chetas.jo...@gmail.com>
>> >> wrote:
>> >> > Hi Erick,
>> >> >
>> >> > This is how I use the streaming approach.
>> >> >
>> >> > Here is the solrconfig block.
>> >> >
>> >> > <requestHandler name="/export" class="solr.SearchHandler">
>> >> >     <lst name="invariants">
>> >> >         <str name="rq">{!xport}</str>
>> >> >         <str name="wt">xsort</str>
>> >> >         <str name="distrib">false</str>
>> >> >     </lst>
>> >> >     <arr name="components">
>> >> >         <str>query</str>
>> >> >     </arr>
>> >> > </requestHandler>
>> >> >
>> >> > And here is the code in which SolrJ is being used.
>> >> >
>> >> > String zkHost = args[0];
>> >> > String collection = args[1];
>> >> >
>> >> > Map props = new HashMap();
>> >> > props.put("q", "*:*");
>> >> > props.put("qt", "/export");
>> >> > props.put("sort", "fieldA asc");
>> >> > props.put("fl", "fieldA,fieldB,fieldC");
>> >> >
>> >> > CloudSolrStream cloudstream = new CloudSolrStream(zkHost,collect
>> >> ion,props);
>> >> >
>> >> > And then I iterate through the cloud stream (TupleStream).
>> >> > So I am using streaming expressions (SolrJ).
>> >> >
>> >> > I have not looked at the solr logs while I started getting the JSON
>> >> parsing
>> >> > exceptions. But I will let you know what I see the next time I run
>> into
>> >> the
>> >> > same exceptions.
>> >> >
>> >> > Thanks
>> >> >
>> >> > On Sat, Nov 5, 2016 at 9:32 PM, Erick Erickson <
>> erickerick...@gmail.com
>> >> >
>> >> > wrote:
>> >> >
>> >> >> Hmmm, export is supposed to handle 10s of million result sets. I know
>> >> >> of a situation where the Streaming Aggregation functionality back
>> >> >> ported to Solr 4.10 processes on that scale. So do you have any clue
>> >> >> what exactly is failing? Is there anything in the Solr logs?
>> >> >>
>> >> >> _How_ are you using /export, through Streaming Aggregation (SolrJ) or
>> >> >> just the raw xport handler? It might be worth trying to do this from
>> >> >> SolrJ if you're not, it should be a very quick program to write, just
>> >> >> to test we're talking 100 lines max.
>> >> >>
>> >> >> You could always roll your own cursor mark stuff by partitioning the
>> >> >> data amongst N threads/processes if you have any reasonable
>> >> >> expectation that you could form filter queries that partition the
>> >> >> result set anywhere near evenly.
>> >> >>
>> >> >> For example, let's say you have a field with random numbers between 0
>> >> >> and 100. You could spin off 10 cursorMark-aware processes each with
>> >> >> its own fq clause like
>> >> >>
>> >> >> fq=partition_field:[0 TO 10}
>> >> >> fq=[10 TO 20}
>> >> >> ....
>> >> >> fq=[90 TO 100]
>> >> >>
>> >> >> Note the use of inclusive/exclusive end points....
>> >> >>
>> >> >> Each one would be totally independent of all others with no
>> >> >> overlapping documents. And since the fq's would presumably be cached
>> >> >> you should be able to go as fast as you can drive your cluster. Of
>> >> >> course you lose query-wide sorting and the like, if that's important
>> >> >> you'd need to figure something out there.
>> >> >>
>> >> >> Do be aware of a potential issue. When regular doc fields are
>> >> >> returned, for each document returned, a 16K block of data will be
>> >> >> decompressed to get the stored field data. Streaming Aggregation
>> >> >> (/xport) reads docValues entries which are held in MMapDirectory
>> space
>> >> >> so will be much, much faster. As of Solr 5.5. You can override the
>> >> >> decompression stuff, see:
>> >> >> https://issues.apache.org/jira/browse/SOLR-8220 for fields that are
>> >> >> both stored and docvalues...
>> >> >>
>> >> >> Best,
>> >> >> Erick
>> >> >>
>> >> >> On Sat, Nov 5, 2016 at 6:41 PM, Chetas Joshi <chetas.jo...@gmail.com
>> >
>> >> >> wrote:
>> >> >> > Thanks Yonik for the explanation.
>> >> >> >
>> >> >> > Hi Erick,
>> >> >> > I was using the /xport functionality. But it hasn't been stable
>> (Solr
>> >> >> > 5.5.0). I started running into run time Exceptions (JSON parsing
>> >> >> > exceptions) while reading the stream of Tuples. This started
>> >> happening as
>> >> >> > the size of my collection increased 3 times and I started running
>> >> queries
>> >> >> > that return millions of documents (>10mm). I don't know if it is
>> the
>> >> >> query
>> >> >> > result size or the actual data size (total number of docs in the
>> >> >> > collection) that is causing the instability.
>> >> >> >
>> >> >> > org.noggit.JSONParser$ParseException: Expected ',' or '}':
>> >> >> > char=5,position=110938 BEFORE='uuid":"0lG99s8vyaKB2I/
>> >> >> > I","space":"uuid","timestamp":1 5' AFTER='DB6 474294954},{"uuid":"
>> >> >> > 0lG99sHT8P5e'
>> >> >> >
>> >> >> > I won't be able to move to Solr 6.0 due to some constraints in our
>> >> >> > production environment and hence moving back to the cursor
>> approach.
>> >> Do
>> >> >> you
>> >> >> > have any other suggestion for me?
>> >> >> >
>> >> >> > Thanks,
>> >> >> > Chetas.
>> >> >> >
>> >> >> > On Fri, Nov 4, 2016 at 10:17 PM, Erick Erickson <
>> >> erickerick...@gmail.com
>> >> >> >
>> >> >> > wrote:
>> >> >> >
>> >> >> >> Have you considered the /xport functionality?
>> >> >> >>
>> >> >> >> On Fri, Nov 4, 2016 at 5:56 PM, Yonik Seeley <ysee...@gmail.com>
>> >> wrote:
>> >> >> >> > No, you can't get cursor-marks ahead of time.
>> >> >> >> > They are the serialized representation of the last sort values
>> >> >> >> > encountered (hence not known ahead of time).
>> >> >> >> >
>> >> >> >> > -Yonik
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > On Fri, Nov 4, 2016 at 8:48 PM, Chetas Joshi <
>> >> chetas.jo...@gmail.com>
>> >> >> >> wrote:
>> >> >> >> >> Hi,
>> >> >> >> >>
>> >> >> >> >> I am using the cursor approach to fetch results from Solr
>> >> (5.5.0).
>> >> >> Most
>> >> >> >> of
>> >> >> >> >> my queries return millions of results. Is there a way I can
>> read
>> >> the
>> >> >> >> pages
>> >> >> >> >> in parallel? Is there a way I can get all the cursors well in
>> >> >> advance?
>> >> >> >> >>
>> >> >> >> >> Let's say my query returns 2M documents and I have set
>> >> rows=100,000.
>> >> >> >> >> Can I have multiple threads iterating over different pages like
>> >> >> >> >> Thread1 -> docs 1 to 100K
>> >> >> >> >> Thread2 -> docs 101K to 200K
>> >> >> >> >> ......
>> >> >> >> >> ......
>> >> >> >> >>
>> >> >> >> >> for this to happen, can I get all the cursorMarks for a given
>> >> query
>> >> >> so
>> >> >> >> that
>> >> >> >> >> I can leverage the following code in parallel
>> >> >> >> >>
>> >> >> >> >> cursorQ.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark)
>> >> >> >> >> val rsp: QueryResponse = c.query(cursorQ)
>> >> >> >> >>
>> >> >> >> >> Thank you,
>> >> >> >> >> Chetas.
>> >> >> >>
>> >> >>
>> >>
>> >
>> >
>>

Re: Parallelize Cursor approach

Reply via email to