Re: Parallelize Cursor approach

Chetas Joshi Mon, 14 Nov 2016 16:16:50 -0800

I got it when you said form N queries. Just wanted to try the "get all
cursorMark first" approach but just realized it would be very inefficient
as you said since cursor mark is serialized version of the last sorted
value you received and hence still you are reading the results from solr
although your "fl" -> null.


Just wanted to try this approach as I need everything sorted. In submitting
N queries, I will have to merge sort the results of N queries. But that
should be way better than the first approach I tried.

Thanks!

On Mon, Nov 14, 2016 at 3:58 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> You're executing all the queries to parallelize before even starting.
> Seems very inefficient. My suggestion doesn't require this first step.
> Perhaps it was confusing because I mentioned "your own cursorMark".
> Really I meant bypass that entirely, just form N queries that were
> restricted to N disjoint subsets of the data and process them all in
> parallel, either with /export or /select.
>
> Best,
> Erick
>
> On Mon, Nov 14, 2016 at 3:53 PM, Chetas Joshi <chetas.jo...@gmail.com>
> wrote:
> > Thanks Joel for the explanation.
> >
> > Hi Erick,
> >
> > One of the ways I am trying to parallelize the cursor approach is by
> > iterating the result set twice.
> > (1) Once just to get all the cursor marks
> >
> > val q: SolrQuery = new solrj.SolrQuery()
> > q.set("q", query)
> > q.add("fq", query)
> > q.add("rows", batchSize.toString)
> > q.add("collection", collection)
> > q.add("fl", "null")
> > q.add("sort", "id asc")
> >
> > Here I am not asking for any field values ( "fl" -> null )
> >
> > (2) Once I get all the cursor marks, I can start parallel threads to get
> > the results in parallel.
> >
> > However, the first step in fact takes a lot of time. Even more than when
> I
> > would actually iterate through the results with "fl" -> field1, field2,
> > field3
> >
> > Why is this happening?
> >
> > Thanks!
> >
> >
> > On Thu, Nov 10, 2016 at 8:22 PM, Joel Bernstein <joels...@gmail.com>
> wrote:
> >
> >> Solr 5 was very early days for Streaming Expressions. Streaming
> Expressions
> >> and SQL use Java 8 so development switched to the 6.0 branch five months
> >> before the 6.0 release. So there was a very large jump in features and
> bug
> >> fixes from Solr 5 to Solr 6 in Streaming Expressions.
> >>
> >> Joel Bernstein
> >> http://joelsolr.blogspot.com/
> >>
> >> On Thu, Nov 10, 2016 at 11:14 PM, Joel Bernstein <joels...@gmail.com>
> >> wrote:
> >>
> >> > In Solr 5 the /export handler wasn't escaping json text fields, which
> >> > would produce json parse exceptions. This was fixed in Solr 6.0.
> >> >
> >> > Joel Bernstein
> >> > http://joelsolr.blogspot.com/
> >> >
> >> > On Tue, Nov 8, 2016 at 6:17 PM, Erick Erickson <
> erickerick...@gmail.com>
> >> > wrote:
> >> >
> >> >> Hmm, that should work fine. Let us know what the logs show if
> anything
> >> >> because this is weird.
> >> >>
> >> >> Best,
> >> >> Erick
> >> >>
> >> >> On Tue, Nov 8, 2016 at 1:00 PM, Chetas Joshi <chetas.jo...@gmail.com
> >
> >> >> wrote:
> >> >> > Hi Erick,
> >> >> >
> >> >> > This is how I use the streaming approach.
> >> >> >
> >> >> > Here is the solrconfig block.
> >> >> >
> >> >> > <requestHandler name="/export" class="solr.SearchHandler">
> >> >> >     <lst name="invariants">
> >> >> >         <str name="rq">{!xport}</str>
> >> >> >         <str name="wt">xsort</str>
> >> >> >         <str name="distrib">false</str>
> >> >> >     </lst>
> >> >> >     <arr name="components">
> >> >> >         <str>query</str>
> >> >> >     </arr>
> >> >> > </requestHandler>
> >> >> >
> >> >> > And here is the code in which SolrJ is being used.
> >> >> >
> >> >> > String zkHost = args[0];
> >> >> > String collection = args[1];
> >> >> >
> >> >> > Map props = new HashMap();
> >> >> > props.put("q", "*:*");
> >> >> > props.put("qt", "/export");
> >> >> > props.put("sort", "fieldA asc");
> >> >> > props.put("fl", "fieldA,fieldB,fieldC");
> >> >> >
> >> >> > CloudSolrStream cloudstream = new CloudSolrStream(zkHost,collect
> >> >> ion,props);
> >> >> >
> >> >> > And then I iterate through the cloud stream (TupleStream).
> >> >> > So I am using streaming expressions (SolrJ).
> >> >> >
> >> >> > I have not looked at the solr logs while I started getting the JSON
> >> >> parsing
> >> >> > exceptions. But I will let you know what I see the next time I run
> >> into
> >> >> the
> >> >> > same exceptions.
> >> >> >
> >> >> > Thanks
> >> >> >
> >> >> > On Sat, Nov 5, 2016 at 9:32 PM, Erick Erickson <
> >> erickerick...@gmail.com
> >> >> >
> >> >> > wrote:
> >> >> >
> >> >> >> Hmmm, export is supposed to handle 10s of million result sets. I
> know
> >> >> >> of a situation where the Streaming Aggregation functionality back
> >> >> >> ported to Solr 4.10 processes on that scale. So do you have any
> clue
> >> >> >> what exactly is failing? Is there anything in the Solr logs?
> >> >> >>
> >> >> >> _How_ are you using /export, through Streaming Aggregation
> (SolrJ) or
> >> >> >> just the raw xport handler? It might be worth trying to do this
> from
> >> >> >> SolrJ if you're not, it should be a very quick program to write,
> just
> >> >> >> to test we're talking 100 lines max.
> >> >> >>
> >> >> >> You could always roll your own cursor mark stuff by partitioning
> the
> >> >> >> data amongst N threads/processes if you have any reasonable
> >> >> >> expectation that you could form filter queries that partition the
> >> >> >> result set anywhere near evenly.
> >> >> >>
> >> >> >> For example, let's say you have a field with random numbers
> between 0
> >> >> >> and 100. You could spin off 10 cursorMark-aware processes each
> with
> >> >> >> its own fq clause like
> >> >> >>
> >> >> >> fq=partition_field:[0 TO 10}
> >> >> >> fq=[10 TO 20}
> >> >> >> ....
> >> >> >> fq=[90 TO 100]
> >> >> >>
> >> >> >> Note the use of inclusive/exclusive end points....
> >> >> >>
> >> >> >> Each one would be totally independent of all others with no
> >> >> >> overlapping documents. And since the fq's would presumably be
> cached
> >> >> >> you should be able to go as fast as you can drive your cluster. Of
> >> >> >> course you lose query-wide sorting and the like, if that's
> important
> >> >> >> you'd need to figure something out there.
> >> >> >>
> >> >> >> Do be aware of a potential issue. When regular doc fields are
> >> >> >> returned, for each document returned, a 16K block of data will be
> >> >> >> decompressed to get the stored field data. Streaming Aggregation
> >> >> >> (/xport) reads docValues entries which are held in MMapDirectory
> >> space
> >> >> >> so will be much, much faster. As of Solr 5.5. You can override the
> >> >> >> decompression stuff, see:
> >> >> >> https://issues.apache.org/jira/browse/SOLR-8220 for fields that
> are
> >> >> >> both stored and docvalues...
> >> >> >>
> >> >> >> Best,
> >> >> >> Erick
> >> >> >>
> >> >> >> On Sat, Nov 5, 2016 at 6:41 PM, Chetas Joshi <
> chetas.jo...@gmail.com
> >> >
> >> >> >> wrote:
> >> >> >> > Thanks Yonik for the explanation.
> >> >> >> >
> >> >> >> > Hi Erick,
> >> >> >> > I was using the /xport functionality. But it hasn't been stable
> >> (Solr
> >> >> >> > 5.5.0). I started running into run time Exceptions (JSON parsing
> >> >> >> > exceptions) while reading the stream of Tuples. This started
> >> >> happening as
> >> >> >> > the size of my collection increased 3 times and I started
> running
> >> >> queries
> >> >> >> > that return millions of documents (>10mm). I don't know if it is
> >> the
> >> >> >> query
> >> >> >> > result size or the actual data size (total number of docs in the
> >> >> >> > collection) that is causing the instability.
> >> >> >> >
> >> >> >> > org.noggit.JSONParser$ParseException: Expected ',' or '}':
> >> >> >> > char=5,position=110938 BEFORE='uuid":"0lG99s8vyaKB2I/
> >> >> >> > I","space":"uuid","timestamp":1 5' AFTER='DB6
> 474294954},{"uuid":"
> >> >> >> > 0lG99sHT8P5e'
> >> >> >> >
> >> >> >> > I won't be able to move to Solr 6.0 due to some constraints in
> our
> >> >> >> > production environment and hence moving back to the cursor
> >> approach.
> >> >> Do
> >> >> >> you
> >> >> >> > have any other suggestion for me?
> >> >> >> >
> >> >> >> > Thanks,
> >> >> >> > Chetas.
> >> >> >> >
> >> >> >> > On Fri, Nov 4, 2016 at 10:17 PM, Erick Erickson <
> >> >> erickerick...@gmail.com
> >> >> >> >
> >> >> >> > wrote:
> >> >> >> >
> >> >> >> >> Have you considered the /xport functionality?
> >> >> >> >>
> >> >> >> >> On Fri, Nov 4, 2016 at 5:56 PM, Yonik Seeley <
> ysee...@gmail.com>
> >> >> wrote:
> >> >> >> >> > No, you can't get cursor-marks ahead of time.
> >> >> >> >> > They are the serialized representation of the last sort
> values
> >> >> >> >> > encountered (hence not known ahead of time).
> >> >> >> >> >
> >> >> >> >> > -Yonik
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > On Fri, Nov 4, 2016 at 8:48 PM, Chetas Joshi <
> >> >> chetas.jo...@gmail.com>
> >> >> >> >> wrote:
> >> >> >> >> >> Hi,
> >> >> >> >> >>
> >> >> >> >> >> I am using the cursor approach to fetch results from Solr
> >> >> (5.5.0).
> >> >> >> Most
> >> >> >> >> of
> >> >> >> >> >> my queries return millions of results. Is there a way I can
> >> read
> >> >> the
> >> >> >> >> pages
> >> >> >> >> >> in parallel? Is there a way I can get all the cursors well
> in
> >> >> >> advance?
> >> >> >> >> >>
> >> >> >> >> >> Let's say my query returns 2M documents and I have set
> >> >> rows=100,000.
> >> >> >> >> >> Can I have multiple threads iterating over different pages
> like
> >> >> >> >> >> Thread1 -> docs 1 to 100K
> >> >> >> >> >> Thread2 -> docs 101K to 200K
> >> >> >> >> >> ......
> >> >> >> >> >> ......
> >> >> >> >> >>
> >> >> >> >> >> for this to happen, can I get all the cursorMarks for a
> given
> >> >> query
> >> >> >> so
> >> >> >> >> that
> >> >> >> >> >> I can leverage the following code in parallel
> >> >> >> >> >>
> >> >> >> >> >> cursorQ.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark)
> >> >> >> >> >> val rsp: QueryResponse = c.query(cursorQ)
> >> >> >> >> >>
> >> >> >> >> >> Thank you,
> >> >> >> >> >> Chetas.
> >> >> >> >>
> >> >> >>
> >> >>
> >> >
> >> >
> >>
>

Re: Parallelize Cursor approach

Reply via email to