Re: Parallelize Cursor approach

Chetas Joshi Tue, 08 Nov 2016 13:01:46 -0800

Hi Erick,

This is how I use the streaming approach.


Here is the solrconfig block.

<requestHandler name="/export" class="solr.SearchHandler">
    <lst name="invariants">
        <str name="rq">{!xport}</str>
        <str name="wt">xsort</str>
        <str name="distrib">false</str>
    </lst>
    <arr name="components">
        <str>query</str>
    </arr>
</requestHandler>

And here is the code in which SolrJ is being used.

String zkHost = args[0];
String collection = args[1];

Map props = new HashMap();
props.put("q", "*:*");
props.put("qt", "/export");
props.put("sort", "fieldA asc");
props.put("fl", "fieldA,fieldB,fieldC");

CloudSolrStream cloudstream = new CloudSolrStream(zkHost,collection,props);

And then I iterate through the cloud stream (TupleStream).
So I am using streaming expressions (SolrJ).

I have not looked at the solr logs while I started getting the JSON parsing
exceptions. But I will let you know what I see the next time I run into the
same exceptions.

Thanks

On Sat, Nov 5, 2016 at 9:32 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Hmmm, export is supposed to handle 10s of million result sets. I know
> of a situation where the Streaming Aggregation functionality back
> ported to Solr 4.10 processes on that scale. So do you have any clue
> what exactly is failing? Is there anything in the Solr logs?
>
> _How_ are you using /export, through Streaming Aggregation (SolrJ) or
> just the raw xport handler? It might be worth trying to do this from
> SolrJ if you're not, it should be a very quick program to write, just
> to test we're talking 100 lines max.
>
> You could always roll your own cursor mark stuff by partitioning the
> data amongst N threads/processes if you have any reasonable
> expectation that you could form filter queries that partition the
> result set anywhere near evenly.
>
> For example, let's say you have a field with random numbers between 0
> and 100. You could spin off 10 cursorMark-aware processes each with
> its own fq clause like
>
> fq=partition_field:[0 TO 10}
> fq=[10 TO 20}
> ....
> fq=[90 TO 100]
>
> Note the use of inclusive/exclusive end points....
>
> Each one would be totally independent of all others with no
> overlapping documents. And since the fq's would presumably be cached
> you should be able to go as fast as you can drive your cluster. Of
> course you lose query-wide sorting and the like, if that's important
> you'd need to figure something out there.
>
> Do be aware of a potential issue. When regular doc fields are
> returned, for each document returned, a 16K block of data will be
> decompressed to get the stored field data. Streaming Aggregation
> (/xport) reads docValues entries which are held in MMapDirectory space
> so will be much, much faster. As of Solr 5.5. You can override the
> decompression stuff, see:
> https://issues.apache.org/jira/browse/SOLR-8220 for fields that are
> both stored and docvalues...
>
> Best,
> Erick
>
> On Sat, Nov 5, 2016 at 6:41 PM, Chetas Joshi <chetas.jo...@gmail.com>
> wrote:
> > Thanks Yonik for the explanation.
> >
> > Hi Erick,
> > I was using the /xport functionality. But it hasn't been stable (Solr
> > 5.5.0). I started running into run time Exceptions (JSON parsing
> > exceptions) while reading the stream of Tuples. This started happening as
> > the size of my collection increased 3 times and I started running queries
> > that return millions of documents (>10mm). I don't know if it is the
> query
> > result size or the actual data size (total number of docs in the
> > collection) that is causing the instability.
> >
> > org.noggit.JSONParser$ParseException: Expected ',' or '}':
> > char=5,position=110938 BEFORE='uuid":"0lG99s8vyaKB2I/
> > I","space":"uuid","timestamp":1 5' AFTER='DB6 474294954},{"uuid":"
> > 0lG99sHT8P5e'
> >
> > I won't be able to move to Solr 6.0 due to some constraints in our
> > production environment and hence moving back to the cursor approach. Do
> you
> > have any other suggestion for me?
> >
> > Thanks,
> > Chetas.
> >
> > On Fri, Nov 4, 2016 at 10:17 PM, Erick Erickson <erickerick...@gmail.com
> >
> > wrote:
> >
> >> Have you considered the /xport functionality?
> >>
> >> On Fri, Nov 4, 2016 at 5:56 PM, Yonik Seeley <ysee...@gmail.com> wrote:
> >> > No, you can't get cursor-marks ahead of time.
> >> > They are the serialized representation of the last sort values
> >> > encountered (hence not known ahead of time).
> >> >
> >> > -Yonik
> >> >
> >> >
> >> > On Fri, Nov 4, 2016 at 8:48 PM, Chetas Joshi <chetas.jo...@gmail.com>
> >> wrote:
> >> >> Hi,
> >> >>
> >> >> I am using the cursor approach to fetch results from Solr (5.5.0).
> Most
> >> of
> >> >> my queries return millions of results. Is there a way I can read the
> >> pages
> >> >> in parallel? Is there a way I can get all the cursors well in
> advance?
> >> >>
> >> >> Let's say my query returns 2M documents and I have set rows=100,000.
> >> >> Can I have multiple threads iterating over different pages like
> >> >> Thread1 -> docs 1 to 100K
> >> >> Thread2 -> docs 101K to 200K
> >> >> ......
> >> >> ......
> >> >>
> >> >> for this to happen, can I get all the cursorMarks for a given query
> so
> >> that
> >> >> I can leverage the following code in parallel
> >> >>
> >> >> cursorQ.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark)
> >> >> val rsp: QueryResponse = c.query(cursorQ)
> >> >>
> >> >> Thank you,
> >> >> Chetas.
> >>
>

Re: Parallelize Cursor approach

Reply via email to