Re: Parallelize Cursor approach

Erick Erickson Tue, 08 Nov 2016 15:18:27 -0800

Hmm, that should work fine. Let us know what the logs show if anything
because this is weird.


Best,
Erick

On Tue, Nov 8, 2016 at 1:00 PM, Chetas Joshi <chetas.jo...@gmail.com> wrote:
> Hi Erick,
>
> This is how I use the streaming approach.
>
> Here is the solrconfig block.
>
> <requestHandler name="/export" class="solr.SearchHandler">
>     <lst name="invariants">
>         <str name="rq">{!xport}</str>
>         <str name="wt">xsort</str>
>         <str name="distrib">false</str>
>     </lst>
>     <arr name="components">
>         <str>query</str>
>     </arr>
> </requestHandler>
>
> And here is the code in which SolrJ is being used.
>
> String zkHost = args[0];
> String collection = args[1];
>
> Map props = new HashMap();
> props.put("q", "*:*");
> props.put("qt", "/export");
> props.put("sort", "fieldA asc");
> props.put("fl", "fieldA,fieldB,fieldC");
>
> CloudSolrStream cloudstream = new CloudSolrStream(zkHost,collection,props);
>
> And then I iterate through the cloud stream (TupleStream).
> So I am using streaming expressions (SolrJ).
>
> I have not looked at the solr logs while I started getting the JSON parsing
> exceptions. But I will let you know what I see the next time I run into the
> same exceptions.
>
> Thanks
>
> On Sat, Nov 5, 2016 at 9:32 PM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
>> Hmmm, export is supposed to handle 10s of million result sets. I know
>> of a situation where the Streaming Aggregation functionality back
>> ported to Solr 4.10 processes on that scale. So do you have any clue
>> what exactly is failing? Is there anything in the Solr logs?
>>
>> _How_ are you using /export, through Streaming Aggregation (SolrJ) or
>> just the raw xport handler? It might be worth trying to do this from
>> SolrJ if you're not, it should be a very quick program to write, just
>> to test we're talking 100 lines max.
>>
>> You could always roll your own cursor mark stuff by partitioning the
>> data amongst N threads/processes if you have any reasonable
>> expectation that you could form filter queries that partition the
>> result set anywhere near evenly.
>>
>> For example, let's say you have a field with random numbers between 0
>> and 100. You could spin off 10 cursorMark-aware processes each with
>> its own fq clause like
>>
>> fq=partition_field:[0 TO 10}
>> fq=[10 TO 20}
>> ....
>> fq=[90 TO 100]
>>
>> Note the use of inclusive/exclusive end points....
>>
>> Each one would be totally independent of all others with no
>> overlapping documents. And since the fq's would presumably be cached
>> you should be able to go as fast as you can drive your cluster. Of
>> course you lose query-wide sorting and the like, if that's important
>> you'd need to figure something out there.
>>
>> Do be aware of a potential issue. When regular doc fields are
>> returned, for each document returned, a 16K block of data will be
>> decompressed to get the stored field data. Streaming Aggregation
>> (/xport) reads docValues entries which are held in MMapDirectory space
>> so will be much, much faster. As of Solr 5.5. You can override the
>> decompression stuff, see:
>> https://issues.apache.org/jira/browse/SOLR-8220 for fields that are
>> both stored and docvalues...
>>
>> Best,
>> Erick
>>
>> On Sat, Nov 5, 2016 at 6:41 PM, Chetas Joshi <chetas.jo...@gmail.com>
>> wrote:
>> > Thanks Yonik for the explanation.
>> >
>> > Hi Erick,
>> > I was using the /xport functionality. But it hasn't been stable (Solr
>> > 5.5.0). I started running into run time Exceptions (JSON parsing
>> > exceptions) while reading the stream of Tuples. This started happening as
>> > the size of my collection increased 3 times and I started running queries
>> > that return millions of documents (>10mm). I don't know if it is the
>> query
>> > result size or the actual data size (total number of docs in the
>> > collection) that is causing the instability.
>> >
>> > org.noggit.JSONParser$ParseException: Expected ',' or '}':
>> > char=5,position=110938 BEFORE='uuid":"0lG99s8vyaKB2I/
>> > I","space":"uuid","timestamp":1 5' AFTER='DB6 474294954},{"uuid":"
>> > 0lG99sHT8P5e'
>> >
>> > I won't be able to move to Solr 6.0 due to some constraints in our
>> > production environment and hence moving back to the cursor approach. Do
>> you
>> > have any other suggestion for me?
>> >
>> > Thanks,
>> > Chetas.
>> >
>> > On Fri, Nov 4, 2016 at 10:17 PM, Erick Erickson <erickerick...@gmail.com
>> >
>> > wrote:
>> >
>> >> Have you considered the /xport functionality?
>> >>
>> >> On Fri, Nov 4, 2016 at 5:56 PM, Yonik Seeley <ysee...@gmail.com> wrote:
>> >> > No, you can't get cursor-marks ahead of time.
>> >> > They are the serialized representation of the last sort values
>> >> > encountered (hence not known ahead of time).
>> >> >
>> >> > -Yonik
>> >> >
>> >> >
>> >> > On Fri, Nov 4, 2016 at 8:48 PM, Chetas Joshi <chetas.jo...@gmail.com>
>> >> wrote:
>> >> >> Hi,
>> >> >>
>> >> >> I am using the cursor approach to fetch results from Solr (5.5.0).
>> Most
>> >> of
>> >> >> my queries return millions of results. Is there a way I can read the
>> >> pages
>> >> >> in parallel? Is there a way I can get all the cursors well in
>> advance?
>> >> >>
>> >> >> Let's say my query returns 2M documents and I have set rows=100,000.
>> >> >> Can I have multiple threads iterating over different pages like
>> >> >> Thread1 -> docs 1 to 100K
>> >> >> Thread2 -> docs 101K to 200K
>> >> >> ......
>> >> >> ......
>> >> >>
>> >> >> for this to happen, can I get all the cursorMarks for a given query
>> so
>> >> that
>> >> >> I can leverage the following code in parallel
>> >> >>
>> >> >> cursorQ.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark)
>> >> >> val rsp: QueryResponse = c.query(cursorQ)
>> >> >>
>> >> >> Thank you,
>> >> >> Chetas.
>> >>
>>

Re: Parallelize Cursor approach

Reply via email to