Re: Parallelize Cursor approach

2016-11-14 Thread Chetas Joshi
I got it when you said form N queries. Just wanted to try the "get all cursorMark first" approach but just realized it would be very inefficient as you said since cursor mark is serialized version of the last sorted value you received and hence still you are reading the results from solr although

Re: Parallelize Cursor approach

2016-11-14 Thread Erick Erickson
You're executing all the queries to parallelize before even starting. Seems very inefficient. My suggestion doesn't require this first step. Perhaps it was confusing because I mentioned "your own cursorMark". Really I meant bypass that entirely, just form N queries that were restricted to N

Re: Parallelize Cursor approach

2016-11-14 Thread Chetas Joshi
Thanks Joel for the explanation. Hi Erick, One of the ways I am trying to parallelize the cursor approach is by iterating the result set twice. (1) Once just to get all the cursor marks val q: SolrQuery = new solrj.SolrQuery() q.set("q", query) q.add("fq", query) q.add("rows",

Re: Parallelize Cursor approach

2016-11-10 Thread Joel Bernstein
Solr 5 was very early days for Streaming Expressions. Streaming Expressions and SQL use Java 8 so development switched to the 6.0 branch five months before the 6.0 release. So there was a very large jump in features and bug fixes from Solr 5 to Solr 6 in Streaming Expressions. Joel Bernstein

Re: Parallelize Cursor approach

2016-11-10 Thread Joel Bernstein
In Solr 5 the /export handler wasn't escaping json text fields, which would produce json parse exceptions. This was fixed in Solr 6.0. Joel Bernstein http://joelsolr.blogspot.com/ On Tue, Nov 8, 2016 at 6:17 PM, Erick Erickson wrote: > Hmm, that should work fine. Let

Re: Parallelize Cursor approach

2016-11-08 Thread Erick Erickson
Hmm, that should work fine. Let us know what the logs show if anything because this is weird. Best, Erick On Tue, Nov 8, 2016 at 1:00 PM, Chetas Joshi wrote: > Hi Erick, > > This is how I use the streaming approach. > > Here is the solrconfig block. > > > >

Re: Parallelize Cursor approach

2016-11-08 Thread Chetas Joshi
Hi Erick, This is how I use the streaming approach. Here is the solrconfig block. {!xport} xsort false query And here is the code in which SolrJ is being used. String zkHost = args[0]; String collection = args[1]; Map props = new

Re: Parallelize Cursor approach

2016-11-05 Thread Erick Erickson
Hmmm, export is supposed to handle 10s of million result sets. I know of a situation where the Streaming Aggregation functionality back ported to Solr 4.10 processes on that scale. So do you have any clue what exactly is failing? Is there anything in the Solr logs? _How_ are you using /export,

Re: Parallelize Cursor approach

2016-11-05 Thread Chetas Joshi
Thanks Yonik for the explanation. Hi Erick, I was using the /xport functionality. But it hasn't been stable (Solr 5.5.0). I started running into run time Exceptions (JSON parsing exceptions) while reading the stream of Tuples. This started happening as the size of my collection increased 3 times

Re: Parallelize Cursor approach

2016-11-04 Thread Erick Erickson
Have you considered the /xport functionality? On Fri, Nov 4, 2016 at 5:56 PM, Yonik Seeley wrote: > No, you can't get cursor-marks ahead of time. > They are the serialized representation of the last sort values > encountered (hence not known ahead of time). > > -Yonik > > > On

Re: Parallelize Cursor approach

2016-11-04 Thread Yonik Seeley
No, you can't get cursor-marks ahead of time. They are the serialized representation of the last sort values encountered (hence not known ahead of time). -Yonik On Fri, Nov 4, 2016 at 8:48 PM, Chetas Joshi wrote: > Hi, > > I am using the cursor approach to fetch results

Parallelize Cursor approach

2016-11-04 Thread Chetas Joshi
Hi, I am using the cursor approach to fetch results from Solr (5.5.0). Most of my queries return millions of results. Is there a way I can read the pages in parallel? Is there a way I can get all the cursors well in advance? Let's say my query returns 2M documents and I have set rows=100,000.