If you could provide the json parse exception stack trace, it might help to predict issue there.
On Fri, Dec 16, 2016 at 5:52 PM, Chetas Joshi <chetas.jo...@gmail.com> wrote: > Hi Joel, > > The only NON alpha-numeric characters I have in my data are '+' and '/'. I > don't have any backslashes. > > If the special characters was the issue, I should get the JSON parsing > exceptions every time irrespective of the index size and irrespective of > the available memory on the machine. That is not the case here. The > streaming API successfully returns all the documents when the index size is > small and fits in the available memory. That's the reason I am confused. > > Thanks! > > On Fri, Dec 16, 2016 at 5:43 PM, Joel Bernstein <joels...@gmail.com> > wrote: > > > The Streaming API may have been throwing exceptions because the JSON > > special characters were not escaped. This was fixed in Solr 6.0. > > > > > > > > > > > > > > Joel Bernstein > > http://joelsolr.blogspot.com/ > > > > On Fri, Dec 16, 2016 at 4:34 PM, Chetas Joshi <chetas.jo...@gmail.com> > > wrote: > > > > > Hello, > > > > > > I am running Solr 5.5.0. > > > It is a solrCloud of 50 nodes and I have the following config for all > the > > > collections. > > > maxShardsperNode: 1 > > > replicationFactor: 1 > > > > > > I was using Streaming API to get back results from Solr. It worked fine > > for > > > a while until the index data size reached beyond 40 GB per shard (i.e. > > per > > > node). It started throwing JSON parsing exceptions while reading the > > > TupleStream data. FYI: I have other services (Yarn, Spark) deployed on > > the > > > same boxes on which Solr shards are running. Spark jobs also use a lot > of > > > disk cache. So, the free available disk cache on the boxes vary a > > > lot depending upon what else is running on the box. > > > > > > Due to this issue, I moved to using the cursor approach and it works > fine > > > but as we all know it is way slower than the streaming approach. > > > > > > Currently the index size per shard is 80GB (The machine has 512 GB of > RAM > > > and being used by different services/programs: heap/off-heap and the > disk > > > cache requirements). > > > > > > When I have enough RAM (more than 80 GB so that all the index data > could > > > fit in memory) available on the machine, the streaming API succeeds > > without > > > running into any exceptions. > > > > > > Question: > > > How different the index data caching mechanism (for HDFS) is for the > > > Streaming API from the cursorMark approach? > > > Why cursor works every time but streaming works only when there is a > lot > > of > > > free disk cache? > > > > > > Thank you. > > > > > >