Re: Solr on HDFS: Streaming API performance tuning

Reth RM Fri, 16 Dec 2016 23:45:52 -0800

If you could provide the json parse exception stack trace, it might help to
predict issue there.



On Fri, Dec 16, 2016 at 5:52 PM, Chetas Joshi <chetas.jo...@gmail.com>
wrote:

> Hi Joel,
>
> The only NON alpha-numeric characters I have in my data are '+' and '/'. I
> don't have any backslashes.
>
> If the special characters was the issue, I should get the JSON parsing
> exceptions every time irrespective of the index size and irrespective of
> the available memory on the machine. That is not the case here. The
> streaming API successfully returns all the documents when the index size is
> small and fits in the available memory. That's the reason I am confused.
>
> Thanks!
>
> On Fri, Dec 16, 2016 at 5:43 PM, Joel Bernstein <joels...@gmail.com>
> wrote:
>
> > The Streaming API may have been throwing exceptions because the JSON
> > special characters were not escaped. This was fixed in Solr 6.0.
> >
> >
> >
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Fri, Dec 16, 2016 at 4:34 PM, Chetas Joshi <chetas.jo...@gmail.com>
> > wrote:
> >
> > > Hello,
> > >
> > > I am running Solr 5.5.0.
> > > It is a solrCloud of 50 nodes and I have the following config for all
> the
> > > collections.
> > > maxShardsperNode: 1
> > > replicationFactor: 1
> > >
> > > I was using Streaming API to get back results from Solr. It worked fine
> > for
> > > a while until the index data size reached beyond 40 GB per shard (i.e.
> > per
> > > node). It started throwing JSON parsing exceptions while reading the
> > > TupleStream data. FYI: I have other services (Yarn, Spark) deployed on
> > the
> > > same boxes on which Solr shards are running. Spark jobs also use a lot
> of
> > > disk cache. So, the free available disk cache on the boxes vary a
> > > lot depending upon what else is running on the box.
> > >
> > > Due to this issue, I moved to using the cursor approach and it works
> fine
> > > but as we all know it is way slower than the streaming approach.
> > >
> > > Currently the index size per shard is 80GB (The machine has 512 GB of
> RAM
> > > and being used by different services/programs: heap/off-heap and the
> disk
> > > cache requirements).
> > >
> > > When I have enough RAM (more than 80 GB so that all the index data
> could
> > > fit in memory) available on the machine, the streaming API succeeds
> > without
> > > running into any exceptions.
> > >
> > > Question:
> > > How different the index data caching mechanism (for HDFS) is for the
> > > Streaming API from the cursorMark approach?
> > > Why cursor works every time but streaming works only when there is a
> lot
> > of
> > > free disk cache?
> > >
> > > Thank you.
> > >
> >
>

Re: Solr on HDFS: Streaming API performance tuning

Reply via email to