Hello,

I am running Solr 5.5.0.
It is a solrCloud of 50 nodes and I have the following config for all the
collections.
maxShardsperNode: 1
replicationFactor: 1

I was using Streaming API to get back results from Solr. It worked fine for
a while until the index data size reached beyond 40 GB per shard (i.e. per
node). It started throwing JSON parsing exceptions while reading the
TupleStream data. FYI: I have other services (Yarn, Spark) deployed on the
same boxes on which Solr shards are running. Spark jobs also use a lot of
disk cache. So, the free available disk cache on the boxes vary a
lot depending upon what else is running on the box.

Due to this issue, I moved to using the cursor approach and it works fine
but as we all know it is way slower than the streaming approach.

Currently the index size per shard is 80GB (The machine has 512 GB of RAM
and being used by different services/programs: heap/off-heap and the disk
cache requirements).

When I have enough RAM (more than 80 GB so that all the index data could
fit in memory) available on the machine, the streaming API succeeds without
running into any exceptions.

Question:
How different the index data caching mechanism (for HDFS) is for the
Streaming API from the cursorMark approach?
Why cursor works every time but streaming works only when there is a lot of
free disk cache?

Thank you.

Reply via email to