Hello, I am running Solr 5.5.0. It is a solrCloud of 50 nodes and I have the following config for all the collections. maxShardsperNode: 1 replicationFactor: 1
I was using Streaming API to get back results from Solr. It worked fine for a while until the index data size reached beyond 40 GB per shard (i.e. per node). It started throwing JSON parsing exceptions while reading the TupleStream data. FYI: I have other services (Yarn, Spark) deployed on the same boxes on which Solr shards are running. Spark jobs also use a lot of disk cache. So, the free available disk cache on the boxes vary a lot depending upon what else is running on the box. Due to this issue, I moved to using the cursor approach and it works fine but as we all know it is way slower than the streaming approach. Currently the index size per shard is 80GB (The machine has 512 GB of RAM and being used by different services/programs: heap/off-heap and the disk cache requirements). When I have enough RAM (more than 80 GB so that all the index data could fit in memory) available on the machine, the streaming API succeeds without running into any exceptions. Question: How different the index data caching mechanism (for HDFS) is for the Streaming API from the cursorMark approach? Why cursor works every time but streaming works only when there is a lot of free disk cache? Thank you.