Re: Solr on HDFS: increase in query time with increase in data

Chetas Joshi Fri, 16 Dec 2016 10:58:55 -0800

Thank you everyone. I would add nodes to the SolrCloud and split the shards.


Shawn,

Thank you for explaining why putting index data on local file system could
be a better idea than using HDFS. I need to find out how HDFS caches the
index files in a resource constrained environment.

I would also like to add that when I try the Streaming API instead of using
the cursor approach, it starts running into JSON parsing exceptions when my
nodes (running Solr shards) don't have enough RAM to fit the entire index
into memory. FYI: I have other services (Yarn, Spark) deployed on the same
boxes as well. Spark jobs also use a lot of disk cache.
When I have enough RAM (more than 70 GB so that all the index data could
fit in memory), the streaming API succeeds without running into any
exceptions. How different the index data caching mechanism is for the
Streaming API from the cursor approach?

Thanks!



On Fri, Dec 16, 2016 at 6:52 AM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 12/14/2016 11:58 AM, Chetas Joshi wrote:
> > I am running Solr 5.5.0 on HDFS. It is a solrCloud of 50 nodes and I have
> > the following config.
> > maxShardsperNode: 1
> > replicationFactor: 1
> >
> > I have been ingesting data into Solr for the last 3 months. With increase
> > in data, I am observing increase in the query time. Currently the size of
> > my indices is 70 GB per shard (i.e. per node).
>
> Query times will increase as the index size increases, but significant
> jumps in the query time may be an indication of a performance problem.
> Performance problems are usually caused by insufficient resources,
> memory in particular.
>
> With HDFS, I am honestly not sure *where* the cache memory is needed.  I
> would assume that it's needed on the HDFS hosts, that a lot of spare
> memory on the Solr (HDFS client) hosts probably won't make much
> difference.  I could be wrong -- I have no idea what kind of caching
> HDFS does.  If the HDFS client can cache data, then you probably would
> want extra memory on the Solr machines.
>
> > I am using cursor approach (/export handler) using SolrJ client to get
> back
> > results from Solr. All the fields I am querying on and all the fields
> that
> > I get back from Solr are indexed and have docValues enabled as well. What
> > could be the reason behind increase in query time?
>
> If actual disk access is required to satisfy a query, Solr is going to
> be slow.  Caching is absolutely required for good performance.  If your
> query times are really long but used to be short, chances are that your
> index size has exceeded your system's ability to cache it effectively.
>
> One thing to keep in mind:  Gigabit Ethernet is comparable in speed to
> the sustained transfer rate of a single modern SATA magnetic disk, so if
> the data has to traverse a gigabit network, it probably will be nearly
> as slow as it would be if it were coming from a single disk.  Having a
> 10gig network for your storage is probably a good idea ... but current
> fast memory chips can leave 10gig in the dust, so if the data can come
> from cache and the chips are new enough, then it can be faster than
> network storage.
>
> Because the network can be a potential bottleneck, I strongly recommend
> putting index data on local disks.  If you have enough memory, the disk
> doesn't even need to be super-fast.
>
> > Has this got something to do with the OS disk cache that is used for
> > loading the Solr indices? When a query is fired, will Solr wait for all
> > (70GB) of disk cache being available so that it can load the index file?
>
> Caching the files on the disk is not handled by Solr, so Solr won't wait
> for the entire index to be cached unless the underlying storage waits
> for some reason.  The caching is usually handled by the OS.  For HDFS,
> it might be handled by a combination of the OS and Hadoop, but I don't
> know enough about HDFS to comment.  Solr makes a request for the parts
> of the index files that it needs to satisfy the request.  If the
> underlying system is capable of caching the data, if that feature is
> enabled, and if there's memory available for that purpose, then it gets
> cached.
>
> Thanks,
> Shawn
>
>

Re: Solr on HDFS: increase in query time with increase in data

Reply via email to