The only option should be to configure Solr to just have a replication
factor of 1 or HDFS to have no replication. I would go for the middle
and configure both to use a factor of 2. This way a single failure in
HDFS and Solr is not a problem. While in 1/3 or 3/1 option a single
server error would bring the collection down.
Setting the HDFS replication factor is a bit tricky as Solr takes in
some places the default replication factor set on HDFS and some times
takes a default from the client side. HDFS allows you to set a
replication factor for every file individually.
regards,
Hendrik
On 07.06.2018 15:30, Shawn Heisey wrote:
On 6/7/2018 6:41 AM, Greenhorn Techie wrote:
As HDFS has got its own replication mechanism, with a HDFS replication
factor of 3, and then SolrCloud replication factor of 3, does that mean
each document will probably have around 9 copies replicated
underneath of
HDFS? If so, is there a way to configure HDFS or Solr such that only
three
copies are maintained overall?
Yes, that is exactly what happens.
SolrCloud replication assumes that each of its replicas is a
completely independent index. I am not aware of anything in Solr's
HDFS support that can use one HDFS index directory for multiple
replicas. At the most basic level, a Solr index is a Lucene index.
Lucene goes to great lengths to make sure that an index *CANNOT* be
used in more than one place.
Perhaps somebody who is more familiar with HDFSDirectoryFactory can
offer you a solution. But as far as I know, there isn't one.
Thanks,
Shawn