Re: Decision on Number of shards and collection

Shawn Heisey Thu, 12 Apr 2018 08:42:51 -0700

On 4/12/2018 4:57 AM, neotorand wrote:

I read from the link you shared that
"Shard cannot contain more than 2 billion documents since Lucene is using
integer for internal IDs."


In which java class of SOLR implimentaion repository this can be found.

The 2 billion limit is a *hard* limit from Lucene. It's not in Solr. It's pretty much the only hard limit that Lucene actually has - there'sa workaround for everything else. Solr can overcome this limit for asingle index by sharding the index into multiple physical indexes acrossmultiple servers, which is more automated in SolrCloud than instandalone mode.

The 2 billion limit per individual index can't be raised. Lucene uses an"int" datatype to hold the internal ID everywhere it's used. Javanumeric types are signed, which means that the maximum number a 32-bitdata type can hold is 2147483647. This is the value returned by theJava constant Integer.MAX_VALUE. A little bit is subtracted from thatvalue to obtain the maximum it will attempt to use, to be absolutelysure it can't go over.


https://issues.apache.org/jira/browse/LUCENE-5843

Raising the limit is theoretically possible, but not without *MAJOR*surgery to an extremely large amount of Lucene's code. The risk of bugswhen attempting that change is *VERY* high -- it could literally takemonths to find them all and fix them.

The two most popular search engines using Lucene are Solr andelasticsearch. Both of these packages can overcome the 2 billion limitwith sharding.

Summary: The 2 billion document limit can be frustrating, but since anindex that large on a single machine is most likely not going to performwell and should be split across several machines, there's almost novalue to raising the limit and risking a large number of software bugs.


Thanks,
Shawn

Re: Decision on Number of shards and collection

Reply via email to