On 4/12/2018 4:57 AM, neotorand wrote:
I read from the link you shared that
"Shard cannot contain more than 2 billion documents since Lucene is using
integer for internal IDs."

In which java class of SOLR implimentaion repository this can be found.

The 2 billion limit  is a *hard* limit from Lucene.  It's not in Solr.  It's pretty much the only hard limit that Lucene actually has - there's a workaround for everything else.  Solr can overcome this limit for a single index by sharding the index into multiple physical indexes across multiple servers, which is more automated in SolrCloud than in standalone mode.

The 2 billion limit per individual index can't be raised. Lucene uses an "int" datatype to hold the internal ID everywhere it's used.  Java numeric types are signed, which means that the maximum number a 32-bit data type can hold is 2147483647.  This is the value returned by the Java constant Integer.MAX_VALUE.  A little bit is subtracted from that value to obtain the maximum it will attempt to use, to be absolutely sure it can't go over.

https://issues.apache.org/jira/browse/LUCENE-5843

Raising the limit is theoretically possible, but not without *MAJOR* surgery to an extremely large amount of Lucene's code. The risk of bugs when attempting that change is *VERY* high -- it could literally take months to find them all and fix them.

The two most popular search engines using Lucene are Solr and elasticsearch. Both of these packages can overcome the 2 billion limit with sharding.

Summary: The 2 billion document limit can be frustrating, but since an index that large on a single machine is most likely not going to perform well and should be split across several machines, there's almost no value to raising the limit and risking a large number of software bugs.

Thanks,
Shawn

Reply via email to