Hi All,

I too would like to have doc'ids that are larger than int32. Not today but in 4 years that would be very nice ;) Already we are splitting some indexes that would be nicer together (mostly allowing more lucene code to be used instead of our own)

On the other hand we are not the default use case of lucene. We index once a month and then have a frozen index. After "freezing" the index we use the doc'ids in lucene to link the search results to our document storage. We could use a stored field value instead but for now this using of the internal lucene id was a nice optimization.

The closest we are coming to this max index number is in a index of how our (uniprot.org) database links to other databases. These are stored as very small documents and we have 892,236,174 of them. We can split this into lots of smaller indexes, without to much of a hassle. On the other hand it would be even nicer to merge them all into a larger index which would have 1.5 billion documents as that would allow us to use the lucene document joining logic. For now we have our own cross lucene index joining logic which is optimized but not optimal.

We get into this problem because we somewhat abuse Lucene to act as more than just a text retrieval engine. We actually have a number of custom query objects that allow users to integrate certain compute results into a lucene search.

Now I understand that splitting indexes etc... into shards is a completely reasonable direction. On the other hand we have more than acceptable search performance on 800 million document indexes and would see no reason why that would not be the case on one 5 times the size. Especially considering this performance is achieved on 32GB ram (18GB heap) machines with 8 cores today. i.e. for us it would be far cheaper to buy bigger machines than to re-architect. I expect that with improvements in JVM+GC it would make sense to have 1 or 2 Solr/Elastic search nodes on one large machine instead of 5 to 10 that we are hearing about on some deployments.

Some of the decisions regarding what we build we would not do today if starting from scratch. But considering we started using lucene 10 years ago and are current with the latest release the decision to continue with our madness makes sense, and would be possible for another 10 years if we had 64 bits for a docid.

Again not something for now, but something that would be interesting in the java10 time frame.

Regards,
Jerven

P.S. thank you very much for building a great search library and ecosystem.

P.P.S if you want to see the madness in action visit uniprot.org.


On 08/18/2016 05:43 PM, Greg Bowyer wrote:
What are you trying to index that has more than 3 billion documents per
shard / index and can not be split as Adrien suggests?



On Thu, Aug 18, 2016, at 07:35 AM, Cristian Lorenzetto wrote:
Maybe lucene has maxsize 2^31 because result set are java array where
length is a int type.
A suggestion for possible changes in future is to not use java array but
Iterator. Iterator is a ADT more scalable , not sucking memory for
returning documents.


2016-08-18 16:03 GMT+02:00 Glen Newton <glen.new...@gmail.com>:

Or maybe it is time Lucene re-examined this limit.

There are use cases out there where >2^31 does make sense in a single index
(huge number of tiny docs).

Also, I think the underlying hardware and the JDK have advanced to make
this more defendable.

Constructively,
Glen


On Thu, Aug 18, 2016 at 9:55 AM, Adrien Grand <jpou...@gmail.com> wrote:

No, IndexWriter enforces that the number of documents cannot go over
IndexWriter.MAX_DOCS (which is a bit less than 2^31) and
BaseCompositeReader computes the number of documents in a long variable
and
ensures it is less than 2^31, so you cannot have indexes that contain
more
than 2^31 documents.

Larger collections should be written to multiple shards and use
TopDocs.merge to merge results.

Le jeu. 18 août 2016 à 15:38, Cristian Lorenzetto <
cristian.lorenze...@gmail.com> a écrit :

docid is a signed int32 so it is not so big, but really docid seams
not a
primary key unmodifiable but a temporary id for the view related to a
specific search.

So repository can contains more than 2^31 documents.

My deduction is correct ? is there a maximum size for lucene index?




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


--
-------------------------------------------------------------------
Jerven Bolleman                        Jerven.Bolleman@sib.swiss
SIB Swiss Institute of Bioinformatics  Tel: +41 (0)22 379 58 85
CMU, rue Michel Servet 1               Fax: +41 (0)22 379 58 58
1211 Geneve 4,
Switzerland     www.sib.swiss - www.uniprot.org
Follow us at https://twitter.com/#!/uniprot
-------------------------------------------------------------------

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to