Re: docid is just a signed int32

Jerven Tjalling Bolleman Thu, 06 Apr 2017 07:09:21 -0700

Hi All,

I too would like to have doc'ids that are larger than int32. Not todaybut in 4 years that would be very nice ;) Already we are splitting someindexes that would be nicer together (mostly allowing more lucene codeto be used instead of our own)

On the other hand we are not the default use case of lucene. We indexonce a month and then have a frozen index. After "freezing" the index weuse the doc'ids in lucene to link the search results to our documentstorage. We could use a stored field value instead but for now thisusing of the internal lucene id was a nice optimization.

The closest we are coming to this max index number is in a index of howour (uniprot.org) database links to other databases. These are stored asvery small documents and we have 892,236,174 of them. We can split thisinto lots of smaller indexes, without to much of a hassle. On the otherhand it would be even nicer to merge them all into a larger index whichwould have 1.5 billion documents as that would allow us to use thelucene document joining logic. For now we have our own cross luceneindex joining logic which is optimized but not optimal.

We get into this problem because we somewhat abuse Lucene to act as morethan just a text retrieval engine. We actually have a number of customquery objects that allow users to integrate certain compute results intoa lucene search.

Now I understand that splitting indexes etc... into shards is acompletely reasonable direction. On the other hand we have more thanacceptable search performance on 800 million document indexes and wouldsee no reason why that would not be the case on one 5 times the size.Especially considering this performance is achieved on 32GB ram (18GBheap) machines with 8 cores today. i.e. for us it would be far cheaperto buy bigger machines than to re-architect. I expect that withimprovements in JVM+GC it would make sense to have 1 or 2 Solr/Elasticsearch nodes on one large machine instead of 5 to 10 that we are hearingabout on some deployments.

Some of the decisions regarding what we build we would not do today ifstarting from scratch. But considering we started using lucene 10 yearsago and are current with the latest release the decision to continuewith our madness makes sense, and would be possible for another 10 yearsif we had 64 bits for a docid.

Again not something for now, but something that would be interesting inthe java10 time frame.


Regards,
Jerven

P.S. thank you very much for building a great search library and ecosystem.

P.P.S if you want to see the madness in action visit uniprot.org.


On 08/18/2016 05:43 PM, Greg Bowyer wrote:

What are you trying to index that has more than 3 billion documents per
shard / index and can not be split as Adrien suggests?



On Thu, Aug 18, 2016, at 07:35 AM, Cristian Lorenzetto wrote:

Maybe lucene has maxsize 2^31 because result set are java array where
length is a int type.
A suggestion for possible changes in future is to not use java array but
Iterator. Iterator is a ADT more scalable , not sucking memory for
returning documents.


2016-08-18 16:03 GMT+02:00 Glen Newton <glen.new...@gmail.com>:

Or maybe it is time Lucene re-examined this limit.

There are use cases out there where >2^31 does make sense in a single index
(huge number of tiny docs).

Also, I think the underlying hardware and the JDK have advanced to make
this more defendable.

Constructively,
Glen


On Thu, Aug 18, 2016 at 9:55 AM, Adrien Grand <jpou...@gmail.com> wrote:

No, IndexWriter enforces that the number of documents cannot go over
IndexWriter.MAX_DOCS (which is a bit less than 2^31) and
BaseCompositeReader computes the number of documents in a long variable

and

ensures it is less than 2^31, so you cannot have indexes that contain

more

than 2^31 documents.

Larger collections should be written to multiple shards and use
TopDocs.merge to merge results.

Le jeu. 18 août 2016 à 15:38, Cristian Lorenzetto <
cristian.lorenze...@gmail.com> a écrit :

docid is a signed int32 so it is not so big, but really docid seams

not a

primary key unmodifiable but a temporary id for the view related to a
specific search.

So repository can contains more than 2^31 documents.

My deduction is correct ? is there a maximum size for lucene index?


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


--
-------------------------------------------------------------------
Jerven Bolleman                        Jerven.Bolleman@sib.swiss
SIB Swiss Institute of Bioinformatics  Tel: +41 (0)22 379 58 85
CMU, rue Michel Servet 1               Fax: +41 (0)22 379 58 58
1211 Geneve 4,
Switzerland     www.sib.swiss - www.uniprot.org
Follow us at https://twitter.com/#!/uniprot
-------------------------------------------------------------------

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: docid is just a signed int32

Reply via email to