Hi All,
I too would like to have doc'ids that are larger than int32. Not today
but in 4 years that would be very nice ;) Already we are splitting some
indexes that would be nicer together (mostly allowing more lucene code
to be used instead of our own)
On the other hand we are not the default use case of lucene. We index
once a month and then have a frozen index. After "freezing" the index we
use the doc'ids in lucene to link the search results to our document
storage. We could use a stored field value instead but for now this
using of the internal lucene id was a nice optimization.
The closest we are coming to this max index number is in a index of how
our (uniprot.org) database links to other databases. These are stored as
very small documents and we have 892,236,174 of them. We can split this
into lots of smaller indexes, without to much of a hassle. On the other
hand it would be even nicer to merge them all into a larger index which
would have 1.5 billion documents as that would allow us to use the
lucene document joining logic. For now we have our own cross lucene
index joining logic which is optimized but not optimal.
We get into this problem because we somewhat abuse Lucene to act as more
than just a text retrieval engine. We actually have a number of custom
query objects that allow users to integrate certain compute results into
a lucene search.
Now I understand that splitting indexes etc... into shards is a
completely reasonable direction. On the other hand we have more than
acceptable search performance on 800 million document indexes and would
see no reason why that would not be the case on one 5 times the size.
Especially considering this performance is achieved on 32GB ram (18GB
heap) machines with 8 cores today. i.e. for us it would be far cheaper
to buy bigger machines than to re-architect. I expect that with
improvements in JVM+GC it would make sense to have 1 or 2 Solr/Elastic
search nodes on one large machine instead of 5 to 10 that we are hearing
about on some deployments.
Some of the decisions regarding what we build we would not do today if
starting from scratch. But considering we started using lucene 10 years
ago and are current with the latest release the decision to continue
with our madness makes sense, and would be possible for another 10 years
if we had 64 bits for a docid.
Again not something for now, but something that would be interesting in
the java10 time frame.
Regards,
Jerven
P.S. thank you very much for building a great search library and ecosystem.
P.P.S if you want to see the madness in action visit uniprot.org.
On 08/18/2016 05:43 PM, Greg Bowyer wrote:
What are you trying to index that has more than 3 billion documents per
shard / index and can not be split as Adrien suggests?
On Thu, Aug 18, 2016, at 07:35 AM, Cristian Lorenzetto wrote:
Maybe lucene has maxsize 2^31 because result set are java array where
length is a int type.
A suggestion for possible changes in future is to not use java array but
Iterator. Iterator is a ADT more scalable , not sucking memory for
returning documents.
2016-08-18 16:03 GMT+02:00 Glen Newton <glen.new...@gmail.com>:
Or maybe it is time Lucene re-examined this limit.
There are use cases out there where >2^31 does make sense in a single index
(huge number of tiny docs).
Also, I think the underlying hardware and the JDK have advanced to make
this more defendable.
Constructively,
Glen
On Thu, Aug 18, 2016 at 9:55 AM, Adrien Grand <jpou...@gmail.com> wrote:
No, IndexWriter enforces that the number of documents cannot go over
IndexWriter.MAX_DOCS (which is a bit less than 2^31) and
BaseCompositeReader computes the number of documents in a long variable
and
ensures it is less than 2^31, so you cannot have indexes that contain
more
than 2^31 documents.
Larger collections should be written to multiple shards and use
TopDocs.merge to merge results.
Le jeu. 18 août 2016 à 15:38, Cristian Lorenzetto <
cristian.lorenze...@gmail.com> a écrit :
docid is a signed int32 so it is not so big, but really docid seams
not a
primary key unmodifiable but a temporary id for the view related to a
specific search.
So repository can contains more than 2^31 documents.
My deduction is correct ? is there a maximum size for lucene index?
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
--
-------------------------------------------------------------------
Jerven Bolleman Jerven.Bolleman@sib.swiss
SIB Swiss Institute of Bioinformatics Tel: +41 (0)22 379 58 85
CMU, rue Michel Servet 1 Fax: +41 (0)22 379 58 58
1211 Geneve 4,
Switzerland www.sib.swiss - www.uniprot.org
Follow us at https://twitter.com/#!/uniprot
-------------------------------------------------------------------
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org