I ran into this issue before and after some digging, I don't think there is an easy way to accommodate long IDs in Lucene. So I decided to go with sharding documents into multiple indexes. It turned out to be a good decision in my case because I would have to shard the index anyway for performance reasons. (There are queries that requires collecting and scoring a large portion of the index).
On Mar 21, 2014, at 09:41 AM, Artem Gayardo-Matrosov <ar...@gayardo.com> wrote:
Hi Oli,
Thanks for your reply,
I thought about this, but it feels like making a crude, inefficient
implementation of what's already in lucene -- CompositeReader, isn't it? It
would involve writing my CompositeCompositeReader which would forward the
requests to the underlying CompositeReader...
Is there a better way?
Thanks,
Artem.
On Fri, Mar 21, 2014 at 6:33 PM, Oliver Christ <ochr...@ebsco.com > wrote:
> Can you split your corpus across multiple Lucene instances?
>
> Cheers, Oli
>
> -----Original Message-----
> From: Artem Gayardo-Matrosov [mailto:ar...@gayardo.com]
> Sent: Friday, March 21, 2014 12:29 PM
> To: java-user@lucene.apache.org
> Subject: maxDoc/numDocs int fields
>
> Hi all,
>
> I am using lucene to index a large corpus of text, with every word being a
> separate document (this is something I cannot change), and I am hitting a
> limitation of the CompositeReader only supporting Integer.MAX_VALUE
> documents.
>
> Is there any way to work around this limitation? For the moment I have
> implemented my own DirectoryReader and BaseCompositeReader to at least make
> them support documents from Integer.MIN_VALUE to -1 (for twice more
> documents supported), the problem is that all the APIs are restricted to
> use the int type and after the docID value wraps back to 0, I have no way
> to restore the original docID.
>
> --
> Thanks in advance,
> Artem.
>
--
Artem.