Thanks guys for your replies, I will go for the sharding approach suggested by Oliver & Tri Cao.
In my case, every word occurrence is a document, and the context of the occurrence are document fields. I use that to do n-gram analysis in a large corpus of text, and lucene seems to be the best and only solution to this problem. Artem. On Fri, Mar 21, 2014 at 7:29 PM, Jack Krupansky <j...@basetechnology.com>wrote: > Every word occurrence or every unique word? I mean Integer.MAX_VALUE like > 2 billion. Even the OED only has 600,000 words defined. The former doesn't > sound like a good use case match for Lucene as it exists today. Lucene > indexes "documents", not "words". > > I'm sure some day Lucene will switch from int to long, but not in the very > near future (maybe Lucene 6.0??), especially since it probably isn't a good > match for existing hardware. Maybe when Lucene moves a lot more stuff off > heap, then it might make more sense. > > Sure, you could do you own Lucene branch that literally does that switch > now, but otherwise, that's the limit for now. > > -- Jack Krupansky > > > -----Original Message----- From: Artem Gayardo-Matrosov > Sent: Friday, March 21, 2014 12:41 PM > To: java-user@lucene.apache.org > Subject: Re: maxDoc/numDocs int fields > > > Hi Oli, > > Thanks for your reply, > > I thought about this, but it feels like making a crude, inefficient > implementation of what's already in lucene -- CompositeReader, isn't it? It > would involve writing my CompositeCompositeReader which would forward the > requests to the underlying CompositeReader... > > Is there a better way? > > Thanks, > Artem. > > > > > On Fri, Mar 21, 2014 at 6:33 PM, Oliver Christ <ochr...@ebsco.com> wrote: > > Can you split your corpus across multiple Lucene instances? >> >> Cheers, Oli >> >> -----Original Message----- >> From: Artem Gayardo-Matrosov [mailto:ar...@gayardo.com] >> Sent: Friday, March 21, 2014 12:29 PM >> To: java-user@lucene.apache.org >> Subject: maxDoc/numDocs int fields >> >> Hi all, >> >> I am using lucene to index a large corpus of text, with every word being a >> separate document (this is something I cannot change), and I am hitting a >> limitation of the CompositeReader only supporting Integer.MAX_VALUE >> documents. >> >> Is there any way to work around this limitation? For the moment I have >> implemented my own DirectoryReader and BaseCompositeReader to at least >> make >> them support documents from Integer.MIN_VALUE to -1 (for twice more >> documents supported), the problem is that all the APIs are restricted to >> use the int type and after the docID value wraps back to 0, I have no way >> to restore the original docID. >> >> -- >> Thanks in advance, >> Artem. >> >> > > > -- > > Artem. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Artem.