Re: maxDoc/numDocs int fields

Artem Gayardo-Matrosov Fri, 21 Mar 2014 14:09:43 -0700

Thanks guys for your replies,

I will go for the sharding approach suggested by Oliver & Tri Cao.


In my case, every word occurrence is a document, and the context of the
occurrence are document fields. I use that to do n-gram analysis in a large
corpus of text, and lucene seems to be the best and only solution to this
problem.

Artem.


On Fri, Mar 21, 2014 at 7:29 PM, Jack Krupansky <j...@basetechnology.com>wrote:

> Every word occurrence or every unique word? I mean Integer.MAX_VALUE like
> 2 billion. Even the OED only has 600,000 words defined. The former doesn't
> sound like a good use case match for Lucene as it exists today. Lucene
> indexes "documents", not "words".
>
> I'm sure some day Lucene will switch from int to long, but not in the very
> near future (maybe Lucene 6.0??), especially since it probably isn't a good
> match for existing hardware. Maybe when Lucene moves a lot more stuff off
> heap, then it might make more sense.
>
> Sure, you could do you own Lucene branch that literally does that switch
> now, but otherwise, that's the limit for now.
>
> -- Jack Krupansky
>
>
> -----Original Message----- From: Artem Gayardo-Matrosov
> Sent: Friday, March 21, 2014 12:41 PM
> To: java-user@lucene.apache.org
> Subject: Re: maxDoc/numDocs int fields
>
>
> Hi Oli,
>
> Thanks for your reply,
>
> I thought about this, but it feels like making a crude, inefficient
> implementation of what's already in lucene -- CompositeReader, isn't it? It
> would involve writing my CompositeCompositeReader which would forward the
> requests to the underlying CompositeReader...
>
> Is there a better way?
>
> Thanks,
> Artem.
>
>
>
>
> On Fri, Mar 21, 2014 at 6:33 PM, Oliver Christ <ochr...@ebsco.com> wrote:
>
>  Can you split your corpus across multiple Lucene instances?
>>
>> Cheers, Oli
>>
>> -----Original Message-----
>> From: Artem Gayardo-Matrosov [mailto:ar...@gayardo.com]
>> Sent: Friday, March 21, 2014 12:29 PM
>> To: java-user@lucene.apache.org
>> Subject: maxDoc/numDocs int fields
>>
>> Hi all,
>>
>> I am using lucene to index a large corpus of text, with every word being a
>> separate document (this is something I cannot change), and I am hitting a
>> limitation of the CompositeReader only supporting Integer.MAX_VALUE
>> documents.
>>
>> Is there any way to work around this limitation? For the moment I have
>> implemented my own DirectoryReader and BaseCompositeReader to at least
>> make
>> them support documents from Integer.MIN_VALUE to -1 (for twice more
>> documents supported), the problem is that all the APIs are restricted to
>> use the int type and after the docID value wraps back to 0, I have no way
>> to restore the original docID.
>>
>> --
>> Thanks in advance,
>> Artem.
>>
>>
>
>
> --
>
> Artem.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 

Artem.

Re: maxDoc/numDocs int fields

Reply via email to