maxDoc/numDocs int fields

2014-03-21 Thread Artem Gayardo-Matrosov
Hi all,

I am using lucene to index a large corpus of text, with every word being a
separate document (this is something I cannot change), and I am hitting a
limitation of the CompositeReader only supporting Integer.MAX_VALUE
documents.

Is there any way to work around this limitation? For the moment I have
implemented my own DirectoryReader and BaseCompositeReader to at least make
them support documents from Integer.MIN_VALUE to -1 (for twice more
documents supported), the problem is that all the APIs are restricted to
use the int type and after the docID value wraps back to 0, I have no way
to restore the original docID.

-- 
Thanks in advance,
Artem.


RE: maxDoc/numDocs int fields

2014-03-21 Thread Oliver Christ
Can you split your corpus across multiple Lucene instances?

Cheers, Oli

-Original Message-
From: Artem Gayardo-Matrosov [mailto:ar...@gayardo.com] 
Sent: Friday, March 21, 2014 12:29 PM
To: java-user@lucene.apache.org
Subject: maxDoc/numDocs int fields

Hi all,

I am using lucene to index a large corpus of text, with every word being a 
separate document (this is something I cannot change), and I am hitting a 
limitation of the CompositeReader only supporting Integer.MAX_VALUE documents.

Is there any way to work around this limitation? For the moment I have 
implemented my own DirectoryReader and BaseCompositeReader to at least make 
them support documents from Integer.MIN_VALUE to -1 (for twice more documents 
supported), the problem is that all the APIs are restricted to use the int type 
and after the docID value wraps back to 0, I have no way to restore the 
original docID.

--
Thanks in advance,
Artem.


Re: maxDoc/numDocs int fields

2014-03-21 Thread Artem Gayardo-Matrosov
Hi Oli,

Thanks for your reply,

I thought about this, but it feels like making a crude, inefficient
implementation of what's already in lucene -- CompositeReader, isn't it? It
would involve writing my CompositeCompositeReader which would forward the
requests to the underlying CompositeReader...

Is there a better way?

Thanks,
Artem.




On Fri, Mar 21, 2014 at 6:33 PM, Oliver Christ  wrote:

> Can you split your corpus across multiple Lucene instances?
>
> Cheers, Oli
>
> -Original Message-
> From: Artem Gayardo-Matrosov [mailto:ar...@gayardo.com]
> Sent: Friday, March 21, 2014 12:29 PM
> To: java-user@lucene.apache.org
> Subject: maxDoc/numDocs int fields
>
> Hi all,
>
> I am using lucene to index a large corpus of text, with every word being a
> separate document (this is something I cannot change), and I am hitting a
> limitation of the CompositeReader only supporting Integer.MAX_VALUE
> documents.
>
> Is there any way to work around this limitation? For the moment I have
> implemented my own DirectoryReader and BaseCompositeReader to at least make
> them support documents from Integer.MIN_VALUE to -1 (for twice more
> documents supported), the problem is that all the APIs are restricted to
> use the int type and after the docID value wraps back to 0, I have no way
> to restore the original docID.
>
> --
> Thanks in advance,
> Artem.
>



-- 

Artem.


Re: maxDoc/numDocs int fields

2014-03-21 Thread Tri Cao
I ran into this issue before and after some digging, I don't think there is an easy way to accommodate long IDs in Lucene. So I decided to go with sharding documents into multiple indexes. It turned out to be a good decision in my case because I would have to shard the index anyway for performance reasons. (There are queries that requires collecting and scoring a large portion of the index).On Mar 21, 2014, at 09:41 AM, Artem Gayardo-Matrosov  wrote:Hi Oli,Thanks for your reply,I thought about this, but it feels like making a crude, inefficientimplementation of what's already in lucene -- CompositeReader, isn't it? Itwould involve writing my CompositeCompositeReader which would forward therequests to the underlying CompositeReader...Is there a better way?Thanks,Artem.On Fri, Mar 21, 2014 at 6:33 PM, Oliver Christ <ochr...@ebsco.com        > wrote:        > Can you split your corpus across multiple Lucene instances?        >        > Cheers, Oli        >        > -Original Message-        > From: Artem Gayardo-Matrosov [mailto:ar...@gayardo.com]        > Sent: Friday, March 21, 2014 12:29 PM        > To: java-user@lucene.apache.org        > Subject: maxDoc/numDocs int fields        >        > Hi all,        >        > I am using lucene to index a large corpus of text, with every word being a        > separate document (this is something I cannot change), and I am hitting a        > limitation of the CompositeReader only supporting Integer.MAX_VALUE        > documents.        >        > Is there any way to work around this limitation? For the moment I have        > implemented my own DirectoryReader and BaseCompositeReader to at least make        > them support documents from Integer.MIN_VALUE to -1 (for twice more        > documents supported), the problem is that all the APIs are restricted to        > use the int type and after the docID value wraps back to 0, I have no way        > to restore the original docID.        >        > --        > Thanks in advance,        > Artem.        >-- Artem.

Re: maxDoc/numDocs int fields

2014-03-21 Thread Jack Krupansky
Every word occurrence or every unique word? I mean Integer.MAX_VALUE like 2 
billion. Even the OED only has 600,000 words defined. The former doesn't 
sound like a good use case match for Lucene as it exists today. Lucene 
indexes "documents", not "words".


I'm sure some day Lucene will switch from int to long, but not in the very 
near future (maybe Lucene 6.0??), especially since it probably isn't a good 
match for existing hardware. Maybe when Lucene moves a lot more stuff off 
heap, then it might make more sense.


Sure, you could do you own Lucene branch that literally does that switch 
now, but otherwise, that's the limit for now.


-- Jack Krupansky

-Original Message- 
From: Artem Gayardo-Matrosov

Sent: Friday, March 21, 2014 12:41 PM
To: java-user@lucene.apache.org
Subject: Re: maxDoc/numDocs int fields

Hi Oli,

Thanks for your reply,

I thought about this, but it feels like making a crude, inefficient
implementation of what's already in lucene -- CompositeReader, isn't it? It
would involve writing my CompositeCompositeReader which would forward the
requests to the underlying CompositeReader...

Is there a better way?

Thanks,
Artem.




On Fri, Mar 21, 2014 at 6:33 PM, Oliver Christ  wrote:


Can you split your corpus across multiple Lucene instances?

Cheers, Oli

-Original Message-
From: Artem Gayardo-Matrosov [mailto:ar...@gayardo.com]
Sent: Friday, March 21, 2014 12:29 PM
To: java-user@lucene.apache.org
Subject: maxDoc/numDocs int fields

Hi all,

I am using lucene to index a large corpus of text, with every word being a
separate document (this is something I cannot change), and I am hitting a
limitation of the CompositeReader only supporting Integer.MAX_VALUE
documents.

Is there any way to work around this limitation? For the moment I have
implemented my own DirectoryReader and BaseCompositeReader to at least 
make

them support documents from Integer.MIN_VALUE to -1 (for twice more
documents supported), the problem is that all the APIs are restricted to
use the int type and after the docID value wraps back to 0, I have no way
to restore the original docID.

--
Thanks in advance,
Artem.





--

Artem. 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: maxDoc/numDocs int fields

2014-03-21 Thread Artem Gayardo-Matrosov
Thanks guys for your replies,

I will go for the sharding approach suggested by Oliver & Tri Cao.

In my case, every word occurrence is a document, and the context of the
occurrence are document fields. I use that to do n-gram analysis in a large
corpus of text, and lucene seems to be the best and only solution to this
problem.

Artem.


On Fri, Mar 21, 2014 at 7:29 PM, Jack Krupansky wrote:

> Every word occurrence or every unique word? I mean Integer.MAX_VALUE like
> 2 billion. Even the OED only has 600,000 words defined. The former doesn't
> sound like a good use case match for Lucene as it exists today. Lucene
> indexes "documents", not "words".
>
> I'm sure some day Lucene will switch from int to long, but not in the very
> near future (maybe Lucene 6.0??), especially since it probably isn't a good
> match for existing hardware. Maybe when Lucene moves a lot more stuff off
> heap, then it might make more sense.
>
> Sure, you could do you own Lucene branch that literally does that switch
> now, but otherwise, that's the limit for now.
>
> -- Jack Krupansky
>
>
> -Original Message- From: Artem Gayardo-Matrosov
> Sent: Friday, March 21, 2014 12:41 PM
> To: java-user@lucene.apache.org
> Subject: Re: maxDoc/numDocs int fields
>
>
> Hi Oli,
>
> Thanks for your reply,
>
> I thought about this, but it feels like making a crude, inefficient
> implementation of what's already in lucene -- CompositeReader, isn't it? It
> would involve writing my CompositeCompositeReader which would forward the
> requests to the underlying CompositeReader...
>
> Is there a better way?
>
> Thanks,
> Artem.
>
>
>
>
> On Fri, Mar 21, 2014 at 6:33 PM, Oliver Christ  wrote:
>
>  Can you split your corpus across multiple Lucene instances?
>>
>> Cheers, Oli
>>
>> -Original Message-
>> From: Artem Gayardo-Matrosov [mailto:ar...@gayardo.com]
>> Sent: Friday, March 21, 2014 12:29 PM
>> To: java-user@lucene.apache.org
>> Subject: maxDoc/numDocs int fields
>>
>> Hi all,
>>
>> I am using lucene to index a large corpus of text, with every word being a
>> separate document (this is something I cannot change), and I am hitting a
>> limitation of the CompositeReader only supporting Integer.MAX_VALUE
>> documents.
>>
>> Is there any way to work around this limitation? For the moment I have
>> implemented my own DirectoryReader and BaseCompositeReader to at least
>> make
>> them support documents from Integer.MIN_VALUE to -1 (for twice more
>> documents supported), the problem is that all the APIs are restricted to
>> use the int type and after the docID value wraps back to 0, I have no way
>> to restore the original docID.
>>
>> --
>> Thanks in advance,
>> Artem.
>>
>>
>
>
> --
>
> Artem.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 

Artem.