Re: Numerical ids for terms?

2011-04-13 Thread Toke Eskildsen
On Tue, 2011-04-12 at 11:41 +0200, Gregor Heinrich wrote:
 Hi -- has there been any effort to create a numerical representation of 
 Lucene 
 indices. That is, to use the Lucene Directory backend as a large 
 term-document 
 matrix at index level. As this would require bijective mapping between terms 
 (per-field, as customary in Lucene) and a numerical index (integer, 
 monotonous 
 from 0 to numTerms()-1), I guess this requires some some special 
 modifications 
 to the Lucene core.

Maybe you're thinking about something like TermsEnum?
https://hudson.apache.org/hudson/job/Lucene-trunk/javadoc/all/org/apache/lucene/index/TermsEnum.html
It provides ordinal-access to terms, represented with longs. In order to
make the access at index-level rather than segment-level you will have
to perform a merge of the ordinals from the different segments.

Unfortunately it is optional whether the codec supports ordinal-based
terms access and the default codec does not, so you will have to
explicitly select a codec when you build your index.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Numerical ids for terms?

2011-04-13 Thread Gregor Heinrich

Thanks Toke and Kirill -- I guess that's the way to go (at least until v4.0).

Best regards

gregor

On 4/13/11 3:42 PM, Toke Eskildsen wrote:

On Tue, 2011-04-12 at 11:41 +0200, Gregor Heinrich wrote:

Hi -- has there been any effort to create a numerical representation of Lucene
indices. That is, to use the Lucene Directory backend as a large term-document
matrix at index level. As this would require bijective mapping between terms
(per-field, as customary in Lucene) and a numerical index (integer, monotonous
from 0 to numTerms()-1), I guess this requires some some special modifications
to the Lucene core.

Maybe you're thinking about something like TermsEnum?
https://hudson.apache.org/hudson/job/Lucene-trunk/javadoc/all/org/apache/lucene/index/TermsEnum.html
It provides ordinal-access to terms, represented with longs. In order to
make the access at index-level rather than segment-level you will have
to perform a merge of the ordinals from the different segments.

Unfortunately it is optional whether the codec supports ordinal-based
terms access and the default codec does not, so you will have to
explicitly select a codec when you build your index.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Numerical ids for terms?

2011-04-12 Thread Gregor Heinrich
Hi -- has there been any effort to create a numerical representation of Lucene 
indices. That is, to use the Lucene Directory backend as a large term-document 
matrix at index level. As this would require bijective mapping between terms 
(per-field, as customary in Lucene) and a numerical index (integer, monotonous 
from 0 to numTerms()-1), I guess this requires some some special modifications 
to the Lucene core.


Another interesting feature would be to use Lucene's Directory backend for 
storage of large dense matrices, for instance to data-mining tasks from within 
Lucene.


Any suggestions?

Best regards and thanks

gregor


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Numerical ids for terms?

2011-04-12 Thread Earwin Burrfoot
On Tue, Apr 12, 2011 at 13:41, Gregor Heinrich gre...@arbylon.net wrote:
 Hi -- has there been any effort to create a numerical representation of
 Lucene indices. That is, to use the Lucene Directory backend as a large
 term-document matrix at index level. As this would require bijective mapping
 between terms (per-field, as customary in Lucene) and a numerical index
 (integer, monotonous from 0 to numTerms()-1), I guess this requires some
 some special modifications to the Lucene core.
Lucene index already provides term - id mapping in some form.

 Another interesting feature would be to use Lucene's Directory backend for
 storage of large dense matrices, for instance to data-mining tasks from
 within Lucene.
Lucene's Directory is a dumb abstraction for random-access named
write-once byte streams.
It doesn't add /any/ value over mmap.

 Any suggestions?
*troll mode on* Use numpy/scipy? :)

-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Numerical ids for terms?

2011-04-12 Thread Gregor Heinrich
Thanks for the quick response. Please be a bit more concrete than some form of 
term--id mapping:  Do you refer to subclassing SegmentReader with the 
appropriate Map implementation or is there a tested structure in the existing 
API that I've overseen? Regarding a Directory abstraction backed by a memory 
mapping API, my question refers to using Lucene API because even if may be 
perceived dumb, it hides a lot of boilerplate code. Are there any efforts 
going on regarding this?


Cheers

gregor

On 4/12/11 1:21 PM, Earwin Burrfoot wrote:

On Tue, Apr 12, 2011 at 13:41, Gregor Heinrichgre...@arbylon.net  wrote:

Hi -- has there been any effort to create a numerical representation of
Lucene indices. That is, to use the Lucene Directory backend as a large
term-document matrix at index level. As this would require bijective mapping
between terms (per-field, as customary in Lucene) and a numerical index
(integer, monotonous from 0 to numTerms()-1), I guess this requires some
some special modifications to the Lucene core.

Lucene index already provides term-  id mapping in some form.


Another interesting feature would be to use Lucene's Directory backend for
storage of large dense matrices, for instance to data-mining tasks from
within Lucene.

Lucene's Directory is a dumb abstraction for random-access named
write-once byte streams.
It doesn't add /any/ value over mmap.


Any suggestions?

*troll mode on* Use numpy/scipy? :)



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org