Re: Numerical ids for terms?
On Tue, 2011-04-12 at 11:41 +0200, Gregor Heinrich wrote: Hi -- has there been any effort to create a numerical representation of Lucene indices. That is, to use the Lucene Directory backend as a large term-document matrix at index level. As this would require bijective mapping between terms (per-field, as customary in Lucene) and a numerical index (integer, monotonous from 0 to numTerms()-1), I guess this requires some some special modifications to the Lucene core. Maybe you're thinking about something like TermsEnum? https://hudson.apache.org/hudson/job/Lucene-trunk/javadoc/all/org/apache/lucene/index/TermsEnum.html It provides ordinal-access to terms, represented with longs. In order to make the access at index-level rather than segment-level you will have to perform a merge of the ordinals from the different segments. Unfortunately it is optional whether the codec supports ordinal-based terms access and the default codec does not, so you will have to explicitly select a codec when you build your index. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Numerical ids for terms?
Thanks Toke and Kirill -- I guess that's the way to go (at least until v4.0). Best regards gregor On 4/13/11 3:42 PM, Toke Eskildsen wrote: On Tue, 2011-04-12 at 11:41 +0200, Gregor Heinrich wrote: Hi -- has there been any effort to create a numerical representation of Lucene indices. That is, to use the Lucene Directory backend as a large term-document matrix at index level. As this would require bijective mapping between terms (per-field, as customary in Lucene) and a numerical index (integer, monotonous from 0 to numTerms()-1), I guess this requires some some special modifications to the Lucene core. Maybe you're thinking about something like TermsEnum? https://hudson.apache.org/hudson/job/Lucene-trunk/javadoc/all/org/apache/lucene/index/TermsEnum.html It provides ordinal-access to terms, represented with longs. In order to make the access at index-level rather than segment-level you will have to perform a merge of the ordinals from the different segments. Unfortunately it is optional whether the codec supports ordinal-based terms access and the default codec does not, so you will have to explicitly select a codec when you build your index. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Numerical ids for terms?
Hi -- has there been any effort to create a numerical representation of Lucene indices. That is, to use the Lucene Directory backend as a large term-document matrix at index level. As this would require bijective mapping between terms (per-field, as customary in Lucene) and a numerical index (integer, monotonous from 0 to numTerms()-1), I guess this requires some some special modifications to the Lucene core. Another interesting feature would be to use Lucene's Directory backend for storage of large dense matrices, for instance to data-mining tasks from within Lucene. Any suggestions? Best regards and thanks gregor - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Numerical ids for terms?
On Tue, Apr 12, 2011 at 13:41, Gregor Heinrich gre...@arbylon.net wrote: Hi -- has there been any effort to create a numerical representation of Lucene indices. That is, to use the Lucene Directory backend as a large term-document matrix at index level. As this would require bijective mapping between terms (per-field, as customary in Lucene) and a numerical index (integer, monotonous from 0 to numTerms()-1), I guess this requires some some special modifications to the Lucene core. Lucene index already provides term - id mapping in some form. Another interesting feature would be to use Lucene's Directory backend for storage of large dense matrices, for instance to data-mining tasks from within Lucene. Lucene's Directory is a dumb abstraction for random-access named write-once byte streams. It doesn't add /any/ value over mmap. Any suggestions? *troll mode on* Use numpy/scipy? :) -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Numerical ids for terms?
Thanks for the quick response. Please be a bit more concrete than some form of term--id mapping: Do you refer to subclassing SegmentReader with the appropriate Map implementation or is there a tested structure in the existing API that I've overseen? Regarding a Directory abstraction backed by a memory mapping API, my question refers to using Lucene API because even if may be perceived dumb, it hides a lot of boilerplate code. Are there any efforts going on regarding this? Cheers gregor On 4/12/11 1:21 PM, Earwin Burrfoot wrote: On Tue, Apr 12, 2011 at 13:41, Gregor Heinrichgre...@arbylon.net wrote: Hi -- has there been any effort to create a numerical representation of Lucene indices. That is, to use the Lucene Directory backend as a large term-document matrix at index level. As this would require bijective mapping between terms (per-field, as customary in Lucene) and a numerical index (integer, monotonous from 0 to numTerms()-1), I guess this requires some some special modifications to the Lucene core. Lucene index already provides term- id mapping in some form. Another interesting feature would be to use Lucene's Directory backend for storage of large dense matrices, for instance to data-mining tasks from within Lucene. Lucene's Directory is a dumb abstraction for random-access named write-once byte streams. It doesn't add /any/ value over mmap. Any suggestions? *troll mode on* Use numpy/scipy? :) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org