Thanks for the quick response. Please be a bit more concrete than "some form" of term--id mapping: Do you refer to subclassing SegmentReader with the appropriate Map implementation or is there a tested structure in the existing API that I've overseen? Regarding a Directory abstraction backed by a memory mapping API, my question refers to using Lucene API because even if may be perceived "dumb", it hides a lot of boilerplate code. Are there any efforts going on regarding this?

Cheers

gregor

On 4/12/11 1:21 PM, Earwin Burrfoot wrote:
On Tue, Apr 12, 2011 at 13:41, Gregor Heinrich<[email protected]>  wrote:
Hi -- has there been any effort to create a numerical representation of
Lucene indices. That is, to use the Lucene Directory backend as a large
term-document matrix at index level. As this would require bijective mapping
between terms (per-field, as customary in Lucene) and a numerical index
(integer, monotonous from 0 to numTerms()-1), I guess this requires some
some special modifications to the Lucene core.
Lucene index already provides term<->  id mapping in some form.

Another interesting feature would be to use Lucene's Directory backend for
storage of large dense matrices, for instance to data-mining tasks from
within Lucene.
Lucene's Directory is a dumb abstraction for random-access named
write-once byte streams.
It doesn't add /any/ value over mmap.

Any suggestions?
*troll mode on* Use numpy/scipy? :)


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to