On 16.03.2010 22:44, Marvin Humphrey wrote:
> On Tue, Mar 16, 2010 at 04:15:40PM +0100, Nick Wellnhofer wrote:
>> What's the easiest way to get to the > term-document matrix either during or
>> after indexing?
>
> I'm not sure what format would be most helpful for you. Here's code to
> iterate over all terms and all postings in all segments for the "content"
> field:
>
> my $poly_reader = KinoSearch::Index::PolyReader->open(
> index => '/path/to/index',
> );
> my %postings;
> my $offset = 0;
> for my $seg_reader ( @{ $poly_reader->seg_readers } ) {
> my $lex_reader = $seg_reader->obtain("KinoSearch::Index::LexiconReader");
> my $plist_reader
> = $seg_reader->obtain("KinoSearch::Index::PostingListReader");
> my $lexicon = $lex_reader->lexicon( field => 'content');
> my $plist = $plist_reader->posting_list( field => 'content' );
> while ($lexicon->next) {
> my $term = $lexicon->get_term;
> warn $term;
> $postings{$term} ||= [];
> my $doc_id_array = $postings{$term};
> $plist->seek($term);
> while (my $seg_doc_id = $plist->next) {
> push @$doc_id_array, $seg_doc_id + offset;
> }
> }
> $offset += $seg_reader->doc_max;
> }
>
> Does that at least provide a point of departure?
Thanks, that's helpful. I also figured out how to get the term
frequencies. How does Kinosearch compute the final term weights? I had a
look at Search/Similarity.c and it seems to be Sim_tf * Sim_idf *
Sim_length_norm.
On a side note, how can I interface with Kinosearch or Lucy directly on
the C level? Is there any documentation yet?
Nick
--
aevum gmbh
rumfordstr. 4
80469 münchen
germany
tel: +49 89 3838 0653
http://aevum.de/