On 16.03.2010 22:44, Marvin Humphrey wrote:
> On Tue, Mar 16, 2010 at 04:15:40PM +0100, Nick Wellnhofer wrote:
>> What's the easiest way to get to the > term-document matrix either during or
>> after indexing?
> 
> I'm not sure what format would be most helpful for you.  Here's code to
> iterate over all terms and all postings in all segments for the "content"
> field:
> 
>   my $poly_reader = KinoSearch::Index::PolyReader->open( 
>     index => '/path/to/index',
>   );  
>   my %postings;
>   my $offset = 0;
>   for my $seg_reader ( @{ $poly_reader->seg_readers } ) { 
>     my $lex_reader = $seg_reader->obtain("KinoSearch::Index::LexiconReader");
>     my $plist_reader
>       = $seg_reader->obtain("KinoSearch::Index::PostingListReader");
>     my $lexicon = $lex_reader->lexicon( field => 'content');
>     my $plist = $plist_reader->posting_list( field => 'content' );
>     while ($lexicon->next) {
>       my $term = $lexicon->get_term;
>       warn $term;
>       $postings{$term} ||= []; 
>       my $doc_id_array = $postings{$term};
>       $plist->seek($term);
>       while (my $seg_doc_id = $plist->next) {
>         push @$doc_id_array, $seg_doc_id + offset;
>       }   
>     }   
>     $offset += $seg_reader->doc_max;
>   }
> 
> Does that at least provide a point of departure?

Thanks, that's helpful. I also figured out how to get the term
frequencies. How does Kinosearch compute the final term weights? I had a
look at Search/Similarity.c and it seems to be Sim_tf * Sim_idf *
Sim_length_norm.

On a side note, how can I interface with Kinosearch or Lucy directly on
the C level? Is there any documentation yet?

Nick

-- 
aevum gmbh
rumfordstr. 4
80469 münchen
germany

tel: +49 89 3838 0653
http://aevum.de/

Reply via email to