(originally I sent this mail to the kinosearch mailing list, but since it's temporarily down Marvin suggested I send this to lucy-dev instead. Please excuse me if it's not quite on topic here).
Hi all, I've been running public IRC logs for a few years now, and have decided to replace the crappy search with something decent. So, KinoSearch it is :-) One page of these logs contains the conversation from one channel at one particular day, and each such page contains many rows consisting of an ID, a timestamp, a nickname, and the line that was being uttered. Example: http://irclog.perlgeek.de/perl6/2011-02-19. (Currently i have about 20 channels, a few years worth of logs and 4 million rows; I want to be able to scale up to maybe 20 million rows) I want my search results to be grouped similarly, so my current schema looks like this: my $schema = KinoSearch::Plan::Schema->new; my $poly_an = KinoSearch::Analysis::PolyAnalyzer->new(language => 'en'); my $full_text = KinoSearch::Plan::FullTextType->new( analyzer => $poly_an, stored => 0, ); my $string = KinoSearch::Plan::StringType->new( stored => 0); my $kept_string = KinoSearch::Plan::StringType->new( stored => 1, sortable => 1); my $sort_string = KinoSearch::Plan::StringType->new( stored => 0, sortable => 1); $schema->spec_field(name => 'line', type => $full_text); $schema->spec_field(name => 'nick', type => $string); $schema->spec_field(name => 'channel', type => $kept_string); $schema->spec_field(name => 'day', type => $kept_string); $schema->spec_field(name => 'timestamp',type => $sort_string); $schema->spec_field(name => 'id', type => $kept_string); Having each line as a separate document has three disadvantages: 1) when displaying the results, I have to construct the context manually (so I need to hit the DB to get the rows before and after, which is why I don't store the line in the index) 2) when paging the search results, I rip apart the last page, because the num_wanted option works with rows, not pages. 3) not sure about this one, but it feels that this solution doesn't scale well. I've wait more than half a minute for a query that was limited to 100 rows. (Mabe my three sort_specs hurt here?) Is there a way to construct my schema in a way to avoid these problems (and still allows searching by field)? Something like sub-documents, where I have pages as top level documents, and each page can have multiple rows? Cheers, Moritz
