[lucy-dev] Schema for searching IRC logs

Moritz Lenz Sun, 20 Feb 2011 10:02:21 -0800

(originally I sent this mail to the kinosearch mailing list, but since
it's temporarily down Marvin suggested I send this to lucy-dev instead.
Please excuse me if it's not quite on topic here).


Hi all,

I've been running public IRC logs for a few years now, and have decided
to replace the crappy search with something decent. So, KinoSearch it is :-)

One page of these logs contains the conversation from one channel at one
particular day, and each such page contains many rows consisting of an
ID, a timestamp, a nickname, and the line that was being uttered.
Example: http://irclog.perlgeek.de/perl6/2011-02-19. (Currently i have
about 20 channels, a few years worth of logs and 4 million rows; I want
to be able to scale up to maybe 20 million rows)

I want my search results to be grouped similarly, so my current schema
looks like this:

my $schema      = KinoSearch::Plan::Schema->new;
my $poly_an     = KinoSearch::Analysis::PolyAnalyzer->new(language => 'en');
my $full_text   = KinoSearch::Plan::FullTextType->new(
                    analyzer => $poly_an,
                    stored   => 0,
                  );
my $string      = KinoSearch::Plan::StringType->new( stored => 0);
my $kept_string = KinoSearch::Plan::StringType->new( stored => 1,
sortable => 1);
my $sort_string = KinoSearch::Plan::StringType->new( stored => 0,
sortable => 1);

$schema->spec_field(name => 'line',     type => $full_text);
$schema->spec_field(name => 'nick',     type => $string);
$schema->spec_field(name => 'channel',  type => $kept_string);
$schema->spec_field(name => 'day',      type => $kept_string);
$schema->spec_field(name => 'timestamp',type => $sort_string);
$schema->spec_field(name => 'id',       type => $kept_string);

Having each line as a separate document has three disadvantages:

1) when displaying the results, I have to construct the context manually
(so I need to hit the DB to get the rows before and after, which is why
I don't store the line in the index)

2) when paging the search results, I rip apart the last page, because
the num_wanted option works with rows, not pages.

3) not sure about this one, but it feels that this solution doesn't
scale well. I've wait more than half a minute for a query that was
limited to 100 rows. (Mabe my three sort_specs hurt here?)

Is there a way to construct my schema in a way to avoid these problems
(and still allows searching by field)? Something like sub-documents,
where I have pages as top level documents, and each page can have
multiple rows?

Cheers,
Moritz

[lucy-dev] Schema for searching IRC logs

Reply via email to