Peter Karman wrote on 1/25/10 9:12 PM:

I'll try and create a test case. I suspect it's going to be because I'm using a lot of fields of various FieldType combinations.


Here's the test case.

First, you need to create a corpus to test with. I use this script:

http://svn.swish-e.org/libswish3/trunk/perl/docmaker.pl

like this:

 perl docmaker.pl \
    --utf_factor=0 \
    --write_files \
    --tmp_dir path/to/my/testdocs/ \
    --max_files 33000 \
    --max_words 3 \
    --tmp_dir_segments 2

could also make fewer files with more words in them. Or use a different corpus altogether. But there appears to be something magical in the *total number* of terms parsed.

Second, here's the test script:

--------------------8<------------------------
#!/usr/bin/env perl
use strict;
use warnings;

use File::Find;
use File::Slurp;
use Data::Dump qw( dump );
use KinoSearch::Indexer;
use KinoSearch::Schema;
use KinoSearch::Analysis::PolyAnalyzer;
use KinoSearch::FieldType::FullTextType;
use KinoSearch::FieldType::StringType;

my $usage = "$0 path/to/files\n";
die $usage unless @ARGV;

my $path_to_index = 'test-ks-utf8';
my $lang          = 'en';
my $schema        = KinoSearch::Schema->new();
my $analyzer  = KinoSearch::Analysis::PolyAnalyzer->new( language => $lang, );
my $fieldtype = KinoSearch::FieldType::FullTextType->new(
    analyzer      => $analyzer,
    highlightable => 1,
    sortable      => 1,
);
my $stringtype = KinoSearch::FieldType::StringType->new( sortable => 1, );
$schema->spec_field(
    name => 'swishtitle',
    type => $fieldtype,
);
$schema->spec_field(
    name => 'swishdefault',
    type => $fieldtype,
);

for my $property_name (
    qw(
    swishdescription
    swishdocpath
    swishdocsize
    swishencoding
    swishlastmodified
    swishmime
    swishparser
    swishwordnum
    )
    )
{
    $schema->spec_field(
        name => $property_name,
        type => $stringtype,
    );
}

my $indexer = KinoSearch::Indexer->new(
    schema => $schema,
    index  => $path_to_index,
    create => 1,
);

my $count = 0;

find( { wanted => \&wanted, no_chdir => 1 }, @ARGV );
print "Crawled $count documents\n";
$indexer->commit();

sub wanted {
    my $filename = $File::Find::name;
    return unless $filename =~ m/\.xml/;
    my $doc = parse_file($filename);

    #warn dump $doc;

    $indexer->add_doc($doc);
    $count++;
}

sub parse_file {
    my $file = shift;
    my $buf  = read_file($file);
    $buf =~ s/<.+?>//sg;
    return {
        swishtitle        => "",  # yes, empty
        swishdescription  => "",  # yes, empty
        swishdefault      => $buf,
        swishlastmodified => ( stat($file) )[9],
        swishdocsize      => ( stat($file) )[7],
        swishparser       => 'XML',
        swishmime         => 'application/xml',
        swishencoding     => 'utf-8',
        swishdocpath      => $file,
        swishwordnum      => 0,   # yes, zero
    };
}
--------------------8<------------------------

Here are some things I notice.

1) if I comment out the swishwordnum and swishdescription in parse_file() it 
works.

2) if I comment out the swishdescription alone, it fails.

3) if I comment out the swishwordnum alone, it fails.

I'll all-in for tonight, but hopefully this can help expose what's going on, either with my code or in KS.

cheers,
pek
--
Peter Karman  .  http://peknet.com/  .  [email protected]

Reply via email to