Peter Karman wrote on 1/25/10 9:12 PM:
I'll try and create a test case. I suspect it's going to be because I'm
using a lot of fields of various FieldType combinations.
Here's the test case.
First, you need to create a corpus to test with. I use this script:
http://svn.swish-e.org/libswish3/trunk/perl/docmaker.pl
like this:
perl docmaker.pl \
--utf_factor=0 \
--write_files \
--tmp_dir path/to/my/testdocs/ \
--max_files 33000 \
--max_words 3 \
--tmp_dir_segments 2
could also make fewer files with more words in them. Or use a different corpus
altogether. But there appears to be something magical in the *total number* of
terms parsed.
Second, here's the test script:
--------------------8<------------------------
#!/usr/bin/env perl
use strict;
use warnings;
use File::Find;
use File::Slurp;
use Data::Dump qw( dump );
use KinoSearch::Indexer;
use KinoSearch::Schema;
use KinoSearch::Analysis::PolyAnalyzer;
use KinoSearch::FieldType::FullTextType;
use KinoSearch::FieldType::StringType;
my $usage = "$0 path/to/files\n";
die $usage unless @ARGV;
my $path_to_index = 'test-ks-utf8';
my $lang = 'en';
my $schema = KinoSearch::Schema->new();
my $analyzer = KinoSearch::Analysis::PolyAnalyzer->new( language => $lang, );
my $fieldtype = KinoSearch::FieldType::FullTextType->new(
analyzer => $analyzer,
highlightable => 1,
sortable => 1,
);
my $stringtype = KinoSearch::FieldType::StringType->new( sortable => 1, );
$schema->spec_field(
name => 'swishtitle',
type => $fieldtype,
);
$schema->spec_field(
name => 'swishdefault',
type => $fieldtype,
);
for my $property_name (
qw(
swishdescription
swishdocpath
swishdocsize
swishencoding
swishlastmodified
swishmime
swishparser
swishwordnum
)
)
{
$schema->spec_field(
name => $property_name,
type => $stringtype,
);
}
my $indexer = KinoSearch::Indexer->new(
schema => $schema,
index => $path_to_index,
create => 1,
);
my $count = 0;
find( { wanted => \&wanted, no_chdir => 1 }, @ARGV );
print "Crawled $count documents\n";
$indexer->commit();
sub wanted {
my $filename = $File::Find::name;
return unless $filename =~ m/\.xml/;
my $doc = parse_file($filename);
#warn dump $doc;
$indexer->add_doc($doc);
$count++;
}
sub parse_file {
my $file = shift;
my $buf = read_file($file);
$buf =~ s/<.+?>//sg;
return {
swishtitle => "", # yes, empty
swishdescription => "", # yes, empty
swishdefault => $buf,
swishlastmodified => ( stat($file) )[9],
swishdocsize => ( stat($file) )[7],
swishparser => 'XML',
swishmime => 'application/xml',
swishencoding => 'utf-8',
swishdocpath => $file,
swishwordnum => 0, # yes, zero
};
}
--------------------8<------------------------
Here are some things I notice.
1) if I comment out the swishwordnum and swishdescription in parse_file() it
works.
2) if I comment out the swishdescription alone, it fails.
3) if I comment out the swishwordnum alone, it fails.
I'll all-in for tonight, but hopefully this can help expose what's going on,
either with my code or in KS.
cheers,
pek
--
Peter Karman . http://peknet.com/ . [email protected]