Still have problem when I try to update the index using the custom analyzer.
If I comment out the truncate => 1 rerun I got the following errror. 'body' assigned conflicting FieldType LUCY_Schema_Spec_Field_IMP at cfcore/Lucy/Plan/Schema.c line 124 at /home/hwu/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/Lucy.pm line 118. Lucy::Index::Indexer::new('Lucy::Index::Indexer', 'index', '/home/hwu/data/lucy/mitbbs.index', 'schema', 'Lucy::Plan::Schema=SCALAR(0x211c758)', 'create', 1) called at mitbbs_index.pl line 26 *** Error in `perl': corrupted double-linked list: 0x00000000021113a0 *** If I switch the analyzer to Lucy::Analysis::StandardTokenize. works fine. a new seg_2 is created. my $tokenizer = Lucy::Analysis::StandardTokenizer->new; my $raw_type = Lucy::Plan::FullTextType->new( analyzer => $tokenizer, ); So I guess I must miss something in the custom Chinese Analyzer. ------------------my script-------------------- #!/usr/local/bin/perl #TODO: update doc, instead create everytime use DBI; use File::Spec::Functions qw( catfile ); use Lucy::Plan::Schema; use Lucy::Plan::FullTextType; use Lucy::Index::Indexer; use ChineseAnalyzer; my $path_to_index = '/home/hwu/data/lucy/mitbbs.index'; # Create Schema. my $schema = Lucy::Plan::Schema->new; my $chinese = ChineseAnalyzer->new(); my $raw_type = Lucy::Plan::FullTextType->new( analyzer => $chinese, ); $schema->spec_field( name => 'body', type => $raw_type); # Create an Indexer object. my $indexer = Lucy::Index::Indexer->new( index => $path_to_index, schema => $schema, create => 1, truncate => 1, ); my $driver = "SQLite"; my $database = "/home/hwu/data/mitbbs.db"; my $dsn = "DBI:$driver:dbname=$database"; my $dbh = DBI->connect($dsn,{ RaiseError => 1 }) or die $DBI::errstr; my $stmt = qq(SELECT id, text from post where id >= 100 and id < 200;); #my $stmt = qq(SELECT id, text from post where id < 100;); my $sth = $dbh->prepare( $stmt ); my $rv = $sth->execute() or die $DBI::errstr; while(my @row = $sth->fetchrow_array()) { print "id = ". $row[0] . "\n"; print $row[1]; my $doc = { body => $row[1] }; $indexer->add_doc($doc); } $indexer->commit; print "Finished.\n"; On Sat, Feb 18, 2017 at 6:46 AM, Nick Wellnhofer <wellnho...@aevum.de> wrote: > On 18/02/2017 07:22, Hao Wu wrote: > >> Thanks. Get it work. >> > > Lucy's StandardTokenizer breaks up the text at the word boundaries defined > in Unicode Standard Annex #29. Then we treat every Alphabetic character > that doesn't have a Word_Break property as a single term. These are > characters that match \p{Ideographic}, \p{Script: Hiragana}, or > \p{Line_Break: Complex_Context}. This should work for Chinese but as Peter > mentioned, we don't support n-grams. > > If you're using QueryParser, you're likely to run into problems, though. > QueryParser will turn a sequence of Chinese characters into a PhraseQuery > which is obviously wrong. A quick hack is to insert a space after every > Chinese character before passing a query string to QueryParser: > > $query_string =~ s/\p{Ideographic}/$& /g; > > Nick > >