[lucy-user] Chinese support?

2017-02-17 Thread Hao Wu
Hi all,

I use the StandardTokenizer. search by English word work, but in
Chinese give me strange results.

my $tokenizer = Lucy::Analysis::StandardTokenizer->new;
my $raw_type = Lucy::Plan::FullTextType->new(
analyzer => $tokenizer,

also, I was going to use the EasyAnalyzer (
, but chinese in not supported.

What is the simple way to use lucy with chinese doc? Thanks.


Hao Wu

Re: [lucy-user] Chinese support?

2017-02-17 Thread Hao Wu
Thanks. Get it work.

code pasted below in case anyone have similar question.

package ChineseAnalyzer;
use Jieba;
use v5.10;
use Encode qw(decode_utf8);
use base qw( Lucy::Analysis::Analyzer );

sub new {
my $self = shift->SUPER::new;
return $self;

sub transform {
my ($self, $inversion)= @_;
return $inversion;

sub transform_text {
my ($self, $text) = @_;
my $inversion = Lucy::Analysis::Inversion->new;
my @tokens = Jieba::jieba_tokenize(decode_utf8($text));
   Lucy::Analysis::Token->new(text =>$_->[0],
  start_offset=> $_->[1] ,

) for @tokens;
return $inversion;


package Jieba;
use v5.10;

sub jieba_tokenize {

#result = jieba.tokenize(u'永和服装饰品有限公司', mode='search')
use Inline Python => <<'END_OF_PYTHON_CODE';
from jieba import tokenize

def jieba_tokenize_python(text):
seg_list = tokenize(text, mode='search')



Re: [lucy-user] Chinese support?

2017-02-20 Thread Hao Wu
Still have problem when I try to update the index using the custom analyzer.

If I comment out the
   truncate => 1

rerun I got the following errror.

'body' assigned conflicting FieldType
LUCY_Schema_Spec_Field_IMP at cfcore/Lucy/Plan/Schema.c line 124
at /home/hwu/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/Lucy.pm
line 118.
Lucy::Index::Indexer::new('Lucy::Index::Indexer', 'index',
'/home/hwu/data/lucy/mitbbs.index', 'schema',
'Lucy::Plan::Schema=SCALAR(0x211c758)', 'create', 1) called at
mitbbs_index.pl line 26
*** Error in `perl': corrupted double-linked list: 0x021113a0 ***

If I switch the analyzer to  Lucy::Analysis::StandardTokenize.  works fine.
a new seg_2 is created.

my $tokenizer = Lucy::Analysis::StandardTokenizer->new;
my $raw_type = Lucy::Plan::FullTextType->new(
analyzer => $tokenizer,

So I guess I must miss something in the custom Chinese Analyzer.

--my script

#TODO: update doc, instead create everytime
use DBI;
use File::Spec::Functions qw( catfile );

use Lucy::Plan::Schema;
use Lucy::Plan::FullTextType;
use Lucy::Index::Indexer;

use ChineseAnalyzer;

my $path_to_index = '/home/hwu/data/lucy/mitbbs.index';

# Create Schema.
my $schema = Lucy::Plan::Schema->new;

my $chinese = ChineseAnalyzer->new();

my $raw_type = Lucy::Plan::FullTextType->new(
analyzer => $chinese,

$schema->spec_field( name => 'body',  type => $raw_type);

# Create an Indexer object.
my $indexer = Lucy::Index::Indexer->new(
index=> $path_to_index,
schema   => $schema,
create   => 1,
truncate => 1,

my $driver   = "SQLite";
my $database = "/home/hwu/data/mitbbs.db";
my $dsn = "DBI:$driver:dbname=$database";
my $dbh = DBI->connect($dsn,{ RaiseError => 1 })  or die $DBI::errstr;

my $stmt = qq(SELECT id, text from post where id >= 100 and id < 200;);
#my $stmt = qq(SELECT id, text from post where id < 100;);
my $sth = $dbh->prepare( $stmt );
my $rv = $sth->execute() or die $DBI::errstr;

while(my @row = $sth->fetchrow_array()) {
  print "id = ". $row[0] . "\n";
      print $row[1];
  my $doc = { body => $row[1] };


print "Finished.\n";

Re: [lucy-user] Chinese support?

2017-02-20 Thread Hao Wu
Hi Peter,

Thanks for reply.

That could be a problem. But probably not in my case.

I removed the old index.

run the program with 'ChineseAnalyzer' and truncate => 0  twice. the second
time, will give me the error.

'body' assigned conflicting FieldType
LUCY_Schema_Spec_Field_IMP at cfcore/Lucy/Plan/Schema.c line 124
at /home/hwu/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/Lucy.pm
line 118.
Lucy::Index::Indexer::new('Lucy::Index::Indexer', 'index',
'/home/hwu/data/lucy/mitbbs.index', 'schema',
'Lucy::Plan::Schema=SCALAR(0x1c56798)', 'create', 1) called at
mitbbs_index.pl line 26

run the program with 'ChineseAnalyzer' and truncate => 0  twice, no error.
but I want to update the index.

run the program with 'StandardTokenizer', with  truncate 0 or 1, both work

So, this make me think I must miss something in the 'ChineseAnalyzer' I

Re: [lucy-user] Chinese support?

2017-02-20 Thread Hao Wu
Hi Peter,

Thanks for spending time in the script.

I clean it up a bit, so there is no dependency now.


still do not work.

Re: [lucy-user] Custom Analyzer [was Chinese support?]

2017-02-23 Thread Hao Wu
Hi Peter,

works great. The document is significantly better now.

Thanks everyone for taking care of this issues.



