[lucy-user] Chinese support?

2017-02-17 Thread Hao Wu
Hi all,

I use the StandardTokenizer. search by English word work, but in
Chinese give me strange results.

my $tokenizer = Lucy::Analysis::StandardTokenizer->new;
my $raw_type = Lucy::Plan::FullTextType->new(
analyzer => $tokenizer,
);

also, I was going to use the EasyAnalyzer (
https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Analysis/EasyAnalyzer.pod
)
, but chinese in not supported.

What is the simple way to use lucy with chinese doc? Thanks.

Best,

Hao Wu


Re: [lucy-user] Chinese support?

2017-02-17 Thread Hao Wu
Thanks. Get it work.

code pasted below in case anyone have similar question.

package ChineseAnalyzer;
use Jieba;
use v5.10;
use Encode qw(decode_utf8);
use base qw( Lucy::Analysis::Analyzer );

sub new {
my $self = shift->SUPER::new;
return $self;
}

sub transform {
my ($self, $inversion)= @_;
return $inversion;
}

sub transform_text {
my ($self, $text) = @_;
my $inversion = Lucy::Analysis::Inversion->new;
my @tokens = Jieba::jieba_tokenize(decode_utf8($text));
$inversion->append(
   Lucy::Analysis::Token->new(text =>$_->[0],
  start_offset=> $_->[1] ,
  end_offset=>$_->[2]
)

) for @tokens;
return $inversion;
}

1;



package Jieba;
use v5.10;

sub jieba_tokenize {
jieba_tokenize_python(shift);
}

# TODO:
#result = jieba.tokenize(u'永和服装饰品有限公司', mode='search')
use Inline Python => <<'END_OF_PYTHON_CODE';
from jieba import tokenize

def jieba_tokenize_python(text):
seg_list = tokenize(text, mode='search')
return(list(seg_list))

END_OF_PYTHON_CODE

1;


On Fri, Feb 17, 2017 at 6:29 PM, Peter Karman  wrote:

> Hao Wu wrote on 2/17/17 4:44 PM:
>
>> Hi all,
>>
>> I use the StandardTokenizer. search by English word work, but in
>> Chinese give me strange results.
>>
>> my $tokenizer = Lucy::Analysis::StandardTokenizer->new;
>> my $raw_type = Lucy::Plan::FullTextType->new(
>> analyzer => $tokenizer,
>> );
>>
>> also, I was going to use the EasyAnalyzer (
>> https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Analysis
>> /EasyAnalyzer.pod
>> )
>> , but chinese in not supported.
>>
>> What is the simple way to use lucy with chinese doc? Thanks.
>>
>
> There is currently no equivalent of
> https://lucene.apache.org/core/4_0_0/analyzers-common/org/
> apache/lucene/analysis/cjk/CJKTokenizer.html
> within core Lucy.
>
> Furthermore, there is no automatic language detection in Lucy. You'll note
> in https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Analysis
> /EasyAnalyzer.pod
> that the language must be explicitly specified, and that is for the
> stemming analyzer. Also, Chinese is not among the supported languages
> listed.
>
> Maybe something wrapped around https://metacpan.org/pod/Lingu
> a::CJK::Tokenizer would work as a custom analyzer.
>
> You can see an example in the documentation here
> https://metacpan.org/pod/Lucy::Analysis::Analyzer#new
>
>
>
> --
> Peter Karman  .  https://peknet.com/  .  https://keybase.io/peterkarman
>


Re: [lucy-user] Chinese support?

2017-02-20 Thread Hao Wu
Still have problem when I try to update the index using the custom analyzer.

If I comment out the
   truncate => 1

rerun I got the following errror.


'body' assigned conflicting FieldType
LUCY_Schema_Spec_Field_IMP at cfcore/Lucy/Plan/Schema.c line 124
at /home/hwu/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/Lucy.pm
line 118.
Lucy::Index::Indexer::new('Lucy::Index::Indexer', 'index',
'/home/hwu/data/lucy/mitbbs.index', 'schema',
'Lucy::Plan::Schema=SCALAR(0x211c758)', 'create', 1) called at
mitbbs_index.pl line 26
*** Error in `perl': corrupted double-linked list: 0x021113a0 ***

If I switch the analyzer to  Lucy::Analysis::StandardTokenize.  works fine.
a new seg_2 is created.

my $tokenizer = Lucy::Analysis::StandardTokenizer->new;
my $raw_type = Lucy::Plan::FullTextType->new(
analyzer => $tokenizer,
);

So I guess I must miss something in the custom Chinese Analyzer.



--my script

#!/usr/local/bin/perl
#TODO: update doc, instead create everytime
use DBI;
use File::Spec::Functions qw( catfile );

use Lucy::Plan::Schema;
use Lucy::Plan::FullTextType;
use Lucy::Index::Indexer;

use ChineseAnalyzer;

my $path_to_index = '/home/hwu/data/lucy/mitbbs.index';

# Create Schema.
my $schema = Lucy::Plan::Schema->new;

my $chinese = ChineseAnalyzer->new();

my $raw_type = Lucy::Plan::FullTextType->new(
analyzer => $chinese,
);

$schema->spec_field( name => 'body',  type => $raw_type);

# Create an Indexer object.
my $indexer = Lucy::Index::Indexer->new(
index=> $path_to_index,
schema   => $schema,
create   => 1,
truncate => 1,
);

my $driver   = "SQLite";
my $database = "/home/hwu/data/mitbbs.db";
my $dsn = "DBI:$driver:dbname=$database";
my $dbh = DBI->connect($dsn,{ RaiseError => 1 })  or die $DBI::errstr;


my $stmt = qq(SELECT id, text from post where id >= 100 and id < 200;);
#my $stmt = qq(SELECT id, text from post where id < 100;);
my $sth = $dbh->prepare( $stmt );
my $rv = $sth->execute() or die $DBI::errstr;

while(my @row = $sth->fetchrow_array()) {
  print "id = ". $row[0] . "\n";
      print $row[1];
  my $doc = { body => $row[1] };
  $indexer->add_doc($doc);
}

$indexer->commit;

print "Finished.\n";

On Sat, Feb 18, 2017 at 6:46 AM, Nick Wellnhofer 
wrote:

> On 18/02/2017 07:22, Hao Wu wrote:
>
>> Thanks. Get it work.
>>
>
> Lucy's StandardTokenizer breaks up the text at the word boundaries defined
> in Unicode Standard Annex #29. Then we treat every Alphabetic character
> that doesn't have a Word_Break property as a single term. These are
> characters that match \p{Ideographic}, \p{Script: Hiragana}, or
> \p{Line_Break: Complex_Context}. This should work for Chinese but as Peter
> mentioned, we don't support n-grams.
>
> If you're using QueryParser, you're likely to run into problems, though.
> QueryParser will turn a sequence of Chinese characters into a PhraseQuery
> which is obviously wrong. A quick hack is to insert a space after every
> Chinese character before passing a query string to QueryParser:
>
> $query_string =~ s/\p{Ideographic}/$& /g;
>
> Nick
>
>


Re: [lucy-user] Chinese support?

2017-02-20 Thread Hao Wu
Hi Peter,

Thanks for reply.

That could be a problem. But probably not in my case.

I removed the old index.

run the program with 'ChineseAnalyzer' and truncate => 0  twice. the second
time, will give me the error.

'body' assigned conflicting FieldType
LUCY_Schema_Spec_Field_IMP at cfcore/Lucy/Plan/Schema.c line 124
at /home/hwu/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/Lucy.pm
line 118.
Lucy::Index::Indexer::new('Lucy::Index::Indexer', 'index',
'/home/hwu/data/lucy/mitbbs.index', 'schema',
'Lucy::Plan::Schema=SCALAR(0x1c56798)', 'create', 1) called at
mitbbs_index.pl line 26

run the program with 'ChineseAnalyzer' and truncate => 0  twice, no error.
but I want to update the index.

run the program with 'StandardTokenizer', with  truncate 0 or 1, both work
fine.

So, this make me think I must miss something in the 'ChineseAnalyzer' I
have.




On Mon, Feb 20, 2017 at 6:47 PM, Peter Karman  wrote:

> Hao Wu wrote on 2/20/17 6:12 PM:
>
>> Still have problem when I try to update the index using the custom
>> analyzer.
>>
>> If I comment out the
>>truncate => 1
>>
>> rerun I got the following errror.
>>
>>
>> 'body' assigned conflicting FieldType
>> LUCY_Schema_Spec_Field_IMP at cfcore/Lucy/Plan/Schema.c line 124
>> at /home/hwu/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/Lucy
>> .pm
>> line 118.
>> Lucy::Index::Indexer::new('Lucy::Index::Indexer', 'index',
>> '/home/hwu/data/lucy/mitbbs.index', 'schema',
>> 'Lucy::Plan::Schema=SCALAR(0x211c758)', 'create', 1) called at
>> mitbbs_index.pl line 26
>> *** Error in `perl': corrupted double-linked list: 0x021113a0 ***
>>
>> If I switch the analyzer to  Lucy::Analysis::StandardTokenize.  works
>> fine.
>> a new seg_2 is created.
>>
>> my $tokenizer = Lucy::Analysis::StandardTokenizer->new;
>> my $raw_type = Lucy::Plan::FullTextType->new(
>> analyzer => $tokenizer,
>> );
>>
>> So I guess I must miss something in the custom Chinese Analyzer.
>>
>>
> since you changed the field definition with a new analyzer, you must
> create a new index. You cannot update an existing index with 2 different
> field definitions in the same schema.
>
>
>
> --
> Peter Karman  .  https://peknet.com/  .  https://keybase.io/peterkarman
>


Re: [lucy-user] Chinese support?

2017-02-20 Thread Hao Wu
Hi Peter,

Thanks for spending time in the script.

I clean it up a bit, so there is no dependency now.

https://gist.github.com/swuecho/1b960ae17a1f47466be006fd14e3b7ff

still do not work.




On Mon, Feb 20, 2017 at 9:03 PM, Peter Karman  wrote:

> Hao Wu wrote on 2/20/17 10:18 PM:
>
>> Hi Peter,
>>
>> Thanks for reply.
>>
>> That could be a problem. But probably not in my case.
>>
>> I removed the old index.
>>
>> run the program with 'ChineseAnalyzer' and truncate => 0  twice. the
>> second
>> time, will give me the error.
>>
>> 'body' assigned conflicting FieldType
>> LUCY_Schema_Spec_Field_IMP at cfcore/Lucy/Plan/Schema.c line 124
>> at /home/hwu/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/Lucy.pm
>> line 118.
>> Lucy::Index::Indexer::new('Lucy::Index::Indexer', 'index',
>> '/home/hwu/data/lucy/mitbbs.index', 'schema',
>> 'Lucy::Plan::Schema=SCALAR(0x1c56798)', 'create', 1) called at
>> mitbbs_index.pl
>> <http://mitbbs_index.pl> line 26
>>
>> run the program with 'ChineseAnalyzer' and truncate => 0  twice, no
>> error. but I
>> want to update the index.
>>
>> run the program with 'StandardTokenizer', with  truncate 0 or 1, both
>> work fine.
>>
>> So, this make me think I must miss something in the 'ChineseAnalyzer' I
>> have.
>>
>>
>
> This is not your default, I don't think. This seems like a bug.
>
> Here's a smaller gist demonstrating the problem:
>
> https://gist.github.com/karpet/d8fe12085246b8419f9e4ab44930c1cc
>
> With the 2 files in the gist, I get this result:
>
> [karpet@pekmac:~/tmp/chinese-analyzer]$ perl indexer.pl test-index
> Building prefix dict from the default dictionary ...
> Loading model from cache /var/folders/r3/yk7hmbb9125fns
> df9bqs6lrmgp/T/jieba.cache
> Loading model cost 0.553 seconds.
> Prefix dict has been built succesfully.
> Finished.
>
> [karpet@pekmac:~/tmp/chinese-analyzer]$ perl indexer.pl test-index
> 'body' assigned conflicting FieldType
> LUCY_Schema_Spec_Field_IMP at cfcore/Lucy/Plan/Schema.c line 124
> at /usr/local/perl/5.24.0/lib/site_perl/5.24.0/darwin-2level/Lucy.pm
> line 118.
> Lucy::Index::Indexer::new("Lucy::Index::Indexer", "index",
> "test-index", "schema", Lucy::Plan::Schema=SCALAR(0x7f9b0b004a18),
> "create", 1) called at indexer.pl line 23
> Segmentation fault: 11
>
>
>
> I would expect the code to work as you wrote it, so maybe someone else can
> spot what's going wrong.
>
> Here's what the schema_1.json file looks like after the initial index
> creation:
>
> {
>   "_class": "Lucy::Plan::Schema",
>   "analyzers": [
> null,
> {
>   "_class": "ChineseAnalyzer"
> }
>   ],
>   "fields": {
> "body": {
>   "analyzer": "1",
>   "type": "fulltext"
>
> }
>   }
> }
>
>
> --
> Peter Karman  .  https://peknet.com/  .  https://keybase.io/peterkarman
>


Re: [lucy-user] Custom Analyzer [was Chinese support?]

2017-02-23 Thread Hao Wu
Hi Peter,

works great. The document is significantly better now.

Thanks everyone for taking care of this issues.

Best,

Hao

On Thu, Feb 23, 2017 at 8:23 AM, Peter Karman  wrote:

> Nick Wellnhofer wrote on 2/23/17 6:49 AM:
>
>> On 23/02/2017 04:52, Peter Karman wrote:
>>
>>> package MyAnalyzer {
>>> use base qw( Lucy::Analysis::Analyzer );
>>> sub transform { $_[1] }
>>> }
>>>
>>
>> Every Analyzer needs an `equals` method. For simple Analyzers, it can
>> simply
>> check whether the class of the other object matches:
>>
>> package MyAnalyzer {
>> use base qw( Lucy::Analysis::Analyzer );
>> sub transform { $_[1] }
>> sub equals { $_[1]->isa(__PACKAGE__) }
>> }
>>
>> If the Analyzer uses (inside-out) member variables, you'll also need dump
>> and
>> load methods. Unfortunately, we don't have good documentation for writing
>> custom
>> analyzers yet.
>>
>>
>
> Thanks for the quick response and accurate diagnosis, Nick. I see you've
> already committed changes to the POD so that will be very helpful in future.
>
> Hao, if you add the `sub equals` method to your ChineseAnalyzer, I think
> that should fix your problem. I have confirmed that locally with my own
> tests.
>
>
>
> --
> Peter Karman  .  https://peknet.com/  .  https://keybase.io/peterkarman
>