; Von: Cedric Ho [mailto:[EMAIL PROTECTED]
> Gesendet: Samstag, 10. November 2007 02:28
> An: java-user@lucene.apache.org
> Betreff: - Re: Chinese Segmentation with Phase Query
>
>
> On Nov 10, 2007 2:08 AM, Steven A Rowe <[EMAIL PROTECTED]> wrote:
> > Hi Cedric,
>
abbreviations)
Regards
Uwe Goetzke
-Ursprüngliche Nachricht-
Von: Cedric Ho [mailto:[EMAIL PROTECTED]
Gesendet: Samstag, 10. November 2007 02:28
An: java-user@lucene.apache.org
Betreff: - Re: Chinese Segmentation with Phase Query
On Nov 10, 2007 2:08 AM, Steven A Rowe <[EMAIL PROTECTED]>
On Nov 10, 2007 2:08 AM, Steven A Rowe <[EMAIL PROTECTED]> wrote:
> Hi Cedric,
>
> On 11/08/2007, Cedric Ho wrote:
> > a sentence containing characters ABC, it may be segmented into AB, C or A,
> > BC.
> [snip]
> > In this cases we would like to index both segmentation into the index:
> >
> > AB o
The CJKAnalyzer is too simple for our need. But thanks for suggesting anyway.
Cheers,
Cedric
On Nov 9, 2007 10:43 PM, Open Study <[EMAIL PROTECTED]> wrote:
> Hi Cedric
>
> You may try the CJKAnalyzer within the lucene sandbox. It doesn't give
> a perfect solution for Chinese word segmentation, bu
Hi Cedric,
On 11/08/2007, Cedric Ho wrote:
> a sentence containing characters ABC, it may be segmented into AB, C or A, BC.
[snip]
> In this cases we would like to index both segmentation into the index:
>
> AB offset (0,1) position 0A offset (0,0) position 0
> C offset (2,2) position
Hi Cedric
You may try the CJKAnalyzer within the lucene sandbox. It doesn't give
a perfect solution for Chinese word segmentation, but will solve the
problem in your case.
On Nov 9, 2007 10:59 AM, Cedric Ho <[EMAIL PROTECTED]> wrote:
> Hi,
>
> We are having an issue while indexing Chinese Documen
Hi,
We are having an issue while indexing Chinese Documents in Lucene.
Some background first:
Since CJK languages doesn't have space between words, we first have to
determine the words from sentences. e.g.
a sentence containing characters ABC, it may be segmented into AB, C or A, BC.
the proble