ab, bc, cd}. I was therefore
expecting the sigram to tokenize abcd as {a, b, c, d}. What the
StandardTokenizer does though is tokenize abcd as {abcd}.
Note I am using ascii characters above, but the argument is meant for
CJK characters. I'll switch to in the rest of this email
to
mean
each: abcd is tokenized as {ab, bc, cd}. I was therefore
expecting the sigram to tokenize abcd as {a, b, c, d}. What the
StandardTokenizer does though is tokenize abcd as {abcd}.
Note I am using ascii characters above, but the argument is meant for
CJK characters. I'll switch to in the re
Does that mean that sigram and ideogram are synonymous?
(c.f. http://en.wikipedia.org/wiki/Ideogram)
Thanks,
Otis
--- Che Dong <[EMAIL PROTECTED]> wrote:
> means token Chinese/Japanese(without space for word segment in
> nature) word with Charactor one by one.
>
> Rega
means token Chinese/Japanese(without space for word segment in nature) word with
Charactor one by one.
Regards
Che, Dong
- Original Message -
From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: "Lucene List" <[EMAIL PROTECTED]>
Sent: Tuesday, December
Could someone define "sigram" for me? It is used as a type of token in
StandardTokenizer. I know it relates to the CJK stuff, but I'm curious
about the term "sigram" and what it means, specifically in the context
of the StandardTokenize
gzilla/show_bug.cgi?id=23466
StandardTokenzier with CJK support(sigram)
[EMAIL PROTECTED] changed:
What|Removed |Added
Status|REOPENED|RESOLVED
Reso
gzilla/show_bug.cgi?id=23466
StandardTokenzier with CJK support(sigram)
[EMAIL PROTECTED] changed:
What|Removed |Added
Status|RESOLVED|REOPENED
Reso
gzilla/show_bug.cgi?id=23466
StandardTokenzier with CJK support(sigram)
[EMAIL PROTECTED] changed:
What|Removed |Added
Status|NEW |RESOLVED
Reso
gzilla/show_bug.cgi?id=23466
StandardTokenzier with CJK support(sigram)
--- Additional Comments From [EMAIL PROTECTED] 2003-09-30 16:24 ---
Created an attachment (id=8397)
Patch file for proposed change
-
To unsubscribe,
gzilla/show_bug.cgi?id=23466
StandardTokenzier with CJK support(sigram)
--- Additional Comments From [EMAIL PROTECTED] 2003-09-29 22:58 ---
Ok, maybe I"m just clueless on applying patches, so enlighten me on how to use what
you
provided to patch my local version. It doesn'
gzilla/show_bug.cgi?id=23466
StandardTokenzier with CJK support(sigram)
Summary: StandardTokenzier with CJK support(sigram)
Product: Lucene
Version: CVS Nightly - Specify date in submission
Platform: All
URL: http://www.chedong.com/
OS/V
+1
Che Dong wrote:
>>Attached StandardTokenizer.jj with Sigram Based east
>>asia language support:
>>tested under Windows and GNU/Linux
>>
>>Just treat different UnicodeBlock with different word
>>segment method.
>>
>>Hope in the futur
> Attached StandardTokenizer.jj with Sigram Based east
> asia language support:
> tested under Windows and GNU/Linux
>
> Just treat different UnicodeBlock with different word
> segment method.
>
> Hope in the future released we can add more language
> supp
13 matches
Mail list logo