Hi Chris,
Thank you for your info.
With CJKAnalyzer, the diagnosis are as follows:
pos start end
Inc Ofst Ofst
[Aa] 1 0 2
[aa] 1 1 3
[aB] 1 2 4
[BC] 1 3 5
[Cc] 1 4 6
[cD] 1 5 7
[Dd] 1 6 8
[dE] 1 7 9
[EF] 1 8 10
[FG] 1 9 11
[Gg] 1 10 12
[gH] 1 11 13
[Hh] 1 12 14
[hI] 1 13 15
[Ii] 1 14 16
[iJ] 1 15 17
[JK] 1 16 18
[Kk] 1 17 19
[kL] 1 18 20
[LM] 1 19 21
[Mm] 1 20 22
[mN] 1 21 23
<B>AaaBCcDdEFGgHhIiJKkLMmN</B>
CJKAnalyzer is producing TokenStream which is all overlap
Mark was pointed out.
But JapaneseAnalyzer is producing a stream of tokens
are not overlapped as I showed in my previous mail.
BTW, I couldn't find CJKHighlighter and CJKHighlighterAnalyzer in
sandbox...
Koji
> -----Original Message-----
> From: Chris Lu [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, September 06, 2005 3:53 PM
> To: [email protected]
> Subject: Re: Highlighter apply to Japanese
>
>
> Hi, Koji,
>
> I had the same problem as you. This is because CJK's n-gram analysis
> is different from single character's.
>
> My get around is to use CJKHighlighter and
> CJKHighlightAnalyzer in sandbox.
>
> --
> Chris Lu
> ------------
> Lucene Search RAD on Any Database
> http://www.dbsight.net
>
>
> On 9/5/05, Koji Sekiguchi <[EMAIL PROTECTED]> wrote:
> > Hi again,
> >
> > I'm using highlighter to highlight terms in Japanese text,
> > but I cannot get preferable output.
> >
> > If I use StandardAnalyzer or SnowballAnalyzer w/ English,
> > getBestFragment() returns preferable outputs:
> >
> > Sample: (SnowballAnalyzer)
> > Text: A meeting will be held in the City Hall
> > TokenStream:
> > [a][meet][will][be][held][in][the][citi][hall]
> > Query Text: meet
> > Output: A <B>meeting</B> will be held in the City Hall
> >
> > But if I use JapaneseAnalyzer, which is most popular Analyzer
> > in Japan to get TokenStream from Japanese text, to highlight
> > Japanese text with Highlighter, whole text is highlighted:
> >
> > Sample: (JapaneseAnalyzer)
> > Text: AMeetingWillBeHeldInTheCityHall
> > TokenStream:
> > [A][Meeting][Will][Be][Held][In][The][City][Hall]
> > Query Text: Meeting
> > Output: <B>AMeetingWillBeHeldInTheCityHall</B>
> >
> > Please note that I use alphabet to show the Text at second sample
> > because most users in this mailing list can read it, but in reality,
> > I used Japanese characters for the Text. And you'll see that
> > JapaneseAnalyzer,
> > which uses Japanese dictionary on background to extract tokens
> > from text stream, can recognize tokens and produce TokenStream.
> > But highlighter.getBestFragment() highlighted whole text.
> >
> > Do I need to implement Fragmenter to highlight tokens correctly
> > for Japanese text?
> >
> > Thanks in advance,
> >
> > Koji
> >
> >
> >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]