Re: CJK Analyzer indexing japanese word document

Chandan Tamrakar Mon, 22 Mar 2004 03:58:24 -0800

hi scott,
 Tnks for ur advise now i am using POI to convert word documents and made
sure that i convert into unicode before I put into lucene for indexing .
and working perfectly fine. Which parser is best for parsing PDF documents i
tried pdfbox but seems it doesnt work well with japanese characters
any suggestion ?


thnks
----- Original Message -----
From: "Scott Smith" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, March 17, 2004 4:27 AM
Subject: RE: CJK Analyzer indexing japanese word document


> I have used this analyzer with Japanese and it works fine.  In fact, I'm
> currently doing English, several western European languages, traditional
> and simplified Chinese and Japanese.  I throw them all in the same index
> and have had no problem other than my users wanted the search limited by
> language.  I solved that problem by simply adding a keyword field to the
> Document which has the 2-letter language code.  I then automatically add
> the term indicating the language as an additional constraint when the
> user specifies the search.
>
> You do need to be sure that the Shift-JIS gets converted to unicode
> before you put it in the Document (and pass it to the analyzer).
> Internally, I believe lucene wants everything in unicode (as any good
> java program would). Originally, I had problems with Asian languages and
> eventually determined my xml parser wasn't translating my Shift-JIS,
> Big5, etc. to unicode.  Once I fixed that, life was good.
>
> -----Original Message-----
> From: Che Dong [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, March 16, 2004 8:31 AM
> To: Lucene Users List
> Subject: Re: CJK Analyzer indexing japanese word document
>
> some Korean friends tell me they use it successfully for Korean. So I
> think its also work for Japanese. mostly the problem is locale settings
>
> Please check weblucene project for xml indexing samples:
> http://sourceforge.net/projects/weblucene/
>
> Che Dong
> ----- Original Message -----
> From: "Chandan Tamrakar" <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Sent: Tuesday, March 16, 2004 4:31 PM
> Subject: CJK Analyzer indexing japanese word document
>
>
> >
> > I am using a CJKAnalyzer from apache sandbox , I have set the java
> > file.encoding setting to SJIS
> > and  i am able to index and search the japanese html page . I can see
> the
> > index dumps as i expected , However when i index a word document
> containing
> > japanese characters it is not indexing as expected . Do I need to
> change
> > anything with CJKTokenizer and CJKAnalyzer classes?
> > I have been able to index a word document with StandardAnalyzers.
> >
> > thanks in advace
> > chandan
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: CJK Analyzer indexing japanese word document

Reply via email to