hi scott, Tnks for ur advise now i am using POI to convert word documents and made sure that i convert into unicode before I put into lucene for indexing . and working perfectly fine. Which parser is best for parsing PDF documents i tried pdfbox but seems it doesnt work well with japanese characters any suggestion ?
thnks ----- Original Message ----- From: "Scott Smith" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Wednesday, March 17, 2004 4:27 AM Subject: RE: CJK Analyzer indexing japanese word document > I have used this analyzer with Japanese and it works fine. In fact, I'm > currently doing English, several western European languages, traditional > and simplified Chinese and Japanese. I throw them all in the same index > and have had no problem other than my users wanted the search limited by > language. I solved that problem by simply adding a keyword field to the > Document which has the 2-letter language code. I then automatically add > the term indicating the language as an additional constraint when the > user specifies the search. > > You do need to be sure that the Shift-JIS gets converted to unicode > before you put it in the Document (and pass it to the analyzer). > Internally, I believe lucene wants everything in unicode (as any good > java program would). Originally, I had problems with Asian languages and > eventually determined my xml parser wasn't translating my Shift-JIS, > Big5, etc. to unicode. Once I fixed that, life was good. > > -----Original Message----- > From: Che Dong [mailto:[EMAIL PROTECTED] > Sent: Tuesday, March 16, 2004 8:31 AM > To: Lucene Users List > Subject: Re: CJK Analyzer indexing japanese word document > > some Korean friends tell me they use it successfully for Korean. So I > think its also work for Japanese. mostly the problem is locale settings > > Please check weblucene project for xml indexing samples: > http://sourceforge.net/projects/weblucene/ > > Che Dong > ----- Original Message ----- > From: "Chandan Tamrakar" <[EMAIL PROTECTED]> > To: <[EMAIL PROTECTED]> > Sent: Tuesday, March 16, 2004 4:31 PM > Subject: CJK Analyzer indexing japanese word document > > > > > > I am using a CJKAnalyzer from apache sandbox , I have set the java > > file.encoding setting to SJIS > > and i am able to index and search the japanese html page . I can see > the > > index dumps as i expected , However when i index a word document > containing > > japanese characters it is not indexing as expected . Do I need to > change > > anything with CJKTokenizer and CJKAnalyzer classes? > > I have been able to index a word document with StandardAnalyzers. > > > > thanks in advace > > chandan > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]