Re: RE: CJK Analyzer indexing japanese word document
My experience tells me that CJKAnalyzer needs to be improved somehow For example, single word X* search works perfectly, however, multiple words wildcard XX* never works. - Original Message - From: Scott Smith [EMAIL PROTECTED] Date: Tuesday, March 16, 2004 5:42 pm Subject: RE: CJK Analyzer indexing japanese word document I have used this analyzer with Japanese and it works fine. In fact, I'm currently doing English, several western European languages, traditionaland simplified Chinese and Japanese. I throw them all in the same index and have had no problem other than my users wanted the search limited by language. I solved that problem by simply adding a keyword field to the Document which has the 2-letter language code. I then automatically add the term indicating the language as an additional constraint when the user specifies the search. You do need to be sure that the Shift-JIS gets converted to unicode before you put it in the Document (and pass it to the analyzer). Internally, I believe lucene wants everything in unicode (as any good java program would). Originally, I had problems with Asian languages and eventually determined my xml parser wasn't translating my Shift-JIS, Big5, etc. to unicode. Once I fixed that, life was good. -Original Message- From: Che Dong [EMAIL PROTECTED] Sent: Tuesday, March 16, 2004 8:31 AM To: Lucene Users List Subject: Re: CJK Analyzer indexing japanese word document some Korean friends tell me they use it successfully for Korean. So I think its also work for Japanese. mostly the problem is locale settings Please check weblucene project for xml indexing samples: http://sourceforge.net/projects/weblucene/ Che Dong - Original Message - From: Chandan Tamrakar [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Tuesday, March 16, 2004 4:31 PM Subject: CJK Analyzer indexing japanese word document I am using a CJKAnalyzer from apache sandbox , I have set the java file.encoding setting to SJIS and i am able to index and search the japanese html page . I can see the index dumps as i expected , However when i index a word document containing japanese characters it is not indexing as expected . Do I need to change anything with CJKTokenizer and CJKAnalyzer classes? I have been able to index a word document with StandardAnalyzers. thanks in advace chandan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --- -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Trouble running web demo
Try to chang permisssion 777 for index directory. = Original Message From Lucene Users List [EMAIL PROTECTED] = hi, When i run the web demo i get an error that says ERROR opening the Index - contact sysadmin! While parsing query: /opt/lucene/index not a directory i do not have the permission to modify opt so have not created an index directory in it.Thus i do not use the default as given /opt/lucene/index. I have changed the configuration files and also according to me modified the luceneweb.war file. Is there any other file that i should be modifying?Also I might be making an error redeploying luceneweb.war. how do i redploy the file and wht other errors can i be making? Thanks, Prerak - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: [ANN] PDFBox 0.6.0
Ben, I downloaded pdfbox and installed it. And I can use: java org.pdfbox.Main PDF-file output-text-file to convert .pdf file to string file. Then I tried to integrate with Lucene. I modified the following codes in IndexHTML.java: else if(file.getPath().endsWith(.pdf)) { Document doc = LucenePDFDocument.getDocument(file); System.out.println(adding + pdf files); writer.addDocument(doc); } It did pass ant compiler (ant wardemo). However, when I tested: java org.apache.lucene.demo.IndexHTML -create -index {index-dir} .. It seems to me it still didnot pick up new IndexHTML.java, still did not index .pdf files. Did I miss something here? Regards, George = Original Message From Lucene Users List [EMAIL PROTECTED] = I would like to announce the next release of PDFBox. PDFBox allows for PDF documents to be indexed using lucene through a simple interface. Please take a look at org.pdfbox.searchengine.lucene.LucenePDFDocument, which will extract all text and PDF document summary properties as lucene fields. You can obtain the latest release from http://www.pdfbox.org Please send all bug reports to me and attach the PDF document when possible. RELEASE 0.6.0 -Massive improvements to memory footprint. -Must call close() on the COSDocument(LucenePDFDocument does this for you) -Really fixed the bug where small documents were not being indexed. -Fixed bug where no whitespace existed between obj and start of object. Exception in thread main java.io.IOException: expected='obj' actual='obj/Pro -Fixed issue with spacing where textLineMatrix was not being copied properly -Fixed 'bug' where parsing would fail with some pdfs with double endobj definitions -Added PDF document summary fields to the lucene document Thank you, Ben Litchfield http://www.pdfbox.org - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]