Re: RE: CJK Analyzer indexing japanese word document

2004-03-16 Thread xx28
My experience tells me that CJKAnalyzer needs to be improved somehow

For example, single word X* search works perfectly, however, multiple words wildcard 
XX* never works.

- Original Message -
From: Scott Smith [EMAIL PROTECTED]
Date: Tuesday, March 16, 2004 5:42 pm
Subject: RE: CJK Analyzer indexing japanese word document

 I have used this analyzer with Japanese and it works fine.  In 
 fact, I'm
 currently doing English, several western European languages, 
 traditionaland simplified Chinese and Japanese.  I throw them all 
 in the same index
 and have had no problem other than my users wanted the search 
 limited by
 language.  I solved that problem by simply adding a keyword field 
 to the
 Document which has the 2-letter language code.  I then 
 automatically add
 the term indicating the language as an additional constraint when the
 user specifies the search.  
 
 You do need to be sure that the Shift-JIS gets converted to unicode
 before you put it in the Document (and pass it to the analyzer).
 Internally, I believe lucene wants everything in unicode (as any good
 java program would). Originally, I had problems with Asian 
 languages and
 eventually determined my xml parser wasn't translating my Shift-JIS,
 Big5, etc. to unicode.  Once I fixed that, life was good.
 
 -Original Message-
 From: Che Dong [EMAIL PROTECTED] 
 Sent: Tuesday, March 16, 2004 8:31 AM
 To: Lucene Users List
 Subject: Re: CJK Analyzer indexing japanese word document
 
 some Korean friends tell me they use it successfully for Korean. 
 So I
 think its also work for Japanese. mostly the problem is locale 
 settings
 Please check weblucene project for xml indexing samples:
 http://sourceforge.net/projects/weblucene/ 
 
 Che Dong
 - Original Message -
 From: Chandan Tamrakar [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Sent: Tuesday, March 16, 2004 4:31 PM
 Subject: CJK Analyzer indexing japanese word document
 
 
  
  I am using a CJKAnalyzer from apache sandbox , I have set the java
  file.encoding setting to SJIS
  and  i am able to index and search the japanese html page . I 
 can see
 the
  index dumps as i expected , However when i index a word document
 containing
  japanese characters it is not indexing as expected . Do I need to
 change
  anything with CJKTokenizer and CJKAnalyzer classes?
  I have been able to index a word document with StandardAnalyzers.
  
  thanks in advace
  chandan
  
  
  
  -
 
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
  
  
 
 ---
 --
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Trouble running web demo

2003-06-06 Thread xx28
Try to chang permisssion 777 for index directory.

= Original Message From Lucene Users List 
[EMAIL PROTECTED] =
hi,

 When i run the web demo i get an error that says


 ERROR opening the Index - contact sysadmin!

 While parsing query: /opt/lucene/index not a directory

 i do not have the permission to modify opt so have not created an index
 directory in it.Thus i do not use the default as given /opt/lucene/index.
 I have changed the configuration files and also according to me modified the
 luceneweb.war file. Is there any other file that i should be modifying?Also 
I
 might be making an error redeploying luceneweb.war.

 how do i redploy the file and wht other errors can i be making?

 Thanks,
 Prerak


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: [ANN] PDFBox 0.6.0

2003-03-06 Thread xx28
Ben,

I downloaded pdfbox and installed it. And I can use:
 java org.pdfbox.Main PDF-file output-text-file
to convert .pdf file to string file.

Then I tried to integrate with Lucene. I modified the following codes in 
IndexHTML.java:

else if(file.getPath().endsWith(.pdf)) {
Document doc =  LucenePDFDocument.getDocument(file);
System.out.println(adding  + pdf files);
writer.addDocument(doc);
}

It did pass ant compiler (ant wardemo). However, when I tested:
java org.apache.lucene.demo.IndexHTML -create -index {index-dir} ..

It seems to me it still didnot pick up new IndexHTML.java, still did not index 
.pdf files.


Did I miss something here?

Regards,

George

= Original Message From Lucene Users List 
[EMAIL PROTECTED] =
I would like to announce the next release of PDFBox.  PDFBox allows for
PDF documents to be indexed using lucene through a simple interface.
Please take a look at org.pdfbox.searchengine.lucene.LucenePDFDocument,
which will extract all text and PDF document summary properties as lucene
fields.

You can obtain the latest release from http://www.pdfbox.org

Please send all bug reports to me and attach the PDF document when
possible.

RELEASE 0.6.0
-Massive improvements to memory footprint.
-Must call close() on the COSDocument(LucenePDFDocument does this for you)
-Really fixed the bug where small documents were not being indexed.
-Fixed bug where no whitespace existed between obj and start of object.
Exception in thread main java.io.IOException: expected='obj'
actual='obj/Pro
-Fixed issue with spacing where textLineMatrix was not being copied
 properly
-Fixed 'bug' where parsing would fail with some pdfs with double endobj
 definitions
-Added PDF document summary fields to the lucene document


Thank you,
Ben Litchfield
http://www.pdfbox.org



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]