Re: [ANN] PDFBox 0.6.0

2003-03-06 Thread Eric Anderson
Ben- In attempting to use the PDFBox-0.6.0, I rec'd the following error when attempting to scan a reasonably sized PDF repository. Any thoughts? caught a class java.io.EOFException with message: Unexpected end of ZLIB input stream Eric Anderson LanRx Network Solutions Quoting Ben

Advanced Text Indexing with Lucene

2003-03-06 Thread petite_abeille
Another fine article by Otis: http://www.onjava.com/pub/a/onjava/2003/03/05/lucene.html PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: my experiences - Re: Parsing Word Docs

2003-03-06 Thread Ryan Ackley
David, The textmining.org stuff only works on Word97 and above. It should work with no exceptions on any Word 97 doc. If you have any problems then it is from an earlier version (most likely Word 6.0) or its not a word document. If this isn't the case you need to email me so I can fix it and make

Re: my experiences - Re: Parsing Word Docs

2003-03-06 Thread Ryan Ackley
Eric, The problem with antiword is that it is a native application. You must write a class that uses JNI to access the native code. If you link your java code with native code you have lost one of the biggest benefits of Java, platform independence. I would suggest you use the library at

Re: my experiences - Re: Parsing Word Docs

2003-03-06 Thread Eric Anderson
I'll go either way, but I still don't know how to implement the word parser, as opposed to the PDF parser or HTM parser. Eric Anderson LanRx Network Solutions Quoting Ryan Ackley [EMAIL PROTECTED]: Eric, The problem with antiword is that it is a native application. You must write a

AW: my experiences - Re: Parsing Word Docs

2003-03-06 Thread Borkenhagen, Michael (ofd-ko zdfin)
Ryan, I tried to use texmining to extract text from word97 Documents. Some german characters like ä, ü etc. aren`t parsed correctly, so a can`t use it cause many german words include this characters. I dont know if the reason is textmining or hdf from poi (hssf from poi parses this characters

Re: [ANN] PDFBox 0.6.0

2003-03-06 Thread Ben Litchfield
In this release I have changed how I parsed the document, which may have introduced this bug. I have received another report of this and will have it fixed for the next point release. You said you tried with reasonably sized PDF repository. Did you stop indexing at this error or did you

AW: my experiences - Re: Parsing Word Docs

2003-03-06 Thread Borkenhagen, Michael (ofd-ko zdfin)
thx a lot :) I'll try it -Ursprüngliche Nachricht- Von: Mario Ivankovits [mailto:[EMAIL PROTECTED] Gesendet: Donnerstag, 6. März 2003 14:00 An: Lucene Users List Betreff: Re: my experiences - Re: Parsing Word Docs The problems with german umlauts should be fixed. I have posted them a

RE: [ANN] PDFBox 0.6.0

2003-03-06 Thread xx28
Ben, I downloaded pdfbox and installed it. And I can use: java org.pdfbox.Main PDF-file output-text-file to convert .pdf file to string file. Then I tried to integrate with Lucene. I modified the following codes in IndexHTML.java: else if(file.getPath().endsWith(.pdf)) { Document doc

RE: Multi Language support

2003-03-06 Thread Eric Isakson
Hi Günter, I had a similar requirement for my use of Lucene. We have documents with mixed languages, some of the text in the user's native language and some in English. We made the decision to not use any of the stemming analyzers and index with no stop words (I didn't like the no stop words

AW: [ANN] PDFBox 0.6.0

2003-03-06 Thread Borkenhagen, Michael (ofd-ko zdfin)
Ben, by using PDFBox-0.5.6 and alternative PDFBox-0.6.0 I'd receive the following StackTrace java.lang.ClassCastException: org.pdfbox.cos.COSObject at org.pdfbox.encoding.DictionaryEncoding.init(DictionaryEncoding.java :66) at

Re: my experiences - Re: Parsing Word Docs

2003-03-06 Thread David Spencer
Eric Anderson wrote: Ok. Thanks for the tip. I downloaded and compiled Antiword, and would like to now add it to my indexing class. However, I'm not sure how the application would be called, How? You exec passing the file name and it prints the ascii text to stdout. This method takes the file

Re: my experiences - Re: Parsing Word Docs

2003-03-06 Thread David Spencer
Ryan Ackley wrote: Eric, The problem with antiword is that it is a native application. You must write a class that uses JNI to access the native code. No you don't. Just use Runtime.exec - no JNI :) If you link your java code with native code you have lost one of the biggest benefits of Java,

Re: my experiences - Re: Parsing Word Docs

2003-03-06 Thread David Spencer
Ryan Ackley wrote: David, The textmining.org stuff only works on Word97 and above. It should work with Could be we had pre word97 docs as some date from 1996 when we (Lumos at least) were founded. no exceptions on any Word 97 doc. If you have any problems then it is from an earlier version

Re: Regarding Setup Lucine for my site

2003-03-06 Thread Leo Galambos
1. 2 threads per request may improve speed up to 50% Hmm? Could you clarify? During indexing, multithreading may speed things up (splitting docs to index in 2 or more sets, indexing separately, combining indexing). But... isn't that a good thing? Or are you saying that it'd be good to have

Re: Potential Lucene drawbacks

2003-03-06 Thread Leo Galambos
If I understand you correctly, then maybe you are not aware of RemoteSearchable in Lucene. That class cannot be used in Merger. RemoteSearchable is a class that allows you to pass a query to another node, nothing less and nothing more AFAIK. This is the point that's more clear to me now.

Re: Potential Lucene drawbacks

2003-03-06 Thread Otis Gospodnetic
--- Leo Galambos [EMAIL PROTECTED] wrote: If I understand you correctly, then maybe you are not aware of RemoteSearchable in Lucene. That class cannot be used in Merger. RemoteSearchable is a class that allows you to pass a query to another node, nothing less and nothing more AFAIK.