Ben-
In attempting to use the PDFBox-0.6.0, I rec'd the following error when
attempting to scan a reasonably sized PDF repository.
Any thoughts?
caught a class java.io.EOFException
with message: Unexpected end of ZLIB input stream
Eric Anderson
LanRx Network Solutions
Quoting Ben
Another fine article by Otis:
http://www.onjava.com/pub/a/onjava/2003/03/05/lucene.html
PA.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
David,
The textmining.org stuff only works on Word97 and above. It should work with
no exceptions on any Word 97 doc. If you have any problems then it is from
an earlier version (most likely Word 6.0) or its not a word document. If
this isn't the case you need to email me so I can fix it and make
Eric,
The problem with antiword is that it is a native application. You must write
a class that uses JNI to access the native code. If you link your java code
with native code you have lost one of the biggest benefits of Java, platform
independence. I would suggest you use the library at
I'll go either way, but I still don't know how to implement the word parser, as
opposed to the PDF parser or HTM parser.
Eric Anderson
LanRx Network Solutions
Quoting Ryan Ackley [EMAIL PROTECTED]:
Eric,
The problem with antiword is that it is a native application. You must
write
a
Ryan,
I tried to use texmining to extract text from word97 Documents. Some german
characters like ä, ü etc. aren`t parsed correctly, so a can`t use it
cause many german words include this characters. I dont know if the reason
is textmining or hdf from poi (hssf from poi parses this characters
In this release I have changed how I parsed the document, which may have
introduced this bug. I have received another report of this and will have
it fixed for the next point release.
You said you tried with reasonably sized PDF repository. Did you stop
indexing at this error or did you
thx a lot :) I'll try it
-Ursprüngliche Nachricht-
Von: Mario Ivankovits [mailto:[EMAIL PROTECTED]
Gesendet: Donnerstag, 6. März 2003 14:00
An: Lucene Users List
Betreff: Re: my experiences - Re: Parsing Word Docs
The problems with german umlauts should be fixed.
I have posted them a
Ben,
I downloaded pdfbox and installed it. And I can use:
java org.pdfbox.Main PDF-file output-text-file
to convert .pdf file to string file.
Then I tried to integrate with Lucene. I modified the following codes in
IndexHTML.java:
else if(file.getPath().endsWith(.pdf)) {
Document doc
Hi Günter,
I had a similar requirement for my use of Lucene. We have documents with mixed
languages, some of the text in the user's native language and some in English. We made
the decision to not use any of the stemming analyzers and index with no stop words (I
didn't like the no stop words
Ben,
by using PDFBox-0.5.6 and alternative PDFBox-0.6.0 I'd receive the following
StackTrace
java.lang.ClassCastException: org.pdfbox.cos.COSObject
at
org.pdfbox.encoding.DictionaryEncoding.init(DictionaryEncoding.java
:66)
at
Eric Anderson wrote:
Ok. Thanks for the tip.
I downloaded and compiled Antiword, and would like to now add it to my indexing
class. However, I'm not sure how the application would be called,
How? You exec passing the file name and it prints the ascii text to stdout.
This method takes the file
Ryan Ackley wrote:
Eric,
The problem with antiword is that it is a native application. You must write
a class that uses JNI to access the native code.
No you don't. Just use Runtime.exec - no JNI :)
If you link your java code
with native code you have lost one of the biggest benefits of Java,
Ryan Ackley wrote:
David,
The textmining.org stuff only works on Word97 and above. It should work with
Could be we had pre word97 docs as some date from 1996 when we (Lumos at
least)
were founded.
no exceptions on any Word 97 doc. If you have any problems then it is from
an earlier version
1. 2 threads per request may improve speed up to 50%
Hmm? Could you clarify? During indexing, multithreading may speed things
up (splitting docs to index in 2 or more sets, indexing separately, combining
indexing). But... isn't that a good thing? Or are you saying that it'd be good
to have
If I understand you correctly, then maybe you are not aware of
RemoteSearchable in Lucene.
That class cannot be used in Merger. RemoteSearchable is a class that
allows you to pass a query to another node, nothing less and nothing more
AFAIK.
This is the point that's more clear to me now.
--- Leo Galambos [EMAIL PROTECTED] wrote:
If I understand you correctly, then maybe you are not aware of
RemoteSearchable in Lucene.
That class cannot be used in Merger. RemoteSearchable is a class that
allows you to pass a query to another node, nothing less and nothing
more
AFAIK.
17 matches
Mail list logo