thx a lot :) I'll try it

-----Ursprüngliche Nachricht-----
Von: Mario Ivankovits [mailto:[EMAIL PROTECTED]
Gesendet: Donnerstag, 6. März 2003 14:00
An: Lucene Users List
Betreff: Re: my experiences - Re: Parsing Word Docs


The problems with german umlauts should be fixed.
I have posted them a patch (see
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=14735), and it should be
applied now.
I havent cross-checked it for now.

I currently use POI to index documents with lucene, but i do not use the
"standard" way with an lucende-word-document class (like the pdfdocument).
For sure, i have had some problems with getting the text from old documents,
but in this case my system falls back to an simple "STRINGS" parser (filters
any human-readable) char from the document-file.

byebye
Mario

----- Original Message -----
From: "Borkenhagen, Michael (ofd-ko zdfin)"
<[EMAIL PROTECTED]>
To: "'Lucene Users List'" <[EMAIL PROTECTED]>
Sent: Thursday, March 06, 2003 1:39 PM
Subject: AW: my experiences - Re: Parsing Word Docs


Ryan,

I tried to use texmining to extract text from word97 Documents. Some german
characters like "ä", "ü" etc. aren`t parsed correctly, so a can`t use it
cause many german words include this characters. I dont know if the reason
is textmining or hdf from poi (hssf from poi parses this characters
correctly). Do you have any hints for me ?

Michael

-----Ursprüngliche Nachricht-----
Von: Ryan Ackley [mailto:[EMAIL PROTECTED]
Gesendet: Donnerstag, 6. März 2003 13:13
An: Lucene Users List
Betreff: Re: my experiences - Re: Parsing Word Docs


David,

The textmining.org stuff only works on Word97 and above. It should work with
no exceptions on any Word 97 doc. If you have any problems then it is from
an earlier version (most likely Word 6.0) or its not a word document. If
this isn't the case you need to email me so I can fix it and make it better
for the benefit of everyone. I plan on adding support for Word 6 in the
future.

Ryan Ackley

----- Original Message -----
From: "David Spencer" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, March 05, 2003 6:24 PM
Subject: my experiences - Re: Parsing Word Docs


> FYI I tried the textmining.org/poi combo and on a collection of 350 word
> docs people have developed here over the years, and it failed on 33% of
them
> with exceptions being thrown about the formats being invalid.
>
> I tried "antiword" ( http://www.winfield.demon.nl/ ), a native & free
> *.exe, and
> it worked great ( well it seemed to process all the files fine).
>
> I've had similar experiences with PDF - I tried the 3 or so
> freeware/java PDF
> text extractors and they were not as good as the exe, pdftotext,
> from foolabs (http://www.foolabs.com/xpdf/).
>
> Not satisfying to a java developer but these work better than anything
> else I can find.
>
> You get source and I use them on windows & linux, no prob.
>
>
>
> Eric Anderson wrote:
>
> >I'm interested in using the textmining/textextraction utilities using
Apache
> >POI, that Ryan was discussing. However, I'm having some difficulty
determining
> >what the insertion point would be to replace the default parser with the
word
> >parser.
> >
> >Any assistance would be appreciated.
> >
> >
> >
> >
> >
> >LanRx Network Solutions, Inc.
> >Providing Enterprise Level Solutions...On A Small Business Budget
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: [EMAIL PROTECTED]
> >For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to