Re: [CODE4LIB] pdf2txt [foreign documents]

Eric Lease Morgan Sat, 12 Oct 2013 07:02:36 -0700

On Oct 11, 2013, at 6:39 PM, Mark Pernotto <[email protected]> wrote:


> Putting my devil's advocate hat on, it doesn't parse foreign documents well
> (I got it to break!).  I also got inconsistent results feeding it PDF files
> with tables embedded (but haven't been able to figure out what it is about
> them it doesn't like).

Mark, foreign documents. Good point. Using a (Perl) module called… Well, I 
can't find it right now. It is possible to guess the language of a text. It 
does this by looking for and tabulating the number of various language stop 
words in a document. Once a language is determined, then different stop word 
lists can be applied to the document and the results ought to be better. 

Also, please remember, parsing the document into sentences and words is 
directly related to the quality of the underlying OCR. Such is a limitation I 
am not able to overcome.

--
Eric

Re: [CODE4LIB] pdf2txt [foreign documents]

Reply via email to