There is a great deal of formal activity in this area - see TREC (
http://en.wikipedia.org/wiki/Text_Retrieval_Conference) which runs
competitions and provides metrics.

Formally a lot of effort is required to produce a precise, reproducible
number. In simple terms you need a corpus which has alreay been annotated
with the agreed correct answer ("Gold Standard") and you run your software
over this. There are metrics which measure precision and recall (and these
have to be carefully defined).

Gold standards are normally produced by humans with a lot of work (it's
very boring - I know!). Volunteers are sometimes paid. You also produce an
inter-annotator agreement (do two humans agree on the right answer).
Problem areas are hyphens, paragraphs, lists, diacritics, dashes, high
codepoints, etc.

In our case - scientific papers - we often have an independent XML version
of the text and this is very useful for a rapid and nearly complete Gold
standard. Some publishers make this available positively, others forbid us
to use and publish it. Getting pseudo-goldStandard XML is a political , not
technical problem.

When all this has been done it is formally possible to create confidence
scores in some cases, based on algorithms such as Hidden Markov.

Be aware, of course, that no program can give 100% correct answers on
arbitrary input, but it may be possible to find controlled sub-domains
where this is effectively 100%. If the input contains non-Unicode
characters or pixel-based glyphs we have to use heuristics and OCR -
neither of these are perfect either.



On Sun, May 11, 2014 at 6:07 AM, Qingchao Kong <[email protected]> wrote:

> Hi, I am using PDFBox to extract text from PDF files.
> As you know, due to some reason, PDFbox might produce errors when
> extracting text from some PDF files, the question I want to ask is
> that: is there a way to automatically evaluate the quality of text
> extraction result? Or can PDFBox offer a confidence score about the
> extracted text result?
>
> Regards,
>



-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Reply via email to