[ https://issues.apache.org/jira/browse/PDFBOX-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tilman Hausherr updated PDFBOX-3019: ------------------------------------ Component/s: Text extraction > Optimize tolerance settings > --------------------------- > > Key: PDFBOX-3019 > URL: https://issues.apache.org/jira/browse/PDFBOX-3019 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 2.0.0 > Reporter: Ben McCann > Fix For: 2.0.0 > > Attachments: jbl-example-com.pdf > > > From testing on my internal dataset I believe there might be some regression > in the effectiveness of PDFTextStripper. > Here's an [example > doc|http://rampages.us/rhodesc1/wp-content/uploads/sites/4737/2014/07/Resume-Connor-Rhodes.pdf] > I found on the web, which converted better in 1.8 than 2.0. Notice that it > extracts "J e a n e t t e A c o s t a ; S e r v i c e M a n a g e r a t > M a d F o x B r e w i n g C o m p a n y". It doesn't seem like there's > very much space between the letters in the pdf, so it's curious to me that it > didn't do too well. > I realize this is an area where we probably can't strive for perfection. Yet, > it does seem to me that from 1.8 to 2.0 we may have taken a step backwards. I > believe there's some sort of regression test for PDFToImage which exports a > set of pdfs to images at two different commits and looks at what the > differences are. Do we have the same sort of thing for PDFTextStripper? If > not, can we build one by pulling docs off the public web? I'd be willing to > contribute to this endeavor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org