[ https://issues.apache.org/jira/browse/PDFBOX-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tilman Hausherr updated PDFBOX-3248: ------------------------------------ Labels: ActualText (was: ) > Unwanted spaces in text extraction (2) > -------------------------------------- > > Key: PDFBOX-3248 > URL: https://issues.apache.org/jira/browse/PDFBOX-3248 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 1.8.11, 2.0.0 > Reporter: Tilman Hausherr > Priority: Major > Labels: ActualText > Attachments: PDFBOX-3248-spaces.pdf > > > The attached file provided by Francisco from the user mailing list has spaces > in text extraction regardless of setting spacingTolerance or > averageCharTolerance. I was unable to extract "Cada frasco ampolla" which > looked straightforward in rendering, but it always appeared as "Ca da fras co > ampo lla". Adobe Reader has no such problem. > The content stream has this: > {code} > 6 0 1.058 6 122.0924 312.51 Tm > (Ca) Tj > /Span << /ActualText (\376\377\000\255) >> BDC > ( ) Tj > EMC > [ (da ) -301 (fras) ] TJ > /Span << /ActualText (\376\377\000\255) >> BDC > ( ) Tj > EMC > [ (co ) -301 (ampo) ] TJ > /Span << /ActualText (\376\377\000\255) >> BDC > ( ) Tj > EMC > [ (lla ) -301 (con) ] TJ > {code} > So there are really spaces there, and we keep them. Adobe is smarter, and > ignores them because they are overwritten thanks to the "-301" backwards > positioning. > Would /ActualText help? However it is always the same here... > Would it help to ignore spaces and decide based on positions only, maybe as > an option? I added these two lines below the first existing one: > {code} > String characterValue = position.getUnicode(); > if (" ".equals(characterValue)) > continue; > {code} > The output looks promising: > {quote} > F ó r m u l a : > Cronopen® Balsámico Adultos: > Cada frasco ampolla contiene: ampicilina (como ampicilina sódica) > 100 mg; ampicilina (como ampicilina benzatínica) 500 mg. > Cada ampolla solvente de 5 ml contiene: dipirona 1000 mg; guaife > nesina 100 mg. Exc.: bisulfito de sodio; agua destilada. > {quote} > A complete test brings many differences, most are harmless or are > improvements. Only one test case really fails, hello3.pdf. Original extract > is "Hello محمد World.", new extract is "Hello .Worldمحمد". > More from Francisco > {quote} > As additional information, I've found 2 related posts (about another tools) > in StackOverflow: > http://stackoverflow.com/questions/34579824/itext-how-to-tweak-text-extraction > http://stackoverflow.com/questions/22671974/itext-reading-pdf-1s-as-up-arrows-error/22688775#22688775 > {quote} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org