[
https://issues.apache.org/jira/browse/PDFBOX-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15168674#comment-15168674
]
Tilman Hausherr commented on PDFBOX-3248:
-----------------------------------------
With "our" do you mean "PDFBox" or your job?
> Unwanted spaces in text extraction (2)
> --------------------------------------
>
> Key: PDFBOX-3248
> URL: https://issues.apache.org/jira/browse/PDFBOX-3248
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.8.11, 2.0.0
> Reporter: Tilman Hausherr
> Attachments: PDFBOX-3248-spaces.pdf
>
>
> The attached file provided by Francisco from the user mailing list has spaces
> in text extraction regardless of setting spacingTolerance or
> averageCharTolerance. I was unable to extract "Cada frasco ampolla" which
> looked straightforward in rendering, but it always appeared as "Ca da fras co
> ampo lla". Adobe Reader has no such problem.
> The content stream has this:
> {code}
> 6 0 1.058 6 122.0924 312.51 Tm
> (Ca) Tj
> /Span << /ActualText (\376\377\000\255) >> BDC
> ( ) Tj
> EMC
> [ (da ) -301 (fras) ] TJ
> /Span << /ActualText (\376\377\000\255) >> BDC
> ( ) Tj
> EMC
> [ (co ) -301 (ampo) ] TJ
> /Span << /ActualText (\376\377\000\255) >> BDC
> ( ) Tj
> EMC
> [ (lla ) -301 (con) ] TJ
> {code}
> So there are really spaces there, and we keep them. Adobe is smarter, and
> ignores them because they are overwritten thanks to the "-301" backwards
> positioning.
> Would /ActualText help? However it is always the same here...
> Would it help to ignore spaces and decide based on positions only, maybe as
> an option? I added these two lines below the first existing one:
> {code}
> String characterValue = position.getUnicode();
> if (" ".equals(characterValue))
> continue;
> {code}
> The output looks promising:
> {quote}
> F ó r m u l a :
> Cronopen® Balsámico Adultos:
> Cada frasco ampolla contiene: ampicilina (como ampicilina sódica)
> 100 mg; ampicilina (como ampicilina benzatínica) 500 mg.
> Cada ampolla solvente de 5 ml contiene: dipirona 1000 mg; guaife
> nesina 100 mg. Exc.: bisulfito de sodio; agua destilada.
> {quote}
> A complete test brings many differences, most are harmless or are
> improvements. Only one test case really fails, hello3.pdf. Original extract
> is "Hello محمد World.", new extract is "Hello .Worldمحمد".
> More from Francisco
> {quote}
> As additional information, I've found 2 related posts (about another tools)
> in StackOverflow:
> http://stackoverflow.com/questions/34579824/itext-how-to-tweak-text-extraction
> http://stackoverflow.com/questions/22671974/itext-reading-pdf-1s-as-up-arrows-error/22688775#22688775
> {quote}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]