[jira] [Updated] (PDFBOX-3248) Unwanted spaces in text extraction (2)

Tilman Hausherr (Jira) Sat, 17 Aug 2024 02:03:38 -0700


     [ 
https://issues.apache.org/jira/browse/PDFBOX-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tilman Hausherr updated PDFBOX-3248:
------------------------------------
    Labels: ActualText  (was: )

> Unwanted spaces in text extraction (2)
> --------------------------------------
>
>                 Key: PDFBOX-3248
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3248
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.11, 2.0.0
>            Reporter: Tilman Hausherr
>            Priority: Major
>              Labels: ActualText
>         Attachments: PDFBOX-3248-spaces.pdf
>
>
> The attached file provided by Francisco from the user mailing list has spaces 
> in text extraction regardless of setting spacingTolerance or 
> averageCharTolerance. I was unable to extract "Cada frasco ampolla" which 
> looked straightforward in rendering, but it always appeared as "Ca da fras co 
> ampo lla". Adobe Reader has no such problem.
> The content stream has this:
> {code}
>      6 0 1.058 6 122.0924 312.51 Tm
>      (Ca) Tj
>      /Span << /ActualText (\376\377\000\255) >> BDC
>        ( ) Tj
>      EMC
>      [ (da ) -301 (fras) ] TJ
>      /Span << /ActualText (\376\377\000\255) >> BDC
>        ( ) Tj
>      EMC
>      [ (co ) -301 (ampo) ] TJ
>      /Span << /ActualText (\376\377\000\255) >> BDC
>        ( ) Tj
>      EMC
>      [ (lla ) -301 (con) ] TJ
> {code}
> So there are really spaces there, and we keep them. Adobe is smarter, and 
> ignores them because they are overwritten thanks to the "-301" backwards 
> positioning.
> Would /ActualText help? However it is always the same here...
> Would it help to ignore spaces and decide based on positions only, maybe as 
> an option? I added these two lines below the first existing one:
> {code}
>                 String characterValue = position.getUnicode();
>                 if (" ".equals(characterValue))
>                     continue;
> {code}
> The output looks promising:
> {quote}
> F ó r m u l a :
> Cronopen® Balsámico Adultos:
> Cada frasco ampolla contiene: ampicilina (como ampicilina sódica)
> 100 mg; ampicilina (como ampicilina benzatínica) 500 mg.
> Cada ampolla solvente de 5 ml contiene: dipirona 1000 mg; guaife
> nesina 100 mg. Exc.: bisulfito de sodio; agua destilada.
> {quote}
> A complete test brings many differences, most are harmless or are 
> improvements. Only one test case really fails, hello3.pdf. Original extract 
> is "Hello محمد World.", new extract is "Hello .Worldمحمد".
> More from Francisco
> {quote}
> As additional information, I've found 2 related posts (about another tools)
> in StackOverflow:
> http://stackoverflow.com/questions/34579824/itext-how-to-tweak-text-extraction
> http://stackoverflow.com/questions/22671974/itext-reading-pdf-1s-as-up-arrows-error/22688775#22688775
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-3248) Unwanted spaces in text extraction (2)

Reply via email to