[ 
https://issues.apache.org/jira/browse/PDFBOX-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14247909#comment-14247909
 ] 

Andreas Lehmkühler commented on PDFBOX-1351:
--------------------------------------------

PDF doesn't know anything about suberscript/subscript, it's just some smaller 
text which is placed higher or lower that thre other text around.

One has to develop some more or less intelligent algorithm to detect such kind 
of text. Patches are welcome

> False paragraph caused by superscript (1.7 regression)
> ------------------------------------------------------
>
>                 Key: PDFBOX-1351
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1351
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.7.0
>            Reporter: Daniel Bonniot de Ruisselet
>         Attachments: PDFParaTest.java, superscript.pdf
>
>
> On the attached minimal example document, text extraction seems to be 
> confused by the superscript, and generates three paragraphs where there is 
> only one.
> Note that 1.6 is processing this case well:
> {noformat}
> $ java -jar /dev/shm/pdfbox-app-1.6.0.jar ExtractText /tmp/superscript.pdf
> Jun 29, 2012 4:52:24 PM org.apache.pdfbox.pdfparser.PDFParser parseObject
> WARNING: expected='%%EOF' actual='5 0 obj '
> $ cat /tmp/superscript.txt 
>   
> Multiple synthetic routes have been described by R. Filler et al.11 regarding 
> 1,3-
> Bis(perfluorophenyl)propane-1,3-dione.  The synthesis and 
>  
>  
> $ java -jar /dev/shm/pdfbox-app-1.7.0.jar ExtractText /tmp/superscript.pdf 
> Jun 29, 2012 4:52:39 PM org.apache.pdfbox.pdfparser.PDFParser parseObject
> WARNING: expected='%%EOF' actual='5 0 obj '
> $ cat /tmp/superscript.txt                                                 
>   
> Multiple synthetic routes have been described by R. Filler et al.
> 11
>  regarding 1,3-
> Bis(perfluorophenyl)propane-1,3-dione.  The synthesis and 
>  
>  
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to