[ 
https://issues.apache.org/jira/browse/PDFBOX-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Leong updated PDFBOX-1860:
--------------------------------

    Attachment: pdftest.pdf

> HTML converter escapes formatting close tags
> --------------------------------------------
>
>                 Key: PDFBOX-1860
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1860
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.3
>            Reporter: Cheng Leong
>            Priority: Minor
>         Attachments: pdftest.pdf
>
>
> Bug introduced by PDFBOX-1213 in 1.8.3 for HTML style information.
> Bold style tags are opened correctly, but the close tags are html-escaped.
> {noformat}
> ~/work/pdfbox ((1.8.3))$ java -jar app/target/pdfbox-app-1.8.3.jar 
> ExtractText -html -nonSeq -console pdftest.pdf 
> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
> "http://www.w3.org/TR/html4/loose.dtd";>
> <html><head><title>1725.PDF</title>
> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
> </head>
> <body>
> <div style="page-break-before:always; 
> page-break-after:always"><div><p>E:\M55\!\1725.fm 2003-01-01 18:15 P Tagg, 
> IPM, University of Liverpool
> </p>
> <p><b>A VERY SMALL PDF FILE
> &lt;/b&gt;</p>
> <p><b>A VERY SMALL PDF FILE
> &lt;/b&gt;</p>
> <p><b>A VERY SMALL PDF FILE
> &lt;/b&gt;</p>
> <p><b>A VERY SMALL PDF FILE
> &lt;/b&gt;</p>
> <p><b>A VERY SMALL PDF FILE
> &lt;/b&gt;</p>
> <p><b>A VERY SMALL PDF FILE&lt;/b&gt;</p>
> </div></div>
> </body></html>
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to