[
https://issues.apache.org/jira/browse/PDFBOX-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Cheng Leong updated PDFBOX-1860:
--------------------------------
Attachment: pdftest.pdf
> HTML converter escapes formatting close tags
> --------------------------------------------
>
> Key: PDFBOX-1860
> URL: https://issues.apache.org/jira/browse/PDFBOX-1860
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.8.3
> Reporter: Cheng Leong
> Priority: Minor
> Attachments: pdftest.pdf
>
>
> Bug introduced by PDFBOX-1213 in 1.8.3 for HTML style information.
> Bold style tags are opened correctly, but the close tags are html-escaped.
> {noformat}
> ~/work/pdfbox ((1.8.3))$ java -jar app/target/pdfbox-app-1.8.3.jar
> ExtractText -html -nonSeq -console pdftest.pdf
> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
> "http://www.w3.org/TR/html4/loose.dtd">
> <html><head><title>1725.PDF</title>
> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
> </head>
> <body>
> <div style="page-break-before:always;
> page-break-after:always"><div><p>E:\M55\!\1725.fm 2003-01-01 18:15 P Tagg,
> IPM, University of Liverpool
> </p>
> <p><b>A VERY SMALL PDF FILE
> </b></p>
> <p><b>A VERY SMALL PDF FILE
> </b></p>
> <p><b>A VERY SMALL PDF FILE
> </b></p>
> <p><b>A VERY SMALL PDF FILE
> </b></p>
> <p><b>A VERY SMALL PDF FILE
> </b></p>
> <p><b>A VERY SMALL PDF FILE</b></p>
> </div></div>
> </body></html>
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)