[ 
https://issues.apache.org/jira/browse/PDFBOX-434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin LeFebvre updated PDFBOX-434:
-----------------------------------

    Attachment: html_improvements.diff

> Improve html output
> -------------------
>
>                 Key: PDFBOX-434
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-434
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>            Reporter: Justin LeFebvre
>         Attachments: html_improvements.diff
>
>
> Would like to improve the html output of pdf files for arabic rendering. The 
> attached file has changes that should improve the way the -html option works. 
> Now, output files are tagged with the .html extension. We also added 
> <DOCTYPE> information as well as a <meta> tag which writes the appropriate 
> encoding of the file. Cleaned up a lot of code from PDFTextStripper and 
> PDFText2HTML which wasn't being used. Added ability to set the <title> tag of 
> the html document to be the title given in the pdf document information if it 
> exists. Otherwise it will guess a title from the beginning first lines of the 
> file. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to