[ 
https://issues.apache.org/jira/browse/PDFBOX-434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin LeFebvre updated PDFBOX-434:
-----------------------------------

    Description: Would like to improve the html output of pdf files for arabic 
rendering. The attached file has changes that should improve the way the -html 
option works. Now, output files are tagged with the .html extension. We also 
added <DOCTYPE> information as well as a <meta> tag which writes the 
appropriate encoding of the file. Cleaned up a lot of code from PDFTextStripper 
and PDFText2HTML which wasn't being used. Added ability to set the <title> tag 
of the html document to be the title given in the pdf document information if 
it exists. Otherwise it will guess a title from the beginning first lines of 
the file.   (was: Would like to improve the html output of pdf files for arabic 
rendering. )

> Improve html output
> -------------------
>
>                 Key: PDFBOX-434
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-434
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>            Reporter: Justin LeFebvre
>         Attachments: html_improvements.diff
>
>
> Would like to improve the html output of pdf files for arabic rendering. The 
> attached file has changes that should improve the way the -html option works. 
> Now, output files are tagged with the .html extension. We also added 
> <DOCTYPE> information as well as a <meta> tag which writes the appropriate 
> encoding of the file. Cleaned up a lot of code from PDFTextStripper and 
> PDFText2HTML which wasn't being used. Added ability to set the <title> tag of 
> the html document to be the title given in the pdf document information if it 
> exists. Otherwise it will guess a title from the beginning first lines of the 
> file. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to