[
https://issues.apache.org/jira/browse/PDFBOX-434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Justin LeFebvre updated PDFBOX-434:
-----------------------------------
Description: Would like to improve the html output of pdf files for arabic
rendering. The attached file has changes that should improve the way the -html
option works. Now, output files are tagged with the .html extension. We also
added <DOCTYPE> information as well as a <meta> tag which writes the
appropriate encoding of the file. Cleaned up a lot of code from PDFTextStripper
and PDFText2HTML which wasn't being used. Added ability to set the <title> tag
of the html document to be the title given in the pdf document information if
it exists. Otherwise it will guess a title from the beginning first lines of
the file. (was: Would like to improve the html output of pdf files for arabic
rendering. )
> Improve html output
> -------------------
>
> Key: PDFBOX-434
> URL: https://issues.apache.org/jira/browse/PDFBOX-434
> Project: PDFBox
> Issue Type: Improvement
> Components: Text extraction
> Reporter: Justin LeFebvre
> Attachments: html_improvements.diff
>
>
> Would like to improve the html output of pdf files for arabic rendering. The
> attached file has changes that should improve the way the -html option works.
> Now, output files are tagged with the .html extension. We also added
> <DOCTYPE> information as well as a <meta> tag which writes the appropriate
> encoding of the file. Cleaned up a lot of code from PDFTextStripper and
> PDFText2HTML which wasn't being used. Added ability to set the <title> tag of
> the html document to be the title given in the pdf document information if it
> exists. Otherwise it will guess a title from the beginning first lines of the
> file.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.