[
https://issues.apache.org/jira/browse/PDFBOX-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194697#comment-13194697
]
Timo Boehme commented on PDFBOX-1213:
-------------------------------------
In my opinion the proposed changes to PDFTextStripper are too much centered on
the use case. I think we need a more general solution here because sometimes
more parameters can be extracted from the font definitions.
I would propose a fontChanged notification, maybe as a listener pattern because
if no listeners are registered we can skip cycles for font information
extraction:
interface FontChangedListener {
public void fontChanged( FontInformation _fInfo );
}
class FontInformation {
public boolean isBold();
public boolean isItalic();
public boolean isRoman();
public boolean isSansSerif();
public String getFontName();
public float getFontSizePt();
}
class PDFTextStripper {
...
protected List<FontListener> fontListeners = new LinkedList<FontListener>();
...
public void registeFontListener( FontListener listener ) {
fontListeners.add( listener );
}
writePage() {
...
if ( ! fontListeners.isEmpty() ) {
// test for font changes and notify listeners
}
...
}
}
In PDFText2HTML you have to keep track if a span was opened with font style
information and close it before closing other tags.
> Adding style information to the PDF to HTML converter
> -----------------------------------------------------
>
> Key: PDFBOX-1213
> URL: https://issues.apache.org/jira/browse/PDFBOX-1213
> Project: PDFBox
> Issue Type: Improvement
> Affects Versions: 1.6.0
> Reporter: Enrique Pérez
> Attachments: diff.patch
>
>
> This patch modifies the PDF to HTML conversion in order to add style
> information (bold, italic and size font) in the resulting file. Moreover, we
> have deleted the "DOCTYPE" header because some parsers throws the following
> exception:
> [Fatal Error] loose.dtd:31:3: The declaration for the entity "HTML.Version"
> must end with '>'.
> org.xml.sax.SAXParseException: The declaration for the entity "HTML.Version"
> must end with '>'.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira