[
https://issues.apache.org/jira/browse/TIKA-53?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jukka Zitting updated TIKA-53:
------------------------------
Attachment: TIKA-53.patch
The attached patch (TIKA-53.patch) is my first shot at this.
Most of the parsers just take the String that they used to produce before, and
output the following SAX events:
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>...</title>
</head>
<body>
<p>...</p>
</body>
</html>
The only exception for now is the HTMLParser (surprise!) that uses the XHTML
output from Tidy.
The TXTParser class is also slightly more advanced, as it'll avoid reading the
full document in memory (assuming ICU4J doesn't do that). Instead it'll read
the character stream in small batches and use the characters() SAX event to
feed that stream to the given ContentHandler.
> XHTML SAX events from parsers
> -----------------------------
>
> Key: TIKA-53
> URL: https://issues.apache.org/jira/browse/TIKA-53
> Project: Tika
> Issue Type: Improvement
> Components: general
> Reporter: Jukka Zitting
> Assignee: Jukka Zitting
> Fix For: 0.1-incubator
>
> Attachments: TIKA-53.patch
>
>
> Tika parsers should produce a sequence XHTML SAX events instead of a single
> unstructured String as the parsed document content.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.