[jira] Updated: (TIKA-53) XHTML SAX events from parsers

Jukka Zitting (JIRA) Thu, 11 Oct 2007 02:18:40 -0700

     [ 
https://issues.apache.org/jira/browse/TIKA-53?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jukka Zitting updated TIKA-53:
------------------------------

    Attachment: TIKA-53.patch

The attached patch (TIKA-53.patch) is my first shot at this.

Most of the parsers just take the String that they used to produce before, and 
output the following SAX events:

    <html xmlns="http://www.w3.org/1999/xhtml";>
        <head>
            <title>...</title>
        </head>
        <body>
          <p>...</p>
        </body>
    </html>

The only exception for now is the HTMLParser (surprise!) that uses the XHTML 
output from Tidy.

The TXTParser class is also slightly more advanced, as it'll avoid reading the 
full document in memory (assuming ICU4J doesn't do that). Instead it'll read 
the character stream in small batches and use the characters() SAX event to 
feed that stream to the given ContentHandler.

> XHTML SAX events from parsers
> -----------------------------
>
>                 Key: TIKA-53
>                 URL: https://issues.apache.org/jira/browse/TIKA-53
>             Project: Tika
>          Issue Type: Improvement
>          Components: general
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>             Fix For: 0.1-incubator
>
>         Attachments: TIKA-53.patch
>
>
> Tika parsers should produce a sequence XHTML SAX events instead of a single 
> unstructured String as the parsed document content.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TIKA-53) XHTML SAX events from parsers

Reply via email to