Hi,
On Tue, Dec 9, 2008 at 8:27 AM, Stephane Bastian
<[EMAIL PROTECTED]> wrote:
> So, I wanted to know 1) if other people had trouble extending existing
> Parser? and 2) if this is an issue we should tackle?
We're of course open to contributions on issues like this, but I'm
wondering if your use case would be better served by directly using
the underlying parser library. If not, how about an extension point
like the one defined in the patch below?
BR,
Jukka Zitting
Index: src/main/java/org/apache/tika/parser/html/HtmlParser.java
===================================================================
--- src/main/java/org/apache/tika/parser/html/HtmlParser.java (revision
724309)
+++ src/main/java/org/apache/tika/parser/html/HtmlParser.java (working copy)
@@ -84,6 +84,31 @@
}
+ /**
+ * Extra handler that can be specified by the client application for
+ * additional processing of raw HTML SAX events generated by NekoHTML.
+ */
+ private ContentHandler extension;
+
+ /**
+ * Returns the configured extension handler.
+ *
+ * @return configured extension handler, or <code>null</code>
+ */
+ public ContentHandler getExtension() {
+ return extension;
+ }
+
+ /**
+ * Sets an extension handler for additional processing of the raw HTML
+ * SAX events generated by the underlying HTML parser.
+ *
+ * @param extension extension handler
+ */
+ public void setExtension(ContentHandler extension) {
+ this.extension = extension;
+ }
+
public void parse(
InputStream stream, ContentHandler handler, Metadata metadata)
throws IOException, SAXException, TikaException {
@@ -102,9 +127,17 @@
new MatchingContentHandler(getTitleHandler(metadata), title),
new MatchingContentHandler(getMetaHandler(metadata), meta));
+ // Simplify the HTML for Tika clients
+ handler = new XHTMLDowngradeHandler(handler);
+
+ // Add the configured extension, if any
+ if (extension != null) {
+ handler = new TeeContentHandler(handler, extension);
+ }
+
// Parse the HTML document
SAXParser parser = new SAXParser();
- parser.setContentHandler(new XHTMLDowngradeHandler(handler));
+ parser.setContentHandler(handler);
parser.parse(new InputSource(Utils.getUTF8Reader(stream, metadata)));
}