[ 
https://issues.apache.org/jira/browse/LABS-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632000#action_12632000
 ] 

Thorsten Scherler commented on LABS-118:
----------------------------------------

Here some more links related to the issue:
http://java.sun.com/j2se/1.4.2/docs/api/org/xml/sax/ContentHandler.html
http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/ContentHandler.html

To get a string representation of the handler tika does:

http://svn.apache.org/repos/asf/incubator/tika/trunk/src/main/java/org/apache/tika/utils/ParseUtils.java
...
return handler.toString();

Generally there are a lot of specialist Handler that can be used:
http://svn.apache.org/repos/asf/incubator/tika/trunk/src/main/java/org/apache/tika/sax/

The tika documentation gives the example with sysout:
ContentHandler handler = new BodyContentHandler(System.out);

The best fit is that we use the XHTMLContentHandler with a 
BufferedOutputStream. Then convert it to an InputStream and work with this (a 
valid xml document) in the next stages (linkExtraction/handler).

One can create as well a linkExtractorHandler that will return the Outlinks 
from the doc. This however will happen in the parser stage meaning there is no 
LinkExtractor.

Looking at 
http://svn.apache.org/repos/asf/incubator/tika/trunk/src/main/java/org/apache/tika/sax/TeeContentHandler.java
 we may actually prefer this. We would use both handler and pass the xhtml as 
stream to the handler.

> Create tied integration with Apache Tika (for parser and handler)
> -----------------------------------------------------------------
>
>                 Key: LABS-118
>                 URL: https://issues.apache.org/jira/browse/LABS-118
>             Project: Labs
>          Issue Type: New Feature
>          Components: Droids
>            Reporter: Thorsten Scherler
>
> http://incubator.apache.org/tika/
> Apache Tika is a toolkit for detecting and extracting metadata and structured 
> text content from various documents using existing parser libraries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to