[ https://issues.apache.org/jira/browse/TIKA-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871544#action_12871544 ]
Julien Nioche commented on TIKA-433: ------------------------------------ You can do that with [Behemoth|http://code.google.com/p/behemoth-pebble/] as it uses Tika on rich documents stored in a SequenceFile. There is an application in the Behemoth Sandbox which sends the annotated documents to SOLR and I am planning to write one to generate vectors for Mahout. The output format is a very straightforward standoff annotation model and that should fit for most applications. > Tika + Hadoop > ------------- > > Key: TIKA-433 > URL: https://issues.apache.org/jira/browse/TIKA-433 > Project: Tika > Issue Type: New Feature > Components: general > Reporter: Grant Ingersoll > Priority: Minor > > Would be great to have a Tika contrib that took in an HDFS location with > "rich" documents on it and an output format (or output processor) and > converted the docs to XHTML or Solr or whatever. Seems like it should be > pretty straightforward to do on the Hadoop side of things. Only tricky part, > I suppose, is the output format and how to make that pluggable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.