Hi Marshall, There is a description in the README.txt file from the TikaAnnotator repository, which I have slightly rewritten into the text below.
*Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. The TikaAnnotator uses Tika to generate annotations representing the original markup of a document, extract its text and metadata. It consists of three resources : - FileSystemCollectionReader : similar to the one in UIMA examples but uses TIKA to extract the text from binary documents and generates annotations to represent the markup - MarkupAnnotator : takes the original content from a view and generates a new view containing the extracted text with markup annotations - TikaWrapper : utility class which allows to populate a CAS from a binary document; used by the FileSystemCollectionReader * Best, J. -- DigitalPebble Ltd http://www.digitalpebble.com 2009/5/22 Marshall Schor <m...@schor.com> > Hi Julien, > > Can you write up a little something and submit a patch to the website? > > -Marshall > > Julien Nioche wrote: > > Hi, > > > > I contributed an annotator to the sandbox some time ago which uses Tika > to > > convert original markup into UIMA annotations. It does not seem to be > listed > > on the website but it should be in the SVN repository of the sandbox. > > > > Tika supports numerous formats such as PDF, XML, HTML etc... > > > > Julien > > > > >