Re: document structure

Julien Nioche Fri, 22 May 2009 01:14:48 -0700

Hi Marshall,

There is a description in the README.txt file from the TikaAnnotator
repository, which I have slightly rewritten into the text below.



*Apache Tika is a toolkit for detecting and extracting metadata and
structured text content from various documents using existing parser
libraries. The TikaAnnotator uses Tika to generate annotations representing
the original markup of a document, extract its text and metadata. It
consists of three resources :

- FileSystemCollectionReader : similar to the one in UIMA examples but uses
TIKA to extract the text from binary documents and generates annotations to
represent the markup

- MarkupAnnotator : takes the original content from a view and generates a
new view containing the extracted text with markup annotations

- TikaWrapper : utility class which allows to populate a CAS from a binary
document; used by the FileSystemCollectionReader *


Best,

J.

-- 
DigitalPebble Ltd
http://www.digitalpebble.com


2009/5/22 Marshall Schor <m...@schor.com>

> Hi Julien,
>
> Can you write up a little something and submit a patch to the website?
>
> -Marshall
>
> Julien Nioche wrote:
> > Hi,
> >
> > I contributed an annotator to the sandbox some time ago which uses Tika
> to
> > convert original markup into UIMA annotations. It does not seem to be
> listed
> > on the website but it should be in the SVN repository of the sandbox.
> >
> > Tika supports numerous formats such as PDF, XML, HTML etc...
> >
> > Julien
> >
> >
>

Re: document structure

Reply via email to