~
 http://projects.apache.org/projects/tika.html
~
 http://tika.apache.org/1.0/formats.html
~
 say that it: " ... easily detect(s) and extract(s) metadata and
content from all major file formats"
~
 I think "all major file formats" should be somehow functionally
specified through something like
~
 core.tika.formatHandlers.getAll[DefinedFormat]Handlers
~
 accessing some registry and (selectively) returning metadata in
XML-based RDF sections or a similar data structure. I also think that
registry should include some CMS-like interface (just for the
metadata) of the files in the repository, with some searchable
(ideally through queries) interface
~
 The thing is that (I think) most people using tika will most probably
need it for large databanks/corpora and they would love to avail
themselves of such an interface to do some statistics or play with the
data. Say you have large amounts or MS Word documents you would like
to translate to ODT, but you don't want to lose any formatting and you
don't have the time to eyeball all of the files ...
~
 lbrtchx

Reply via email to