Hi,
I have a requirement to convert hundreds of unstructured documents in
WORD/PDF/TXT/EMAIL formats
into a structured repository of XML Metadata of the document and the
documents itself.
I need to parse each of these documents and extract the relevant
information to build a XML metadata
document for each document.
The XML structured metadata of the underlying document will contain
fields like Keywords, Category, Doc Name,
Author etc.
Is it possible to use Cocoon and or POI to do this. And if
yes how to use Cocoon to do the extraction.
I am new to Cocoon, and trying to understand the world
of transformers/generators etc.
Also could I use Lucene to index the XML documents and build a search
engine around it.
I would like to know about the possible ways to do this.
regards
rajesh.