Rajesh Parekh wrote:

> Hi,
>  
> I have a requirement to convert hundreds of unstructured documents in 
> WORD/PDF/TXT/EMAIL formats
> into a structured repository of XML Metadata of the document and the 
> documents itself.
>  
> I need to parse each of these documents and extract the relevant 
> information to build a XML metadata
> document for each document.
>  
> The XML structured metadata of the underlying document will contain 
> fields like Keywords, Category, Doc Name,
> Author etc.
>  
> *Is it possible to use Cocoon and or POI to do this.  And if yes how 
> to use Cocoon to do the extraction. *
>  

Pieces of this yes and no.  Right now you'd need to use POI directly as 
we've not written a Cocoon generator for Excel
yet and our read support for Word is at best very immature hence there 
is no Serializer or Generator for it.  

I'm not sure Cocoon is appropriate for what you want to do to be honest. 
 But definetly join the POI list if your interested
in helping develop the word port I'm sure you could suit your needs so 
far as Excel/Doc go.

-Andy


> I am new to Cocoon, and trying to understand the world of 
> transformers/generators etc.
>  
> Also could I use Lucene to index the XML documents and build a search 
> engine around it.
>  
> I would like to know about the possible ways to do this.
>  
> regards
>  
> rajesh.
>  





---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <[EMAIL PROTECTED]>
For additional commands, e-mail:   <[EMAIL PROTECTED]>

Reply via email to