RE: Full Text Search for MS Word and Excel files?

Martin.Wallmer Wed, 25 Feb 2004 01:14:14 -0800

Hi Jim,

thats good news! I embedded some questions in the text.


So as I understood for XML documents your extractor defines the 
content of one or more nodes as properties and stores them as 
webdav properties. This extraction is done in the webdav layer.
Thats cool! Did you modify the WebDAV layer?
Do you use this indexing for full text search? 
I do not yet understand, what you are doing with binary files and 
web services, could you pls explain more in detail?

Best regards,
Martin


> -----Original Message-----
> From: Jim Myers [mailto:[EMAIL PROTECTED]
> Sent: Dienstag, 24. Februar 2004 21:04
> To: Slide Developers Mailing List
> Subject: Re: Full Text Search for MS Word and Excel files?
> 
> 
> I think I mentioned previously that we implemented this in 
> the webDAV layer
> for our SAM project. It would be nice to move our 
> capabilities down to use
> the events, so I'll pass on some of our requirements/issues:
> 

Sorry I didn't react on your first posting.

> We've implemented three types of extractors - XSLT, Binary Format
> Description (BFD - a language for describing ascii/binary content and
> pulling it into an XML tagged form), and web services. 

So the result of extraction is XML?

> We also allow
> multiple steps, eg. using BFD to go from binary to XML with 
> XSLT used to
> extract properties. We chose not to support Java at this 
> point because we
> want extractors to be user-registerable and were concerned 
> about what people
> might try to do in Java (some large computation on the 
> resource to produce a
> property).
> 
> We found that it was useful to have a 'pre-processing' stage 
> that could
> override the default mimetype for .xml files - too many 
> people we're using
> .xml for different types and, when using a file system driver 
> with webdav,
> all of these files were coming in as text/xml. So we allow 
> one global XSLT
> 'extractor' to run on all text/xml files to change just their 
> mimetype,
> which is then used to determine which extractors to run.
> 

Which mimetype goes into the webDav properties, the original 
text/xml or the changed one?

> It wasn't necessary for us to require a discovery mechanism 
> to know what
> properties an extractor is capable of producing and several of the
> extractors we've developed have variable outputs, e.g. some 
> look for all
> tags in a specific namespace (at a specific location in the 
> doc) and turn
> them into properties --> the extractor  won't change as the 
> tags defined in
> that namespace evolve.
> 
> We mark any properties produced by extractors as 
> live/read-only to guarantee
> they stay synched with the resource contents.
> 
> To boil this down -
> I think we could do what we're doing now using the proposed 
> interface if we
> implement a generic java extractor class that accepts 
> XSLT/BFD/webservice
> address params as part of its configuration - it would be 
> useful to be able
> to pass such params into extractors.
> 
> Mime-type may not be fine enough for determining which 
> extractors to run,
> particular for .xml text/xml - adding a step to allow 
> programatic override
> on mime types derived from extensions at the DAV layer, or 
> providing an api
> to determine which extractors to run that can access both mimetype and
> content, would be helpful for this.
> 
> While I can see the usefulness of a mechanism to discover 
> what an extractor
> can produce, it would be useful to have a way to see 'and maybe other
> properties as well'.
> 
> Probably obvious, but extracted properties should not be 
> writable directly.

correct.

> 
>   Jim
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Full Text Search for MS Word and Excel files?

Reply via email to