Re: Full Text Search for MS Word and Excel files?

Jim Myers Wed, 25 Feb 2004 07:37:27 -0800

Martin,

We modified PutMethod - mostly because we started with Slide 1 and wanted to
stay on the edges. For an XSLT extractor, we just run the specified XSLT
against the incoming content stream and expect that the output will look
like


<D:prop xmlns;D="DAV:">
  <prop1><val1/></prop1>
  ...
</D:prop>

which we then parse/store as though a proppatch occurred.

For binary files - the BFD language allows you to say in XML that there are,
for example, three little-endian 32 bit ints followed by an array floats
that is a 3-D matrix with the dimensions you just read. A generic parser
propduces XML tagged output from that description which can then be fed to
XSLT to extract properties as above.(see
http://collaboratory.emsl.pnl.gov/docs/collab/sam/bfd/ for a full example).
There's work going on to standardize a language like this as the "Data
Format Description Language (DFDL)" within the Global Grid forum.

While I think this is powerful for scientific data formats, I'm not sure I'd
want to describe .doc format with it. So - the webservice hook allows you to
send data off to an external service that might wrapper an existing
conversion tool. If the webservice knows to return properties, great. If
not, we can, as with BFD, run a final XSLT stage to get the output we
expect. I realize that web services are a heavy operation to do during a
PUT, but the alternative in our primary use case would just be to have
portal software do the workflow instead.

We really think more about scientific data than documents, so we haven't
done much about free text. We've really been looking to give researchers a
low-barrier to entry - use a DAV file system driver to upload files from the
programs you use now - XML, binary, whatever - we'll install an extractor to
get standard information out (what chemical is described in the file) which
is then available for searching on through our portal. (And once you find
files, we generate a "hastranslations" property that lists virtual URLs in
the /slide/translatedto/... hierarchy that, when you do a GET, invokes [BFD,
web service, XSLT] to produce the translated files). So - researchers
collaborate but didn't have to all agree on one format, one schema up front.

So - apologies for the sales pitch. Just wanted to make clear we're looking
more at community systems than enterprise ones and we expect to have little
control over content types, and we expect many non-office formats, etc.

> Sorry I didn't react on your first posting.

I wasn't complaining, just using the new post as an excuse to extend what
I'd said earlier.

>Which mimetype goes into the webDav properties, the original
>text/xml or the changed one?

We store the changed one.


   Jim


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Full Text Search for MS Word and Excel files?

Reply via email to