Re: Full Text Search for MS Word and Excel files?

Jim Myers Tue, 24 Feb 2004 12:04:16 -0800

I think I mentioned previously that we implemented this in the webDAV layer
for our SAM project. It would be nice to move our capabilities down to use
the events, so I'll pass on some of our requirements/issues:


We've implemented three types of extractors - XSLT, Binary Format
Description (BFD - a language for describing ascii/binary content and
pulling it into an XML tagged form), and web services. We also allow
multiple steps, eg. using BFD to go from binary to XML with XSLT used to
extract properties. We chose not to support Java at this point because we
want extractors to be user-registerable and were concerned about what people
might try to do in Java (some large computation on the resource to produce a
property).

We found that it was useful to have a 'pre-processing' stage that could
override the default mimetype for .xml files - too many people we're using
.xml for different types and, when using a file system driver with webdav,
all of these files were coming in as text/xml. So we allow one global XSLT
'extractor' to run on all text/xml files to change just their mimetype,
which is then used to determine which extractors to run.

It wasn't necessary for us to require a discovery mechanism to know what
properties an extractor is capable of producing and several of the
extractors we've developed have variable outputs, e.g. some look for all
tags in a specific namespace (at a specific location in the doc) and turn
them into properties --> the extractor  won't change as the tags defined in
that namespace evolve.

We mark any properties produced by extractors as live/read-only to guarantee
they stay synched with the resource contents.

To boil this down -
I think we could do what we're doing now using the proposed interface if we
implement a generic java extractor class that accepts XSLT/BFD/webservice
address params as part of its configuration - it would be useful to be able
to pass such params into extractors.

Mime-type may not be fine enough for determining which extractors to run,
particular for .xml text/xml - adding a step to allow programatic override
on mime types derived from extensions at the DAV layer, or providing an api
to determine which extractors to run that can access both mimetype and
content, would be helpful for this.

While I can see the usefulness of a mechanism to discover what an extractor
can produce, it would be useful to have a way to see 'and maybe other
properties as well'.

Probably obvious, but extracted properties should not be writable directly.

  Jim



----- Original Message ----- 
From: "Daniel Florey" <[EMAIL PROTECTED]>
To: "Slide Developers Mailing List" <[EMAIL PROTECTED]>
Sent: Tuesday, February 24, 2004 10:18 AM
Subject: Re: Full Text Search for MS Word and Excel files?


> Hi,
> there is nothing of the mentioned interfaces implemented / checked in yet.
I
> will implement this things tomorrow and check in a proposal. I'm too busy
> today to work on it.
> I will check in a sample Domain.xml where you can see how the content-type
or
> URL matching is cofigured.
> Regards, Daniel
>
> Am Dienstag, 24. Februar 2004 15:49 schrieb Ryan Rhodes:
> > Hi guys,
> >
> > This all sounds great.  I think I understand the extractor interface,
and
> > I've worked with POI in the past so this doesn't sound too hard to
> > implement.  I'm still a little fuzzy on how this fits into the big
picture.
> >
> > How is the association made between my extractor and my MIME type
(.DOC)?
> >
> > When does the extractor get invoked... at the time the content is
stored?
> >
> > How does this integrate with DASL... are these properties automatically
a
> > part of the content so that searches return a reference to the original
> > content or does it return a reference to the extracted content and then
its
> > my job to map back to the original content?  (sorry, I'm still learning
> > DASL).
> >
> > By the way, once you submit your proposal, does that mean the code is in
> > the CVS, or at what point is it likely to become a part of the release
> > (2.x) ?
> >
> > thanks,
> >
> > Ryan
> >
> > From: <[EMAIL PROTECTED]>
> >
> > >Reply-To: "Slide Developers Mailing List"
<[EMAIL PROTECTED]>
> > >To: <[EMAIL PROTECTED]>
> > >Subject: RE: Full Text Search for MS Word and Excel files?
> > >Date: Tue, 24 Feb 2004 13:43:42 +0100
> > >
> > >Hi Daniel,
> > >
> > > > -----Original Message-----
> > > > From: Daniel Florey [mailto:[EMAIL PROTECTED]
> > > > Sent: Dienstag, 24. Februar 2004 13:23
> > > > To: Slide Developers Mailing List
> > > > Subject: Re: Full Text Search for MS Word and Excel files?
> > > >
> > > >
> > > > Hi Martin,
> > > > my proposal would look like this:
> > > >
> > > > public interface Extractor {
> > > > /**
> > > > * Will be called from extractor framework before
> > > > content and properties will
> > > > be stored
> > > > */
> > > > public void extract(InputStream content) throws
> > > > ExtractException;
> > >
> > >agreed
> > >
> > > > /**
> > > >  * gets extracted property value from the resource, for
> > > > example "author"
> > > >  * for a word doc, ...
> > > >  *
> > > > */
> > > > public String getPropertyValue(String propertyName);
> > > >
> > > > /**
> > > > * gets a description of all properties that are
> > > > provided by this extractor.
> > > > * Can be used by indexing framework to e.g. generate
> > > > columns in index table
> > >
> > >Of course the store / indexer could do whatever it wants with the
> > >properties, but I think, the normal case should be to write the
> > >properties into DescriptorStore as NodeProperties. So these properties
> > >can be exposed to DASL. So what about following comment:
> > >
> > >* Can be used to be stored as NodeProperty in DescriptorStore
> > >
> > > > */
> > > > public PropertyDescriptor[] getPropertyDescriptors();
> > > > }
> > > >
> > > > I prefer InputStream for content because the whole document
> > > > doesn't have to be
> > > > loaded into memory.
> > >
> > >agreed.
> > >
> > >
> > >Best regards,
> > >Martin
> > >
> > >---------------------------------------------------------------------
> > >To unsubscribe, e-mail: [EMAIL PROTECTED]
> > >For additional commands, e-mail: [EMAIL PROTECTED]
> >
> > _________________________________________________________________
> > Get fast, reliable access with MSN 9 Dial-up. Click here for Special
Offer!
> > http://click.atdmt.com/AVE/go/onm00200361ave/direct/01/
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Full Text Search for MS Word and Excel files?

Reply via email to