Re: Full Text Search for MS Word and Excel files?

Daniel Florey Wed, 25 Feb 2004 03:47:11 -0800

This sounds good.
I'll do some work on this today and will check in as much as I will achieve. 
So we can check the requirements on code basis.
Regards,
Daniel


Am Dienstag, 24. Februar 2004 21:04 schrieb Jim Myers:
> I think I mentioned previously that we implemented this in the webDAV layer
> for our SAM project. It would be nice to move our capabilities down to use
> the events, so I'll pass on some of our requirements/issues:
>
> We've implemented three types of extractors - XSLT, Binary Format
> Description (BFD - a language for describing ascii/binary content and
> pulling it into an XML tagged form), and web services. We also allow
> multiple steps, eg. using BFD to go from binary to XML with XSLT used to
> extract properties. We chose not to support Java at this point because we
> want extractors to be user-registerable and were concerned about what
> people might try to do in Java (some large computation on the resource to
> produce a property).
>
> We found that it was useful to have a 'pre-processing' stage that could
> override the default mimetype for .xml files - too many people we're using
> .xml for different types and, when using a file system driver with webdav,
> all of these files were coming in as text/xml. So we allow one global XSLT
> 'extractor' to run on all text/xml files to change just their mimetype,
> which is then used to determine which extractors to run.
>
> It wasn't necessary for us to require a discovery mechanism to know what
> properties an extractor is capable of producing and several of the
> extractors we've developed have variable outputs, e.g. some look for all
> tags in a specific namespace (at a specific location in the doc) and turn
> them into properties --> the extractor  won't change as the tags defined in
> that namespace evolve.
>
> We mark any properties produced by extractors as live/read-only to
> guarantee they stay synched with the resource contents.
>
> To boil this down -
> I think we could do what we're doing now using the proposed interface if we
> implement a generic java extractor class that accepts XSLT/BFD/webservice
> address params as part of its configuration - it would be useful to be able
> to pass such params into extractors.
>
> Mime-type may not be fine enough for determining which extractors to run,
> particular for .xml text/xml - adding a step to allow programatic override
> on mime types derived from extensions at the DAV layer, or providing an api
> to determine which extractors to run that can access both mimetype and
> content, would be helpful for this.
>
> While I can see the usefulness of a mechanism to discover what an extractor
> can produce, it would be useful to have a way to see 'and maybe other
> properties as well'.
>
> Probably obvious, but extracted properties should not be writable directly.
>
>   Jim
>
>
>
> ----- Original Message -----
> From: "Daniel Florey" <[EMAIL PROTECTED]>
> To: "Slide Developers Mailing List" <[EMAIL PROTECTED]>
> Sent: Tuesday, February 24, 2004 10:18 AM
> Subject: Re: Full Text Search for MS Word and Excel files?
>
> > Hi,
> > there is nothing of the mentioned interfaces implemented / checked in
> > yet.
>
> I
>
> > will implement this things tomorrow and check in a proposal. I'm too busy
> > today to work on it.
> > I will check in a sample Domain.xml where you can see how the
> > content-type
>
> or
>
> > URL matching is cofigured.
> > Regards, Daniel
> >
> > Am Dienstag, 24. Februar 2004 15:49 schrieb Ryan Rhodes:
> > > Hi guys,
> > >
> > > This all sounds great.  I think I understand the extractor interface,
>
> and
>
> > > I've worked with POI in the past so this doesn't sound too hard to
> > > implement.  I'm still a little fuzzy on how this fits into the big
>
> picture.
>
> > > How is the association made between my extractor and my MIME type
>
> (.DOC)?
>
> > > When does the extractor get invoked... at the time the content is
>
> stored?
>
> > > How does this integrate with DASL... are these properties automatically
>
> a
>
> > > part of the content so that searches return a reference to the original
> > > content or does it return a reference to the extracted content and then
>
> its
>
> > > my job to map back to the original content?  (sorry, I'm still learning
> > > DASL).
> > >
> > > By the way, once you submit your proposal, does that mean the code is
> > > in the CVS, or at what point is it likely to become a part of the
> > > release (2.x) ?
> > >
> > > thanks,
> > >
> > > Ryan
> > >
> > > From: <[EMAIL PROTECTED]>
> > >
> > > >Reply-To: "Slide Developers Mailing List"
>
> <[EMAIL PROTECTED]>
>
> > > >To: <[EMAIL PROTECTED]>
> > > >Subject: RE: Full Text Search for MS Word and Excel files?
> > > >Date: Tue, 24 Feb 2004 13:43:42 +0100
> > > >
> > > >Hi Daniel,
> > > >
> > > > > -----Original Message-----
> > > > > From: Daniel Florey [mailto:[EMAIL PROTECTED]
> > > > > Sent: Dienstag, 24. Februar 2004 13:23
> > > > > To: Slide Developers Mailing List
> > > > > Subject: Re: Full Text Search for MS Word and Excel files?
> > > > >
> > > > >
> > > > > Hi Martin,
> > > > > my proposal would look like this:
> > > > >
> > > > > public interface Extractor {
> > > > > /**
> > > > > * Will be called from extractor framework before
> > > > > content and properties will
> > > > > be stored
> > > > > */
> > > > > public void extract(InputStream content) throws
> > > > > ExtractException;
> > > >
> > > >agreed
> > > >
> > > > > /**
> > > > >  * gets extracted property value from the resource, for
> > > > > example "author"
> > > > >  * for a word doc, ...
> > > > >  *
> > > > > */
> > > > > public String getPropertyValue(String propertyName);
> > > > >
> > > > > /**
> > > > > * gets a description of all properties that are
> > > > > provided by this extractor.
> > > > > * Can be used by indexing framework to e.g. generate
> > > > > columns in index table
> > > >
> > > >Of course the store / indexer could do whatever it wants with the
> > > >properties, but I think, the normal case should be to write the
> > > >properties into DescriptorStore as NodeProperties. So these properties
> > > >can be exposed to DASL. So what about following comment:
> > > >
> > > >* Can be used to be stored as NodeProperty in DescriptorStore
> > > >
> > > > > */
> > > > > public PropertyDescriptor[] getPropertyDescriptors();
> > > > > }
> > > > >
> > > > > I prefer InputStream for content because the whole document
> > > > > doesn't have to be
> > > > > loaded into memory.
> > > >
> > > >agreed.
> > > >
> > > >
> > > >Best regards,
> > > >Martin
> > > >
> > > >---------------------------------------------------------------------
> > > >To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > >For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > > _________________________________________________________________
> > > Get fast, reliable access with MSN 9 Dial-up. Click here for Special
>
> Offer!
>
> > > http://click.atdmt.com/AVE/go/onm00200361ave/direct/01/
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Full Text Search for MS Word and Excel files?

Reply via email to