This sounds good. I'll do some work on this today and will check in as much as I will achieve. So we can check the requirements on code basis. Regards, Daniel
Am Dienstag, 24. Februar 2004 21:04 schrieb Jim Myers: > I think I mentioned previously that we implemented this in the webDAV layer > for our SAM project. It would be nice to move our capabilities down to use > the events, so I'll pass on some of our requirements/issues: > > We've implemented three types of extractors - XSLT, Binary Format > Description (BFD - a language for describing ascii/binary content and > pulling it into an XML tagged form), and web services. We also allow > multiple steps, eg. using BFD to go from binary to XML with XSLT used to > extract properties. We chose not to support Java at this point because we > want extractors to be user-registerable and were concerned about what > people might try to do in Java (some large computation on the resource to > produce a property). > > We found that it was useful to have a 'pre-processing' stage that could > override the default mimetype for .xml files - too many people we're using > .xml for different types and, when using a file system driver with webdav, > all of these files were coming in as text/xml. So we allow one global XSLT > 'extractor' to run on all text/xml files to change just their mimetype, > which is then used to determine which extractors to run. > > It wasn't necessary for us to require a discovery mechanism to know what > properties an extractor is capable of producing and several of the > extractors we've developed have variable outputs, e.g. some look for all > tags in a specific namespace (at a specific location in the doc) and turn > them into properties --> the extractor won't change as the tags defined in > that namespace evolve. > > We mark any properties produced by extractors as live/read-only to > guarantee they stay synched with the resource contents. > > To boil this down - > I think we could do what we're doing now using the proposed interface if we > implement a generic java extractor class that accepts XSLT/BFD/webservice > address params as part of its configuration - it would be useful to be able > to pass such params into extractors. > > Mime-type may not be fine enough for determining which extractors to run, > particular for .xml text/xml - adding a step to allow programatic override > on mime types derived from extensions at the DAV layer, or providing an api > to determine which extractors to run that can access both mimetype and > content, would be helpful for this. > > While I can see the usefulness of a mechanism to discover what an extractor > can produce, it would be useful to have a way to see 'and maybe other > properties as well'. > > Probably obvious, but extracted properties should not be writable directly. > > Jim > > > > ----- Original Message ----- > From: "Daniel Florey" <[EMAIL PROTECTED]> > To: "Slide Developers Mailing List" <[EMAIL PROTECTED]> > Sent: Tuesday, February 24, 2004 10:18 AM > Subject: Re: Full Text Search for MS Word and Excel files? > > > Hi, > > there is nothing of the mentioned interfaces implemented / checked in > > yet. > > I > > > will implement this things tomorrow and check in a proposal. I'm too busy > > today to work on it. > > I will check in a sample Domain.xml where you can see how the > > content-type > > or > > > URL matching is cofigured. > > Regards, Daniel > > > > Am Dienstag, 24. Februar 2004 15:49 schrieb Ryan Rhodes: > > > Hi guys, > > > > > > This all sounds great. I think I understand the extractor interface, > > and > > > > I've worked with POI in the past so this doesn't sound too hard to > > > implement. I'm still a little fuzzy on how this fits into the big > > picture. > > > > How is the association made between my extractor and my MIME type > > (.DOC)? > > > > When does the extractor get invoked... at the time the content is > > stored? > > > > How does this integrate with DASL... are these properties automatically > > a > > > > part of the content so that searches return a reference to the original > > > content or does it return a reference to the extracted content and then > > its > > > > my job to map back to the original content? (sorry, I'm still learning > > > DASL). > > > > > > By the way, once you submit your proposal, does that mean the code is > > > in the CVS, or at what point is it likely to become a part of the > > > release (2.x) ? > > > > > > thanks, > > > > > > Ryan > > > > > > From: <[EMAIL PROTECTED]> > > > > > > >Reply-To: "Slide Developers Mailing List" > > <[EMAIL PROTECTED]> > > > > >To: <[EMAIL PROTECTED]> > > > >Subject: RE: Full Text Search for MS Word and Excel files? > > > >Date: Tue, 24 Feb 2004 13:43:42 +0100 > > > > > > > >Hi Daniel, > > > > > > > > > -----Original Message----- > > > > > From: Daniel Florey [mailto:[EMAIL PROTECTED] > > > > > Sent: Dienstag, 24. Februar 2004 13:23 > > > > > To: Slide Developers Mailing List > > > > > Subject: Re: Full Text Search for MS Word and Excel files? > > > > > > > > > > > > > > > Hi Martin, > > > > > my proposal would look like this: > > > > > > > > > > public interface Extractor { > > > > > /** > > > > > * Will be called from extractor framework before > > > > > content and properties will > > > > > be stored > > > > > */ > > > > > public void extract(InputStream content) throws > > > > > ExtractException; > > > > > > > >agreed > > > > > > > > > /** > > > > > * gets extracted property value from the resource, for > > > > > example "author" > > > > > * for a word doc, ... > > > > > * > > > > > */ > > > > > public String getPropertyValue(String propertyName); > > > > > > > > > > /** > > > > > * gets a description of all properties that are > > > > > provided by this extractor. > > > > > * Can be used by indexing framework to e.g. generate > > > > > columns in index table > > > > > > > >Of course the store / indexer could do whatever it wants with the > > > >properties, but I think, the normal case should be to write the > > > >properties into DescriptorStore as NodeProperties. So these properties > > > >can be exposed to DASL. So what about following comment: > > > > > > > >* Can be used to be stored as NodeProperty in DescriptorStore > > > > > > > > > */ > > > > > public PropertyDescriptor[] getPropertyDescriptors(); > > > > > } > > > > > > > > > > I prefer InputStream for content because the whole document > > > > > doesn't have to be > > > > > loaded into memory. > > > > > > > >agreed. > > > > > > > > > > > >Best regards, > > > >Martin > > > > > > > >--------------------------------------------------------------------- > > > >To unsubscribe, e-mail: [EMAIL PROTECTED] > > > >For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > _________________________________________________________________ > > > Get fast, reliable access with MSN 9 Dial-up. Click here for Special > > Offer! > > > > http://click.atdmt.com/AVE/go/onm00200361ave/direct/01/ > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
