This all sounds great. I think I understand the extractor interface, and I've worked with POI in the past so this doesn't sound too hard to implement. I'm still a little fuzzy on how this fits into the big picture.
How is the association made between my extractor and my MIME type (.DOC)?
When does the extractor get invoked... at the time the content is stored?
How does this integrate with DASL... are these properties automatically a part of the content so that searches return a reference to the original content or does it return a reference to the extracted content and then its my job to map back to the original content? (sorry, I'm still learning DASL).
By the way, once you submit your proposal, does that mean the code is in the CVS, or at what point is it likely to become a part of the release (2.x) ?
thanks,
Ryan
From: <[EMAIL PROTECTED]> Reply-To: "Slide Developers Mailing List" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Subject: RE: Full Text Search for MS Word and Excel files? Date: Tue, 24 Feb 2004 13:43:42 +0100
Hi Daniel,
> -----Original Message----- > From: Daniel Florey [mailto:[EMAIL PROTECTED] > Sent: Dienstag, 24. Februar 2004 13:23 > To: Slide Developers Mailing List > Subject: Re: Full Text Search for MS Word and Excel files? > > > Hi Martin, > my proposal would look like this: > > public interface Extractor { > /** > * Will be called from extractor framework before > content and properties will > be stored > */ > public void extract(InputStream content) throws > ExtractException;
agreed
> > /** > * gets extracted property value from the resource, for > example "author" > * for a word doc, ... > * > */ > public String getPropertyValue(String propertyName); > > /** > * gets a description of all properties that are > provided by this extractor. > * Can be used by indexing framework to e.g. generate > columns in index table
Of course the store / indexer could do whatever it wants with the properties, but I think, the normal case should be to write the properties into DescriptorStore as NodeProperties. So these properties can be exposed to DASL. So what about following comment:
* Can be used to be stored as NodeProperty in DescriptorStore
> */ > public PropertyDescriptor[] getPropertyDescriptors(); > } > > I prefer InputStream for content because the whole document > doesn't have to be > loaded into memory.
agreed.
Best regards, Martin
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
_________________________________________________________________
Get fast, reliable access with MSN 9 Dial-up. Click here for Special Offer! http://click.atdmt.com/AVE/go/onm00200361ave/direct/01/
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
