Indexer and extractor bugs

Eirikur Hrafnsson Fri, 07 Jan 2005 08:18:51 -0800

Hi,

I've been trying to get the properties indexers to work (from HEAD) and most of the extractors and I have found some bugs and perhaps a design flaw. Here goes...

I'm using these settings in Domain.xml <propertiesindexer classname="org.apache.slide.index.lucene.LucenePropertiesIndexer"> <parameter name="indexpath">store/index/metadata</parameter> </propertiesindexer> ...  <extractors> <extractor classname="org.apache.slide.extractor.SimpleXmlExtractor" uri="/files/public/xml"> <configuration> <instruction property="title" xpath="/article/title/text()" /> <instruction property="summary" xpath="/article/summary/text()" /> </configuration> </extractor> <extractor classname="org.apache.slide.extractor.PDFExtractor" uri="/files/public/pdf/" /> <extractor classname="org.apache.slide.extractor.TextContentExtractor" uri="/files/public/text/" />

<extractor classname="org.apache.slide.extractor.OfficeExtractor" uri="/files/public/office/"> <configuration> <instruction property="author" id="SummaryInformation-0-4" /> <instruction property="application" id="SummaryInformation-0-18" /> </configuration> </extractor> </extractors>

First the LucenePropertiesIndexer will stop Slide from loading (DomainConfigurationException) because of a null pointer that happens on the line 55:

  public void initialize(NamespaceAccessToken token)
            throws ServiceInitializationFailedException
    {
        super.initialize(token);

try { indexConfiguration.initDefaultConfiguration(); nullpointer >> indexConfiguration.readPropertyConfiguration(this.indexedProperties);

This method call is not in LuceneContentIndexer and I don't know what it is for so I tried commenting it out and the Indexer then loads "correctly". Why was that method call?

Secondly I cannot see that e.g. PDFExtractor has ever worked or any of the other ones because of one flaw in the design. PDFExtractor does not implement/override the method getContentType() that is used in Extractor manager to see if the extractor is suitable for the file it is about to index:

//From ExtractorManager static boolean matches(Extractor extractor, String namespace, String uri, NodeRevisionDescriptor descriptor){ if ( descriptor != null && !descriptor.getContentType().equals(extractor.getContentType()) ) { return false; }

For a pdf file the extractor.getContentType() will return null but the descriptor will return the corrent contenttype so the matches(...) method will always return false and the pdf is never indexed. This is easily fixed by implementing getContentType (or is there a property for filling the supported content type for an extractor?) but here is the design flaw, pdf like office documents can have MANY content types. It's stupid I know but a fact so getContentType() or rather "getSupportedContentTypes()" should be returning a list of types or a semicolon separated list at least and the check should do a contains(type) or and indexof(type)>=0 rather than an equals. That way the pdf will survive the first check and hopefully be indexed.

Has anyone changed this and not committed the changes?

Best Regards

Eirikur S. Hrafnsson, [EMAIL PROTECTED]
Chief Software Engineer
Idega Software
http://www.idega.com

Indexer and extractor bugs

Reply via email to