Hi List,

I think I may have discovered a bug or two in the subcollection plugin.

I indexed a couple of sites with the subcollection plugin enabled to
evaluate this feature of Nutch but was unable to search within a
subcollection. I analysed the index with Luke and confirmed the
subcollection had been indexed but I still was unable to search for the
specific value. After some time I noticed a superfluous space at the
beginning of the indexed value, and a quick test showed that adding the
space to my query gave me the results I expected.

I've looked into this and I assume the problem is caused in the
getSubCollections() method of CollectionManager:

  public String getSubCollections(final String url) {
    String collections = "";
    final Iterator iterator = collectionMap.values().iterator();

    while (iterator.hasNext()) {
      final Subcollection subCol = (Subcollection) iterator.next();
      if (subCol.filter(url) != null) {
        collections += " " + subCol.name;
      }
    }
    if (LOG.isTraceEnabled()) { LOG.trace("subcollections:" +
collections); }
    
    return collections;
  }

This could be fixed by returning collections.trim() or rewriting the
while loop that builds the string. In addition to this, the code (and
comment) suggests that a single URL could appear in multiple
subcollections. Would I be correct in concluding that, given that
subcollection value is untokenized, it would not be possible to search a
URL that appears in multiple subcollections when specifying a search
within a particular subcollection?

rgds,

Richard



Richard Grantham
Development

-------------------------------
[email protected]
Limehouse Software Ltd

DDI:  01628 640 453
Main: 01628 640 401 
Fax:  01628 640 461 

Limehouse Software Ltd
St Cloud Gate 
St Cloud Way 
Cookham Road 
Maidenhead, Berks
SL6 8XD 


www.limehousesoftware.co.uk - Unifying Information

Limehouse Software Limited - An Objective Company

The information contained in this e-mail or in any attachments is confidential 
and is intended solely for the named addressee only. Access to this e-mail by 
anyone else is unauthorised. If you are not the intended recipient, please 
notify Limehouse Software Ltd immediately by returning this e-mail to sender or 
calling 01628 640 401 and do not read, use or disseminate the information. 
Opinions expressed in this e-mail are those of the sender and not necessarily 
the company. Although an active anti-virus policy is operated, the company 
accepts no liability for any damage caused by any virus transmitted by this 
e-mail, including any attachments.

Reply via email to