Hi List,
I think I may have discovered a bug or two in the subcollection plugin.
I indexed a couple of sites with the subcollection plugin enabled to
evaluate this feature of Nutch but was unable to search within a
subcollection. I analysed the index with Luke and confirmed the
subcollection had been indexed but I still was unable to search for the
specific value. After some time I noticed a superfluous space at the
beginning of the indexed value, and a quick test showed that adding the
space to my query gave me the results I expected.
I've looked into this and I assume the problem is caused in the
getSubCollections() method of CollectionManager:
public String getSubCollections(final String url) {
String collections = "";
final Iterator iterator = collectionMap.values().iterator();
while (iterator.hasNext()) {
final Subcollection subCol = (Subcollection) iterator.next();
if (subCol.filter(url) != null) {
collections += " " + subCol.name;
}
}
if (LOG.isTraceEnabled()) { LOG.trace("subcollections:" +
collections); }
return collections;
}
This could be fixed by returning collections.trim() or rewriting the
while loop that builds the string. In addition to this, the code (and
comment) suggests that a single URL could appear in multiple
subcollections. Would I be correct in concluding that, given that
subcollection value is untokenized, it would not be possible to search a
URL that appears in multiple subcollections when specifying a search
within a particular subcollection?
rgds,
Richard
Richard Grantham
Development
-------------------------------
[email protected]
Limehouse Software Ltd
DDI: 01628 640 453
Main: 01628 640 401
Fax: 01628 640 461
Limehouse Software Ltd
St Cloud Gate
St Cloud Way
Cookham Road
Maidenhead, Berks
SL6 8XD
www.limehousesoftware.co.uk - Unifying Information
Limehouse Software Limited - An Objective Company
The information contained in this e-mail or in any attachments is confidential
and is intended solely for the named addressee only. Access to this e-mail by
anyone else is unauthorised. If you are not the intended recipient, please
notify Limehouse Software Ltd immediately by returning this e-mail to sender or
calling 01628 640 401 and do not read, use or disseminate the information.
Opinions expressed in this e-mail are those of the sender and not necessarily
the company. Although an active anti-virus policy is operated, the company
accepts no liability for any damage caused by any virus transmitted by this
e-mail, including any attachments.