Hi,

I'm using the blacklist|whitelist plug-in with Nutch 1.3 provided with the 
patch at: https://issues.apache.org/jira/browse/NUTCH-585. The plug-in strips 
out content from HTML pages identified by HTML element:class or element:id 
descriptors. The ReadMe instructions note that the following needs to be added 
to the schema.xml file. 

<!-- fields for the blacklist/whitelist plugin -->
<field name="strippedContent" type="text" stored="true" indexed="true"/>

I've done this for the schema file in both Solr and Nutch and also added this 
line to solrindex-mapping.xml:

<field  dest="strippedContent" source="strippedContent"/>


A crawl with this config works great. I can see the new field containing the 
stripped content in the index. Problem is I want to target the contents of 
strippedContent into the Content field but all attempts are resulting in this 
error:

Jul 26, 2012 1:09:04 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: ERROR: multiple values 
encountered for non multiValued copy field content: .....

In schema.xml (Nutches and Solr I have):

    <field name="content" type="text" indexed="true" stored="true" 
termVectors="true"/>
    ...
    <!-- fields for the blacklist/whitelist plugin -->
    <field name="strippedContent" type="text" stored="true" indexed="true"/>
    ...
    <copyField source="strippedContent" dest="content"/>


In Nutch's solrindexmapping.xml file I have no directives for either the 
content or strippedContent fields. Can anyone point me to where I'm going wrong 
with the config? My ideal state is to write the strippedContent field into the 
content field and not keep a copy of the strippedContent field in the index at 
all.

Thanks in advance,
Matt






.headfirst
WEB DEVELOPERS .ENGAGING .USEFUL .WORKS
web:www.headfirst.co.nz
email:[email protected]
phone:(04) 498 5737
mobile:022 384 3874



Reply via email to