Hi,
I'm using the blacklist|whitelist plug-in with Nutch 1.3 provided with the
patch at: https://issues.apache.org/jira/browse/NUTCH-585. The plug-in strips
out content from HTML pages identified by HTML element:class or element:id
descriptors. The ReadMe instructions note that the following needs to be added
to the schema.xml file.
<!-- fields for the blacklist/whitelist plugin -->
<field name="strippedContent" type="text" stored="true" indexed="true"/>
I've done this for the schema file in both Solr and Nutch and also added this
line to solrindex-mapping.xml:
<field dest="strippedContent" source="strippedContent"/>
A crawl with this config works great. I can see the new field containing the
stripped content in the index. Problem is I want to target the contents of
strippedContent into the Content field but all attempts are resulting in this
error:
Jul 26, 2012 1:09:04 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: ERROR: multiple values
encountered for non multiValued copy field content: .....
In schema.xml (Nutches and Solr I have):
<field name="content" type="text" indexed="true" stored="true"
termVectors="true"/>
...
<!-- fields for the blacklist/whitelist plugin -->
<field name="strippedContent" type="text" stored="true" indexed="true"/>
...
<copyField source="strippedContent" dest="content"/>
In Nutch's solrindexmapping.xml file I have no directives for either the
content or strippedContent fields. Can anyone point me to where I'm going wrong
with the config? My ideal state is to write the strippedContent field into the
content field and not keep a copy of the strippedContent field in the index at
all.
Thanks in advance,
Matt
.headfirst
WEB DEVELOPERS .ENGAGING .USEFUL .WORKS
web:www.headfirst.co.nz
email:[email protected]
phone:(04) 498 5737
mobile:022 384 3874