Matt Zytaruk wrote:
Here you go.
java.lang.ClassCastException: java.util.ArrayList
at org.apache.nutch.parse.ParseData.write(ParseData.java:122)
at org.apache.nutch.parse.ParseImpl.write(ParseImpl.java:51)
at
org.apache.nutch.fetcher.FetcherOutput.write(FetcherOutput.java:57)
at
org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:168)
at org.apache.nutch.mapred.MapTask$1.collect(MapTask.java:78)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:229)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:123)
Congratulations! You are the first person to actually use (and suffer
from) the multiple values in ContentProperties... ;-)
It turns out that ParseData.write() uses its own method for writing out
metadata, instead of using ContentProperties.write(). It works well if
you only have single values (then they are stored as Strings), but if
there are multiple values they are stored in ArrayLists, which ParseData
accesses directly by the virtue of using metadata.entrySet().iterator().
The fix is easy: please replace the following lines in ParseData.write():
out.writeInt(metadata.size()); // write metadata
Iterator i = metadata.entrySet().iterator();
while (i.hasNext()) {
Map.Entry e = (Map.Entry)i.next();
UTF8.writeString(out, (String)e.getKey());
UTF8.writeString(out, (String)e.getValue());
}
with this:
metadata.write(out);
and the same for reading the metadata field; replace in
ParseData.readField() this:
int propertyCount = in.readInt(); // read metadata
metadata = new ContentProperties();
for (int i = 0; i < propertyCount; i++) {
metadata.put(UTF8.readString(in), UTF8.readString(in));
}
with this:
metadata = new ContentProperties();
metadata.readFields(in);
Compile, deploy, test, report ... :-) Please note that this changes the
on-disk segment format, so you won't be able to read the old segments
with the new code. You may want to bump the ParseData.VERSION, and leave
this code to handle older versions...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com