Matt Zytaruk wrote:

Here you go.

java.lang.ClassCastException: java.util.ArrayList
       at org.apache.nutch.parse.ParseData.write(ParseData.java:122)
       at org.apache.nutch.parse.ParseImpl.write(ParseImpl.java:51)
at org.apache.nutch.fetcher.FetcherOutput.write(FetcherOutput.java:57) at org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:168)
       at org.apache.nutch.mapred.MapTask$1.collect(MapTask.java:78)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:229) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:123)


Congratulations! You are the first person to actually use (and suffer from) the multiple values in ContentProperties... ;-)

It turns out that ParseData.write() uses its own method for writing out metadata, instead of using ContentProperties.write(). It works well if you only have single values (then they are stored as Strings), but if there are multiple values they are stored in ArrayLists, which ParseData accesses directly by the virtue of using metadata.entrySet().iterator().

The fix is easy: please replace the following lines in ParseData.write():

   out.writeInt(metadata.size());                // write metadata
   Iterator i = metadata.entrySet().iterator();
   while (i.hasNext()) {
     Map.Entry e = (Map.Entry)i.next();
     UTF8.writeString(out, (String)e.getKey());
     UTF8.writeString(out, (String)e.getValue());
   }

with this:

   metadata.write(out);

and the same for reading the metadata field; replace in ParseData.readField() this:

   int propertyCount = in.readInt();             // read metadata
   metadata = new ContentProperties();
   for (int i = 0; i < propertyCount; i++) {
     metadata.put(UTF8.readString(in), UTF8.readString(in));
   }

with this:

   metadata = new ContentProperties();
   metadata.readFields(in);
Compile, deploy, test, report ... :-) Please note that this changes the on-disk segment format, so you won't be able to read the old segments with the new code. You may want to bump the ParseData.VERSION, and leave this code to handle older versions...

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to