Hi Lewis, > Please see http://wiki.apache.org/nutch/Nutch2Tutorial which is an > update of Julien's (I think) page on GORA_HBase. Thsi will get you > rocking with HBase. The changes between Cassandra, Accumulo and the > other data stores are fairly trivial.
I'll managed to perform a crawl with 2.0 and HBase: it rocks, indeed. Much simpler than 1.x (no segments!). Below a couple of problems I've run into (possible issues to be adressed in 2.1). Cheers, Sebastian % ./bin/nutch readdb -stats WebTable statistics start WebTableReader: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readFully(DataInputStream.java:169) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1486) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1470) at org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:89) at org.apache.nutch.crawl.WebTableReader.run(WebTableReader.java:537) at org.apache.nutch.crawl.WebTableReader.processStatJob(WebTableReader.java:218) at org.apache.nutch.crawl.WebTableReader.run(WebTableReader.java:479) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.WebTableReader.main(WebTableReader.java:412) --> readdb -dump works. % ./bin/nutch fetch 1339621550-203073321 -threads 1 -parse Exception in thread "main" java.lang.IllegalArgumentException: arg -parse not recognized % ./bin/nutch parse -all -force -resume ParserJob: starting ParserJob: resuming: false <<< -resume and ParserJob: forced reparse: false <<< -force obviously ignored ? ParserJob: parsing all % ./bin/nutch generate --> generates batchid, but should show help as in 1.x ? --> is there an option -topN ? The 2.0 Solr schema and mappings still contain the field "site" which has been removed in 1.x (NUTCH-1232). Should be done also in 2.0: it's easier to maintain only one Solr installation for all Nutch versions.