Hi Sebastian, On Wed, Jun 13, 2012 at 11:30 PM, Sebastian Nagel <wastl.na...@googlemail.com> wrote: >I'll managed to perform a crawl with 2.0 and HBase: it rocks, indeed. > Much simpler than 1.x (no segments!).
:0) > % ./bin/nutch readdb -stats > WebTable statistics start > WebTableReader: java.io.EOFException > at java.io.DataInputStream.readFully(DataInputStream.java:197) > at java.io.DataInputStream.readFully(DataInputStream.java:169) > at > org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508) > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1486) > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475) > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1470) > at > org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:89) > at org.apache.nutch.crawl.WebTableReader.run(WebTableReader.java:537) > at > org.apache.nutch.crawl.WebTableReader.processStatJob(WebTableReader.java:218) > at org.apache.nutch.crawl.WebTableReader.run(WebTableReader.java:479) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.crawl.WebTableReader.main(WebTableReader.java:412) > --> readdb -dump works. Confirmed and ticket opened as NUTCH-1391 > % ./bin/nutch fetch 1339621550-203073321 -threads 1 -parse > Exception in thread "main" java.lang.IllegalArgumentException: arg -parse not > recognized The parse argument was removed in Nutch 2.0 and now throws an illegalargumentexception. This is now normal. To enable parsing during fetching please set config in nutch-site.xml. The reason that the incorrect -parse argument is till in the Usage message, is because I was not diligent enough when patching the fetcher CLI aesthetics. I'll address this within the issue below as well. > > > % ./bin/nutch parse -all -force -resume > ParserJob: starting > ParserJob: resuming: false <<< -resume and > ParserJob: forced reparse: false <<< -force obviously ignored ? > ParserJob: parsing all Yes confirmed and ticket opened as NUTCH-1392 > % ./bin/nutch generate > --> generates batchid, but should show help as in 1.x ? > --> is there an option -topN ? Yes this is opened in NUTCH-1393. Users may not necessarily wish to generate at all, instead wishing to merely find out the GeneratorJob CLI options... I will open this just now and fix for 2.1. > The 2.0 Solr schema and mappings still contain the field "site" > which has been removed in 1.x (NUTCH-1232). > Should be done also in 2.0: it's easier to maintain only one Solr installation > for all Nutch versions. Logged in NUTCH-1394 Thanks Seb for your contributions here... this is exactly what we are after. Does anyone have issues with running another RC and addressing these issues in 2.1? -- Lewis