Hi all,
I am trying to follow the tutorial of nutch2 at
http://wiki.apache.org/nutch/Nutch2Tutorial
but after inject the tutorial ends and i don't know how to continue from
there.
When i try to run
nutch readdb
I get an error
:bin/nutch readdb
Usage: WebTableReader (-stats | -url [url] | -dump <out_dir> [-regex
regex])
[-crawlId <id>] [-content] [-headers] [-links] [-text]
-crawlId <id> - the id to prefix the schemas to operate on,
(default: storage.crawl.id)
-stats [-sort] - print overall statistics to System.out
[-sort] - list status sorted by host
-url <url> - print information on <url> to System.out
-dump <out_dir> [-regex regex] - dump the webtable to a text file in
<out_dir>
-content - dump also raw content
-headers - dump protocol headers
-links - dump links
-text - dump extracted text
[-regex] - filter on the URL of the webtable entry
I am asking myself how i can configure nutch that it will crawl a certain
page and all his children pages.
I see that this is the topic in the tutorial
http://wiki.apache.org/nutch/NutchTutorial
but i am not sure from which point to continue, as in nutch2 i am working
against hbase and not against a directory.
Thanks,
David