nutch 2 tutorial

Michael Gang Mon, 07 Jan 2013 08:52:41 -0800

Hi all,

I am trying to follow the tutorial of nutch2 at
http://wiki.apache.org/nutch/Nutch2Tutorial
but after inject the tutorial ends and i don't know how to continue from
there.


When i try to run

nutch readdb


I get an error

:bin/nutch readdb
Usage: WebTableReader (-stats | -url [url] | -dump <out_dir> [-regex
regex])
                      [-crawlId <id>] [-content] [-headers] [-links] [-text]
    -crawlId <id>  - the id to prefix the schemas to operate on,
                     (default: storage.crawl.id)
    -stats [-sort] - print overall statistics to System.out
    [-sort]        - list status sorted by host
    -url <url>     - print information on <url> to System.out
    -dump <out_dir> [-regex regex] - dump the webtable to a text file in
                     <out_dir>
    -content       - dump also raw content
    -headers       - dump protocol headers
    -links         - dump links
    -text          - dump extracted text
    [-regex]       - filter on the URL of the webtable entry

I am asking myself how i can configure nutch that it will crawl a certain
page and all his children pages.
I see that this is the topic in the tutorial
http://wiki.apache.org/nutch/NutchTutorial
but i am not sure from which point to continue, as in nutch2 i am working
against hbase and not against a directory.

Thanks,
David

nutch 2 tutorial

Reply via email to