That or you forgot to put 'http://' in the URLs in your seed list ( like i
just did )

Le 2011-05-19 à 13:14, Jean-Francois Gingras <
[email protected]> a écrit :

One more thing. Make sure you seed URL are in UNIX format (end of line).
All my test seeds worked, but the real seed fail with 0 record produce
by the generator. Turns out the real seed list was in dos format.
Runing dos2unix and uploading the on the hdfs fixes my problem

Le 2011-05-18 à 05:25, Marek Bachmann <[email protected]> a écrit :

Thank you very much for the help.

I checked ALL files in nutch/conf if there are an further expressions that
would exclude my URLs.

I found nothing like this.

In fact, as I mentioned before, the ./nuch crawl command works fine on
exactly the some input data.


One again if I delete all entries in my crawl directory and the run:


./nutch crawl seedUrls/ -dir crawl -threads 30 -depth 10


 crawl started in: crawl

 rootUrlDir = seedUrls

 threads = 30

 depth = 10

 indexer=lucene

 Injector: starting

 Injector: crawlDb: crawl/crawldb

 Injector: urlDir: seedUrls

 Injector: Converting injected urls to crawl db entries.

 Injector: Merging injected urls into crawl db.

 Injector: done

 Generator: Selecting best-scoring urls due for fetch.

 Generator: starting

 Generator: filtering: true

 Generator: normalizing: true

 Generator: jobtracker is 'local', generating exactly one partition.

 Generator: Partitioning selected urls for politeness.

 Generator: segment: crawl/segments/20110518111323

 Generator: done.

 Fetcher: Your 'http.agent.name' value should be listed first in

 'http.robots.agents' property.

 Fetcher: starting

 Fetcher: segment: crawl/segments/20110518111323

 Fetcher: threads: 30

 QueueFeeder finished: total 10 records + hit by time limit :0

 fetching http://portal.uni-kassel.de/

 fetching http://www.studentenwerk-kassel.de/

 fetching http://www.asta-kassel.de/

 fetching http://www.uni-kassel.de/fb16

 fetching http://www.uni-kassel.de/

 fetching http://www.uni-kassel.de/uni/studium/

 fetching http://www.uni-kassel.de/uni/fachbereiche/

 fetching http://www.uni-kassel.de/uni/

 fetching http://www.uni-kassel.de/uni/forschung/

 fetching http://www.cs.uni-kassel.de/


But if I try it manually (after deleting the crawldb once again):


./nutch inject crawl/crawldb seedUrls/


 Injector: starting

 Injector: crawlDb: crawl/crawldb

 Injector: urlDir: seedUrls

 Injector: Converting injected urls to crawl db entries.

 Injector: Merging injected urls into crawl db.

 Injector: done


./nutch generate crawl/crawldb/ crawl/segments


 Generator: Selecting best-scoring urls due for fetch.

 Generator: starting

 Generator: filtering: true

 Generator: normalizing: true

 Generator: jobtracker is 'local', generating exactly one partition.

 Generator: 0 records selected for fetching, exiting ...



So my conclusion is that the crawl command does the url injecting in some
other way? I just don't get it why it works with the crawl command but
doesn't work when manually injecting. Any further suggestions where I could
find my failure would be great :-)



On 16.05.2011 16:14, Markus Jelsma wrote:

I see that too and it shouldn't dump an exception if there's nothing in the

CrawlDB.

This is, however, not your problem it seems. If you inject but there's
nothing

in the CrawlDB then you have some filters running that skip your seed URL's.

Check your domain filter settings or other url filter settings, depening on

the plugin's you defined.


On Monday 16 May 2011 15:56:26 Marek Bachmann wrote:

Hello people,


I was trying to do an manual crawl like described in the nutch tutorial

on http://wiki.apache.org/nutch/NutchTutorial


First of all: If I do a crawl, with the same seed urls, using the "nutch

crawl" command, everything works fine.


Here's what I was trying to do:


1.) Trying to create a new crawlDB with:


    ./nutch inject crawl/crawldb seedUrls


        The directory crawl was empty and in the directory seedUrls is

one file "urls" with this content:

            http://www.uni-kassel.de

            http://portal.uni-kassel.de

            http://www.asta-kassel.de

            http://www.uni-kassel.de/fb16

            http://www.cs.uni-kassel.de

            http://www.studentenwerk-kassel.de


    The command runs without any error:

    ./nutch inject crawl/crawldb seedUrls

    Injector: starting

    Injector: crawlDb: crawl/crawldb

    Injector: urlDir: seedUrls

    Injector: Converting injected urls to crawl db entries.

    Injector: Merging injected urls into crawl db.

    Injector: done


    After that a new directory with the name crawldb exists in crawl/


2.) Trying to generate new segments:


    ./nutch generate crawl/crawldb/ crawl/segments -noFilter

    Generator: Selecting best-scoring urls due for fetch.

    Generator: starting

    Generator: filtering: false

    Generator: normalizing: true

    Generator: jobtracker is 'local', generating exactly one partition.

    Generator: 0 records selected for fetching, exiting ...


So I am wondering why the generator does not create segements. It says

that it had 0 records selected for fetching. It seems to me, that the

injector hadn't injected the urls into the db.


When I run:

    ./nutch readdb crawl/crawldb/ -stats


It outputs:

    CrawlDb statistics start: crawl/crawldb/

    Statistics for CrawlDb: crawl/crawldb/

    Exception in thread "main" java.lang.NullPointerException

        at

org.apache.nutch.crawl.CrawlDbReader.processStatJob(CrawlDbReader.java:352)

        at

org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:502)


Anybody has an idea what am I doing wrong?


Is there any possibility to get more verbose output / logging from the

commands?

Reply via email to