That or you forgot to put 'http://' in the URLs in your seed list ( like i just did )
Le 2011-05-19 à 13:14, Jean-Francois Gingras < [email protected]> a écrit : One more thing. Make sure you seed URL are in UNIX format (end of line). All my test seeds worked, but the real seed fail with 0 record produce by the generator. Turns out the real seed list was in dos format. Runing dos2unix and uploading the on the hdfs fixes my problem Le 2011-05-18 à 05:25, Marek Bachmann <[email protected]> a écrit : Thank you very much for the help. I checked ALL files in nutch/conf if there are an further expressions that would exclude my URLs. I found nothing like this. In fact, as I mentioned before, the ./nuch crawl command works fine on exactly the some input data. One again if I delete all entries in my crawl directory and the run: ./nutch crawl seedUrls/ -dir crawl -threads 30 -depth 10 crawl started in: crawl rootUrlDir = seedUrls threads = 30 depth = 10 indexer=lucene Injector: starting Injector: crawlDb: crawl/crawldb Injector: urlDir: seedUrls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/segments/20110518111323 Generator: done. Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting Fetcher: segment: crawl/segments/20110518111323 Fetcher: threads: 30 QueueFeeder finished: total 10 records + hit by time limit :0 fetching http://portal.uni-kassel.de/ fetching http://www.studentenwerk-kassel.de/ fetching http://www.asta-kassel.de/ fetching http://www.uni-kassel.de/fb16 fetching http://www.uni-kassel.de/ fetching http://www.uni-kassel.de/uni/studium/ fetching http://www.uni-kassel.de/uni/fachbereiche/ fetching http://www.uni-kassel.de/uni/ fetching http://www.uni-kassel.de/uni/forschung/ fetching http://www.cs.uni-kassel.de/ But if I try it manually (after deleting the crawldb once again): ./nutch inject crawl/crawldb seedUrls/ Injector: starting Injector: crawlDb: crawl/crawldb Injector: urlDir: seedUrls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: done ./nutch generate crawl/crawldb/ crawl/segments Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... So my conclusion is that the crawl command does the url injecting in some other way? I just don't get it why it works with the crawl command but doesn't work when manually injecting. Any further suggestions where I could find my failure would be great :-) On 16.05.2011 16:14, Markus Jelsma wrote: I see that too and it shouldn't dump an exception if there's nothing in the CrawlDB. This is, however, not your problem it seems. If you inject but there's nothing in the CrawlDB then you have some filters running that skip your seed URL's. Check your domain filter settings or other url filter settings, depening on the plugin's you defined. On Monday 16 May 2011 15:56:26 Marek Bachmann wrote: Hello people, I was trying to do an manual crawl like described in the nutch tutorial on http://wiki.apache.org/nutch/NutchTutorial First of all: If I do a crawl, with the same seed urls, using the "nutch crawl" command, everything works fine. Here's what I was trying to do: 1.) Trying to create a new crawlDB with: ./nutch inject crawl/crawldb seedUrls The directory crawl was empty and in the directory seedUrls is one file "urls" with this content: http://www.uni-kassel.de http://portal.uni-kassel.de http://www.asta-kassel.de http://www.uni-kassel.de/fb16 http://www.cs.uni-kassel.de http://www.studentenwerk-kassel.de The command runs without any error: ./nutch inject crawl/crawldb seedUrls Injector: starting Injector: crawlDb: crawl/crawldb Injector: urlDir: seedUrls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: done After that a new directory with the name crawldb exists in crawl/ 2.) Trying to generate new segments: ./nutch generate crawl/crawldb/ crawl/segments -noFilter Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: filtering: false Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... So I am wondering why the generator does not create segements. It says that it had 0 records selected for fetching. It seems to me, that the injector hadn't injected the urls into the db. When I run: ./nutch readdb crawl/crawldb/ -stats It outputs: CrawlDb statistics start: crawl/crawldb/ Statistics for CrawlDb: crawl/crawldb/ Exception in thread "main" java.lang.NullPointerException at org.apache.nutch.crawl.CrawlDbReader.processStatJob(CrawlDbReader.java:352) at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:502) Anybody has an idea what am I doing wrong? Is there any possibility to get more verbose output / logging from the commands?

