Fro mn the looks of it you need to parse all segments before indexing
attempting to index them.

As Markus has pointed out, the specific segment hasn't been parsed. Try
parsing as per the following link

http://wiki.apache.org/nutch/bin/nutch_parse

On Tue, Jul 12, 2011 at 1:50 PM, Paul van Hoven <
paul.van.ho...@googlemail.com> wrote:

> Okay, and what does that mean? How can I repair the error?
>
> 2011/7/12 Markus Jelsma <markus.jel...@openindex.io>:
> > I don't see this segment 20110712114256 being parsed.
> >
> > On Tuesday 12 July 2011 13:38:35 Paul van Hoven wrote:
> >> I'm not if I did understand you correct. Here is the complete output
> >> of my crawl:
> >>
> >>
> >> tom:bin toom$ ./nutch crawl /Users/toom/Downloads/nutch-1.3/crawled
> >> -dir /Users/toom/Downloads/nutch-1.3/sites -depth 3 -topN 50
> >> solrUrl is not set, indexing will be skipped...
> >> crawl started in: /Users/toom/Downloads/nutch-1.3/sites
> >> rootUrlDir = /Users/toom/Downloads/nutch-1.3/crawled
> >> threads = 10
> >> depth = 3
> >> solrUrl=null
> >> topN = 50
> >> Injector: starting at 2011-07-12 12:28:49
> >> Injector: crawlDb: /Users/toom/Downloads/nutch-1.3/sites/crawldb
> >> Injector: urlDir: /Users/toom/Downloads/nutch-1.3/crawled
> >> Injector: Converting injected urls to crawl db entries.
> >> Injector: Merging injected urls into crawl db.
> >> Injector: finished at 2011-07-12 12:28:53, elapsed: 00:00:04
> >> Generator: starting at 2011-07-12 12:28:53
> >> Generator: Selecting best-scoring urls due for fetch.
> >> Generator: filtering: true
> >> Generator: normalizing: true
> >> Generator: topN: 50
> >> Generator: jobtracker is 'local', generating exactly one partition.
> >> Generator: Partitioning selected urls for politeness.
> >> Generator: segment:
> >> /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856
> >> Generator: finished at 2011-07-12 12:28:57, elapsed: 00:00:04
> >> Fetcher: Your 'http.agent.name' value should be listed first in
> >> 'http.robots.agents' property.
> >> Fetcher: starting at 2011-07-12 12:28:57
> >> Fetcher: segment:
> >> /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856 Fetcher:
> >> threads: 10
> >> QueueFeeder finished: total 1 records + hit by time limit :0
> >> fetching http://nutch.apache.org/
> >> -finishing thread FetcherThread, activeThreads=1
> >> -finishing thread FetcherThread, activeThreads=1
> >> -finishing thread FetcherThread, activeThreads=1
> >> -finishing thread FetcherThread, activeThreads=1
> >> -finishing thread FetcherThread, activeThreads=1
> >> -finishing thread FetcherThread, activeThreads=1
> >> -finishing thread FetcherThread, activeThreads=1
> >> -finishing thread FetcherThread, activeThreads=1
> >> -finishing thread FetcherThread, activeThreads=1
> >> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> >> -finishing thread FetcherThread, activeThreads=0
> >> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> >> -activeThreads=0
> >> Fetcher: finished at 2011-07-12 12:29:01, elapsed: 00:00:03
> >> ParseSegment: starting at 2011-07-12 12:29:01
> >> ParseSegment: segment:
> >> /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856
> >> ParseSegment: finished at 2011-07-12 12:29:03, elapsed: 00:00:02
> >> CrawlDb update: starting at 2011-07-12 12:29:03
> >> CrawlDb update: db: /Users/toom/Downloads/nutch-1.3/sites/crawldb
> >> CrawlDb update: segments:
> >> [/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856]
> >> CrawlDb update: additions allowed: true
> >> CrawlDb update: URL normalizing: true
> >> CrawlDb update: URL filtering: true
> >> CrawlDb update: Merging segment data into db.
> >> CrawlDb update: finished at 2011-07-12 12:29:06, elapsed: 00:00:02
> >> Generator: starting at 2011-07-12 12:29:06
> >> Generator: Selecting best-scoring urls due for fetch.
> >> Generator: filtering: true
> >> Generator: normalizing: true
> >> Generator: topN: 50
> >> Generator: jobtracker is 'local', generating exactly one partition.
> >> Generator: Partitioning selected urls for politeness.
> >> Generator: segment:
> >> /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908
> >> Generator: finished at 2011-07-12 12:29:10, elapsed: 00:00:03
> >> Fetcher: Your 'http.agent.name' value should be listed first in
> >> 'http.robots.agents' property.
> >> Fetcher: starting at 2011-07-12 12:29:10
> >> Fetcher: segment:
> >> /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908 Fetcher:
> >> threads: 10
> >> QueueFeeder finished: total 50 records + hit by time limit :0
> >> fetching http://www.cafepress.com/nutch/
> >> fetching http://creativecommons.org/press-releases/entry/5064
> >> fetching http://blog.foofactory.fi/2007/03/twice-speed-half-size.html
> >> fetching http://www.apache.org/dist/nutch/CHANGES-1.0.txt
> >> fetching http://eu.apachecon.com/c/aceu2009/sessions/138
> >> fetching http://www.us.apachecon.com/c/acus2009/
> >> fetching http://issues.apache.org/jira/browse/NUTCH
> >> fetching http://forrest.apache.org/
> >> fetching http://hadoop.apache.org/
> >> fetching http://wiki.apache.org/nutch/
> >> fetching http://nutch.apache.org/credits.html
> >> fetching http://tika.apache.org/
> >> fetching http://lucene.apache.org/solr/
> >> fetching http://osuosl.org/news_folder/nutch
> >> fetching http://www.eu.apachecon.com/c/aceu2009/
> >> -activeThreads=10, spinWaiting=1, fetchQueues.totalSize=35
> >> -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=35
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=35
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=35
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=35
> >> fetching http://www.apache.org/
> >> fetching http://eu.apachecon.com/c/aceu2009/sessions/251
> >> fetching http://nutch.apache.org/skin/fontsize.js
> >> -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=32
> >> fetching http://www.us.apachecon.com/c/acus2009/schedule
> >> fetching http://wiki.apache.org/nutch/NutchTutorial
> >> -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=30
> >> fetching http://lucene.apache.org/java/
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=29
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=29
> >> fetching http://www.apache.org/dyn/closer.cgi/nutch/
> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=28
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=28
> >> fetching http://eu.apachecon.com/c/aceu2009/sessions/197
> >> fetching http://nutch.apache.org/nightly.html
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=26
> >> fetching http://wiki.apache.org/nutch/FAQ
> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=25
> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=25
> >> fetching http://www.apache.org/licenses/
> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=24
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=24
> >> fetching http://eu.apachecon.com/c/aceu2009/sessions/136
> >> fetching http://nutch.apache.org/apidocs-1.3/index.html
> >> -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=22
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=22
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=22
> >> fetching http://www.apache.org/dist/nutch/CHANGES-1.2.txt
> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=21
> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=21
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=21
> >> fetching http://nutch.apache.org/skin/breadcrumbs.js
> >> fetching http://eu.apachecon.com/c/aceu2009/sessions/165
> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=19
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=19
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=19
> >> fetching http://www.apache.org/dist/nutch/CHANGES-0.9.txt
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=18
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=18
> >> fetching http://eu.apachecon.com/c/aceu2009/sessions/201
> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=17
> >> fetching http://nutch.apache.org/skin/getMenu.js
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=16
> >> fetching http://www.apache.org/dist/nutch/CHANGES-1.1.txt
> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=15
> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=15
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=15
> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=15
> >> fetching http://eu.apachecon.com/c/aceu2009/sessions/137
> >> fetching http://nutch.apache.org/index.html
> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=13
> >> fetching http://www.apache.org/dist/nutch/CHANGES-0.8.1.txt
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=12
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=12
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=12
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=12
> >> fetching
> >>
> http://www.apache.org/foundation/records/minutes/2010/board_minutes_2010_0
> >> 4_21.txt fetching http://eu.apachecon.com/c/aceu2009/sessions/250
> >> -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=10
> >> fetching http://nutch.apache.org/mailing_lists.html
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9
> >> fetching http://www.apache.org/dist/nutch/CHANGES-1.3.txt
> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=8
> >> fetching http://nutch.apache.org/bot.html
> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=7
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7
> >> fetching http://nutch.apache.org/issue_tracking.html
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=6
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=6
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=6
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=6
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=6
> >> fetching http://nutch.apache.org/about.html
> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=5
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5
> >> fetching http://nutch.apache.org/i18n.html
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=4
> >> * queue: http://nutch.apache.org
> >>   maxThreads    = 1
> >>   inProgress    = 0
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466617719
> >>   now           = 1310466613063
> >>   0. http://nutch.apache.org/version_control.html
> >>   1. http://nutch.apache.org/skin/getBlank.js
> >>   2. http://nutch.apache.org/index.pdf
> >>   3. http://nutch.apache.org/apidocs-1.2/index.html
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=4
> >> * queue: http://nutch.apache.org
> >>   maxThreads    = 1
> >>   inProgress    = 0
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466617719
> >>   now           = 1310466614064
> >>   0. http://nutch.apache.org/version_control.html
> >>   1. http://nutch.apache.org/skin/getBlank.js
> >>   2. http://nutch.apache.org/index.pdf
> >>   3. http://nutch.apache.org/apidocs-1.2/index.html
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=4
> >> * queue: http://nutch.apache.org
> >>   maxThreads    = 1
> >>   inProgress    = 0
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466617719
> >>   now           = 1310466615066
> >>   0. http://nutch.apache.org/version_control.html
> >>   1. http://nutch.apache.org/skin/getBlank.js
> >>   2. http://nutch.apache.org/index.pdf
> >>   3. http://nutch.apache.org/apidocs-1.2/index.html
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=4
> >> * queue: http://nutch.apache.org
> >>   maxThreads    = 1
> >>   inProgress    = 0
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466617719
> >>   now           = 1310466616068
> >>   0. http://nutch.apache.org/version_control.html
> >>   1. http://nutch.apache.org/skin/getBlank.js
> >>   2. http://nutch.apache.org/index.pdf
> >>   3. http://nutch.apache.org/apidocs-1.2/index.html
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=4
> >> * queue: http://nutch.apache.org
> >>   maxThreads    = 1
> >>   inProgress    = 0
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466617719
> >>   now           = 1310466617069
> >>   0. http://nutch.apache.org/version_control.html
> >>   1. http://nutch.apache.org/skin/getBlank.js
> >>   2. http://nutch.apache.org/index.pdf
> >>   3. http://nutch.apache.org/apidocs-1.2/index.html
> >> fetching http://nutch.apache.org/version_control.html
> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=3
> >> * queue: http://nutch.apache.org
> >>   maxThreads    = 1
> >>   inProgress    = 1
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466617719
> >>   now           = 1310466618071
> >>   0. http://nutch.apache.org/skin/getBlank.js
> >>   1. http://nutch.apache.org/index.pdf
> >>   2. http://nutch.apache.org/apidocs-1.2/index.html
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3
> >> * queue: http://nutch.apache.org
> >>   maxThreads    = 1
> >>   inProgress    = 0
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466623151
> >>   now           = 1310466619073
> >>   0. http://nutch.apache.org/skin/getBlank.js
> >>   1. http://nutch.apache.org/index.pdf
> >>   2. http://nutch.apache.org/apidocs-1.2/index.html
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3
> >> * queue: http://nutch.apache.org
> >>   maxThreads    = 1
> >>   inProgress    = 0
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466623151
> >>   now           = 1310466620075
> >>   0. http://nutch.apache.org/skin/getBlank.js
> >>   1. http://nutch.apache.org/index.pdf
> >>   2. http://nutch.apache.org/apidocs-1.2/index.html
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3
> >> * queue: http://nutch.apache.org
> >>   maxThreads    = 1
> >>   inProgress    = 0
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466623151
> >>   now           = 1310466621077
> >>   0. http://nutch.apache.org/skin/getBlank.js
> >>   1. http://nutch.apache.org/index.pdf
> >>   2. http://nutch.apache.org/apidocs-1.2/index.html
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3
> >> * queue: http://nutch.apache.org
> >>   maxThreads    = 1
> >>   inProgress    = 0
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466623151
> >>   now           = 1310466622078
> >>   0. http://nutch.apache.org/skin/getBlank.js
> >>   1. http://nutch.apache.org/index.pdf
> >>   2. http://nutch.apache.org/apidocs-1.2/index.html
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3
> >> * queue: http://nutch.apache.org
> >>   maxThreads    = 1
> >>   inProgress    = 0
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466623151
> >>   now           = 1310466623080
> >>   0. http://nutch.apache.org/skin/getBlank.js
> >>   1. http://nutch.apache.org/index.pdf
> >>   2. http://nutch.apache.org/apidocs-1.2/index.html
> >> fetching http://nutch.apache.org/skin/getBlank.js
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2
> >> * queue: http://nutch.apache.org
> >>   maxThreads    = 1
> >>   inProgress    = 0
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466628578
> >>   now           = 1310466624082
> >>   0. http://nutch.apache.org/index.pdf
> >>   1. http://nutch.apache.org/apidocs-1.2/index.html
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2
> >> * queue: http://nutch.apache.org
> >>   maxThreads    = 1
> >>   inProgress    = 0
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466628578
> >>   now           = 1310466625084
> >>   0. http://nutch.apache.org/index.pdf
> >>   1. http://nutch.apache.org/apidocs-1.2/index.html
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2
> >> * queue: http://nutch.apache.org
> >>   maxThreads    = 1
> >>   inProgress    = 0
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466628578
> >>   now           = 1310466626086
> >>   0. http://nutch.apache.org/index.pdf
> >>   1. http://nutch.apache.org/apidocs-1.2/index.html
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2
> >> * queue: http://nutch.apache.org
> >>   maxThreads    = 1
> >>   inProgress    = 0
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466628578
> >>   now           = 1310466627088
> >>   0. http://nutch.apache.org/index.pdf
> >>   1. http://nutch.apache.org/apidocs-1.2/index.html
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2
> >> * queue: http://nutch.apache.org
> >>   maxThreads    = 1
> >>   inProgress    = 0
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466628578
> >>   now           = 1310466628089
> >>   0. http://nutch.apache.org/index.pdf
> >>   1. http://nutch.apache.org/apidocs-1.2/index.html
> >> fetching http://nutch.apache.org/index.pdf
> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=1
> >> * queue: http://nutch.apache.org
> >>   maxThreads    = 1
> >>   inProgress    = 1
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466628578
> >>   now           = 1310466629090
> >>   0. http://nutch.apache.org/apidocs-1.2/index.html
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
> >> * queue: http://nutch.apache.org
> >>   maxThreads    = 1
> >>   inProgress    = 0
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466634844
> >>   now           = 1310466630092
> >>   0. http://nutch.apache.org/apidocs-1.2/index.html
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
> >> * queue: http://nutch.apache.org
> >>   maxThreads    = 1
> >>   inProgress    = 0
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466634844
> >>   now           = 1310466631094
> >>   0. http://nutch.apache.org/apidocs-1.2/index.html
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
> >> * queue: http://nutch.apache.org
> >>   maxThreads    = 1
> >>   inProgress    = 0
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466634844
> >>   now           = 1310466632095
> >>   0. http://nutch.apache.org/apidocs-1.2/index.html
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
> >> * queue: http://nutch.apache.org
> >>   maxThreads    = 1
> >>   inProgress    = 0
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466634844
> >>   now           = 1310466633097
> >>   0. http://nutch.apache.org/apidocs-1.2/index.html
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
> >> * queue: http://nutch.apache.org
> >>   maxThreads    = 1
> >>   inProgress    = 0
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466634844
> >>   now           = 1310466634099
> >>   0. http://nutch.apache.org/apidocs-1.2/index.html
> >> fetching http://nutch.apache.org/apidocs-1.2/index.html
> >> -finishing thread FetcherThread, activeThreads=9
> >> -finishing thread FetcherThread, activeThreads=8
> >> -finishing thread FetcherThread, activeThreads=7
> >> -finishing thread FetcherThread, activeThreads=6
> >> -finishing thread FetcherThread, activeThreads=5
> >> -activeThreads=5, spinWaiting=4, fetchQueues.totalSize=0
> >> -finishing thread FetcherThread, activeThreads=4
> >> -finishing thread FetcherThread, activeThreads=3
> >> -finishing thread FetcherThread, activeThreads=2
> >> -finishing thread FetcherThread, activeThreads=1
> >> -finishing thread FetcherThread, activeThreads=0
> >> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> >> -activeThreads=0
> >> Fetcher: finished at 2011-07-12 12:30:37, elapsed: 00:01:27
> >> ParseSegment: starting at 2011-07-12 12:30:37
> >> ParseSegment: segment:
> >> /Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908
> >> Error parsing: http://nutch.apache.org/skin/breadcrumbs.js:
> >> failed(2,0): Can't retrieve Tika parser for mime-type
> >> application/javascript
> >> Error parsing: http://nutch.apache.org/skin/fontsize.js: failed(2,0):
> >> Can't retrieve Tika parser for mime-type application/javascript
> >> Error parsing: http://nutch.apache.org/skin/getBlank.js: failed(2,0):
> >> Can't retrieve Tika parser for mime-type application/javascript
> >> Error parsing: http://nutch.apache.org/skin/getMenu.js: failed(2,0):
> >> Can't retrieve Tika parser for mime-type application/javascript
> >> ParseSegment: finished at 2011-07-12 12:30:46, elapsed: 00:00:08
> >> CrawlDb update: starting at 2011-07-12 12:30:46
> >> CrawlDb update: db: /Users/toom/Downloads/nutch-1.3/sites/crawldb
> >> CrawlDb update: segments:
> >> [/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908]
> >> CrawlDb update: additions allowed: true
> >> CrawlDb update: URL normalizing: true
> >> CrawlDb update: URL filtering: true
> >> CrawlDb update: Merging segment data into db.
> >> CrawlDb update: finished at 2011-07-12 12:30:48, elapsed: 00:00:02
> >> Generator: starting at 2011-07-12 12:30:48
> >> Generator: Selecting best-scoring urls due for fetch.
> >> Generator: filtering: true
> >> Generator: normalizing: true
> >> Generator: topN: 50
> >> Generator: jobtracker is 'local', generating exactly one partition.
> >> Generator: Partitioning selected urls for politeness.
> >> Generator: segment:
> >> /Users/toom/Downloads/nutch-1.3/sites/segments/20110712123051
> >> Generator: finished at 2011-07-12 12:30:52, elapsed: 00:00:03
> >> Fetcher: Your 'http.agent.name' value should be listed first in
> >> 'http.robots.agents' property.
> >> Fetcher: starting at 2011-07-12 12:30:52
> >> Fetcher: segment:
> >> /Users/toom/Downloads/nutch-1.3/sites/segments/20110712123051 Fetcher:
> >> threads: 10
> >> QueueFeeder finished: total 50 records + hit by time limit :0
> >> fetching http://www.onehippo.com/
> >> fetching http://apacheconeu.blogspot.com/
> >> fetching http://www.day.com/
> >> fetching http://www.func.nl/apacheconeu2009
> >> fetching http://www.thawte.com/
> >> fetching http://eu.apachecon.com/c/aceu2009/about
> >> fetching http://www.us.apachecon.com/c/acus2009/sessions/333
> >> fetching http://www.joost.com/
> >> fetching http://developer.yahoo.com/blogs/hadoop/
> >> fetching http://www.springsource.com/
> >> fetching http://www.isi.edu/~koehn/europarl/
> >> fetching http://www.topicus.nl/
> >> fetching http://opensource.hp.com/
> >> fetching http://nutch.apache.org/apidocs-1.3/overview-frame.html
> >> -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=36
> >> fetching http://www.haloworldwide.com/
> >> fetching https://builds.apache.org/job/Nutch-trunk/javadoc/
> >> fetch of https://builds.apache.org/job/Nutch-trunk/javadoc/ failed
> >> with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found
> >> for url=https
> >> fetching http://www.hotwaxmedia.com/
> >> fetching http://lucene.apache.org/hadoop
> >> fetching http://www.cloudera.com/
> >> fetching http://code.google.com/opensource/
> >> fetching http://www.lucidimagination.com/
> >> fetching http://apache.lehtivihrea.org/nutch/
> >> fetching http://www.eu.apachecon.com/c/aceu2009/about/meetups
> >> -activeThreads=10, spinWaiting=4, fetchQueues.totalSize=27
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=27
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=27
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=27
> >> fetching http://www.us.apachecon.com/c/acus2009/sessions/334
> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=26
> >> fetching http://nutch.apache.org/apidocs-1.2/allclasses-frame.html
> >> fetching http://eu.apachecon.com/c/aceu2009/about/crowdvine
> >> -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=24
> >> fetching http://www.eu.apachecon.com/c/aceu2009/about/videoStreaming
> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=23
> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=23
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=23
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=23
> >> fetching http://www.us.apachecon.com/c/acus2009/sessions/335
> >> fetching http://nutch.apache.org/apidocs-1.2/overview-summary.html
> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=21
> >> fetching http://eu.apachecon.com/c/aceu2009/speakers
> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=20
> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=20
> >> fetching http://www.eu.apachecon.com/c/aceu2009/sponsors/sponsor
> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=19
> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=19
> >> fetching http://www.us.apachecon.com/c/acus2009/sessions/461
> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=18
> >> fetching http://nutch.apache.org/apidocs-1.3/allclasses-frame.html
> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=17
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=17
> >> fetching http://eu.apachecon.com/c/aceu2009/articles
> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=16
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=16
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=16
> >> fetching http://www.us.apachecon.com/c/acus2009/sessions/427
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=15
> >> fetching http://nutch.apache.org/apidocs-1.2/overview-frame.html
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=14
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=14
> >> fetching http://eu.apachecon.com/c/aceu2009/sessions/
> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=13
> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=13
> >> fetching http://www.us.apachecon.com/c/acus2009/sessions/430
> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=12
> >> fetching http://nutch.apache.org/apidocs-1.3/overview-summary.html
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=11
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=11
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=11
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=11
> >> fetching http://eu.apachecon.com/c/aceu2009/sponsors/sponsors
> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=10
> >> fetching http://www.us.apachecon.com/c/acus2009/sessions/375
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9
> >> fetching http://eu.apachecon.com/c/
> >> fetching http://www.us.apachecon.com/c/acus2009/sessions/462
> >> -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=7
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7
> >> fetching http://www.us.apachecon.com/c/acus2009/sessions/428
> >> fetching http://eu.apachecon.com/c/aceu2009/schedule
> >> -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=5
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=5
> >> fetching http://www.us.apachecon.com/c/acus2009/sessions/331
> >> fetching http://eu.apachecon.com/c/aceu2009/
> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=3
> >> * queue: http://eu.apachecon.com
> >>   maxThreads    = 1
> >>   inProgress    = 1
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466704235
> >>   now           = 1310466704428
> >>   0. http://eu.apachecon.com/js/jquery.akslideshow.js
> >> * queue: http://www.us.apachecon.com
> >>   maxThreads    = 1
> >>   inProgress    = 0
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466709214
> >>   now           = 1310466704428
> >>   0. http://www.us.apachecon.com/c/acus2009/sessions/437
> >>   1. http://www.us.apachecon.com/c/acus2009/sessions/332
> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=3
> >> * queue: http://eu.apachecon.com
> >>   maxThreads    = 1
> >>   inProgress    = 1
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466704235
> >>   now           = 1310466705429
> >>   0. http://eu.apachecon.com/js/jquery.akslideshow.js
> >> * queue: http://www.us.apachecon.com
> >>   maxThreads    = 1
> >>   inProgress    = 0
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466709214
> >>   now           = 1310466705430
> >>   0. http://www.us.apachecon.com/c/acus2009/sessions/437
> >>   1. http://www.us.apachecon.com/c/acus2009/sessions/332
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3
> >> * queue: http://eu.apachecon.com
> >>   maxThreads    = 1
> >>   inProgress    = 0
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466710968
> >>   now           = 1310466706431
> >>   0. http://eu.apachecon.com/js/jquery.akslideshow.js
> >> * queue: http://www.us.apachecon.com
> >>   maxThreads    = 1
> >>   inProgress    = 0
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466709214
> >>   now           = 1310466706431
> >>   0. http://www.us.apachecon.com/c/acus2009/sessions/437
> >>   1. http://www.us.apachecon.com/c/acus2009/sessions/332
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3
> >> * queue: http://eu.apachecon.com
> >>   maxThreads    = 1
> >>   inProgress    = 0
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466710968
> >>   now           = 1310466707433
> >>   0. http://eu.apachecon.com/js/jquery.akslideshow.js
> >> * queue: http://www.us.apachecon.com
> >>   maxThreads    = 1
> >>   inProgress    = 0
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466709214
> >>   now           = 1310466707433
> >>   0. http://www.us.apachecon.com/c/acus2009/sessions/437
> >>   1. http://www.us.apachecon.com/c/acus2009/sessions/332
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=3
> >> * queue: http://eu.apachecon.com
> >>   maxThreads    = 1
> >>   inProgress    = 0
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466710968
> >>   now           = 1310466708435
> >>   0. http://eu.apachecon.com/js/jquery.akslideshow.js
> >> * queue: http://www.us.apachecon.com
> >>   maxThreads    = 1
> >>   inProgress    = 0
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466709214
> >>   now           = 1310466708435
> >>   0. http://www.us.apachecon.com/c/acus2009/sessions/437
> >>   1. http://www.us.apachecon.com/c/acus2009/sessions/332
> >> fetching http://www.us.apachecon.com/c/acus2009/sessions/437
> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=2
> >> * queue: http://eu.apachecon.com
> >>   maxThreads    = 1
> >>   inProgress    = 0
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466710968
> >>   now           = 1310466709442
> >>   0. http://eu.apachecon.com/js/jquery.akslideshow.js
> >> * queue: http://www.us.apachecon.com
> >>   maxThreads    = 1
> >>   inProgress    = 1
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466709214
> >>   now           = 1310466709442
> >>   0. http://www.us.apachecon.com/c/acus2009/sessions/332
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2
> >> * queue: http://eu.apachecon.com
> >>   maxThreads    = 1
> >>   inProgress    = 0
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466710968
> >>   now           = 1310466710444
> >>   0. http://eu.apachecon.com/js/jquery.akslideshow.js
> >> * queue: http://www.us.apachecon.com
> >>   maxThreads    = 1
> >>   inProgress    = 0
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466714813
> >>   now           = 1310466710444
> >>   0. http://www.us.apachecon.com/c/acus2009/sessions/332
> >> fetching http://eu.apachecon.com/js/jquery.akslideshow.js
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
> >> * queue: http://www.us.apachecon.com
> >>   maxThreads    = 1
> >>   inProgress    = 0
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466714813
> >>   now           = 1310466711446
> >>   0. http://www.us.apachecon.com/c/acus2009/sessions/332
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
> >> * queue: http://www.us.apachecon.com
> >>   maxThreads    = 1
> >>   inProgress    = 0
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466714813
> >>   now           = 1310466712447
> >>   0. http://www.us.apachecon.com/c/acus2009/sessions/332
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
> >> * queue: http://www.us.apachecon.com
> >>   maxThreads    = 1
> >>   inProgress    = 0
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466714813
> >>   now           = 1310466713448
> >>   0. http://www.us.apachecon.com/c/acus2009/sessions/332
> >> -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
> >> * queue: http://www.us.apachecon.com
> >>   maxThreads    = 1
> >>   inProgress    = 0
> >>   crawlDelay    = 5000
> >>   minCrawlDelay = 0
> >>   nextFetchTime = 1310466714813
> >>   now           = 1310466714450
> >>   0. http://www.us.apachecon.com/c/acus2009/sessions/332
> >> fetching http://www.us.apachecon.com/c/acus2009/sessions/332
> >> -finishing thread FetcherThread, activeThreads=9
> >> -finishing thread FetcherThread, activeThreads=8
> >> -finishing thread FetcherThread, activeThreads=7
> >> -finishing thread FetcherThread, activeThreads=6
> >> -finishing thread FetcherThread, activeThreads=5
> >> -finishing thread FetcherThread, activeThreads=4
> >> -finishing thread FetcherThread, activeThreads=3
> >> -finishing thread FetcherThread, activeThreads=2
> >> -finishing thread FetcherThread, activeThreads=1
> >> -finishing thread FetcherThread, activeThreads=0
> >> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> >> -activeThreads=0
> >> Fetcher: finished at 2011-07-12 12:31:55, elapsed: 00:01:03
> >> ParseSegment: starting at 2011-07-12 12:31:55
> >> ParseSegment: segment:
> >> /Users/toom/Downloads/nutch-1.3/sites/segments/20110712123051
> >> Error parsing: http://eu.apachecon.com/js/jquery.akslideshow.js:
> >> failed(2,0): Can't retrieve Tika parser for mime-type text/javascript
> >> ParseSegment: finished at 2011-07-12 12:31:59, elapsed: 00:00:03
> >> CrawlDb update: starting at 2011-07-12 12:31:59
> >> CrawlDb update: db: /Users/toom/Downloads/nutch-1.3/sites/crawldb
> >> CrawlDb update: segments:
> >> [/Users/toom/Downloads/nutch-1.3/sites/segments/20110712123051]
> >> CrawlDb update: additions allowed: true
> >> CrawlDb update: URL normalizing: true
> >> CrawlDb update: URL filtering: true
> >> CrawlDb update: Merging segment data into db.
> >> CrawlDb update: finished at 2011-07-12 12:32:03, elapsed: 00:00:03
> >> LinkDb: starting at 2011-07-12 12:32:03
> >> LinkDb: linkdb: /Users/toom/Downloads/nutch-1.3/sites/linkdb
> >> LinkDb: URL normalize: true
> >> LinkDb: URL filter: true
> >> LinkDb: adding segment:
> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110707140238
> >> LinkDb: adding segment:
> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712113732
> >> LinkDb: adding segment:
> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712114256
> >> LinkDb: adding segment:
> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856
> >> LinkDb: adding segment:
> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908
> >> LinkDb: adding segment:
> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712123051
> >> Exception in thread "main"
> >> org.apache.hadoop.mapred.InvalidInputException: Input path does not
> >> exist:
> >>
> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110707140238/parse_d
> >> ata Input path does not exist:
> >>
> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712113732/parse_da
> >> ta Input path does not exist:
> >>
> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712114256/parse_da
> >> ta at
> >>
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:1
> >> 90) at
> >>
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileIn
> >> putFormat.java:44) at
> >>
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:20
> >> 1) at
> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> >> at
> >> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> >> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at
> >> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at
> >> org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
> >>       at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
> >>       at org.apache.nutch.crawl.Crawl.run(Crawl.java:142)
> >>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >>       at org.apache.nutch.crawl.Crawl.main(Crawl.java:54)
> >>
> >> 2011/7/12 Julien Nioche <lists.digitalpeb...@gmail.com>:
> >> >> Actually I'm not shure if I look at the right log lines. Please
> >> >> explain in more detail for what exactly I should look for. Anyway I
> >> >> found the following line just before the error:
> >> >>
> >> >> Error parsing: http://eu.apachecon.com/js/jquery.akslideshow.js:
> >> >> failed(2,0): Can't retrieve Tika parser for mime-type text/javascript
> >> >>
> >> >> But I can see that parsing erros like this already appeared earlier
> >> >> during the crawl.
> >> >
> >> > This simply means that the javascript parser is not enabled in your
> conf
> >> > (which is the default behaviour) and as a consequence the default
> parser
> >> > (Tika) was used to try and parse it but has no resources for doing so.
> >> >
> >> > Note : we should probably add .js to the default url filters. The
> >> > javascript parser has been deactivated by default because it generates
> >> > atrocious URLs so we might as well prevent such URLs form being
> fetched
> >> > in the first place.
> >> >
> >> > Anyway this is not the source of the problem. You seem to have
> unparsed
> >> > segments among the ones specified. Could be that you interrupted a
> >> > previous crawl or got a problem with it and did not delete these
> >> > segments or the whole crawl directory. Removing the segments and
> calling
> >> > the last couple of steps manually should do the trick.
> >> >
> >> >> 2011/7/12 Markus Jelsma <markus.jel...@openindex.io>:
> >> >> > Were there errors during parsing of that last segment?
> >> >> >
> >> >> >> I'm starting with nutch and I ran a simple job as described in the
> >> >> >> nutch tutorial. After a while I get the following error:
> >> >> >>
> >> >> >>
> >> >> >> CrawlDb update: URL filtering: true
> >> >> >> CrawlDb update: Merging segment data into db.
> >> >> >> CrawlDb update: finished at 2011-07-12 12:32:03, elapsed: 00:00:03
> >> >> >> LinkDb: starting at 2011-07-12 12:32:03
> >> >> >> LinkDb: linkdb: /Users/toom/Downloads/nutch-1.3/sites/linkdb
> >> >> >> LinkDb: URL normalize: true
> >> >> >> LinkDb: URL filter: true
> >> >> >> LinkDb: adding segment:
> >> >> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110707140238
> >> >> >> LinkDb: adding segment:
> >> >> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712113732
> >> >> >> LinkDb: adding segment:
> >> >> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712114256
> >> >> >> LinkDb: adding segment:
> >> >> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856
> >> >> >> LinkDb: adding segment:
> >> >> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908
> >> >> >> LinkDb: adding segment:
> >> >> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712123051
> >> >> >> Exception in thread "main"
> >> >> >> org.apache.hadoop.mapred.InvalidInputException: Input path does
> not
> >> >>
> >> >> >> exist:
> >> >>
> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110707140238/parse
> >> >> _d
> >> >>
> >> >> >> ata Input path does not exist:
> >> >>
> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712113732/parse
> >> >> _da
> >> >>
> >> >> >> ta Input path does not exist:
> >> >>
> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712114256/parse
> >> >> _da
> >> >>
> >> >> >> ta at
> >> >>
> >> >>
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java
> >> >> :1
> >> >>
> >> >> >> 90) at
> >> >>
> >> >>
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFile
> >> >> In
> >> >>
> >> >> >> putFormat.java:44) at
> >> >>
> >> >>
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:
> >> >> 20
> >> >>
> >> >> >> 1) at
> >> >>
> >> >> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> >> >>
> >> >> >> at
> >> >> >>
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7
> >> >> >> 81) at
> >> >> >> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> at
> >> >> >> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at
> >> >> >> org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
> >> >> >>       at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
> >> >> >>       at org.apache.nutch.crawl.Crawl.run(Crawl.java:142)
> >> >> >>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >> >> >>       at org.apache.nutch.crawl.Crawl.main(Crawl.java:54)
> >> >
> >> > --
> >> > *
> >> > *Open Source Solutions for Text Engineering
> >> >
> >> > http://digitalpebble.blogspot.com/
> >> > http://www.digitalpebble.com
> >
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350
> >
>



-- 
*Lewis*

Reply via email to