That's what I also thought, but

i)  The URLs I use are just seperated by /
ii) I do not use "crawl", I use single commands and regex-urlfilter.txt

Here it is:
---snip---
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz| rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
# -[...@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+.
---snip---

So no, I guess it's not the URL matching.

Any other ideas?

On 26.06.2009, at 23:22, MilleBii wrote:

out of the box, only simple urls (no special characters like "?" etc...) are
crawled.
So make sure you remove such filters
so make sure to comment in crawl-urlfilter.txt

# skip URLs containing certain characters as probable queries, etc.
*#*-[...@=]

How do your urls look like.

2009/6/26 Jochen Witte <[email protected]>

Hello,

I just start with Nutch. My problem: I do not understand, why URLs are not fetched. My simple trial with one start URL without any filters and some
adjusted configuration can be seen below

fetcher.server.delay: 2.0
fetcher.verbose: true
db.ignore.internal.links: false

http://www.rwth-aachen.de
depth=6
threads=30
adddays=0
topN=15

[nu...@d-1 search]$ bin/nutch readdb crawl/crawldb -stats
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls:     156
retry 0:        156
min score:      0.0
avg score:      0.03282051
max score:      1.208
status 1 (db_unfetched):        149
status 2 (db_fetched):  5
status 4 (db_redir_temp):       1
status 5 (db_redir_perm):       1
CrawlDb statistics: done

Question, why are 149 URLs from 156 not fetched at all?

Thanks in advance
Jochen






--
-MilleBii-

Reply via email to