That's what I also thought, but
i) The URLs I use are just seperated by /
ii) I do not use "crawl", I use single commands and regex-urlfilter.txt
Here it is:
---snip---
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|
rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
# skip URLs containing certain characters as probable queries, etc.
# -[...@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to
break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept anything else
+.
---snip---
So no, I guess it's not the URL matching.
Any other ideas?
On 26.06.2009, at 23:22, MilleBii wrote:
out of the box, only simple urls (no special characters like "?"
etc...) are
crawled.
So make sure you remove such filters
so make sure to comment in crawl-urlfilter.txt
# skip URLs containing certain characters as probable queries, etc.
*#*-[...@=]
How do your urls look like.
2009/6/26 Jochen Witte <[email protected]>
Hello,
I just start with Nutch. My problem: I do not understand, why URLs
are not
fetched. My simple trial with one start URL without any filters and
some
adjusted configuration can be seen below
fetcher.server.delay: 2.0
fetcher.verbose: true
db.ignore.internal.links: false
http://www.rwth-aachen.de
depth=6
threads=30
adddays=0
topN=15
[nu...@d-1 search]$ bin/nutch readdb crawl/crawldb -stats
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls: 156
retry 0: 156
min score: 0.0
avg score: 0.03282051
max score: 1.208
status 1 (db_unfetched): 149
status 2 (db_fetched): 5
status 4 (db_redir_temp): 1
status 5 (db_redir_perm): 1
CrawlDb statistics: done
Question, why are 149 URLs from 156 not fetched at all?
Thanks in advance
Jochen
--
-MilleBii-