thanks,i have Collection of urls Only these four can not search a subset of
their pages
the urls and crawl-urlfilter like Attachment
2009/4/1 Alejandro Gonzalez <[email protected]>
> it's your crawl-urlfilter ok? are u sure it's fetching them properly? maybe
> it's not getting the content of the pages and so it cannot extract links
> for
> fetch in the next level (i suppose you have set the crawl depth just for
> the
> seeds level).
>
> So or your filters are skipping the seeds (i suppose it's not the case
> cause
> you say that urls arrive to Fetcher), or the fetching it's not going ok
> (network issues?). take a look on that
>
> 2009/4/1 陈琛 <[email protected]>
>
> > HI,all
> > I have four urls, like this:
> > http://www.lao-indochina.com
> > http://www.nuol.edu.la
> > http://www.corninc.com.la
> > http://www.vientianecollege.laopdr.com
> >
> > only fetch the HomePage why? Sub-page is not fetch。。。
> >
>
http://www.na.gov.la
http://www.lnmcmekong.org
http://www.nast.gov.la
http://www.mofa.gov.la
http://www.smpwood.com
http://www.sangkasy-lao.com
http://www.tacdo.com.la
http://www.bigartlao.com
http://www.lanxangshop.com
http://www.exim.com.la
http://www.lao-indochina.com
http://www.ninhomlaotour.com
http://www.talatsaomall.com
http://www.mixsports.net
http://www.laocement.com
http://www.beer-lao.com
http://www.nuol.edu.la
http://www.corninc.com.la
http://www.laodrivetech.com
http://www.vientianecollege.laopdr.com
http://www.thepsymoung.com
http://www.exim.la
http://www.undplao.org
http://www.worldbank.org/la
# The url filter file used by the crawl command.
# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(swf|SWF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$*
-\.(gpx|GPX|nb|PDF|pdf|m|java|JAVA|doc|DOC|ps|tex|jpeg|JPEG|bmp|BMP)$*
# skip URLs containing certain characters as probable queries, etc.
# -...@]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
# -.*(/.+?)/.*?\1/.*?\1/
# accept hosts in MY.DOMAIN.NAME
+^http://www.na.gov.la/
+^http://www.lnmcmekong.org/
+^http://www.nast.gov.la/
+^http://www.mofa.gov.la/
+^http://www.smpwood.com/
+^http://www.sangkasy-lao.com/
+^http://www.tacdo.com.la/
+^http://www.bigartlao.com/
+^http://www.lanxangshop.com/
+^http://www.exim.com.la/
+^http://www.lao-indochina.com/
+^http://www.ninhomlaotour.com/
+^http://www.talatsaomall.com/
+^http://www.mixsports.net/
+^http://www.laocement.com/
+^http://www.beer-lao.com/
+^http://www.nuol.edu.la/
+^http://www.corninc.com.la/
+^http://www.laodrivetech.com/
+^http://www.vientianecollege.laopdr.com/
+^http://www.thepsymoung.com/
+^http://www.exim.la/
+^http://www.undplao.org/
+^http://www.worldbank.org/la/
# skip everything else
-.