I want to crawl deep pages

Yousin Kim Fri, 17 Apr 2015 13:59:02 -0700

Hello, I compiled nutch2.3 with gora0.6, mongodb and tried to crawl
online-shop.


But, I got only front pages except detail pages of products.
How can I get product detail pages?

Thank you :)

I want to get urls like :
http://www.vanillashu.co.kr/product/detail.html?product_no=20388&cate_no=42&display_group=2

my seed list is http://www.vanillashu.co.kr/

regex-urlfilter
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
#-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
#+.

+^(http|https)://.* vanillashu.co.kr/

I want to crawl deep pages

Reply via email to