I am using Nutch 2.x with habse as backend storage. *-Gajanan*
On Wed, Oct 10, 2018 at 5:17 PM Gajanan Watkar <gajananwat...@gmail.com> wrote: > Hi all, > > *1. Want to fillter all urls like:* > > http://14538.diarynote.jp/items/music-jp/B00005FMG1/ > http://12899diarynote.jp/amp/201503160602121325/ > http://15131513marudiarynote.jp/amp/201603181431397340/ > http://11621diarynote.jp/amp/200409061741310000/ > http://14291.diarynote.jp/items/dvd-jp/B00016ZPCQ/ > http://10695diarynote.jp/amp/200908112143487146/ > > *2. Contents of regex-urlfilter.txt file:* > > # skip diarynote.jp > *-.*diarynote.jp.** > > # skip file: ftp: and mailto: urls > -^(file|ftp|mailto): > > # skip image and other suffixes we can't yet parse > # for a more extensive coverage use the urlfilter-suffix plugin > > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ > > # skip URLs containing certain characters as probable queries, etc. > -[?*!@=] > > # skip URLs with slash-delimited segment that repeats 3+ times, to break > loops > -.*(/[^/]+)/[^/]+\1/[^/]+\1/ > > # accept anything else > +. > > *3. nutch-site.xml *has *plugin.includes* property with *urlfilter-regex > *plugin > included in it. > > *4. *When I test with *bin/nutch plugin urlfilter-regex > org.apache.nutch.urlfilter.regex.RegexURLFilter*, I am getting expected > Results, *But at Crawl time all these urls are getting included in fetch > list*. > > 18/10/10 12:35:23 INFO conf.Configuration: found resource > regex-urlfilter.txt at > file:/home/user/hadoop/tmp/hadoop-unjar5013141110548091848/regex-urlfilter.txt > http://11848.diarynote.jp/home/diary/new/ > -http://11848.diarynote.jp/home/diary/new/ > http://23810diarynote.jp/amp/201210031421469096/ > -http://23810diarynote.jp/amp/201210031421469096/ > diarynote.jp/amp/201210031421469096/ > -diarynote.jp/amp/201210031421469096/ > 23810diarynote.jp/amp/201210031421469096/ > -23810diarynote.jp/amp/201210031421469096/ > 11848.diarynote.jp/home/diary/new/ > -11848.diarynote.jp/home/diary/new/ > http://20131110karadiarynote.jp/amp/201604260043253476/ > -http://20131110karadiarynote.jp/amp/201604260043253476/ > 20131110karadiarynote.jp/amp/201604260043253476/ > -20131110karadiarynote.jp/amp/201604260043253476/ > > *What could be the problem. Needs Help.* > > > *-Gajanan* > >