Hey Jiaxin, My understanding is that the suffix_urlfilter will not come into the picture unless it is part of the plugin.includes property of the nutch-configuration. By default only the regex_urlfilter is integrated into nutch, and we need to set the mime types to skip/not skip in the regex_urlfilter.txt
Please correct me if my understanding is wrong. Thanks and regards, *Majisha Namath Parambath* *Graduate Student, M.S in Computer Science* *Viterbi School of Engineering* *University of Southern California, Los Angeles* On Sun, Feb 15, 2015 at 11:34 PM, Jiaxin Ye <jiaxi...@usc.edu> wrote: > Hi Swati, > > I am also the student in Prof Matmann's class. I think the politeness > depends on the crawl-delay to the same server. Usually in the robots.txt > the crawl-delay will be set to 5 to 15 seconds. It's true that setting > fetcher.threads.per.queue to be bigger than 1 will cause the Crawl-Delay > value from robots.txt to be ignored, but you can set the > fetcher.server.delay to be 5 to 15 seconds to rebalance the successive > requests time. > > I also think we should change the content in suffix_urlfillter as well, as > our task is to collect as much data as we can from the three websites. > > Jiaxin > > On Sun, Feb 15, 2015 at 10:48 PM, Swati Kothari <swati...@usc.edu> wrote: > >> Hi, >> We are working on a project under Professor Chris Mattmann as part of >> Information Retrieval course. >> We are trying to edit different properties to change politeness and do >> url filtering. >> >> We are trying more than 1 thread, which makes it impolite, but we are not >> sure how impolite it should be made for better results. >> Also, url filtering blocks almost all image, audio, video formats in >> suffix_urlfilter.xml, should that be tampered with or not? >> > >