Re:

Majisha Parambath Sun, 22 Feb 2015 18:09:58 -0800

Hey Jiaxin,

My understanding is that the suffix_urlfilter will not come into the
picture unless it is part of the plugin.includes property of the
nutch-configuration. By default only the regex_urlfilter is integrated into
nutch, and we need to set the mime types to skip/not skip in the
regex_urlfilter.txt


Please correct me if my understanding is wrong.

Thanks and regards,
*Majisha Namath Parambath*
*Graduate Student, M.S in Computer Science*
*Viterbi School of Engineering*
*University of Southern California, Los Angeles*

On Sun, Feb 15, 2015 at 11:34 PM, Jiaxin Ye <[email protected]> wrote:

> Hi Swati,
>
> I am also the student in Prof Matmann's class. I think the politeness
> depends on the crawl-delay to the same server. Usually in the robots.txt
> the crawl-delay will be set to 5 to 15 seconds. It's true that setting
> fetcher.threads.per.queue to be bigger than 1 will cause the Crawl-Delay
> value from robots.txt to be ignored, but you can set the
> fetcher.server.delay to be 5 to 15 seconds to rebalance the successive
> requests time.
>
> I also think we should change the content in suffix_urlfillter as well, as
> our task is to collect as much data as we can from the three websites.
>
> Jiaxin
>
> On Sun, Feb 15, 2015 at 10:48 PM, Swati Kothari <[email protected]> wrote:
>
>> Hi,
>> We are working on a project under Professor Chris Mattmann as part of
>> Information Retrieval course.
>> We are trying to edit different properties to change politeness and do
>> url filtering.
>>
>> We are trying more than 1 thread, which makes it impolite, but we are not
>> sure how impolite it should be made for better results.
>> Also, url filtering blocks almost all image, audio, video formats in
>> suffix_urlfilter.xml, should that be tampered with or not?
>>
>
>

Re:

Reply via email to