Thanks Jiaxin. We are already trying to vary the parameters as you said,
but what values would be appropriate for the properties that we are
changing is still doubtful.

On Sun, Feb 15, 2015 at 11:34 PM, Jiaxin Ye <jiaxi...@usc.edu> wrote:

> Hi Swati,
>
> I am also the student in Prof Matmann's class. I think the politeness
> depends on the crawl-delay to the same server. Usually in the robots.txt
> the crawl-delay will be set to 5 to 15 seconds. It's true that setting
> fetcher.threads.per.queue to be bigger than 1 will cause the Crawl-Delay
> value from robots.txt to be ignored, but you can set the
> fetcher.server.delay to be 5 to 15 seconds to rebalance the successive
> requests time.
>
> I also think we should change the content in suffix_urlfillter as well, as
> our task is to collect as much data as we can from the three websites.
>
> Jiaxin
>
> On Sun, Feb 15, 2015 at 10:48 PM, Swati Kothari <swati...@usc.edu> wrote:
>
>> Hi,
>> We are working on a project under Professor Chris Mattmann as part of
>> Information Retrieval course.
>> We are trying to edit different properties to change politeness and do
>> url filtering.
>>
>> We are trying more than 1 thread, which makes it impolite, but we are not
>> sure how impolite it should be made for better results.
>> Also, url filtering blocks almost all image, audio, video formats in
>> suffix_urlfilter.xml, should that be tampered with or not?
>>
>
>

Reply via email to