Re: A few questions about crawl-urlfilter.txt

reinhard schwab Thu, 16 Jul 2009 03:05:45 -0700

Hrishikesh Agashe schrieb:
> Here are few questions I had about crawl-urlfilter.txt.
>
>
> -          Does Nutch obey crawl-urlfilter.txt properly? By default, it is 
> set to not download css, but when I do the crawl, I do see parse.ParseUtil 
> exceptions in my Hadoop.log (org.apache.nutch.parse.ParseException: parser 
> not found for contentType=text/css)
> Doesn't this mean that Nutch has actually downloaded a css file and is trying 
> to parse it?
>   
crawl-urfilter.txt describes the filter rules for regexp filtering of urls.
see  conf/crawl-tool.xml
it only filters urls by matching regexps.


if the file has no css file extension and if it is matched positive by
one of the filter rules, it will be downloaded and parsed.
this is your case.
>
> -          Can I put a positive filter in crawl-urlfilter.txt? Like
>
> +\.(html, htm)
>
> Instead of current one which starts with "-"? Will it make Nutch only 
> download files with extension htm and html?
>   
yes.
see the code in

RegexURLFilterBase

it iterates through the rules and the first matching rule is applied.
if no rule matches, the url is filtered out.

btw, your pattern should be
+\.(html|htm|HTML|HTM)$
the pattern has to be a regular expression.

>
>
> -          Are the extensions in crawl-urlfilter.txt case sensitive or not?  
> i.e. do I have to add mp3, MP3, Mp3 to tell Nutch to not to download mp3 
> files?
>   
they are case sensitive.
see RegexURLFilter
patterns are compiled

pattern = Pattern.compile(regex);

if you want them to be case insensitive,
patterns have to be compiled with
pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);

>
>
> -          How does Nutch handle URLs which are GET but does not end with 
> extension? i.e. if there is a URL like http://www.mysite.com/images/1 which 
> returns an image, will Nutch be able to identify it and avoid it's download?
>   
download can only be avoided by defining filter rules.
if you have a rule like

-images/1$

it will not be downloaded.

reinhard


> TIA,
> --Hrishi
>
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is the 
> property of Persistent Systems Ltd. It is intended only for the use of the 
> individual or entity to which it is addressed. If you are not the intended 
> recipient, you are not authorized to read, retain, copy, print, distribute or 
> use this message. If you have received this communication in error, please 
> notify the sender and delete all copies of this message. Persistent Systems 
> Ltd. does not accept any liability for virus infected mails.
>
>

Re: A few questions about crawl-urlfilter.txt

Reply via email to