[Nutch-general] I don't want to crawl internet sites
This is my regex-urlfilter.txt file. -^http://([a-z0-9]*\.)+ +^http://([0-9]+\.)+ +. I want to allow only IP addresses and internal sites to be crawled and fetched. This means:- http://www.google.com should be ignored http://shoppingcenter should be crawled http://192.168.101.5 should be crawled. But when I see the logs, I find that http://someone.blogspot.com/ has also been crawled. How is it possible? Is my regex-urlfilter.txt wrong? Are there other URL filters? If so, in what order are the filters called? - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
Re: [Nutch-general] I don't want to crawl internet sites
Your regexp are incorrect. Both of them require the address to end with dot. Try this one: +^http://([0-9]{1,3}\.){3}[0-9]{1,3} -^http://([a-z0-9]+\.)+[a-z0-9]+ +. Note, I moved the IP pattern first, because the deny pattern matches as well. You should try you patterns first, outside Nutch. Cheers, Marcin On 5/30/07, Manoharam Reddy [EMAIL PROTECTED] wrote: This is my regex-urlfilter.txt file. -^http://([a-z0-9]*\.)+ +^http://([0-9]+\.)+ +. I want to allow only IP addresses and internal sites to be crawled and fetched. This means:- http://www.google.com should be ignored http://shoppingcenter should be crawled http://192.168.101.5 should be crawled. But when I see the logs, I find that http://someone.blogspot.com/ has also been crawled. How is it possible? Is my regex-urlfilter.txt wrong? Are there other URL filters? If so, in what order are the filters called? - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
Re: [Nutch-general] I don't want to crawl internet sites
If you are unsure of your regex you might want to try this regex applet http://jakarta.apache.org/regexp/applet.html Also, I do all my filtering in crawl-urlfilter.txt I guess you must also, unless you have configured crawl-tool.xml to use your other file. property nameurlfilter.regex.file/name valuecrawl-urlfilter.txt/value /property -Ronny -Opprinnelig melding- Fra: Manoharam Reddy [mailto:[EMAIL PROTECTED] Sendt: 30. mai 2007 13:42 Til: [EMAIL PROTECTED] Emne: I don't want to crawl internet sites This is my regex-urlfilter.txt file. -^http://([a-z0-9]*\.)+ +^http://([0-9]+\.)+ +. I want to allow only IP addresses and internal sites to be crawled and fetched. This means:- http://www.google.com should be ignored http://shoppingcenter should be crawled http://192.168.101.5 should be crawled. But when I see the logs, I find that http://someone.blogspot.com/ has also been crawled. How is it possible? Is my regex-urlfilter.txt wrong? Are there other URL filters? If so, in what order are the filters called? !DSPAM:465d634894881383415936! - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
Re: [Nutch-general] I don't want to crawl internet sites
I am not configuring crawl-urlfilter.txt because I am not using the bin/nutch crawl tool. Instead I am calling bin/nutch generate, fetch, update, etc. from a script. In that case I should be configuring regex-urlfilter.txt instead of crawl-urlfilter.txt. Am I right? On 5/30/07, Naess, Ronny [EMAIL PROTECTED] wrote: If you are unsure of your regex you might want to try this regex applet http://jakarta.apache.org/regexp/applet.html Also, I do all my filtering in crawl-urlfilter.txt I guess you must also, unless you have configured crawl-tool.xml to use your other file. property nameurlfilter.regex.file/name valuecrawl-urlfilter.txt/value /property -Ronny -Opprinnelig melding- Fra: Manoharam Reddy [mailto:[EMAIL PROTECTED] Sendt: 30. mai 2007 13:42 Til: [EMAIL PROTECTED] Emne: I don't want to crawl internet sites This is my regex-urlfilter.txt file. -^http://([a-z0-9]*\.)+ +^http://([0-9]+\.)+ +. I want to allow only IP addresses and internal sites to be crawled and fetched. This means:- http://www.google.com should be ignored http://shoppingcenter should be crawled http://192.168.101.5 should be crawled. But when I see the logs, I find that http://someone.blogspot.com/ has also been crawled. How is it possible? Is my regex-urlfilter.txt wrong? Are there other URL filters? If so, in what order are the filters called? !DSPAM:465d634894881383415936! - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
Re: [Nutch-general] I don't want to crawl internet sites
Oki. I think the crawl tool is for intranet fetching and I though that was what you wanted? http://lucene.apache.org/nutch/tutorial8.html Anyway, I do not have any experience not using crawl so I suppose someone else must help you. -Opprinnelig melding- Fra: Manoharam Reddy [mailto:[EMAIL PROTECTED] Sendt: 30. mai 2007 14:58 Til: [EMAIL PROTECTED] Emne: Re: I don't want to crawl internet sites I am not configuring crawl-urlfilter.txt because I am not using the bin/nutch crawl tool. Instead I am calling bin/nutch generate, fetch, update, etc. from a script. In that case I should be configuring regex-urlfilter.txt instead of crawl-urlfilter.txt. Am I right? On 5/30/07, Naess, Ronny [EMAIL PROTECTED] wrote: If you are unsure of your regex you might want to try this regex applet http://jakarta.apache.org/regexp/applet.html Also, I do all my filtering in crawl-urlfilter.txt I guess you must also, unless you have configured crawl-tool.xml to use your other file. property nameurlfilter.regex.file/name valuecrawl-urlfilter.txt/value /property -Ronny -Opprinnelig melding- Fra: Manoharam Reddy [mailto:[EMAIL PROTECTED] Sendt: 30. mai 2007 13:42 Til: [EMAIL PROTECTED] Emne: I don't want to crawl internet sites This is my regex-urlfilter.txt file. -^http://([a-z0-9]*\.)+ +^http://([0-9]+\.)+ +. I want to allow only IP addresses and internal sites to be crawled and fetched. This means:- http://www.google.com should be ignored http://shoppingcenter should be crawled http://192.168.101.5 should be crawled. But when I see the logs, I find that http://someone.blogspot.com/ has also been crawled. How is it possible? Is my regex-urlfilter.txt wrong? Are there other URL filters? If so, in what order are the filters called? !DSPAM:465d750470581367111490! - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
Re: [Nutch-general] I don't want to crawl internet sites
On 5/30/07, Manoharam Reddy [EMAIL PROTECTED] wrote: I am not configuring crawl-urlfilter.txt because I am not using the bin/nutch crawl tool. Instead I am calling bin/nutch generate, fetch, update, etc. from a script. In that case I should be configuring regex-urlfilter.txt instead of crawl-urlfilter.txt. Am I right? Yes, if you are not using 'crawl' command you should change regex-urlfilter.txt. On 5/30/07, Naess, Ronny [EMAIL PROTECTED] wrote: If you are unsure of your regex you might want to try this regex applet http://jakarta.apache.org/regexp/applet.html Also, I do all my filtering in crawl-urlfilter.txt I guess you must also, unless you have configured crawl-tool.xml to use your other file. property nameurlfilter.regex.file/name valuecrawl-urlfilter.txt/value /property -Ronny -Opprinnelig melding- Fra: Manoharam Reddy [mailto:[EMAIL PROTECTED] Sendt: 30. mai 2007 13:42 Til: [EMAIL PROTECTED] Emne: I don't want to crawl internet sites This is my regex-urlfilter.txt file. -^http://([a-z0-9]*\.)+ +^http://([0-9]+\.)+ +. I want to allow only IP addresses and internal sites to be crawled and fetched. This means:- http://www.google.com should be ignored http://shoppingcenter should be crawled http://192.168.101.5 should be crawled. But when I see the logs, I find that http://someone.blogspot.com/ has also been crawled. How is it possible? Is my regex-urlfilter.txt wrong? Are there other URL filters? If so, in what order are the filters called? !DSPAM:465d634894881383415936! -- Doğacan Güney - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general