[Nutch-general] I don't want to crawl internet sites

2007-05-30 Thread Manoharam Reddy
This is my regex-urlfilter.txt file.


-^http://([a-z0-9]*\.)+
+^http://([0-9]+\.)+
+.

I want to allow only IP addresses and internal sites to be crawled and
fetched. This means:-

http://www.google.com should be ignored
http://shoppingcenter should be crawled
http://192.168.101.5 should be crawled.

But when I see the logs, I find that http://someone.blogspot.com/ has
also been crawled. How is it possible?

Is my regex-urlfilter.txt wrong? Are there other URL filters? If so,
in what order are the filters called?

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] I don't want to crawl internet sites

2007-05-30 Thread Marcin Okraszewski
Your regexp are incorrect. Both of them require the address to end
with dot. Try this one:

+^http://([0-9]{1,3}\.){3}[0-9]{1,3}
-^http://([a-z0-9]+\.)+[a-z0-9]+
+.

Note, I moved the IP pattern first, because the deny pattern matches
as well. You should try you patterns first, outside Nutch.

Cheers,
Marcin

On 5/30/07, Manoharam Reddy [EMAIL PROTECTED] wrote:
 This is my regex-urlfilter.txt file.


 -^http://([a-z0-9]*\.)+
 +^http://([0-9]+\.)+
 +.

 I want to allow only IP addresses and internal sites to be crawled and
 fetched. This means:-

 http://www.google.com should be ignored
 http://shoppingcenter should be crawled
 http://192.168.101.5 should be crawled.

 But when I see the logs, I find that http://someone.blogspot.com/ has
 also been crawled. How is it possible?

 Is my regex-urlfilter.txt wrong? Are there other URL filters? If so,
 in what order are the filters called?


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] I don't want to crawl internet sites

2007-05-30 Thread Naess, Ronny
If you are unsure of your regex you might want to try this regex applet

http://jakarta.apache.org/regexp/applet.html

Also, I do all my filtering in crawl-urlfilter.txt
I guess you must also, unless you have configured crawl-tool.xml to use
your other file.

property
  nameurlfilter.regex.file/name
  valuecrawl-urlfilter.txt/value
/property

-Ronny



-Opprinnelig melding-
Fra: Manoharam Reddy [mailto:[EMAIL PROTECTED] 
Sendt: 30. mai 2007 13:42
Til: [EMAIL PROTECTED]
Emne: I don't want to crawl internet sites

This is my regex-urlfilter.txt file.


-^http://([a-z0-9]*\.)+
+^http://([0-9]+\.)+
+.

I want to allow only IP addresses and internal sites to be crawled and
fetched. This means:-

http://www.google.com should be ignored
http://shoppingcenter should be crawled
http://192.168.101.5 should be crawled.

But when I see the logs, I find that http://someone.blogspot.com/ has
also been crawled. How is it possible?

Is my regex-urlfilter.txt wrong? Are there other URL filters? If so, in
what order are the filters called?

!DSPAM:465d634894881383415936!


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] I don't want to crawl internet sites

2007-05-30 Thread Manoharam Reddy
I am not configuring crawl-urlfilter.txt because I am not using the
bin/nutch crawl tool. Instead I am calling bin/nutch generate,
fetch, update, etc. from a script.

In that case I should be configuring regex-urlfilter.txt instead of
crawl-urlfilter.txt. Am I right?

On 5/30/07, Naess, Ronny [EMAIL PROTECTED] wrote:
 If you are unsure of your regex you might want to try this regex applet

 http://jakarta.apache.org/regexp/applet.html

 Also, I do all my filtering in crawl-urlfilter.txt
 I guess you must also, unless you have configured crawl-tool.xml to use
 your other file.

 property
   nameurlfilter.regex.file/name
   valuecrawl-urlfilter.txt/value
 /property

 -Ronny



 -Opprinnelig melding-
 Fra: Manoharam Reddy [mailto:[EMAIL PROTECTED]
 Sendt: 30. mai 2007 13:42
 Til: [EMAIL PROTECTED]
 Emne: I don't want to crawl internet sites

 This is my regex-urlfilter.txt file.


 -^http://([a-z0-9]*\.)+
 +^http://([0-9]+\.)+
 +.

 I want to allow only IP addresses and internal sites to be crawled and
 fetched. This means:-

 http://www.google.com should be ignored
 http://shoppingcenter should be crawled
 http://192.168.101.5 should be crawled.

 But when I see the logs, I find that http://someone.blogspot.com/ has
 also been crawled. How is it possible?

 Is my regex-urlfilter.txt wrong? Are there other URL filters? If so, in
 what order are the filters called?

 !DSPAM:465d634894881383415936!



-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] I don't want to crawl internet sites

2007-05-30 Thread Naess, Ronny
Oki. I think the crawl tool is for intranet fetching and I though that
was what you wanted?

http://lucene.apache.org/nutch/tutorial8.html

Anyway, I do not have any experience not using crawl so I suppose
someone else must help you.



-Opprinnelig melding-
Fra: Manoharam Reddy [mailto:[EMAIL PROTECTED] 
Sendt: 30. mai 2007 14:58
Til: [EMAIL PROTECTED]
Emne: Re: I don't want to crawl internet sites

I am not configuring crawl-urlfilter.txt because I am not using the
bin/nutch crawl tool. Instead I am calling bin/nutch generate,
fetch, update, etc. from a script.

In that case I should be configuring regex-urlfilter.txt instead of
crawl-urlfilter.txt. Am I right?

On 5/30/07, Naess, Ronny [EMAIL PROTECTED] wrote:
 If you are unsure of your regex you might want to try this regex 
 applet

 http://jakarta.apache.org/regexp/applet.html

 Also, I do all my filtering in crawl-urlfilter.txt I guess you must 
 also, unless you have configured crawl-tool.xml to use your other 
 file.

 property
   nameurlfilter.regex.file/name
   valuecrawl-urlfilter.txt/value
 /property

 -Ronny



 -Opprinnelig melding-
 Fra: Manoharam Reddy [mailto:[EMAIL PROTECTED]
 Sendt: 30. mai 2007 13:42
 Til: [EMAIL PROTECTED]
 Emne: I don't want to crawl internet sites

 This is my regex-urlfilter.txt file.


 -^http://([a-z0-9]*\.)+
 +^http://([0-9]+\.)+
 +.

 I want to allow only IP addresses and internal sites to be crawled and

 fetched. This means:-

 http://www.google.com should be ignored http://shoppingcenter should 
 be crawled
 http://192.168.101.5 should be crawled.

 But when I see the logs, I find that http://someone.blogspot.com/ has 
 also been crawled. How is it possible?

 Is my regex-urlfilter.txt wrong? Are there other URL filters? If so, 
 in what order are the filters called?

 



!DSPAM:465d750470581367111490!


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] I don't want to crawl internet sites

2007-05-30 Thread Doğacan Güney
On 5/30/07, Manoharam Reddy [EMAIL PROTECTED] wrote:
 I am not configuring crawl-urlfilter.txt because I am not using the
 bin/nutch crawl tool. Instead I am calling bin/nutch generate,
 fetch, update, etc. from a script.

 In that case I should be configuring regex-urlfilter.txt instead of
 crawl-urlfilter.txt. Am I right?

Yes, if you are not using 'crawl' command you should change regex-urlfilter.txt.


 On 5/30/07, Naess, Ronny [EMAIL PROTECTED] wrote:
  If you are unsure of your regex you might want to try this regex applet
 
  http://jakarta.apache.org/regexp/applet.html
 
  Also, I do all my filtering in crawl-urlfilter.txt
  I guess you must also, unless you have configured crawl-tool.xml to use
  your other file.
 
  property
nameurlfilter.regex.file/name
valuecrawl-urlfilter.txt/value
  /property
 
  -Ronny
 
 
 
  -Opprinnelig melding-
  Fra: Manoharam Reddy [mailto:[EMAIL PROTECTED]
  Sendt: 30. mai 2007 13:42
  Til: [EMAIL PROTECTED]
  Emne: I don't want to crawl internet sites
 
  This is my regex-urlfilter.txt file.
 
 
  -^http://([a-z0-9]*\.)+
  +^http://([0-9]+\.)+
  +.
 
  I want to allow only IP addresses and internal sites to be crawled and
  fetched. This means:-
 
  http://www.google.com should be ignored
  http://shoppingcenter should be crawled
  http://192.168.101.5 should be crawled.
 
  But when I see the logs, I find that http://someone.blogspot.com/ has
  also been crawled. How is it possible?
 
  Is my regex-urlfilter.txt wrong? Are there other URL filters? If so, in
  what order are the filters called?
 
  !DSPAM:465d634894881383415936!
 
 



-- 
Doğacan Güney
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general