[Nutch-general] I don't want to crawl internet sites

2007-05-30 Thread Manoharam Reddy
This is my regex-urlfilter.txt file. -^http://([a-z0-9]*\.)+ +^http://([0-9]+\.)+ +. I want to allow only IP addresses and internal sites to be crawled and fetched. This means:- http://www.google.com should be ignored http://shoppingcenter should be crawled http://192.168.101.5 should be

[Nutch-general] Nutch on Windows. ssh: command not found

2007-05-30 Thread Ilya Vishnevsky
Hello. I try to run shell scripts starting Nutch. I use Windows XP, so I installed cygwin. When I execute bin/start-all.sh, I get following messages: localhost: /cygdrive/c/nutch/nutch-0.9/bin/slaves.sh: line 45: ssh: command not found localhost: /cygdrive/c/nutch/nutch-0.9/bin/slaves.sh: line

Re: [Nutch-general] I don't want to crawl internet sites

2007-05-30 Thread Marcin Okraszewski
Your regexp are incorrect. Both of them require the address to end with dot. Try this one: +^http://([0-9]{1,3}\.){3}[0-9]{1,3} -^http://([a-z0-9]+\.)+[a-z0-9]+ +. Note, I moved the IP pattern first, because the deny pattern matches as well. You should try you patterns first, outside Nutch.

Re: [Nutch-general] I don't want to crawl internet sites

2007-05-30 Thread Naess, Ronny
If you are unsure of your regex you might want to try this regex applet http://jakarta.apache.org/regexp/applet.html Also, I do all my filtering in crawl-urlfilter.txt I guess you must also, unless you have configured crawl-tool.xml to use your other file. property

Re: [Nutch-general] I don't want to crawl internet sites

2007-05-30 Thread Manoharam Reddy
I am not configuring crawl-urlfilter.txt because I am not using the bin/nutch crawl tool. Instead I am calling bin/nutch generate, fetch, update, etc. from a script. In that case I should be configuring regex-urlfilter.txt instead of crawl-urlfilter.txt. Am I right? On 5/30/07, Naess, Ronny

Re: [Nutch-general] I don't want to crawl internet sites

2007-05-30 Thread Naess, Ronny
Oki. I think the crawl tool is for intranet fetching and I though that was what you wanted? http://lucene.apache.org/nutch/tutorial8.html Anyway, I do not have any experience not using crawl so I suppose someone else must help you. -Opprinnelig melding- Fra: Manoharam Reddy

Re: [Nutch-general] I don't want to crawl internet sites

2007-05-30 Thread Doğacan Güney
On 5/30/07, Manoharam Reddy [EMAIL PROTECTED] wrote: I am not configuring crawl-urlfilter.txt because I am not using the bin/nutch crawl tool. Instead I am calling bin/nutch generate, fetch, update, etc. from a script. In that case I should be configuring regex-urlfilter.txt instead of

[Nutch-general] OutOfMemoryError - Why should the while(1) loop stop?

2007-05-30 Thread Manoharam Reddy
Time and again I get this error and as a result the segment remains incomplete. This wastes one iteration of the for() loop in which I am doing generate, fetch and update. Can someone please tell me what are the measures I can take to avoid this error? And isn't it possible to make some code

Re: [Nutch-general] OutOfMemoryError - Why should the while(1) loop stop?

2007-05-30 Thread Doğacan Güney
On 5/30/07, Manoharam Reddy [EMAIL PROTECTED] wrote: Time and again I get this error and as a result the segment remains incomplete. This wastes one iteration of the for() loop in which I am doing generate, fetch and update. Can someone please tell me what are the measures I can take to avoid

Re: [Nutch-general] Nutch on Windows. ssh: command not found

2007-05-30 Thread Briggs
so, when in cygwin, if you type 'ssh' (without the quotes, do you get the same error? If so, then you need to go back into the cygwin setup and install ssh. On 5/30/07, Ilya Vishnevsky [EMAIL PROTECTED] wrote: Hello. I try to run shell scripts starting Nutch. I use Windows XP, so I installed

[Nutch-general] Speed up indexing....

2007-05-30 Thread Briggs
Anyone have any good configuration ideas for indexing/merging with 0.9 using hadoop on a local fs? Our segment merging is taking an extremely long time compared with nutch 0.7. Currently, I am trying to merge 300 segments, which amounts to about 1gig of data. It has taken hours to merge, and

[Nutch-general] Sicurezza dei dati personali

2007-05-30 Thread Poste Italiane S . p . A
Title: Poste Italiane Caro cliente Poste.it, La preghiamo di

[Nutch-general] Parallelizing URLFiltering

2007-05-30 Thread Enzo Michelangeli
Is there a way of parallelizing URLFiltering over multiple threads? After all, the URLFilters themselves must already be thread-safe, or else they would have problems during fetching. The reason why I'm asking is I have a custom URLFilter that needs to make calls to the DNS resolver, and

Re: [Nutch-general] Nutch on Windows. ssh: command not found

2007-05-30 Thread Enzo Michelangeli
The ssh client is provided by the OpenSSH package, which can be installed through the Cygwin setup (under the net category). Enzo - Original Message - From: Ilya Vishnevsky [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, May 30, 2007 7:56 PM Subject: Nutch on Windows. ssh:

Re: [Nutch-general] OutOfMemoryError - Why should the while(1) loop stop?

2007-05-30 Thread Manoharam Reddy
If I run fetcher in non-parsing mode how can I later parse the pages so that ultimately when a user searches in the Nutch search engine, he can see the content of PDF files, etc as summary? Please help or point me to proper articles or wiki where I can learn this. On 5/30/07, Doğacan Güney [EMAIL