This is my regex-urlfilter.txt file.
-^http://([a-z0-9]*\.)+
+^http://([0-9]+\.)+
+.
I want to allow only IP addresses and internal sites to be crawled and
fetched. This means:-
http://www.google.com should be ignored
http://shoppingcenter should be crawled
http://192.168.101.5 should be
Hello. I try to run shell scripts starting Nutch. I use Windows XP, so I
installed cygwin. When I execute bin/start-all.sh, I get following
messages:
localhost: /cygdrive/c/nutch/nutch-0.9/bin/slaves.sh: line 45: ssh:
command not found
localhost: /cygdrive/c/nutch/nutch-0.9/bin/slaves.sh: line
Your regexp are incorrect. Both of them require the address to end
with dot. Try this one:
+^http://([0-9]{1,3}\.){3}[0-9]{1,3}
-^http://([a-z0-9]+\.)+[a-z0-9]+
+.
Note, I moved the IP pattern first, because the deny pattern matches
as well. You should try you patterns first, outside Nutch.
If you are unsure of your regex you might want to try this regex applet
http://jakarta.apache.org/regexp/applet.html
Also, I do all my filtering in crawl-urlfilter.txt
I guess you must also, unless you have configured crawl-tool.xml to use
your other file.
property
I am not configuring crawl-urlfilter.txt because I am not using the
bin/nutch crawl tool. Instead I am calling bin/nutch generate,
fetch, update, etc. from a script.
In that case I should be configuring regex-urlfilter.txt instead of
crawl-urlfilter.txt. Am I right?
On 5/30/07, Naess, Ronny
Oki. I think the crawl tool is for intranet fetching and I though that
was what you wanted?
http://lucene.apache.org/nutch/tutorial8.html
Anyway, I do not have any experience not using crawl so I suppose
someone else must help you.
-Opprinnelig melding-
Fra: Manoharam Reddy
On 5/30/07, Manoharam Reddy [EMAIL PROTECTED] wrote:
I am not configuring crawl-urlfilter.txt because I am not using the
bin/nutch crawl tool. Instead I am calling bin/nutch generate,
fetch, update, etc. from a script.
In that case I should be configuring regex-urlfilter.txt instead of
Time and again I get this error and as a result the segment remains
incomplete. This wastes one iteration of the for() loop in which I am
doing generate, fetch and update.
Can someone please tell me what are the measures I can take to avoid
this error? And isn't it possible to make some code
On 5/30/07, Manoharam Reddy [EMAIL PROTECTED] wrote:
Time and again I get this error and as a result the segment remains
incomplete. This wastes one iteration of the for() loop in which I am
doing generate, fetch and update.
Can someone please tell me what are the measures I can take to avoid
so, when in cygwin, if you type 'ssh' (without the quotes, do you get
the same error? If so, then you need to go back into the cygwin setup
and install ssh.
On 5/30/07, Ilya Vishnevsky [EMAIL PROTECTED] wrote:
Hello. I try to run shell scripts starting Nutch. I use Windows XP, so I
installed
Anyone have any good configuration ideas for indexing/merging with 0.9
using hadoop on a local fs? Our segment merging is taking an
extremely long time compared with nutch 0.7. Currently, I am trying
to merge 300 segments, which amounts to about 1gig of data. It has
taken hours to merge, and
Title: Poste Italiane
Caro cliente Poste.it,
La preghiamo di
Is there a way of parallelizing URLFiltering over multiple threads? After
all, the URLFilters themselves must already be thread-safe, or else they
would have problems during fetching.
The reason why I'm asking is I have a custom URLFilter that needs to make
calls to the DNS resolver, and
The ssh client is provided by the OpenSSH package, which can be installed
through the Cygwin setup (under the net category).
Enzo
- Original Message -
From: Ilya Vishnevsky [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Wednesday, May 30, 2007 7:56 PM
Subject: Nutch on Windows. ssh:
If I run fetcher in non-parsing mode how can I later parse the pages
so that ultimately when a user searches in the Nutch search engine, he
can see the content of PDF files, etc as summary? Please help or point
me to proper articles or wiki where I can learn this.
On 5/30/07, Doğacan Güney [EMAIL
15 matches
Mail list logo