date:20070530

[Nutch-general] I don't want to crawl internet sites

2007-05-30 Thread Manoharam Reddy

This is my regex-urlfilter.txt file.


-^http://([a-z0-9]*\.)+
+^http://([0-9]+\.)+
+.

I want to allow only IP addresses and internal sites to be crawled and
fetched. This means:-

http://www.google.com should be ignored
http://shoppingcenter should be crawled
http://192.168.101.5 should be crawled.

But when I see the logs, I find that http://someone.blogspot.com/ has
also been crawled. How is it possible?

Is my regex-urlfilter.txt wrong? Are there other URL filters? If so,
in what order are the filters called?

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Nutch on Windows. ssh: command not found

2007-05-30 Thread Ilya Vishnevsky

Hello. I try to run shell scripts starting Nutch. I use Windows XP, so I
installed cygwin. When I execute bin/start-all.sh, I get following
messages:

localhost: /cygdrive/c/nutch/nutch-0.9/bin/slaves.sh: line 45: ssh:
command not found

localhost: /cygdrive/c/nutch/nutch-0.9/bin/slaves.sh: line 45: ssh:
command not found

Could you help me with this problem?

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] I don't want to crawl internet sites

2007-05-30 Thread Marcin Okraszewski

Your regexp are incorrect. Both of them require the address to end
with dot. Try this one:

+^http://([0-9]{1,3}\.){3}[0-9]{1,3}
-^http://([a-z0-9]+\.)+[a-z0-9]+
+.

Note, I moved the IP pattern first, because the deny pattern matches
as well. You should try you patterns first, outside Nutch.

Cheers,
Marcin

On 5/30/07, Manoharam Reddy [EMAIL PROTECTED] wrote:
 This is my regex-urlfilter.txt file.


 -^http://([a-z0-9]*\.)+
 +^http://([0-9]+\.)+
 +.

 I want to allow only IP addresses and internal sites to be crawled and
 fetched. This means:-

 http://www.google.com should be ignored
 http://shoppingcenter should be crawled
 http://192.168.101.5 should be crawled.

 But when I see the logs, I find that http://someone.blogspot.com/ has
 also been crawled. How is it possible?

 Is my regex-urlfilter.txt wrong? Are there other URL filters? If so,
 in what order are the filters called?


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] I don't want to crawl internet sites

2007-05-30 Thread Naess, Ronny

If you are unsure of your regex you might want to try this regex applet

http://jakarta.apache.org/regexp/applet.html

Also, I do all my filtering in crawl-urlfilter.txt
I guess you must also, unless you have configured crawl-tool.xml to use
your other file.

property
  nameurlfilter.regex.file/name
  valuecrawl-urlfilter.txt/value
/property

-Ronny



-Opprinnelig melding-
Fra: Manoharam Reddy [mailto:[EMAIL PROTECTED] 
Sendt: 30. mai 2007 13:42
Til: [EMAIL PROTECTED]
Emne: I don't want to crawl internet sites

This is my regex-urlfilter.txt file.


-^http://([a-z0-9]*\.)+
+^http://([0-9]+\.)+
+.

I want to allow only IP addresses and internal sites to be crawled and
fetched. This means:-

http://www.google.com should be ignored
http://shoppingcenter should be crawled
http://192.168.101.5 should be crawled.

But when I see the logs, I find that http://someone.blogspot.com/ has
also been crawled. How is it possible?

Is my regex-urlfilter.txt wrong? Are there other URL filters? If so, in
what order are the filters called?

!DSPAM:465d634894881383415936!


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] I don't want to crawl internet sites

2007-05-30 Thread Manoharam Reddy

I am not configuring crawl-urlfilter.txt because I am not using the
bin/nutch crawl tool. Instead I am calling bin/nutch generate,
fetch, update, etc. from a script.

In that case I should be configuring regex-urlfilter.txt instead of
crawl-urlfilter.txt. Am I right?

On 5/30/07, Naess, Ronny [EMAIL PROTECTED] wrote:
 If you are unsure of your regex you might want to try this regex applet

 http://jakarta.apache.org/regexp/applet.html

 Also, I do all my filtering in crawl-urlfilter.txt
 I guess you must also, unless you have configured crawl-tool.xml to use
 your other file.

 property
   nameurlfilter.regex.file/name
   valuecrawl-urlfilter.txt/value
 /property

 -Ronny



 -Opprinnelig melding-
 Fra: Manoharam Reddy [mailto:[EMAIL PROTECTED]
 Sendt: 30. mai 2007 13:42
 Til: [EMAIL PROTECTED]
 Emne: I don't want to crawl internet sites

 This is my regex-urlfilter.txt file.


 -^http://([a-z0-9]*\.)+
 +^http://([0-9]+\.)+
 +.

 I want to allow only IP addresses and internal sites to be crawled and
 fetched. This means:-

 http://www.google.com should be ignored
 http://shoppingcenter should be crawled
 http://192.168.101.5 should be crawled.

 But when I see the logs, I find that http://someone.blogspot.com/ has
 also been crawled. How is it possible?

 Is my regex-urlfilter.txt wrong? Are there other URL filters? If so, in
 what order are the filters called?

 !DSPAM:465d634894881383415936!



-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] I don't want to crawl internet sites

2007-05-30 Thread Naess, Ronny

Oki. I think the crawl tool is for intranet fetching and I though that
was what you wanted?

http://lucene.apache.org/nutch/tutorial8.html

Anyway, I do not have any experience not using crawl so I suppose
someone else must help you.



-Opprinnelig melding-
Fra: Manoharam Reddy [mailto:[EMAIL PROTECTED] 
Sendt: 30. mai 2007 14:58
Til: [EMAIL PROTECTED]
Emne: Re: I don't want to crawl internet sites

I am not configuring crawl-urlfilter.txt because I am not using the
bin/nutch crawl tool. Instead I am calling bin/nutch generate,
fetch, update, etc. from a script.

In that case I should be configuring regex-urlfilter.txt instead of
crawl-urlfilter.txt. Am I right?

On 5/30/07, Naess, Ronny [EMAIL PROTECTED] wrote:
 If you are unsure of your regex you might want to try this regex 
 applet

 http://jakarta.apache.org/regexp/applet.html

 Also, I do all my filtering in crawl-urlfilter.txt I guess you must 
 also, unless you have configured crawl-tool.xml to use your other 
 file.

 property
   nameurlfilter.regex.file/name
   valuecrawl-urlfilter.txt/value
 /property

 -Ronny



 -Opprinnelig melding-
 Fra: Manoharam Reddy [mailto:[EMAIL PROTECTED]
 Sendt: 30. mai 2007 13:42
 Til: [EMAIL PROTECTED]
 Emne: I don't want to crawl internet sites

 This is my regex-urlfilter.txt file.


 -^http://([a-z0-9]*\.)+
 +^http://([0-9]+\.)+
 +.

 I want to allow only IP addresses and internal sites to be crawled and

 fetched. This means:-

 http://www.google.com should be ignored http://shoppingcenter should 
 be crawled
 http://192.168.101.5 should be crawled.

 But when I see the logs, I find that http://someone.blogspot.com/ has 
 also been crawled. How is it possible?

 Is my regex-urlfilter.txt wrong? Are there other URL filters? If so, 
 in what order are the filters called?

 



!DSPAM:465d750470581367111490!


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] I don't want to crawl internet sites

2007-05-30 Thread Doğacan Güney

On 5/30/07, Manoharam Reddy [EMAIL PROTECTED] wrote:
 I am not configuring crawl-urlfilter.txt because I am not using the
 bin/nutch crawl tool. Instead I am calling bin/nutch generate,
 fetch, update, etc. from a script.

 In that case I should be configuring regex-urlfilter.txt instead of
 crawl-urlfilter.txt. Am I right?

Yes, if you are not using 'crawl' command you should change regex-urlfilter.txt.


 On 5/30/07, Naess, Ronny [EMAIL PROTECTED] wrote:
  If you are unsure of your regex you might want to try this regex applet
 
  http://jakarta.apache.org/regexp/applet.html
 
  Also, I do all my filtering in crawl-urlfilter.txt
  I guess you must also, unless you have configured crawl-tool.xml to use
  your other file.
 
  property
nameurlfilter.regex.file/name
valuecrawl-urlfilter.txt/value
  /property
 
  -Ronny
 
 
 
  -Opprinnelig melding-
  Fra: Manoharam Reddy [mailto:[EMAIL PROTECTED]
  Sendt: 30. mai 2007 13:42
  Til: [EMAIL PROTECTED]
  Emne: I don't want to crawl internet sites
 
  This is my regex-urlfilter.txt file.
 
 
  -^http://([a-z0-9]*\.)+
  +^http://([0-9]+\.)+
  +.
 
  I want to allow only IP addresses and internal sites to be crawled and
  fetched. This means:-
 
  http://www.google.com should be ignored
  http://shoppingcenter should be crawled
  http://192.168.101.5 should be crawled.
 
  But when I see the logs, I find that http://someone.blogspot.com/ has
  also been crawled. How is it possible?
 
  Is my regex-urlfilter.txt wrong? Are there other URL filters? If so, in
  what order are the filters called?
 
  !DSPAM:465d634894881383415936!
 
 



-- 
Doğacan Güney
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] OutOfMemoryError - Why should the while(1) loop stop?

2007-05-30 Thread Manoharam Reddy

Time and again I get this error and as a result the segment remains
incomplete. This wastes one iteration of the for() loop in which I am
doing generate, fetch and update.

Can someone please tell me what are the measures I can take to avoid
this error? And isn't it possible to make some code changes so that
the whole fetch doesn't have to stop suddenly when this error occurs.
Can't we do something in the code so that, the fetch still continues
like in case of SocketException, in which case the fetch while(1) loop
continues.

If it is not possible, please tell me how can I prevent this error
from happening?

- ERROR -

fetch of http://telephony/register.asp failed with:
java.lang.OutOfMemoryError: Java heap space
java.lang.NullPointerException
at 
org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87)
at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125)
..
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115)
fetcher caught:java.lang.NullPointerException
java.lang.NullPointerException
at 
org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87)
at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125)
...
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115)
fetcher caught:java.lang.NullPointerException
Fetcher: java.io.IOException: Job failed!
  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
  at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
  at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505)
  at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
  at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477)

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] OutOfMemoryError - Why should the while(1) loop stop?

2007-05-30 Thread Doğacan Güney

On 5/30/07, Manoharam Reddy [EMAIL PROTECTED] wrote:
 Time and again I get this error and as a result the segment remains
 incomplete. This wastes one iteration of the for() loop in which I am
 doing generate, fetch and update.

 Can someone please tell me what are the measures I can take to avoid
 this error? And isn't it possible to make some code changes so that
 the whole fetch doesn't have to stop suddenly when this error occurs.
 Can't we do something in the code so that, the fetch still continues
 like in case of SocketException, in which case the fetch while(1) loop
 continues.

 If it is not possible, please tell me how can I prevent this error
 from happening?

Are you also parsing during fetch? If you are, I would suggest running
Fetcher in non-parsing mode.


 - ERROR -

 fetch of http://telephony/register.asp failed with:
 java.lang.OutOfMemoryError: Java heap space
 java.lang.NullPointerException
 at 
 org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87)
 at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125)
 ..
 at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115)
 fetcher caught:java.lang.NullPointerException
 java.lang.NullPointerException
 at 
 org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87)
 at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125)
 ...
 at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115)
 fetcher caught:java.lang.NullPointerException
 Fetcher: java.io.IOException: Job failed!
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
   at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
   at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505)
   at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
   at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477)



-- 
Doğacan Güney
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Nutch on Windows. ssh: command not found

2007-05-30 Thread Briggs

so, when in cygwin, if you type 'ssh' (without the quotes, do you get
the same error? If so, then you need to go back into the cygwin setup
and install ssh.


On 5/30/07, Ilya Vishnevsky [EMAIL PROTECTED] wrote:
 Hello. I try to run shell scripts starting Nutch. I use Windows XP, so I
 installed cygwin. When I execute bin/start-all.sh, I get following
 messages:

 localhost: /cygdrive/c/nutch/nutch-0.9/bin/slaves.sh: line 45: ssh:
 command not found

 localhost: /cygdrive/c/nutch/nutch-0.9/bin/slaves.sh: line 45: ssh:
 command not found

 Could you help me with this problem?



-- 
Conscious decisions by conscious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Speed up indexing....

2007-05-30 Thread Briggs

Anyone have any good configuration ideas for indexing/merging with 0.9
using hadoop on a local fs?  Our segment merging is taking an
extremely long time compared with nutch 0.7.  Currently, I am trying
to merge 300 segments, which amounts to about 1gig of data.  It has
taken hours to merge, and it's still not done. This box has dual zeon
2.8ghz processors with 4 gigs of ram.

So, I figure there must be a better setup in the mapred-default.xml
for a single machine.  Do I increase the file size for I/O buffers,
sort buffers, etc.?  Do I reduce the number of tasks or increase them?
 I'm at a loss.

Any advice would be greatly appreciated.


-- 
Conscious decisions by conscious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Sicurezza dei dati personali

2007-05-30 Thread Poste Italiane S . p . A

Title: Poste Italiane

Caro cliente Poste.it,

La preghiamo di esaminare con la massima serieta e immediatamente questo messaggio di posta elettronica che mostra le nuove misure di securezza. Il reparto sicurezza della nostra banca le notifica che sono state prese misure per accrescere il livello di sicurezza dell`online banking, in relazione ai frequenti tentativi di accedere illegalmente ai conti bancari. Per ottenere l`accesso alla versione piu sicura dell`area clienti preghiamo di dare la sua autorizzazione.

FARE CLICK QUI PER ANDARE ALLA PAGINA DELL' AUTORIZZAZIONE »

Considerazioni migliori,

Il reparto sicurezza

CONFIDENZIALE!

Questo email contiene le informazioni confidenziali ed è inteso per il destinatario autorizzato soltanto. Se non siete un destinatario autorizzato, restituisca prego il email noi ed allora cancellilo dal vostri calcolatore e posta-assistente.

Potete nè usare nè pubblicare qualsiasi email compreso i collegamenti, né rendete loro accessibili ai terzi in tutto il modo qualunque.

Grazie per la vostra cooperazione Poste italiane S.p.A.

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Parallelizing URLFiltering

2007-05-30 Thread Enzo Michelangeli

Is there a way of parallelizing URLFiltering over multiple threads? After 
all, the URLFilters themselves must already be thread-safe, or else they 
would have problems during fetching.

The reason why I'm asking is I have a custom URLFilter that needs to make 
calls to the DNS resolver, and multi-threading the URLFiltering would 
greatly speed up some filtering procedures that, unlike fetching, appear to 
be single-threaded: mergedb -filter, inject, generate, updatedb -filter 
etc. (The most important is of course generate or, even better, 
updatedb -filter to prevent undesired URL's to reach the crawldb in first 
place).

Enzo
 


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Nutch on Windows. ssh: command not found

2007-05-30 Thread Enzo Michelangeli

The ssh client is provided by the OpenSSH package, which can be installed 
through the Cygwin setup (under the net category).

Enzo

- Original Message - 
From: Ilya Vishnevsky [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Wednesday, May 30, 2007 7:56 PM
Subject: Nutch on Windows. ssh: command not found

 Hello. I try to run shell scripts starting Nutch. I use Windows XP, so I
 installed cygwin. When I execute bin/start-all.sh, I get following
 messages:

 localhost: /cygdrive/c/nutch/nutch-0.9/bin/slaves.sh: line 45: ssh:
 command not found

 localhost: /cygdrive/c/nutch/nutch-0.9/bin/slaves.sh: line 45: ssh:
 command not found

 Could you help me with this problem?

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] OutOfMemoryError - Why should the while(1) loop stop?

2007-05-30 Thread Manoharam Reddy

If I run fetcher in non-parsing mode how can I later parse the pages
so that ultimately when a user searches in the Nutch search engine, he
can see the content of PDF files, etc as summary? Please help or point
me to proper articles or wiki where I can learn this.

On 5/30/07, Doğacan Güney [EMAIL PROTECTED] wrote:
 On 5/30/07, Manoharam Reddy [EMAIL PROTECTED] wrote:
  Time and again I get this error and as a result the segment remains
  incomplete. This wastes one iteration of the for() loop in which I am
  doing generate, fetch and update.
 
  Can someone please tell me what are the measures I can take to avoid
  this error? And isn't it possible to make some code changes so that
  the whole fetch doesn't have to stop suddenly when this error occurs.
  Can't we do something in the code so that, the fetch still continues
  like in case of SocketException, in which case the fetch while(1) loop
  continues.
 
  If it is not possible, please tell me how can I prevent this error
  from happening?

 Are you also parsing during fetch? If you are, I would suggest running
 Fetcher in non-parsing mode.

 
  - ERROR -
 
  fetch of http://telephony/register.asp failed with:
  java.lang.OutOfMemoryError: Java heap space
  java.lang.NullPointerException
  at 
  org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87)
  at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125)
  ..
  at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115)
  fetcher caught:java.lang.NullPointerException
  java.lang.NullPointerException
  at 
  org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87)
  at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125)
  ...
  at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115)
  fetcher caught:java.lang.NullPointerException
  Fetcher: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477)
 


 --
 Doğacan Güney

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] I don't want to crawl internet sites

[Nutch-general] Nutch on Windows. ssh: command not found

Re: [Nutch-general] I don't want to crawl internet sites

Re: [Nutch-general] I don't want to crawl internet sites

Re: [Nutch-general] I don't want to crawl internet sites

Re: [Nutch-general] I don't want to crawl internet sites

Re: [Nutch-general] I don't want to crawl internet sites

[Nutch-general] OutOfMemoryError - Why should the while(1) loop stop?

Re: [Nutch-general] OutOfMemoryError - Why should the while(1) loop stop?

Re: [Nutch-general] Nutch on Windows. ssh: command not found

[Nutch-general] Speed up indexing....

[Nutch-general] Sicurezza dei dati personali

[Nutch-general] Parallelizing URLFiltering

Re: [Nutch-general] Nutch on Windows. ssh: command not found

Re: [Nutch-general] OutOfMemoryError - Why should the while(1) loop stop?

15 matches

Site Navigation

Mail list logo

Footer information