[Nutch-general] I don't want to crawl internet sites
This is my regex-urlfilter.txt file. -^http://([a-z0-9]*\.)+ +^http://([0-9]+\.)+ +. I want to allow only IP addresses and internal sites to be crawled and fetched. This means:- http://www.google.com should be ignored http://shoppingcenter should be crawled http://192.168.101.5 should be crawled. But when I see the logs, I find that http://someone.blogspot.com/ has also been crawled. How is it possible? Is my regex-urlfilter.txt wrong? Are there other URL filters? If so, in what order are the filters called? - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
[Nutch-general] Nutch on Windows. ssh: command not found
Hello. I try to run shell scripts starting Nutch. I use Windows XP, so I installed cygwin. When I execute bin/start-all.sh, I get following messages: localhost: /cygdrive/c/nutch/nutch-0.9/bin/slaves.sh: line 45: ssh: command not found localhost: /cygdrive/c/nutch/nutch-0.9/bin/slaves.sh: line 45: ssh: command not found Could you help me with this problem? - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
Re: [Nutch-general] I don't want to crawl internet sites
Your regexp are incorrect. Both of them require the address to end with dot. Try this one: +^http://([0-9]{1,3}\.){3}[0-9]{1,3} -^http://([a-z0-9]+\.)+[a-z0-9]+ +. Note, I moved the IP pattern first, because the deny pattern matches as well. You should try you patterns first, outside Nutch. Cheers, Marcin On 5/30/07, Manoharam Reddy [EMAIL PROTECTED] wrote: This is my regex-urlfilter.txt file. -^http://([a-z0-9]*\.)+ +^http://([0-9]+\.)+ +. I want to allow only IP addresses and internal sites to be crawled and fetched. This means:- http://www.google.com should be ignored http://shoppingcenter should be crawled http://192.168.101.5 should be crawled. But when I see the logs, I find that http://someone.blogspot.com/ has also been crawled. How is it possible? Is my regex-urlfilter.txt wrong? Are there other URL filters? If so, in what order are the filters called? - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
Re: [Nutch-general] I don't want to crawl internet sites
If you are unsure of your regex you might want to try this regex applet http://jakarta.apache.org/regexp/applet.html Also, I do all my filtering in crawl-urlfilter.txt I guess you must also, unless you have configured crawl-tool.xml to use your other file. property nameurlfilter.regex.file/name valuecrawl-urlfilter.txt/value /property -Ronny -Opprinnelig melding- Fra: Manoharam Reddy [mailto:[EMAIL PROTECTED] Sendt: 30. mai 2007 13:42 Til: [EMAIL PROTECTED] Emne: I don't want to crawl internet sites This is my regex-urlfilter.txt file. -^http://([a-z0-9]*\.)+ +^http://([0-9]+\.)+ +. I want to allow only IP addresses and internal sites to be crawled and fetched. This means:- http://www.google.com should be ignored http://shoppingcenter should be crawled http://192.168.101.5 should be crawled. But when I see the logs, I find that http://someone.blogspot.com/ has also been crawled. How is it possible? Is my regex-urlfilter.txt wrong? Are there other URL filters? If so, in what order are the filters called? !DSPAM:465d634894881383415936! - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
Re: [Nutch-general] I don't want to crawl internet sites
I am not configuring crawl-urlfilter.txt because I am not using the bin/nutch crawl tool. Instead I am calling bin/nutch generate, fetch, update, etc. from a script. In that case I should be configuring regex-urlfilter.txt instead of crawl-urlfilter.txt. Am I right? On 5/30/07, Naess, Ronny [EMAIL PROTECTED] wrote: If you are unsure of your regex you might want to try this regex applet http://jakarta.apache.org/regexp/applet.html Also, I do all my filtering in crawl-urlfilter.txt I guess you must also, unless you have configured crawl-tool.xml to use your other file. property nameurlfilter.regex.file/name valuecrawl-urlfilter.txt/value /property -Ronny -Opprinnelig melding- Fra: Manoharam Reddy [mailto:[EMAIL PROTECTED] Sendt: 30. mai 2007 13:42 Til: [EMAIL PROTECTED] Emne: I don't want to crawl internet sites This is my regex-urlfilter.txt file. -^http://([a-z0-9]*\.)+ +^http://([0-9]+\.)+ +. I want to allow only IP addresses and internal sites to be crawled and fetched. This means:- http://www.google.com should be ignored http://shoppingcenter should be crawled http://192.168.101.5 should be crawled. But when I see the logs, I find that http://someone.blogspot.com/ has also been crawled. How is it possible? Is my regex-urlfilter.txt wrong? Are there other URL filters? If so, in what order are the filters called? !DSPAM:465d634894881383415936! - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
Re: [Nutch-general] I don't want to crawl internet sites
Oki. I think the crawl tool is for intranet fetching and I though that was what you wanted? http://lucene.apache.org/nutch/tutorial8.html Anyway, I do not have any experience not using crawl so I suppose someone else must help you. -Opprinnelig melding- Fra: Manoharam Reddy [mailto:[EMAIL PROTECTED] Sendt: 30. mai 2007 14:58 Til: [EMAIL PROTECTED] Emne: Re: I don't want to crawl internet sites I am not configuring crawl-urlfilter.txt because I am not using the bin/nutch crawl tool. Instead I am calling bin/nutch generate, fetch, update, etc. from a script. In that case I should be configuring regex-urlfilter.txt instead of crawl-urlfilter.txt. Am I right? On 5/30/07, Naess, Ronny [EMAIL PROTECTED] wrote: If you are unsure of your regex you might want to try this regex applet http://jakarta.apache.org/regexp/applet.html Also, I do all my filtering in crawl-urlfilter.txt I guess you must also, unless you have configured crawl-tool.xml to use your other file. property nameurlfilter.regex.file/name valuecrawl-urlfilter.txt/value /property -Ronny -Opprinnelig melding- Fra: Manoharam Reddy [mailto:[EMAIL PROTECTED] Sendt: 30. mai 2007 13:42 Til: [EMAIL PROTECTED] Emne: I don't want to crawl internet sites This is my regex-urlfilter.txt file. -^http://([a-z0-9]*\.)+ +^http://([0-9]+\.)+ +. I want to allow only IP addresses and internal sites to be crawled and fetched. This means:- http://www.google.com should be ignored http://shoppingcenter should be crawled http://192.168.101.5 should be crawled. But when I see the logs, I find that http://someone.blogspot.com/ has also been crawled. How is it possible? Is my regex-urlfilter.txt wrong? Are there other URL filters? If so, in what order are the filters called? !DSPAM:465d750470581367111490! - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
Re: [Nutch-general] I don't want to crawl internet sites
On 5/30/07, Manoharam Reddy [EMAIL PROTECTED] wrote: I am not configuring crawl-urlfilter.txt because I am not using the bin/nutch crawl tool. Instead I am calling bin/nutch generate, fetch, update, etc. from a script. In that case I should be configuring regex-urlfilter.txt instead of crawl-urlfilter.txt. Am I right? Yes, if you are not using 'crawl' command you should change regex-urlfilter.txt. On 5/30/07, Naess, Ronny [EMAIL PROTECTED] wrote: If you are unsure of your regex you might want to try this regex applet http://jakarta.apache.org/regexp/applet.html Also, I do all my filtering in crawl-urlfilter.txt I guess you must also, unless you have configured crawl-tool.xml to use your other file. property nameurlfilter.regex.file/name valuecrawl-urlfilter.txt/value /property -Ronny -Opprinnelig melding- Fra: Manoharam Reddy [mailto:[EMAIL PROTECTED] Sendt: 30. mai 2007 13:42 Til: [EMAIL PROTECTED] Emne: I don't want to crawl internet sites This is my regex-urlfilter.txt file. -^http://([a-z0-9]*\.)+ +^http://([0-9]+\.)+ +. I want to allow only IP addresses and internal sites to be crawled and fetched. This means:- http://www.google.com should be ignored http://shoppingcenter should be crawled http://192.168.101.5 should be crawled. But when I see the logs, I find that http://someone.blogspot.com/ has also been crawled. How is it possible? Is my regex-urlfilter.txt wrong? Are there other URL filters? If so, in what order are the filters called? !DSPAM:465d634894881383415936! -- Doğacan Güney - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
[Nutch-general] OutOfMemoryError - Why should the while(1) loop stop?
Time and again I get this error and as a result the segment remains incomplete. This wastes one iteration of the for() loop in which I am doing generate, fetch and update. Can someone please tell me what are the measures I can take to avoid this error? And isn't it possible to make some code changes so that the whole fetch doesn't have to stop suddenly when this error occurs. Can't we do something in the code so that, the fetch still continues like in case of SocketException, in which case the fetch while(1) loop continues. If it is not possible, please tell me how can I prevent this error from happening? - ERROR - fetch of http://telephony/register.asp failed with: java.lang.OutOfMemoryError: Java heap space java.lang.NullPointerException at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87) at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125) .. at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115) fetcher caught:java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87) at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125) ... at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115) fetcher caught:java.lang.NullPointerException Fetcher: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470) at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505) at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477) - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
Re: [Nutch-general] OutOfMemoryError - Why should the while(1) loop stop?
On 5/30/07, Manoharam Reddy [EMAIL PROTECTED] wrote: Time and again I get this error and as a result the segment remains incomplete. This wastes one iteration of the for() loop in which I am doing generate, fetch and update. Can someone please tell me what are the measures I can take to avoid this error? And isn't it possible to make some code changes so that the whole fetch doesn't have to stop suddenly when this error occurs. Can't we do something in the code so that, the fetch still continues like in case of SocketException, in which case the fetch while(1) loop continues. If it is not possible, please tell me how can I prevent this error from happening? Are you also parsing during fetch? If you are, I would suggest running Fetcher in non-parsing mode. - ERROR - fetch of http://telephony/register.asp failed with: java.lang.OutOfMemoryError: Java heap space java.lang.NullPointerException at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87) at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125) .. at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115) fetcher caught:java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87) at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125) ... at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115) fetcher caught:java.lang.NullPointerException Fetcher: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470) at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505) at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477) -- Doğacan Güney - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
Re: [Nutch-general] Nutch on Windows. ssh: command not found
so, when in cygwin, if you type 'ssh' (without the quotes, do you get the same error? If so, then you need to go back into the cygwin setup and install ssh. On 5/30/07, Ilya Vishnevsky [EMAIL PROTECTED] wrote: Hello. I try to run shell scripts starting Nutch. I use Windows XP, so I installed cygwin. When I execute bin/start-all.sh, I get following messages: localhost: /cygdrive/c/nutch/nutch-0.9/bin/slaves.sh: line 45: ssh: command not found localhost: /cygdrive/c/nutch/nutch-0.9/bin/slaves.sh: line 45: ssh: command not found Could you help me with this problem? -- Conscious decisions by conscious minds are what make reality real - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
[Nutch-general] Speed up indexing....
Anyone have any good configuration ideas for indexing/merging with 0.9 using hadoop on a local fs? Our segment merging is taking an extremely long time compared with nutch 0.7. Currently, I am trying to merge 300 segments, which amounts to about 1gig of data. It has taken hours to merge, and it's still not done. This box has dual zeon 2.8ghz processors with 4 gigs of ram. So, I figure there must be a better setup in the mapred-default.xml for a single machine. Do I increase the file size for I/O buffers, sort buffers, etc.? Do I reduce the number of tasks or increase them? I'm at a loss. Any advice would be greatly appreciated. -- Conscious decisions by conscious minds are what make reality real - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
[Nutch-general] Sicurezza dei dati personali
Title: Poste Italiane Caro cliente Poste.it, La preghiamo di esaminare con la massima serieta e immediatamente questo messaggio di posta elettronica che mostra le nuove misure di securezza. Il reparto sicurezza della nostra banca le notifica che sono state prese misure per accrescere il livello di sicurezza dell`online banking, in relazione ai frequenti tentativi di accedere illegalmente ai conti bancari. Per ottenere l`accesso alla versione piu sicura dell`area clienti preghiamo di dare la sua autorizzazione. FARE CLICK QUI PER ANDARE ALLA PAGINA DELL' AUTORIZZAZIONE » Considerazioni migliori, Il reparto sicurezza CONFIDENZIALE! Questo email contiene le informazioni confidenziali ed è inteso per il destinatario autorizzato soltanto. Se non siete un destinatario autorizzato, restituisca prego il email noi ed allora cancellilo dal vostri calcolatore e posta-assistente. Potete nè usare nè pubblicare qualsiasi email compreso i collegamenti, né rendete loro accessibili ai terzi in tutto il modo qualunque. Grazie per la vostra cooperazione Poste italiane S.p.A. - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
[Nutch-general] Parallelizing URLFiltering
Is there a way of parallelizing URLFiltering over multiple threads? After all, the URLFilters themselves must already be thread-safe, or else they would have problems during fetching. The reason why I'm asking is I have a custom URLFilter that needs to make calls to the DNS resolver, and multi-threading the URLFiltering would greatly speed up some filtering procedures that, unlike fetching, appear to be single-threaded: mergedb -filter, inject, generate, updatedb -filter etc. (The most important is of course generate or, even better, updatedb -filter to prevent undesired URL's to reach the crawldb in first place). Enzo - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
Re: [Nutch-general] Nutch on Windows. ssh: command not found
The ssh client is provided by the OpenSSH package, which can be installed through the Cygwin setup (under the net category). Enzo - Original Message - From: Ilya Vishnevsky [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, May 30, 2007 7:56 PM Subject: Nutch on Windows. ssh: command not found Hello. I try to run shell scripts starting Nutch. I use Windows XP, so I installed cygwin. When I execute bin/start-all.sh, I get following messages: localhost: /cygdrive/c/nutch/nutch-0.9/bin/slaves.sh: line 45: ssh: command not found localhost: /cygdrive/c/nutch/nutch-0.9/bin/slaves.sh: line 45: ssh: command not found Could you help me with this problem? - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
Re: [Nutch-general] OutOfMemoryError - Why should the while(1) loop stop?
If I run fetcher in non-parsing mode how can I later parse the pages so that ultimately when a user searches in the Nutch search engine, he can see the content of PDF files, etc as summary? Please help or point me to proper articles or wiki where I can learn this. On 5/30/07, Doğacan Güney [EMAIL PROTECTED] wrote: On 5/30/07, Manoharam Reddy [EMAIL PROTECTED] wrote: Time and again I get this error and as a result the segment remains incomplete. This wastes one iteration of the for() loop in which I am doing generate, fetch and update. Can someone please tell me what are the measures I can take to avoid this error? And isn't it possible to make some code changes so that the whole fetch doesn't have to stop suddenly when this error occurs. Can't we do something in the code so that, the fetch still continues like in case of SocketException, in which case the fetch while(1) loop continues. If it is not possible, please tell me how can I prevent this error from happening? Are you also parsing during fetch? If you are, I would suggest running Fetcher in non-parsing mode. - ERROR - fetch of http://telephony/register.asp failed with: java.lang.OutOfMemoryError: Java heap space java.lang.NullPointerException at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87) at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125) .. at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115) fetcher caught:java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87) at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125) ... at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115) fetcher caught:java.lang.NullPointerException Fetcher: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470) at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505) at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477) -- Doğacan Güney - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general