Re: [Nutch-general] OutOfMemoryError - Why should the while(1) loop stop?

2007-05-31 Thread Vishal Shah
Hi Manoharam, You can use the parse command to parse a segment after it is fetched with -noParsing option. The result will be equivalent to running fetch without the noparsing option. In your nutch installation directory, try the command bin/nutch. It will give you the usage for the parse

[Nutch-general] How to parse PDF files? Deferred parsing possible?

2007-05-31 Thread Manoharam Reddy
I am crawling pages using the following commands in a loop iterating 10 times:- bin/nutch generate crawl/crawldb crawl/segments -topN 1000 seg1=`ls -d crawl/segments/* | tail -1` bin/nutch fetch $seg1 -threads 50 bin/nutch updatedb crawl/crawldb $seg1 I am getting the following

Re: [Nutch-general] How to parse PDF files? Deferred parsing possible?

2007-05-31 Thread Doğacan Güney
On 5/31/07, Manoharam Reddy [EMAIL PROTECTED] wrote: I am crawling pages using the following commands in a loop iterating 10 times:- bin/nutch generate crawl/crawldb crawl/segments -topN 1000 seg1=`ls -d crawl/segments/* | tail -1` bin/nutch fetch $seg1 -threads 50 bin/nutch

Re: [Nutch-general] OutOfMemoryError - Why should the while(1) loop stop?

2007-05-31 Thread Manoharam Reddy
Thanks. I do my crawl using the Intranet Recrawl script available in the wiki. I have put these statements in a loop iterating 10 times. 1. bin/nutch generate crawl/crawldb crawl/segments -topN 1000 2. seg1=`ls -d crawl/segments/* | tail -1` 3. bin/nutch fetch $seg1 -threads 50 4. bin/nutch

Re: [Nutch-general] OutOfMemoryError - Why should the while(1) loop stop?

2007-05-31 Thread Doğacan Güney
On 5/31/07, Manoharam Reddy [EMAIL PROTECTED] wrote: Thanks. I do my crawl using the Intranet Recrawl script available in the wiki. I have put these statements in a loop iterating 10 times. 1. bin/nutch generate crawl/crawldb crawl/segments -topN 1000 2. seg1=`ls -d crawl/segments/* |

[Nutch-general] What is parse-oo and why doesn't parsed PDF content show up in cached.jsp ?

2007-05-31 Thread Manoharam Reddy
Some confusions regarding plugins.includes 1. I find a parse-oo in the plugins folder. What is that for? 2. I have enabled parse-pdf by including in plugins.include of nutch-site.xml. The pages now come in the search result. But when I visit the cached page of the result. It shows a message like

[Nutch-general] Any URL filter available for search.jsp?

2007-05-31 Thread Manoharam Reddy
I want to use two filters one for crawling and another for searching through search.jsp. I am currently using regex-urlfilter.txt for generate, fetch, update cycle. But when a user searches the sites, I want him not to see certain sites in the results that have been crawled. How can this be

Re: [Nutch-general] Fetcher2 slowness?

2007-05-31 Thread Doğacan Güney
On 5/18/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Doğacan Güney wrote: On 5/18/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Doğacan Güney wrote: Hi everyone, Has anyone tried Fetcher2 from latest trunk? On our tests, Fetcher2 is always slower (by a large margin) that Fetcher.

Re: [Nutch-general] Parallelizing URLFiltering

2007-05-31 Thread Enzo Michelangeli
- Original Message - From: Andrzej Bialecki [EMAIL PROTECTED] Sent: Thursday, May 31, 2007 2:25 PM Are you running jobs in the local mode? In distributed mode filtering is naturally parallel, because you have as many concurrent lookups as there are map tasks. I'm just using the

Re: [Nutch-general] Any URL filter available for search.jsp?

2007-05-31 Thread Naess, Ronny
Maybe I did'nt or maybe I did :-) If you want to filter out certain sites, what is then the problem with doing it in the query? My query is wrong however, but adding '-site:' in front of every site you want to exclude will do the trick. I am sure there is another smarter way aswell. I do not

Re: [Nutch-general] Parallelizing URLFiltering

2007-05-31 Thread Dennis Kubes
We setup an /etc/resolv.conf configuration as shown below. This allows us to check first local then two of the major DNS caches on the internet before requesting it through a local DNS caching server. The 208 addresses are OpenDNS servers and the 4.x addresses are Verizon DNS servers. All

Re: [Nutch-general] Fetcher2 slowness?

2007-05-31 Thread Andrzej Bialecki
Doğacan Güney wrote: I am still not sure about the source of this bug, but I think I found some unnecessary waits in Fetcher2. Even if a url is blocked by robots.txt (or has a crawl delay larger that max.crawl.delay), Fetcher2 still waits fetcher.server.delay before fetching another url from

Re: [Nutch-general] Parallelizing URLFiltering

2007-05-31 Thread Andrzej Bialecki
Enzo Michelangeli wrote: I'm just using the vanilla (local) configuration. The situation is so bad that lately I'm seeing durations like: generate: 2h 48' (-topN 2) fetch:1h 40' (200 threads) updatedb: 2h 20' This because both generate and updatedb perform filtering, and are

Re: [Nutch-general] Fetcher2 slowness?

2007-05-31 Thread Doğacan Güney
On 5/31/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Doğacan Güney wrote: I am still not sure about the source of this bug, but I think I found some unnecessary waits in Fetcher2. Even if a url is blocked by robots.txt (or has a crawl delay larger that max.crawl.delay), Fetcher2 still

Re: [Nutch-general] What is parse-oo and why doesn't parsed PDF content show up in cached.jsp ?

2007-05-31 Thread Doğacan Güney
Hi, On 5/31/07, Manoharam Reddy [EMAIL PROTECTED] wrote: Some confusions regarding plugins.includes 1. I find a parse-oo in the plugins folder. What is that for? Plugin parse-oo has something to do with parsing OpenOffice.org documents, I am not sure what exactly. 2. I have enabled

Re: [Nutch-general] Parallelizing URLFiltering

2007-05-31 Thread Enzo Michelangeli
- Original Message - From: Andrzej Bialecki [EMAIL PROTECTED] Sent: Thursday, May 31, 2007 11:39 PM Caching seems to be the only solution. Even if you were able to fire DNS requests more rapidly, remote servers wouldn't be able (or wouldn't like to) respond that quickly ... Then why

Re: [Nutch-general] Fetcher2 slowness?

2007-05-31 Thread Andrzej Bialecki
Doğacan Güney wrote: Good catch! The patch looks good, too - please go ahead. One question: why did you remove the call to finishFetchItem() around line 505? Because it seems we already call finishFetchItem in that code path just before the switch statement. I have opened NUTCH-495 for this,

Re: [Nutch-general] Clustered crawl

2007-05-31 Thread Bolle, Jeffrey F.
After some tweaking and playing I've found a couple of things. I moved the mapred.map.tasks and the mapred.reduce.tasks configuration settings into hadoop-site.xml from a mapred.xml file (maybe mapred-default.xml, I forget now). Also, my crawl only seems to work when using the value of 20 for

Re: [Nutch-general] Parallelizing URLFiltering

2007-05-31 Thread Ken Krugler
- Original Message - From: Andrzej Bialecki [EMAIL PROTECTED] Sent: Thursday, May 31, 2007 11:39 PM Caching seems to be the only solution. Even if you were able to fire DNS requests more rapidly, remote servers wouldn't be able (or wouldn't like to) respond that quickly ... Then why is

[Nutch-general] Mail delivery failed: returning message to sender

2007-05-31 Thread Mail Delivery System
This message was created automatically by mail delivery software. A message that you sent could not be delivered to one or more of its recipients. This is a permanent error. The following address(es) failed: [EMAIL PROTECTED] SMTP error from remote mailer after RCPT TO:[EMAIL PROTECTED]:

Re: [Nutch-general] Parallelizing URLFiltering

2007-05-31 Thread Dennis Kubes
Enzo Michelangeli wrote: - Original Message - From: Andrzej Bialecki [EMAIL PROTECTED] Sent: Thursday, May 31, 2007 11:39 PM Caching seems to be the only solution. Even if you were able to fire DNS requests more rapidly, remote servers wouldn't be able (or wouldn't like to)