[Nutch-general] is it possible to set different addDays for different sites?

2007-06-12 Thread Manoharam Reddy
Hi, I am trying to solve a problem but I am unable to find any feature in Nutch that lets me solve this problem. Let's say in my intranet there are 1000 sites. Sites 1 to 100 have pages that are never going to change, i.e. they are static. So I don't need to crawl them again and again. But

[Nutch-general] Why Nutch is indexing HTTP 302 pages

2007-06-12 Thread Manoharam Reddy
I find in the search results that lots of HTTP 302 pages have been indexed. This is decreasing the quality of search results. Is there any way to disable indexing such pages? I want only HTTP 200 OK pages to be indexed. -

Re: [Nutch-general] What is parse-oo and why doesn't parsed PDF content show up in cached.jsp ?

2007-06-12 Thread Manoharam Reddy
How can I change it to read from segment/parse_text instead of segment/content ? On 5/31/07, Doğacan Güney [EMAIL PROTECTED] wrote: Hi, On 5/31/07, Manoharam Reddy [EMAIL PROTECTED] wrote: Some confusions regarding plugins.includes 1. I find a parse-oo in the plugins folder. What

[Nutch-general] meaning of depth value - tutorial wrong?

2007-06-12 Thread Manoharam Reddy
the tutorial says that depth value is the level of depth of a page from the root of a website. so as per the tutorial, if i want to fetch a page say, http://www.blabla.com/a/b/c/d/e/a.html, I must set the value of depth = 6. but I find in the source code that depth is simply a for loop. It will

[Nutch-general] Complex problem of recrawling economically

2007-06-04 Thread Manoharam Reddy
Hi, I am trying to solve a problem but I am unable to find any feature in Nutch that lets me solve this problem. Let's say in my intranet there are 1000 sites. Sites 1 to 100 have pages that are never going to change, i.e. they are static. So I don't need to crawl them again and again. But

[Nutch-general] How to enable followRedirects?

2007-06-03 Thread Manoharam Reddy
I get this error message for many URLs. Is there any property to enable redirect requests to be allowed? 2007-06-04 10:00:42,298 INFO httpclient.HttpMethodDirector - Redirect requested but followRedirects is disabled - This

[Nutch-general] How to parse PDF files? Deferred parsing possible?

2007-05-31 Thread Manoharam Reddy
I am crawling pages using the following commands in a loop iterating 10 times:- bin/nutch generate crawl/crawldb crawl/segments -topN 1000 seg1=`ls -d crawl/segments/* | tail -1` bin/nutch fetch $seg1 -threads 50 bin/nutch updatedb crawl/crawldb $seg1 I am getting the following

Re: [Nutch-general] OutOfMemoryError - Why should the while(1) loop stop?

2007-05-31 Thread Manoharam Reddy
command. Regards, -vishal. -Original Message- From: Manoharam Reddy [mailto:[EMAIL PROTECTED] Sent: Thursday, May 31, 2007 11:24 AM To: [EMAIL PROTECTED] Subject: Re: OutOfMemoryError - Why should the while(1) loop stop? If I run fetcher in non-parsing mode how can I later parse

[Nutch-general] What is parse-oo and why doesn't parsed PDF content show up in cached.jsp ?

2007-05-31 Thread Manoharam Reddy
Some confusions regarding plugins.includes 1. I find a parse-oo in the plugins folder. What is that for? 2. I have enabled parse-pdf by including in plugins.include of nutch-site.xml. The pages now come in the search result. But when I visit the cached page of the result. It shows a message like

[Nutch-general] Any URL filter available for search.jsp?

2007-05-31 Thread Manoharam Reddy
I want to use two filters one for crawling and another for searching through search.jsp. I am currently using regex-urlfilter.txt for generate, fetch, update cycle. But when a user searches the sites, I want him not to see certain sites in the results that have been crawled. How can this be

[Nutch-general] I don't want to crawl internet sites

2007-05-30 Thread Manoharam Reddy
This is my regex-urlfilter.txt file. -^http://([a-z0-9]*\.)+ +^http://([0-9]+\.)+ +. I want to allow only IP addresses and internal sites to be crawled and fetched. This means:- http://www.google.com should be ignored http://shoppingcenter should be crawled http://192.168.101.5 should be

Re: [Nutch-general] I don't want to crawl internet sites

2007-05-30 Thread Manoharam Reddy
nameurlfilter.regex.file/name valuecrawl-urlfilter.txt/value /property -Ronny -Opprinnelig melding- Fra: Manoharam Reddy [mailto:[EMAIL PROTECTED] Sendt: 30. mai 2007 13:42 Til: [EMAIL PROTECTED] Emne: I don't want to crawl internet sites This is my regex-urlfilter.txt file

[Nutch-general] OutOfMemoryError - Why should the while(1) loop stop?

2007-05-30 Thread Manoharam Reddy
Time and again I get this error and as a result the segment remains incomplete. This wastes one iteration of the for() loop in which I am doing generate, fetch and update. Can someone please tell me what are the measures I can take to avoid this error? And isn't it possible to make some code

Re: [Nutch-general] OutOfMemoryError - Why should the while(1) loop stop?

2007-05-30 Thread Manoharam Reddy
PROTECTED] wrote: On 5/30/07, Manoharam Reddy [EMAIL PROTECTED] wrote: Time and again I get this error and as a result the segment remains incomplete. This wastes one iteration of the for() loop in which I am doing generate, fetch and update. Can someone please tell me what

Re: [Nutch-general] Nutch crawls blocked sites - Why?

2007-05-29 Thread Manoharam Reddy
Thanks! It worked. On 5/28/07, Doğacan Güney [EMAIL PROTECTED] wrote: Hi, On 5/28/07, Manoharam Reddy [EMAIL PROTECTED] wrote: In my crawl-urlfilter.txt I have put a statement like -^http://cdserver Still while running crawl, it fetches this site. I am running the crawl using

[Nutch-general] Optimum number of threads

2007-05-29 Thread Manoharam Reddy
I am running Nutch on a powerful server with 1 GB RAM and 3 GHz Intel processor. I want to know what the optimum number of threads would be to crawl an intranet with around 100 sites. If I use too many threads (say -threads 100) while crawling, won't the context switching overhead hamper the

Re: [Nutch-general] mergesegs is not functioning properly

2007-05-29 Thread Manoharam Reddy
to maintain clarity. As a result this segment had only crawl_generate and nothing else. Can anyone please explain me what caused this error? How can I prevent this error from happening? On 5/29/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Manoharam Reddy wrote: My segment merger is not functioning

[Nutch-general] Nutch crawls blocked sites - Why?

2007-05-28 Thread Manoharam Reddy
In my crawl-urlfilter.txt I have put a statement like -^http://cdserver Still while running crawl, it fetches this site. I am running the crawl using these commands:- bin/nutch inject crawl/crawldb urls Inside a loop:- bin/nutch generate crawl/crawldb crawl/segments -topN 10 segment=`ls -d

[Nutch-general] mergesegs is not functioning properly

2007-05-28 Thread Manoharam Reddy
My segment merger is not functioning properly. I am unable to figure out the problem. These are the commands I am using. bin/nutch inject crawl/crawldb seedurls In a loop iterating 10 times:- bin/nutch generate crawl/crawldb crawl/segments -topN 1000 -adddays 5 segment=`ls -d

Re: [Nutch-general] Deleting crawl still gives proper results

2007-05-27 Thread Manoharam Reddy
: Manoharam Reddy [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Saturday, May 26, 2007 6:23 PM After I create the crawldb after running bin/nutch crawl, I start my Tomcat server. It gives proper search results. What I am wondering is that even after I delete, the 'crawl' folder, the search

Re: [Nutch-general] Deleting crawl still gives proper results

2007-05-27 Thread Manoharam Reddy
Message - From: Manoharam Reddy [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Saturday, May 26, 2007 6:23 PM After I create the crawldb after running bin/nutch crawl, I start my Tomcat server. It gives proper search results. What I am wondering is that even after I delete

[Nutch-general] Deleting crawl still gives proper results

2007-05-26 Thread Manoharam Reddy
After I create the crawldb after running bin/nutch crawl, I start my Tomcat server. It gives proper search results. What I am wondering is that even after I delete, the 'crawl' folder, the search page still gives proper search results. How is this possible? Only after I restart the Tomcat server,

[Nutch-general] Daily re-crawl possible?

2007-05-23 Thread Manoharam Reddy
I am using Nutch. I want to know how I can do daily crawls with Nutch. Here are the details I want:- 1. Doing a crawl that keeps running all the time and keeps updating the crawldb 2. Whether it can avoid re-crawling the pages that have been crawled recently. Basically I want it to waste