from:"Manoharam Reddy"

[Nutch-general] is it possible to set different addDays for different sites?

2007-06-12 Thread Manoharam Reddy

Hi, I am trying to solve a problem but I am unable to find any feature in Nutch that lets me solve this problem. Let's say in my intranet there are 1000 sites. Sites 1 to 100 have pages that are never going to change, i.e. they are static. So I don't need to crawl them again and again. But

[Nutch-general] Why Nutch is indexing HTTP 302 pages

2007-06-12 Thread Manoharam Reddy

I find in the search results that lots of HTTP 302 pages have been indexed. This is decreasing the quality of search results. Is there any way to disable indexing such pages? I want only HTTP 200 OK pages to be indexed. -

Re: [Nutch-general] What is parse-oo and why doesn't parsed PDF content show up in cached.jsp ?

2007-06-12 Thread Manoharam Reddy

How can I change it to read from segment/parse_text instead of segment/content ? On 5/31/07, Doğacan Güney [EMAIL PROTECTED] wrote: Hi, On 5/31/07, Manoharam Reddy [EMAIL PROTECTED] wrote: Some confusions regarding plugins.includes 1. I find a parse-oo in the plugins folder. What

[Nutch-general] meaning of depth value - tutorial wrong?

2007-06-12 Thread Manoharam Reddy

the tutorial says that depth value is the level of depth of a page from the root of a website. so as per the tutorial, if i want to fetch a page say, http://www.blabla.com/a/b/c/d/e/a.html, I must set the value of depth = 6. but I find in the source code that depth is simply a for loop. It will

[Nutch-general] Complex problem of recrawling economically

2007-06-04 Thread Manoharam Reddy

Hi, I am trying to solve a problem but I am unable to find any feature in Nutch that lets me solve this problem. Let's say in my intranet there are 1000 sites. Sites 1 to 100 have pages that are never going to change, i.e. they are static. So I don't need to crawl them again and again. But

[Nutch-general] How to enable followRedirects?

2007-06-03 Thread Manoharam Reddy

I get this error message for many URLs. Is there any property to enable redirect requests to be allowed? 2007-06-04 10:00:42,298 INFO httpclient.HttpMethodDirector - Redirect requested but followRedirects is disabled - This

[Nutch-general] How to parse PDF files? Deferred parsing possible?

2007-05-31 Thread Manoharam Reddy

I am crawling pages using the following commands in a loop iterating 10 times:- bin/nutch generate crawl/crawldb crawl/segments -topN 1000 seg1=`ls -d crawl/segments/* | tail -1` bin/nutch fetch $seg1 -threads 50 bin/nutch updatedb crawl/crawldb $seg1 I am getting the following

Re: [Nutch-general] OutOfMemoryError - Why should the while(1) loop stop?

2007-05-31 Thread Manoharam Reddy

command. Regards, -vishal. -Original Message- From: Manoharam Reddy [mailto:[EMAIL PROTECTED] Sent: Thursday, May 31, 2007 11:24 AM To: [EMAIL PROTECTED] Subject: Re: OutOfMemoryError - Why should the while(1) loop stop? If I run fetcher in non-parsing mode how can I later parse

[Nutch-general] What is parse-oo and why doesn't parsed PDF content show up in cached.jsp ?

2007-05-31 Thread Manoharam Reddy

Some confusions regarding plugins.includes 1. I find a parse-oo in the plugins folder. What is that for? 2. I have enabled parse-pdf by including in plugins.include of nutch-site.xml. The pages now come in the search result. But when I visit the cached page of the result. It shows a message like

[Nutch-general] Any URL filter available for search.jsp?

2007-05-31 Thread Manoharam Reddy

I want to use two filters one for crawling and another for searching through search.jsp. I am currently using regex-urlfilter.txt for generate, fetch, update cycle. But when a user searches the sites, I want him not to see certain sites in the results that have been crawled. How can this be

[Nutch-general] I don't want to crawl internet sites

2007-05-30 Thread Manoharam Reddy

This is my regex-urlfilter.txt file. -^http://([a-z0-9]*\.)+ +^http://([0-9]+\.)+ +. I want to allow only IP addresses and internal sites to be crawled and fetched. This means:- http://www.google.com should be ignored http://shoppingcenter should be crawled http://192.168.101.5 should be

Re: [Nutch-general] I don't want to crawl internet sites

2007-05-30 Thread Manoharam Reddy

nameurlfilter.regex.file/name valuecrawl-urlfilter.txt/value /property -Ronny -Opprinnelig melding- Fra: Manoharam Reddy [mailto:[EMAIL PROTECTED] Sendt: 30. mai 2007 13:42 Til: [EMAIL PROTECTED] Emne: I don't want to crawl internet sites This is my regex-urlfilter.txt file

[Nutch-general] OutOfMemoryError - Why should the while(1) loop stop?

2007-05-30 Thread Manoharam Reddy

Time and again I get this error and as a result the segment remains incomplete. This wastes one iteration of the for() loop in which I am doing generate, fetch and update. Can someone please tell me what are the measures I can take to avoid this error? And isn't it possible to make some code

Re: [Nutch-general] OutOfMemoryError - Why should the while(1) loop stop?

2007-05-30 Thread Manoharam Reddy

PROTECTED] wrote: On 5/30/07, Manoharam Reddy [EMAIL PROTECTED] wrote: Time and again I get this error and as a result the segment remains incomplete. This wastes one iteration of the for() loop in which I am doing generate, fetch and update. Can someone please tell me what

Re: [Nutch-general] Nutch crawls blocked sites - Why?

2007-05-29 Thread Manoharam Reddy

Thanks! It worked. On 5/28/07, Doğacan Güney [EMAIL PROTECTED] wrote: Hi, On 5/28/07, Manoharam Reddy [EMAIL PROTECTED] wrote: In my crawl-urlfilter.txt I have put a statement like -^http://cdserver Still while running crawl, it fetches this site. I am running the crawl using

[Nutch-general] Optimum number of threads

2007-05-29 Thread Manoharam Reddy

I am running Nutch on a powerful server with 1 GB RAM and 3 GHz Intel processor. I want to know what the optimum number of threads would be to crawl an intranet with around 100 sites. If I use too many threads (say -threads 100) while crawling, won't the context switching overhead hamper the

Re: [Nutch-general] mergesegs is not functioning properly

2007-05-29 Thread Manoharam Reddy

to maintain clarity. As a result this segment had only crawl_generate and nothing else. Can anyone please explain me what caused this error? How can I prevent this error from happening? On 5/29/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Manoharam Reddy wrote: My segment merger is not functioning

[Nutch-general] Nutch crawls blocked sites - Why?

2007-05-28 Thread Manoharam Reddy

In my crawl-urlfilter.txt I have put a statement like -^http://cdserver Still while running crawl, it fetches this site. I am running the crawl using these commands:- bin/nutch inject crawl/crawldb urls Inside a loop:- bin/nutch generate crawl/crawldb crawl/segments -topN 10 segment=`ls -d

[Nutch-general] mergesegs is not functioning properly

2007-05-28 Thread Manoharam Reddy

My segment merger is not functioning properly. I am unable to figure out the problem. These are the commands I am using. bin/nutch inject crawl/crawldb seedurls In a loop iterating 10 times:- bin/nutch generate crawl/crawldb crawl/segments -topN 1000 -adddays 5 segment=`ls -d

Re: [Nutch-general] Deleting crawl still gives proper results

2007-05-27 Thread Manoharam Reddy

: Manoharam Reddy [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Saturday, May 26, 2007 6:23 PM After I create the crawldb after running bin/nutch crawl, I start my Tomcat server. It gives proper search results. What I am wondering is that even after I delete, the 'crawl' folder, the search

Re: [Nutch-general] Deleting crawl still gives proper results

2007-05-27 Thread Manoharam Reddy

Message - From: Manoharam Reddy [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Saturday, May 26, 2007 6:23 PM After I create the crawldb after running bin/nutch crawl, I start my Tomcat server. It gives proper search results. What I am wondering is that even after I delete

[Nutch-general] Deleting crawl still gives proper results

2007-05-26 Thread Manoharam Reddy

After I create the crawldb after running bin/nutch crawl, I start my Tomcat server. It gives proper search results. What I am wondering is that even after I delete, the 'crawl' folder, the search page still gives proper search results. How is this possible? Only after I restart the Tomcat server,

[Nutch-general] Daily re-crawl possible?

2007-05-23 Thread Manoharam Reddy

I am using Nutch. I want to know how I can do daily crawls with Nutch. Here are the details I want:- 1. Doing a crawl that keeps running all the time and keeps updating the crawldb 2. Whether it can avoid re-crawling the pages that have been crawled recently. Basically I want it to waste

[Nutch-general] is it possible to set different addDays for different sites?

[Nutch-general] Why Nutch is indexing HTTP 302 pages

Re: [Nutch-general] What is parse-oo and why doesn't parsed PDF content show up in cached.jsp ?

[Nutch-general] meaning of depth value - tutorial wrong?

[Nutch-general] Complex problem of recrawling economically

[Nutch-general] How to enable followRedirects?

[Nutch-general] How to parse PDF files? Deferred parsing possible?

Re: [Nutch-general] OutOfMemoryError - Why should the while(1) loop stop?

[Nutch-general] What is parse-oo and why doesn't parsed PDF content show up in cached.jsp ?

[Nutch-general] Any URL filter available for search.jsp?

[Nutch-general] I don't want to crawl internet sites

Re: [Nutch-general] I don't want to crawl internet sites

[Nutch-general] OutOfMemoryError - Why should the while(1) loop stop?

Re: [Nutch-general] OutOfMemoryError - Why should the while(1) loop stop?

Re: [Nutch-general] Nutch crawls blocked sites - Why?

[Nutch-general] Optimum number of threads

Re: [Nutch-general] mergesegs is not functioning properly

[Nutch-general] Nutch crawls blocked sites - Why?

[Nutch-general] mergesegs is not functioning properly

Re: [Nutch-general] Deleting crawl still gives proper results

Re: [Nutch-general] Deleting crawl still gives proper results

[Nutch-general] Deleting crawl still gives proper results

[Nutch-general] Daily re-crawl possible?

23 matches

Site Navigation

Mail list logo

Footer information