Hi,
I am trying to solve a problem but I am unable to find any feature in
Nutch that lets me solve this problem.
Let's say in my intranet there are 1000 sites.
Sites 1 to 100 have pages that are never going to change, i.e. they
are static. So I don't need to crawl them again and again. But
I find in the search results that lots of HTTP 302 pages have been
indexed. This is decreasing the quality of search results. Is there
any way to disable indexing such pages?
I want only HTTP 200 OK pages to be indexed.
-
How can I change it to read from segment/parse_text instead of
segment/content ?
On 5/31/07, Doğacan Güney [EMAIL PROTECTED] wrote:
Hi,
On 5/31/07, Manoharam Reddy [EMAIL PROTECTED] wrote:
Some confusions regarding plugins.includes
1. I find a parse-oo in the plugins folder. What
the tutorial says that depth value is the level of depth of a page
from the root of a website. so as per the tutorial, if i want to fetch
a page say, http://www.blabla.com/a/b/c/d/e/a.html, I must set the
value of depth = 6.
but I find in the source code that depth is simply a for loop. It will
Hi,
I am trying to solve a problem but I am unable to find any feature in
Nutch that lets me solve this problem.
Let's say in my intranet there are 1000 sites.
Sites 1 to 100 have pages that are never going to change, i.e. they
are static. So I don't need to crawl them again and again. But
I get this error message for many URLs. Is there any property to
enable redirect requests to be allowed?
2007-06-04 10:00:42,298 INFO httpclient.HttpMethodDirector - Redirect
requested but followRedirects is disabled
-
This
I am crawling pages using the following commands in a loop iterating 10 times:-
bin/nutch generate crawl/crawldb crawl/segments -topN 1000
seg1=`ls -d crawl/segments/* | tail -1`
bin/nutch fetch $seg1 -threads 50
bin/nutch updatedb crawl/crawldb $seg1
I am getting the following
command.
Regards,
-vishal.
-Original Message-
From: Manoharam Reddy [mailto:[EMAIL PROTECTED]
Sent: Thursday, May 31, 2007 11:24 AM
To: [EMAIL PROTECTED]
Subject: Re: OutOfMemoryError - Why should the while(1) loop stop?
If I run fetcher in non-parsing mode how can I later parse
Some confusions regarding plugins.includes
1. I find a parse-oo in the plugins folder. What is that for?
2. I have enabled parse-pdf by including in plugins.include of
nutch-site.xml. The pages now come in the search result. But when I
visit the cached page of the result. It shows a message like
I want to use two filters one for crawling and another for searching
through search.jsp.
I am currently using regex-urlfilter.txt for generate, fetch, update
cycle. But when a user searches the sites, I want him not to see
certain sites in the results that have been crawled.
How can this be
This is my regex-urlfilter.txt file.
-^http://([a-z0-9]*\.)+
+^http://([0-9]+\.)+
+.
I want to allow only IP addresses and internal sites to be crawled and
fetched. This means:-
http://www.google.com should be ignored
http://shoppingcenter should be crawled
http://192.168.101.5 should be
nameurlfilter.regex.file/name
valuecrawl-urlfilter.txt/value
/property
-Ronny
-Opprinnelig melding-
Fra: Manoharam Reddy [mailto:[EMAIL PROTECTED]
Sendt: 30. mai 2007 13:42
Til: [EMAIL PROTECTED]
Emne: I don't want to crawl internet sites
This is my regex-urlfilter.txt file
Time and again I get this error and as a result the segment remains
incomplete. This wastes one iteration of the for() loop in which I am
doing generate, fetch and update.
Can someone please tell me what are the measures I can take to avoid
this error? And isn't it possible to make some code
PROTECTED] wrote:
On 5/30/07, Manoharam Reddy [EMAIL PROTECTED] wrote:
Time and again I get this error and as a result the segment remains
incomplete. This wastes one iteration of the for() loop in which I am
doing generate, fetch and update.
Can someone please tell me what
Thanks! It worked.
On 5/28/07, Doğacan Güney [EMAIL PROTECTED] wrote:
Hi,
On 5/28/07, Manoharam Reddy [EMAIL PROTECTED] wrote:
In my crawl-urlfilter.txt I have put a statement like
-^http://cdserver
Still while running crawl, it fetches this site. I am running the
crawl using
I am running Nutch on a powerful server with 1 GB RAM and 3 GHz Intel
processor. I want to know what the optimum number of threads would be
to crawl an intranet with around 100 sites.
If I use too many threads (say -threads 100) while crawling, won't the
context switching overhead hamper the
to maintain clarity.
As a result this segment had only crawl_generate and nothing else. Can
anyone please explain me what caused this error? How can I prevent
this error from happening?
On 5/29/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Manoharam Reddy wrote:
My segment merger is not functioning
In my crawl-urlfilter.txt I have put a statement like
-^http://cdserver
Still while running crawl, it fetches this site. I am running the
crawl using these commands:-
bin/nutch inject crawl/crawldb urls
Inside a loop:-
bin/nutch generate crawl/crawldb crawl/segments -topN 10
segment=`ls -d
My segment merger is not functioning properly. I am unable to figure
out the problem.
These are the commands I am using.
bin/nutch inject crawl/crawldb seedurls
In a loop iterating 10 times:-
bin/nutch generate crawl/crawldb crawl/segments -topN 1000 -adddays 5
segment=`ls -d
: Manoharam Reddy [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Saturday, May 26, 2007 6:23 PM
After I create the crawldb after running bin/nutch crawl, I start my
Tomcat server. It gives proper search results.
What I am wondering is that even after I delete, the 'crawl' folder,
the search
Message -
From: Manoharam Reddy [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Saturday, May 26, 2007 6:23 PM
After I create the crawldb after running bin/nutch crawl, I start my
Tomcat server. It gives proper search results.
What I am wondering is that even after I delete
After I create the crawldb after running bin/nutch crawl, I start my
Tomcat server. It gives proper search results.
What I am wondering is that even after I delete, the 'crawl' folder,
the search page still gives proper search results. How is this
possible? Only after I restart the Tomcat server,
I am using Nutch. I want to know how I can do daily crawls with Nutch.
Here are the details I want:-
1. Doing a crawl that keeps running all the time and keeps updating the crawldb
2. Whether it can avoid re-crawling the pages that have been crawled
recently. Basically I want it to waste
23 matches
Mail list logo