I am using the following tutorial below (with nutch 0.9) to crawl the
web. I went through the steps, download dmoz and run the parser, etc,
etc.
bin/nutch inject crawl/crawldb dmoz
etc
etc.
bin/nutch fetch $s1
Once I get to this step, is there a way to crawl the sites that are
in the dmoz/url
Yea, but how do crawl the actual pages like you would a intranet
crawl. For example, lets say that I have 20 urls in my set from the
DmozParser. Lets also say that I want to go into the depth 3 levels
deep into the 20 urls. Is that possible.
For example with the intranet crawl I would start
- Original Message -
From: Berlin Brown [EMAIL PROTECTED]
Sent: Sunday, June 10, 2007 11:24 AM
Yea, but how do crawl the actual pages like you would a intranet
crawl. For example, lets say that I have 20 urls in my set from the
DmozParser. Lets also say that I want to go into the
Enzo Michelangeli wrote:
- Original Message - From: Berlin Brown [EMAIL PROTECTED]
Sent: Sunday, June 10, 2007 11:24 AM
Yea, but how do crawl the actual pages like you would a intranet
crawl. For example, lets say that I have 20 urls in my set from the
DmozParser. Lets also say
ok try this,
as you see the two filters have the same entry. I dont exactly why it has to
be 2 where one would be enough but this keeps me from crawl the parent dir
aswell.
check the nutch site.xml if I put there .* it isnt working in my case so I
have to write the plugins I really need.
- Original Message -
From: Andrzej Bialecki [EMAIL PROTECTED]
Sent: Sunday, June 10, 2007 5:48 PM
Enzo Michelangeli wrote:
- Original Message - From: Berlin Brown
[EMAIL PROTECTED]
Sent: Sunday, June 10, 2007 11:24 AM
Yea, but how do crawl the actual pages like you would a
Enzo Michelangeli wrote:
- Original Message - From: Andrzej Bialecki [EMAIL PROTECTED]
Sent: Sunday, June 10, 2007 5:48 PM
Enzo Michelangeli wrote:
- Original Message - From: Berlin Brown
[EMAIL PROTECTED]
Sent: Sunday, June 10, 2007 11:24 AM
Yea, but how do crawl the
As the size of my data keeps growing, and the indexing time grows even
faster, I'm trying to switch from a reindex all at every crawl model to an
incremental indexing one. I intend to keep the segments separate, but I
want to index only the segment fetched during the last cycle, and then merge
Hi all, I have some problem for some time, I want to crawl only sites of my
country or related to it. The problem is
that crawling only by domain (in my case I set teh regex-urlfiter regex to
cath (com|org|..).uy) lives out a lot of sites wich doesn,t end in .uy but
in .com .org, I don´t
I have written a custom URLFilter that resolves the hostname into an IP
address and checks the latter against a GeoIP database. Unfortunately the
source code was developed under a commercial contract, and is not freely
available.
Enzo
- Original Message -
From: Cesar Voulgaris [EMAIL
Hi,
I am trying to solve a problem but I am unable to find any feature in
Nutch that lets me solve this problem.
Let's say in my intranet there are 1000 sites.
Sites 1 to 100 have pages that are never going to change, i.e. they
are static. So I don't need to crawl them again and again. But
I find in the search results that lots of HTTP 302 pages have been
indexed. This is decreasing the quality of search results. Is there
any way to disable indexing such pages?
I want only HTTP 200 OK pages to be indexed.
-
On 6/11/07, Manoharam Reddy [EMAIL PROTECTED] wrote:
I find in the search results that lots of HTTP 302 pages have been
indexed. This is decreasing the quality of search results. Is there
any way to disable indexing such pages?
I want only HTTP 200 OK pages to be indexed.
If you run fetcher
Hi Guys,
I've got some trouble to make hadoop working.
I have the following error when i launch the slave scripts:
[EMAIL PROTECTED] search]$ bin/slaves.sh uptime
+ usage='Usage: slaves.sh [--config confdir] command...'
+ '[' 1 -le 0 ']'
++ dirname bin/slaves.sh
+ bin=bin
++ cd bin
++ pwd
+
Far I know currently it is not possible. But if I'm correct, in trunk there is
applied patch that adapts adopts frequency of page updates on how often it is
updated. You possibly can use it from night build, or wait for next release.
Regards,
Marcin
Hi,
I am trying to solve a problem but I
I'm running Nutch 0.8.1 on 3 servers. Everything works fine, but I'm
confused about some Fetcher behavior. I'll generate a list of 100k urls
to fetch, that works fine. However, only 1 server in the cluster
actually fetches a reasonable number. 2 out of three go get at most 20
pages. I've gotta
Hi all,
I have problem witch cache, after crawling searching successfully. The cache
page is display with square question marks, plz take a look at
http://192.168.71.66:8080/cached.jsp?idx=0id=1. I have tried to make some
configuration but no lucky. Do you have any idea ???
By the way, anyone
- Original Message -
From: Phạm Hải Thanh [EMAIL PROTECTED]
Sent: Tuesday, June 12, 2007 9:29 AM
Hi all,
I have problem witch cache, after crawling searching successfully. The
cache page is display with square question marks, plz take a look at
Oops, I am sorry, here is the link: http://
203.162.71.66:8080/cached.jsp?idx=0id=1
I also think this is a an issue of encoding too :(
About this config
property
namefetcher.store.content/name
valuefalse/value
descriptionIf true, fetcher will store content./description
/property
I have
How can I change it to read from segment/parse_text instead of
segment/content ?
On 5/31/07, Doğacan Güney [EMAIL PROTECTED] wrote:
Hi,
On 5/31/07, Manoharam Reddy [EMAIL PROTECTED] wrote:
Some confusions regarding plugins.includes
1. I find a parse-oo in the plugins folder. What is that
Dear all,
My client uses HTTrack with GDS (Google desktop search). While pages are
fetched much quicker using nutch (kudos to the nutch engine developers), it
doesnt seem to index the entire page like HTTrack/GDS does. As a result, he
claims if he searchs on 'hbx' (a web analytics tool that is
Hi,
On 6/12/07, patrik [EMAIL PROTECTED] wrote:
I'm running Nutch 0.8.1 on 3 servers. Everything works fine, but I'm
confused about some Fetcher behavior. I'll generate a list of 100k urls
to fetch, that works fine. However, only 1 server in the cluster
actually fetches a reasonable number.
Be given a massive concession on your p ills
honest qualities, superior quality.
stupendous range, including wearisome to find drugs
0 formula vital.
Hush-hush with No waiting quarters or engagements just
shop in totality and Save! Still additional
Just type www . pillneed [dot] org
in Your IE
On 6/12/07, Manoharam Reddy [EMAIL PROTECTED] wrote:
How can I change it to read from segment/parse_text instead of
segment/content ?
If you are using Nutch's web ui, you have to change this part in cached.jsp :
% } else { %
The cached content has mime type %=contentType%,
click this a
It seems I'm having a lot of trouble trying to configure hadoop on one
machine.
I've followed the wiki tutorial and I've configured every thing on 1
machine. I tried to start hadoop using start-all.sh and it works. I've the
following output:
starting namenode, logging to
When generator runs in distributed mode, it partitions urls to seperate
map tasks according to their hosts.
This way, urls under the same host end up in the same map task (which
is necessary for politeness). So,
in your case, you either have very few hosts (of which one has almost
100K urls) or
On 6/12/07, patrik [EMAIL PROTECTED] wrote:
When generator runs in distributed mode, it partitions urls to seperate
map tasks according to their hosts.
This way, urls under the same host end up in the same map task (which
is necessary for politeness). So,
in your case, you either have very
Doğacan Güney wrote:
I think you may also run a segment merge. If you run segmerge on a
single segment(where you set number of reduce tasks to the desired
number of fetchers) segmerge will put equal number of urls to every
part. Then set fetcher.max.threads.per.host to a value greater than 1
Dear all,
My client uses HTTrack with GDS (Google desktop search). While pages are
fetched much quicker using nutch (kudos to the nutch engine developers), it
doesnt seem to index the entire page like HTTrack/GDS does. As a result, he
claims if he searchs on 'hbx' (a web analytics tool that
- Original Message -
From: Phạm Hải Thanh [EMAIL PROTECTED]
Sent: Tuesday, June 12, 2007 10:06 AM
Oops, I am sorry, here is the link: http://
203.162.71.66:8080/cached.jsp?idx=0id=1
I also think this is a an issue of encoding too :(
It looks fine to me, both with Firefox and MSIE 7
Hi Enzo, hi all
I have fixed it all yesterday, so it looks fine to all ^^
By curtain reason, the cache.jsp can not get charset from hit, so I have forced
it
content = new String(bean.getContent(details), utf-8);
Thanks [EMAIL PROTECTED] about this.
Thank u very much, Enzo.
-Original
the tutorial says that depth value is the level of depth of a page
from the root of a website. so as per the tutorial, if i want to fetch
a page say, http://www.blabla.com/a/b/c/d/e/a.html, I must set the
value of depth = 6.
but I find in the source code that depth is simply a for loop. It will
32 matches
Mail list logo