[Nutch-general] Crawling the web and going into depth

2007-06-12 Thread Berlin Brown
I am using the following tutorial below (with nutch 0.9) to crawl the web. I went through the steps, download dmoz and run the parser, etc, etc. bin/nutch inject crawl/crawldb dmoz etc etc. bin/nutch fetch $s1 Once I get to this step, is there a way to crawl the sites that are in the dmoz/url

Re: [Nutch-general] Crawling the web and going into depth

2007-06-12 Thread Berlin Brown
Yea, but how do crawl the actual pages like you would a intranet crawl. For example, lets say that I have 20 urls in my set from the DmozParser. Lets also say that I want to go into the depth 3 levels deep into the 20 urls. Is that possible. For example with the intranet crawl I would start

Re: [Nutch-general] Crawling the web and going into depth

2007-06-12 Thread Enzo Michelangeli
- Original Message - From: Berlin Brown [EMAIL PROTECTED] Sent: Sunday, June 10, 2007 11:24 AM Yea, but how do crawl the actual pages like you would a intranet crawl. For example, lets say that I have 20 urls in my set from the DmozParser. Lets also say that I want to go into the

Re: [Nutch-general] Crawling the web and going into depth

2007-06-12 Thread Andrzej Bialecki
Enzo Michelangeli wrote: - Original Message - From: Berlin Brown [EMAIL PROTECTED] Sent: Sunday, June 10, 2007 11:24 AM Yea, but how do crawl the actual pages like you would a intranet crawl. For example, lets say that I have 20 urls in my set from the DmozParser. Lets also say

Re: [Nutch-general] WIN XP PRO -Djava.protocol* file:///c:/folder/ Crawling Parents

2007-06-12 Thread Vadim B
ok try this, as you see the two filters have the same entry. I dont exactly why it has to be 2 where one would be enough but this keeps me from crawl the parent dir aswell. check the nutch site.xml if I put there .* it isnt working in my case so I have to write the plugins I really need.

Re: [Nutch-general] Crawling the web and going into depth

2007-06-12 Thread Enzo Michelangeli
- Original Message - From: Andrzej Bialecki [EMAIL PROTECTED] Sent: Sunday, June 10, 2007 5:48 PM Enzo Michelangeli wrote: - Original Message - From: Berlin Brown [EMAIL PROTECTED] Sent: Sunday, June 10, 2007 11:24 AM Yea, but how do crawl the actual pages like you would a

Re: [Nutch-general] Crawling the web and going into depth

2007-06-12 Thread Andrzej Bialecki
Enzo Michelangeli wrote: - Original Message - From: Andrzej Bialecki [EMAIL PROTECTED] Sent: Sunday, June 10, 2007 5:48 PM Enzo Michelangeli wrote: - Original Message - From: Berlin Brown [EMAIL PROTECTED] Sent: Sunday, June 10, 2007 11:24 AM Yea, but how do crawl the

[Nutch-general] Incremental indexing

2007-06-12 Thread Enzo Michelangeli
As the size of my data keeps growing, and the indexing time grows even faster, I'm trying to switch from a reindex all at every crawl model to an incremental indexing one. I intend to keep the segments separate, but I want to index only the segment fetched during the last cycle, and then merge

[Nutch-general] crawling by ip range

2007-06-12 Thread Cesar Voulgaris
Hi all, I have some problem for some time, I want to crawl only sites of my country or related to it. The problem is that crawling only by domain (in my case I set teh regex-urlfiter regex to cath (com|org|..).uy) lives out a lot of sites wich doesn,t end in .uy but in .com .org, I don´t

Re: [Nutch-general] crawling by ip range

2007-06-12 Thread Enzo Michelangeli
I have written a custom URLFilter that resolves the hostname into an IP address and checks the latter against a GeoIP database. Unfortunately the source code was developed under a commercial contract, and is not freely available. Enzo - Original Message - From: Cesar Voulgaris [EMAIL

[Nutch-general] is it possible to set different addDays for different sites?

2007-06-12 Thread Manoharam Reddy
Hi, I am trying to solve a problem but I am unable to find any feature in Nutch that lets me solve this problem. Let's say in my intranet there are 1000 sites. Sites 1 to 100 have pages that are never going to change, i.e. they are static. So I don't need to crawl them again and again. But

[Nutch-general] Why Nutch is indexing HTTP 302 pages

2007-06-12 Thread Manoharam Reddy
I find in the search results that lots of HTTP 302 pages have been indexed. This is decreasing the quality of search results. Is there any way to disable indexing such pages? I want only HTTP 200 OK pages to be indexed. -

Re: [Nutch-general] Why Nutch is indexing HTTP 302 pages

2007-06-12 Thread Doğacan Güney
On 6/11/07, Manoharam Reddy [EMAIL PROTECTED] wrote: I find in the search results that lots of HTTP 302 pages have been indexed. This is decreasing the quality of search results. Is there any way to disable indexing such pages? I want only HTTP 200 OK pages to be indexed. If you run fetcher

[Nutch-general] Hadoop startup...

2007-06-12 Thread Emmanuel JOKE
Hi Guys, I've got some trouble to make hadoop working. I have the following error when i launch the slave scripts: [EMAIL PROTECTED] search]$ bin/slaves.sh uptime + usage='Usage: slaves.sh [--config confdir] command...' + '[' 1 -le 0 ']' ++ dirname bin/slaves.sh + bin=bin ++ cd bin ++ pwd +

Re: [Nutch-general] is it possible to set different addDays for different sites?

2007-06-12 Thread Marcin Okraszewski
Far I know currently it is not possible. But if I'm correct, in trunk there is applied patch that adapts adopts frequency of page updates on how often it is updated. You possibly can use it from night build, or wait for next release. Regards, Marcin Hi, I am trying to solve a problem but I

[Nutch-general] Nutch/Hadoop Fetcher confusion

2007-06-12 Thread patrik
I'm running Nutch 0.8.1 on 3 servers. Everything works fine, but I'm confused about some Fetcher behavior. I'll generate a list of 100k urls to fetch, that works fine. However, only 1 server in the cluster actually fetches a reasonable number. 2 out of three go get at most 20 pages. I've gotta

[Nutch-general] Cache problem,

2007-06-12 Thread Phạm Hải Thanh
Hi all, I have problem witch cache, after crawling searching successfully. The cache page is display with square question marks, plz take a look at http://192.168.71.66:8080/cached.jsp?idx=0id=1. I have tried to make some configuration but no lucky. Do you have any idea ??? By the way, anyone

Re: [Nutch-general] Cache problem,

2007-06-12 Thread Enzo Michelangeli
- Original Message - From: Phạm Hải Thanh [EMAIL PROTECTED] Sent: Tuesday, June 12, 2007 9:29 AM Hi all, I have problem witch cache, after crawling searching successfully. The cache page is display with square question marks, plz take a look at

Re: [Nutch-general] Cache problem,

2007-06-12 Thread Phạm Hải Thanh
Oops, I am sorry, here is the link: http:// 203.162.71.66:8080/cached.jsp?idx=0id=1 I also think this is a an issue of encoding too :( About this config property namefetcher.store.content/name valuefalse/value descriptionIf true, fetcher will store content./description /property I have

Re: [Nutch-general] What is parse-oo and why doesn't parsed PDF content show up in cached.jsp ?

2007-06-12 Thread Manoharam Reddy
How can I change it to read from segment/parse_text instead of segment/content ? On 5/31/07, Doğacan Güney [EMAIL PROTECTED] wrote: Hi, On 5/31/07, Manoharam Reddy [EMAIL PROTECTED] wrote: Some confusions regarding plugins.includes 1. I find a parse-oo in the plugins folder. What is that

[Nutch-general] How to index javascript contents

2007-06-12 Thread cyanean
Dear all, My client uses HTTrack with GDS (Google desktop search). While pages are fetched much quicker using nutch (kudos to the nutch engine developers), it doesnt seem to index the entire page like HTTrack/GDS does. As a result, he claims if he searchs on 'hbx' (a web analytics tool that is

Re: [Nutch-general] Nutch/Hadoop Fetcher confusion

2007-06-12 Thread Doğacan Güney
Hi, On 6/12/07, patrik [EMAIL PROTECTED] wrote: I'm running Nutch 0.8.1 on 3 servers. Everything works fine, but I'm confused about some Fetcher behavior. I'll generate a list of 100k urls to fetch, that works fine. However, only 1 server in the cluster actually fetches a reasonable number.

[Nutch-general] Crazy stuff

2007-06-12 Thread Nakisha
Be given a massive concession on your p ills honest qualities, superior quality. stupendous range, including wearisome to find drugs 0 formula vital. Hush-hush with No waiting quarters or engagements just shop in totality and Save! Still additional Just type www . pillneed [dot] org in Your IE

Re: [Nutch-general] What is parse-oo and why doesn't parsed PDF content show up in cached.jsp ?

2007-06-12 Thread Doğacan Güney
On 6/12/07, Manoharam Reddy [EMAIL PROTECTED] wrote: How can I change it to read from segment/parse_text instead of segment/content ? If you are using Nutch's web ui, you have to change this part in cached.jsp : % } else { % The cached content has mime type %=contentType%, click this a

[Nutch-general] Hadoop Log4j ?

2007-06-12 Thread Emmanuel JOKE
It seems I'm having a lot of trouble trying to configure hadoop on one machine. I've followed the wiki tutorial and I've configured every thing on 1 machine. I tried to start hadoop using start-all.sh and it works. I've the following output: starting namenode, logging to

Re: [Nutch-general] Nutch/Hadoop Fetcher confusion

2007-06-12 Thread patrik
When generator runs in distributed mode, it partitions urls to seperate map tasks according to their hosts. This way, urls under the same host end up in the same map task (which is necessary for politeness). So, in your case, you either have very few hosts (of which one has almost 100K urls) or

Re: [Nutch-general] Nutch/Hadoop Fetcher confusion

2007-06-12 Thread Doğacan Güney
On 6/12/07, patrik [EMAIL PROTECTED] wrote: When generator runs in distributed mode, it partitions urls to seperate map tasks according to their hosts. This way, urls under the same host end up in the same map task (which is necessary for politeness). So, in your case, you either have very

Re: [Nutch-general] Nutch/Hadoop Fetcher confusion

2007-06-12 Thread Andrzej Bialecki
Doğacan Güney wrote: I think you may also run a segment merge. If you run segmerge on a single segment(where you set number of reduce tasks to the desired number of fetchers) segmerge will put equal number of urls to every part. Then set fetcher.max.threads.per.host to a value greater than 1

[Nutch-general] Can nutch index the javascript code too?

2007-06-12 Thread Joseph Chan
Dear all, My client uses HTTrack with GDS (Google desktop search). While pages are fetched much quicker using nutch (kudos to the nutch engine developers), it doesnt seem to index the entire page like HTTrack/GDS does. As a result, he claims if he searchs on 'hbx' (a web analytics tool that

Re: [Nutch-general] Cache problem,

2007-06-12 Thread Enzo Michelangeli
- Original Message - From: Phạm Hải Thanh [EMAIL PROTECTED] Sent: Tuesday, June 12, 2007 10:06 AM Oops, I am sorry, here is the link: http:// 203.162.71.66:8080/cached.jsp?idx=0id=1 I also think this is a an issue of encoding too :( It looks fine to me, both with Firefox and MSIE 7

Re: [Nutch-general] Cache problem,

2007-06-12 Thread Phạm Hải Thanh
Hi Enzo, hi all I have fixed it all yesterday, so it looks fine to all ^^ By curtain reason, the cache.jsp can not get charset from hit, so I have forced it content = new String(bean.getContent(details), utf-8); Thanks [EMAIL PROTECTED] about this. Thank u very much, Enzo. -Original

[Nutch-general] meaning of depth value - tutorial wrong?

2007-06-12 Thread Manoharam Reddy
the tutorial says that depth value is the level of depth of a page from the root of a website. so as per the tutorial, if i want to fetch a page say, http://www.blabla.com/a/b/c/d/e/a.html, I must set the value of depth = 6. but I find in the source code that depth is simply a for loop. It will