date:20060420

dfs filesystem

2006-04-20 Thread Anton Potehin

Which of Linux file systems is most preferred for DFS name-node and
data-node?

[jira] Commented: (NUTCH-173) PerHost Crawling Policy ( crawl.ignore.external.links )

2006-04-20 Thread Christophe Noel (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-173?page=comments#action_12375300 ] 

Christophe Noel commented on NUTCH-173:
---

We are TENS of nutch users using this precious patch.

Most of nutch users are not making whole-web search engine (too much hardware 
needed) but are willing to develop dedicated search engines.

We crawl sometimes 1000, sometimes 25000 web servers and it really slow down 
the crawling with 25000 entries in prefix-urlfilter.

This patch is NEEDED !

Christophe Noël
CETIC
Belgium

> PerHost Crawling Policy ( crawl.ignore.external.links )
> ---
>
>  Key: NUTCH-173
>  URL: http://issues.apache.org/jira/browse/NUTCH-173
>  Project: Nutch
> Type: New Feature

>   Components: fetcher
> Versions: 0.7.1, 0.7, 0.8-dev
> Reporter: Philippe EUGENE
> Priority: Minor
>  Attachments: patch.txt, patch08.txt
>
> There is two major way of crawl in Nutch.
> Intranet Crawl : forbidden all, allow somes few host
> Whole-web crawl : allow all, forbidden few thinks
> I propose a third type of crawl.
> Directory Crawl : The purpose of this crawl is to manage few thousands of 
> host wihtout managing rules pattern in UrlFilterRegexp.
> I made two patch for : 0.7, 0.7.1 and 0.8-dev
> I propose a new boolean property in nutch-site.xml : 
> crawl.ignore.external.links, with false value at default.
> By default this new feature don't modify the behavior of nutch crawler.
> When you setup this property to true, the crawler don't fetch external links 
> of the host.
> So the crawl is limited to the host that you inject at the beginning at the 
> crawl.
> I know there is some proposal of new crawl policy using the CrawlDatum in 
> 0.8-dev branch. 
> This feature colud be a easiest way to add quickly new crawl feature to 
> nutch, waiting for a best way to improve crawl policy.
> I post two patch.
> Sorry for my very poor english 
> --
> Philippe

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-250) Generate to log truncation caused by generate.max.per.host

2006-04-20 Thread Doug Cutting (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-250?page=all ]
 
Doug Cutting resolved NUTCH-250:


Fix Version: 0.8-dev
 Resolution: Fixed
  Assign To: Doug Cutting

I just committed this.  Thanks, Rod.

> Generate to log truncation caused by generate.max.per.host
> --
>
>  Key: NUTCH-250
>  URL: http://issues.apache.org/jira/browse/NUTCH-250
>  Project: Nutch
> Type: Improvement

> Versions: 0.8-dev
> Reporter: Rod Taylor
> Assignee: Doug Cutting
>  Fix For: 0.8-dev
>  Attachments: nutch-generate-truncatelog.patch
>
> LOG.info() hosts which have had their generate lists truncated.
> This can inform admins about potential abusers or excessively large sites 
> that they may wish to block with rules.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-173) PerHost Crawling Policy ( crawl.ignore.external.links )

2006-04-20 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-173?page=comments#action_12375421 ] 

Doug Cutting commented on NUTCH-173:


+1, with a few modifications.

Can you please re-generate this against the current sources?  This patch does 
not apply for me.

Also, the fromHost should only be computed if crawl.ignore.external.links is 
true.

Finally, please add an entry to conf/nutch-default.xml for the new parameter in 
your patch.

Thanks!

> PerHost Crawling Policy ( crawl.ignore.external.links )
> ---
>
>  Key: NUTCH-173
>  URL: http://issues.apache.org/jira/browse/NUTCH-173
>  Project: Nutch
> Type: New Feature

>   Components: fetcher
> Versions: 0.7.1, 0.7, 0.8-dev
> Reporter: Philippe EUGENE
> Priority: Minor
>  Attachments: patch.txt, patch08.txt
>
> There is two major way of crawl in Nutch.
> Intranet Crawl : forbidden all, allow somes few host
> Whole-web crawl : allow all, forbidden few thinks
> I propose a third type of crawl.
> Directory Crawl : The purpose of this crawl is to manage few thousands of 
> host wihtout managing rules pattern in UrlFilterRegexp.
> I made two patch for : 0.7, 0.7.1 and 0.8-dev
> I propose a new boolean property in nutch-site.xml : 
> crawl.ignore.external.links, with false value at default.
> By default this new feature don't modify the behavior of nutch crawler.
> When you setup this property to true, the crawler don't fetch external links 
> of the host.
> So the crawl is limited to the host that you inject at the beginning at the 
> crawl.
> I know there is some proposal of new crawl policy using the CrawlDatum in 
> 0.8-dev branch. 
> This feature colud be a easiest way to add quickly new crawl feature to 
> nutch, waiting for a best way to improve crawl policy.
> I post two patch.
> Sorry for my very poor english 
> --
> Philippe

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Re: mapred.map.tasks

2006-04-20 Thread Doug Cutting


Anton Potehin wrote:

We have a question on this property. Is it really preferred to set this
parameter several times greater than number of available hosts? We do
not understand why it should be so? 


It should be at least numHosts*mapred.tasktracker.tasks.maximum, so that 
all of the task slots are used.  More tasks makes recovery faster when a 
task fails, since less needs to be redone.



Our spider is distributed among 3 machines. What value is most preferred
for this parameter in our case? Which other factors may have effect on
most preferred value of this parameter?  


When fetching, the total number of hosts you're fetching can also be a 
factor, since fetch tasks are hostwise-disjoint.  If you're only 
fetching a few hosts, then a large value for mapred.map.tasks will cause 
there to be a few big fetch tasks and a bunch of empty ones.  This could 
be a problem if the big ones are not allocated evenly among your nodes.


I generally use 5*numHosts*mapred.tasktracker.tasks.maximum.

Doug

Re: mapred.map.tasks

2006-04-20 Thread Doug Cutting

One more thing.  This parameter should be set in mapred-default.xml, not 
 hadoop-site.xml or nutch-site.xml.  Parameters in those latter files 
cannot be overridden by application settings, and mapred.map.tasks is 
sometimes overidden.


Doug

Re: nutch user meeting in San Francisco: May 18th

2006-04-20 Thread Doug Cutting


Folks can say whether they'll attend at:

http://www.evite.com/app/publicUrl/[EMAIL PROTECTED]/nutch-1

Doug

nutch user meeting in San Francisco: May 18th

2006-04-20 Thread Stefan Groschupf


(with apologies for multiple postings)

Dear Nutch users, Dear Nutch developers, Dear Hadoop developers,

we would love to invite you to the Nutch user meeting in San Francisco.

Date:  Thursday, May 18th, 2006
Time: 7 PM.
Location:  Cafe Du Soleil, 200 Fillmore St, San Francisco, CA 94117.   
(Thanks to Michael Stack helping to find this location)


http://sanfrancisco.citysearch.com/profile/41734267/san_francisco_ca/ 
cafe_du_soleil.html
http://maps.yahoo.com/beta/#maxp=search&q1bizid=29996598&q1=200 
+Fillmore+St+San 
+Francisco&mvt=m&trf=0&lon=-122.430074214935&lat=37.7713763317352&mag=3


Talks or something like that are not planed, the location is a cafe,  
so the idea is to meet each other, have dinner and some drinks.
It would be nice to get an idea how many people will join this  
meeting, so we can switch the location if necessary.
Please post to the nutch user list in case you plan to join the  
meeting, however a registration is not required.


Looking forward to meeting you.

Stefan

RE: mapred.map.tasks

2006-04-20 Thread anton

Tnx. We changed this parameters in hadoop-default.xml.

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED] 
Sent: Thursday, April 20, 2006 11:53 PM
To: nutch-dev@lucene.apache.org
Subject: Re: mapred.map.tasks

One more thing.  This parameter should be set in mapred-default.xml, not 
  hadoop-site.xml or nutch-site.xml.  Parameters in those latter files 
cannot be overridden by application settings, and mapred.map.tasks is 
sometimes overidden.

Doug

dfs filesystem

[jira] Commented: (NUTCH-173) PerHost Crawling Policy ( crawl.ignore.external.links )

[jira] Resolved: (NUTCH-250) Generate to log truncation caused by generate.max.per.host

[jira] Commented: (NUTCH-173) PerHost Crawling Policy ( crawl.ignore.external.links )

Re: mapred.map.tasks

Re: mapred.map.tasks

Re: nutch user meeting in San Francisco: May 18th

nutch user meeting in San Francisco: May 18th

RE: mapred.map.tasks

9 matches

Site Navigation

Mail list logo

Footer information