Re: need help to speed up map-reduce

2006-11-06 Thread Uroš Gruber

Doug Cook wrote:

I've been planning to spend some time looking at this, but haven't gotten
round to it yet -- I see the same (serious) performance problems on a single
machine setup -- reduce takes quite a bit longer than the fetch (map)
operation in my case, and this is on a very fast 4-CPU machine with a ton of
memory. It just doesn't seem like it should take this long. I'm using 0.8 +
some patches & local mods.

If you find some things, please let me know. Likewise, when I get round to
it, I will post my findings.

  
I was talking about slownes months ago, so I'm glad someone else have 
the same problems. We also have single machine and reduce task takes 
hours to complete. Funny thing is that CPU is loaded 100% but when we do 
search on this server there is no difference in speed. But still It 
would be great if things go faster.


When fetching I have 20 to 30 pages per sec. But then I have to wait for 
reduce task to finish. I try use debug loging and only thing I can see 
is about 1 to 3 seconds between reduce log msgs. I know that map/reduce 
is meant to use with multiple nodes.


regards

Uros

Thanks,

Doug



AJ Chen-2 wrote:
  

Sorry for repeating this question. But, I have to find a solution,
otherwise
the crawling is too slow to be practical.  I'm using nutch 0.9-dev on one
linux server to crawl millions of pages.  The fetching itself is
reasonable,
but the map-reduce operations is killing the performance. For example,
fetching takes 10 hours and map-reduce also takes 10 hours, which makes
the
overall performance very slow. Can anyone share experience on how to speed
up map-reduce for single server crawling?  Single server uses local file
system. It should spend very little time in doing map and reduce, isn't it
right?

Thanks,
--
AJ Chen, PhD
http://web2express.org





  




Re: [jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

2006-10-03 Thread Uroš Gruber

Doug Cook wrote:

In this case, the site uses the "right" kind of redirect. Unfortunately, as
you point out, it's not at all clear that we can rely on sites correctly
choosing the type of redirect (I tried a few sites and most were 302s, even
in cases where the redirect was to the permanent, canonical version of the
page). And then there's the problem of what to do with meta refresh tags,
which don't have a "permanent" vs. "temporary" indication.

An alternative is to use the link structure - the page with the most
external links is likely the canonical version of the page. (Although with
permanent redirects, there is a time lag as sites linking to the page stop
using the old name and start using the new name). This won't work well in
small crawls, though, given the relative paucity of links.

  
This could be something, because others most certainly don't link 
redirects. But as you point out problem with permanent links, we have 
just the same stuff in our portal. We have new structure and some links 
have changed because of that we add permanent redirects from old to new 
ones. In this case the only solution is to replace url with permanent.



In any case, if we have an inexpensive way of aliasing the two to be the
same, we won't lose any anchor text, and we're effectively not "throwing
out" either URL, so it matters less which one we choose. 
  

Do you have any example what would this aliasdb look like.

regards

  -Doug


Uro? Gruber-2 wrote:
  

Ken Krugler (JIRA) wrote:


[
http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439304
] 

Ken Krugler commented on NUTCH-353:

---

+1 that the redirect target is not always the "real" URL that we want to
keep.

For example,
http://www.ibm.com/developerworks/lotus/downloads/toolkits.html =>
http://www-128.ibm.com/developerworks/lotus/downloads/toolkits.html. This
holds true for most  (all?) developerWorks pages; they redirect to
www-128.ibm.com/, but IBM would love for the URL everybody sees
to still be www.ibm.com/.

  
  
If you check status code of the original URL you get 302 Found. By 
definition



  10.3.3 302 Found

The requested resource resides temporarily under a different URI. Since 
the redirection might be altered on occasion, the client SHOULD continue 
to use the Request-URI for future requests. This response is only 
cacheable if indicated by a Cache-Control or Expires header field.


In this case there is no need to replace original url with redirected.

I know that a lot of sites use permanent redirects in such cases. But I 
don't se any proper solution for both.



regards

Uros


pages that serverside forwards will be refetched every time
---

Key: NUTCH-353
URL: http://issues.apache.org/jira/browse/NUTCH-353
Project: Nutch
 Issue Type: Bug
   Affects Versions: 0.8.1, 0.9.0
   Reporter: Stefan Groschupf
Assigned To: Andrzej Bialecki 
   Priority: Blocker

Fix For: 0.9.0

Attachments: doNotRefecthForwarderPagesV1.patch


Pages that do a serverside forward are not written with a status change
back into the crawlDb. Also the nextFetchTime is not changed. 
This causes a refetch of the same page again and again. The result is

nutch is not polite and refetching the forwarding and target page in
each segment iteration. Also it effects the scoring since the forward
page contribute it's score to all outlinks.


  
  





  




Re: [jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

2006-10-02 Thread Uroš Gruber

Ken Krugler (JIRA) wrote:
[ http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439304 ] 

Ken Krugler commented on NUTCH-353:

---

+1 that the redirect target is not always the "real" URL that we want to keep.

For example, http://www.ibm.com/developerworks/lotus/downloads/toolkits.html => 
http://www-128.ibm.com/developerworks/lotus/downloads/toolkits.html. This holds true for most  
(all?) developerWorks pages; they redirect to www-128.ibm.com/, but IBM would 
love for the URL everybody sees to still be www.ibm.com/.

  
If you check status code of the original URL you get 302 Found. By 
definition



 10.3.3 302 Found

The requested resource resides temporarily under a different URI. Since 
the redirection might be altered on occasion, the client SHOULD continue 
to use the Request-URI for future requests. This response is only 
cacheable if indicated by a Cache-Control or Expires header field.


In this case there is no need to replace original url with redirected.

I know that a lot of sites use permanent redirects in such cases. But I 
don't se any proper solution for both.



regards

Uros

pages that serverside forwards will be refetched every time
---

Key: NUTCH-353
URL: http://issues.apache.org/jira/browse/NUTCH-353
Project: Nutch
 Issue Type: Bug
   Affects Versions: 0.8.1, 0.9.0
   Reporter: Stefan Groschupf
Assigned To: Andrzej Bialecki 
   Priority: Blocker

Fix For: 0.9.0

Attachments: doNotRefecthForwarderPagesV1.patch


Pages that do a serverside forward are not written with a status change back into the crawlDb. Also the nextFetchTime is not changed. 
This causes a refetch of the same page again and again. The result is nutch is not polite and refetching the forwarding and target page in each segment iteration. Also it effects the scoring since the forward page contribute it's score to all outlinks.



  




Re: [Fwd: Re: get CrawlDatum]

2006-09-06 Thread Uroš Gruber

Andrzej Bialecki wrote:

Uroš Gruber wrote:
I made some draft patch. But there is still some problems I see. I 
know code needs to be cleaned and test. But right now I don't know 
what number set to external urls. For internal linking works great.


(the patch changes CrawlDatum itself, I think it would be better to 
put the hop counter in CrawlDatum.metaData.)



I can try to make with metaData


What is the whole idea of this changes.

Injected urls always get hop 0. While fetching/updating/generating 
hop value is incremented by 1. (still no idea what to do with 
external link). Then I can add config value max_hop etc. to limit 
fetcher and generator to create more urls.


This way it's possible to limit crawling vertically

Comments are welcome.


Well, it really depends on what you want to do when you encounter an 
external link. Do you want to restart the counter, i.e. crawl the new 
site at full depth up to max_hop? Then set hop=0. Do you want to 
terminate the crawl at that link? then set hop=max_hop.


I talk with my friend about this and here is what we've came up. Let say 
URLs manualy injected are good and checked by human and probably you 
wan't to start from it. So setting hop to 0 at injection is ok. While 
crawling we have some sort of filtering by host (regexp etc.). We need 
no worry about urls we don't have in our list so hop can be set whatever 
it's, maybe to max_hop.


But here a scenario We add foo.com and bar.com from injection. After 
crawling we find on site foo.com link to bar.com/hop/hop/index.html
We can set url hop to 0 or to max because we can update this after we 
found this url on bar.com site.


Checking for hop needs to be done while updating I think, so we don't 
end up with bunch of urls having hop greater than max_hop.


I will try to make a decent patch for this to check and if there is any 
idea by others please make a comment on this.


regards

Uros


[Fwd: Re: get CrawlDatum]

2006-09-06 Thread Uroš Gruber
A while ago I posted this on dev list but without reply. I wonder if 
this is right approach and If I continue to create this feature?
Do you think this idea would help nutch or maybe this is dead end and 
you've already talked about this.


regards

Uros

Andrzej Bialecki wrote:

Uroš Gruber wrote:

ParseData.metadata sounds nice, but I think I'm lost again :)
If I understand code flow the best place would be in Fetcher [262]

but i'm not sure that datum holds info of url being fetched


On the input to the fetcher you get a URL and a CrawlDatum (originally 
coming from the crawldb). Check for example how the segment name is 
passed around in metadata, you can use the same method.



Hi,

I made some draft patch. But there is still some problems I see. I know 
code needs to be cleaned and test. But right now I don't know what 
number set to external urls. For internal linking works great.


What is the whole idea of this changes.

Injected urls always get hop 0. While fetching/updating/generating hop 
value is incremented by 1. (still no idea what to do with external 
link). Then I can add config value max_hop etc. to limit fetcher and 
generator to create more urls.


This way it's possible to limit crawling vertically

Comments are welcome.





Re: [jira] Commented: (NUTCH-249) black- white list url filtering

2006-09-05 Thread Uroš Gruber

Zaheed Haque wrote:

Hi

Lot of the patch/plugins in Jiira are not updated to reflect changes
in trunk. Probably the way to test it would be building this using
that specific revision of nutch.

I'm aware of that. I just put a note because I see that this patch is 
for 0.9.


regards,

Uros

cheers

On 9/5/06, Uros Gruber (JIRA) <[EMAIL PROTECTED]> wrote:
[ 
http://issues.apache.org/jira/browse/NUTCH-249?page=comments#action_12432584 
]


Uros Gruber commented on NUTCH-249:
---

I'm trying to test this patch but I'm having build problems

compile-core:
[javac] Compiling 2 source files to 
/usr/home/uros/nutch-wb/build/classes
[javac] 
/usr/home/uros/nutch-wb/src/java/org/apache/nutch/crawl/bw/BWUpdateDb.java:261: 
createJob(org.apache.hadoop.conf.Configuration,org.apache.hadoop.fs.Path) 
in org.apache.nutch.crawl.CrawlDb cannot be applied to 
(org.apache.hadoop.conf.Configuration,java.io.File)
[javac] JobConf updateJob = CrawlDb.createJob(getConf(), 
crawlDb);

[javac]^
[javac] 
/usr/home/uros/nutch-wb/src/java/org/apache/nutch/crawl/bw/BWUpdateDb.java:267: 
install(org.apache.hadoop.mapred.JobConf,org.apache.hadoop.fs.Path) 
in org.apache.nutch.crawl.CrawlDb cannot be applied to 
(org.apache.hadoop.mapred.JobConf,java.io.File)

[javac] CrawlDb.install(updateJob, crawlDb);
[javac]^
[javac] Note: 
/usr/home/uros/nutch-wb/src/java/org/apache/nutch/crawl/bw/BWUpdateDb.java 
uses or overrides a deprecated API.




> black- white list url filtering
> ---
>
> Key: NUTCH-249
> URL: http://issues.apache.org/jira/browse/NUTCH-249
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 0.8
>Reporter: Stefan Groschupf
>Priority: Trivial
> Fix For: 0.9.0
>
> Attachments: blackWhiteListV2.patch, blackWhiteListV3.patch
>
>
> Existing url filter mechanisms need to process each url against 
each filter pattern. For very large filter sets this may be does not 
scale very well.


--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the 
administrators: http://issues.apache.org/jira/secure/Administrators.jspa

-
For more information on JIRA, see: 
http://www.atlassian.com/software/jira








Re: [jira] Commented: (NUTCH-361) generator create fetchlist randomly

2006-09-03 Thread Uroš Gruber

Sami Siren (JIRA) wrote:
[ http://issues.apache.org/jira/browse/NUTCH-361?page=comments#action_12432322 ] 

Sami Siren commented on NUTCH-361:

--

I started to write (allready put some on svn trunk) some simple junit tests for 
the main tools (inject, generate, fetch). if you can extend some of those to 
demonstrate this problem then it would be easier to track down.

  
I run through it and here is my problem pop out   [junit] Tests run: 
1, Failures: 0, Errors: 1, Time elapsed: 4.294 sec

  [junit] Test org.apache.nutch.crawl.TestGenerator FAILED

I run this on server. But I have problems run test from eclipse.
java.lang.ArithmeticException: / by zero
   at 
org.apache.nutch.crawl.PartitionUrlByHost.getPartition(PartitionUrlByHost.java:49)

   at org.apache.hadoop.mapred.MapTask$2.collect(MapTask.java:152)
   at 
org.apache.nutch.crawl.Generator$SelectorInverseMapper.map(Generator.java:223)

   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:51)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:195)
   at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:106)


Probably some configuration problems.

regards

Uros

generator create fetchlist randomly
---

Key: NUTCH-361
URL: http://issues.apache.org/jira/browse/NUTCH-361
Project: Nutch
 Issue Type: Bug
 Components: fetcher
   Affects Versions: 0.9.0
Environment: Java 1.5, FreeBSD 6.1
   Reporter: Uros Gruber
   Priority: Critical

I noticed problems during generating fetchlist. I already post some info at the 
users list. Today I check release 0.8 and I'm certain that problem is only in 
version later than this. I've do testnig only on 0.8 and svn from today.
The problem is that generator generate fetchlist from crawldb but everytime i 
run there is different number of urls in fetchlist.
For example I put 6 test urls we have for testing and only 5 of 20 test there 
were all urls listed in fetchlist, sometimes onyl one. Config was always the 
same also when testing at version 0.8.
I try to debug what might go wrong but I only end up that in /tmp there were 
all urls but somehow missed in crawl_generate
I also se some of 
2006-09-02 20:14:20,147 DEBUG conf.Configuration - java.io.IOException: config(config)

at org.apache.hadoop.conf.Configuration.(Configuration.java:76)
at org.apache.hadoop.mapred.JobConf.(JobConf.java:87)
at org.apache.hadoop.mapred.JobConf.(JobConf.java:98)
at org.apache.nutch.util.NutchJob.(NutchJob.java:26)
at org.apache.nutch.crawl.Generator.generate(Generator.java:330)
at org.apache.nutch.crawl.Generator.run(Generator.java:405)
at org.apache.nutch.util.ToolBase.doMain(ToolBase.java:145)
at org.apache.nutch.crawl.Generator.main(Generator.java:372)
if I enable DEBUG loging but I doubt that this has anything to do with this.



  




Re: Patch Available status?

2006-08-30 Thread Uroš Gruber

Chris Mattmann wrote:

Hi Doug and Andrzej,

  +1. I think that workflow makes a lot of sense. Currently users in the
nutch-developers group can close and resolve issues. In the Hadoop workflow,
would this continue to be the case?

  

+1

Regards,
Uros

Cheers,
  Chris



On 8/30/06 3:14 PM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote:

  

Doug Cutting wrote:


Sami Siren wrote:
  

I am not able to do it either, or then I just don't know how, can
Doug help us here?


This requires a change the the project's workflow.  I'd be happy to
move Nutch to use the workflow we use for Hadoop, which supports
"Patch Available".

This workflow has one other non-default feature, which is that bugs,
once closed, cannot be re-opened.  This works as follows: Only project
administrators are allowed to close issues.  Bugs are resolved as
they're fixed, and only closed when a release is made.  This keeps the
release notes Jira generates from changing after a release is made.

Would you like me to switch Nutch to use this Jira workflow?
  

+1, this would finally make sense with the "resolved" vs. "closed" ...




  




Re: get CrawlDatum

2006-08-30 Thread Uroš Gruber

Andrzej Bialecki wrote:

Uroš Gruber wrote:

ParseData.metadata sounds nice, but I think I'm lost again :)
If I understand code flow the best place would be in Fetcher [262]

but i'm not sure that datum holds info of url being fetched


On the input to the fetcher you get a URL and a CrawlDatum (originally 
coming from the crawldb). Check for example how the segment name is 
passed around in metadata, you can use the same method.



Hi,

I made some draft patch. But there is still some problems I see. I know 
code needs to be cleaned and test. But right now I don't know what 
number set to external urls. For internal linking works great.


What is the whole idea of this changes.

Injected urls always get hop 0. While fetching/updating/generating hop 
value is incremented by 1. (still no idea what to do with external 
link). Then I can add config value max_hop etc. to limit fetcher and 
generator to create more urls.


This way it's possible to limit crawling vertically

Comments are welcome.

regards,

Uros
Index: java/org/apache/nutch/crawl/CrawlDatum.java
===
--- java/org/apache/nutch/crawl/CrawlDatum.java (revision 437981)
+++ java/org/apache/nutch/crawl/CrawlDatum.java (working copy)
@@ -57,6 +57,7 @@
   private byte status;
   private long fetchTime = System.currentTimeMillis();
   private byte retries;
+  private int hop;
   private float fetchInterval;
   private float score = 1.0f;
   private byte[] signature = null;
@@ -82,6 +83,8 @@
   public byte getStatus() { return status; }
   public void setStatus(int status) { this.status = (byte)status; }
 
+  public int getHop() { return hop; }
+  public void setHop (int hop) {this.hop = hop; }
   public long getFetchTime() { return fetchTime; }
   public void setFetchTime(long fetchTime) { this.fetchTime = fetchTime; }
 
@@ -151,6 +154,7 @@
 retries = in.readByte();
 fetchInterval = in.readFloat();
 score = in.readFloat();
+hop = in.readInt();
 if (version > 2) {
   modifiedTime = in.readLong();
   int cnt = in.readByte();
@@ -186,6 +190,7 @@
 out.writeByte(retries);
 out.writeFloat(fetchInterval);
 out.writeFloat(score);
+out.writeInt(hop);
 out.writeLong(modifiedTime);
 if (signature == null) {
   out.writeByte(0);
@@ -210,6 +215,7 @@
 this.score = that.score;
 this.modifiedTime = that.modifiedTime;
 this.signature = that.signature;
+this.hop = that.hop;
 this.metaData = new MapWritable(that.metaData); // make a deep copy
   }
 
@@ -290,6 +296,7 @@
 buf.append("Retries since fetch: " + getRetriesSinceFetch() + "\n");
 buf.append("Retry interval: " + getFetchInterval() + " days\n");
 buf.append("Score: " + getScore() + "\n");
+buf.append("Hop: " + getHop() + "\n");
 buf.append("Signature: " + StringUtil.toHexString(getSignature()) + "\n");
 buf.append("Metadata: " + (metaData != null ? metaData.toString() : 
"null") + "\n");
 return buf.toString();
Index: java/org/apache/nutch/crawl/Injector.java
===
--- java/org/apache/nutch/crawl/Injector.java   (revision 437981)
+++ java/org/apache/nutch/crawl/Injector.java   (working copy)
@@ -77,6 +77,7 @@
 value.set(url);   // collect it
 CrawlDatum datum = new CrawlDatum(CrawlDatum.STATUS_DB_UNFETCHED, 
interval);
 datum.setScore(scoreInjected);
+datum.setHop(0);
 try {
   scfilters.initialScore(value, datum);
 } catch (ScoringFilterException e) {
Index: java/org/apache/nutch/fetcher/Fetcher.java
===
--- java/org/apache/nutch/fetcher/Fetcher.java  (revision 437981)
+++ java/org/apache/nutch/fetcher/Fetcher.java  (working copy)
@@ -260,6 +260,8 @@
   Metadata metadata = content.getMetadata();
   // add segment to metadata
   metadata.set(SEGMENT_NAME_KEY, segmentName);
+
+  metadata.set("hop", Integer.toString(datum.getHop()));
   // add score to content metadata so that ParseSegment can pick it up.
   try {
 scfilters.passScoreBeforeParsing(key, datum, content);
Index: java/org/apache/nutch/parse/ParseOutputFormat.java
===
--- java/org/apache/nutch/parse/ParseOutputFormat.java  (revision 437981)
+++ java/org/apache/nutch/parse/ParseOutputFormat.java  (working copy)
@@ -85,8 +85,8 @@
   String fromHost = null; 
   String toHost = null;  
   textOut.append(key, new ParseText(parse.getText()));
-  
   ParseData parseData = parse.getData();
+  String pd = parseData.getContentMeta().get("hop");
   // recover th

Re: get CrawlDatum

2006-08-30 Thread Uroš Gruber

Andrzej Bialecki wrote:

Uroš Gruber wrote:

Hi,

Could someone point me how to get CrawlDatum data from key url in 
ParseOutputFormat.write [83].
I would like to add data to link urls but this data depend on data of 
url being crawled.


You can't, because that instance of CrawlDatum is not available at 
this place. Either you need to provide it on the input to the 
map/reduce job (but then you will have to change input and output 
formats), or you should prepare this information in advance during 
parsing, and put it into ParseData.metadata.

ParseData.metadata sounds nice, but I think I'm lost again :)
If I understand code flow the best place would be in Fetcher [262]

but i'm not sure that datum holds info of url being fetched



I hope I was clear enough about my problem.

I hope so too ;)






get CrawlDatum

2006-08-30 Thread Uroš Gruber

Hi,

Could someone point me how to get CrawlDatum data from key url in 
ParseOutputFormat.write [83].
I would like to add data to link urls but this data depend on data of 
url being crawled.


I hope I was clear enough about my problem.

regards

Uros


Re: [Nutch Wiki] Update of "RunNutchInEclipse" by UrosG

2006-08-29 Thread Uroš Gruber

Stefan Groschupf wrote:

Hi,

+ You may have problems with some imports in parse-mp3 and parse-rtf 
plugins. Because of incompatibility with apache licence they were 
left from sources. You can find it here:

+
+ http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/lib/
+
+ http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/
+
+ You need to copy jar files into plugin "lib" path and refresh the 
project.


Isn't the mp3 plugin deactivated? I suggest we remove it and put in a 
kind of sandbox within the jars. However I think the sandbox have to 
be outside of apache.


Stefan
I just put a note because I have errors in eclipse. I also do not use 
those plugins. So if you think it's better this way I can change this.
Sandbox sounds nice. I found that there is a lot of plugins who are not 
used often so some kind of moving this out of nutch would be great.


regards

Uros


Nutch internals

2006-08-29 Thread Uroš Gruber

Hi,

I do some changes in CrawlDatum but some things I'm not quite understand.

My idea is to add int hop in CrawlDatum and set this in Injector to 0. 
Then after fetching other urls this can be calculated parenturl + 1.


I try to find where adding new urls to webDB is done. If somebody could 
explain this to me.


1. Inject (urls are read from url file, filtered through enabled Filters 
and stored in WebDB)
2. after that generate is started. Here WebDB is read in create some 
list of urls to fetch

3. Fetcher fetch urls and store this in segments dirs

4. updatedb, If I understand correctly data from segment/*/crawl_parse 
is merged with current WebDB. If so creating webdb in segment is done 
when fetching.


I think it's possible to get fetching url CrawlDatum info while fetching 
and then use hop number to calculate with all other urls found on 
current page and store this.


Maybe I missed the whole concept of this.

Affter that I can use this hop number to limit generating fetch lists.

regards

Uros


Re: (NUTCH-339) Refactor nutch to allow fetcher improvements

2006-08-09 Thread Uroš Gruber

e w wrote:
What do you now set fetcher.threads.per.host to? Can you tell me what 
your

generate.max.per.host value is as well?



fetcher.server.delay
0
The number of seconds the fetcher will delay between
 successive requests to the same server.



 fetcher.threads.fetch
 10



 generate.max.per.host
 400



fetcher.threads.per.host
10



 http.max.delays
 30



I got big improvements after setting:


 fetcher.server.delay
 0.5
 The number of seconds the fetcher will delay between
  successive requests to the same server.


even though I'm only generating 5 urls per host 
(generate.max.per.host=5). I
don't know whether fetcher.server.delay also affects requests made 
through a

proxy (anyone?) since I'm using a proxy.

Also, I still can't see any logging output from the fetchers i.e. what 
url

is being requested in any log file anywhere. I'm not so hot with java but
can anyone here tell whether:

log4j.threshhold=ALL


I set this

log4j.logger.org.apache.nutch=DEBUG
log4j.logger.org.apache.hadoop=DEBUG

That I can see what is going on.

--
Uros
is conf/log4j.properties should be threshhold with 1 "h" or are 2 
"h"'s the

java way?

And is there any reason why the lines in the function below are commented
out:

 public void configure(JobConf job) {
   setConf(job);

   this.segmentName = job.get(SEGMENT_NAME_KEY);
   this.storingContent = isStoringContent(job);
   this.parsing = isParsing(job);

//if (job.getBoolean("fetcher.verbose", false)) {
//  LOG.setLevel(Level.FINE);
//}
 }

Is this parameter now read somewhere else?

Any enlightenment always appreciated.

-Ed

On 8/9/06, Uroš Gruber <[EMAIL PROTECTED]> wrote:


Sami Siren wrote:
>
>> I set DEBUG level loging and I've checked time during operations and
>> when doint MapReduce job which is run after every page it takes 3-4
>> seconds till next url is fethed.
>> I have some local site and fetching 100 pages takes about 6 minutes.
>
> You are fetching a single site yes? Then you can get more performance
> by tweaking the configuration
> of fetcher.
>
> 
>  fetcher.server.delay
>  
>  The number of seconds the fetcher will delay between
>   successive requests to the same server.
> 
>
> 
>  fetcher.threads.per.host
>  
>  This number is the maximum number of threads that
>should be allowed to access a host at one time.
> 
>
Hi,

I've manage to test nutch speed on several machines with different OS as
well.
I looks that fetcher.threads.per.host makes fetcher run faster.

What I still don't understand is this.

When fetcher threads was set to default value fetcher was doing
mapreduce after every url.
But now job is run on about 400 urls or maybe more.

--
Uros
> --
> Sami Siren








Re: (NUTCH-339) Refactor nutch to allow fetcher improvements

2006-08-09 Thread Uroš Gruber

Sami Siren wrote:


I set DEBUG level loging and I've checked time during operations and 
when doint MapReduce job which is run after every page it takes 3-4 
seconds till next url is fethed.

I have some local site and fetching 100 pages takes about 6 minutes.


You are fetching a single site yes? Then you can get more performance 
by tweaking the configuration

of fetcher.


 fetcher.server.delay
 
 The number of seconds the fetcher will delay between
  successive requests to the same server.



 fetcher.threads.per.host
 
 This number is the maximum number of threads that
   should be allowed to access a host at one time.



Hi,

I've manage to test nutch speed on several machines with different OS as 
well.

I looks that fetcher.threads.per.host makes fetcher run faster.

What I still don't understand is this.

When fetcher threads was set to default value fetcher was doing 
mapreduce after every url.

But now job is run on about 400 urls or maybe more.

--
Uros

--
Sami Siren




Re: (NUTCH-339) Refactor nutch to allow fetcher improvements

2006-08-04 Thread Uroš Gruber

Sami Siren wrote:

Uroš Gruber wrote:


Andrzej Bialecki wrote:


Sami Siren (JIRA) wrote:

I am not sure to what you refer to by this 3-4 sec but yes I agree 
threre are more aspects to optimize in fetcher, what I was firstly 
concerned was the fetching IO speed what was getting ridiculously 
low (not quite sure when this happened).
  



I set DEBUG level loging and I've checked time during operations and 
when doint MapReduce job which is run after every page it takes 3-4 
seconds till next url is fethed.

I have some local site and fetching 100 pages takes about 6 minutes.


Even I havent's seen it go that slow :)


Lucky me ;)
Depending on the number of map/reduce tasks, there is a framework 
overhead to transfer the job JAR


I would like to help find what cause such slowness. Version 0.7 did 
not use MapReduce and fetching was done about 20 pages per second on 
the same server. With same site fetching is reduced to 0.3 pages per 
second.


With queue based solution I just did a crawl of about 600k pages and 
it averaged 16 pps (1376 kb/s) with parsing enabled. Perhaps you could 
try Andrzejs new Fetcher and see how it performs for you (I haven't 
yet read the code ot tested it my self).


I'll try it, but first I need to test it on java 1.4.2. Maybe the 
problem is with OS itself. I'll report bask as soon as I have more test.


regards

Uros


Re: (NUTCH-339) Refactor nutch to allow fetcher improvements

2006-08-04 Thread Uroš Gruber

Andrzej Bialecki wrote:

Sami Siren (JIRA) wrote:
I am not sure to what you refer to by this 3-4 sec but yes I agree 
threre are more aspects to optimize in fetcher, what I was firstly 
concerned was the fetching IO speed what was getting ridiculously low 
(not quite sure when this happened).
  


I set DEBUG level loging and I've checked time during operations and 
when doint MapReduce job which is run after every page it takes 3-4 
seconds till next url is fethed. 


I have some local site and fetching 100 pages takes about 6 minutes.
Depending on the number of map/reduce tasks, there is a framework 
overhead to transfer the job JAR file, and start the subprocess on 
each tasktracker. However, once these are started the framework's 
overhead should be negligible, because single task is responsible for 
fetching many urls.


Naturally, for small jobs, with very few urls, the overhead is 
relatively large.


The symptoms I'm seeing is that eventually most threads end up in 
blockAddr spin-waiting. Another problem I see is that when the number 
of fetching threads is high relative to the available bandwidth, the 
data is trickling in so slowly that the Fetcher.run() decides that 
it's hung, and aborts the task. What happens then is that the task 
gets a SUCCEEDED status in tasktracker, although in reality it may 
have fetched only a small portion of the allotted fetchlist.


I would like to help find what cause such slowness. Version 0.7 did not 
use MapReduce and fetching was done about 20 pages per second on the 
same server. With same site fetching is reduced to 0.3 pages per second.


here is log msg

2006-08-02 10:12:29,162 INFO  mapred.LocalJobRunner - 37 pages, 0 errors, 0.3 
pages/s, 52 kb/s,
2006-08-02 10:12:30,164 INFO  mapred.LocalJobRunner - 37 pages, 0 errors, 0.3 
pages/s, 52 kb/s,
2006-08-02 10:12:31,166 INFO  mapred.LocalJobRunner - 37 pages, 0 errors, 0.3 
pages/s, 51 kb/s,
2006-08-02 10:12:32,168 INFO  mapred.LocalJobRunner - 37 pages, 0 errors, 0.3 
pages/s, 51 kb/s,
2006-08-02 10:12:33,170 INFO  mapred.LocalJobRunner - 37 pages, 0 errors, 0.3 
pages/s, 50 kb/s,


We should open more than one ticket to track these separate aspects. 
And for general discussion the mailing lista are perhaps the best place.
  

(I'm moving this to the list then).



regards

Uros


Re: .classpath for Ecplise

2006-08-03 Thread Uroš Gruber

Johannes Zillmann wrote:

Uroš Gruber schrieb:

Hi,

I'm trying to debug Nutch 0.8 with Ecplise but it seams that I have 
problems with my .classpath. I would like to make some plugins and 
mostly to get more knowledge on how Nutch internaly works.


Is it possible to get .classpath and mybe .project for Eclipse IDE 
from any of you developing Nutch?


regards

Uros


Have a look at

http://find23.net/Web-Site/blog/66A7676A-8C9C-4A93-8B59-A6A100EF8C1B.html

I tried with that, but it's not up to date and I have some problems when 
debuging (like some Classes can't be find).


regards

Uros







.classpath for Ecplise

2006-08-03 Thread Uroš Gruber

Hi,

I'm trying to debug Nutch 0.8 with Ecplise but it seams that I have 
problems with my .classpath. I would like to make some plugins and 
mostly to get more knowledge on how Nutch internaly works.


Is it possible to get .classpath and mybe .project for Eclipse IDE from 
any of you developing Nutch?


regards

Uros