[jira] Commented: (NUTCH-558) Need tool to retrieve domain statistics

2007-09-27 Thread Chris Schneider (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530755
 ] 

Chris Schneider commented on NUTCH-558:
---

The reason that DomainStats does not use URLUtils is that (as mentioned above) 
we are currently using a relatively old Nutch source base (last integrated at 
revision 417928). There are probably other tools/resources we could use as well 
if we reworked the code to better fit the current Nutch/Hadooop source 
environment. Sorry for being so out of date.

 Need tool to retrieve domain statistics
 ---

 Key: NUTCH-558
 URL: https://issues.apache.org/jira/browse/NUTCH-558
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 0.9.0
Reporter: Chris Schneider
Assignee: Chris Schneider
 Attachments: DomainStats.patch


 Several developers have expressed interest in a tool to retrieve statistics 
 from a crawl on a domain basis (e.g., how many pages were successfully 
 fetched from www.apache.org vs. apache.org, where the latter total would 
 include the former).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-558) Need tool to retrieve domain statistics

2007-09-23 Thread Chris Schneider (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12529749
 ] 

Chris Schneider commented on NUTCH-558:
---

I made a comment in the source about this, but thinking about it later, I do 
wonder whether this version truly works correctly when presented with a segment 
directory (in addition to a crawldb). I had to rewrite the InputFormat section 
of the tool to fit the latest Nutch/Hadoop source environment, and in the 
process, I removed the wrapper object necessary for my older source 
environment. I'd certainly welcome it if somebody out there with a more up to 
date installation and crawl data could give it a try.

 Need tool to retrieve domain statistics
 ---

 Key: NUTCH-558
 URL: https://issues.apache.org/jira/browse/NUTCH-558
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 0.9.0
Reporter: Chris Schneider
Assignee: Chris Schneider
 Attachments: DomainStats.patch


 Several developers have expressed interest in a tool to retrieve statistics 
 from a crawl on a domain basis (e.g., how many pages were successfully 
 fetched from www.apache.org vs. apache.org, where the latter total would 
 include the former).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-558) Need tool to retrieve domain statistics

2007-09-19 Thread Chris Schneider (JIRA)
Need tool to retrieve domain statistics
---

 Key: NUTCH-558
 URL: https://issues.apache.org/jira/browse/NUTCH-558
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 0.9.0
Reporter: Chris Schneider
Assignee: Chris Schneider


Several developers have expressed interest in a tool to retrieve statistics 
from a crawl on a domain basis (e.g., how many pages were successfully fetched 
from www.apache.org vs. apache.org, where the latter total would include the 
former).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-351) Protocol forward proxy

2006-11-01 Thread Chris Schneider (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-351?page=comments#action_12446424 ] 

Chris Schneider commented on NUTCH-351:
---

I just noticed a bug in the patch above. I believe it's missing a return 
sequence between the Host: host and Accept-Encoding: x-gzip, gzip

These lines:

reqStr.append( HTTP/1.0\r\n);
reqStr.append(Host: );
reqStr.append(host);
reqStr.append(portString);
reqStr.append(Accept-Encoding: x-gzip, gzip\r\n);


Need to look something like:

reqStr.append( HTTP/1.0\r\n);
reqStr.append(Host: );
reqStr.append(host);
reqStr.append(portString);
reqStr.append(\r\n);

reqStr.append(Accept-Encoding: x-gzip, gzip\r\n);




 Protocol forward proxy
 --

 Key: NUTCH-351
 URL: http://issues.apache.org/jira/browse/NUTCH-351
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.8, 0.9.0, 0.8.1
Reporter: Sami Siren
 Assigned To: Sami Siren
Priority: Minor
 Fix For: 0.9.0

 Attachments: protocol-http-proxy-adapter.txt


 Protocol proxy adapter takes advantage of protocols known to http forward 
 proxy. Usually there's atleast http, https and ftp.
 You must configure nutch to use this plugin and to use http proxy before use.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-385) Server delay feature conflicts with maxThreadsPerHost

2006-10-11 Thread Chris Schneider (JIRA)
Server delay feature conflicts with maxThreadsPerHost
-

 Key: NUTCH-385
 URL: http://issues.apache.org/jira/browse/NUTCH-385
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Reporter: Chris Schneider


For some time I've been puzzled by the interaction between two paramters that 
control how often the fetcher can access a particular host:

1) The server delay, which comes back from the remote server during our 
processing of the robots.txt file, and which can be limited by 
fetcher.max.crawl.delay.

2) The fetcher.threads.per.host value, particularly when this is greater than 
the default of 1.

According to my (limited) understanding of the code in HttpBase.java:

Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher 
ends up keeping either 1 or 2 fetcher threads pointing at a particular host 
continuously. In other words, it never tries to point 3 at the host, and it 
always points a second thread at the host before the first thread finishes 
accessing it. Since HttpBase.unblockAddr never gets called with 
(((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts 
System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the host. 
Thus, the server delay will never be used at all. The fetcher will be 
continuously retrieving pages from the host, often with 2 fetchers accessing 
the host simultaneously.

Suppose instead that the fetcher finally does allow the last thread to complete 
before it gets around to pointing another thread at the target host. When the 
last fetcher thread calls HttpBase.unblockAddr, it will now put 
System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the host. 
This, in turn, will prevent any threads from accessing this host until the 
delay is complete, even though zero threads are currently accessing the host.

I see this behavior as inconsistent. More importantly, the current 
implementation certainly doesn't seem to answer my original question about 
appropriate definitions for what appear to be conflicting parameters. 

In a nutshell, how could we possibly honor the server delay if we allow more 
than one fetcher thread to simultaneously access the host?

It would be one thing if whenever (fetcher.threads.per.host  1), this trumped 
the server delay, causing the latter to be ignored completely. That is 
certainly not the case in the current implementation, as it will wait for 
server delay whenever the number of threads accessing a given host drops to 
zero.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-385) Server delay feature conflicts with maxThreadsPerHost

2006-10-11 Thread Chris Schneider (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-385?page=comments#action_12441528 ] 

Chris Schneider commented on NUTCH-385:
---

This comment was actually made by Andrzej in response to an email containing 
the analysis above that I sent him before creating this JIRA issue:

Let's start with defining what is the desired semantics of these two parameters 
together. In my opinion it's the following:

* if only 1 thread per host is allowed, at any given moment at most one thread 
should be accessing the host, and the interval between consecutive requests 
should be at least crawlDelay (whichever way we determine this value - from 
config, from robots.txt or external sources such as partner agreements).

* if two or more (for example N) threads per host are allowed, at any given 
moment at most N threads should be accessing the host, and the interval between 
consecutive requests should be at least crawlDelay - that is, the interval 
between when one of the threads finishes, and another starts requesting.

I.e.: for threads.per.host=2 and crawlDelay=3 seconds, if we start 3 threads 
trying to access the same host we should get something like this (time in [s] 
on the x axis, # - start request, + - request in progress, b - blocked in 
per-host limit, c - obeying crawlDelay):

===0 1 2
===01234567890123456789012345678
1: #+++cccbbccc#cccbb#++
2: #cccbcccbcc#+++cb
3: ccc#+ccc#+ccc#+++

As you can see, at any given time we have at most 2 threads accessing the site, 
and the interval between consecutive requests is at least 3 seconds. Especially 
interesting in the above graph is the period between 17-18 seconds - thread 2 
had to be delayed additional 2 seconds to satisfy the crawl delay requirement, 
even though the threads.per.host requirement was satisfied.

[snip]

It's a question of priorities - in the model I drafted above the topmost 
priority is the observance of crawlDelay, sometimes at the cost of the number 
of concurrent threads (see seconds 17-18). In this model, the code should 
always put the delay in BLOCKED_ADDR_TO_TIME, in order to wait at least 
crawlDelay after _any_ thread finishes. We could use an alternative model, 
where crawlDelay is measured from the start of the request, and not from the 
end - see the graph below:

===0 1 2 3
===01234567890123456789012345678901234567
1: #+++cbbb##++cc#+++
2: ccc#cc#+++c#c#
3: cc#+ccc#+ccc#+ccbb

but it seems to me that it's more complicated, gives less requests/sec, and the 
interpretaion of crawlDelay's meaning is stretched ...

[snip]

 Server delay feature conflicts with maxThreadsPerHost
 -

 Key: NUTCH-385
 URL: http://issues.apache.org/jira/browse/NUTCH-385
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Reporter: Chris Schneider

 For some time I've been puzzled by the interaction between two paramters that 
 control how often the fetcher can access a particular host:
 1) The server delay, which comes back from the remote server during our 
 processing of the robots.txt file, and which can be limited by 
 fetcher.max.crawl.delay.
 2) The fetcher.threads.per.host value, particularly when this is greater than 
 the default of 1.
 According to my (limited) understanding of the code in HttpBase.java:
 Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher 
 ends up keeping either 1 or 2 fetcher threads pointing at a particular host 
 continuously. In other words, it never tries to point 3 at the host, and it 
 always points a second thread at the host before the first thread finishes 
 accessing it. Since HttpBase.unblockAddr never gets called with 
 (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts 
 System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
 host. Thus, the server delay will never be used at all. The fetcher will be 
 continuously retrieving pages from the host, often with 2 fetchers accessing 
 the host simultaneously.
 Suppose instead that the fetcher finally does allow the last thread to 
 complete before it gets around to pointing another thread at the target host. 
 When the last fetcher thread calls HttpBase.unblockAddr, it will now put 
 System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
 host. This, in turn, will prevent any threads from accessing this host until 
 the delay is complete, even though zero threads are currently accessing the 
 host.
 I see this behavior as inconsistent. More importantly, the current 
 implementation certainly doesn't seem to answer my original question about 
 appropriate definitions for what appear to be conflicting parameters. 
 In a nutshell, how could we possibly honor the server delay 

[jira] Commented: (NUTCH-385) Server delay feature conflicts with maxThreadsPerHost

2006-10-11 Thread Chris Schneider (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-385?page=comments#action_12441529 ] 

Chris Schneider commented on NUTCH-385:
---

This comment was actually made by Ken Krugler, who was responding to Andrzej's 
comment above:

[with respect to Andrzej's definitions at the beginning of his comment - Ed.:]
I agree that this is one of two possible interpretations. The other is that 
there are N virtual users, and there crawlDelay applies to each of these 
virtual users in isolation.

Using the same type of request data from above, I see a queue of requests with 
the following durations (in seconds):

4, 9, 6, 5, 6, 4, 7, 4

So with the virtual user model (where N = 2, thus A and B users), I get:

===0 1 2
===01234567890123456789012345678
A: 4+++ccc6+ccc6+ccc7++
B: 9ccc5ccc4+++ccc4+++

The numbers mark the start of each new request, and the total duration for the 
request.

This would seem to be less efficient than your approach, but somehow feels more 
in the nature of what threads.per.host really means.

Let's see, for N = 3 this would look like:

===0 1 2
===01234567890123456789012345678
A: 4+++ccc5ccc7++ccc
B: 9ccc4+++ccc
C: 6+ccc6+ccc4+++ccc

[snip]

To implement the virtual users model, each unique domain being actively fetched 
from would need to have N bits of state tracking the time of completion of the 
last request.

Anyway, just an alternative interpretation...


 Server delay feature conflicts with maxThreadsPerHost
 -

 Key: NUTCH-385
 URL: http://issues.apache.org/jira/browse/NUTCH-385
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Reporter: Chris Schneider

 For some time I've been puzzled by the interaction between two paramters that 
 control how often the fetcher can access a particular host:
 1) The server delay, which comes back from the remote server during our 
 processing of the robots.txt file, and which can be limited by 
 fetcher.max.crawl.delay.
 2) The fetcher.threads.per.host value, particularly when this is greater than 
 the default of 1.
 According to my (limited) understanding of the code in HttpBase.java:
 Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher 
 ends up keeping either 1 or 2 fetcher threads pointing at a particular host 
 continuously. In other words, it never tries to point 3 at the host, and it 
 always points a second thread at the host before the first thread finishes 
 accessing it. Since HttpBase.unblockAddr never gets called with 
 (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts 
 System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
 host. Thus, the server delay will never be used at all. The fetcher will be 
 continuously retrieving pages from the host, often with 2 fetchers accessing 
 the host simultaneously.
 Suppose instead that the fetcher finally does allow the last thread to 
 complete before it gets around to pointing another thread at the target host. 
 When the last fetcher thread calls HttpBase.unblockAddr, it will now put 
 System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
 host. This, in turn, will prevent any threads from accessing this host until 
 the delay is complete, even though zero threads are currently accessing the 
 host.
 I see this behavior as inconsistent. More importantly, the current 
 implementation certainly doesn't seem to answer my original question about 
 appropriate definitions for what appear to be conflicting parameters. 
 In a nutshell, how could we possibly honor the server delay if we allow more 
 than one fetcher thread to simultaneously access the host?
 It would be one thing if whenever (fetcher.threads.per.host  1), this 
 trumped the server delay, causing the latter to be ignored completely. That 
 is certainly not the case in the current implementation, as it will wait for 
 server delay whenever the number of threads accessing a given host drops to 
 zero.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-351) Protocol forward proxy

2006-09-26 Thread Chris Schneider (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-351?page=comments#action_12438002 ] 

Chris Schneider commented on NUTCH-351:
---

I would really appreciate it if Sami could explain in a little more detail what 
this patch adds to the proxy support already in Nutch. Although the patch seems 
to generalize the support somewhat, my reading of the current HttpResponse.java 
code suggests that it is already designed to handle URLs using these protocols 
when Nutch lives behind a proxy server.

 Protocol forward proxy
 --

 Key: NUTCH-351
 URL: http://issues.apache.org/jira/browse/NUTCH-351
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.8, 0.8.1, 0.9.0
Reporter: Sami Siren
 Assigned To: Sami Siren
Priority: Minor
 Fix For: 0.9.0

 Attachments: protocol-http-proxy-adapter.txt


 Protocol proxy adapter takes advantage of protocols known to http forward 
 proxy. Usually there's atleast http, https and ftp.
 You must configure nutch to use this plugin and to use http proxy before use.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-371) DeleteDuplicates should remove documents with duplicate URLs

2006-09-25 Thread Chris Schneider (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-371?page=all ]

Chris Schneider updated NUTCH-371:
--

Description: 
DeleteDuplicates is supposed to delete documents with duplicate URLs (after 
deleting documents with identical MD5 hashes), but this part is apparently not 
yet implemented. Here's the comment from DeleteDuplicates.java:

// 2. map indexes - url, fetchdate, index,doc
// partition by url
// reduce, deleting all but most recent.
//
// Part 2 is not yet implemented, but the Indexer currently only indexes one
// URL per page, so this is not a critical problem.

It is apparently also known that re-fetching the same URL (e.g., one month 
later) will result in more than one document with the same URL (this is alluded 
to in NUTCH-95), but the comment above suggests that the indexer will solve the 
problem before DeleteDuplicates, because it will only index one document per 
URL.

This is not necessarily the case if the segments are to be divided among search 
servers, as each server will have its own index built from its own portion of 
the segments. Thus, if the URL in question was fetched in different segments, 
and these segments end up assigned to different search servers, then the 
indexer can't be relied on to eliminate the duplicates.

Thus, it seems like the second part of the DeleteDuplicates algorithm (i.e., 
deleting documents with duplicate URLs) needs to be implemented. I agree with 
Byron and Andrzej that the most recently fetched document (rather than the one 
with the highest score) should be preserved.

Finally, it's also possible to get duplicate URLs in the segments without 
re-fetching an expired URL in the crawldb. This can happen if 3 different URLs 
all redirect to the target URL. This is yet another consequence of handling 
redirections immediately, rather than adding the target URL to the crawldb for 
fetching in some subsequent segment (see NUTCH-273).

  was:
DeleteDuplicates is supposed to delete documents with duplicate URLs (after 
deleting documents with identical MD5 hashes), but this part is apparently not 
yet implemented. Here's the comment from DeleteDuplicates.java:

// 2. map indexes - url, fetchdate, index,doc
// partition by url
// reduce, deleting all but most recent.
//
// Part 2 is not yet implemented, but the Indexer currently only indexes one
// URL per page, so this is not a critical problem.

It is apparently also known that re-fetching the same URL (e.g., one month 
later) will result in more than one document with the same URL (this is alluded 
to in NUTCH-95), but the comment above suggests that the indexer will solve the 
problem before DeleteDuplicates, because it will only index one document per 
URL.

This is not necessarily the case if the segments are to be divided among search 
servers, as each server will have its own index built from its own portion of 
the segments. Thus, if the URL in question was fetched in different segments, 
and these segments end up assigned to different search servers, then the 
indexer can't be relied on to eliminate the duplicates.

Thus, it seems like the second part of the DeleteDuplicates algorithm (i.e., 
deleting documents with duplicate URLs) needs to be implemented. I agree with 
Byron and Andrzej that most recently fetched document (rather than the one with 
the highest score) should be preserved.

Finally, it's also possible to get duplicate URLs in the segments without 
re-fetching an expired URL in the crawldb. This can happen if 3 different URLs 
all redirect to the target URL. This is yet another consequence of handling 
redirections immediately, rather than adding the target URL to the crawldb for 
fetching some subsequent segment (see NUTCH-273).


 DeleteDuplicates should remove documents with duplicate URLs
 

 Key: NUTCH-371
 URL: http://issues.apache.org/jira/browse/NUTCH-371
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Reporter: Chris Schneider

 DeleteDuplicates is supposed to delete documents with duplicate URLs (after 
 deleting documents with identical MD5 hashes), but this part is apparently 
 not yet implemented. Here's the comment from DeleteDuplicates.java:
 // 2. map indexes - url, fetchdate, index,doc
 // partition by url
 // reduce, deleting all but most recent.
 //
 // Part 2 is not yet implemented, but the Indexer currently only indexes one
 // URL per page, so this is not a critical problem.
 It is apparently also known that re-fetching the same URL (e.g., one month 
 later) will result in more than one document with the same URL (this is 
 alluded to in NUTCH-95), but the comment above suggests that the indexer will 
 solve the problem before DeleteDuplicates, because it will only index one 
 document per URL.
 This is not necessarily the case if the segments are to be divided among 
 search servers, 

[jira] Commented: (NUTCH-273) When a page is redirected, the original url is NOT updated.

2006-08-24 Thread Chris Schneider (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-273?page=comments#action_12430117 ] 

Chris Schneider commented on NUTCH-273:
---

Another reason why it would be better to wait until the next segment to process 
the target of the redirect is that this target may already have been fetched. 
In this case, there's no need to refetch it. More importantly, though, 
refetching the page will cause its OPIC score to be distributed a second time 
to its outlinks. In fact, each page that redirects to the target page will 
cause the target page's OPIC score to get redistributed.

I honestly can't see a good reason for doing an immediate redirect, since 
hopefully these cases aren't common enough to make a significant difference to 
crawling performance.

Note that there are several other issues related to this issue, so we should 
take care to satisfy the goals of all with any fix. In particular, I agree that 
we should be saving more information in the metadata about the redirection (as 
well as other protocol cases).

 When a page is redirected, the original url is NOT updated.
 ---

 Key: NUTCH-273
 URL: http://issues.apache.org/jira/browse/NUTCH-273
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8
 Environment: n/a
Reporter: Lukas Vlcek

 [Excerpt from maillist, sender: Andrzej Bialecki]
 When a page is redirected, the original url is NOT updated - so, CrawlDB will 
 never know that a redirect occured, it won't even know that a fetch 
 occured... This looks like a bug.
 In 0.7 this was recorded in the segment, and then it would affect the Page 
 status during updatedb. It should do so 0.8, too...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-348) Generator is building fetch list using *lowest* scoring URLs

2006-08-16 Thread Chris Schneider (JIRA)
Generator is building fetch list using *lowest* scoring URLs


 Key: NUTCH-348
 URL: http://issues.apache.org/jira/browse/NUTCH-348
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Reporter: Chris Schneider


Ever since revision 391271, when the CrawlDatum key was replaced by a 
FloatWritable key, the Generator.Selector.reduce method has been outputting the 
*lowest* scoring URLs! The CrawlDatum class has a Comparator that essentially 
treats higher scoring CrawlDatum objects as less than lower scoring 
CrawlDatum objects, so the higher scoring ones would appear first in a sequence 
file sorted using this as the key.

When a FloatWritable based on the score itself (as returned from 
scfilters.generatorSortValue) became the sort key, it should have been negated 
in Generator.Selector.map to have the same result. Curiously, there is a 
comment to this effect immediately before the FloatWritable is set:

  // sort by decreasing score
  sortValue.set(sort);

It seems like the simplest way to fix this is to just negate the score, and 
this seems to work for me:

  // sort by decreasing score
  // 2006-08-15 CSc REALLY sort by decreasing score
  sortValue.set(-sort);

Unfortunately, this means that any crawls that have been done using 
Generator.java after revision 391271 should be discarded, as they were focused 
on fetching the lowest scoring unfetched URLs in the crawldb, essentially 
pointing the crawler 180 degrees from its intended direction.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-342) Nutch commands log to nutch/logs/hadoop.logs by default

2006-08-06 Thread Chris Schneider (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-342?page=comments#action_12426039 ] 

Chris Schneider commented on NUTCH-342:
---

I apologize for my confusion. I had been thinking that hadoop-env.sh was 
getting sourced when a Nutch command was run; it is not. Thus, $HADOOP_LOG_DIR 
and $HADOOP_LOG_FILE are not set when executing Nutch commands. For now, I 
think it makes most sense for me to set NUTCH_LOG_DIR and NUTCH_LOGFILE to the 
same locations as $HADOOP_LOG_DIR and $HADOOP_LOG_FILE via .bash_profile, etc. 
I consider this awkward, but am unsure about how best to address this design 
problem. I'm beginning to think that NUTCH_LOGFILE should default to something 
like nutch-$USER-$COMMAND-`hostname`.log, which would seem more appropriate 
to find within the $NUTCH_HOME/logs directory.

 Nutch commands log to nutch/logs/hadoop.logs by default
 ---

 Key: NUTCH-342
 URL: http://issues.apache.org/jira/browse/NUTCH-342
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Chris Schneider
Priority: Minor
 Attachments: NUTCH-342.patch


 If (by default) Nutch commands are going to send their output to a file named 
 hadoop.log, then it seems like the default location for this file should be 
 the same location where Hadoop is putting its hadoop.log file (i.e., 
 $HADOOP_LOG_DIR). Currently, if I set HADOOP_LOG_DIR to a special location 
 (via hadoop-env.sh), this has no effect on where Nutch commands send their 
 output.
 Some would probably suggest that I could just set NUTCH_LOG_DIR to 
 $HADOOP_LOG_DIR myself. I still think that it should be defaulted this way in 
 the nutch script. However, I'm unaware of an elegant way to modify such Nutch 
 environment variables anyway. The hadoop-env.sh file provides a convenient 
 place to modify Hadoop environment variables, but doing the same for Nutch 
 environment variables presumably requires you to modify .bash_profile or a 
 similar user script file (which is the way I used to accomplish this kind of 
 thing with Nutch 0.7).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-342) Nutch commands log to nutch/logs/hadoop.logs by default

2006-08-05 Thread Chris Schneider (JIRA)
Nutch commands log to nutch/logs/hadoop.logs by default
---

 Key: NUTCH-342
 URL: http://issues.apache.org/jira/browse/NUTCH-342
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Chris Schneider
Priority: Minor


If (by default) Nutch commands are going to send their output to a file named 
hadoop.log, then it seems like the default location for this file should be 
the same location where Hadoop is putting its hadoop.log file (i.e., 
$HADOOP_LOG_DIR). Currently, if I set HADOOP_LOG_DIR to a special location (via 
hadoop-env.sh), this has no effect on where Nutch commands send their output.

Some would probably suggest that I could just set NUTCH_LOG_DIR to 
$HADOOP_LOG_DIR myself. I still think that it should be defaulted this way in 
the nutch script. However, I'm unaware of an elegant way to modify such Nutch 
environment variables anyway. The hadoop-env.sh file provides a convenient 
place to modify Hadoop environment variables, but doing the same for Nutch 
environment variables presumably requires you to modify .bash_profile or a 
similar user script file (which is the way I used to accomplish this kind of 
thing with Nutch 0.7).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-342) Nutch commands log to nutch/logs/hadoop.logs by default

2006-08-05 Thread Chris Schneider (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-342?page=all ]

Chris Schneider updated NUTCH-342:
--

Attachment: NUTCH-342.patch

Here's a patch that defaults NUTCH_LOG_DIR to $HADOOP_LOG_DIR and NUTCH_LOGFILE 
to $HADOOP_LOG_FILE.

 Nutch commands log to nutch/logs/hadoop.logs by default
 ---

 Key: NUTCH-342
 URL: http://issues.apache.org/jira/browse/NUTCH-342
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Chris Schneider
Priority: Minor
 Attachments: NUTCH-342.patch


 If (by default) Nutch commands are going to send their output to a file named 
 hadoop.log, then it seems like the default location for this file should be 
 the same location where Hadoop is putting its hadoop.log file (i.e., 
 $HADOOP_LOG_DIR). Currently, if I set HADOOP_LOG_DIR to a special location 
 (via hadoop-env.sh), this has no effect on where Nutch commands send their 
 output.
 Some would probably suggest that I could just set NUTCH_LOG_DIR to 
 $HADOOP_LOG_DIR myself. I still think that it should be defaulted this way in 
 the nutch script. However, I'm unaware of an elegant way to modify such Nutch 
 environment variables anyway. The hadoop-env.sh file provides a convenient 
 place to modify Hadoop environment variables, but doing the same for Nutch 
 environment variables presumably requires you to modify .bash_profile or a 
 similar user script file (which is the way I used to accomplish this kind of 
 thing with Nutch 0.7).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-336) Harvested links shouldn't get db.score.injected in addition to inbound contributions

2006-08-02 Thread Chris Schneider (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-336?page=all ]

Chris Schneider updated NUTCH-336:
--

Attachment: NUTCH-336.patch.txt

Here's a patch that fixes the problem. It separates a new injectionScore API 
out from the initialScore API.

 Harvested links shouldn't get db.score.injected in addition to inbound 
 contributions
 

 Key: NUTCH-336
 URL: http://issues.apache.org/jira/browse/NUTCH-336
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8
Reporter: Chris Schneider
Priority: Minor
 Attachments: NUTCH-336.patch.txt


 Currently (even with Stefan's fix for NUTCH-324), harvested links have their 
 initial scores set to db.score.injected + (sum of inbound contributions * 
 db.score.link.[internal | external]), but this will place (at least external) 
 harvested links even higher than injected URLs on the fetch list. Perhaps 
 more importantly, this effect cascades.
 As a simple example, if each page in A-B-C-D has exactly one external link 
 and only A is injected, then D will receive an initial score of at least 
 (4*db.score.injected) with the default db.score.link.external of 1.0. Higher 
 values of db.score.injected and db.score.link.external obviously exacerbate 
 this problem.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-336) Harvested links shouldn't get db.score.injected in addition to inbound contributions

2006-08-01 Thread Chris Schneider (JIRA)
Harvested links shouldn't get db.score.injected in addition to inbound 
contributions


 Key: NUTCH-336
 URL: http://issues.apache.org/jira/browse/NUTCH-336
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8
Reporter: Chris Schneider
Priority: Minor


Currently (even with Stefan's fix for NUTCH-324), harvested links have their 
initial scores set to db.score.injected + (sum of inbound contributions * 
db.score.link.[internal | external]), but this will place (at least external) 
harvested links even higher than injected URLs on the fetch list. Perhaps more 
importantly, this effect cascades.

As a simple example, if each page in A-B-C-D has exactly one external link 
and only A is injected, then D will receive an initial score of at least 
(4*db.score.injected) with the default db.score.link.external of 1.0. Higher 
values of db.score.injected and db.score.link.external obviously exacerbate 
this problem.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-301) CommonGrams loads analysis.common.terms.file for each query

2006-06-06 Thread Chris Schneider (JIRA)
CommonGrams loads analysis.common.terms.file for each query
---

 Key: NUTCH-301
 URL: http://issues.apache.org/jira/browse/NUTCH-301
 Project: Nutch
Type: Improvement

  Components: searcher  
Versions: 0.8-dev
Reporter: Chris Schneider


The move away from static objects toward instance variables has resulted in 
CommonGrams constructor parsing its analysis.common.terms.file for each query. 
I'm not certain how large a performance impact this really is, but it seems 
like something you'd want to avoid doing for each query. Perhaps the solution 
is to keep around an instance of the CommonGrams object itself?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-267) Indexer doesn't consider linkdb when calculating boost value

2006-05-08 Thread Chris Schneider (JIRA)
Indexer doesn't consider linkdb when calculating boost value


 Key: NUTCH-267
 URL: http://issues.apache.org/jira/browse/NUTCH-267
 Project: Nutch
Type: Bug

  Components: indexer  
Versions: 0.8-dev
Reporter: Chris Schneider
Priority: Minor


Before OPIC was implemented (Nutch 0.7, very early Nutch 0.8-dev), if 
indexer.boost.by.link.count was true, the indexer boost value was scaled based 
on the log of the # of inbound links:

if (boostByLinkCount)
  res *= (float)Math.log(Math.E + linkCount);

This is no longer true (even before Andrzej implemented scoring filters). 
Instead, the boost value is just the square root (or some other scorePower) of 
the page score. Shouldn't the invertlinks command, which creates the linkdb, 
have some affect on the boost value calculated during indexing (either via the 
OPICScoringFilter or some other built-in filter)?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-246) segment size is never as big as topN or crawlDB size in a distributed deployement

2006-04-12 Thread Chris Schneider (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-246?page=comments#action_12374253 ] 

Chris Schneider commented on NUTCH-246:
---

As it turns out, this problem was due to a time synchronization between the 
jobtracker and the tasktrackers. When the URLs were injected, their fetchTimes 
were set to the System.currentTime() of the tasktrackers, which were 2 minutes 
in the future. Soon afterward, during the generation phase, these fetchTimes 
were compared to curTime, which came from the (correct) clock on the jobtracker 
(via the crawl.gen.curTime property in job.xml?) Thus, if the injection 
proceeded quickly enough, the generation phase would begin before these URLs 
were ready to be fetched.

It seems like the Injector should be loading the current time from a job 
configuration property in the same way that that the Generator is doing now, 
then calling setFetchTime(), rather than leaving this to what the CrawlDatum 
constructor sets it to.

 segment size is never as big as topN or crawlDB size in a distributed 
 deployement
 -

  Key: NUTCH-246
  URL: http://issues.apache.org/jira/browse/NUTCH-246
  Project: Nutch
 Type: Bug

 Versions: 0.8-dev
 Reporter: Stefan Groschupf
 Priority: Blocker
  Fix For: 0.8-dev


 I didn't reopen NUTCH-136 since it is may related to the hadoop split.
 I tested this on two different deployement (with 10 ttrackers + 1 jobtracker 
 and 9 ttracks and 1 jobtracker).
 Defining map and reduce task number in a mapred-default.xml does not solve 
 the problem. (is in nutch/conf on all boxes)
 We verified that it is not  a problem of maximum urls per hosts and also not 
 a problem of the url filter.
 Looks like the first job of the Generator (Selector) already got to less 
 entries to process. 
 May be this is somehow releasted to split generation or configuration inside 
 the distributed jobtracker since it runs in a different jvm as the jobclient.
 However we was not able to find the source for this problem.
 I think that should be fixed before  publishing a nutch 0.8. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-246) segment size is never as big as topN or crawlDB size in a distributed deployement

2006-04-12 Thread Chris Schneider (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-246?page=all ]

Chris Schneider updated NUTCH-246:
--

Priority: Minor  (was: Blocker)

 segment size is never as big as topN or crawlDB size in a distributed 
 deployement
 -

  Key: NUTCH-246
  URL: http://issues.apache.org/jira/browse/NUTCH-246
  Project: Nutch
 Type: Bug

 Versions: 0.8-dev
 Reporter: Stefan Groschupf
 Priority: Minor
  Fix For: 0.8-dev


 I didn't reopen NUTCH-136 since it is may related to the hadoop split.
 I tested this on two different deployement (with 10 ttrackers + 1 jobtracker 
 and 9 ttracks and 1 jobtracker).
 Defining map and reduce task number in a mapred-default.xml does not solve 
 the problem. (is in nutch/conf on all boxes)
 We verified that it is not  a problem of maximum urls per hosts and also not 
 a problem of the url filter.
 Looks like the first job of the Generator (Selector) already got to less 
 entries to process. 
 May be this is somehow releasted to split generation or configuration inside 
 the distributed jobtracker since it runs in a different jvm as the jobclient.
 However we was not able to find the source for this problem.
 I think that should be fixed before  publishing a nutch 0.8. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-246) segment size is never as big as topN or crawlDB size in a distributed deployement

2006-04-11 Thread Chris Schneider (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-246?page=comments#action_12374049 ] 

Chris Schneider commented on NUTCH-246:
---

A few more details:

Stefan and I were able to reproduce this problem using either an injection set 
of 4500 URLs or a larger set of DMOZ URLs. With the 4500 URL injection, only 
653 URLs were generated for the first segment, despite the fact that topN was 
set to 500K. I confirmed that nearly all of the 4500 injected URLs passed our 
URL filer and were actually injected into the crawldb.

To eliminate the possibility that the bug had been fixed recently or was due to 
a code modification that we'd made ourselves, we deployed yesterday's sandbox 
version of nutch (2006-04-10), including hadoop-0.1.1.jar (though I believe 
that Stefan had to build it himself because the nutch-0.8-dev.jar didn't match 
the source). We made the absolute minimum changes to nutch-site.xml, 
hadoop-site.xml, and hadoop-env.sh in order to deploy this version properly in 
our cluster (1 jobtracker/namenode machine, 10 tasktracker/datanode machines). 
However, we got the same results (i.e., very few URLs actually generated).

This bug has apparently been present since at least change 382948, but I 
suspect that it may have been present for the entire history of the mapreduce 
implementation of Nutch. It may also be the root cause of NUTCH-136, the 
explanation for which has always left me somewhat dissatisfied. Just because a 
nutch-site.xml containing default properties may override the desired mapred 
properties (incorrectly) specified in one of the *-default.xml files, and may 
therefore set mapred.map.tasks and mapred.reduce.tasks back to the defaults (2 
and 1, respectively), it's not clear to me exactly how/why you'd get only a 
fraction of topN URLs fetched. As Stefan has suggested, it would actually seem 
more plausible if each tasktracker tried to fetch the entire set of URLs in 
this case.

I would suggest that someone with a good understanding of the hadoop 
implementation investigate the first generation job in fine detail, both for 
the case where the mapred properties are specified in an appropriate manner and 
for the case where nutch-site.xml overrides the desired properties, setting 
them back to the defaults.

 segment size is never as big as topN or crawlDB size in a distributed 
 deployement
 -

  Key: NUTCH-246
  URL: http://issues.apache.org/jira/browse/NUTCH-246
  Project: Nutch
 Type: Bug

 Versions: 0.8-dev
 Reporter: Stefan Groschupf
 Priority: Blocker
  Fix For: 0.8-dev


 I didn't reopen NUTCH-136 since it is may related to the hadoop split.
 I tested this on two different deployement (with 10 ttrackers + 1 jobtracker 
 and 9 ttracks and 1 jobtracker).
 Defining map and reduce task number in a mapred-default.xml does not solve 
 the problem. (is in nutch/conf on all boxes)
 We verified that it is not  a problem of maximum urls per hosts and also not 
 a problem of the url filter.
 Looks like the first job of the Generator (Selector) already got to less 
 entries to process. 
 May be this is somehow releasted to split generation or configuration inside 
 the distributed jobtracker since it runs in a different jvm as the jobclient.
 However we was not able to find the source for this problem.
 I think that should be fixed before  publishing a nutch 0.8. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-195) RPC call times out while indexing map task is computing splits

2006-01-31 Thread Chris Schneider (JIRA)
RPC call times out while indexing map task is computing splits
--

 Key: NUTCH-195
 URL: http://issues.apache.org/jira/browse/NUTCH-195
 Project: Nutch
Type: Bug
  Components: indexer  
Versions: 0.8-dev
 Environment: MapReduce multi-computer crawl environment: 11 machines (1 master 
with JobTracker/NameNode, 10 slaves with TaskTrackers/DataNodes)
Reporter: Chris Schneider


We've been using Nutch 0.8 (MapReduce) to perform some internet crawling. 
Things seemed to be going well until...

060129 222409 Lost tracker 'tracker_56288'
060129 222409 Task 'task_m_10gs5f' has been lost.
060129 222409 Task 'task_m_10qhzr' has been lost.
   
   
060129 222409 Task 'task_r_zggbwu' has been lost.
060129 222409 Task 'task_r_zh8dao' has been lost.
060129 222455 Server handler 8 on 8010 caught: java.net.SocketException: Socket 
closed
java.net.SocketException: Socket closed
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:99)
at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
at java.io.DataOutputStream.flush(DataOutputStream.java:106)
at org.apache.nutch.ipc.Server$Handler.run(Server.java:216)
060129 222455 Adding task 'task_m_cia5po' to set for tracker 'tracker_56288'
060129 223711 Adding task 'task_m_ffv59i' to set for tracker 'tracker_25647'

I'm hoping that someone could explain why task_m_cia5po got added to 
tracker_56288 after this tracker was lost.

The Crawl .main process died with the following output:

060129 221129 Indexer: adding segment: 
/user/crawler/crawl-20060129091444/segments/20060129200246
Exception in thread main java.io.IOException: timed out waiting for response
at org.apache.nutch.ipc.Client.call(Client.java:296)
at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
at $Proxy1.submitJob(Unknown Source)
at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
at org.apache.nutch.indexer.Indexer.index(Indexer.java:263)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:127)

However, it definitely seems as if the JobTracker is still waiting for the job 
to finish (no failed jobs).

Doug Cutting's response:
The bug here is that the RPC call times out while the map task is computing 
splits.  The fix is that the job tracker should not compute splits until after 
it has returned from the submitJob RPC.  Please submit a bug in Jira to help 
remind us to fix this.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira