Re: Bug in DeleteDuplicates.java ?

2005-12-30 Thread Andrzej Bialecki

Gal Nitzan wrote:


this function throws IOException. Why?

public long getPos() throws IOException {
   return (doc*INDEX_LENGTH)/maxDoc;
 }

It should be throwing ArithmeticException 

 



The IOException is required by the API of RecordReader.


What happens when maxDoc is zero?
 



Ka-boom! ;-) You're right, this should be wrapped in an IOException and 
rethrown.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: Trunk is broken

2005-12-30 Thread Thomas Jaeger
Hi Andrzej,

Gal Nitzan wrote:

 It seems that Trunk is now broken...



DmozParser seems to be broken, too. It's package declaration is still
org.apache.nutch.crawl instead of org.apache.nutch.tools.


TJ



Re: Trunk is broken

2005-12-30 Thread Andrzej Bialecki

Thomas Jaeger wrote:


Hi Andrzej,

Gal Nitzan wrote:

 


It seems that Trunk is now broken...

 




DmozParser seems to be broken, too. It's package declaration is still
org.apache.nutch.crawl instead of org.apache.nutch.tools.
 



Fixed. Thanks!

--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




[jira] Updated: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content

2005-12-30 Thread Andrzej Bialecki (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-61?page=all ]

Andrzej Bialecki  updated NUTCH-61:
---

Attachment: 20051230.txt

Updated version for the latest mapred branch.

 Adaptive re-fetch interval. Detecting umodified content
 ---

  Key: NUTCH-61
  URL: http://issues.apache.org/jira/browse/NUTCH-61
  Project: Nutch
 Type: New Feature
   Components: fetcher
 Reporter: Andrzej Bialecki 
 Assignee: Andrzej Bialecki 
  Attachments: 20050606.diff, 20051230.txt

 Currently Nutch doesn't adjust automatically its re-fetch period, no matter 
 if individual pages change seldom or frequently. The goal of these changes is 
 to extend the current codebase to support various possible adjustments to 
 re-fetch times and intervals, and specifically a re-fetch schedule which 
 tries to adapt the period between consecutive fetches to the period of 
 content changes.
 Also, these patches implement checking if the content has changed since last 
 fetching; protocol plugins are also changed to make use of this information, 
 so that if content is unmodified it doesn't have to be fetched and processed.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Adaptive fetch interval unmodified content detection, episode II

2005-12-30 Thread Andrzej Bialecki

Hi,

I've been working on a set of patches to implement this functionality 
for the mapred branch.


I have a workable solution now, but before I decide to commit it I'd 
like to solicit some comments. Please see the latest patch available 
from JIRA NUTCH-61.


Based on the past discussions, I decided to implement a maximum limit 
for fetch interval, after which pages are unconditionally refetched, 
even if they are marked as UNMODIFIED. The reason for this is that pages 
could be stuck in this state for a very long time, and in the meantime 
the segments that contain copies of such pages could be expired (deleted 
or lost).


All protocol plugins have been changed to check for content 
modification, and return a specific status if it's unmodified, avoiding 
fetching the actual content.


Modification is also checked based on a page signature, using the 
recently added pluggable signature implementations.


The main remaining doubt that I have is about the adaptive fetch 
interval functionality. The patch contains a framework for pluggable 
FetchSchedule implementations, which modify the fetch interval and the 
next fetch time based on the following information:


* previous fetch time
* previous modification time (may be 0 if unknown)
* previous fetch interval
* current fetch time
* current modification time (may be 0 if unknown)
* a boolean value changed, based on checking the page signatures (old 
vs. new), if the page's content is available


For efficiency reasons, most of this information is stored and passed to 
processing jobs inside instances of CrawlDatum - for the key step of DB 
update any other parts of segments (such as Content, ParseData or 
ParseText) are not used, which prevents easy access to other page 
metadata. For now, I added both the signature and the modifiedTime to 
CrawlDatum as separate attributes, but I'm considering to put them (and 
any other values that users might want to add to CrawlDB) into a 
Properties attribute.


The reason for this is that the reality may be more complicated than 
this simple model above. Various sites use additional information to 
control re-fetching, besides the Last-Modified that we use now, such as:


* Expire header
* ETag header
* Caching headers
* page metadata

Additionally, some schemes for phasing out old segments might want to 
store some segment information inside the CrawlDb, such as the last 
segment name, where the latest copy can be found.


So, I'll hold off with committing these patches until we can reach some 
agreement how to proceed. We should keep as little information in 
CrawlDB as possible, but no less than it's necessary... ;-)


Please review the patches and play around with them - they work properly 
even now.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com





Re: severe error in fetch

2005-12-30 Thread AJ Chen
This problem is recurring. It happens when fetching
https://www.kodak.com:0/something.  I guess the port number 0 is the cause
of the problem because there is no problem fetching
https://www.kodak.com/anything.  see log entries:

051230 105257 fetching
https://www.kodak.com:0/eknec/PageQuerier.jhtml?pq-path=2/782/2608/2610/4074/7058pq-locale=en_US_loopback=1
051230 105305 SEVERE Host connection pool not found,
hostConfig=HostConfiguration[host=https://www.kodak.com]
java.lang.RuntimeException: SEVERE error logged.  Exiting fetcher.

Is it right that some specific port numbers can cause connection pool
problem in httpclient? If yes, I can filter out url containing these trouble
ports before httpclient is fixed.

Thanks,
AJ

On 12/26/05, Andrzej Bialecki [EMAIL PROTECTED] wrote:

 AJ Chen wrote:

 Stefan,
 Here is the trace in my log.  My SSFetcher (for site-specific fetch) is
 the
 same as nutch Fetcher except that the URLFilters it uses has additional
 filter based on domain names. Line 363 is
 throw new RuntimeException(SEVERE error logged.  Exiting
 fetcher.);
 
 
 051224 075950 SEVERE Host connection pool not found,
 hostConfig=HostConfiguration[host=https://www.kodak.com]
 
 

 This error comes from the httpclient library (you won't get a better
 stacktrace, you need to redefine the java.util.logging properties to get
 more info). I'm in the process of upgrading to the latest release, but
 it's trivial, you can try it yourself. Hopefully this should solve the
 issue.

 --
 Best regards,
 Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com