Re: Bug in DeleteDuplicates.java ?
Gal Nitzan wrote: this function throws IOException. Why? public long getPos() throws IOException { return (doc*INDEX_LENGTH)/maxDoc; } It should be throwing ArithmeticException The IOException is required by the API of RecordReader. What happens when maxDoc is zero? Ka-boom! ;-) You're right, this should be wrapped in an IOException and rethrown. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Trunk is broken
Hi Andrzej, Gal Nitzan wrote: It seems that Trunk is now broken... DmozParser seems to be broken, too. It's package declaration is still org.apache.nutch.crawl instead of org.apache.nutch.tools. TJ
Re: Trunk is broken
Thomas Jaeger wrote: Hi Andrzej, Gal Nitzan wrote: It seems that Trunk is now broken... DmozParser seems to be broken, too. It's package declaration is still org.apache.nutch.crawl instead of org.apache.nutch.tools. Fixed. Thanks! -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Updated: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content
[ http://issues.apache.org/jira/browse/NUTCH-61?page=all ] Andrzej Bialecki updated NUTCH-61: --- Attachment: 20051230.txt Updated version for the latest mapred branch. Adaptive re-fetch interval. Detecting umodified content --- Key: NUTCH-61 URL: http://issues.apache.org/jira/browse/NUTCH-61 Project: Nutch Type: New Feature Components: fetcher Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: 20050606.diff, 20051230.txt Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes. Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Adaptive fetch interval unmodified content detection, episode II
Hi, I've been working on a set of patches to implement this functionality for the mapred branch. I have a workable solution now, but before I decide to commit it I'd like to solicit some comments. Please see the latest patch available from JIRA NUTCH-61. Based on the past discussions, I decided to implement a maximum limit for fetch interval, after which pages are unconditionally refetched, even if they are marked as UNMODIFIED. The reason for this is that pages could be stuck in this state for a very long time, and in the meantime the segments that contain copies of such pages could be expired (deleted or lost). All protocol plugins have been changed to check for content modification, and return a specific status if it's unmodified, avoiding fetching the actual content. Modification is also checked based on a page signature, using the recently added pluggable signature implementations. The main remaining doubt that I have is about the adaptive fetch interval functionality. The patch contains a framework for pluggable FetchSchedule implementations, which modify the fetch interval and the next fetch time based on the following information: * previous fetch time * previous modification time (may be 0 if unknown) * previous fetch interval * current fetch time * current modification time (may be 0 if unknown) * a boolean value changed, based on checking the page signatures (old vs. new), if the page's content is available For efficiency reasons, most of this information is stored and passed to processing jobs inside instances of CrawlDatum - for the key step of DB update any other parts of segments (such as Content, ParseData or ParseText) are not used, which prevents easy access to other page metadata. For now, I added both the signature and the modifiedTime to CrawlDatum as separate attributes, but I'm considering to put them (and any other values that users might want to add to CrawlDB) into a Properties attribute. The reason for this is that the reality may be more complicated than this simple model above. Various sites use additional information to control re-fetching, besides the Last-Modified that we use now, such as: * Expire header * ETag header * Caching headers * page metadata Additionally, some schemes for phasing out old segments might want to store some segment information inside the CrawlDb, such as the last segment name, where the latest copy can be found. So, I'll hold off with committing these patches until we can reach some agreement how to proceed. We should keep as little information in CrawlDB as possible, but no less than it's necessary... ;-) Please review the patches and play around with them - they work properly even now. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: severe error in fetch
This problem is recurring. It happens when fetching https://www.kodak.com:0/something. I guess the port number 0 is the cause of the problem because there is no problem fetching https://www.kodak.com/anything. see log entries: 051230 105257 fetching https://www.kodak.com:0/eknec/PageQuerier.jhtml?pq-path=2/782/2608/2610/4074/7058pq-locale=en_US_loopback=1 051230 105305 SEVERE Host connection pool not found, hostConfig=HostConfiguration[host=https://www.kodak.com] java.lang.RuntimeException: SEVERE error logged. Exiting fetcher. Is it right that some specific port numbers can cause connection pool problem in httpclient? If yes, I can filter out url containing these trouble ports before httpclient is fixed. Thanks, AJ On 12/26/05, Andrzej Bialecki [EMAIL PROTECTED] wrote: AJ Chen wrote: Stefan, Here is the trace in my log. My SSFetcher (for site-specific fetch) is the same as nutch Fetcher except that the URLFilters it uses has additional filter based on domain names. Line 363 is throw new RuntimeException(SEVERE error logged. Exiting fetcher.); 051224 075950 SEVERE Host connection pool not found, hostConfig=HostConfiguration[host=https://www.kodak.com] This error comes from the httpclient library (you won't get a better stacktrace, you need to redefine the java.util.logging properties to get more info). I'm in the process of upgrading to the latest release, but it's trivial, you can try it yourself. Hopefully this should solve the issue. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com