Re: [jira] Commented: (NUTCH-266) hadoop bug when doing updatedb
Sami Siren (JIRA) wrote: [ http://issues.apache.org/jira/browse/NUTCH-266?page=comments#action_12422929 ] Sami Siren commented on NUTCH-266: -- I finally found the time to setup an environment with cygwin and try this out. I can confirm that the hadoop.jar version provided with nutch gives these errors. I then checked tested nutch with hadoop nightly jar and everything worked just fine. Can someone try the hadoop nightly jar with nutch and see if it works for you. Nightly builds for hadoop are available from http://people.apache.org/dist/lucene/hadoop/nightly/ Are you planning to update Hadoop to trunk/ ? I'd rather be careful with that - I'm not sure if it's still compatible with Java 1.4, besides being unreleased/unstable ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [jira] Commented: (NUTCH-266) hadoop bug when doing updatedb
Are you planning to update Hadoop to trunk/ ? I'd rather be careful with that - I'm not sure if it's still compatible with Java 1.4, besides being unreleased/unstable ... Not planning an upgrade, just wan't to know if it resolves the issues. We can then decide what's the best thing to do. -- Sami Siren
[jira] Commented: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages
[ http://issues.apache.org/jira/browse/NUTCH-322?page=comments#action_12422996 ] Enrico Triolo commented on NUTCH-322: - Ok, I can see your point, nevertheless I think we should consider some potential problems that could arise from such modifications: 1. When a redirect occours, both the redirecting and the redirected pages should be indexed, independently of crawling depth, but I think this is what you meant from the beginning... 2. How should linkdb updated? Or better, should linkdb be updated somehow? I mean, if page A has a link to page B, and page B redirects to C, should we set an incoming link to C from A? Fetcher discards ProtocolStatus, doesn't store redirected pages --- Key: NUTCH-322 URL: http://issues.apache.org/jira/browse/NUTCH-322 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.8-dev Reporter: Andrzej Bialecki Fix For: 0.8-dev Fetcher doesn't store ProtocolStatus in output segments. ProtocolStatus contains important information, such as protocol-level response code, lastModified time, and possibly other messages. I propose that ProtocolStatus should be stored inside CrawlDatum.metaData, which is then stored into crawl_fetch (in Fetcher.FetcherThread.output()). In addition, if ProtocolStatus contains a valid lastModified time, that CrawlDatum's modified time should also be set to this value. Additionally, Fetcher doesn't store redirected pages. Content of such pages is silently discarded. When Fetcher translates from protocol-level status to crawldb-level status it should probably store such pages with the following translation of status codes: * ProtocolStatus.TEMP_MOVED - CrawlDatum.STATUS_DB_RETRY. This code indicates a transient change, so we probably shouldn't mark the initial URL as bad. * ProtocolStatus.MOVED - CrawlDatum.STATUS_DB_GONE. This code indicates a permanent change, so the initial URL is no longer valid, i.e. it will always result in redirects. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-167) Observation of META NAME=ROBOTS CONTENT=NOARCHIVE directive
[ http://issues.apache.org/jira/browse/NUTCH-167?page=all ] Andrzej Bialecki updated NUTCH-167: Attachment: patch.txt This patch implements support for Pragma: no-cache and Robots: noarchive. Three cache policies are supported in this patch: * CACHE_FORBIDDEN_CONTENT: for pages that specify noarchive, only summaries will be shown, but a cached content won't be displayed. * CACHE_FORBIDDEN_ALL: for pages that specify noarchive, neither summaries nor cached content will be shown - although they will appear in the list of matching results. * CACHE_FORBIDDEN_NONE: even for sites that specify noarchive Nutch will still disobey it, and show both summaries and cached content. This is the current (broken?) behavior. Since this patch is important for legal reasons, I'd like to commit it soon, before 0.8 release. Observation of META NAME=ROBOTS CONTENT=NOARCHIVE directive - Key: NUTCH-167 URL: http://issues.apache.org/jira/browse/NUTCH-167 Project: Nutch Issue Type: Improvement Components: indexer, web gui Affects Versions: 0.7.1 Reporter: Ed Whittaker Priority: Critical Attachments: patch.txt Though not strictly a bug, this issue is potentially serious for users of Nutch who deploy live systems who might be threatened with legal action for caching copies of copyrighted material. The major search engines all observe this directive (even though apparently it's not stanard) so there's every reason why Nutch should too. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-324) db.score.link.internal and db.score.link.external are ignored
[ http://issues.apache.org/jira/browse/NUTCH-324?page=all ] Andrzej Bialecki closed NUTCH-324. --- Fix Version/s: 0.8-dev Resolution: Fixed Patch applied, with minor whitespace diffs and doc. clarifications. Thank you! db.score.link.internal and db.score.link.external are ignored - Key: NUTCH-324 URL: http://issues.apache.org/jira/browse/NUTCH-324 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Stefan Groschupf Priority: Critical Fix For: 0.8-dev Attachments: InternalAndExternalLinkScoreFactor.patch Configuration properties db.score.link.external and db.score.link.internal are ignored. In case of e.g. message board webpages or pages that have larger navigation menus on each page having a lower impact of internal links makes a lot of sense for scoring. Also for web spam this is a serious problem, since now spammers can setup just one domain with dynamically generated pages and this highly manipulate the nutch scores. So I also suggest that we give db.score.link.internal by default a value of something like 0.25. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages
[ http://issues.apache.org/jira/browse/NUTCH-322?page=comments#action_12423187 ] Andrzej Bialecki commented on NUTCH-322: - Good questions ... ;) ad 1: Google shows only the final page, and you can access it through both the original (starting) url and the final redirected url. You can't view the intermediate pages. To be Google-compatible we should index only the final page, but put it under both URLs. This is relatively easy to implement in Fetcher and index-basic, by appropriately marking the starting and intermediate pages, skipping any non-final pages during indexing, and then adding the original url to the final url when indexing the final page. Also, I think that if redirect refresh time is large (e.g. larger than 20 seconds) we should consider the pages to be separate, and treat them separately. ad 2: Google shows only inlinks going to the final url. However, the same inlinks can be obtained by using either the starting or the final url. OTOH MSN has separate inlinks in each case. I'm not sure yet how we should implement this... Fetcher discards ProtocolStatus, doesn't store redirected pages --- Key: NUTCH-322 URL: http://issues.apache.org/jira/browse/NUTCH-322 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.8-dev Reporter: Andrzej Bialecki Fix For: 0.8-dev Fetcher doesn't store ProtocolStatus in output segments. ProtocolStatus contains important information, such as protocol-level response code, lastModified time, and possibly other messages. I propose that ProtocolStatus should be stored inside CrawlDatum.metaData, which is then stored into crawl_fetch (in Fetcher.FetcherThread.output()). In addition, if ProtocolStatus contains a valid lastModified time, that CrawlDatum's modified time should also be set to this value. Additionally, Fetcher doesn't store redirected pages. Content of such pages is silently discarded. When Fetcher translates from protocol-level status to crawldb-level status it should probably store such pages with the following translation of status codes: * ProtocolStatus.TEMP_MOVED - CrawlDatum.STATUS_DB_RETRY. This code indicates a transient change, so we probably shouldn't mark the initial URL as bad. * ProtocolStatus.MOVED - CrawlDatum.STATUS_DB_GONE. This code indicates a permanent change, so the initial URL is no longer valid, i.e. it will always result in redirects. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: segread vs. readseg
I like it! Am 24.07.2006 um 16:10 schrieb Andrzej Bialecki: Stefan Neufeind wrote: Andrzej Bialecki wrote: Stefan Groschupf wrote: Hi developers, we have command like readdb and readlinkdb but segread. Wouldn't be more consistent to name the command readseg instead segread? ... just a thought. Yes, it seems more consistent. However, if we change it then scripts people wrote would break. We could support both aliases in 0.8, and give a deprecation message. What do others think? Same feeling here. Agreed. What about the following? Index: bin/nutch === --- bin/nutch(revision 424960) +++ bin/nutch(working copy) @@ -40,7 +40,7 @@ echo generate generate new segments to fetch echo fetch fetch a segment's pages echo parse parse a segment's pages - echo segread read / dump segment data + echo readseg read / dump segment data echo mergesegs merge several segments, with optional filtering and slicing echo updatedb update crawl db from segments after fetching echo invertlinks create a linkdb from parsed segments @@ -158,7 +158,10 @@ CLASS=org.apache.nutch.crawl.CrawlDbMerger elif [ $COMMAND = readlinkdb ] ; then CLASS=org.apache.nutch.crawl.LinkDbReader +elif [ $COMMAND = readseg ] ; then + CLASS=org.apache.nutch.segment.SegmentReader elif [ $COMMAND = segread ] ; then + echo [DEPRECATED] Command 'segread' is deprecated, use 'readseg' instead. CLASS=org.apache.nutch.segment.SegmentReader elif [ $COMMAND = mergesegs ] ; then CLASS=org.apache.nutch.segment.SegmentMerger -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Why was prune removed in 0.8?
Hi, I might be bringing up old discussions (sorry if so) - but discussing about segread/readseg I wondered why prune is missing in bin/nutch. It's still working when you give the full classname by hand. But could it be (re)added to bin/nutch again as well? Regards, Stefan
Re: Why was prune removed in 0.8?
Stefan Neufeind wrote: Hi, I might be bringing up old discussions (sorry if so) - but discussing about segread/readseg I wondered why prune is missing in bin/nutch. It's still working when you give the full classname by hand. But could it be (re)added to bin/nutch again as well? I think PruneIndexTool is not fully compatible in the command-line usage with the current layout of indexes. I mean that in 0.8 indexes are not created inside each segment directory, and also a single output index consists of as many parts as there were reduce tasks ... so, some fiddling around with paths and arguments will be necessary to fix it. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com