[jira] Commented: (NUTCH-462) Noarchive urls are available via the cache link
[ https://issues.apache.org/jira/browse/NUTCH-462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12482332 ] Andrzej Bialecki commented on NUTCH-462: - Is this happening with the latest trunk? See NUTCH-167, which is now in trunk. Noarchive urls are available via the cache link --- Key: NUTCH-462 URL: https://issues.apache.org/jira/browse/NUTCH-462 Project: Nutch Issue Type: Bug Components: web gui Reporter: Steve Severance Fix For: 0.8.1 If a robots.txt file specifies a Noarchive statement then urls that or contained as part of that path should not be available via the cached link. For example Noarchive:/ means that no pages should be available via the cached link. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: svn commit: r516643 - in /lucene/nutch/trunk/src/plugin/parse-html/src: java/org/apache/nutch/parse/html/DOMContentUtils.java test/org/apache/nutch/parse/html/TestDOMContentUtils.java
[EMAIL PROTECTED] wrote: [ ... ] -/** - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with [ ... ] +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with This kind of thing is very unfortunate, since it makes it very difficult to figure out when particular lines were changed. I recommend always previewing commits with something like 'svn diff | less' before committing so that you can be sure to *only* commit changes that you intend. If your development environment does not permit you to preview the commit then please run subversion from the shell. Doug
Re: Issues pending before 0.9 release
Andrzej Bialecki wrote: Hi all, I just committed Hadoop 0.12.1. Let's double-check that it works ok. Here's the list of Critical/Blocker issues I mentioned before, and their current status: Any other stuff we need to fix before the release? I am satisfied except the broken bin/nutch. -- Sami Siren
Re: 0.12.1 release plan
Hadoop 0.12.1 is now available (http://www.apache.org/dyn/closer.cgi/lucene/hadoop/). Release notes are here: http://tinyurl.com/2kynuc. Cheers, Tom On 14/03/07, Nigel Daley [EMAIL PROTECTED] wrote: [cross posting to nutch-dev since they're waiting for 0.12.1 release] Still no progress on HADOOP-1093. Postponing Hadoop 0.12.1 release to Monday. Also, HADOOP-1118 and HADOOP-1119 are being investigated. Cheers, Nige On Mar 13, 2007, at 10:49 AM, Nigel Daley wrote: A number of additional 0.12.1 blockers have come up and have patches available for them. However, a fix for HADOOP-1093 is still pending. In light of this, I'm pushing Hadoop 0.12.1 release out 1 more day to tomorrow, Wednesday, March 14. Cheers, Nige On Mar 12, 2007, at 2:42 PM, Nigel Daley wrote: A fix for HADOOP-1093 is still pending. The fix for HADOOP-1091 failed and a new fix is pending. In light of these, I've rescheduled Hadoop 0.12.1 to tomorrow, Tuesday, March 13. Cheers, Nige On Mar 9, 2007, at 9:49 AM, Doug Cutting wrote: I've pushed out the release date for 0.12.1 to Monday. With recent patches, stability is looking a lot better, but there are still a few blocker issues. My hope is 0.12.1 will be a stable release that, e.g., Nutch can confidently integrate into its upcoming 0.9 release. Trunk currently contains only changes for 0.12.1. No 0.13.0 changes have yet been committed. Let's hold off committing those until 0.12.1 is out the door, okay? Doug
[jira] Closed: (NUTCH-462) Noarchive urls are available via the cache link
[ https://issues.apache.org/jira/browse/NUTCH-462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Severance closed NUTCH-462. - Resolution: Fixed duplicate. see NUTCH-167. Has been fixed. Noarchive urls are available via the cache link --- Key: NUTCH-462 URL: https://issues.apache.org/jira/browse/NUTCH-462 Project: Nutch Issue Type: Bug Components: web gui Reporter: Steve Severance Fix For: 0.8.1 If a robots.txt file specifies a Noarchive statement then urls that or contained as part of that path should not be available via the cached link. For example Noarchive:/ means that no pages should be available via the cached link. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Multi-pass algorithms
If I want to have an algorithm that runs over the same data multiple times (it is an iterative algorithm) is there a way to have my MapReduce job use the same directory for both input and output? Or do I need to make a temp directory for each iteration? Steve
[jira] Created: (NUTCH-463) Nutch powerpoint parser plugin fails to parse ppt with images
Nutch powerpoint parser plugin fails to parse ppt with images - Key: NUTCH-463 URL: https://issues.apache.org/jira/browse/NUTCH-463 Project: Nutch Issue Type: Bug Affects Versions: 0.8.1 Environment: Windows Reporter: Wilson Fong With powerpoint presentations that have images, the parser seems to treat images as if they are text and tries to index it resulting in maxFieldLength being reached. The lines from the crawl log file for the powerpoint in question: Indexing [http://127.0.0.1/] with analyzer [EMAIL PROTECTED] (null) Indexing [http://127.0.0.1/wiki/images/0/01/Customer.ppt] with analyzer [EMAIL PROTECTED] (null) maxFieldLength 1 reached, ignoring following tokens -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Issues pending before 0.9 release
Sami Siren wrote: Andrzej Bialecki wrote: Hi all, I just committed Hadoop 0.12.1. Let's double-check that it works ok. Here's the list of Critical/Blocker issues I mentioned before, and their current status: Any other stuff we need to fix before the release? I am satisfied except the broken bin/nutch. Fixed now - tested both under Cygwin and Fedora. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Issues pending before 0.9 release
I am good to go as well. Dennis Kubes Andrzej Bialecki wrote: Sami Siren wrote: Andrzej Bialecki wrote: Hi all, I just committed Hadoop 0.12.1. Let's double-check that it works ok. Here's the list of Critical/Blocker issues I mentioned before, and their current status: Any other stuff we need to fix before the release? I am satisfied except the broken bin/nutch. Fixed now - tested both under Cygwin and Fedora.