[jira] Commented: (NUTCH-462) Noarchive urls are available via the cache link

2007-03-20 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12482332
 ] 

Andrzej Bialecki  commented on NUTCH-462:
-

Is this happening with the latest trunk? See NUTCH-167, which is now in trunk.

 Noarchive urls are available via the cache link
 ---

 Key: NUTCH-462
 URL: https://issues.apache.org/jira/browse/NUTCH-462
 Project: Nutch
  Issue Type: Bug
  Components: web gui
Reporter: Steve Severance
 Fix For: 0.8.1


 If a robots.txt file specifies a Noarchive statement then urls that or 
 contained as part of that path should not be available via the cached link.
 For example Noarchive:/ means that no pages should be available via the 
 cached link.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: svn commit: r516643 - in /lucene/nutch/trunk/src/plugin/parse-html/src: java/org/apache/nutch/parse/html/DOMContentUtils.java test/org/apache/nutch/parse/html/TestDOMContentUtils.java

2007-03-20 Thread Doug Cutting

[EMAIL PROTECTED] wrote:
[ ... ]

-/**
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with

[ ... ]

+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with


This kind of thing is very unfortunate, since it makes it very difficult 
to figure out when particular lines were changed.  I recommend always 
previewing commits with something like 'svn diff | less' before 
committing so that you can be sure to *only* commit changes that you 
intend.  If your development environment does not permit you to preview 
the commit then please run subversion from the shell.


Doug


Re: Issues pending before 0.9 release

2007-03-20 Thread Sami Siren
Andrzej Bialecki wrote:
 Hi all,
 
 I just committed Hadoop 0.12.1. Let's double-check that it works ok.
 Here's the list of Critical/Blocker issues I mentioned before, and their
 current status:
 
 Any other stuff we need to fix before the release?

I am satisfied except the broken bin/nutch.

--
 Sami Siren



Re: 0.12.1 release plan

2007-03-20 Thread Tom White

Hadoop 0.12.1 is now available
(http://www.apache.org/dyn/closer.cgi/lucene/hadoop/). Release notes
are here: http://tinyurl.com/2kynuc.

Cheers,

Tom

On 14/03/07, Nigel Daley [EMAIL PROTECTED] wrote:

[cross posting to nutch-dev since they're waiting for 0.12.1 release]

Still no progress on HADOOP-1093.  Postponing Hadoop 0.12.1 release
to Monday.

Also, HADOOP-1118 and HADOOP-1119 are being investigated.

Cheers,
Nige

On Mar 13, 2007, at 10:49 AM, Nigel Daley wrote:

 A number of additional 0.12.1 blockers have come up and have
 patches available for them.  However, a fix for HADOOP-1093 is
 still pending.  In light of this, I'm pushing Hadoop 0.12.1 release
 out 1 more day to tomorrow, Wednesday, March 14.

 Cheers,
 Nige

 On Mar 12, 2007, at 2:42 PM, Nigel Daley wrote:

 A fix for HADOOP-1093 is still pending.  The fix for HADOOP-1091
 failed and a new fix is pending.  In light of these, I've
 rescheduled Hadoop 0.12.1 to tomorrow, Tuesday, March 13.

 Cheers,
 Nige

 On Mar 9, 2007, at 9:49 AM, Doug Cutting wrote:

 I've pushed out the release date for 0.12.1 to Monday.  With
 recent patches, stability is looking a lot better, but there are
 still a few blocker issues.   My hope is 0.12.1 will be a stable
 release that, e.g., Nutch can confidently integrate into its
 upcoming 0.9 release.

 Trunk currently contains only changes for 0.12.1.  No 0.13.0
 changes have yet been committed.  Let's hold off committing those
 until 0.12.1 is out the door, okay?

 Doug






[jira] Closed: (NUTCH-462) Noarchive urls are available via the cache link

2007-03-20 Thread Steve Severance (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Severance closed NUTCH-462.
-

Resolution: Fixed

duplicate. see NUTCH-167. Has been fixed.

 Noarchive urls are available via the cache link
 ---

 Key: NUTCH-462
 URL: https://issues.apache.org/jira/browse/NUTCH-462
 Project: Nutch
  Issue Type: Bug
  Components: web gui
Reporter: Steve Severance
 Fix For: 0.8.1


 If a robots.txt file specifies a Noarchive statement then urls that or 
 contained as part of that path should not be available via the cached link.
 For example Noarchive:/ means that no pages should be available via the 
 cached link.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Multi-pass algorithms

2007-03-20 Thread Steve Severance
If I want to have an algorithm that runs over the same data multiple times
(it is an iterative algorithm) is there a way to have my MapReduce job use
the same directory for both input and output? Or do I need to make a temp
directory for each iteration?

Steve



[jira] Created: (NUTCH-463) Nutch powerpoint parser plugin fails to parse ppt with images

2007-03-20 Thread Wilson Fong (JIRA)
Nutch powerpoint parser plugin fails to parse ppt with images
-

 Key: NUTCH-463
 URL: https://issues.apache.org/jira/browse/NUTCH-463
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8.1
 Environment: Windows
Reporter: Wilson Fong


With powerpoint presentations that have images, the parser seems to treat 
images as if they are text and tries to index it resulting in maxFieldLength 
being reached.
The lines from the crawl log file for the powerpoint in question:

 Indexing [http://127.0.0.1/] with analyzer [EMAIL PROTECTED] (null)
 Indexing [http://127.0.0.1/wiki/images/0/01/Customer.ppt] with analyzer [EMAIL 
PROTECTED] (null)
maxFieldLength 1 reached, ignoring following tokens
 



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Issues pending before 0.9 release

2007-03-20 Thread Andrzej Bialecki

Sami Siren wrote:

Andrzej Bialecki wrote:

Hi all,

I just committed Hadoop 0.12.1. Let's double-check that it works ok.
Here's the list of Critical/Blocker issues I mentioned before, and their
current status:

Any other stuff we need to fix before the release?


I am satisfied except the broken bin/nutch.


Fixed now - tested both under Cygwin and Fedora.

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Issues pending before 0.9 release

2007-03-20 Thread Dennis Kubes

I am good to go as well.

Dennis Kubes

Andrzej Bialecki wrote:

Sami Siren wrote:

Andrzej Bialecki wrote:

Hi all,

I just committed Hadoop 0.12.1. Let's double-check that it works ok.
Here's the list of Critical/Blocker issues I mentioned before, and their
current status:

Any other stuff we need to fix before the release?


I am satisfied except the broken bin/nutch.


Fixed now - tested both under Cygwin and Fedora.