Build failed in Jenkins: Nutch-trunk #1961

2012-09-18 Thread Apache Jenkins Server
See 

--
Started by timer
Building remotely on solaris1 in workspace 

hudson.util.IOException2: remote file operation failed: 
 at 
hudson.remoting.Channel@3781cfaa:solaris1
at hudson.FilePath.act(FilePath.java:838)
at hudson.FilePath.act(FilePath.java:824)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:743)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:685)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1256)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:589)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:88)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:494)
at hudson.model.Run.execute(Run.java:1502)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
at hudson.model.ResourceController.execute(ResourceController.java:88)
at hudson.model.Executor.run(Executor.java:236)
Caused by: java.io.IOException: Remote call on solaris1 failed
at hudson.remoting.Channel.call(Channel.java:673)
at hudson.FilePath.act(FilePath.java:831)
... 11 more
Caused by: java.lang.LinkageError: duplicate class definition: 
hudson/model/Descriptor
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
at java.lang.ClassLoader.defineClass(ClassLoader.java:466)
at 
hudson.remoting.RemoteClassLoader.loadClassFile(RemoteClassLoader.java:152)
at 
hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:131)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
at java.lang.Class.getDeclaredFields0(Native Method)
at java.lang.Class.privateGetDeclaredFields(Class.java:2259)
at java.lang.Class.getDeclaredField(Class.java:1852)
at 
java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1582)
at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:52)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:408)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.(ObjectStreamClass.java:400)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:297)
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:531)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1552)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1466)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1552)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1466)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1699)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:348)
at hudson.remoting.UserRequest.deserialize(UserRequest.java:182)
at hudson.remoting.UserRequest.perform(UserRequest.java:98)
at hudson.remoting.UserRequest.perform(UserRequest.java:48)
at hudson.remoting.Request$2.run(Request.java:326)
at 
hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
at java.util.concurrent.FutureTask.run(FutureTask.java:123)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:651)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecu

Build failed in Jenkins: Nutch-nutchgora #352

2012-09-18 Thread Apache Jenkins Server
See 

--
Started by timer
Building remotely on solaris1 in workspace 

hudson.util.IOException2: remote file operation failed: 
 at 
hudson.remoting.Channel@3781cfaa:solaris1
at hudson.FilePath.act(FilePath.java:838)
at hudson.FilePath.act(FilePath.java:824)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:743)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:685)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1256)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:589)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:88)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:494)
at hudson.model.Run.execute(Run.java:1502)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
at hudson.model.ResourceController.execute(ResourceController.java:88)
at hudson.model.Executor.run(Executor.java:236)
Caused by: java.io.IOException: Remote call on solaris1 failed
at hudson.remoting.Channel.call(Channel.java:673)
at hudson.FilePath.act(FilePath.java:831)
... 11 more
Caused by: java.lang.LinkageError: duplicate class definition: 
hudson/model/Descriptor
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
at java.lang.ClassLoader.defineClass(ClassLoader.java:466)
at 
hudson.remoting.RemoteClassLoader.loadClassFile(RemoteClassLoader.java:152)
at 
hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:131)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
at java.lang.Class.getDeclaredFields0(Native Method)
at java.lang.Class.privateGetDeclaredFields(Class.java:2259)
at java.lang.Class.getDeclaredField(Class.java:1852)
at 
java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1582)
at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:52)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:408)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.(ObjectStreamClass.java:400)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:297)
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:531)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1552)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1466)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1552)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1466)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1699)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:348)
at hudson.remoting.UserRequest.deserialize(UserRequest.java:182)
at hudson.remoting.UserRequest.perform(UserRequest.java:98)
at hudson.remoting.UserRequest.perform(UserRequest.java:48)
at hudson.remoting.Request$2.run(Request.java:326)
at 
hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
at java.util.concurrent.FutureTask.run(FutureTask.java:123)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:651)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(Thre

Re: svn commit: r1387363 - in /nutch/branches/2.1: CHANGES.txt build.xml pom.xml

2012-09-18 Thread Mattmann, Chris A (388J)
Lewis you beat me to it, you ROCK!

Cheers,
Chris

On Sep 18, 2012, at 5:11 PM, 
  wrote:

> Author: lewismc
> Date: Tue Sep 18 21:11:06 2012
> New Revision: 1387363
> 
> URL: http://svn.apache.org/viewvc?rev=1387363&view=rev
> Log:
> forward port of NUTCH-1415
> 
> Modified:
>nutch/branches/2.1/CHANGES.txt
>nutch/branches/2.1/build.xml
>nutch/branches/2.1/pom.xml
> 
> Modified: nutch/branches/2.1/CHANGES.txt
> URL: 
> http://svn.apache.org/viewvc/nutch/branches/2.1/CHANGES.txt?rev=1387363&r1=1387362&r2=1387363&view=diff
> ==
> --- nutch/branches/2.1/CHANGES.txt (original)
> +++ nutch/branches/2.1/CHANGES.txt Tue Sep 18 21:11:06 2012
> @@ -3,6 +3,8 @@ Nutch Change Log
> Release 2.1 (19/09/2012) ddmm
> Full Jira Report - 
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=10680&version=12321040
> 
> +* NUTCH-1415 release packages to contain top level folder apache-nutch-x.x 
> (snagel)
> +
> * NUTCH-1432 property storage.schema does not work anymore, should be 
> storage.schema.webpage and storage.schema.host (lewismc)
> 
> * NUTCH-1468 Redirects that are external links not adhering to 
> db.ignore.external.links (Matt MacDonald via ferdy)
> 
> Modified: nutch/branches/2.1/build.xml
> URL: 
> http://svn.apache.org/viewvc/nutch/branches/2.1/build.xml?rev=1387363&r1=1387362&r2=1387363&view=diff
> ==
> --- nutch/branches/2.1/build.xml (original)
> +++ nutch/branches/2.1/build.xml Tue Sep 18 21:11:06 2012
> @@ -700,14 +700,13 @@
>   
>   
>  -  destfile="${src.dist.version.dir}.tar.gz" 
> basedir="${src.dist.version.dir}">
> -  
> - 
> - 
> -
> +  destfile="${src.dist.version.dir}.tar.gz">
> +   prefix="${final.name}">
> +
> +
>   
> -  
> -
> +   prefix="${final.name}">
> +
>   
> 
>   
> @@ -717,13 +716,13 @@
>   
>   
>  -  destfile="${bin.dist.version.dir}.tar.gz" 
> basedir="${bin.dist.version.dir}">
> -  
> - 
> -
> +  destfile="${bin.dist.version.dir}.tar.gz">
> +   prefix="${final.name}">
> +
> +
>   
> -  
> -
> +   prefix="${final.name}">
> +
>   
> 
>   
> @@ -733,14 +732,13 @@
>   
>   
> -   destfile="${src.dist.version.dir}.zip" basedir="${src.dist.version.dir}">
> -   
> -   
> -   
> -   
> + destfile="${src.dist.version.dir}.zip">
> +prefix="${final.name}">
> +   
> +   
>
> -   
> -   
> +prefix="${final.name}">
> +   
>
>
>   
> @@ -750,13 +748,13 @@
>   
>   
> -   destfile="${bin.dist.version.dir}.zip" basedir="${bin.dist.version.dir}">
> -   
> -   
> -   
> + destfile="${bin.dist.version.dir}.zip">
> +prefix="${final.name}">
> +   
> +   
>
> -   
> -   
> +prefix="${final.name}">
> +   
>
>
>   
> 
> Modified: nutch/branches/2.1/pom.xml
> URL: 
> http://svn.apache.org/viewvc/nutch/branches/2.1/pom.xml?rev=1387363&r1=1387362&r2=1387363&view=diff
> ==
> --- nutch/branches/2.1/pom.xml (original)
> +++ nutch/branches/2.1/pom.xml Tue Sep 18 21:11:06 2012
> @@ -22,7 +22,7 @@
>   org.apache.nutch
>   nutch
>   jar
> -  2.0
> +  2.1
>   Apache Nutch
>   http://nutch.apache.org
>   
> @@ -109,6 +109,12 @@
> 
> 
> 
> +org.elasticsearch
> +elasticsearch
> +0.19.4
> +true
> +
> +
> org.apache.solr
> solr-solrj
> 3.4.0
> @@ -165,7 +171,7 @@
> 
> org.apache.gora
> gora-core
> -0.2
> +0.2.1
> true
> 
> 
> 
> 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



[jira] [Commented] (NUTCH-1415) release packages to contain top level folder apache-nutch-x.x

2012-09-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458188#comment-13458188
 ] 

Hudson commented on NUTCH-1415:
---

Integrated in nutch-trunk-maven #426 (See 
[https://builds.apache.org/job/nutch-trunk-maven/426/])
NUTCH-1415 release packages to contain top level folder apache-nutch-x.x 
(Revision 1387357)

 Result = SUCCESS
snagel : 
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/build.xml


> release packages to contain top level folder apache-nutch-x.x
> -
>
> Key: NUTCH-1415
> URL: https://issues.apache.org/jira/browse/NUTCH-1415
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: nutchgora, 1.6, 1.5.1
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.6, 2.1
>
> Attachments: NUTCH-1415-2.patch, NUTCH-1415.patch
>
>
> The release packages should contain a top level folder named apache-nutch-x.x 
> (x replaced by major and minor version) as in previous releases. Unpacking 
> the packages from the command line via tar xvfz package.tar.gz or unzip 
> package.zip should place all files in that folder. Cf. discussions on mailing 
> lists:
> * 
> http://mail-archives.apache.org/mod_mbox/nutch-dev/201205.mbox/%3c4fbd613f.1020...@googlemail.com%3E
> * 
> http://mail-archives.apache.org/mod_mbox/nutch-user/201206.mbox/%3czarafa.4fe9e41c.2e51.6a20afee54fe4...@mail.openindex.io%3E

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1441) AnchorIndexingFilter should use plain HashSet

2012-09-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458189#comment-13458189
 ] 

Hudson commented on NUTCH-1441:
---

Integrated in nutch-trunk-maven #426 (See 
[https://builds.apache.org/job/nutch-trunk-maven/426/])
NUTCH-1441 AnchorIndexingFilter should use plain HashSet (Revision 1387341)

 Result = SUCCESS
lewismc : 
Files : 
* /nutch/trunk/CHANGES.txt
* 
/nutch/trunk/src/plugin/index-anchor/src/java/org/apache/nutch/indexer/anchor/AnchorIndexingFilter.java


> AnchorIndexingFilter should use plain HashSet
> -
>
> Key: NUTCH-1441
> URL: https://issues.apache.org/jira/browse/NUTCH-1441
> Project: Nutch
>  Issue Type: Bug
>Reporter: Ferdy Galema
>Priority: Minor
> Fix For: 1.6, 2.1
>
> Attachments: NUTCH-1441.patch, NUTCH-1441-trunk.patch
>
>
> AnchorIndexingFilter should use a plain HashSet, instead of WeakHashMap. 
> WeakHashMap is unnecessary and can perhaps even cause bugs. (A WeakHashMap 
> get its entries removed when the gc notices the keys are not elsewhere in 
> use.)
> This patch also makes the filter a bit faster by lazy instantiating the set. 
> (No need to create one everytime when deduplication is off).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: svn commit: r1387356 - in /nutch/branches/2.x: CHANGES.txt build.xml

2012-09-18 Thread Sebastian Nagel
Great.

On 09/18/2012 10:57 PM, Lewis John Mcgibbney wrote:
> Hi Seb,
> 
> I totally forgot about this. I will forward port to 2.1 branch before
> pushing the release.
> 
> Thanks
> 
> Lewis.
> 
> On Tue, Sep 18, 2012 at 9:52 PM,   wrote:
>> Author: snagel
>> Date: Tue Sep 18 20:52:08 2012
>> New Revision: 1387356
>>
>> URL: http://svn.apache.org/viewvc?rev=1387356&view=rev
>> Log:
>> NUTCH-1415 release packages to contain top level folder apache-nutch-x.x
>>
>> Modified:
>> nutch/branches/2.x/CHANGES.txt
>> nutch/branches/2.x/build.xml
>>
>> Modified: nutch/branches/2.x/CHANGES.txt
>> URL: 
>> http://svn.apache.org/viewvc/nutch/branches/2.x/CHANGES.txt?rev=1387356&r1=1387355&r2=1387356&view=diff
>> ==
>> --- nutch/branches/2.x/CHANGES.txt (original)
>> +++ nutch/branches/2.x/CHANGES.txt Tue Sep 18 20:52:08 2012
>> @@ -2,6 +2,8 @@ Nutch Change Log
>>
>>  Release 2.1 - Current Development
>>
>> +* NUTCH-1415 release packages to contain top level folder apache-nutch-x.x 
>> (snagel)
>> +
>>  * NUTCH-1432 property storage.schema does not work anymore, should be 
>> storage.schema.webpage and storage.schema.host (lewismc)
>>
>>  * NUTCH-1468 Redirects that are external links not adhering to 
>> db.ignore.external.links (Matt MacDonald via ferdy)
>>
>> Modified: nutch/branches/2.x/build.xml
>> URL: 
>> http://svn.apache.org/viewvc/nutch/branches/2.x/build.xml?rev=1387356&r1=1387355&r2=1387356&view=diff
>> ==
>> --- nutch/branches/2.x/build.xml (original)
>> +++ nutch/branches/2.x/build.xml Tue Sep 18 20:52:08 2012
>> @@ -700,14 +700,13 @@
>>
>>
>>  > -  destfile="${src.dist.version.dir}.tar.gz" 
>> basedir="${src.dist.version.dir}">
>> -  
>> -   
>> -   
>> -
>> +  destfile="${src.dist.version.dir}.tar.gz">
>> +  > prefix="${final.name}">
>> +
>> +
>>
>> -  
>> -
>> +  > prefix="${final.name}">
>> +
>>
>>  
>>
>> @@ -717,13 +716,13 @@
>>
>>
>>  > -  destfile="${bin.dist.version.dir}.tar.gz" 
>> basedir="${bin.dist.version.dir}">
>> -  
>> -   
>> -
>> +  destfile="${bin.dist.version.dir}.tar.gz">
>> +  > prefix="${final.name}">
>> +
>> +
>>
>> -  
>> -
>> +  > prefix="${final.name}">
>> +
>>
>>  
>>
>> @@ -733,14 +732,13 @@
>>
>>
>> > -   destfile="${src.dist.version.dir}.zip" basedir="${src.dist.version.dir}">
>> -   
>> -   
>> -   
>> -   
>> + destfile="${src.dist.version.dir}.zip">
>> +   > prefix="${final.name}">
>> +   
>> +   
>> 
>> -   
>> -   
>> +   > prefix="${final.name}">
>> +   
>> 
>> 
>>
>> @@ -750,13 +748,13 @@
>>
>>
>> > -   destfile="${bin.dist.version.dir}.zip" basedir="${bin.dist.version.dir}">
>> -   
>> -   
>> -   
>> + destfile="${bin.dist.version.dir}.zip">
>> +   > prefix="${final.name}">
>> +   
>> +   
>> 
>> -   
>> -   
>> +   > prefix="${final.name}">
>> +   
>> 
>> 
>>
>>
>>
> 
> 
> 



[jira] [Resolved] (NUTCH-1415) release packages to contain top level folder apache-nutch-x.x

2012-09-18 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-1415.


   Resolution: Fixed
Fix Version/s: 2.1
   1.6

committed to trunk (revision 1387357) and 2.x (revision 1387356)

> release packages to contain top level folder apache-nutch-x.x
> -
>
> Key: NUTCH-1415
> URL: https://issues.apache.org/jira/browse/NUTCH-1415
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: nutchgora, 1.6, 1.5.1
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.6, 2.1
>
> Attachments: NUTCH-1415-2.patch, NUTCH-1415.patch
>
>
> The release packages should contain a top level folder named apache-nutch-x.x 
> (x replaced by major and minor version) as in previous releases. Unpacking 
> the packages from the command line via tar xvfz package.tar.gz or unzip 
> package.zip should place all files in that folder. Cf. discussions on mailing 
> lists:
> * 
> http://mail-archives.apache.org/mod_mbox/nutch-dev/201205.mbox/%3c4fbd613f.1020...@googlemail.com%3E
> * 
> http://mail-archives.apache.org/mod_mbox/nutch-user/201206.mbox/%3czarafa.4fe9e41c.2e51.6a20afee54fe4...@mail.openindex.io%3E

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: svn commit: r1387356 - in /nutch/branches/2.x: CHANGES.txt build.xml

2012-09-18 Thread Lewis John Mcgibbney
Hi Seb,

I totally forgot about this. I will forward port to 2.1 branch before
pushing the release.

Thanks

Lewis.

On Tue, Sep 18, 2012 at 9:52 PM,   wrote:
> Author: snagel
> Date: Tue Sep 18 20:52:08 2012
> New Revision: 1387356
>
> URL: http://svn.apache.org/viewvc?rev=1387356&view=rev
> Log:
> NUTCH-1415 release packages to contain top level folder apache-nutch-x.x
>
> Modified:
> nutch/branches/2.x/CHANGES.txt
> nutch/branches/2.x/build.xml
>
> Modified: nutch/branches/2.x/CHANGES.txt
> URL: 
> http://svn.apache.org/viewvc/nutch/branches/2.x/CHANGES.txt?rev=1387356&r1=1387355&r2=1387356&view=diff
> ==
> --- nutch/branches/2.x/CHANGES.txt (original)
> +++ nutch/branches/2.x/CHANGES.txt Tue Sep 18 20:52:08 2012
> @@ -2,6 +2,8 @@ Nutch Change Log
>
>  Release 2.1 - Current Development
>
> +* NUTCH-1415 release packages to contain top level folder apache-nutch-x.x 
> (snagel)
> +
>  * NUTCH-1432 property storage.schema does not work anymore, should be 
> storage.schema.webpage and storage.schema.host (lewismc)
>
>  * NUTCH-1468 Redirects that are external links not adhering to 
> db.ignore.external.links (Matt MacDonald via ferdy)
>
> Modified: nutch/branches/2.x/build.xml
> URL: 
> http://svn.apache.org/viewvc/nutch/branches/2.x/build.xml?rev=1387356&r1=1387355&r2=1387356&view=diff
> ==
> --- nutch/branches/2.x/build.xml (original)
> +++ nutch/branches/2.x/build.xml Tue Sep 18 20:52:08 2012
> @@ -700,14 +700,13 @@
>
>
>   -  destfile="${src.dist.version.dir}.tar.gz" 
> basedir="${src.dist.version.dir}">
> -  
> -   
> -   
> -
> +  destfile="${src.dist.version.dir}.tar.gz">
> +   prefix="${final.name}">
> +
> +
>
> -  
> -
> +   prefix="${final.name}">
> +
>
>  
>
> @@ -717,13 +716,13 @@
>
>
>   -  destfile="${bin.dist.version.dir}.tar.gz" 
> basedir="${bin.dist.version.dir}">
> -  
> -   
> -
> +  destfile="${bin.dist.version.dir}.tar.gz">
> +   prefix="${final.name}">
> +
> +
>
> -  
> -
> +   prefix="${final.name}">
> +
>
>  
>
> @@ -733,14 +732,13 @@
>
>
>  -   destfile="${src.dist.version.dir}.zip" basedir="${src.dist.version.dir}">
> -   
> -   
> -   
> -   
> + destfile="${src.dist.version.dir}.zip">
> +prefix="${final.name}">
> +   
> +   
> 
> -   
> -   
> +prefix="${final.name}">
> +   
> 
> 
>
> @@ -750,13 +748,13 @@
>
>
>  -   destfile="${bin.dist.version.dir}.zip" basedir="${bin.dist.version.dir}">
> -   
> -   
> -   
> + destfile="${bin.dist.version.dir}.zip">
> +prefix="${final.name}">
> +   
> +   
> 
> -   
> -   
> +prefix="${final.name}">
> +   
> 
> 
>
>
>



-- 
Lewis


[Nutch Wiki] Trivial Update of "Release_HOWTO" by LewisJohnMcgibbney

2012-09-18 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "Release_HOWTO" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/Release_HOWTO?action=diff&rev1=17&rev2=18

1. Run unit tests.
 {{{ant test}}}
1. Do basic test to see if release looks ok - e.g. install it and run 
example from tutorial.
- 1. Get hold of '''maven-ant-tasks-2.X.X.jar''' for 
[[http://search.maven.org/#search|gav|1|g%3A%22org.apache.maven%22%20AND%20a%3A%22maven-ant-tasks%22|here]]
 and put it in the ivy directory
+ 1. Get hold of '''maven-ant-tasks-2.X.X.jar''' from 
[[http://search.maven.org/|here]] and put it in the ivy directory
  1. Execute ant -lib ivy deploy from $NUTCH_HOME, this will sign the 
Maven artifacts (sources, javadoc, .jar) and send them to a Apache Nexus 
staging repository. Details of how to det this up can be found 
[[http://www.apache.org/dev/publishing-maven-artifacts.html|here]].
  1. Remove the maven-ant-tasks jar from the ivy directory 
1. Tag it. 


[Nutch Wiki] Trivial Update of "Release_HOWTO" by LewisJohnMcgibbney

2012-09-18 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "Release_HOWTO" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/Release_HOWTO?action=diff&rev1=16&rev2=17

1. Run unit tests.
 {{{ant test}}}
1. Do basic test to see if release looks ok - e.g. install it and run 
example from tutorial.
- 1. Get hold of '''maven-ant-tasks-2.X.X.jar''' and put it in the ivy 
directory
+ 1. Get hold of '''maven-ant-tasks-2.X.X.jar''' for 
[[http://search.maven.org/#search|gav|1|g%3A%22org.apache.maven%22%20AND%20a%3A%22maven-ant-tasks%22|here]]
 and put it in the ivy directory
  1. Execute ant -lib ivy deploy from $NUTCH_HOME, this will sign the 
Maven artifacts (sources, javadoc, .jar) and send them to a Apache Nexus 
staging repository. Details of how to det this up can be found 
[[http://www.apache.org/dev/publishing-maven-artifacts.html|here]].
  1. Remove the maven-ant-tasks jar from the ivy directory 
1. Tag it. 


[Nutch Wiki] Trivial Update of "Release_HOWTO" by LewisJohnMcgibbney

2012-09-18 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "Release_HOWTO" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/Release_HOWTO?action=diff&rev1=15&rev2=16

1. Update version numbers (from X.Y-dev to X.Y) for release in:
* nutch-default.xml - http.agent.version property
* default.properties - version property and year property
- * schema.xml - version property
1. Update CHANGES.txt with release date and (if needed) add additional 
changelog entries. It's also good practice to include a link to the Jira 
report. 
1. Check if documentation needs an update. Although this may be a huge 
task at any given time, any minor contribution is better than nothing at all.
1. Update news in 
{{{https://svn.apache.org/repos/asf/nutch/site/forrest/src/documentation/content/xdocs/index.xml}}}
 and for the main nutch.apache,org site stored 
[[https://svn.apache.org/repos/asf/nutch/site/|here]]. There is documentation 
on how to edit, manage and build the site documentation 
[[http://wiki.apache.org/nutch/Website_Update_HOWTO|here]] 


[jira] [Resolved] (NUTCH-1432) property storage.schema does not work anymore, should be storage.schema.webpage and storage.schema.host

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1432.
-

Resolution: Fixed

Committed @revision 1387347 in 2.x branch

> property storage.schema does not work anymore, should be 
> storage.schema.webpage and storage.schema.host
> ---
>
> Key: NUTCH-1432
> URL: https://issues.apache.org/jira/browse/NUTCH-1432
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: nutchgora
>Reporter: Ferdy Galema
> Fix For: 2.1
>
> Attachments: NUTCH-1432.patch
>
>
> Since the addition of the host table, the property storage.schema in 
> nutch-default.xml does not work anymore. It should be storage.schema.webpage 
> and storage.schema.host. Thanks Tianwei Sheng for reporting.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1469) Upgrade commons-net dependency

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1469:


Fix Version/s: (was: 2.1)
   2.2

> Upgrade commons-net dependency
> --
>
> Key: NUTCH-1469
> URL: https://issues.apache.org/jira/browse/NUTCH-1469
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol
>Affects Versions: nutchgora, 1.5.1
>Reporter: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.6, 2.2
>
>
> Currently we are using the commons-net-1.2.0-dev artefact.
> The most recent version on maven central is 3.1 [0]
> [0] http://search.maven.org/#artifactdetails|commons-net|commons-net|3.1|jar

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1301) Index job resume switch to resume a failed job

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1301:


Fix Version/s: (was: 2.1)
   2.2

> Index job resume switch to resume a failed job
> --
>
> Key: NUTCH-1301
> URL: https://issues.apache.org/jira/browse/NUTCH-1301
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: nutchgora
>Reporter: Dan Rosher
>Priority: Minor
> Fix For: 2.2
>
> Attachments: NUTCH-1301.patch, NUTCH-1301-v2.patch
>
>
> This is also useful in nutchgora to allow for continuous indexing with -all 
> -resume, as it is for fetching, cron scripts can then be independent without 
> having to know the batchid.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1294) IndexClean job with solr implementation.

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1294:


Fix Version/s: (was: 2.1)
   2.2

> IndexClean job with solr implementation.
> 
>
> Key: NUTCH-1294
> URL: https://issues.apache.org/jira/browse/NUTCH-1294
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: nutchgora
>Reporter: Dan Rosher
>Priority: Minor
> Fix For: 2.2
>
> Attachments: NUTCH-1294.patch, NUTCH-1294-v2.patch
>
>
> I started by copying/altering the trunk version of SolrClean, though is was 
> inadequate for our needs. We needed to mark particular pages as gone even 
> though they still might be visible on the web, this implementation abstracts 
> the index cleaning process, has a Solr implementation, and adds a clean index 
> plugin extension that allows others to tailor how pages might be removed from 
> their store.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-978) A Plugin for extracting certain element of a web page on html page parsing.

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-978:
---

Fix Version/s: (was: 2.1)
   2.2

> A Plugin for extracting certain element of a web page on html page parsing.
> ---
>
> Key: NUTCH-978
> URL: https://issues.apache.org/jira/browse/NUTCH-978
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.2
> Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>Reporter: Ammar Shadiq
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: gsoc2012, mentor
> Fix For: 2.2
>
> Attachments: app_guardian_ivory_coast_news_exmpl.png, 
> app_screenshoot_configuration_result_anchor.png, 
> app_screenshoot_configuration_result.png, app_screenshoot_source_view.png, 
> app_screenshoot_url_regex_filter.png, for_GSoc.zip, 
> [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, 
> version_alpha2.zip
>
>   Original Estimate: 1,680h
>  Remaining Estimate: 1,680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of 
> the web page by removing html tags and component like javascript and css and 
> leaving the extracted text to be stored on the index. Nutch by default 
> doesn't have the capability to select certain atomic element on an html page, 
> like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text 
> as its node. This branch and node could be extracted using XPath. XPath 
> allowing us to select a certain branch or node of an XML and therefore could 
> be used to extract certain information and treat it differently based on its 
> content and the user requirements. Furthermore a web domain like news website 
> usually have a same html code structure for storing the information on its 
> web pages. This same html code structure could be parsed using the same XPath 
> query and retrieve the same content information element. All of the XPath 
> query for selecting various content could be stored on a XPath Configuration 
> File.
> The purpose of nutch are for various web source, not all of the web page 
> retrieved from those various source have the same html code structure, thus 
> have to be threated differently using the correct XPath Configuration. The 
> selection of the correct XPath configuration could be done automatically 
> using regex by matching the url of the web page with valid url pattern for 
> that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page 
> and get only certain information that user wants therefore making the index 
> more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting 
> certain elements on various news website for the purpose of document 
> clustering. This includes a Configuration Editor Application build using 
> NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1285) Debian Packaging for Nutch

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1285:


Fix Version/s: (was: 2.1)
   2.2

> Debian Packaging for Nutch
> --
>
> Key: NUTCH-1285
> URL: https://issues.apache.org/jira/browse/NUTCH-1285
> Project: Nutch
>  Issue Type: New Feature
>  Components: build
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.6, 2.2
>
>
> This is a utopian type issue which will not be addressed for some time due to 
> many factors, outwith our control which exist within the Debian policy 
> ecosystem. 
> I've been in touch with Ioan over @ Apache James and they have recently 
> (after a number of years) made some real progress with this. Some links are 
> below
> [0] http://svn.apache.org/repos/asf/james/app
> [1] http://svn.apache.org/viewvc/james/app/trunk/pom.xml?view=markup
> [2] https://issues.apache.org/jira/browse/JAMES-1343
> [3] http://www.mail-archive.com/server-dev@james.apache.org/
> [4] http://www.debian.org/doc/debian-policy/
> [5] http://www.debian.org/doc/manuals/maint-guide/index.en.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1397) language-identifier incorrectly handles double-barreled language properties

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1397:


Fix Version/s: (was: 2.1)
   2.2

> language-identifier incorrectly handles double-barreled language properties
> ---
>
> Key: NUTCH-1397
> URL: https://issues.apache.org/jira/browse/NUTCH-1397
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.6, 2.2
>
>
> Currently when language-identifier is activated is parses and identifies 
> langauge-type=en, however does not identify en-GB or en-US. This issues 
> should correct that. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1389) parsechecker and indexchecker to report truncated content

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1389:


Fix Version/s: (was: 2.1)
   2.2

> parsechecker and indexchecker to report truncated content
> -
>
> Key: NUTCH-1389
> URL: https://issues.apache.org/jira/browse/NUTCH-1389
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, parser
>Affects Versions: nutchgora, 1.5
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.6, 2.2
>
>
> ParserChecker and IndexingFiltersChecker should report when a document is 
> truncated due to {http,file,ftp}.content.limit.
> Truncated content may cause text and metadata extraction to fail for PDF and 
> other binary document formats.
> A hint that truncation (and not a broken plugin) is the possible reason would 
> be useful.
> See NUTCH-965 and {{ParseSegment.isTruncated(content)}}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-710) Support for rel="canonical" attribute

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-710:
---

Fix Version/s: (was: 2.1)
   2.2

> Support for rel="canonical" attribute
> -
>
> Key: NUTCH-710
> URL: https://issues.apache.org/jira/browse/NUTCH-710
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 1.1
>Reporter: Frank McCown
>Priority: Minor
> Fix For: 1.6, 2.2
>
>
> There is a the new rel="canonical" attribute which is
> now being supported by Google, Yahoo, and Live:
> http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html
> Adding support for this attribute value will potentially reduce the number of 
> URLs crawled and indexed and reduce duplicate page content.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1249) Resolve all issues flagged up by adding javac -Xlint arguement

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1249:


Fix Version/s: (was: 2.1)
   2.2

> Resolve all issues flagged up by adding javac -Xlint arguement
> --
>
> Key: NUTCH-1249
> URL: https://issues.apache.org/jira/browse/NUTCH-1249
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.6, 2.2
>
>
> There are a heap of issues flagged up by NUTCH-1237, I think over time it 
> would be great to get these addressed and resolved.
> What is interesting is that adding the same arguements to 
> /src/plugin/plugin-build.xml actually breaks my build as tests begin to fail.
> Some of this stuff is documented in the link below
> http://docs.oracle.com/javase/1.5.0/docs/tooldocs/windows/javac.html#options

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1360:


Fix Version/s: (was: 2.1)
   2.2

> Suport the storing of IP address connected to when web crawling
> ---
>
> Key: NUTCH-1360
> URL: https://issues.apache.org/jira/browse/NUTCH-1360
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.6, 2.2
>
> Attachments: NUTCH-1360-nutchgora.patch, 
> NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-trunk.patch
>
>
> Simple issue enabling us to capture the specific IP address of the host which 
> we connect to to fetch a page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1370) Expose exact number of urls injected @runtime

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1370:


Fix Version/s: (was: 2.1)
   2.2

> Expose exact number of urls injected @runtime 
> --
>
> Key: NUTCH-1370
> URL: https://issues.apache.org/jira/browse/NUTCH-1370
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.6, 2.2
>
>
> Example: When using trunk, currently we see 
> {code}
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: starting at 
> 2012-05-22 09:04:00
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: crawlDb: 
> crawl/crawldb
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: urlDir: urls
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Converting injected 
> urls to crawl db entries.
> 2012-05-22 09:04:00,955 INFO  plugin.PluginRepository - Plugins: looking in:
> {code}
> I would like to see
> {code}
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: starting at 
> 2012-05-22 09:04:00
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: crawlDb: 
> crawl/crawldb
> 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: urlDir: urls
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Injected N urls to 
> crawl/crawldb
> 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Converting injected 
> urls to crawl db entries.
> 2012-05-22 09:04:00,955 INFO  plugin.PluginRepository - Plugins: looking in:
> {code}
> This would make debugging easier and would help those who end up getting 
> {code}
> 2012-05-22 09:04:04,850 WARN  crawl.Generator - Generator: 0 records selected 
> for fetching, exiting ...
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1441) AnchorIndexingFilter should use plain HashSet

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1441.
-

Resolution: Fixed

Committed @revision 1387341 in trunk
Thank you Ferdy

> AnchorIndexingFilter should use plain HashSet
> -
>
> Key: NUTCH-1441
> URL: https://issues.apache.org/jira/browse/NUTCH-1441
> Project: Nutch
>  Issue Type: Bug
>Reporter: Ferdy Galema
>Priority: Minor
> Fix For: 1.6, 2.1
>
> Attachments: NUTCH-1441.patch, NUTCH-1441-trunk.patch
>
>
> AnchorIndexingFilter should use a plain HashSet, instead of WeakHashMap. 
> WeakHashMap is unnecessary and can perhaps even cause bugs. (A WeakHashMap 
> get its entries removed when the gc notices the keys are not elsewhere in 
> use.)
> This patch also makes the filter a bit faster by lazy instantiating the set. 
> (No need to create one everytime when deduplication is off).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1359) Add raw_headers support

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1359:


Fix Version/s: (was: 2.1)
   2.2

> Add raw_headers support
> ---
>
> Key: NUTCH-1359
> URL: https://issues.apache.org/jira/browse/NUTCH-1359
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.6, 2.2
>
>
> This should enable us to capture raw headers, however as it may not be 
> required within every type of job, or by every type of user, it should be 
> made configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1369) Improve ParserChecker in Nutchgora

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1369:


Fix Version/s: (was: 2.1)
   2.2

> Improve ParserChecker in Nutchgora
> --
>
> Key: NUTCH-1369
> URL: https://issues.apache.org/jira/browse/NUTCH-1369
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.2
>
> Attachments: NUTCH-1369.patch
>
>
> This issue should bring the ParserChecker implementation in Nutchgora into 
> line with trunk. WIP patch coming up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1087) Deprecate crawl command and replace with example script

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1087:


Fix Version/s: (was: 2.1)
   2.2

> Deprecate crawl command and replace with example script
> ---
>
> Key: NUTCH-1087
> URL: https://issues.apache.org/jira/browse/NUTCH-1087
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.4
>Reporter: Markus Jelsma
>Assignee: Julien Nioche
>Priority: Minor
> Fix For: 1.6, 2.2
>
> Attachments: NUTCH-1087-1.6-2.patch, NUTCH-1087-1.6-3.patch, 
> NUTCH-1087-2.1.patch, NUTCH-1087.patch
>
>
> * remove the crawl command
> * add basic crawl shell script
> See thread:
> http://www.mail-archive.com/dev@nutch.apache.org/msg03848.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1451) Upgrade automaton jar to 1.11-8

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1451:


Fix Version/s: (was: 2.1)
   2.2

> Upgrade automaton jar to 1.11-8
> ---
>
> Key: NUTCH-1451
> URL: https://issues.apache.org/jira/browse/NUTCH-1451
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6, 2.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.6, 2.2
>
>
> The latest version 1.11-8 was released September 7, 2011.
> This library is significantly faster than the default regex parsing. I 
> haven't got a clue what version we currently use but the license states 2005 
> so I'm guessing its been a long time since it was upgraded.
> I'll get a patch together and for completeness run independent test to 
> compare results pre and post upgrade. It would be nice to see > marginal 
> improvements :0)  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1454) parsing chm failed

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1454:


Fix Version/s: (was: 2.1)
   2.2

> parsing chm failed
> --
>
> Key: NUTCH-1454
> URL: https://issues.apache.org/jira/browse/NUTCH-1454
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5.1
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.6, 2.2
>
>
> (reported by Jan Riewe, see 
> http://lucene.472066.n3.nabble.com/CHM-Files-and-Tika-td3999735.html)
> Nutch fails to parse chm files with
> {quote}
>  ERROR tika.TikaParser - Can't retrieve Tika parser for mime-type 
> application/vnd.ms-htmlhelp
> {quote}
> Tested with chm test files from Tika:
> {code}
>  % bin/nutch parsechecker 
> file:/.../tika/trunk/tika-parsers/src/test/resources/test-documents/testChm.chm
> {code}
> Tika parses this document (but does not extract any content).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-849) different versions of the same library in nutch-2.0-dev.job and local\lib directory

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-849:
---

Fix Version/s: (was: 2.1)
   2.2

> different versions of the same library in nutch-2.0-dev.job and local\lib 
> directory 
> 
>
> Key: NUTCH-849
> URL: https://issues.apache.org/jira/browse/NUTCH-849
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.4, nutchgora
> Environment: Window XP SP3, Cygwin
>Reporter: Pham Tuan Minh
>Priority: Minor
> Fix For: 1.6, 2.2
>
>
> Hi,
> I found that after building runtime, In nutch-2.0-dev.job and local\lib 
> directory contains different versions of the same library
> ant-1.7.1.jar
> ant-1.6.5.jar
> servlet-api-2.5-20081211.jar
> servlet-api-2.5-6.1.14.jar
> I predict these libraries come from different dependencies branch. Anyone 
> help me to fix it?
> Thanks,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-979) Add support for deleting Solr documents with ProtocolStatusCodes.NOTFOUND

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-979:
---

Fix Version/s: (was: 2.1)
   2.2

> Add support for deleting Solr documents with ProtocolStatusCodes.NOTFOUND
> -
>
> Key: NUTCH-979
> URL: https://issues.apache.org/jira/browse/NUTCH-979
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Affects Versions: nutchgora
>Reporter: Markus Jelsma
>Priority: Minor
> Fix For: 2.2
>
> Attachments: SolrClean.java
>
>
> When issuing recrawls it can happen that certain urls have expired (i.e. URLs 
> that don't exist anymore and return 404).
> This issue creates a new command in the indexer that scans for WebPages with 
> ProtocolStatusCodes.NOTFOUND and issues delete commands to Solr.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-944) Increase the number of elements to look for URLs and add the ability to specify multiple attributes by elements

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-944:
---

Fix Version/s: (was: 2.1)
   2.2

> Increase the number of elements to look for URLs and add the ability to 
> specify multiple attributes by elements
> ---
>
> Key: NUTCH-944
> URL: https://issues.apache.org/jira/browse/NUTCH-944
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
> Environment: GNU/Linux Fedora 12
>Reporter: Jean-Francois Gingras
>Priority: Minor
> Fix For: 1.6, 2.2
>
> Attachments: DOMContentUtils.java.path-1.0, 
> DOMContentUtils.java.path-1.3
>
>
> Here a patch for DOMContentUtils.java that increase the number of elements to 
> look for URLs. It also add the ability to specify multiple attributes by 
> elements, for example:
> linkParams.put("frame", new LinkParams("frame", "longdesc,src", 0));
> linkParams.put("object", new LinkParams("object", 
> "classid,codebase,data,usemap", 0));
> linkParams.put("video", new LinkParams("video", "poster,src", 0)); // HTML 5
> I have a patch for release-1.0 and branch-1.3
> I would love to hear your comments about this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a "?"

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-797:
---

Fix Version/s: (was: 2.1)
   2.2

> parse-tika is not properly constructing URLs when the target begins with a "?"
> --
>
> Key: NUTCH-797
> URL: https://issues.apache.org/jira/browse/NUTCH-797
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.1, nutchgora
> Environment: Win 7, Java(TM) SE Runtime Environment (build 
> 1.6.0_16-b01)
> Also repro's on RHEL and java 1.4.2
>Reporter: Robert Hohman
>Assignee: Andrzej Bialecki 
>Priority: Minor
> Fix For: 2.2
>
> Attachments: NUTCH-797.patch, pureQueryUrl-2.patch, pureQueryUrl.patch
>
>
> This is my first bug and patch on nutch, so apologies if I have not provided 
> enough detail.
> In crawling the page at 
> http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0 there are 
> links in the page that look like this:
> 2 href="?co=0&sk=0&p=3&pi=1">3
> in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
> getOutlinks looks for links, it comes across this link, and constucts a new 
> url with a base URL class built from 
> "http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0";, and a 
> target of "?co=0&sk=0&p=2&pi=1"
> The URL class, per RFC 3986 at 
> http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines 
> how to merge these two, and per the RFC, the URL class merges these to: 
> http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&p=2&pi=1
> because the RFC explicitly states that the rightmost url segment (the 
> Search.aspx in this case) should be ripped off before combining.
> While this is compliant with the RFC, it means the URLs which are created for 
> the next round of fetching are incorrect.  Modern browsers seem to handle 
> this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
> exception or handling of what is a poorly formed url on accenture's part.
> I have fixed this by modifying DOMContentUtils to look for the case where a ? 
> begins the target, and then pulling the rightmost component out of the base 
> and inserting it into the target before the ?, so the target in this example 
> becomes:
> Search.aspx?co=0&sk=0&p=2&pi=1
> The URL class then properly constructs the new url as:
> http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0&p=2&pi=1
> If it is agreed that this solution works, I believe the other html parsers in 
> nutch would need to be modified in a similar way.
> Can I get feedback on this proposed solution?  Specifically I'm worried about 
> unforeseen side effects.
> Much thanks
> Here is the patch info:
> Index: 
> src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
> ===
> --- 
> src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
>(revision 916362)
> +++ 
> src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
>(working copy)
> @@ -299,6 +299,50 @@
>  return false;
>}
>
> +  private URL fixURL(URL base, String target) throws MalformedURLException
> +  {
> +   // handle params that are embedded into the base url - move them to 
> target
> +   // so URL class constructs the new url class properly
> +   if  (base.toString().indexOf(';') > 0)  
> +  return fixEmbeddedParams(base, target);
> +   
> +   // handle the case that there is a target that is a pure query.
> +   // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
> how to assemble
> +   // URLs but I've seen this in numerous places, for example at
> +   // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0
> +   // It has urls in the page of the form href="?co=0&sk=0&pg=1", and by 
> default
> +   // URL constructs the base+target combo as 
> +   // http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&pg=1, 
> incorrectly
> +   // dropping the Search.aspx target
> +   //
> +   // Browsers handle these just fine, they must have an exception 
> similar to this
> +   if (target.startsWith("?"))
> +   {
> +   return fixPureQueryTargets(base, target);
> +   }
> +   
> +   return new URL(base, target);
> +  }
> +  
> +  private URL fixPureQueryTargets(URL base, String target) throws 
> MalformedURLException
> +  {
> + if (!target.startsWith("?"))
> + return new URL(base, target);
> +
> + String basePath = base.getPath();
> + String baseRightMost="";
> + int baseRightMostIdx = basePath.lastIndexOf("/");

[jira] [Updated] (NUTCH-1357) All gora mapreduce functionality should go through StorageUtils

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1357:


Fix Version/s: (was: 2.1)
   2.2

> All gora mapreduce functionality should go through StorageUtils
> ---
>
> Key: NUTCH-1357
> URL: https://issues.apache.org/jira/browse/NUTCH-1357
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: nutchgora
>Reporter: Ferdy Galema
> Fix For: 2.2
>
>
> I am trying to make the concept of crawlId work for ALL nutch jobs: it seems 
> the biggest problem with it not working as expected is because of the various 
> ways gora mapreduce is used in nutch.
> Some jobs use StorageUtils, some use GoraMapper/GoraReduce, some even use 
> directly GoraInputFormat/GoraOutputFormat. But the only place the translation 
> is made from crawlId into a schema name is in StorageUtils! Currently I am 
> converting all calls to Gora* mapreduce initializing code to StorageUtils 
> calls.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-864) Fetcher generates entries with status 0

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-864:
---

Fix Version/s: (was: 2.1)
   2.2

> Fetcher generates entries with status 0
> ---
>
> Key: NUTCH-864
> URL: https://issues.apache.org/jira/browse/NUTCH-864
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: nutchgora
> Environment: Gora with SQLBackend
> URL: https://svn.apache.org/repos/asf/nutch/branches/nutchbase
> Last Changed Rev: 980748
> Last Changed Date: 2010-07-30 14:19:52 +0200 (Fri, 30 Jul 2010)
>Reporter: Julien Nioche
>Assignee: Doğacan Güney
> Fix For: 2.2
>
>
> After a round of fetching which got the following protocol status :
> 10/07/30 15:11:39 INFO mapred.JobClient: ACCESS_DENIED=2
> 10/07/30 15:11:39 INFO mapred.JobClient: SUCCESS=1177
> 10/07/30 15:11:39 INFO mapred.JobClient: GONE=3
> 10/07/30 15:11:39 INFO mapred.JobClient: TEMP_MOVED=138
> 10/07/30 15:11:39 INFO mapred.JobClient: EXCEPTION=93
> 10/07/30 15:11:39 INFO mapred.JobClient: MOVED=521
> 10/07/30 15:11:39 INFO mapred.JobClient: NOTFOUND=62
> I ran : ./nutch org.apache.nutch.crawl.WebTableReader -stats
> 10/07/30 15:12:37 INFO crawl.WebTableReader: Statistics for WebTable: 
> 10/07/30 15:12:37 INFO crawl.WebTableReader: TOTAL urls:  2690
> 10/07/30 15:12:37 INFO crawl.WebTableReader: retry 0: 2690
> 10/07/30 15:12:37 INFO crawl.WebTableReader: min score:   0.0
> 10/07/30 15:12:37 INFO crawl.WebTableReader: avg score:   0.7587361
> 10/07/30 15:12:37 INFO crawl.WebTableReader: max score:   1.0
> 10/07/30 15:12:37 INFO crawl.WebTableReader: status 0 (null): 649
> 10/07/30 15:12:37 INFO crawl.WebTableReader: status 2 (status_fetched):   
> 1177 (SUCCESS=1177)
> 10/07/30 15:12:37 INFO crawl.WebTableReader: status 3 (status_gone):  112 
> 10/07/30 15:12:37 INFO crawl.WebTableReader: status 34 (status_retry):
> 93 (EXCEPTION=93)
> 10/07/30 15:12:37 INFO crawl.WebTableReader: status 4 (status_redir_temp):
> 138  (TEMP_MOVED=138)
> 10/07/30 15:12:37 INFO crawl.WebTableReader: status 5 (status_redir_perm):
> 521 (MOVED=521)
> 10/07/30 15:12:37 INFO crawl.WebTableReader: WebTable statistics: done
> There should not be any entries with status 0 (null)
> I will investigate a bit more...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1433) Upgrade to Tika 1.2

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1433:


Fix Version/s: (was: 2.1)
   2.2

> Upgrade to Tika 1.2
> ---
>
> Key: NUTCH-1433
> URL: https://issues.apache.org/jira/browse/NUTCH-1433
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 1.6, 2.2
>
> Attachments: NUTCH-1433.branch-2.patch, NUTCH-1433-trunk-2.patch, 
> NUTCH-1433-trunk.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-841) Nutch 2.0 webapp

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-841:
---

Fix Version/s: (was: 2.1)
   2.2

> Nutch 2.0 webapp
> 
>
> Key: NUTCH-841
> URL: https://issues.apache.org/jira/browse/NUTCH-841
> Project: Nutch
>  Issue Type: Improvement
>  Components: web gui
>Affects Versions: nutchgora
> Environment: Nutch 2.0
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 2.2
>
>
> In light of the conversation on NUTCH-837, we are removing the old Nutch 
> webapp and will replace it with a 2.0 one that works with GORA + Solr. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1432) property storage.schema does not work anymore, should be storage.schema.webpage and storage.schema.host

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1432:


Attachment: NUTCH-1432.patch

trivial patch for this issue.
If someone can please check then I will commit. 

> property storage.schema does not work anymore, should be 
> storage.schema.webpage and storage.schema.host
> ---
>
> Key: NUTCH-1432
> URL: https://issues.apache.org/jira/browse/NUTCH-1432
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: nutchgora
>Reporter: Ferdy Galema
> Fix For: 2.1
>
> Attachments: NUTCH-1432.patch
>
>
> Since the addition of the host table, the property storage.schema in 
> nutch-default.xml does not work anymore. It should be storage.schema.webpage 
> and storage.schema.host. Thanks Tianwei Sheng for reporting.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1283) Radically update all Solr configuration in Nutchgora

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1283.
-

Resolution: Not A Problem

This was originally opened due to my misunderstanding of how schema revisions 
work in Solr. Out Olr schema versions are up-to-date as explained by Markus, so 
therefore I'm closing this one off. 

> Radically update all Solr configuration in Nutchgora
> 
>
> Key: NUTCH-1283
> URL: https://issues.apache.org/jira/browse/NUTCH-1283
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
> Fix For: 2.1
>
>
> We're currently running with a Schema which states it's 1.4 :0| There should 
> be better support for newer stuff going on over the Solrland. Thsi issue 
> should track those improvements entirely.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1104) Port issues from trunk NutchGora branch

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1104:


Fix Version/s: (was: 2.1)
   2.2

> Port issues from trunk NutchGora branch
> ---
>
> Key: NUTCH-1104
> URL: https://issues.apache.org/jira/browse/NUTCH-1104
> Project: Nutch
>  Issue Type: Task
>Affects Versions: nutchgora
>Reporter: Markus Jelsma
> Fix For: 2.2
>
>
> Umbrella issue for tracking issues that should be ported from 1.x trunk to 
> the NutchGora branch. Please mark ported issues by modifying this description.
> NOT YET PORTED:
> * NUTCH-809 Parse-metatags plugin
> * NUTCH-987 Support HTTP auth for Solr communication
> * NUTCH-1028 Log parser keys
> * NUTCH-1036 Solr jobs should increment counters in Reporter
> * NUTCH-1057 Make fetcher thread time out configurable
> * NUTCH-1067 Configure minimum throughput for fetcher
> * NUTCH-1101 Options to purge db_gone records in updatedb
> * NUTCH-1102 Fetcher, rely on fetcher.parse directive only
> * NUTCH-1105 MaxContentLength option for index-basic
> * NUTCH-940 Statis field plugin
> * NUTCH-1094 create comprehensive documentation for Nutch 2.0 trunk
> * NUTCH-1207 ParserChecker to output signature
> * NUTCH-1090 InvertLinks should inform when ignoring internal links
> * NUTCH-1174 Outlinks are not properly normalized
> * NUTCH-1203 ParseSegment to show number of milliseconds per parse
> * NUTCH-1173 DomainStats doesn't count db_not_modified
> * NUTCH-1155 Host/domain limit in generator is generate.max.count+1
> * NUTCH-1061 Migrate MoreIndexingFilter from Apache ORO to java.util.regex
> * NUTCH-1142 Normalization and filtering in WebGraph
> * NUTCH-1153 LinkRank not to log all keys and not to write Hadoop _SUCCESS 
> file
> * NUTCH-1195 Add Solr 4x (trunk) example schema
> * NUTCH-1141 Configurable Fetcher queue depth
> * NUTCH-1214 DomainStats tool should be named for what it's doing
> * NUTCH-1213 Pass additional SolrParams when indexing to Solr
> * NUTCH-1211 URLFilterChecker command line help doesn't inform user of STDIN 
> requirements
> * NUTCH-1231 Upgrade to Tika 1.0
> * NUTCH-1230 MimeType API deprecated and breaks with Tika 1.0
> * NUTCH-1235 Upgrade to new Hadoop 0.20.205.0
> * NUTCH-1184 Fetcher to parse and follow Nth degree outlinks
> * NUTCH-1214 DomainStats tool should be named for what it's doing
> * NUTCH-1207 ParserChecker to output signature
> * NUTCH-1174 Outlinks are not properly normalized
> * NUTCH-1173 DomainStats doesn't count db_not_modified
> * NUTCH-1142 Normalization and filtering in WebGraph
> PORTED:
> * No issues yet
> NOT GOING TO BE PORTED:
> * No issues, explain why it should not be ported

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1402) Create AbstractScoringFilter

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1402:


Fix Version/s: (was: 2.1)
   2.2

> Create AbstractScoringFilter 
> -
>
> Key: NUTCH-1402
> URL: https://issues.apache.org/jira/browse/NUTCH-1402
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: nutchgora, 1.5
>Reporter: Julien Nioche
> Fix For: 1.6, 2.2
>
>
> Most scoring filters don't need to implement all the methods defined by the 
> interface. Having an AbstractScoringFilter would make it easier to implement 
> a new scoring filter or understand existing ones.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1277) Fix [fallthrough] javac warnings

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1277:


Fix Version/s: (was: 2.1)
   2.2

> Fix [fallthrough] javac warnings
> 
>
> Key: NUTCH-1277
> URL: https://issues.apache.org/jira/browse/NUTCH-1277
> Project: Nutch
>  Issue Type: Sub-task
>  Components: build
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
> Fix For: 1.6, 2.2
>
>
> This usually occurs when we have an instance where a switch statement(s) fall 
> through (that is, one or more break statements are missing).
> We need to determine where a simple
> {code}
> @SuppressWarnings("fallthrough")
> {code}
> is required or whether we need to include the break statements in switch 
> blocks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1393) Display consistent usage of GeneratorJob with 1.X

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1393:


Fix Version/s: (was: 2.1)
   2.2

> Display consistent usage of GeneratorJob with 1.X
> -
>
> Key: NUTCH-1393
> URL: https://issues.apache.org/jira/browse/NUTCH-1393
> Project: Nutch
>  Issue Type: Bug
>  Components: administration gui, generator
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
> Fix For: 2.2
>
>
> If we pass the generate argument to the nutch script, the Generator 
> auto-spings into action and begins generating fetchlists. This should not be 
> the case, instead it should print traditional usage to stdout. An example is 
> below
> {code}
> lewis@lewis:~/ASF/nutchgora/runtime/local$ ./bin/nutch generate
> GeneratorJob: Selecting best-scoring urls due for fetch.
> GeneratorJob: starting
> GeneratorJob: filtering: true
> GeneratorJob: done
> GeneratorJob: generated batch id: 1339628223-1694200031
> {code}
> All I wanted to do was get the usage params printed to stdout but instead it 
> generated my batch willy nilly.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1394) backport NUTCH-1232 Remove site field from index-basic

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1394:


Fix Version/s: (was: 2.1)
   2.2

> backport NUTCH-1232 Remove site field from index-basic
> --
>
> Key: NUTCH-1394
> URL: https://issues.apache.org/jira/browse/NUTCH-1394
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, storage
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
> Fix For: 2.2
>
>
> This is a simple backport. The 2.0 Solr schema and mappings still contain the 
> field "site" which has been removed in 1.x (NUTCH-1232). Should be done also 
> in 2.0: it's easier to maintain only one Solr installation for all Nutch 
> versions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1164) Write JUnit tests for protocol-http

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1164:


Fix Version/s: (was: 2.1)
   2.2

> Write JUnit tests for protocol-http
> ---
>
> Key: NUTCH-1164
> URL: https://issues.apache.org/jira/browse/NUTCH-1164
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>  Labels: test
> Fix For: 2.2
>
>
> This issue should provide a single Junit test as part of an effort to provide 
> JUnit tests for all nutchgora plugins

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1165) Write JUnit tests for protocol-sftp

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1165:


Fix Version/s: (was: 2.1)
   2.2

> Write JUnit tests for protocol-sftp
> ---
>
> Key: NUTCH-1165
> URL: https://issues.apache.org/jira/browse/NUTCH-1165
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>  Labels: test
> Fix For: 2.2
>
>
> This issue should provide a single Junit test as part of an effort to provide 
> JUnit tests for all nutchgora plugins

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1169) Write JUnit tests for urlfilter-prefix

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1169:


Fix Version/s: (was: 2.1)
   2.2

> Write JUnit tests for urlfilter-prefix
> --
>
> Key: NUTCH-1169
> URL: https://issues.apache.org/jira/browse/NUTCH-1169
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>  Labels: test
> Fix For: 2.2
>
>
> This issue should provide a single Junit test as part of an effort to provide 
> JUnit tests for all nutchgora plugins

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1168) Write JUnit tests for tld

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1168:


Fix Version/s: (was: 2.1)
   2.2

> Write JUnit tests for tld
> -
>
> Key: NUTCH-1168
> URL: https://issues.apache.org/jira/browse/NUTCH-1168
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>  Labels: test
> Fix For: 2.2
>
>
> This issue should provide a single Junit test as part of an effort to provide 
> JUnit tests for all nutchgora plugins

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1166) Write JUnit tests for scoring-link

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1166:


Fix Version/s: (was: 2.1)
   2.2

> Write JUnit tests for scoring-link
> --
>
> Key: NUTCH-1166
> URL: https://issues.apache.org/jira/browse/NUTCH-1166
> Project: Nutch
>  Issue Type: Sub-task
>  Components: linkdb
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>  Labels: test
> Fix For: 2.2
>
>
> This issue should provide a single Junit test as part of an effort to provide 
> JUnit tests for all nutchgora plugins

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1167) Write JUnit tests for scoring-opic

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1167:


Fix Version/s: (was: 2.1)
   2.2

> Write JUnit tests for scoring-opic
> --
>
> Key: NUTCH-1167
> URL: https://issues.apache.org/jira/browse/NUTCH-1167
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>  Labels: test
> Fix For: 2.2
>
>
> This issue should provide a single Junit test as part of an effort to provide 
> JUnit tests for all nutchgora plugins

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1158) Write JUnit tests for all nutchgora plugins

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1158:


Fix Version/s: (was: 2.1)
   2.2

> Write JUnit tests for all nutchgora plugins
> ---
>
> Key: NUTCH-1158
> URL: https://issues.apache.org/jira/browse/NUTCH-1158
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>  Labels: test
> Fix For: 2.2
>
>
> This issue should act as a parent issue to track the development and gradual 
> integration and addition of JUnit tests to accompany all nutchgora plugins. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1170) Write JUnit tests for urlfilter-validator

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1170:


Fix Version/s: (was: 2.1)
   2.2

> Write JUnit tests for urlfilter-validator
> -
>
> Key: NUTCH-1170
> URL: https://issues.apache.org/jira/browse/NUTCH-1170
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>  Labels: test
> Fix For: 2.2
>
>
> This issue should provide a single Junit test as part of an effort to provide 
> JUnit tests for all nutchgora plugins

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-842) AutoGenerate WebPage code

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-842:
---

Fix Version/s: (was: 2.1)
   2.2

> AutoGenerate WebPage code
> -
>
> Key: NUTCH-842
> URL: https://issues.apache.org/jira/browse/NUTCH-842
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: nutchgora
>Reporter: Doğacan Güney
>Assignee: Doğacan Güney
> Fix For: 2.2
>
> Attachments: NUTCH-842.patch
>
>
> This issue will track the addition of an ant task that will automatically 
> generate o.a.n.storage.WebPage (and ProtocolStatus and ParseStatus) from 
> src/gora/webpage.avsc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1374) Workaround for license headers

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1374:


Fix Version/s: (was: 2.1)
   2.2

> Workaround for license headers
> --
>
> Key: NUTCH-1374
> URL: https://issues.apache.org/jira/browse/NUTCH-1374
> Project: Nutch
>  Issue Type: Task
>  Components: documentation
>Affects Versions: 1.4, nutchgora
>Reporter: Lewis John McGibbney
> Fix For: 1.6, 2.2
>
>
> Currently in both versions of Nutch we have two types of files which DO NOT 
> contain license headers; namely all package.html files and the test files 
> within the language detection plugin. On my initial tests, adding license 
> headers to the language test files breaks the tests so we need to find a 
> workaround (or the correct synatx) to add commented out license headers to 
> these files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-887) Delegate parsing of feeds to Tika

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-887:
---

Fix Version/s: (was: 2.1)
   2.2

> Delegate parsing of feeds to Tika
> -
>
> Key: NUTCH-887
> URL: https://issues.apache.org/jira/browse/NUTCH-887
> Project: Nutch
>  Issue Type: Wish
>  Components: parser
>Affects Versions: nutchgora
>Reporter: Julien Nioche
> Fix For: 2.2
>
>
> [Starting a new thread from https://issues.apache.org/jira/browse/NUTCH-874]
> One of the plugins which hasn't been ported yet is the feed parser. We could 
> rely on the one we recently added to Tika, knowing that there is a 
> substantial difference in the sense that the Tika feed parser generates a 
> simple XHTML representation of the document where the feeds are simply 
> represented as anchors whereas the Nutch version created new documents for 
> each feed.
> There is also the parse-rss plugin in Nutch which is quite similar - what's 
> the difference with the feed one again? Since the Tika parser would handle 
> all sorts of feed formats why not simply rely on it? 
> Any thoughts on this?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1038) Port IndexingFiltersChecker to 2.0

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1038:


Fix Version/s: (was: 2.1)
   2.2

> Port IndexingFiltersChecker to 2.0
> --
>
> Key: NUTCH-1038
> URL: https://issues.apache.org/jira/browse/NUTCH-1038
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: nutchgora
>Reporter: Markus Jelsma
> Fix For: 2.2
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1465) Support sitemaps in Nutch

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1465:


Fix Version/s: (was: 2.1)
   2.2

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
> Fix For: 1.6, 2.2
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1163) Write JUnit tests for protocol-ftp

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1163:


Fix Version/s: (was: 2.1)
   2.2

> Write JUnit tests for protocol-ftp
> --
>
> Key: NUTCH-1163
> URL: https://issues.apache.org/jira/browse/NUTCH-1163
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>  Labels: test
> Fix For: 2.2
>
>
> This issue should provide a single Junit test as part of an effort to provide 
> JUnit tests for all nutchgora plugins

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1390) readdb -url $url throws NPE with gora-cassandra

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1390:


Fix Version/s: (was: 2.1)
   2.2

> readdb -url $url throws NPE with gora-cassandra
> ---
>
> Key: NUTCH-1390
> URL: https://issues.apache.org/jira/browse/NUTCH-1390
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
> Fix For: 2.2
>
>
> After successfully injecting, generating, fetching (without parsing enabled), 
> parsing, updatingdb, then executinga readdb passing a particular -url 
> argument I get a lovely NPE
> {code}
> lewis@lewis:~/ASF/nutchgora/runtime/local$ ./bin/nutch readdb -url 
> http://www.trancearoundtheworld.com
> WebTableReader: java.lang.NullPointerException
>   at 
> org.apache.gora.cassandra.store.CassandraClient.getFamilyMap(CassandraClient.java:220)
>   at 
> org.apache.gora.cassandra.store.CassandraStore.execute(CassandraStore.java:108)
>   at org.apache.nutch.crawl.WebTableReader.read(WebTableReader.java:234)
>   at org.apache.nutch.crawl.WebTableReader.run(WebTableReader.java:476)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>   at org.apache.nutch.crawl.WebTableReader.main(WebTableReader.java:412)
> {code} 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1453) Substantiate tests for IndexingFilters

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1453:


Fix Version/s: (was: 2.1)
   2.2

> Substantiate tests for IndexingFilters
> --
>
> Key: NUTCH-1453
> URL: https://issues.apache.org/jira/browse/NUTCH-1453
> Project: Nutch
>  Issue Type: Test
>  Components: indexer
>Affects Versions: nutchgora, 1.5.1
>Reporter: Lewis John McGibbney
> Fix For: 1.6, 2.2
>
>
> This issue is a follow up from the issues discussed in NUTCH-1442 where it 
> was agreed that the current test is o.a.n.indexer.TestIndexingFilters is 
> simply not doing us an justice.
> There are some slight differences between trunk and 2.x but they both share 
> the common problem that there needs to be more thorough testing undertaken. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-970) Injector job crashes with MySQL with table collation set to utf8_general_ci

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-970:
---

Fix Version/s: (was: 2.1)
   2.2

> Injector job crashes with MySQL with table collation set to utf8_general_ci
> ---
>
> Key: NUTCH-970
> URL: https://issues.apache.org/jira/browse/NUTCH-970
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: nutchgora
>Reporter: Markus Jelsma
> Fix For: 2.2
>
>
> Running the injector of trunk with an already existing database where the 
> default collation is utf8_* or ucs2_* the following GoraException is thrown:
> InjectorJob: starting
> InjectorJob: urlDir: urls
> InjectorJob: org.apache.gora.util.GoraException: java.io.IOException: 
> com.mysql.jdbc.exceptions.MySQLSyntaxErrorException: Column length too big 
> for column 'text' (max = 21845); use BLOB or TEXT instead
> at 
> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:110)
> at 
> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:93)
> at 
> org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:43)
> at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:227)
> at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:252)
> at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:266)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:276)
> Caused by: java.io.IOException: 
> com.mysql.jdbc.exceptions.MySQLSyntaxErrorException: Column length too big 
> for column 'text' (max = 21845); use BLOB or TEXT instead
> at org.apache.gora.sql.store.SqlStore.createSchema(SqlStore.java:226)
> at org.apache.gora.sql.store.SqlStore.initialize(SqlStore.java:172)
> at 
> org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:81)
> at 
> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:104)
> ... 7 more
> Caused by: com.mysql.jdbc.exceptions.MySQLSyntaxErrorException: Column length 
> too big for column 'text' (max = 21845); use BLOB or TEXT instead
> at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:936)
> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2985)
> at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1631)
> at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:1723)
> at com.mysql.jdbc.Connection.execSQL(Connection.java:3283)
> at 
> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1332)
> at 
> com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1604)
> at 
> com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1519)
> at 
> com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1504)
> at org.apache.gora.sql.store.SqlStore.createSchema(SqlStore.java:224)
> ... 10 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-875) Port Webgraph to Nutch 2.0

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-875:
---

Fix Version/s: (was: 2.1)
   2.2

> Port Webgraph to Nutch 2.0
> --
>
> Key: NUTCH-875
> URL: https://issues.apache.org/jira/browse/NUTCH-875
> Project: Nutch
>  Issue Type: New Feature
>  Components: linkdb
>Affects Versions: nutchgora
>Reporter: Julien Nioche
> Fix For: 2.2
>
>
> The webgraph has not yet been ported to the GORA-based API.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-879) URL-s getting lost

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-879:
---

Fix Version/s: (was: 2.1)
   2.2

> URL-s getting lost
> --
>
> Key: NUTCH-879
> URL: https://issues.apache.org/jira/browse/NUTCH-879
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: nutchgora
> Environment: * Ubuntu 10.4 x64, Sun JDK 1.6
> * using 1-node Hadoop + HDFS
> * trunk r983472, using MySQL store
> * branch-1.3
>Reporter: Andrzej Bialecki 
> Fix For: 2.2
>
> Attachments: branch-1.3-bench.txt, trunk-bench.txt
>
>
> I ran the Benchmark using branch-1.3 and trunk (formerly nutchbase). With the 
> same Benchmark parameters and the same plugins branch-1.3 collects ~1.5mln 
> urls, while trunk collects ~20,000 urls. Clearly something is wrong.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1094) create comprehensive documentation for Nutchgora branch

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1094:


Fix Version/s: (was: 2.1)
   2.2

> create comprehensive documentation for Nutchgora branch
> ---
>
> Key: NUTCH-1094
> URL: https://issues.apache.org/jira/browse/NUTCH-1094
> Project: Nutch
>  Issue Type: Sub-task
>  Components: documentation
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
> Fix For: 2.2
>
>
> This should shadow the core documentation for Nutch 1.4 (branch) and 
> mainstream users, however it should include fundamentals specific to Nutch 
> trunk. Until we release Nutch 2.0 this documentation should be stored in svn 
> under a /docs directory. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-956) solrindex issues

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-956:
---

Fix Version/s: (was: 2.1)
   2.2

> solrindex issues
> 
>
> Key: NUTCH-956
> URL: https://issues.apache.org/jira/browse/NUTCH-956
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: nutchgora
>Reporter: Alexis
> Fix For: 2.2
>
> Attachments: solr.patch, solr.patch2
>
>
> I ran into a few caveats with solrindex command trying to index documents.
> Please refer to 
> http://techvineyard.blogspot.com/2010/12/build-nutch-20.html#solrindex that 
> describes my tests.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-992) SolrDedup is broken in trunk

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-992:
---

Fix Version/s: (was: 2.1)
   2.2

> SolrDedup is broken in trunk
> 
>
> Key: NUTCH-992
> URL: https://issues.apache.org/jira/browse/NUTCH-992
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: nutchgora
>Reporter: Markus Jelsma
> Fix For: 2.2
>
>
> SolrDedup seems to have been broken for at least a few months, perhaps more. 
> It does fetch the documents from Solr but when processing the rows we get the 
> following exception:
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
> at 
> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:899)
> at 
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
> at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
> at 
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:350)
> at 
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.run(SolrDeleteDuplicates.java:360)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at 
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.main(SolrDeleteDuplicates.java:370)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-840:
---

Fix Version/s: (was: 2.1)
   2.2

> Port tests from parse-html to parse-tika
> 
>
> Key: NUTCH-840
> URL: https://issues.apache.org/jira/browse/NUTCH-840
> Project: Nutch
>  Issue Type: Task
>  Components: parser
>Affects Versions: 1.1
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 2.2
>
> Attachments: NUTCH-840.patch, NUTCH-840.patch
>
>
> We don't have test for HTML in parse-tika so I'll copy them from the old 
> parse-html plugin

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1086) Rewrite protocol-httpclient

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1086:


Fix Version/s: (was: 2.1)
   2.2

> Rewrite protocol-httpclient
> ---
>
> Key: NUTCH-1086
> URL: https://issues.apache.org/jira/browse/NUTCH-1086
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: nutchgora, 1.5
>Reporter: Markus Jelsma
>Priority: Critical
> Fix For: 1.6, 2.2
>
>
> There are several issues about protocol-httpclient and several comments about 
> rewriting the plugin with the new http client libraries. There is, however, 
> not yet an issue for rewriting/reimplementing protocol-httpclient.
> http://hc.apache.org/httpcomponents-client-ga/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-874) Make sure all plugins in src/plugin are compatible with Nutch 2.0 and Gora

2012-09-18 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-874:
---

Fix Version/s: (was: 2.1)
   2.2

> Make sure all plugins in src/plugin are compatible with Nutch 2.0 and Gora
> --
>
> Key: NUTCH-874
> URL: https://issues.apache.org/jira/browse/NUTCH-874
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: nutchgora
> Environment: Nutch 2.0
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Critical
> Fix For: 2.2
>
>
> I just noticed while fixing NUTCH-564 that the ExtParser hasn't been brought 
> up to date with Nutch 2.0 trunk. We should review the plugins in src/plugin 
> to make sure they all work with Gora/Nutchbase now.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1162) Write JUnit tests for parse-js

2012-09-18 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13457840#comment-13457840
 ] 

Lewis John McGibbney commented on NUTCH-1162:
-

part 2 Committed @revision 1387173 in 2.x

> Write JUnit tests for parse-js
> --
>
> Key: NUTCH-1162
> URL: https://issues.apache.org/jira/browse/NUTCH-1162
> Project: Nutch
>  Issue Type: Sub-task
>  Components: parser
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>  Labels: test
> Fix For: 2.1
>
> Attachments: NUTCH-1162.patch
>
>
> This issue should provide a single Junit test as part of an effort to provide 
> JUnit tests for all nutchgora plugins

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1415) release packages to contain top level folder apache-nutch-x.x

2012-09-18 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13457821#comment-13457821
 ] 

Lewis John McGibbney commented on NUTCH-1415:
-

+1 from me as well. If you could commit to current 2.x and trunk then we can 
branch once the relative output on the release thread on dev@ has been fully 
agreed.

> release packages to contain top level folder apache-nutch-x.x
> -
>
> Key: NUTCH-1415
> URL: https://issues.apache.org/jira/browse/NUTCH-1415
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: nutchgora, 1.6, 1.5.1
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Attachments: NUTCH-1415-2.patch, NUTCH-1415.patch
>
>
> The release packages should contain a top level folder named apache-nutch-x.x 
> (x replaced by major and minor version) as in previous releases. Unpacking 
> the packages from the command line via tar xvfz package.tar.gz or unzip 
> package.zip should place all files in that folder. Cf. discussions on mailing 
> lists:
> * 
> http://mail-archives.apache.org/mod_mbox/nutch-dev/201205.mbox/%3c4fbd613f.1020...@googlemail.com%3E
> * 
> http://mail-archives.apache.org/mod_mbox/nutch-user/201206.mbox/%3czarafa.4fe9e41c.2e51.6a20afee54fe4...@mail.openindex.io%3E

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1415) release packages to contain top level folder apache-nutch-x.x

2012-09-18 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13457786#comment-13457786
 ] 

Markus Jelsma commented on NUTCH-1415:
--

+1 for having the top level directory in the package.

> release packages to contain top level folder apache-nutch-x.x
> -
>
> Key: NUTCH-1415
> URL: https://issues.apache.org/jira/browse/NUTCH-1415
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: nutchgora, 1.6, 1.5.1
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Attachments: NUTCH-1415-2.patch, NUTCH-1415.patch
>
>
> The release packages should contain a top level folder named apache-nutch-x.x 
> (x replaced by major and minor version) as in previous releases. Unpacking 
> the packages from the command line via tar xvfz package.tar.gz or unzip 
> package.zip should place all files in that folder. Cf. discussions on mailing 
> lists:
> * 
> http://mail-archives.apache.org/mod_mbox/nutch-dev/201205.mbox/%3c4fbd613f.1020...@googlemail.com%3E
> * 
> http://mail-archives.apache.org/mod_mbox/nutch-user/201206.mbox/%3czarafa.4fe9e41c.2e51.6a20afee54fe4...@mail.openindex.io%3E

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1415) release packages to contain top level folder apache-nutch-x.x

2012-09-18 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13457753#comment-13457753
 ] 

Sebastian Nagel commented on NUTCH-1415:


This has been fixed only for 1.5.1 and 2.0 branches.
Should be fixed for trunk and 2.x before branching 2.1 and 1.6.
Are there any objections?
Otherwise I would apply the patches today night and check the resulting
packages (cf. NUTCH-1436).

> release packages to contain top level folder apache-nutch-x.x
> -
>
> Key: NUTCH-1415
> URL: https://issues.apache.org/jira/browse/NUTCH-1415
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: nutchgora, 1.6, 1.5.1
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Attachments: NUTCH-1415-2.patch, NUTCH-1415.patch
>
>
> The release packages should contain a top level folder named apache-nutch-x.x 
> (x replaced by major and minor version) as in previous releases. Unpacking 
> the packages from the command line via tar xvfz package.tar.gz or unzip 
> package.zip should place all files in that folder. Cf. discussions on mailing 
> lists:
> * 
> http://mail-archives.apache.org/mod_mbox/nutch-dev/201205.mbox/%3c4fbd613f.1020...@googlemail.com%3E
> * 
> http://mail-archives.apache.org/mod_mbox/nutch-user/201206.mbox/%3czarafa.4fe9e41c.2e51.6a20afee54fe4...@mail.openindex.io%3E

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (NUTCH-1415) release packages to contain top level folder apache-nutch-x.x

2012-09-18 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-1415:
--

Assignee: Sebastian Nagel

> release packages to contain top level folder apache-nutch-x.x
> -
>
> Key: NUTCH-1415
> URL: https://issues.apache.org/jira/browse/NUTCH-1415
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: nutchgora, 1.6, 1.5.1
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Attachments: NUTCH-1415-2.patch, NUTCH-1415.patch
>
>
> The release packages should contain a top level folder named apache-nutch-x.x 
> (x replaced by major and minor version) as in previous releases. Unpacking 
> the packages from the command line via tar xvfz package.tar.gz or unzip 
> package.zip should place all files in that folder. Cf. discussions on mailing 
> lists:
> * 
> http://mail-archives.apache.org/mod_mbox/nutch-dev/201205.mbox/%3c4fbd613f.1020...@googlemail.com%3E
> * 
> http://mail-archives.apache.org/mod_mbox/nutch-user/201206.mbox/%3czarafa.4fe9e41c.2e51.6a20afee54fe4...@mail.openindex.io%3E

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira