[jira] [Commented] (NUTCH-1371) Replace Ivy with Maven Ant tasks

2014-01-07 Thread Talat UYARER (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13864133#comment-13864133
 ] 

Talat UYARER commented on NUTCH-1371:
-

Is there any opinion ? 

> Replace Ivy with Maven Ant tasks
> 
>
> Key: NUTCH-1371
> URL: https://issues.apache.org/jira/browse/NUTCH-1371
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.7, 2.2.1
>Reporter: Julien Nioche
>Assignee: Lewis John McGibbney
> Fix For: 1.8, 2.4
>
> Attachments: NUTCH-1371-plugins.trunk.patch, NUTCH-1371-pom.patch, 
> NUTCH-1371-r1461140.patch, NUTCH-1371.patch
>
>
> We might move to Maven altogether but a good intermediate step could be to 
> rely on the maven ant tasks for managing the dependencies. Ivy does a good 
> job but we need to have a pom file anyway for publishing the artefacts which 
> means keeping the pom.xml and ivy.xml contents in sync. Most devs are also 
> more familiar with Maven, and it is well integrated in IDEs. Going the 
> ANT+MVN way also means that we don't have to rewrite the whole building 
> process and can rely on our existing script



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (NUTCH-1675) NutchField to support long

2014-01-07 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-1675.
--

Resolution: Fixed

Committed revision 1556194.


> NutchField to support long
> --
>
> Key: NUTCH-1675
> URL: https://issues.apache.org/jira/browse/NUTCH-1675
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.8
>
> Attachments: NUTCH-1675-trunk.patch
>
>
> NutchField has no support for Long in readfields. Usually this is not a 
> problem because in reducers it is only written to the output. But when using 
> NutchField in mappers, then a reducer cannot read a Long.
> {code}
> java.lang.RuntimeException: problem advancing post rec#0
> at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1217)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:250)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:246)
> at 
> org.apache.nutch.fetcher.Fetcher$FetcherReducer.reduce(Fetcher.java:1440)
> at 
> org.apache.nutch.fetcher.Fetcher$FetcherReducer.reduce(Fetcher.java:1401)
> at 
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:522)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
> Caused by: java.io.EOFException
> at java.io.DataInputStream.readFully(DataInputStream.java:197)
> at org.apache.hadoop.io.Text.readString(Text.java:402)
> at org.apache.nutch.indexer.NutchField.readFields(NutchField.java:89)
> at 
> org.apache.nutch.indexer.NutchDocument.readFields(NutchDocument.java:112)
> at 
> org.apache.nutch.indexer.NutchIndexAction.readFields(NutchIndexAction.java:81)
> at 
> org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWritableConfigurable.java:54)
> at 
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
> at 
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
> at 
> org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:1276)
> at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1214)
> ... 7 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1371) Replace Ivy with Maven Ant tasks

2014-01-07 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13864168#comment-13864168
 ] 

Lewis John McGibbney commented on NUTCH-1371:
-

Hi [~talat] that sounds good to me, I have no objection.
Are you proposing to follow [~jnioche] suggestion RE the patch he attached?

> Replace Ivy with Maven Ant tasks
> 
>
> Key: NUTCH-1371
> URL: https://issues.apache.org/jira/browse/NUTCH-1371
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.7, 2.2.1
>Reporter: Julien Nioche
>Assignee: Lewis John McGibbney
> Fix For: 1.8, 2.4
>
> Attachments: NUTCH-1371-plugins.trunk.patch, NUTCH-1371-pom.patch, 
> NUTCH-1371-r1461140.patch, NUTCH-1371.patch
>
>
> We might move to Maven altogether but a good intermediate step could be to 
> rely on the maven ant tasks for managing the dependencies. Ivy does a good 
> job but we need to have a pom file anyway for publishing the artefacts which 
> means keeping the pom.xml and ivy.xml contents in sync. Most devs are also 
> more familiar with Maven, and it is well integrated in IDEs. Going the 
> ANT+MVN way also means that we don't have to rewrite the whole building 
> process and can rely on our existing script



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Nightly builds

2014-01-07 Thread Lewis John Mcgibbney
Hi Folks,
I'm working on getting the Jenkins job configuration stable again.
Something seems to have been reset or in not correct.
I'll update here once we are back to stable builds.
Ta
Lewis

-- 
*Lewis*


[jira] [Commented] (NUTCH-1371) Replace Ivy with Maven Ant tasks

2014-01-07 Thread Talat UYARER (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13864182#comment-13864182
 ] 

Talat UYARER commented on NUTCH-1371:
-

Hi [~lewismc],

Different from the [~jnioche]'s suggestion, I want to learn why do we 'not' 
migrate Nutch completely as Maven project ? If it wont be a problem to migrate 
completely, I am volunteer to do it.

> Replace Ivy with Maven Ant tasks
> 
>
> Key: NUTCH-1371
> URL: https://issues.apache.org/jira/browse/NUTCH-1371
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.7, 2.2.1
>Reporter: Julien Nioche
>Assignee: Lewis John McGibbney
> Fix For: 1.8, 2.4
>
> Attachments: NUTCH-1371-plugins.trunk.patch, NUTCH-1371-pom.patch, 
> NUTCH-1371-r1461140.patch, NUTCH-1371.patch
>
>
> We might move to Maven altogether but a good intermediate step could be to 
> rely on the maven ant tasks for managing the dependencies. Ivy does a good 
> job but we need to have a pom file anyway for publishing the artefacts which 
> means keeping the pom.xml and ivy.xml contents in sync. Most devs are also 
> more familiar with Maven, and it is well integrated in IDEs. Going the 
> ANT+MVN way also means that we don't have to rewrite the whole building 
> process and can rely on our existing script



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Build failed in Jenkins: Nutch-trunk #2479

2014-01-07 Thread Apache Jenkins Server
See 

Changes:

[markus] NUTCH-1675 NutchField to support long

--
[...truncated 6775 lines...]

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlmeta

jar:

deps-test:

deploy:

copy-generated-lib:

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-basic

jar:

deps-test:

deploy:

copy-generated-lib:

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-host

jar:

deps-test:

deploy:

copy-generated-lib:

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-pass

jar:

deps-test:

deploy:

copy-generated-lib:

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-querystring

jar:

deps-test:

deploy:

copy-generated-lib:

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-regex

jar:

deps-test:

init:

init-plugin:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml

compile:

jar:

deps-test:

deploy:

copy-generated-lib:

deploy:

copy-generated-lib:

compile:

javadoc:
  [javadoc] Generating Javadoc
  [javadoc] Javadoc execution
  [javadoc] Loading source files for package org.apache.nutch.crawl...
  [javadoc] Loading source files for package org.apache.nutch.fetcher...
  [javadoc] Loading source files for package org.apache.nutch.indexer...
  [javadoc] Loading source files for package org.apache.nutch.metadata...
  [javadoc] Loading source files for package org.apache.nutch.net...
  [javadoc] Loading source files for package org.apache.nutch.net.protocols...
  [javadoc] Loading source files for package org.apache.nutch.parse...
  [javadoc] Loading source files for package org.apache.nutch.plugin...
  [javadoc] Loading source files for package org.apache.nutch.protocol...
  [javadoc] Loading source files for package org.apache.nutch.scoring...
  [javadoc] Loading source files for package 
org.apache.nutch.scoring.webgraph...
  [javadoc] Loading source files for package org.apache.nutch.segment...
  [javadoc] Loading source files for package org.apache.nutch.tools...
  [javadoc] Loading source files for package org.apache.nutch.tools.arc...
  [javadoc] Loading source files for package org.apache.nutch.tools.proxy...
  [javadoc] Loading source files for package org.apache.nutch.util...
  [javadoc] 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/src/java/org/apache/nutch/util/StringUtil.java:130:
 error: unmappable character for encoding ASCII
  [javadoc]* Simple character substitution which cleans all ??? chars from 
a given String.
  [javadoc] ^
  [javadoc] 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/src/java/org/apache/nutch/util/StringUtil.java:130:
 error: unmappable character for encoding ASCII
  [javadoc] Loading source files for package org.apache.nutch.util.domain...
  [javadoc]* Simple character substitution which cleans all ??? chars from 
a given String.
  [javadoc] Loading source files for package org.creativecommons.nutch...
  [javadoc]  ^
  [javadoc] Loading source files for package org.apache.nutch.indexer.feed...
  [javadoc] 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/src/java/org/apache/nutch/util/StringUtil.java:130:
 error: unmappable character for encoding ASCII
  [javadoc] Loading source files for package org.apache.nutch.parse.feed...
  [javadoc]* Simple character substitution which cleans all ??? chars from 
a given String.
  [javadoc] Loading source files for package org.apache.nutch.parse.headings...
  [javadoc]   ^
  [javadoc] Loading source files for package org.apache.nutch.indexer.anchor...
  [javadoc] 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/src/java/org/apache/nutch/util/StringUtil.java:133:
 error: unmappable character for encoding

[jira] [Commented] (NUTCH-1675) NutchField to support long

2014-01-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13864188#comment-13864188
 ] 

Hudson commented on NUTCH-1675:
---

FAILURE: Integrated in Nutch-trunk #2479 (See 
[https://builds.apache.org/job/Nutch-trunk/2479/])
NUTCH-1675 NutchField to support long (markus: 
http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1556194)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/indexer/NutchField.java


> NutchField to support long
> --
>
> Key: NUTCH-1675
> URL: https://issues.apache.org/jira/browse/NUTCH-1675
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.8
>
> Attachments: NUTCH-1675-trunk.patch
>
>
> NutchField has no support for Long in readfields. Usually this is not a 
> problem because in reducers it is only written to the output. But when using 
> NutchField in mappers, then a reducer cannot read a Long.
> {code}
> java.lang.RuntimeException: problem advancing post rec#0
> at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1217)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:250)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:246)
> at 
> org.apache.nutch.fetcher.Fetcher$FetcherReducer.reduce(Fetcher.java:1440)
> at 
> org.apache.nutch.fetcher.Fetcher$FetcherReducer.reduce(Fetcher.java:1401)
> at 
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:522)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
> Caused by: java.io.EOFException
> at java.io.DataInputStream.readFully(DataInputStream.java:197)
> at org.apache.hadoop.io.Text.readString(Text.java:402)
> at org.apache.nutch.indexer.NutchField.readFields(NutchField.java:89)
> at 
> org.apache.nutch.indexer.NutchDocument.readFields(NutchDocument.java:112)
> at 
> org.apache.nutch.indexer.NutchIndexAction.readFields(NutchIndexAction.java:81)
> at 
> org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWritableConfigurable.java:54)
> at 
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
> at 
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
> at 
> org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:1276)
> at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1214)
> ... 7 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1371) Replace Ivy with Maven Ant tasks

2014-01-07 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13864189#comment-13864189
 ] 

Lewis John McGibbney commented on NUTCH-1371:
-

I think for trunk this would not be a problem. 
For 2.x we would need to write a gora-maven-plugin which could be included in 
pom.xml. This is in planning but not implemented yet.
If you are ready to migrate trunk code to maven lifecycle management then I 
would back this effort 100%

> Replace Ivy with Maven Ant tasks
> 
>
> Key: NUTCH-1371
> URL: https://issues.apache.org/jira/browse/NUTCH-1371
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.7, 2.2.1
>Reporter: Julien Nioche
>Assignee: Lewis John McGibbney
> Fix For: 1.8, 2.4
>
> Attachments: NUTCH-1371-plugins.trunk.patch, NUTCH-1371-pom.patch, 
> NUTCH-1371-r1461140.patch, NUTCH-1371.patch
>
>
> We might move to Maven altogether but a good intermediate step could be to 
> rely on the maven ant tasks for managing the dependencies. Ivy does a good 
> job but we need to have a pom file anyway for publishing the artefacts which 
> means keeping the pom.xml and ivy.xml contents in sync. Most devs are also 
> more familiar with Maven, and it is well integrated in IDEs. Going the 
> ANT+MVN way also means that we don't have to rewrite the whole building 
> process and can rely on our existing script



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1371) Replace Ivy with Maven Ant tasks

2014-01-07 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13864190#comment-13864190
 ] 

Julien Nioche commented on NUTCH-1371:
--

Talat, 
Moving to Maven altogether won't be a problem only if all the functionalities 
of the ANT+IVY build are preserved. For instance we'd need an elegant way of 
dealing with the Nutch plugins with Maven, etc...
Not a high priority task in my views, but if you want to provide a patch, I'll 
be happy to review it
Thanks
Julien 

> Replace Ivy with Maven Ant tasks
> 
>
> Key: NUTCH-1371
> URL: https://issues.apache.org/jira/browse/NUTCH-1371
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.7, 2.2.1
>Reporter: Julien Nioche
>Assignee: Lewis John McGibbney
> Fix For: 1.8, 2.4
>
> Attachments: NUTCH-1371-plugins.trunk.patch, NUTCH-1371-pom.patch, 
> NUTCH-1371-r1461140.patch, NUTCH-1371.patch
>
>
> We might move to Maven altogether but a good intermediate step could be to 
> rely on the maven ant tasks for managing the dependencies. Ivy does a good 
> job but we need to have a pom file anyway for publishing the artefacts which 
> means keeping the pom.xml and ivy.xml contents in sync. Most devs are also 
> more familiar with Maven, and it is well integrated in IDEs. Going the 
> ANT+MVN way also means that we don't have to rewrite the whole building 
> process and can rely on our existing script



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (NUTCH-1695) NutchDocument.toString()

2014-01-07 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-1695:


 Summary: NutchDocument.toString()
 Key: NUTCH-1695
 URL: https://issues.apache.org/jira/browse/NUTCH-1695
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Trivial
 Fix For: 1.8


We need a NutchDocument.toString() for easier debugging.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1695) NutchDocument.toString()

2014-01-07 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1695:
-

Attachment: NUTCH-1695-trunk.patch

Patch for trunk

> NutchDocument.toString()
> 
>
> Key: NUTCH-1695
> URL: https://issues.apache.org/jira/browse/NUTCH-1695
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 1.8
>
> Attachments: NUTCH-1695-trunk.patch
>
>
> We need a NutchDocument.toString() for easier debugging.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1371) Replace Ivy with Maven Ant tasks

2014-01-07 Thread Talat UYARER (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13864215#comment-13864215
 ] 

Talat UYARER commented on NUTCH-1371:
-

Hi [~jnioche],

Actually I have some problems with ant+ivy style. Firstly, when I change a 
configuration file I always have to run ant runtime target. In addition to 
this, When I change a build dependency, I always have to run ant eclipse target 
for working on the IDE.  At the present dependencies are managed from the built 
files. I think when we migrate maven, they will be solved. How do you solve 
this type of problems ?

> Replace Ivy with Maven Ant tasks
> 
>
> Key: NUTCH-1371
> URL: https://issues.apache.org/jira/browse/NUTCH-1371
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.7, 2.2.1
>Reporter: Julien Nioche
>Assignee: Lewis John McGibbney
> Fix For: 1.8, 2.4
>
> Attachments: NUTCH-1371-plugins.trunk.patch, NUTCH-1371-pom.patch, 
> NUTCH-1371-r1461140.patch, NUTCH-1371.patch
>
>
> We might move to Maven altogether but a good intermediate step could be to 
> rely on the maven ant tasks for managing the dependencies. Ivy does a good 
> job but we need to have a pom file anyway for publishing the artefacts which 
> means keeping the pom.xml and ivy.xml contents in sync. Most devs are also 
> more familiar with Maven, and it is well integrated in IDEs. Going the 
> ANT+MVN way also means that we don't have to rewrite the whole building 
> process and can rely on our existing script



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: Inject operation: can't it be done in a single map-reduce job ?

2014-01-07 Thread Lewis John Mcgibbney
Hi Tejas,

On Mon, Jan 6, 2014 at 10:55 PM,  wrote:

> I don't think that it would be a perfect setup to get some stats.
>

+1 ;)


> Does ASF has any cluster which could be used ?
>
>
> Don't know mate. I'm heading over to builds@ tonight to see about
stabalizing our nightly builds against as it seems there is a problem when
publishing the Javadoc or something to do with the Jenkins job
configuration. I'll report back here when I get a min unless you beat me to
it.
More info on Apache Infra services can be found here
http://www.apache.org/dev/infrastructure.html


[jira] [Created] (NUTCH-1696) Enable use of (Gora) SNAPSHOT dependencies

2014-01-07 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-1696:
---

 Summary: Enable use of (Gora) SNAPSHOT dependencies
 Key: NUTCH-1696
 URL: https://issues.apache.org/jira/browse/NUTCH-1696
 Project: Nutch
  Issue Type: Task
  Components: build
Reporter: Lewis John McGibbney
 Fix For: 2.3


For some time it has been on my radar to enable use of SNAPSHOT dependencies 
for use within Nutch. Specifically, this relates to gora-* SNAPSHOT's available 
here [0].
I am working on a patch which updates ivy.xml and ivysettings.xml t enable 
this, however it seems almost like black magic right now.
I'll upload the patch once I get my build working. 

[0] 
https://repository.apache.org/content/repositories/snapshots/org/apache/gora/



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1696) Enable use of (Gora) SNAPSHOT dependencies

2014-01-07 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1696:


Attachment: NUTCH-1696.patch

Patch for 2.x HEAD. This works fine for me and I am able to build perfectly by 
editing ivy/ivy.xml to suit my requirements.
Please comment.

> Enable use of (Gora) SNAPSHOT dependencies
> --
>
> Key: NUTCH-1696
> URL: https://issues.apache.org/jira/browse/NUTCH-1696
> Project: Nutch
>  Issue Type: Task
>  Components: build
>Reporter: Lewis John McGibbney
> Fix For: 2.3
>
> Attachments: NUTCH-1696.patch
>
>
> For some time it has been on my radar to enable use of SNAPSHOT dependencies 
> for use within Nutch. Specifically, this relates to gora-* SNAPSHOT's 
> available here [0].
> I am working on a patch which updates ivy.xml and ivysettings.xml t enable 
> this, however it seems almost like black magic right now.
> I'll upload the patch once I get my build working. 
> [0] 
> https://repository.apache.org/content/repositories/snapshots/org/apache/gora/



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1695) NutchDocument.toString()

2014-01-07 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13864670#comment-13864670
 ] 

Sebastian Nagel commented on NUTCH-1695:


+1

> NutchDocument.toString()
> 
>
> Key: NUTCH-1695
> URL: https://issues.apache.org/jira/browse/NUTCH-1695
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 1.8
>
> Attachments: NUTCH-1695-trunk.patch
>
>
> We need a NutchDocument.toString() for easier debugging.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1693) TextMD5Signatue compute on textual content

2014-01-07 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13864835#comment-13864835
 ] 

Sebastian Nagel commented on NUTCH-1693:


A useful improvement! Thanks!
{quote}
> both 2x and 1x should provide identical signature impl.
{quote}
Definitely. Falling back to MD5 on raw content for empty documents may be 
useful eg. for PDFs with scanned images and no readable textual content: two 
binary identical PDFs are then still deduplicated.
Regarding calculation of MD5 from text:
* [patch for trunk] String.getBytes() depends on default encoding / locale. If 
it differs eg. for development and production environments this may cause some 
headaches. We could either pass a fixed Charset (UTF-8) as parameter to 
getBytes(...) or use MD5Hash.digest(String string) which encodes string as 
UTF-8 before check-summing
* [patch for 2x] instead of converting an UTF-8-encoded byte array to Java 
String and back: MD5Hash.digest(page.getText().getBytes()) may be more efficient

> TextMD5Signatue compute on textual content
> --
>
> Key: NUTCH-1693
> URL: https://issues.apache.org/jira/browse/NUTCH-1693
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Tien Nguyen Manh
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1693-trunk.patch, NUTCH-1693.patch
>
>
> I create a new MD5Signature that based on textual content. In our case we use 
> boilerpipe to extract main text from content so this signature is more 
> effective to deduplicate.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Build failed in Jenkins: Nutch-nutchgora #876

2014-01-07 Thread Apache Jenkins Server
See 

--
[...truncated 3936 lines...]
deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: microformats-reltag
[junit] Running 
org.apache.nutch.microformats.reltag.TestRelTagIndexingFilter
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.495 sec
[junit] Running org.apache.nutch.microformats.reltag.TestRelTagParser
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.511 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: scoring-opic

compile-test:
[javac] Compiling 1 source file to 

[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] 1 warning

jar:

deps-test:

init:

init-plugin:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:

jar:

deps-test:

deploy:

copy-generated-lib:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: scoring-opic
[junit] Running org.apache.nutch.scoring.opic.TestOPICScoringFilter
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.546 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: tld

compile-test:
[javac] Compiling 1 source file to 

[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] 1 warning

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: tld
[junit] Running org.apache.nutch.indexer.tld.TestTLDIndexingFilter
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.623 sec

test:

jar:
 [copy] Copying 1 file to 

 [copy] Copying 1 file to 

  [jar] Building jar: 


runtime:
[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 

 [copy] Copying 1 file to 

 [copy] Copying 2 files to 

 [copy] Copying 1 file to 

 [copy] Copying 1 file to 

 [copy] Copying 26 files to 

 [copy] Copying 2 files to 

 [copy] Copying 103 files to 

 [copy] Copying 106 files to 

 [copy] Copying 158 files to 


javadoc:
[mkdir] Created dir: 

  [javadoc] Generating Javadoc
  [javadoc] Javadoc execution
  [javadoc] Loading source files for package org.apache.nutch.api...
  [javadoc] Loading source files for package org.apache.nutch.api.impl...
  [javadoc] Loading source files for package org.apache.nutch.crawl...
  [javadoc] Loading source files for package org.apache.nutch.fetcher...
  [javadoc] Loading source files for package org.apache.nutch.host...
  [javadoc] Loading source files for package org.apache.nutch.html...
  [javadoc] Loading source files for package org.apache.nutch.indexer...
  [javadoc] Loading source files for package org.apache.nutch.indexer.elastic...
  [javadoc] Loading source files for package org.apache.nutch.indexer.solr...
  [javadoc] Loading source files for package org.apache.nutch.metadata...
  [javadoc] Loading source files for package org.apache.nutch.net...
  [javadoc] Loading source files for package org.apache.nutch.net.protocols...
  [javadoc] Loading source files for package org.apache.nutch.parse...
  [javadoc] Loading source files for package org.apache.nutch.plugin...
  [javadoc] 


Build failed in Jenkins: Nutch-trunk #2480

2014-01-07 Thread Apache Jenkins Server
See 

--
[...truncated 6731 lines...]
deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlmeta

jar:

deps-test:

deploy:

copy-generated-lib:

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-basic

jar:

deps-test:

deploy:

copy-generated-lib:

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-host

jar:

deps-test:

deploy:

copy-generated-lib:

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-pass

jar:

deps-test:

deploy:

copy-generated-lib:

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-querystring

jar:

deps-test:

deploy:

copy-generated-lib:

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-regex

jar:

deps-test:

init:

init-plugin:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:

jar:

deps-test:

deploy:

copy-generated-lib:

deploy:

copy-generated-lib:

compile:

javadoc:
  [javadoc] Generating Javadoc
  [javadoc] Javadoc execution
  [javadoc] Loading source files for package org.apache.nutch.crawl...
  [javadoc] Loading source files for package org.apache.nutch.fetcher...
  [javadoc] Loading source files for package org.apache.nutch.indexer...
  [javadoc] Loading source files for package org.apache.nutch.metadata...
  [javadoc] Loading source files for package org.apache.nutch.net...
  [javadoc] Loading source files for package org.apache.nutch.net.protocols...
  [javadoc] Loading source files for package org.apache.nutch.parse...
  [javadoc] Loading source files for package org.apache.nutch.plugin...
  [javadoc] Loading source files for package org.apache.nutch.protocol...
  [javadoc] Loading source files for package org.apache.nutch.scoring...
  [javadoc] Loading source files for package 
org.apache.nutch.scoring.webgraph...
  [javadoc] Loading source files for package org.apache.nutch.segment...
  [javadoc] 
:130:
 error: unmappable character for encoding ASCII
  [javadoc] Loading source files for package org.apache.nutch.tools...
  [javadoc]* Simple character substitution which cleans all ??? chars from 
a given String.
  [javadoc] Loading source files for package org.apache.nutch.tools.arc...
  [javadoc] ^
  [javadoc] Loading source files for package org.apache.nutch.tools.proxy...
  [javadoc] 
:130:
 error: unmappable character for encoding ASCII
  [javadoc] Loading source files for package org.apache.nutch.util...
  [javadoc]* Simple character substitution which cleans all ??? chars from 
a given String.
  [javadoc] Loading source files for package org.apache.nutch.util.domain...
  [javadoc]  ^
  [javadoc] Loading source files for package org.creativecommons.nutch...
  [javadoc] 
:130:
 error: unmappable character for encoding ASCII
  [javadoc] Loading source files for package org.apache.nutch.indexer.feed...
  [javadoc]* Simple character substitution which cleans all ??? chars from 
a given String.
  [javadoc] Loading source files for package org.apache.nutch.parse.feed...
  [javadoc]   ^
  [javadoc] Loading source files for package org.apache.nutch.parse.headings...
  [javadoc] 
:133:
 error: unmappable character for encoding ASCII
  [javadoc] Loading source files for package org.apache.nutch.indexer.anchor...
  [javadoc] return value.replaceAll("???", "");
  [javadoc] Loadin