Build failed in Jenkins: Nutch-trunk #3112

2015-05-09 Thread Apache Jenkins Server
See 

--
[...truncated 1362 lines...]
A src/java/org/apache/nutch/tools
A src/java/org/apache/nutch/tools/AbstractCommonCrawlFormat.java
A src/java/org/apache/nutch/tools/CommonCrawlFormatFactory.java
A src/java/org/apache/nutch/tools/CommonCrawlFormatJackson.java
A src/java/org/apache/nutch/tools/CommonCrawlFormatSimple.java
AUsrc/java/org/apache/nutch/tools/package-info.java
AUsrc/java/org/apache/nutch/tools/ResolveUrls.java
A src/java/org/apache/nutch/tools/arc
AUsrc/java/org/apache/nutch/tools/arc/package-info.java
AUsrc/java/org/apache/nutch/tools/arc/ArcRecordReader.java
AUsrc/java/org/apache/nutch/tools/arc/ArcSegmentCreator.java
AUsrc/java/org/apache/nutch/tools/arc/ArcInputFormat.java
AUsrc/java/org/apache/nutch/tools/DmozParser.java
A src/java/org/apache/nutch/tools/CommonCrawlConfig.java
A src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java
A src/java/org/apache/nutch/tools/CommonCrawlFormatJettinson.java
AUsrc/java/org/apache/nutch/tools/FreeGenerator.java
AUsrc/java/org/apache/nutch/tools/Benchmark.java
A src/java/org/apache/nutch/tools/CommonCrawlFormat.java
A src/java/org/apache/nutch/tools/FileDumper.java
A src/java/org/apache/nutch/protocol
AUsrc/java/org/apache/nutch/protocol/ProtocolFactory.java
AUsrc/java/org/apache/nutch/protocol/Content.java
A src/java/org/apache/nutch/protocol/RobotRulesParser.java
AUsrc/java/org/apache/nutch/protocol/ProtocolNotFound.java
AUsrc/java/org/apache/nutch/protocol/ProtocolException.java
AUsrc/java/org/apache/nutch/protocol/Protocol.java
AUsrc/java/org/apache/nutch/protocol/ProtocolOutput.java
AUsrc/java/org/apache/nutch/protocol/package-info.java
AUsrc/java/org/apache/nutch/protocol/ProtocolStatus.java
A src/java/org/apache/nutch/segment
AUsrc/java/org/apache/nutch/segment/SegmentMerger.java
AUsrc/java/org/apache/nutch/segment/package-info.java
AUsrc/java/org/apache/nutch/segment/SegmentReader.java
AUsrc/java/org/apache/nutch/segment/SegmentChecker.java
AUsrc/java/org/apache/nutch/segment/SegmentMergeFilter.java
AUsrc/java/org/apache/nutch/segment/SegmentPart.java
AUsrc/java/org/apache/nutch/segment/SegmentMergeFilters.java
AUsrc/java/org/apache/nutch/segment/ContentAsTextInputFormat.java
A src/java/org/apache/nutch/scoring
AUsrc/java/org/apache/nutch/scoring/ScoringFilters.java
A src/java/org/apache/nutch/scoring/AbstractScoringFilter.java
A src/java/org/apache/nutch/scoring/webgraph
AUsrc/java/org/apache/nutch/scoring/webgraph/NodeDumper.java
AUsrc/java/org/apache/nutch/scoring/webgraph/package-info.java
AUsrc/java/org/apache/nutch/scoring/webgraph/Node.java
AUsrc/java/org/apache/nutch/scoring/webgraph/LinkDatum.java
AUsrc/java/org/apache/nutch/scoring/webgraph/NodeReader.java
AUsrc/java/org/apache/nutch/scoring/webgraph/LinkRank.java
AUsrc/java/org/apache/nutch/scoring/webgraph/LinkDumper.java
AUsrc/java/org/apache/nutch/scoring/webgraph/LoopReader.java
AUsrc/java/org/apache/nutch/scoring/webgraph/Loops.java
AUsrc/java/org/apache/nutch/scoring/webgraph/WebGraph.java
AUsrc/java/org/apache/nutch/scoring/webgraph/ScoreUpdater.java
AUsrc/java/org/apache/nutch/scoring/package-info.java
AUsrc/java/org/apache/nutch/scoring/ScoringFilterException.java
AUsrc/java/org/apache/nutch/scoring/ScoringFilter.java
A src/java/org/apache/nutch/net
AUsrc/java/org/apache/nutch/net/package-info.java
A src/java/org/apache/nutch/net/protocols
AUsrc/java/org/apache/nutch/net/protocols/package-info.java
AUsrc/java/org/apache/nutch/net/protocols/HttpDateFormat.java
AUsrc/java/org/apache/nutch/net/protocols/Response.java
AUsrc/java/org/apache/nutch/net/protocols/ProtocolException.java
AUsrc/java/org/apache/nutch/net/URLNormalizer.java
AUsrc/java/org/apache/nutch/net/URLFilterException.java
AUsrc/java/org/apache/nutch/net/URLFilter.java
AUsrc/java/org/apache/nutch/net/URLNormalizers.java
AUsrc/java/org/apache/nutch/net/URLNormalizerChecker.java
AUsrc/java/org/apache/nutch/net/URLFilters.java
AUsrc/java/org/apache/nutch/net/URLFilterChecker.java
A src/java/org/apache/nutch/crawl
AUsrc/java/org/apache/nutch/crawl/FetchScheduleFactory.java
AUsrc/java/org/apache/nutch/crawl/Signature.java
AUsrc/java/org/apache/nutch/crawl/CrawlDbReader.java
AUsrc/java/org/apache/nutch/crawl/LinkDb.java
AUsrc/java/org/apache/nutch/crawl/CrawlDatum.java
AUsrc/java/org/apache/nutch/crawl/LinkDbMerg

[jira] [Work started] (NUTCH-1995) Add support for wildcard to http.robot.rules.whitelist

2015-05-09 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-1995 started by Chris A. Mattmann.

> Add support for wildcard to http.robot.rules.whitelist
> --
>
> Key: NUTCH-1995
> URL: https://issues.apache.org/jira/browse/NUTCH-1995
> Project: Nutch
>  Issue Type: Improvement
>  Components: robots
>Affects Versions: 1.10
>Reporter: Giuseppe Totaro
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> The {{http.robot.rules.whitelist}} 
> ([NUTCH-1927|https://issues.apache.org/jira/browse/NUTCH-1927]) configuration 
> parameter allows to specify a comma separated list of hostnames or IP 
> addresses to ignore robot rules parsing for.
> Adding support for wildcard in {{http.robot.rules.whitelist}} could be very 
> useful and simplify the configuration, for example, if we need to give many 
> hostnames/addresses. Here is an example:
> {noformat}
> http.robot.rules.whitelist
>   *.sample.com
>   Comma separated list of hostnames or IP addresses to ignore 
>   robot rules parsing for. Use with care and only if you are explicitly
>   allowed by the site owner to ignore the site's robots.txt!
>   
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1995) Add support for wildcard to http.robot.rules.whitelist

2015-05-09 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-1995:
-
Fix Version/s: 1.11

> Add support for wildcard to http.robot.rules.whitelist
> --
>
> Key: NUTCH-1995
> URL: https://issues.apache.org/jira/browse/NUTCH-1995
> Project: Nutch
>  Issue Type: Improvement
>  Components: robots
>Affects Versions: 1.10
>Reporter: Giuseppe Totaro
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> The {{http.robot.rules.whitelist}} 
> ([NUTCH-1927|https://issues.apache.org/jira/browse/NUTCH-1927]) configuration 
> parameter allows to specify a comma separated list of hostnames or IP 
> addresses to ignore robot rules parsing for.
> Adding support for wildcard in {{http.robot.rules.whitelist}} could be very 
> useful and simplify the configuration, for example, if we need to give many 
> hostnames/addresses. Here is an example:
> {noformat}
> http.robot.rules.whitelist
>   *.sample.com
>   Comma separated list of hostnames or IP addresses to ignore 
>   robot rules parsing for. Use with care and only if you are explicitly
>   allowed by the site owner to ignore the site's robots.txt!
>   
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1995) Add support for wildcard to http.robot.rules.whitelist

2015-05-09 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-1995:
-
Labels: memex  (was: )

> Add support for wildcard to http.robot.rules.whitelist
> --
>
> Key: NUTCH-1995
> URL: https://issues.apache.org/jira/browse/NUTCH-1995
> Project: Nutch
>  Issue Type: Improvement
>  Components: robots
>Affects Versions: 1.10
>Reporter: Giuseppe Totaro
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> The {{http.robot.rules.whitelist}} 
> ([NUTCH-1927|https://issues.apache.org/jira/browse/NUTCH-1927]) configuration 
> parameter allows to specify a comma separated list of hostnames or IP 
> addresses to ignore robot rules parsing for.
> Adding support for wildcard in {{http.robot.rules.whitelist}} could be very 
> useful and simplify the configuration, for example, if we need to give many 
> hostnames/addresses. Here is an example:
> {noformat}
> http.robot.rules.whitelist
>   *.sample.com
>   Comma separated list of hostnames or IP addresses to ignore 
>   robot rules parsing for. Use with care and only if you are explicitly
>   allowed by the site owner to ignore the site's robots.txt!
>   
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-1995) Add support for wildcard to http.robot.rules.whitelist

2015-05-09 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-1995:


Assignee: Chris A. Mattmann

> Add support for wildcard to http.robot.rules.whitelist
> --
>
> Key: NUTCH-1995
> URL: https://issues.apache.org/jira/browse/NUTCH-1995
> Project: Nutch
>  Issue Type: Improvement
>  Components: robots
>Affects Versions: 1.10
>Reporter: Giuseppe Totaro
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> The {{http.robot.rules.whitelist}} 
> ([NUTCH-1927|https://issues.apache.org/jira/browse/NUTCH-1927]) configuration 
> parameter allows to specify a comma separated list of hostnames or IP 
> addresses to ignore robot rules parsing for.
> Adding support for wildcard in {{http.robot.rules.whitelist}} could be very 
> useful and simplify the configuration, for example, if we need to give many 
> hostnames/addresses. Here is an example:
> {noformat}
> http.robot.rules.whitelist
>   *.sample.com
>   Comma separated list of hostnames or IP addresses to ignore 
>   robot rules parsing for. Use with care and only if you are explicitly
>   allowed by the site owner to ignore the site's robots.txt!
>   
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Build failed in Jenkins: Nutch-trunk #3111

2015-05-09 Thread Apache Jenkins Server
See 

Changes:

[mattmann] Commit patch for NUTCH-1988 Add support for user-defined file 
extension to CommonCrawlDataDumper contributed by Giuseppe Totaro.

--
[...truncated 1362 lines...]
A src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium
A 
src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java
A src/plugin/lib-selenium/plugin.xml
A src/plugin/lib-selenium/build.xml
A src/plugin/index-basic
A src/plugin/index-basic/ivy.xml
A src/plugin/index-basic/src
A src/plugin/index-basic/src/test
A src/plugin/index-basic/src/test/org
A src/plugin/index-basic/src/test/org/apache
A src/plugin/index-basic/src/test/org/apache/nutch
A src/plugin/index-basic/src/test/org/apache/nutch/indexer
A src/plugin/index-basic/src/test/org/apache/nutch/indexer/basic
A 
src/plugin/index-basic/src/test/org/apache/nutch/indexer/basic/TestBasicIndexingFilter.java
A src/plugin/index-basic/src/java
A src/plugin/index-basic/src/java/org
A src/plugin/index-basic/src/java/org/apache
A src/plugin/index-basic/src/java/org/apache/nutch
A src/plugin/index-basic/src/java/org/apache/nutch/indexer
A src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic
AU
src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/BasicIndexingFilter.java
AU
src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/package.html
AUsrc/plugin/index-basic/plugin.xml
AUsrc/plugin/index-basic/build.xml
A src/plugin/urlnormalizer-ajax
A src/plugin/urlnormalizer-ajax/ivy.xml
A src/plugin/urlnormalizer-ajax/src
A src/plugin/urlnormalizer-ajax/src/test
A src/plugin/urlnormalizer-ajax/src/test/org
A src/plugin/urlnormalizer-ajax/src/test/org/apache
A src/plugin/urlnormalizer-ajax/src/test/org/apache/nutch
A src/plugin/urlnormalizer-ajax/src/test/org/apache/nutch/net
A 
src/plugin/urlnormalizer-ajax/src/test/org/apache/nutch/net/urlnormalizer
A 
src/plugin/urlnormalizer-ajax/src/test/org/apache/nutch/net/urlnormalizer/ajax
A 
src/plugin/urlnormalizer-ajax/src/test/org/apache/nutch/net/urlnormalizer/ajax/TestAjaxURLNormalizer.java
A src/plugin/urlnormalizer-ajax/src/java
A src/plugin/urlnormalizer-ajax/src/java/org
A src/plugin/urlnormalizer-ajax/src/java/org/apache
A src/plugin/urlnormalizer-ajax/src/java/org/apache/nutch
A src/plugin/urlnormalizer-ajax/src/java/org/apache/nutch/net
A 
src/plugin/urlnormalizer-ajax/src/java/org/apache/nutch/net/urlnormalizer
A 
src/plugin/urlnormalizer-ajax/src/java/org/apache/nutch/net/urlnormalizer/ajax
A 
src/plugin/urlnormalizer-ajax/src/java/org/apache/nutch/net/urlnormalizer/ajax/AjaxURLNormalizer.java
A src/plugin/urlnormalizer-ajax/plugin.xml
A src/plugin/urlnormalizer-ajax/build.xml
AUsrc/plugin/build-plugin.xml
A src/plugin/urlfilter-validator
AUsrc/plugin/urlfilter-validator/plugin.xml
AUsrc/plugin/urlfilter-validator/build.xml
A src/plugin/urlfilter-validator/ivy.xml
A src/plugin/urlfilter-validator/src
A src/plugin/urlfilter-validator/src/test
A src/plugin/urlfilter-validator/src/test/org
A src/plugin/urlfilter-validator/src/test/org/apache
A src/plugin/urlfilter-validator/src/test/org/apache/nutch
A src/plugin/urlfilter-validator/src/test/org/apache/nutch/urlfilter
A 
src/plugin/urlfilter-validator/src/test/org/apache/nutch/urlfilter/validator
A 
src/plugin/urlfilter-validator/src/test/org/apache/nutch/urlfilter/validator/TestUrlValidator.java
A src/plugin/urlfilter-validator/src/java
A src/plugin/urlfilter-validator/src/java/org
A src/plugin/urlfilter-validator/src/java/org/apache
A src/plugin/urlfilter-validator/src/java/org/apache/nutch
A src/plugin/urlfilter-validator/src/java/org/apache/nutch/urlfilter
A 
src/plugin/urlfilter-validator/src/java/org/apache/nutch/urlfilter/validator
AU
src/plugin/urlfilter-validator/src/java/org/apache/nutch/urlfilter/validator/UrlValidator.java
AU
src/plugin/urlfilter-validator/src/java/org/apache/nutch/urlfilter/validator/package.html
A src/plugin/index-static
A src/plugin/index-static/plugin.xml
A src/plugin/index-static/build.xml
A src/plugin/index-static/ivy.xml
A src/plugin/index-static/src
A src/plugin/index-static/src/test
A src/plugin/index-static/src/test/org
A src/plugin/index-static/src/test/org/apache
A src/plugin/index-static/src/test/org/apache/nutch
A src/plugin/index-static/src/test/org/apache/nutch/indexer
A 

[jira] [Commented] (NUTCH-1988) Make nested output directory dump optional

2015-05-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14536887#comment-14536887
 ] 

Hudson commented on NUTCH-1988:
---

FAILURE: Integrated in Nutch-trunk #3111 (See 
[https://builds.apache.org/job/Nutch-trunk/3111/])
Commit patch for NUTCH-1988 Add support for user-defined file extension to 
CommonCrawlDataDumper contributed by Giuseppe Totaro. (mattmann: 
http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1678520)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java


> Make nested output directory dump optional
> --
>
> Key: NUTCH-1988
> URL: https://issues.apache.org/jira/browse/NUTCH-1988
> Project: Nutch
>  Issue Type: Improvement
>  Components: dumpers
>Affects Versions: 1.9
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: memex
> Fix For: 1.10
>
>
> NUTCH-1957 added nested directories to the bin/nutch dump output to help 
> avoid naming conflicts in output files. It would be nice to be able to 
> specify that you want the older flat directory output as an optional 
> parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-1998) Add support for user-defined file extension to CommonCrawlDataDumper

2015-05-09 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-1998.
--
Resolution: Fixed

Committed thanks [~gostep]!

{noformat}
[chipotle:~/tmp/nutch-trunk] mattmann% svn commit -m "Commit patch for 
NUTCH-1988 Add support for user-defined file extension to CommonCrawlDataDumper 
contributed by Giuseppe Totaro."
SendingCHANGES.txt
Sendingsrc/java/org/apache/nutch/tools/CommonCrawlDataDumper.java
Transmitting file data ..
Committed revision 1678520.
[chipotle:~/tmp/nutch-trunk] mattmann% 
{noformat}


> Add support for user-defined file extension to CommonCrawlDataDumper
> 
>
> Key: NUTCH-1998
> URL: https://issues.apache.org/jira/browse/NUTCH-1998
> Project: Nutch
>  Issue Type: Improvement
>  Components: tool
>Reporter: Giuseppe Totaro
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: memex, patch
> Fix For: 1.11
>
> Attachments: NUTCH-1998.patch
>
>
> {{CommonCrawlDataDumper}} tool is able to generate CBOR-encoded files, 
> extracted from Nutch crawled data, using the Common Crawl format. By default, 
> {{CommonCrawlDataDumper}} uses the original file extension.
> We are going to add support for a command-line option (e.g., {{-extension}}) 
> that allows the user to provide a file extension to use in place of the 
> original one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2004) ParseChecker does not handle redirects

2015-05-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14536851#comment-14536851
 ] 

Hudson commented on NUTCH-2004:
---

SUCCESS: Integrated in Nutch-trunk #3110 (See 
[https://builds.apache.org/job/Nutch-trunk/3110/])
Tickle commit for NUTCH-2004 and this closes #23. (mattmann: 
http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1678519)
* /nutch/trunk/CHANGES.txt


> ParseChecker does not handle redirects
> --
>
> Key: NUTCH-2004
> URL: https://issues.apache.org/jira/browse/NUTCH-2004
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.9
>Reporter: Michael Joyce
>Assignee: Michael Joyce
>Priority: Minor
> Fix For: 1.11
>
>
> At the moment ParseChecker doesn't handle redirects. If it gets anything but 
> a success status it errors out. It would be nice if it handled redirects a 
> bit more gracefully based on the http.redirects config setting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: NUTCH-2004: Update parsechecker to handle redi...

2015-05-09 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/23


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (NUTCH-2004) ParseChecker does not handle redirects

2015-05-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14536839#comment-14536839
 ] 

ASF GitHub Bot commented on NUTCH-2004:
---

Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/23


> ParseChecker does not handle redirects
> --
>
> Key: NUTCH-2004
> URL: https://issues.apache.org/jira/browse/NUTCH-2004
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.9
>Reporter: Michael Joyce
>Assignee: Michael Joyce
>Priority: Minor
> Fix For: 1.11
>
>
> At the moment ParseChecker doesn't handle redirects. If it gets anything but 
> a success status it errors out. It would be nice if it handled redirects a 
> bit more gracefully based on the http.redirects config setting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)