[jira] [Commented] (NUTCH-2111) Set temporary file location for selenium tmp files

2015-09-22 Thread Kim Whitehall (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903993#comment-14903993
 ] 

Kim Whitehall commented on NUTCH-2111:
--

Further investigation showed that changing the temporary path does not get rid 
of the tmp files that eat up space. Further, if a selenium grid is utilized, 
the location chosen on a given node may not be available on all nodes. As such, 
it is best to stay with the default /tmp location and handle deleting the files 
there instead. The patch submitted does this. 


> Set temporary file location for selenium tmp files
> --
>
> Key: NUTCH-2111
> URL: https://issues.apache.org/jira/browse/NUTCH-2111
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.10
>Reporter: Kim Whitehall
>
> When using the selenium plug in (local mode or selenium grid), a large # tmp 
> files can be generated for each webdriver executed. The default location for 
> selenium is the /tmp library. Thus very quickly (and inadvertently) the 
> nutch-selenium interaction can lead to filesystem issues. 
> I propose to include a config in nutch-default.xml that allows users to 
> specify where they want the selenium tmp files to be written. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2105) Update Nutch Cassandra Dockerfile to work with Gora Nutch 2.3.1

2015-09-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903792#comment-14903792
 ] 

Hudson commented on NUTCH-2105:
---

SUCCESS: Integrated in Nutch-nutchgora #1539 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1539/])
NUTCH-2105 Update Nutch Cassandra Dockerfile to work with Gora Nutch 2.3.1 
(lewismc: http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1704754)
* /nutch/branches/2.x/CHANGES.txt
* /nutch/branches/2.x/docker/cassandra/README.md
* /nutch/branches/2.x/docker/cassandra/bin/build.sh
* /nutch/branches/2.x/docker/cassandra/bin/ipof.sh
* /nutch/branches/2.x/docker/cassandra/bin/nodes.sh
* /nutch/branches/2.x/docker/cassandra/bin/restart.sh
* /nutch/branches/2.x/docker/cassandra/bin/start.sh
* /nutch/branches/2.x/docker/cassandra/bin/stop.sh
* /nutch/branches/2.x/docker/cassandra/cassandra/Dockerfile
* /nutch/branches/2.x/docker/cassandra/cassandra/bootstrap.sh
* /nutch/branches/2.x/docker/cassandra/nutch/Dockerfile
* /nutch/branches/2.x/docker/cassandra/nutch/bootstrap.sh
* /nutch/branches/2.x/docker/cassandra/nutch/config/nutch-site.xml
* /nutch/branches/2.x/docker/cassandra/nutch/testUrls/seed.txt


> Update Nutch Cassandra Dockerfile to work with Gora Nutch 2.3.1
> ---
>
> Key: NUTCH-2105
> URL: https://issues.apache.org/jira/browse/NUTCH-2105
> Project: Nutch
>  Issue Type: New Feature
>  Components: docker
>Affects Versions: 2.3.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.3.1
>
> Attachments: NUTCH-2105.patch
>
>
> Since we are updating NUTCH-2050 it would be excellent to have the Nutch + 
> Hadoop + Gora + Cassandra stack up-to-date and ready to use as part of the 
> 2.3.1 release. This issue should review the Dockerfile and update it where 
> necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[VOTE] Release Apache Nutch 2.3.1

2015-09-22 Thread Lewis John Mcgibbney
Hi user@ & dev@,This thread is a VOTE for releasing Apache Nutch 2.3.1 RC#1.

We addressed 32 issues in all which can been see at the release report
http://s.apache.org/nutch_2.3.1

The release candidate comprises the following components.
 * A staging repository [0] containing various Maven artifacts* A
branch-2.3.1 of the 2.x code [1]* The tagged source upon which we are
VOTE'ing [2]* Finally, the release artifacts [3] which i would
encourage you to verify for signatures and test.You should use the
following KEYS [4] file to verify the signatures of all release
artifacts.Please VOTE as follows[ ] +1 Push the release, I am happy
:)[ ] +0 I am not bothered either way[ ] -1 I am not happy with this
release candidate (please state why)Firstly thank you to everyone that
contributed to Nutch. Secondly, thank you to everyone that VOTE's. It
is appreciated.ThanksLewis(on behalf of Nutch PMC)p.s. Here's my +1
[0] https://repository.apache.org/content/repositories/orgapachenutch-1005[1]
https://svn.apache.org/repos/asf/nutch/branches/branch-2.3.1[2]
https://svn.apache.org/repos/asf/nutch/tags/release-2.3.1[3]
https://dist.apache.org/repos/dist/dev/nutch/2.3.1[4]
http://www.apache.org/dist/nutch/KEYS



-- 
*Lewis*


[jira] [Resolved] (NUTCH-2018) Ensure that the Docker containers for Nutch 2.X are part of the Release Management Documentation

2015-09-22 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2018.
-
Resolution: Fixed

Added to release management HOWTO

> Ensure that the Docker containers for Nutch 2.X are part of the Release 
> Management Documentation
> 
>
> Key: NUTCH-2018
> URL: https://issues.apache.org/jira/browse/NUTCH-2018
> Project: Nutch
>  Issue Type: Improvement
>  Components: docker, documentation
>Affects Versions: 2.3.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 2.3.1
>
>
> We need to ensure that the new docker containers which live within 
> [https://github.com/apache/nutch/tree/2.x/docker|the docker package] are 
> functional and working when making releases. This means documenting how the 
> code should be updated prior to a release. This work is essential to keep 
> them working. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[Nutch Wiki] Trivial Update of "Release_HOWTO" by LewisJohnMcgibbney

2015-09-22 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "Release_HOWTO" page has been changed by LewisJohnMcgibbney:
https://wiki.apache.org/nutch/Release_HOWTO?action=diff&rev1=41&rev2=42

1. Run unit tests.
 {{{ant test}}}
1. Do basic test to see if release looks ok - e.g. install it and run 
example from tutorial.
+ 1. Run the docker containers as per the guidance for 
[[https://github.com/apache/nutch/tree/trunk/docker|trunk]] and 
[[https://github.com/apache/nutch/tree/2.x/docker/hbase|2.x HBase]] and 
[[https://github.com/apache/nutch/tree/2.x/docker/cassandra|2.x Cassandra]]
  1. Get hold of '''maven-ant-tasks-2.X.X.jar''' from 
http://search.maven.org/#search|gav|1|g%3A%22org.apache.maven%22%20AND%20a%3A%22maven-ant-tasks%22
 and put it in the ivy directory
  1. Execute '''ant -lib ivy deploy''' from $NUTCH_HOME, this will sign 
the Maven artifacts (sources, javadoc, .jar) and send them to a Apache Nexus 
staging repository. Details of how to set this up can be found 
[[http://www.apache.org/dev/publishing-maven-artifacts.html|here]]. '''N.B.''' 
Ensure that you have an '''apache-release''' profile contained within 
~/.m2/settings.xml
  1. Once you've read, and are happy with the 
[[https://repository.apache.org/|staging repos]], close it.


[jira] [Resolved] (NUTCH-2105) Update Nutch Cassandra Dockerfile to work with Gora Nutch 2.3.1

2015-09-22 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2105.
-
Resolution: Fixed

Committed @revision 1704754 in 2.X HEAD

> Update Nutch Cassandra Dockerfile to work with Gora Nutch 2.3.1
> ---
>
> Key: NUTCH-2105
> URL: https://issues.apache.org/jira/browse/NUTCH-2105
> Project: Nutch
>  Issue Type: New Feature
>  Components: docker
>Affects Versions: 2.3.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.3.1
>
> Attachments: NUTCH-2105.patch
>
>
> Since we are updating NUTCH-2050 it would be excellent to have the Nutch + 
> Hadoop + Gora + Cassandra stack up-to-date and ready to use as part of the 
> 2.3.1 release. This issue should review the Dockerfile and update it where 
> necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2110) Create the capability to provide seeds in the form of "url+xpath(including option to enter seach terms).selenium"

2015-09-22 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903524#comment-14903524
 ] 

Sebastian Nagel commented on NUTCH-2110:


Ok, understood. One point to consider: shall all paginated documents be kept 
under the same URL? As a batch crawler Nutch uses the URL in many places to 
uniquely identify content, meta data, status information, indexed documents, 
etc.  Of course, the outlinks generated for page1 could be modified by adding a 
suffix which makes the URL unique. Only inside protocol-selenium the suffix is 
removed to fetch the right page.

> Create the capability to provide seeds in the form of "url+xpath(including 
> option to enter seach terms).selenium" 
> --
>
> Key: NUTCH-2110
> URL: https://issues.apache.org/jira/browse/NUTCH-2110
> Project: Nutch
>  Issue Type: Sub-task
>  Components: fetcher
>Affects Versions: 1.10
>Reporter: Asitang Mishra
>  Labels: memex
>
> Create the capability to provide seeds in the form of "url+xpath(including 
> option to enter seach terms).selenium" to be used by selenium 
> protocols/plugins as urls/flow to reach to a specific ajax based page or save 
> the state of a selenium operation for the next fetching round.
> Atleast, this should make nutch capable of distinguishing if a url should be 
> opened using the basic http, httpclient or selenium protocols. And provide 
> the selenium protocol with basic authentication capabilities based on the 
> above ideas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2018) Ensure that the Docker containers for Nutch 2.X are part of the Release Management Documentation

2015-09-22 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2018:

Issue Type: Improvement  (was: Bug)

> Ensure that the Docker containers for Nutch 2.X are part of the Release 
> Management Documentation
> 
>
> Key: NUTCH-2018
> URL: https://issues.apache.org/jira/browse/NUTCH-2018
> Project: Nutch
>  Issue Type: Improvement
>  Components: docker, documentation
>Affects Versions: 2.3.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 2.3.1
>
>
> We need to ensure that the new docker containers which live within 
> [https://github.com/apache/nutch/tree/2.x/docker|the docker package] are 
> functional and working when making releases. This means documenting how the 
> code should be updated prior to a release. This work is essential to keep 
> them working. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2105) Update Nutch Cassandra Dockerfile to work with Gora Nutch 2.3.1

2015-09-22 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2105:

Attachment: NUTCH-2105.patch

Patch for 2.X HEAD
Would like to commit today and get an RC rolled out if no objections.
Thanks
Lewis

> Update Nutch Cassandra Dockerfile to work with Gora Nutch 2.3.1
> ---
>
> Key: NUTCH-2105
> URL: https://issues.apache.org/jira/browse/NUTCH-2105
> Project: Nutch
>  Issue Type: New Feature
>  Components: docker
>Affects Versions: 2.3.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.3.1
>
> Attachments: NUTCH-2105.patch
>
>
> Since we are updating NUTCH-2050 it would be excellent to have the Nutch + 
> Hadoop + Gora + Cassandra stack up-to-date and ready to use as part of the 
> 2.3.1 release. This issue should review the Dockerfile and update it where 
> necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Jenkins build is back to normal : Nutch-nutchgora #1538

2015-09-22 Thread Apache Jenkins Server
See 



[jira] [Commented] (NUTCH-2095) WARC exporter for the CommonCrawlDataDumper

2015-09-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902844#comment-14902844
 ] 

Hudson commented on NUTCH-2095:
---

SUCCESS: Integrated in Nutch-trunk #3279 (See 
[https://builds.apache.org/job/Nutch-trunk/3279/])
Adding NUTCH-2095 to the CHANGES.txt file (jorgelbg: 
http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1704650)
* /nutch/trunk/CHANGES.txt


> WARC exporter for the CommonCrawlDataDumper
> ---
>
> Key: NUTCH-2095
> URL: https://issues.apache.org/jira/browse/NUTCH-2095
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, tool
>Affects Versions: 1.11
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: tools, warc
> Attachments: NUTCH-2095.patch
>
>
> Adds the possibility of exporting the nutch segments to a WARC files.
> From the usage point of view a couple of new command line options are 
> available:
> {{-warc}}: enables the functionality to export into WARC files, if not 
> specified the default JACKSON formatter is used.
> {{-warcSize}}: enable the option to define a max file size for each WARC 
> file, if not specified a default of 1GB per file is used as recommended by 
> the WARC ISO standard.
> The usual {{-gzip}} flag can be used to enable compression on the WARC files.
> Some changes to the default {{CommonCrawlDataDumper}} were done, essentially 
> some changes to the Factory and to the Formats. This changes avoid creating a 
> new instance of a {{CommmonCrawlFormat}} on each URL read from the segments. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2095) WARC exporter for the CommonCrawlDataDumper

2015-09-22 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902784#comment-14902784
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2095:
---

Updated the CHANGES.txt file

> WARC exporter for the CommonCrawlDataDumper
> ---
>
> Key: NUTCH-2095
> URL: https://issues.apache.org/jira/browse/NUTCH-2095
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, tool
>Affects Versions: 1.11
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: tools, warc
> Attachments: NUTCH-2095.patch
>
>
> Adds the possibility of exporting the nutch segments to a WARC files.
> From the usage point of view a couple of new command line options are 
> available:
> {{-warc}}: enables the functionality to export into WARC files, if not 
> specified the default JACKSON formatter is used.
> {{-warcSize}}: enable the option to define a max file size for each WARC 
> file, if not specified a default of 1GB per file is used as recommended by 
> the WARC ISO standard.
> The usual {{-gzip}} flag can be used to enable compression on the WARC files.
> Some changes to the default {{CommonCrawlDataDumper}} were done, essentially 
> some changes to the Factory and to the Formats. This changes avoid creating a 
> new instance of a {{CommmonCrawlFormat}} on each URL read from the segments. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2095) WARC exporter for the CommonCrawlDataDumper

2015-09-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902751#comment-14902751
 ] 

Hudson commented on NUTCH-2095:
---

SUCCESS: Integrated in Nutch-trunk #3278 (See 
[https://builds.apache.org/job/Nutch-trunk/3278/])
bugfix removed Guave dependency see NUTCH-2095 (jnioche: 
http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1704641)
* /nutch/trunk/ivy/ivy.xml


> WARC exporter for the CommonCrawlDataDumper
> ---
>
> Key: NUTCH-2095
> URL: https://issues.apache.org/jira/browse/NUTCH-2095
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, tool
>Affects Versions: 1.11
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: tools, warc
> Attachments: NUTCH-2095.patch
>
>
> Adds the possibility of exporting the nutch segments to a WARC files.
> From the usage point of view a couple of new command line options are 
> available:
> {{-warc}}: enables the functionality to export into WARC files, if not 
> specified the default JACKSON formatter is used.
> {{-warcSize}}: enable the option to define a max file size for each WARC 
> file, if not specified a default of 1GB per file is used as recommended by 
> the WARC ISO standard.
> The usual {{-gzip}} flag can be used to enable compression on the WARC files.
> Some changes to the default {{CommonCrawlDataDumper}} were done, essentially 
> some changes to the Factory and to the Formats. This changes avoid creating a 
> new instance of a {{CommmonCrawlFormat}} on each URL read from the segments. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2102) WARC Exporter

2015-09-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902752#comment-14902752
 ] 

Hudson commented on NUTCH-2102:
---

SUCCESS: Integrated in Nutch-trunk #3278 (See 
[https://builds.apache.org/job/Nutch-trunk/3278/])
NUTCH-2102 WARC Exporter (jnioche: 
http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1704634)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/ivy/ivy.xml
* /nutch/trunk/src/bin/nutch
* /nutch/trunk/src/java/org/apache/nutch/tools/warc
* /nutch/trunk/src/java/org/apache/nutch/tools/warc/WARCExporter.java
* /nutch/trunk/src/java/org/apache/nutch/tools/warc/package-info.java


> WARC Exporter
> -
>
> Key: NUTCH-2102
> URL: https://issues.apache.org/jira/browse/NUTCH-2102
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, dumpers
>Affects Versions: 1.10
>Reporter: Julien Nioche
> Fix For: 1.11
>
> Attachments: NUTCH-2102.patch
>
>
> This patch adds a WARC exporter 
> [http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike 
> the code submitted in [https://github.com/apache/nutch/pull/55] which is 
> based on the CommonCrawlDataDumper, this exporter is a MapReduce job and 
> hence should be able to cope with large segments in a timely fashion and also 
> is not limited to the local file system.
> Later on we could have a WARCImporter to generate segments from WARC files, 
> which is outside the scope of the CCDD anyway. Also WARC is not specific to 
> CommonCrawl, which is why the package name does not reflect it.
> I don't think it would be a problem to have both the modified CCDD and this 
> class providing similar functionalities.
> This class is called in the following way 
> ./nutch org.apache.nutch.tools.warc.WARCExporter 
> /data/nutch-dipe/1kcrawl/warc -dir /data/nutch-dipe/1kcrawl/segments/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Jenkins build is back to normal : Nutch-trunk #3278

2015-09-22 Thread Apache Jenkins Server
See 



[jira] [Commented] (NUTCH-2095) WARC exporter for the CommonCrawlDataDumper

2015-09-22 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902737#comment-14902737
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2095:
---

[~jnioche] Nice catch! 

Will do! But didn't get the same behavior locally, I can't even find Guava v17 
on the university maven mirror. 

> WARC exporter for the CommonCrawlDataDumper
> ---
>
> Key: NUTCH-2095
> URL: https://issues.apache.org/jira/browse/NUTCH-2095
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, tool
>Affects Versions: 1.11
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: tools, warc
> Attachments: NUTCH-2095.patch
>
>
> Adds the possibility of exporting the nutch segments to a WARC files.
> From the usage point of view a couple of new command line options are 
> available:
> {{-warc}}: enables the functionality to export into WARC files, if not 
> specified the default JACKSON formatter is used.
> {{-warcSize}}: enable the option to define a max file size for each WARC 
> file, if not specified a default of 1GB per file is used as recommended by 
> the WARC ISO standard.
> The usual {{-gzip}} flag can be used to enable compression on the WARC files.
> Some changes to the default {{CommonCrawlDataDumper}} were done, essentially 
> some changes to the Factory and to the Formats. This changes avoid creating a 
> new instance of a {{CommmonCrawlFormat}} on each URL read from the segments. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2095) WARC exporter for the CommonCrawlDataDumper

2015-09-22 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902715#comment-14902715
 ] 

Julien Nioche commented on NUTCH-2095:
--

See [https://issues.apache.org/jira/browse/HADOOP-10961]. This is due to Guava 
17 which is inherited from webarchive-common version 1.1.5
I've excluded guava from it - in revision 1704641 and it has fixed the problem.

[~jorgelbg] please remember to run 'ant clean test' before committing something.

> WARC exporter for the CommonCrawlDataDumper
> ---
>
> Key: NUTCH-2095
> URL: https://issues.apache.org/jira/browse/NUTCH-2095
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, tool
>Affects Versions: 1.11
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: tools, warc
> Attachments: NUTCH-2095.patch
>
>
> Adds the possibility of exporting the nutch segments to a WARC files.
> From the usage point of view a couple of new command line options are 
> available:
> {{-warc}}: enables the functionality to export into WARC files, if not 
> specified the default JACKSON formatter is used.
> {{-warcSize}}: enable the option to define a max file size for each WARC 
> file, if not specified a default of 1GB per file is used as recommended by 
> the WARC ISO standard.
> The usual {{-gzip}} flag can be used to enable compression on the WARC files.
> Some changes to the default {{CommonCrawlDataDumper}} were done, essentially 
> some changes to the Factory and to the Formats. This changes avoid creating a 
> new instance of a {{CommmonCrawlFormat}} on each URL read from the segments. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2102) WARC Exporter

2015-09-22 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-2102:
-
Fix Version/s: 1.11

> WARC Exporter
> -
>
> Key: NUTCH-2102
> URL: https://issues.apache.org/jira/browse/NUTCH-2102
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, dumpers
>Affects Versions: 1.10
>Reporter: Julien Nioche
> Fix For: 1.11
>
> Attachments: NUTCH-2102.patch
>
>
> This patch adds a WARC exporter 
> [http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike 
> the code submitted in [https://github.com/apache/nutch/pull/55] which is 
> based on the CommonCrawlDataDumper, this exporter is a MapReduce job and 
> hence should be able to cope with large segments in a timely fashion and also 
> is not limited to the local file system.
> Later on we could have a WARCImporter to generate segments from WARC files, 
> which is outside the scope of the CCDD anyway. Also WARC is not specific to 
> CommonCrawl, which is why the package name does not reflect it.
> I don't think it would be a problem to have both the modified CCDD and this 
> class providing similar functionalities.
> This class is called in the following way 
> ./nutch org.apache.nutch.tools.warc.WARCExporter 
> /data/nutch-dipe/1kcrawl/warc -dir /data/nutch-dipe/1kcrawl/segments/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2102) WARC Exporter

2015-09-22 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-2102.
--
Resolution: Fixed

Committed revision 1704634.

Thanks for the reviews

> WARC Exporter
> -
>
> Key: NUTCH-2102
> URL: https://issues.apache.org/jira/browse/NUTCH-2102
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, dumpers
>Affects Versions: 1.10
>Reporter: Julien Nioche
> Attachments: NUTCH-2102.patch
>
>
> This patch adds a WARC exporter 
> [http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike 
> the code submitted in [https://github.com/apache/nutch/pull/55] which is 
> based on the CommonCrawlDataDumper, this exporter is a MapReduce job and 
> hence should be able to cope with large segments in a timely fashion and also 
> is not limited to the local file system.
> Later on we could have a WARCImporter to generate segments from WARC files, 
> which is outside the scope of the CCDD anyway. Also WARC is not specific to 
> CommonCrawl, which is why the package name does not reflect it.
> I don't think it would be a problem to have both the modified CCDD and this 
> class providing similar functionalities.
> This class is called in the following way 
> ./nutch org.apache.nutch.tools.warc.WARCExporter 
> /data/nutch-dipe/1kcrawl/warc -dir /data/nutch-dipe/1kcrawl/segments/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: WARC exporter for the CommonCrawlDataDumper

2015-09-22 Thread jorgelbg
Github user jorgelbg closed the pull request at:

https://github.com/apache/nutch/pull/55


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (NUTCH-2095) WARC exporter for the CommonCrawlDataDumper

2015-09-22 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902651#comment-14902651
 ] 

Julien Nioche commented on NUTCH-2095:
--

Thanks [~jorgelbg]. Please add a line to CHANGES.txt to describe what you did 
with this. Could you also edit 
[https://wiki.apache.org/nutch/CommonCrawlDataDumper] and describe what you 
added to the CCDD? Thanks

BTW the basic tests fail on my machine - do you get this too? e.g. for 
TestInjector

{code}
tried to access method com.google.common.base.Stopwatch.()V from class 
org.apache.hadoop.mapred.FileInputFormat
java.lang.IllegalAccessError: tried to access method 
com.google.common.base.Stopwatch.()V from class 
org.apache.hadoop.mapred.FileInputFormat
 {code}



> WARC exporter for the CommonCrawlDataDumper
> ---
>
> Key: NUTCH-2095
> URL: https://issues.apache.org/jira/browse/NUTCH-2095
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, tool
>Affects Versions: 1.11
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: tools, warc
> Attachments: NUTCH-2095.patch
>
>
> Adds the possibility of exporting the nutch segments to a WARC files.
> From the usage point of view a couple of new command line options are 
> available:
> {{-warc}}: enables the functionality to export into WARC files, if not 
> specified the default JACKSON formatter is used.
> {{-warcSize}}: enable the option to define a max file size for each WARC 
> file, if not specified a default of 1GB per file is used as recommended by 
> the WARC ISO standard.
> The usual {{-gzip}} flag can be used to enable compression on the WARC files.
> Some changes to the default {{CommonCrawlDataDumper}} were done, essentially 
> some changes to the Factory and to the Formats. This changes avoid creating a 
> new instance of a {{CommmonCrawlFormat}} on each URL read from the segments. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2095) WARC exporter for the CommonCrawlDataDumper

2015-09-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902650#comment-14902650
 ] 

Hudson commented on NUTCH-2095:
---

FAILURE: Integrated in Nutch-trunk #3277 (See 
[https://builds.apache.org/job/Nutch-trunk/3277/])
NUTCH-2095 WARC exporter for the CommonCrawlDataDumper updating test (jorgelbg: 
http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1704619)
* /nutch/trunk/src/test/org/apache/nutch/tools/TestCommonCrawlDataDumper.java


> WARC exporter for the CommonCrawlDataDumper
> ---
>
> Key: NUTCH-2095
> URL: https://issues.apache.org/jira/browse/NUTCH-2095
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, tool
>Affects Versions: 1.11
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: tools, warc
> Attachments: NUTCH-2095.patch
>
>
> Adds the possibility of exporting the nutch segments to a WARC files.
> From the usage point of view a couple of new command line options are 
> available:
> {{-warc}}: enables the functionality to export into WARC files, if not 
> specified the default JACKSON formatter is used.
> {{-warcSize}}: enable the option to define a max file size for each WARC 
> file, if not specified a default of 1GB per file is used as recommended by 
> the WARC ISO standard.
> The usual {{-gzip}} flag can be used to enable compression on the WARC files.
> Some changes to the default {{CommonCrawlDataDumper}} were done, essentially 
> some changes to the Factory and to the Formats. This changes avoid creating a 
> new instance of a {{CommmonCrawlFormat}} on each URL read from the segments. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Build failed in Jenkins: Nutch-trunk #3277

2015-09-22 Thread Apache Jenkins Server
See 

Changes:

[jorgelbg] NUTCH-2095 WARC exporter for the CommonCrawlDataDumper updating test

--
[...truncated 4624 lines...]
[mkdir] Created dir: 
/x1/jenkins/jenkins-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-pass
[mkdir] Created dir: 
/x1/jenkins/jenkins-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-pass/classes
[mkdir] Created dir: 
/x1/jenkins/jenkins-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-pass/test
[mkdir] Created dir: 
/x1/jenkins/jenkins-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-pass/test/lib
[mkdir] Created dir: 
/x1/jenkins/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/urlnormalizer-pass

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/x1/jenkins/jenkins-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-pass
[javac] Compiling 2 source files to 
/x1/jenkins/jenkins-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-pass/classes
[javac] Creating empty 
/x1/jenkins/jenkins-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-pass/classes/org/apache/nutch/net/urlnormalizer/pass/package-info.class

jar:
  [jar] Building jar: 
/x1/jenkins/jenkins-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-pass/urlnormalizer-pass.jar

deps-test:

deploy:
 [copy] Copying 1 file to 
/x1/jenkins/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/urlnormalizer-pass

copy-generated-lib:
 [copy] Copying 1 file to 
/x1/jenkins/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/urlnormalizer-pass

init:
[mkdir] Created dir: 
/x1/jenkins/jenkins-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-querystring
[mkdir] Created dir: 
/x1/jenkins/jenkins-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-querystring/classes
[mkdir] Created dir: 
/x1/jenkins/jenkins-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-querystring/test
[mkdir] Created dir: 
/x1/jenkins/jenkins-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-querystring/test/lib
[mkdir] Created dir: 
/x1/jenkins/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/urlnormalizer-querystring

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/x1/jenkins/jenkins-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-querystring
[javac] Compiling 2 source files to 
/x1/jenkins/jenkins-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-querystring/classes
[javac] Creating empty 
/x1/jenkins/jenkins-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-querystring/classes/org/apache/nutch/net/urlnormalizer/querystring/package-info.class

jar:
  [jar] Building jar: 
/x1/jenkins/jenkins-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-querystring/urlnormalizer-querystring.jar

deps-test:

deploy:
 [copy] Copying 1 file to 
/x1/jenkins/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/urlnormalizer-querystring

copy-generated-lib:
 [copy] Copying 1 file to 
/x1/jenkins/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/urlnormalizer-querystring
[mkdir] Created dir: 
/x1/jenkins/jenkins-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-regex/test/data
 [copy] Copying 4 files to 
/x1/jenkins/jenkins-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-regex/test/data

init:
[mkdir] Created dir: 
/x1/jenkins/jenkins-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-regex/classes
[mkdir] Created dir: 
/x1/jenkins/jenkins-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-regex/test/lib
[mkdir] Created dir: 
/x1/jenkins/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/urlnormalizer-regex

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/x1/jenkins/jenkins-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-regex
[javac] Compiling 2 source files to 
/x1/jenkins/jenkins-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-regex/classes
[javac] Creating empty 
/x1/jenkins/jenkins-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-regex/classes/org/apache/nutch/net/urlnormalizer/regex/package-info.class

jar:
  [jar] Building jar: 
/x1/jenkins/jenkins-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-regex/urlnormalizer-regex.jar

deps-test:

init:

init-plugin:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/x1/jenkins/jenkins-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml

compile:

jar:

deps-test:

deploy:

copy-generated-lib:

deploy:
 [copy] Copying 1 file to 
/x1/jenkins/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/urlnormalizer-regex

copy-generated-lib:
 [copy] Copyi

[jira] [Commented] (NUTCH-2095) WARC exporter for the CommonCrawlDataDumper

2015-09-22 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902600#comment-14902600
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2095:
---

Committed the updated test

> WARC exporter for the CommonCrawlDataDumper
> ---
>
> Key: NUTCH-2095
> URL: https://issues.apache.org/jira/browse/NUTCH-2095
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, tool
>Affects Versions: 1.11
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: tools, warc
> Attachments: NUTCH-2095.patch
>
>
> Adds the possibility of exporting the nutch segments to a WARC files.
> From the usage point of view a couple of new command line options are 
> available:
> {{-warc}}: enables the functionality to export into WARC files, if not 
> specified the default JACKSON formatter is used.
> {{-warcSize}}: enable the option to define a max file size for each WARC 
> file, if not specified a default of 1GB per file is used as recommended by 
> the WARC ISO standard.
> The usual {{-gzip}} flag can be used to enable compression on the WARC files.
> Some changes to the default {{CommonCrawlDataDumper}} were done, essentially 
> some changes to the Factory and to the Formats. This changes avoid creating a 
> new instance of a {{CommmonCrawlFormat}} on each URL read from the segments. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2095) WARC exporter for the CommonCrawlDataDumper

2015-09-22 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902578#comment-14902578
 ] 

Julien Nioche commented on NUTCH-2095:
--

[~jorgelbg] could you please fix the test. See below

{code}
Index: src/test/org/apache/nutch/tools/TestCommonCrawlDataDumper.java
===
--- src/test/org/apache/nutch/tools/TestCommonCrawlDataDumper.java  
(revision 1704612)
+++ src/test/org/apache/nutch/tools/TestCommonCrawlDataDumper.java  
(working copy)
@@ -101,8 +101,9 @@
 
CommonCrawlDataDumper dumper = new CommonCrawlDataDumper(
new CommonCrawlConfig());
-   dumper.dump(tempDir, sampleSegmentDir, false, null, false, "");
 
+   dumper.dump(tempDir, sampleSegmentDir, false, null, false, "", 
false);
+
Collection tempFiles = FileUtils.listFiles(tempDir,
FileFilterUtils.fileFileFilter(),
FileFilterUtils.directoryFileFilter());
{code}

> WARC exporter for the CommonCrawlDataDumper
> ---
>
> Key: NUTCH-2095
> URL: https://issues.apache.org/jira/browse/NUTCH-2095
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, tool
>Affects Versions: 1.11
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: tools, warc
> Attachments: NUTCH-2095.patch
>
>
> Adds the possibility of exporting the nutch segments to a WARC files.
> From the usage point of view a couple of new command line options are 
> available:
> {{-warc}}: enables the functionality to export into WARC files, if not 
> specified the default JACKSON formatter is used.
> {{-warcSize}}: enable the option to define a max file size for each WARC 
> file, if not specified a default of 1GB per file is used as recommended by 
> the WARC ISO standard.
> The usual {{-gzip}} flag can be used to enable compression on the WARC files.
> Some changes to the default {{CommonCrawlDataDumper}} were done, essentially 
> some changes to the Factory and to the Formats. This changes avoid creating a 
> new instance of a {{CommmonCrawlFormat}} on each URL read from the segments. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2095) WARC exporter for the CommonCrawlDataDumper

2015-09-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902521#comment-14902521
 ] 

Hudson commented on NUTCH-2095:
---

FAILURE: Integrated in Nutch-trunk #3276 (See 
[https://builds.apache.org/job/Nutch-trunk/3276/])
NUTCH-2095 WARC exporter for the CommonCrawlDataDumper (jorgelbg: 
http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1704594)
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/ivy/ivy.xml
* /nutch/trunk/src/java/org/apache/nutch/tools/AbstractCommonCrawlFormat.java
* /nutch/trunk/src/java/org/apache/nutch/tools/CommonCrawlConfig.java
* /nutch/trunk/src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java
* /nutch/trunk/src/java/org/apache/nutch/tools/CommonCrawlFormat.java
* /nutch/trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatFactory.java
* /nutch/trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatJackson.java
* /nutch/trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatJettinson.java
* /nutch/trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatSimple.java
* /nutch/trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatWARC.java
* /nutch/trunk/src/java/org/apache/nutch/tools/WARCUtils.java
* 
/nutch/trunk/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java


> WARC exporter for the CommonCrawlDataDumper
> ---
>
> Key: NUTCH-2095
> URL: https://issues.apache.org/jira/browse/NUTCH-2095
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, tool
>Affects Versions: 1.11
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: tools, warc
> Attachments: NUTCH-2095.patch
>
>
> Adds the possibility of exporting the nutch segments to a WARC files.
> From the usage point of view a couple of new command line options are 
> available:
> {{-warc}}: enables the functionality to export into WARC files, if not 
> specified the default JACKSON formatter is used.
> {{-warcSize}}: enable the option to define a max file size for each WARC 
> file, if not specified a default of 1GB per file is used as recommended by 
> the WARC ISO standard.
> The usual {{-gzip}} flag can be used to enable compression on the WARC files.
> Some changes to the default {{CommonCrawlDataDumper}} were done, essentially 
> some changes to the Factory and to the Formats. This changes avoid creating a 
> new instance of a {{CommmonCrawlFormat}} on each URL read from the segments. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Build failed in Jenkins: Nutch-trunk #3276

2015-09-22 Thread Apache Jenkins Server
See 

Changes:

[jorgelbg] NUTCH-2095 WARC exporter for the CommonCrawlDataDumper

--
[...truncated 1587 lines...]
AUsrc/test/crawl-tests.xml
A src/test/org
A src/test/org/apache
A src/test/org/apache/nutch
A src/test/org/apache/nutch/parse
AUsrc/test/org/apache/nutch/parse/TestOutlinkExtractor.java
AUsrc/test/org/apache/nutch/parse/parse-plugin-test.xml
AUsrc/test/org/apache/nutch/parse/TestParseData.java
AUsrc/test/org/apache/nutch/parse/TestParserFactory.java
AUsrc/test/org/apache/nutch/parse/TestParseText.java
A src/test/org/apache/nutch/util
AUsrc/test/org/apache/nutch/util/TestStringUtil.java
AUsrc/test/org/apache/nutch/util/TestMimeUtil.java
AUsrc/test/org/apache/nutch/util/TestPrefixStringMatcher.java
AUsrc/test/org/apache/nutch/util/TestGZIPUtils.java
AUsrc/test/org/apache/nutch/util/WritableTestUtils.java
AUsrc/test/org/apache/nutch/util/TestNodeWalker.java
AUsrc/test/org/apache/nutch/util/TestSuffixStringMatcher.java
AUsrc/test/org/apache/nutch/util/TestEncodingDetector.java
AUsrc/test/org/apache/nutch/util/TestURLUtil.java
A src/test/org/apache/nutch/util/DumpFileUtilTest.java
A src/test/org/apache/nutch/indexer
AUsrc/test/org/apache/nutch/indexer/TestIndexingFilters.java
A src/test/org/apache/nutch/plugin
AUsrc/test/org/apache/nutch/plugin/TestPluginSystem.java
AUsrc/test/org/apache/nutch/plugin/ITestExtension.java
AUsrc/test/org/apache/nutch/plugin/HelloWorldExtension.java
AUsrc/test/org/apache/nutch/plugin/SimpleTestPlugin.java
A src/test/org/apache/nutch/fetcher
AUsrc/test/org/apache/nutch/fetcher/TestFetcher.java
A src/test/org/apache/nutch/metadata
AUsrc/test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java
AUsrc/test/org/apache/nutch/metadata/TestMetadata.java
A src/test/org/apache/nutch/service
A src/test/org/apache/nutch/service/TestNutchServer.java
A src/test/org/apache/nutch/tools
A src/test/org/apache/nutch/tools/TestCommonCrawlDataDumper.java
A src/test/org/apache/nutch/tools/proxy
A src/test/org/apache/nutch/tools/proxy/SegmentHandler.java
A src/test/org/apache/nutch/tools/proxy/FakeHandler.java
A src/test/org/apache/nutch/tools/proxy/package-info.java
A src/test/org/apache/nutch/tools/proxy/LogDebugHandler.java
A src/test/org/apache/nutch/tools/proxy/NotFoundHandler.java
A src/test/org/apache/nutch/tools/proxy/AbstractTestbedHandler.java
A src/test/org/apache/nutch/tools/proxy/DelayHandler.java
A src/test/org/apache/nutch/tools/proxy/ProxyTestbed.java
A src/test/org/apache/nutch/protocol
AUsrc/test/org/apache/nutch/protocol/TestProtocolFactory.java
AUsrc/test/org/apache/nutch/protocol/TestContent.java
A src/test/org/apache/nutch/segment
AUsrc/test/org/apache/nutch/segment/TestSegmentMerger.java
A src/test/org/apache/nutch/segment/TestSegmentMergerCrawlDatums.java
A src/test/org/apache/nutch/net
AUsrc/test/org/apache/nutch/net/TestURLNormalizers.java
AUsrc/test/org/apache/nutch/net/TestURLFilters.java
A src/test/org/apache/nutch/crawl
AUsrc/test/org/apache/nutch/crawl/TestGenerator.java
A src/test/org/apache/nutch/crawl/TODOTestCrawlDbStates.java
AUsrc/test/org/apache/nutch/crawl/TestSignatureFactory.java
AUsrc/test/org/apache/nutch/crawl/CrawlDBTestUtil.java
A src/test/org/apache/nutch/crawl/ContinuousCrawlTestUtil.java
AUsrc/test/org/apache/nutch/crawl/TestInjector.java
A src/test/org/apache/nutch/crawl/CrawlDbUpdateUtil.java
AUsrc/test/org/apache/nutch/crawl/TestCrawlDbMerger.java
A src/test/org/apache/nutch/crawl/TestCrawlDbStates.java
AUsrc/test/org/apache/nutch/crawl/TestAdaptiveFetchSchedule.java
A src/test/org/apache/nutch/crawl/TestCrawlDbFilter.java
AUsrc/test/org/apache/nutch/crawl/DummyWritable.java
AUsrc/test/org/apache/nutch/crawl/TestLinkDbMerger.java
A KEYS
AUREADME.md
AUbuild.xml
AUNOTICE.txt
AUdefault.properties
 U.
At revision 1704609
[trunk] $ /home/jenkins/tools/ant/latest/bin/ant -file build.xml 
-Dtest.junit.output.format=xml nightly javadoc test-plugins
Buildfile: 
Trying to override old definition of task javac
  [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. 
It could not be found.

ivy-probe-antlib:

ivy-download:
  [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. 
It could not be found.

ivy-download-unchecked:

ivy-init-antlib:

ivy-init:

init:

[jira] [Commented] (NUTCH-2095) WARC exporter for the CommonCrawlDataDumper

2015-09-22 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902481#comment-14902481
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2095:
---

Committed revision 1704594

> WARC exporter for the CommonCrawlDataDumper
> ---
>
> Key: NUTCH-2095
> URL: https://issues.apache.org/jira/browse/NUTCH-2095
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, tool
>Affects Versions: 1.11
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: tools, warc
> Attachments: NUTCH-2095.patch
>
>
> Adds the possibility of exporting the nutch segments to a WARC files.
> From the usage point of view a couple of new command line options are 
> available:
> {{-warc}}: enables the functionality to export into WARC files, if not 
> specified the default JACKSON formatter is used.
> {{-warcSize}}: enable the option to define a max file size for each WARC 
> file, if not specified a default of 1GB per file is used as recommended by 
> the WARC ISO standard.
> The usual {{-gzip}} flag can be used to enable compression on the WARC files.
> Some changes to the default {{CommonCrawlDataDumper}} were done, essentially 
> some changes to the Factory and to the Formats. This changes avoid creating a 
> new instance of a {{CommmonCrawlFormat}} on each URL read from the segments. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: Update NutchServer.java

2015-09-22 Thread zhangmianhongni
GitHub user zhangmianhongni opened a pull request:

https://github.com/apache/nutch/pull/63

Update NutchServer.java

Line 201, "CMD_PORT" is incorrect, edit to "CMD_HOST".

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhangmianhongni/nutch patch-1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/63.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #63


commit 98b1b9c80f62b493d6559b3d5b2faf5963a1cbf5
Author: zhangmian 
Date:   2015-09-22T09:56:35Z

Update NutchServer.java




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---