NUTCH-1946 Upgrade to Gora 0.6.1

2015-09-17 Thread Lewis John Mcgibbney
Hi user@ and dev@, Quick message to ask kindly for a call to arms. I pushed a patch to NUTCH-1946 [0] for Nutch 2.X HEAD [1] This includes - Upgrade to Gora 0.6.1 - Upgrade to Hadoop 2.5.1 (which Gora supports fully) see NUTCH-2101

[jira] [Updated] (NUTCH-2050) Upgrade HBase and Hadoop versioning on 2.X Docker

2015-09-17 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2050: Flags: Patch,Important Patch Info: Patch Available > Upgrade HBase and

[jira] [Created] (NUTCH-2105) Update Nutch Cassandra Dockerfile to work with Gora Nutch 2.3.1

2015-09-17 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-2105: --- Summary: Update Nutch Cassandra Dockerfile to work with Gora Nutch 2.3.1 Key: NUTCH-2105 URL: https://issues.apache.org/jira/browse/NUTCH-2105 Project:

[jira] [Updated] (NUTCH-1286) Refactoring/reimplementing crawling API (NutchApp)

2015-09-17 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1286: Fix Version/s: (was: 2.4) 2.3.1 > Refactoring/reimplementing

[jira] [Updated] (NUTCH-1169) Write JUnit tests for urlfilter-prefix

2015-09-17 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1169: Assignee: Talat UYARER > Write JUnit tests for urlfilter-prefix >

[jira] [Reopened] (NUTCH-1286) Refactoring/reimplementing crawling API (NutchApp)

2015-09-17 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reopened NUTCH-1286: - > Refactoring/reimplementing crawling API (NutchApp) >

[jira] [Updated] (NUTCH-1946) Upgrade to Gora 0.6.1

2015-09-17 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1946: Attachment: NUTCH-1946v4.patch Patch for 2.X HEAD This includes * Upgrade to Gora

[jira] [Updated] (NUTCH-1946) Upgrade to Gora 0.6.1

2015-09-17 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1946: Flags: Patch,Important Patch Info: Patch Available Priority:

[jira] [Updated] (NUTCH-2050) Upgrade HBase and Hadoop versioning on 2.X Docker

2015-09-17 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2050: Attachment: NUTCH-2050.patch Patch for 2.X HEAD blocker by NUTCH-1946. This patch

[jira] [Updated] (NUTCH-1893) Parse-tika fails to parse feed files

2015-09-17 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1893: Fix Version/s: (was: 2.4) 2.3.1 > Parse-tika fails to parse

[jira] [Updated] (NUTCH-1886) Review and update default.properties

2015-09-17 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1886: Fix Version/s: (was: 2.4) 2.3.1 > Review and update

[jira] [Updated] (NUTCH-1709) Generated classes o.a.n.storage.Host and o.a.n.storage.ProtocolStatus contain methods not defined in source .avsc

2015-09-17 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1709: Fix Version/s: (was: 2.3.1) 2.4 > Generated classes

[jira] [Updated] (NUTCH-2050) Upgrade HBase and Hadoop versioning on 2.X Docker

2015-09-17 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2050: Component/s: (was: build) docker > Upgrade HBase and Hadoop

[jira] [Updated] (NUTCH-2018) Ensure that the Docker containers for Nutch 2.X are part of the Release Management Documentation

2015-09-17 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2018: Component/s: docker > Ensure that the Docker containers for Nutch 2.X are part of

[jira] [Commented] (NUTCH-2104) Add documentation to the protocol-selenium plugin Readme file re: selenium grid implementation

2015-09-17 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14791680#comment-14791680 ] Lewis John McGibbney commented on NUTCH-2104: - Hi [~kwhitehall] if you think you can get this

[jira] [Updated] (NUTCH-1920) Upgrade Nutch to use Java 1.7

2015-09-17 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1920: Fix Version/s: (was: 2.4) 2.3.1 > Upgrade Nutch to use Java

[jira] [Updated] (NUTCH-1981) Upgrade icu4j

2015-09-17 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1981: Fix Version/s: (was: 2.4) 2.3.1 > Upgrade icu4j >

[jira] [Updated] (NUTCH-1941) Optional rolling http.agent.name's

2015-09-17 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1941: Fix Version/s: (was: 2.4) 2.3.1 > Optional rolling

[jira] [Updated] (NUTCH-1169) Write JUnit tests for urlfilter-prefix

2015-09-17 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1169: Fix Version/s: (was: 2.4) 2.3.1 > Write JUnit tests for

[jira] [Commented] (NUTCH-1946) Upgrade to Gora 0.6.1

2015-09-17 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14791632#comment-14791632 ] Lewis John McGibbney commented on NUTCH-1946: - These are intrinsically linked. > Upgrade to

[jira] [Updated] (NUTCH-2101) Upgrade Nutch 2.X to Hadoop 2.5.1

2015-09-17 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2101: Summary: Upgrade Nutch 2.X to Hadoop 2.5.1 (was: Upgrade Nutch 2.X to Hadoop

[jira] [Commented] (NUTCH-1946) Upgrade to Gora 0.6.1

2015-09-17 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14791633#comment-14791633 ] Lewis John McGibbney commented on NUTCH-1946: - As We've fixed a deal of things over on

[jira] [Updated] (NUTCH-1062) Migrate BasicURLNormalizer from Apache ORO to java.util.regex

2015-09-17 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1062: Fix Version/s: (was: 2.4) 2.3.1 > Migrate BasicURLNormalizer

[jira] [Updated] (NUTCH-1990) Use URI.normalise() in BasicURLNormalizer

2015-09-17 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1990: Fix Version/s: 2.3.1 > Use URI.normalise() in BasicURLNormalizer >

[jira] [Closed] (NUTCH-1936) GSoC 2015 - Move Nutch to Hadoop 2.X

2015-09-17 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney closed NUTCH-1936. --- Resolution: Fixed > GSoC 2015 - Move Nutch to Hadoop 2.X >

[jira] [Reopened] (NUTCH-1936) GSoC 2015 - Move Nutch to Hadoop 2.X

2015-09-17 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reopened NUTCH-1936: - > GSoC 2015 - Move Nutch to Hadoop 2.X > > >

[jira] [Updated] (NUTCH-1936) GSoC 2015 - Move Nutch to Hadoop 2.X

2015-09-17 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1936: Fix Version/s: (was: 2.4) 2.3.1 > GSoC 2015 - Move Nutch to

[jira] [Created] (NUTCH-2106) Runtime to contain Selenium and dependencies only once

2015-09-17 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2106: -- Summary: Runtime to contain Selenium and dependencies only once Key: NUTCH-2106 URL: https://issues.apache.org/jira/browse/NUTCH-2106 Project: Nutch

[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2015-09-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1932: - Attachment: NUTCH-1932.patch Updated patch. CrawlDatum now supports Jexl expressions on Long

[jira] [Updated] (NUTCH-2106) Runtime to contain Selenium and dependencies only once

2015-09-17 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2106: --- Attachment: NUTCH-2106.patch > Runtime to contain Selenium and dependencies only once >

[jira] [Updated] (NUTCH-2107) plugin.xml to validate against plugin.dtd

2015-09-17 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2107: --- Attachment: NUTCH-2107.patch Patch for trunk and 2.x. The validation error in lib-selenium's

[jira] [Commented] (NUTCH-2104) Add documentation to the protocol-selenium plugin Readme file re: selenium grid implementation

2015-09-17 Thread Kim Whitehall (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14803155#comment-14803155 ] Kim Whitehall commented on NUTCH-2104: -- Yeap [~lewismc], I'll turn it round asap. > Add

[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2015-09-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1932: - Attachment: NUTCH-1932.patch Fixed bad long to int casting. > Automatically remove orphaned

[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2015-09-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1932: - Attachment: NUTCH-1932.patch Wrong default in code was used for markOrphanAfter. Config is ok >

[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2015-09-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1932: - Attachment: NUTCH-1932.patch Uh, using long over int for time keeping makes no sense. Relies on

[jira] [Commented] (NUTCH-2050) Upgrade HBase and Hadoop versioning on 2.X Docker

2015-09-17 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14803311#comment-14803311 ] Lewis John McGibbney commented on NUTCH-2050: - Hi [~stack] I agree, GORA-443 was logged

[jira] [Commented] (NUTCH-2050) Upgrade HBase and Hadoop versioning on 2.X Docker

2015-09-17 Thread stack (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14803219#comment-14803219 ] stack commented on NUTCH-2050: -- Why not go to hbase-1.x rather than hbase-0.98.x [~lewismc]? Its been out

[jira] [Commented] (NUTCH-2050) Upgrade HBase and Hadoop versioning on 2.X Docker

2015-09-17 Thread stack (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14803346#comment-14803346 ] stack commented on NUTCH-2050: -- Sounds like we need to update Gora then (smile). > Upgrade HBase and Hadoop

[jira] [Commented] (NUTCH-2050) Upgrade HBase and Hadoop versioning on 2.X Docker

2015-09-17 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14803379#comment-14803379 ] Lewis John McGibbney commented on NUTCH-2050: - ACK. We are on it and will have updated along

[jira] [Created] (NUTCH-2108) Add a function to the selenium interactive plugin interface to do multiple manipulation of driver and then return the data

2015-09-17 Thread Asitang Mishra (JIRA)
Asitang Mishra created NUTCH-2108: - Summary: Add a function to the selenium interactive plugin interface to do multiple manipulation of driver and then return the data Key: NUTCH-2108 URL:

[jira] [Created] (NUTCH-2109) Create a brute force click-all-ajax-links utility fucntion for selenium interactive plugin

2015-09-17 Thread Asitang Mishra (JIRA)
Asitang Mishra created NUTCH-2109: - Summary: Create a brute force click-all-ajax-links utility fucntion for selenium interactive plugin Key: NUTCH-2109 URL: https://issues.apache.org/jira/browse/NUTCH-2109

[jira] [Created] (NUTCH-2110) Create the capability to provide seeds in the form of "url+xpath(including option to enter seach terms).selenium"

2015-09-17 Thread Asitang Mishra (JIRA)
Asitang Mishra created NUTCH-2110: - Summary: Create the capability to provide seeds in the form of "url+xpath(including option to enter seach terms).selenium" Key: NUTCH-2110 URL:

[jira] [Commented] (NUTCH-2099) Refactoring the REST endpoints for integration with webui

2015-09-17 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14804901#comment-14804901 ] ASF GitHub Bot commented on NUTCH-2099: --- Github user sujen1412 commented on a diff in the pull

[jira] [Commented] (NUTCH-2099) Refactoring the REST endpoints for integration with webui

2015-09-17 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14804899#comment-14804899 ] ASF GitHub Bot commented on NUTCH-2099: --- Github user sujen1412 commented on a diff in the pull

[GitHub] nutch pull request: Fix for NUTCH-2099 Contributed by Sujen Shah

2015-09-17 Thread sujen1412
Github user sujen1412 commented on a diff in the pull request: https://github.com/apache/nutch/pull/59#discussion_r39821072 --- Diff: src/java/org/apache/nutch/crawl/CrawlDb.java --- @@ -261,30 +262,68 @@ public int run(String[] args) throws Exception { additionsAllowed

[GitHub] nutch pull request: Fix for NUTCH-2099 Contributed by Sujen Shah

2015-09-17 Thread sujen1412
Github user sujen1412 commented on a diff in the pull request: https://github.com/apache/nutch/pull/59#discussion_r39821056 --- Diff: src/java/org/apache/nutch/crawl/CrawlDb.java --- @@ -236,10 +237,10 @@ public int run(String[] args) throws Exception { * Used for Nutch

[jira] [Updated] (NUTCH-2098) Add null SeedUrl constructor

2015-09-17 Thread Aron Ahmadia (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aron Ahmadia updated NUTCH-2098: Labels: memex newbie (was: newbie) > Add null SeedUrl constructor > >

[jira] [Commented] (NUTCH-2011) Endpoint to support realtime JSON output from the fetcher

2015-09-17 Thread Aron Ahmadia (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14805002#comment-14805002 ] Aron Ahmadia commented on NUTCH-2011: - What's the status on the implementation of this endpoint? This

[jira] [Assigned] (NUTCH-2098) Add null SeedUrl constructor

2015-09-17 Thread Chris A. Mattmann (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reassigned NUTCH-2098: Assignee: Chris A. Mattmann > Add null SeedUrl constructor >

[jira] [Work started] (NUTCH-2098) Add null SeedUrl constructor

2015-09-17 Thread Chris A. Mattmann (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-2098 started by Chris A. Mattmann. > Add null SeedUrl constructor > > >

[jira] [Resolved] (NUTCH-2098) Add null SeedUrl constructor

2015-09-17 Thread Chris A. Mattmann (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann resolved NUTCH-2098. -- Resolution: Fixed Thanks [~ahmadia] fixed in trunk! {noformat}

[jira] [Commented] (NUTCH-2011) Endpoint to support realtime JSON output from the fetcher

2015-09-17 Thread Chris A. Mattmann (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14805010#comment-14805010 ] Chris A. Mattmann commented on NUTCH-2011: -- [~sujenshah] [~asitang] > Endpoint to support