Jenkins build is back to normal : Nutch-trunk #3273

2015-09-17 Thread Apache Jenkins Server
See 



[jira] [Commented] (NUTCH-2098) Add null SeedUrl constructor

2015-09-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14805040#comment-14805040
 ] 

Hudson commented on NUTCH-2098:
---

FAILURE: Integrated in Nutch-trunk #3272 (See 
[https://builds.apache.org/job/Nutch-trunk/3272/])
Fix for NUTCH-2098: Add null SeedUrl constructor contributed by Aron Ahmadia. 
(mattmann: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1703745)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/service/model/request/SeedUrl.java


> Add null SeedUrl constructor
> 
>
> Key: NUTCH-2098
> URL: https://issues.apache.org/jira/browse/NUTCH-2098
> Project: Nutch
>  Issue Type: Bug
>  Components: REST_api
>Affects Versions: 1.10
> Environment: All
>Reporter: Aron Ahmadia
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: memex, newbie
> Fix For: 1.11
>
> Attachments: 0001-Default-SeedURL-constructor.patch
>
>
> The SeedUrl class currently doesn't provide a null constructor, and therefore 
> can't correctly implement the Serializable interface to instantiate from JSON 
> objects.
> This patch adds a null constructor for the class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Build failed in Jenkins: Nutch-trunk #3272

2015-09-17 Thread Apache Jenkins Server
See 

Changes:

[mattmann] Fix for NUTCH-2098: Add null SeedUrl constructor contributed by Aron 
Ahmadia.

--
[...truncated 14996 lines...]
clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-ajax

deps-test-compile:

compile-test:

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-ajax
[junit] Running 
org.apache.nutch.net.urlnormalizer.ajax.TestAjaxURLNormalizer
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.036 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-basic

deps-test-compile:

compile-test:

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-basic
[junit] Running 
org.apache.nutch.net.urlnormalizer.basic.TestBasicURLNormalizer
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 3.531 sec
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.346 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-host

deps-test-compile:

compile-test:

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-host

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-pass

deps-test-compile:

compile-test:

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-pass
[junit] Running 
org.apache.nutch.net.urlnormalizer.host.TestHostURLNormalizer
[junit] Running 
org.apache.nutch.net.urlnormalizer.pass.TestPassURLNormalizer
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.351 sec
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.713 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-querystring

deps-test-compile:

compile-test:

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-querystring

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-regex

deps-test-compile:

compile-test:

jar:

deps-test:

init:

init-plugin:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:

jar:

deps-test:

deploy:

copy-generated-lib:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-regex
[junit] Running 
org.apache.nutch.net.urlnormalizer.querystring.TestQuerystringURLNormalizer
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.388 sec
[junit] Running 
org.apache.nutch.net.urlnormalizer.regex.TestRegexURLNormalizer

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-slash

deps-test-compile:

compile-test:

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-slash
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.508 sec
[junit] Running 
org.apache.nutch.net.urlnormalizer.slash.TestSlashURLNormalizer
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.646 sec

BUILD FAILED
:464: The 
following error occurred while executing this line:
:106: 
The following error occurred while executing this line:
:223:
 Tests failed!

Total time: 14 minutes 0 seconds
Build step 'Invoke Ant' marked build as failure
Publishing Javadoc
[xUnit] [INFO] - Starting to record.
[xUnit] [INFO] - Processing JUnit
[xUnit] [INFO] - [JUnit] - 33 test report file(s) were found with the pattern 
'trunk/build/test/TEST-*.xml' relative to 
' for 

[jira] [Commented] (NUTCH-2011) Endpoint to support realtime JSON output from the fetcher

2015-09-17 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14805010#comment-14805010
 ] 

Chris A. Mattmann commented on NUTCH-2011:
--

[~sujenshah] [~asitang]

> Endpoint to support realtime JSON output from the fetcher
> -
>
> Key: NUTCH-2011
> URL: https://issues.apache.org/jira/browse/NUTCH-2011
> Project: Nutch
>  Issue Type: Sub-task
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
>
> This fix will create an endpoint to query the Nutch REST service and get a 
> real-time JSON response of the current/past Fetched URLs. 
> This endpoint also includes pagination of the output to reduce data transfer 
> bw in large crawls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2098) Add null SeedUrl constructor

2015-09-17 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-2098.
--
Resolution: Fixed

Thanks [~ahmadia] fixed in trunk!

{noformat}
[mattmann-0420740:~/tmp/nutch1.11] mattmann% svn commit -m "Fix for NUTCH-2098: 
Add null SeedUrl constructor contributed by Aron Ahmadia."
SendingCHANGES.txt
Sendingsrc/java/org/apache/nutch/service/model/request/SeedUrl.java
Transmitting file data ..
Committed revision 1703745.
[mattmann-0420740:~/tmp/nutch1.11] mattmann% 
{noformat}


> Add null SeedUrl constructor
> 
>
> Key: NUTCH-2098
> URL: https://issues.apache.org/jira/browse/NUTCH-2098
> Project: Nutch
>  Issue Type: Bug
>  Components: REST_api
>Affects Versions: 1.10
> Environment: All
>Reporter: Aron Ahmadia
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: memex, newbie
> Fix For: 1.11
>
> Attachments: 0001-Default-SeedURL-constructor.patch
>
>
> The SeedUrl class currently doesn't provide a null constructor, and therefore 
> can't correctly implement the Serializable interface to instantiate from JSON 
> objects.
> This patch adds a null constructor for the class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2098) Add null SeedUrl constructor

2015-09-17 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-2098:


Assignee: Chris A. Mattmann

> Add null SeedUrl constructor
> 
>
> Key: NUTCH-2098
> URL: https://issues.apache.org/jira/browse/NUTCH-2098
> Project: Nutch
>  Issue Type: Bug
>  Components: REST_api
>Affects Versions: 1.10
> Environment: All
>Reporter: Aron Ahmadia
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: memex, newbie
> Fix For: 1.11
>
> Attachments: 0001-Default-SeedURL-constructor.patch
>
>
> The SeedUrl class currently doesn't provide a null constructor, and therefore 
> can't correctly implement the Serializable interface to instantiate from JSON 
> objects.
> This patch adds a null constructor for the class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (NUTCH-2098) Add null SeedUrl constructor

2015-09-17 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2098 started by Chris A. Mattmann.

> Add null SeedUrl constructor
> 
>
> Key: NUTCH-2098
> URL: https://issues.apache.org/jira/browse/NUTCH-2098
> Project: Nutch
>  Issue Type: Bug
>  Components: REST_api
>Affects Versions: 1.10
> Environment: All
>Reporter: Aron Ahmadia
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: memex, newbie
> Fix For: 1.11
>
> Attachments: 0001-Default-SeedURL-constructor.patch
>
>
> The SeedUrl class currently doesn't provide a null constructor, and therefore 
> can't correctly implement the Serializable interface to instantiate from JSON 
> objects.
> This patch adds a null constructor for the class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2098) Add null SeedUrl constructor

2015-09-17 Thread Aron Ahmadia (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aron Ahmadia updated NUTCH-2098:

Labels: memex newbie  (was: newbie)

> Add null SeedUrl constructor
> 
>
> Key: NUTCH-2098
> URL: https://issues.apache.org/jira/browse/NUTCH-2098
> Project: Nutch
>  Issue Type: Bug
>  Components: REST_api
>Affects Versions: 1.10
> Environment: All
>Reporter: Aron Ahmadia
>Priority: Minor
>  Labels: memex, newbie
> Fix For: 1.11
>
> Attachments: 0001-Default-SeedURL-constructor.patch
>
>
> The SeedUrl class currently doesn't provide a null constructor, and therefore 
> can't correctly implement the Serializable interface to instantiate from JSON 
> objects.
> This patch adds a null constructor for the class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2011) Endpoint to support realtime JSON output from the fetcher

2015-09-17 Thread Aron Ahmadia (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14805002#comment-14805002
 ] 

Aron Ahmadia commented on NUTCH-2011:
-

What's the status on the implementation of this endpoint?  This is exactly the 
sort of thing we need for useful visualizations of the crawl in progress.

> Endpoint to support realtime JSON output from the fetcher
> -
>
> Key: NUTCH-2011
> URL: https://issues.apache.org/jira/browse/NUTCH-2011
> Project: Nutch
>  Issue Type: Sub-task
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
>
> This fix will create an endpoint to query the Nutch REST service and get a 
> real-time JSON response of the current/past Fetched URLs. 
> This endpoint also includes pagination of the output to reduce data transfer 
> bw in large crawls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2099) Refactoring the REST endpoints for integration with webui

2015-09-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14804901#comment-14804901
 ] 

ASF GitHub Bot commented on NUTCH-2099:
---

Github user sujen1412 commented on a diff in the pull request:

https://github.com/apache/nutch/pull/59#discussion_r39821072
  
--- Diff: src/java/org/apache/nutch/crawl/CrawlDb.java ---
@@ -261,30 +262,68 @@ public int run(String[] args) throws Exception {
   additionsAllowed = false;
 }
 
-String crawldb = crawlId+"/crawldb";
-String segment_dir = crawlId+"/segments";
-File segmentsDir = new File(segment_dir);
-File[] segmentsList = segmentsDir.listFiles();  
-Arrays.sort(segmentsList, new Comparator(){
-  @Override
-  public int compare(File f1, File f2) {
-if(f1.lastModified()>f2.lastModified())
-  return -1;
-else
-  return 0;
-  }  
-});
+Path crawlDb;
+if(args.containsKey(Nutch.ARG_CRAWLDB)) {
+   Object crawldbPath = args.get(Nutch.ARG_CRAWLDB);
+   if(crawldbPath instanceof Path) {
+   crawlDb = (Path) crawldbPath;
--- End diff --

I am using eclipse and I did set the formatting, I don't know why this 
happened. Will take care from now on. 


> Refactoring the REST endpoints for integration with webui
> -
>
> Key: NUTCH-2099
> URL: https://issues.apache.org/jira/browse/NUTCH-2099
> Project: Nutch
>  Issue Type: Sub-task
>  Components: REST_api, web gui
>Reporter: Sujen Shah
>  Labels: memex
> Fix For: 1.11
>
>
> This PR changes the structure of the arguments in the REST endpoints. Earlier 
> the args were accepted in a Map form and now it is 
> Map. This is to allow Wicket to create the proper requests 
> objects and send it to NutchServer. 
> With the above, I have also added the metadata required for these services in 
> Nutch metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: Fix for NUTCH-2099 Contributed by Sujen Shah

2015-09-17 Thread sujen1412
Github user sujen1412 commented on a diff in the pull request:

https://github.com/apache/nutch/pull/59#discussion_r39821072
  
--- Diff: src/java/org/apache/nutch/crawl/CrawlDb.java ---
@@ -261,30 +262,68 @@ public int run(String[] args) throws Exception {
   additionsAllowed = false;
 }
 
-String crawldb = crawlId+"/crawldb";
-String segment_dir = crawlId+"/segments";
-File segmentsDir = new File(segment_dir);
-File[] segmentsList = segmentsDir.listFiles();  
-Arrays.sort(segmentsList, new Comparator(){
-  @Override
-  public int compare(File f1, File f2) {
-if(f1.lastModified()>f2.lastModified())
-  return -1;
-else
-  return 0;
-  }  
-});
+Path crawlDb;
+if(args.containsKey(Nutch.ARG_CRAWLDB)) {
+   Object crawldbPath = args.get(Nutch.ARG_CRAWLDB);
+   if(crawldbPath instanceof Path) {
+   crawlDb = (Path) crawldbPath;
--- End diff --

I am using eclipse and I did set the formatting, I don't know why this 
happened. Will take care from now on. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (NUTCH-2099) Refactoring the REST endpoints for integration with webui

2015-09-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14804899#comment-14804899
 ] 

ASF GitHub Bot commented on NUTCH-2099:
---

Github user sujen1412 commented on a diff in the pull request:

https://github.com/apache/nutch/pull/59#discussion_r39821056
  
--- Diff: src/java/org/apache/nutch/crawl/CrawlDb.java ---
@@ -236,10 +237,10 @@ public int run(String[] args) throws Exception {
* Used for Nutch REST service
*/
   @Override
-  public Map run(Map args, String crawlId) 
throws Exception {
+  public Map run(Map args, String crawlId) 
throws Exception {
--- End diff --

@lewismc, making it object allows me to parse multiple inputs in the Map 
args(for ex- segments in the updatedb job) as an Arraylist instead of string 
parsing. Also this change the 1x code similar to 2x and also makes porting the 
webui easier as it expects an Object.


> Refactoring the REST endpoints for integration with webui
> -
>
> Key: NUTCH-2099
> URL: https://issues.apache.org/jira/browse/NUTCH-2099
> Project: Nutch
>  Issue Type: Sub-task
>  Components: REST_api, web gui
>Reporter: Sujen Shah
>  Labels: memex
> Fix For: 1.11
>
>
> This PR changes the structure of the arguments in the REST endpoints. Earlier 
> the args were accepted in a Map form and now it is 
> Map. This is to allow Wicket to create the proper requests 
> objects and send it to NutchServer. 
> With the above, I have also added the metadata required for these services in 
> Nutch metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: Fix for NUTCH-2099 Contributed by Sujen Shah

2015-09-17 Thread sujen1412
Github user sujen1412 commented on a diff in the pull request:

https://github.com/apache/nutch/pull/59#discussion_r39821056
  
--- Diff: src/java/org/apache/nutch/crawl/CrawlDb.java ---
@@ -236,10 +237,10 @@ public int run(String[] args) throws Exception {
* Used for Nutch REST service
*/
   @Override
-  public Map run(Map args, String crawlId) 
throws Exception {
+  public Map run(Map args, String crawlId) 
throws Exception {
--- End diff --

@lewismc, making it object allows me to parse multiple inputs in the Map 
args(for ex- segments in the updatedb job) as an Arraylist instead of string 
parsing. Also this change the 1x code similar to 2x and also makes porting the 
webui easier as it expects an Object.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (NUTCH-2110) Create the capability to provide seeds in the form of "url+xpath(including option to enter seach terms).selenium"

2015-09-17 Thread Asitang Mishra (JIRA)
Asitang Mishra created NUTCH-2110:
-

 Summary: Create the capability to provide seeds in the form of 
"url+xpath(including option to enter seach terms).selenium" 
 Key: NUTCH-2110
 URL: https://issues.apache.org/jira/browse/NUTCH-2110
 Project: Nutch
  Issue Type: Sub-task
  Components: fetcher
Affects Versions: 1.10
Reporter: Asitang Mishra


Create the capability to provide seeds in the form of "url+xpath(including 
option to enter seach terms).selenium" to be used by selenium protocols/plugins 
as urls/flow to reach to a specific ajax based page or save the state of a 
selenium operation for the next fetching round.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2109) Create a brute force click-all-ajax-links utility fucntion for selenium interactive plugin

2015-09-17 Thread Asitang Mishra (JIRA)
Asitang Mishra created NUTCH-2109:
-

 Summary: Create a  brute force click-all-ajax-links utility 
fucntion for selenium interactive plugin
 Key: NUTCH-2109
 URL: https://issues.apache.org/jira/browse/NUTCH-2109
 Project: Nutch
  Issue Type: Sub-task
  Components: fetcher
Affects Versions: 1.10
Reporter: Asitang Mishra
Priority: Minor


A function that clicks each ajax link on a page and then concatenates the 
changes to a single string and returns it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2108) Add a function to the selenium interactive plugin interface to do multiple manipulation of driver and then return the data

2015-09-17 Thread Asitang Mishra (JIRA)
Asitang Mishra created NUTCH-2108:
-

 Summary: Add a function to the selenium interactive plugin 
interface to do multiple manipulation of driver and then return the data
 Key: NUTCH-2108
 URL: https://issues.apache.org/jira/browse/NUTCH-2108
 Project: Nutch
  Issue Type: Sub-task
  Components: fetcher
Affects Versions: 1.10
Reporter: Asitang Mishra
Priority: Minor


In the interactive selenium plugin we have to create handler classes for each 
manipulation of a page. Sometimes we need to manipulate a page in many ways and 
keep track of those manipulations. Like clicking on say each link in a table 
and then refreshing to get the original page back as even one click can make 
all other links go away. This can be done in a single loop. Which will be a 
little too much work and way complicated using multiple handlers. So, I am 
proposing a new function "String multiProcessDriver(WebDriver driver)"  that 
takes the driver and returns a concatenated String along with the already 
present "void processDriver(WebDriver driver)".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2050) Upgrade HBase and Hadoop versioning on 2.X Docker

2015-09-17 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803379#comment-14803379
 ] 

Lewis John McGibbney commented on NUTCH-2050:
-

ACK. We are on it and will have updated along with next release 0.7. Will make 
the update here once it has been done. Thanks for dropping in :)

> Upgrade HBase and Hadoop versioning on 2.X Docker 
> --
>
> Key: NUTCH-2050
> URL: https://issues.apache.org/jira/browse/NUTCH-2050
> Project: Nutch
>  Issue Type: Improvement
>  Components: docker
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.3.1
>
> Attachments: NUTCH-2050.patch
>
>
> We are working on old versioning.
> Lets sort this out.
> 2.X works perfectly with Hadoop 2.X.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2050) Upgrade HBase and Hadoop versioning on 2.X Docker

2015-09-17 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803346#comment-14803346
 ] 

stack commented on NUTCH-2050:
--

Sounds like we need to update Gora then (smile).

> Upgrade HBase and Hadoop versioning on 2.X Docker 
> --
>
> Key: NUTCH-2050
> URL: https://issues.apache.org/jira/browse/NUTCH-2050
> Project: Nutch
>  Issue Type: Improvement
>  Components: docker
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.3.1
>
> Attachments: NUTCH-2050.patch
>
>
> We are working on old versioning.
> Lets sort this out.
> 2.X works perfectly with Hadoop 2.X.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2050) Upgrade HBase and Hadoop versioning on 2.X Docker

2015-09-17 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803311#comment-14803311
 ] 

Lewis John McGibbney commented on NUTCH-2050:
-

Hi [~stack] I agree, GORA-443 was logged recently and will deal with this. It 
is just that Gora 0.6.1 supports HBase 0.98.89-hadoop2 so I thought best to get 
Dockerfile here to sync. wdyt?

> Upgrade HBase and Hadoop versioning on 2.X Docker 
> --
>
> Key: NUTCH-2050
> URL: https://issues.apache.org/jira/browse/NUTCH-2050
> Project: Nutch
>  Issue Type: Improvement
>  Components: docker
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.3.1
>
> Attachments: NUTCH-2050.patch
>
>
> We are working on old versioning.
> Lets sort this out.
> 2.X works perfectly with Hadoop 2.X.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2050) Upgrade HBase and Hadoop versioning on 2.X Docker

2015-09-17 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803219#comment-14803219
 ] 

stack commented on NUTCH-2050:
--

Why not go to hbase-1.x rather than hbase-0.98.x [~lewismc]? Its been out with 
a good while now. Thanks.

> Upgrade HBase and Hadoop versioning on 2.X Docker 
> --
>
> Key: NUTCH-2050
> URL: https://issues.apache.org/jira/browse/NUTCH-2050
> Project: Nutch
>  Issue Type: Improvement
>  Components: docker
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.3.1
>
> Attachments: NUTCH-2050.patch
>
>
> We are working on old versioning.
> Lets sort this out.
> 2.X works perfectly with Hadoop 2.X.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2104) Add documentation to the protocol-selenium plugin Readme file re: selenium grid implementation

2015-09-17 Thread Kim Whitehall (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803155#comment-14803155
 ] 

Kim Whitehall commented on NUTCH-2104:
--

Yeap [~lewismc], I'll turn it round asap. 

> Add documentation to the protocol-selenium plugin Readme file re: selenium 
> grid implementation
> --
>
> Key: NUTCH-2104
> URL: https://issues.apache.org/jira/browse/NUTCH-2104
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.10
>Reporter: Kim Whitehall
>Priority: Trivial
>
> Adding some documentation to the protocol-selenium Readme file with regards 
> to advice on using the selenium grid. Namely:
> (1) parameters to set for optimization of the grid 
> (2) pitfalls to beware of when using the grid 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2015-09-17 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1932:
-
Attachment: NUTCH-1932.patch

Wrong default in code was used for markOrphanAfter. Config is ok


> Automatically remove orphaned pages
> ---
>
> Key: NUTCH-1932
> URL: https://issues.apache.org/jira/browse/NUTCH-1932
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-1932-add.patch, NUTCH-1932.patch, 
> NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, 
> NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, 
> NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch
>
>
> Orphan scoring filter that determines whether a page has become orphaned, 
> e.g. it has no more other pages linking to it. If a page hasn't been linked 
> to after markGoneAfter seconds, the page is marked as gone and is then 
> removed by an indexer.  If a page hasn't been linked to after markOrphanAfter 
> seconds, the page is removed from the CrawlDB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2015-09-17 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1932:
-
Attachment: NUTCH-1932.patch

Fixed bad long to int casting.

> Automatically remove orphaned pages
> ---
>
> Key: NUTCH-1932
> URL: https://issues.apache.org/jira/browse/NUTCH-1932
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-1932-add.patch, NUTCH-1932.patch, 
> NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, 
> NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, 
> NUTCH-1932.patch, NUTCH-1932.patch
>
>
> Orphan scoring filter that determines whether a page has become orphaned, 
> e.g. it has no more other pages linking to it. If a page hasn't been linked 
> to after markGoneAfter seconds, the page is marked as gone and is then 
> removed by an indexer.  If a page hasn't been linked to after markOrphanAfter 
> seconds, the page is removed from the CrawlDB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2015-09-17 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1932:
-
Attachment: NUTCH-1932.patch

Uh, using long over int for time keeping makes no sense. Relies on int now.

> Automatically remove orphaned pages
> ---
>
> Key: NUTCH-1932
> URL: https://issues.apache.org/jira/browse/NUTCH-1932
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-1932-add.patch, NUTCH-1932.patch, 
> NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, 
> NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, 
> NUTCH-1932.patch
>
>
> Orphan scoring filter that determines whether a page has become orphaned, 
> e.g. it has no more other pages linking to it. If a page hasn't been linked 
> to after markGoneAfter seconds, the page is marked as gone and is then 
> removed by an indexer.  If a page hasn't been linked to after markOrphanAfter 
> seconds, the page is removed from the CrawlDB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2107) plugin.xml to validate against plugin.dtd

2015-09-17 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2107:
---
Attachment: NUTCH-2107.patch

Patch for trunk and 2.x. The validation error in lib-selenium's plugin.xml is 
fixed in NUTCH-2106.

> plugin.xml to validate against plugin.dtd
> -
>
> Key: NUTCH-2107
> URL: https://issues.apache.org/jira/browse/NUTCH-2107
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 2.3, 1.10, 1.11
>Reporter: Sebastian Nagel
>Priority: Trivial
> Fix For: 2.4, 1.12
>
> Attachments: NUTCH-2107.patch
>
>
> Some of the plugin.xml do not validate against the plugin.dtd:
> {noformat}
> % xmllint --noout --dtdvalid ./src/plugin/plugin.dtd 
> src/plugin/urlnormalizer-regex/plugin.xml
> src/plugin/urlnormalizer-regex/plugin.xml:30: element requires: validity 
> error : Element requires content does not follow the DTD, expecting 
> (import)+, got (include )
> src/plugin/urlnormalizer-regex/plugin.xml:31: element include: validity error 
> : No declaration for element include
> src/plugin/urlnormalizer-regex/plugin.xml:31: element include: validity error 
> : No declaration for attribute file of element include
> Document src/plugin/urlnormalizer-regex/plugin.xml does not validate against 
> ./src/plugin/plugin.dtd
> % ...
> src/plugin/subcollection/plugin.xml:22: element plugin: validity error : 
> Element plugin content does not follow the DTD, expecting (runtime? , 
> requires? , extension-point* , extension*), got (requires runtime extension )
> % ...
> src/plugin/lib-selenium/plugin.xml:76: element requires: validity error : 
> Element requires content does not follow the DTD, expecting (import)+, got 
> (library library )
> src/plugin/lib-selenium/plugin.xml:80: element library: validity error : 
> Element library content does not follow the DTD, expecting (export)*, got 
> (export exclude )
> src/plugin/lib-selenium/plugin.xml:82: element exclude: validity error : No 
> declaration for element exclude
> src/plugin/lib-selenium/plugin.xml:82: element exclude: validity error : No 
> declaration for attribute name of element exclude
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2106) Runtime to contain Selenium and dependencies only once

2015-09-17 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2106:
---
Attachment: NUTCH-2106.patch

> Runtime to contain Selenium and dependencies only once
> --
>
> Key: NUTCH-2106
> URL: https://issues.apache.org/jira/browse/NUTCH-2106
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.11
>Reporter: Sebastian Nagel
> Fix For: 1.11
>
> Attachments: NUTCH-2106.patch
>
>
> All Selenium-based plugins contain the same dependendent jars which 
> significantly affects the size of runtime and bin package:
> {noformat}
> % du -hs runtime/local/plugins/*selenium/ runtime/deploy/*.job
> 25M runtime/local/plugins/lib-selenium/
> 25M runtime/local/plugins/protocol-interactiveselenium/
> 25M runtime/local/plugins/protocol-selenium/
> 182M runtime/deploy/apache-nutch-1.11-SNAPSHOT.job
> {noformat}
> Since all plugins depend on the same Selenium version we could bundle the 
> dependencies in lib-selenium and let the other plugins load it from there:
> - let lib-selenium export all dependent libs, e.g.:
> {code:xml|title=lib-selenium/plugin.xml}
> 
>   ...
>   
> 
>   
> {code}
> - both protocol plugins already import lib-selenium: the dependencies in 
> ivy.xml can be removed
> As expected, these changes make the runtime smaller:
> {noformat}
> 25M runtime/local/plugins/lib-selenium/
> 20K runtime/local/plugins/protocol-interactiveselenium/
> 16K runtime/local/plugins/protocol-selenium/
> 138M runtime/deploy/apache-nutch-1.11-SNAPSHOT.job
> {noformat}
> Open points:
> - I've tested only protocol-selenium using chromedriver. Should also test 
> protocol-interactiveselenium?
> - What about phantomjsdriver-1.2.1.jar? It was contained in lib-selenium and 
> protocol-selenium but not protocol-interactiveselenium. Is there a reason for 
> this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2107) plugin.xml to validate against plugin.dtd

2015-09-17 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2107:
--

 Summary: plugin.xml to validate against plugin.dtd
 Key: NUTCH-2107
 URL: https://issues.apache.org/jira/browse/NUTCH-2107
 Project: Nutch
  Issue Type: Improvement
  Components: plugin
Affects Versions: 1.10, 2.3, 1.11
Reporter: Sebastian Nagel
Priority: Trivial
 Fix For: 2.4, 1.12


Some of the plugin.xml do not validate against the plugin.dtd:

{noformat}
% xmllint --noout --dtdvalid ./src/plugin/plugin.dtd 
src/plugin/urlnormalizer-regex/plugin.xml
src/plugin/urlnormalizer-regex/plugin.xml:30: element requires: validity error 
: Element requires content does not follow the DTD, expecting (import)+, got 
(include )
src/plugin/urlnormalizer-regex/plugin.xml:31: element include: validity error : 
No declaration for element include
src/plugin/urlnormalizer-regex/plugin.xml:31: element include: validity error : 
No declaration for attribute file of element include
Document src/plugin/urlnormalizer-regex/plugin.xml does not validate against 
./src/plugin/plugin.dtd

% ...
src/plugin/subcollection/plugin.xml:22: element plugin: validity error : 
Element plugin content does not follow the DTD, expecting (runtime? , requires? 
, extension-point* , extension*), got (requires runtime extension )

% ...
src/plugin/lib-selenium/plugin.xml:76: element requires: validity error : 
Element requires content does not follow the DTD, expecting (import)+, got 
(library library )
src/plugin/lib-selenium/plugin.xml:80: element library: validity error : 
Element library content does not follow the DTD, expecting (export)*, got 
(export exclude )
src/plugin/lib-selenium/plugin.xml:82: element exclude: validity error : No 
declaration for element exclude
src/plugin/lib-selenium/plugin.xml:82: element exclude: validity error : No 
declaration for attribute name of element exclude
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2015-09-17 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1932:
-
Attachment: NUTCH-1932.patch

Updated patch. CrawlDatum now supports Jexl expressions on Long types, type 
used by this scoring filter.

> Automatically remove orphaned pages
> ---
>
> Key: NUTCH-1932
> URL: https://issues.apache.org/jira/browse/NUTCH-1932
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-1932-add.patch, NUTCH-1932.patch, 
> NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, 
> NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch
>
>
> Orphan scoring filter that determines whether a page has become orphaned, 
> e.g. it has no more other pages linking to it. If a page hasn't been linked 
> to after markGoneAfter seconds, the page is marked as gone and is then 
> removed by an indexer.  If a page hasn't been linked to after markOrphanAfter 
> seconds, the page is removed from the CrawlDB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2106) Runtime to contain Selenium and dependencies only once

2015-09-17 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2106:
--

 Summary: Runtime to contain Selenium and dependencies only once
 Key: NUTCH-2106
 URL: https://issues.apache.org/jira/browse/NUTCH-2106
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.11
Reporter: Sebastian Nagel
 Fix For: 1.11


All Selenium-based plugins contain the same dependendent jars which 
significantly affects the size of runtime and bin package:
{noformat}
% du -hs runtime/local/plugins/*selenium/ runtime/deploy/*.job
25M runtime/local/plugins/lib-selenium/
25M runtime/local/plugins/protocol-interactiveselenium/
25M runtime/local/plugins/protocol-selenium/
182M runtime/deploy/apache-nutch-1.11-SNAPSHOT.job
{noformat}

Since all plugins depend on the same Selenium version we could bundle the 
dependencies in lib-selenium and let the other plugins load it from there:
- let lib-selenium export all dependent libs, e.g.:
{code:xml|title=lib-selenium/plugin.xml}

  ...
  

  
{code}
- both protocol plugins already import lib-selenium: the dependencies in 
ivy.xml can be removed

As expected, these changes make the runtime smaller:
{noformat}
25M runtime/local/plugins/lib-selenium/
20K runtime/local/plugins/protocol-interactiveselenium/
16K runtime/local/plugins/protocol-selenium/
138M runtime/deploy/apache-nutch-1.11-SNAPSHOT.job
{noformat}

Open points:
- I've tested only protocol-selenium using chromedriver. Should also test 
protocol-interactiveselenium?
- What about phantomjsdriver-1.2.1.jar? It was contained in lib-selenium and 
protocol-selenium but not protocol-interactiveselenium. Is there a reason for 
this?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1169) Write JUnit tests for urlfilter-prefix

2015-09-17 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1169:

Fix Version/s: (was: 2.4)
   2.3.1

> Write JUnit tests for urlfilter-prefix
> --
>
> Key: NUTCH-1169
> URL: https://issues.apache.org/jira/browse/NUTCH-1169
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Talat UYARER
>  Labels: test
> Fix For: 2.3.1
>
>
> This issue should provide a single Junit test as part of an effort to provide 
> JUnit tests for all nutchgora plugins



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (NUTCH-1286) Refactoring/reimplementing crawling API (NutchApp)

2015-09-17 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney reopened NUTCH-1286:
-

> Refactoring/reimplementing crawling API (NutchApp)
> --
>
> Key: NUTCH-1286
> URL: https://issues.apache.org/jira/browse/NUTCH-1286
> Project: Nutch
>  Issue Type: Improvement
>  Components: administration gui, REST_api, web gui
>Reporter: Ferdy Galema
>  Labels: gsoc2014
> Fix For: 2.3.1
>
>
> This issue is to track changes we (Mathijs and I) have planned for the API 
> and webapp in Nutchgora. We have a pretty good idea of how we want to be 
> using the crawl API. It may involve some major refactoring or perhaps a side 
> implementation next the current NutchApp functionality. It depends on how 
> much we can reuse the existing components. The bottom line is that there will 
> be a strictly defined Java API that provide everyting related from 
> crawling/indexing to job control. (Listing jobs, tracking progress and 
> aborting jobs being part of it). There will be no server or service for 
> tracking crawling states, all will be persisted one way or the other and 
> queryable from the API. The REST server shall be a very thin layer on top of 
> the Java implementation. A rich web interface will be very easy layer too, 
> once we have a cleanly (but extensive) defined API. But we will start to make 
> to API usable from a simple command-line interface.
> More details will be provided later on.. feel free to comment if you have 
> suggestions/questions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1286) Refactoring/reimplementing crawling API (NutchApp)

2015-09-17 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1286:

Fix Version/s: (was: 2.4)
   2.3.1

> Refactoring/reimplementing crawling API (NutchApp)
> --
>
> Key: NUTCH-1286
> URL: https://issues.apache.org/jira/browse/NUTCH-1286
> Project: Nutch
>  Issue Type: Improvement
>  Components: administration gui, REST_api, web gui
>Reporter: Ferdy Galema
>  Labels: gsoc2014
> Fix For: 2.3.1
>
>
> This issue is to track changes we (Mathijs and I) have planned for the API 
> and webapp in Nutchgora. We have a pretty good idea of how we want to be 
> using the crawl API. It may involve some major refactoring or perhaps a side 
> implementation next the current NutchApp functionality. It depends on how 
> much we can reuse the existing components. The bottom line is that there will 
> be a strictly defined Java API that provide everyting related from 
> crawling/indexing to job control. (Listing jobs, tracking progress and 
> aborting jobs being part of it). There will be no server or service for 
> tracking crawling states, all will be persisted one way or the other and 
> queryable from the API. The REST server shall be a very thin layer on top of 
> the Java implementation. A rich web interface will be very easy layer too, 
> once we have a cleanly (but extensive) defined API. But we will start to make 
> to API usable from a simple command-line interface.
> More details will be provided later on.. feel free to comment if you have 
> suggestions/questions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1169) Write JUnit tests for urlfilter-prefix

2015-09-17 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1169:

Assignee: Talat UYARER

> Write JUnit tests for urlfilter-prefix
> --
>
> Key: NUTCH-1169
> URL: https://issues.apache.org/jira/browse/NUTCH-1169
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Talat UYARER
>  Labels: test
> Fix For: 2.4
>
>
> This issue should provide a single Junit test as part of an effort to provide 
> JUnit tests for all nutchgora plugins



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (NUTCH-1936) GSoC 2015 - Move Nutch to Hadoop 2.X

2015-09-17 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney reopened NUTCH-1936:
-

> GSoC 2015 - Move Nutch to Hadoop 2.X
> 
>
> Key: NUTCH-1936
> URL: https://issues.apache.org/jira/browse/NUTCH-1936
> Project: Nutch
>  Issue Type: Task
>  Components: build
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>  Labels: gsoc2015
> Fix For: 1.11, 2.3.1
>
> Attachments: NUTCH-1939.patch
>
>
> The Nutch PMC 
> [discussed|http://www.mail-archive.com/dev%40nutch.apache.org/msg16250.html] 
> ideas for a good 2015 GSoC project. It appears that porting the (trunk) 
> codebase to [Hadoop 2.X|http://hadoop.apache.org/docs/stable/] seems to an 
> attractive option and one which would present an excellent learning 
> experience for a summer student.
> A more comprehensive description of this issue should be included within 
> either a mentor-defined project description or a successful student 
> application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1936) GSoC 2015 - Move Nutch to Hadoop 2.X

2015-09-17 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1936:

Fix Version/s: (was: 2.4)
   2.3.1

> GSoC 2015 - Move Nutch to Hadoop 2.X
> 
>
> Key: NUTCH-1936
> URL: https://issues.apache.org/jira/browse/NUTCH-1936
> Project: Nutch
>  Issue Type: Task
>  Components: build
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>  Labels: gsoc2015
> Fix For: 1.11, 2.3.1
>
> Attachments: NUTCH-1939.patch
>
>
> The Nutch PMC 
> [discussed|http://www.mail-archive.com/dev%40nutch.apache.org/msg16250.html] 
> ideas for a good 2015 GSoC project. It appears that porting the (trunk) 
> codebase to [Hadoop 2.X|http://hadoop.apache.org/docs/stable/] seems to an 
> attractive option and one which would present an excellent learning 
> experience for a summer student.
> A more comprehensive description of this issue should be included within 
> either a mentor-defined project description or a successful student 
> application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (NUTCH-1936) GSoC 2015 - Move Nutch to Hadoop 2.X

2015-09-17 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-1936.
---
Resolution: Fixed

> GSoC 2015 - Move Nutch to Hadoop 2.X
> 
>
> Key: NUTCH-1936
> URL: https://issues.apache.org/jira/browse/NUTCH-1936
> Project: Nutch
>  Issue Type: Task
>  Components: build
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>  Labels: gsoc2015
> Fix For: 1.11, 2.3.1
>
> Attachments: NUTCH-1939.patch
>
>
> The Nutch PMC 
> [discussed|http://www.mail-archive.com/dev%40nutch.apache.org/msg16250.html] 
> ideas for a good 2015 GSoC project. It appears that porting the (trunk) 
> codebase to [Hadoop 2.X|http://hadoop.apache.org/docs/stable/] seems to an 
> attractive option and one which would present an excellent learning 
> experience for a summer student.
> A more comprehensive description of this issue should be included within 
> either a mentor-defined project description or a successful student 
> application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1990) Use URI.normalise() in BasicURLNormalizer

2015-09-17 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1990:

Fix Version/s: 2.3.1

> Use URI.normalise() in BasicURLNormalizer
> -
>
> Key: NUTCH-1990
> URL: https://issues.apache.org/jira/browse/NUTCH-1990
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.9
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 1.10, 2.3.1
>
> Attachments: NUTCH-1990-trial1.patch, NUTCH-1990-v1.patch
>
>
> One of the things that 
> [BasicURLNormalizer|https://github.com/apache/nutch/blob/trunk/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java]
>  is to remove unnecessary dot segments in path.
> Instead of implementing the logic ourselves with some antiquated regex 
> library, we should simply use 
> [http://docs.oracle.com/javase/7/docs/api/java/net/URI.html#normalize()] 
> which does the same and is probably more efficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1062) Migrate BasicURLNormalizer from Apache ORO to java.util.regex

2015-09-17 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1062:

Fix Version/s: (was: 2.4)
   2.3.1

> Migrate BasicURLNormalizer from Apache ORO to java.util.regex
> -
>
> Key: NUTCH-1062
> URL: https://issues.apache.org/jira/browse/NUTCH-1062
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.10, 2.3.1
>
>
> Issue for migration from ORO to j.u.regex. There is a small problem here. I 
> began the migration mostly because of the double slash issue using lookback 
> which was not supported in ORO. This was to prevent the URL schema from being 
> reduced to one slash. The current Basic URL Normalizer has this problem 
> built-in!
> {code}
> // this pattern tries to find spots like "xx//yy" in the url,
> // which could be replaced by a "/"
> adjacentSlashRule = new Rule();
> adjacentSlashRule.pattern = (Perl5Pattern)  
>   compiler.compile("/{2,}", Perl5Compiler.READ_ONLY_MASK); 
> adjacentSlashRule.substitution = new Perl5Substitution("/");
> {code}
> But provides the wrong solution as it touches the schema as well. What to do? 
> Migrate to j.u.regex and keep this `feature` intact? 
> edit: reading more it looks like it is being fixed at a later stage. A slash 
> is added for URI schema's http & ftp.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1893) Parse-tika fails to parse feed files

2015-09-17 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1893:

Fix Version/s: (was: 2.4)
   2.3.1

> Parse-tika fails to parse feed files
> 
>
> Key: NUTCH-1893
> URL: https://issues.apache.org/jira/browse/NUTCH-1893
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.3, 1.9
> Environment: Windows 7 + Cygwin + JDK 7
>Reporter: Mengying Wang
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.10, 2.3.1
>
> Attachments: NUTCH-1893-v1.patch, NUTCH-1893.mywang.141209.txt
>
>
> In the Nutch parse step, I received the following error. It seems the 
> parse-tika plugin has broken. 
> $ /cygdrive/d/nutch_trunk/runtime/local/bin/nutch parse -D 
> mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D 
> mapred.reduce.tasks.speculative.execution=false -D 
> mapred.map.tasks.speculative.execution=false -D 
> mapred.compress.map.output=true -D mapred.skip.attempts.to.start.skipping=2 
> -D mapred.skip.map.max.skip.records=1 crawlId/segments/20141118235323
> java.lang.ExceptionInInitializerError
>   at com.sun.syndication.io.SyndFeedInput.build(SyndFeedInput.java:136)
>   at org.apache.tika.parser.feed.FeedParser.parse(FeedParser.java:70)
>   at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:103)
>   at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:95)
>   at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:101)
>   at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> Caused by: java.lang.NullPointerException
>   at java.util.Properties$LineReader.readLine(Properties.java:434)
>   at java.util.Properties.load0(Properties.java:353)
>   at java.util.Properties.load(Properties.java:341)
>   at 
> com.sun.syndication.io.impl.PropertiesLoader.(PropertiesLoader.java:74)
>   at 
> com.sun.syndication.io.impl.PropertiesLoader.getPropertiesLoader(PropertiesLoader.java:46)
>   at 
> com.sun.syndication.io.impl.PluginManager.(PluginManager.java:54)
>   at 
> com.sun.syndication.io.impl.PluginManager.(PluginManager.java:46)
>   at 
> com.sun.syndication.feed.synd.impl.Converters.(Converters.java:40)
>   at 
> com.sun.syndication.feed.synd.SyndFeedImpl.(SyndFeedImpl.java:59)
>   ... 10 more 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1886) Review and update default.properties

2015-09-17 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1886:

Fix Version/s: (was: 2.4)
   2.3.1

> Review and update default.properties
> 
>
> Key: NUTCH-1886
> URL: https://issues.apache.org/jira/browse/NUTCH-1886
> Project: Nutch
>  Issue Type: Task
>  Components: build
>Affects Versions: 1.9, 2.2.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.10, 2.3.1
>
>
> Right now default.properties contains all sorts of outdated garbage. We need 
> to review and update where necessary. I don't think that this is a major 
> issue and I doubt the patches will be sigificant.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1941) Optional rolling http.agent.name's

2015-09-17 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1941:

Fix Version/s: (was: 2.4)
   2.3.1

> Optional rolling http.agent.name's
> --
>
> Key: NUTCH-1941
> URL: https://issues.apache.org/jira/browse/NUTCH-1941
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, protocol
>Affects Versions: 2.3, 1.9
>Reporter: Lewis John McGibbney
>Assignee: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.10, 2.3.1
>
> Attachments: NUTCH-1941-2x-v6.patch, NUTCH-1941-ITR2.patch, 
> NUTCH-1941-itr3.patch, NUTCH-1941-itr4.patch, NUTCH-1941-v5.patch, 
> NUTCH-1941-ver1.patch, NUTCH-1941-ver6.patch, agent.names.txt, nutch.patch
>
>
> In some scenarios, even whilst adhering to fetcher.crawl.delay, web admins 
> can block your fetcher based merely on your crawler name. 
> I propose the ability to implement rolling http.agent.name's which could be 
> substituted every 5 seconds for example. This would mean that successive 
> requests to the same domain would be sent with different http.agent.name. 
> This behavior should be off by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1920) Upgrade Nutch to use Java 1.7

2015-09-17 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1920:

Fix Version/s: (was: 2.4)
   2.3.1

> Upgrade Nutch to use Java 1.7
> -
>
> Key: NUTCH-1920
> URL: https://issues.apache.org/jira/browse/NUTCH-1920
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 2.3, 1.10
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.10, 2.3.1
>
> Attachments: NUTCH-1920-trunk.patch
>
>
> In order to build the Nutch Javadoc securely, we rely upon no less than Java 
> version 7u25 or greater. See NUTCH-1590.
> indexer-elastic also requires a JDK 1.7 in order compile.
> We should make the upgrade and state support for Java 1.7 based on the 
> following announcement from Oracle
> {code}
> End of Public Updates for Oracle JDK 7
> The April 2015 CPU release will be the last Oracle JDK 7 publicly available 
> update. For more information, and details on how to receive longer term 
> support for Oracle JDK 7, please see the Oracle Java SE Support Roadmap. 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1981) Upgrade icu4j

2015-09-17 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1981:

Fix Version/s: (was: 2.4)
   2.3.1

> Upgrade icu4j
> -
>
> Key: NUTCH-1981
> URL: https://issues.apache.org/jira/browse/NUTCH-1981
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 2.3, 1.9
>Reporter: Marko Asplund
> Fix For: 1.10, 2.3.1
>
> Attachments: NUTCH-1981-2.x.patch, NUTCH-1981-trunk.patch
>
>
> The icu4j version from 2009 is causing some compatibility issues with custom 
> plugins we're developing. Please upgrade to a more recent version.
> I'm attaching a patch to this issue. Nutch builds and all tests pass without 
> source code changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2104) Add documentation to the protocol-selenium plugin Readme file re: selenium grid implementation

2015-09-17 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14791680#comment-14791680
 ] 

Lewis John McGibbney commented on NUTCH-2104:
-

Hi [~kwhitehall] if you think you can get this in pretty soon then we could 
include it within the 1.10 release.
wdyt?

> Add documentation to the protocol-selenium plugin Readme file re: selenium 
> grid implementation
> --
>
> Key: NUTCH-2104
> URL: https://issues.apache.org/jira/browse/NUTCH-2104
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.10
>Reporter: Kim Whitehall
>Priority: Trivial
>
> Adding some documentation to the protocol-selenium Readme file with regards 
> to advice on using the selenium grid. Namely:
> (1) parameters to set for optimization of the grid 
> (2) pitfalls to beware of when using the grid 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)