date:20150918

[jira] [Commented] (NUTCH-2104) Add documentation to the protocol-selenium plugin Readme file re: selenium grid implementation

2015-09-18 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14876908#comment-14876908
 ] 

Hudson commented on NUTCH-2104:
---

SUCCESS: Integrated in Nutch-trunk #3274 (See 
[https://builds.apache.org/job/Nutch-trunk/3274/])
Fix for NUTCH-2104: Add documentation to the protocol-selenium plugin Readme 
file re: selenium grid implementation contributed by Kim Whitehall 
 this closes #60. (mattmann: 
http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1703944)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/plugin/protocol-selenium/README.md


> Add documentation to the protocol-selenium plugin Readme file re: selenium 
> grid implementation
> --
>
> Key: NUTCH-2104
> URL: https://issues.apache.org/jira/browse/NUTCH-2104
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation, plugin, protocol
>Affects Versions: 1.10
>Reporter: Kim Whitehall
>Assignee: Chris A. Mattmann
>Priority: Trivial
>  Labels: memex
> Fix For: 1.11
>
>
> Adding some documentation to the protocol-selenium Readme file with regards 
> to advice on using the selenium grid. Namely:
> (1) parameters to set for optimization of the grid 
> (2) pitfalls to beware of when using the grid 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2099) Refactoring the REST endpoints for integration with webui

2015-09-18 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14876907#comment-14876907
 ] 

Hudson commented on NUTCH-2099:
---

SUCCESS: Integrated in Nutch-trunk #3274 (See 
[https://builds.apache.org/job/Nutch-trunk/3274/])
Fix for NUTCH-2099: Refactoring the REST endpoints for integration with webui 
contributed by Sujen Shah  this closes #59. (mattmann: 
http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1703941)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDb.java
* /nutch/trunk/src/java/org/apache/nutch/crawl/DeduplicationJob.java
* /nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java
* /nutch/trunk/src/java/org/apache/nutch/crawl/Injector.java
* /nutch/trunk/src/java/org/apache/nutch/crawl/LinkDb.java
* /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingJob.java
* /nutch/trunk/src/java/org/apache/nutch/metadata/Nutch.java
* /nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java
* /nutch/trunk/src/java/org/apache/nutch/service/model/request/JobConfig.java
* /nutch/trunk/src/java/org/apache/nutch/service/model/response/JobInfo.java
* /nutch/trunk/src/java/org/apache/nutch/util/NutchTool.java


> Refactoring the REST endpoints for integration with webui
> -
>
> Key: NUTCH-2099
> URL: https://issues.apache.org/jira/browse/NUTCH-2099
> Project: Nutch
>  Issue Type: Sub-task
>  Components: REST_api, web gui
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> This PR changes the structure of the arguments in the REST endpoints. Earlier 
> the args were accepted in a Map form and now it is 
> Map. This is to allow Wicket to create the proper requests 
> objects and send it to NutchServer. 
> With the above, I have also added the metadata required for these services in 
> Nutch metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2094) Stopping and Restarting a crawl has issues in the Web UI

2015-09-18 Thread Lewis John McGibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14876902#comment-14876902
 ] 

Lewis John McGibbney commented on NUTCH-2094:
-

Grand thanks Chris for committing to 2 branch. Real close to the release
now and this patch is good.




-- 
*Lewis*


> Stopping and Restarting a crawl has issues in the Web UI
> 
>
> Key: NUTCH-2094
> URL: https://issues.apache.org/jira/browse/NUTCH-2094
> Project: Nutch
>  Issue Type: Bug
>  Components: web gui
>Reporter: Prerna Satija
>Assignee: Chris A. Mattmann
> Fix For: 2.4
>
>
> I have created a stop button in Nutch webapp to stop a running crawl from the 
> UI on click of a "stop" button. While testing, I found that I am able to stop 
> a crawl successfully but when I restart a stopped crawl and try to stop it, 
> it doesn't stop. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2094) Stopping and Restarting a crawl has issues in the Web UI

2015-09-18 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14876901#comment-14876901
 ] 

Hudson commented on NUTCH-2094:
---

SUCCESS: Integrated in Nutch-nutchgora #1536 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1536/])
Fix for NUTCH-2094: Stopping and Restarting a crawl has issues in the Web UI 
contributed by Prerna Satija  this closes #58. (mattmann: 
http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1703942)
* /nutch/branches/2.x/CHANGES.txt
* 
/nutch/branches/2.x/src/java/org/apache/nutch/api/impl/NutchServerPoolExecutor.java


> Stopping and Restarting a crawl has issues in the Web UI
> 
>
> Key: NUTCH-2094
> URL: https://issues.apache.org/jira/browse/NUTCH-2094
> Project: Nutch
>  Issue Type: Bug
>  Components: web gui
>Reporter: Prerna Satija
>Assignee: Chris A. Mattmann
> Fix For: 2.4
>
>
> I have created a stop button in Nutch webapp to stop a running crawl from the 
> UI on click of a "stop" button. While testing, I found that I am able to stop 
> a crawl successfully but when I restart a stopped crawl and try to stop it, 
> it doesn't stop. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2104) Add documentation to the protocol-selenium plugin Readme file re: selenium grid implementation

2015-09-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14876896#comment-14876896
 ] 

ASF GitHub Bot commented on NUTCH-2104:
---

Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/60


> Add documentation to the protocol-selenium plugin Readme file re: selenium 
> grid implementation
> --
>
> Key: NUTCH-2104
> URL: https://issues.apache.org/jira/browse/NUTCH-2104
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation, plugin, protocol
>Affects Versions: 1.10
>Reporter: Kim Whitehall
>Assignee: Chris A. Mattmann
>Priority: Trivial
>  Labels: memex
> Fix For: 1.11
>
>
> Adding some documentation to the protocol-selenium Readme file with regards 
> to advice on using the selenium grid. Namely:
> (1) parameters to set for optimization of the grid 
> (2) pitfalls to beware of when using the grid 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (NUTCH-2104) Add documentation to the protocol-selenium plugin Readme file re: selenium grid implementation

2015-09-18 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-2104.
--
Resolution: Fixed

Looks great, thanks Kim!

{noformat}
[chipotle:~/tmp/nutch1.11] mattmann% svn commit -m "Fix for NUTCH-2104: Add 
documentation to the protocol-selenium plugin Readme file re: selenium grid 
implementation contributed by Kim Whitehall  this 
closes #60."
SendingCHANGES.txt
Sendingsrc/plugin/protocol-selenium/README.md
Transmitting file data ..
Committed revision 1703944.
{noformat}


> Add documentation to the protocol-selenium plugin Readme file re: selenium 
> grid implementation
> --
>
> Key: NUTCH-2104
> URL: https://issues.apache.org/jira/browse/NUTCH-2104
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation, plugin, protocol
>Affects Versions: 1.10
>Reporter: Kim Whitehall
>Assignee: Chris A. Mattmann
>Priority: Trivial
>  Labels: memex
> Fix For: 1.11
>
>
> Adding some documentation to the protocol-selenium Readme file with regards 
> to advice on using the selenium grid. Namely:
> (1) parameters to set for optimization of the grid 
> (2) pitfalls to beware of when using the grid 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] nutch pull request: fix for NUTCH-2104 contributed by kwhitehall

2015-09-18 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/60


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Updated] (NUTCH-2104) Add documentation to the protocol-selenium plugin Readme file re: selenium grid implementation

2015-09-18 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2104:
-
Component/s: protocol

> Add documentation to the protocol-selenium plugin Readme file re: selenium 
> grid implementation
> --
>
> Key: NUTCH-2104
> URL: https://issues.apache.org/jira/browse/NUTCH-2104
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation, plugin, protocol
>Affects Versions: 1.10
>Reporter: Kim Whitehall
>Assignee: Chris A. Mattmann
>Priority: Trivial
>  Labels: memex
> Fix For: 1.11
>
>
> Adding some documentation to the protocol-selenium Readme file with regards 
> to advice on using the selenium grid. Namely:
> (1) parameters to set for optimization of the grid 
> (2) pitfalls to beware of when using the grid 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (NUTCH-2104) Add documentation to the protocol-selenium plugin Readme file re: selenium grid implementation

2015-09-18 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-2104:


Assignee: Chris A. Mattmann

> Add documentation to the protocol-selenium plugin Readme file re: selenium 
> grid implementation
> --
>
> Key: NUTCH-2104
> URL: https://issues.apache.org/jira/browse/NUTCH-2104
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation, plugin, protocol
>Affects Versions: 1.10
>Reporter: Kim Whitehall
>Assignee: Chris A. Mattmann
>Priority: Trivial
>  Labels: memex
> Fix For: 1.11
>
>
> Adding some documentation to the protocol-selenium Readme file with regards 
> to advice on using the selenium grid. Namely:
> (1) parameters to set for optimization of the grid 
> (2) pitfalls to beware of when using the grid 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2104) Add documentation to the protocol-selenium plugin Readme file re: selenium grid implementation

2015-09-18 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2104:
-
Component/s: documentation

> Add documentation to the protocol-selenium plugin Readme file re: selenium 
> grid implementation
> --
>
> Key: NUTCH-2104
> URL: https://issues.apache.org/jira/browse/NUTCH-2104
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation, plugin, protocol
>Affects Versions: 1.10
>Reporter: Kim Whitehall
>Assignee: Chris A. Mattmann
>Priority: Trivial
>  Labels: memex
> Fix For: 1.11
>
>
> Adding some documentation to the protocol-selenium Readme file with regards 
> to advice on using the selenium grid. Namely:
> (1) parameters to set for optimization of the grid 
> (2) pitfalls to beware of when using the grid 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2104) Add documentation to the protocol-selenium plugin Readme file re: selenium grid implementation

2015-09-18 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2104:
-
Fix Version/s: 1.11

> Add documentation to the protocol-selenium plugin Readme file re: selenium 
> grid implementation
> --
>
> Key: NUTCH-2104
> URL: https://issues.apache.org/jira/browse/NUTCH-2104
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation, plugin, protocol
>Affects Versions: 1.10
>Reporter: Kim Whitehall
>Assignee: Chris A. Mattmann
>Priority: Trivial
>  Labels: memex
> Fix For: 1.11
>
>
> Adding some documentation to the protocol-selenium Readme file with regards 
> to advice on using the selenium grid. Namely:
> (1) parameters to set for optimization of the grid 
> (2) pitfalls to beware of when using the grid 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Work started] (NUTCH-2104) Add documentation to the protocol-selenium plugin Readme file re: selenium grid implementation

2015-09-18 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2104 started by Chris A. Mattmann.

> Add documentation to the protocol-selenium plugin Readme file re: selenium 
> grid implementation
> --
>
> Key: NUTCH-2104
> URL: https://issues.apache.org/jira/browse/NUTCH-2104
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation, plugin, protocol
>Affects Versions: 1.10
>Reporter: Kim Whitehall
>Assignee: Chris A. Mattmann
>Priority: Trivial
>  Labels: memex
> Fix For: 1.11
>
>
> Adding some documentation to the protocol-selenium Readme file with regards 
> to advice on using the selenium grid. Namely:
> (1) parameters to set for optimization of the grid 
> (2) pitfalls to beware of when using the grid 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2104) Add documentation to the protocol-selenium plugin Readme file re: selenium grid implementation

2015-09-18 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2104:
-
Labels: memex  (was: )

> Add documentation to the protocol-selenium plugin Readme file re: selenium 
> grid implementation
> --
>
> Key: NUTCH-2104
> URL: https://issues.apache.org/jira/browse/NUTCH-2104
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation, plugin, protocol
>Affects Versions: 1.10
>Reporter: Kim Whitehall
>Assignee: Chris A. Mattmann
>Priority: Trivial
>  Labels: memex
> Fix For: 1.11
>
>
> Adding some documentation to the protocol-selenium Readme file with regards 
> to advice on using the selenium grid. Namely:
> (1) parameters to set for optimization of the grid 
> (2) pitfalls to beware of when using the grid 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] nutch pull request: Webui integration

2015-09-18 Thread chrismattmann

Github user chrismattmann closed the pull request at:

https://github.com/apache/nutch/pull/51


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] nutch pull request: fix for NUTCH-2094 contributed by prernasatija

2015-09-18 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/58


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (NUTCH-2094) Stopping and Restarting a crawl has issues in the Web UI

2015-09-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14876892#comment-14876892
 ] 

ASF GitHub Bot commented on NUTCH-2094:
---

Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/58


> Stopping and Restarting a crawl has issues in the Web UI
> 
>
> Key: NUTCH-2094
> URL: https://issues.apache.org/jira/browse/NUTCH-2094
> Project: Nutch
>  Issue Type: Bug
>  Components: web gui
>Reporter: Prerna Satija
>Assignee: Chris A. Mattmann
> Fix For: 2.4
>
>
> I have created a stop button in Nutch webapp to stop a running crawl from the 
> UI on click of a "stop" button. While testing, I found that I am able to stop 
> a crawl successfully but when I restart a stopped crawl and try to stop it, 
> it doesn't stop. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (NUTCH-2094) Stopping and Restarting a crawl has issues in the Web UI

2015-09-18 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-2094.
--
Resolution: Fixed

I committed this to 2.x branch but Github auto closing integration isn't 
enabled on 2.x branch. So I filed this issue 
https://issues.apache.org/jira/browse/INFRA-10464 to fix that. Thanks!

{noformat}
[chipotle:~/tmp/nutch2.x] mattmann% svn commit -m "Fix for NUTCH-2094: Stopping 
and Restarting a crawl has issues in the Web UI contributed by Prerna Satija 
 this closes #58."
SendingCHANGES.txt
Sendingsrc/java/org/apache/nutch/api/impl/NutchServerPoolExecutor.java
Transmitting file data ..
Committed revision 1703942.
[chipotle:~/tmp/nutch2.x] mattmann% 
{noformat}


> Stopping and Restarting a crawl has issues in the Web UI
> 
>
> Key: NUTCH-2094
> URL: https://issues.apache.org/jira/browse/NUTCH-2094
> Project: Nutch
>  Issue Type: Bug
>  Components: web gui
>Reporter: Prerna Satija
>Assignee: Chris A. Mattmann
> Fix For: 2.4
>
>
> I have created a stop button in Nutch webapp to stop a running crawl from the 
> UI on click of a "stop" button. While testing, I found that I am able to stop 
> a crawl successfully but when I restart a stopped crawl and try to stop it, 
> it doesn't stop. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Reopened] (NUTCH-2094) Stopping and Restarting a crawl has issues in the Web UI

2015-09-18 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reopened NUTCH-2094:
--

> Stopping and Restarting a crawl has issues in the Web UI
> 
>
> Key: NUTCH-2094
> URL: https://issues.apache.org/jira/browse/NUTCH-2094
> Project: Nutch
>  Issue Type: Bug
>  Components: web gui
>Reporter: Prerna Satija
>Assignee: Chris A. Mattmann
> Fix For: 2.4
>
>
> I have created a stop button in Nutch webapp to stop a running crawl from the 
> UI on click of a "stop" button. While testing, I found that I am able to stop 
> a crawl successfully but when I restart a stopped crawl and try to stop it, 
> it doesn't stop. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2094) Stopping and Restarting a crawl has issues in the Web UI

2015-09-18 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2094:
-
Fix Version/s: 2.4

> Stopping and Restarting a crawl has issues in the Web UI
> 
>
> Key: NUTCH-2094
> URL: https://issues.apache.org/jira/browse/NUTCH-2094
> Project: Nutch
>  Issue Type: Bug
>  Components: web gui
>Reporter: Prerna Satija
>Assignee: Chris A. Mattmann
> Fix For: 2.4
>
>
> I have created a stop button in Nutch webapp to stop a running crawl from the 
> UI on click of a "stop" button. While testing, I found that I am able to stop 
> a crawl successfully but when I restart a stopped crawl and try to stop it, 
> it doesn't stop. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Work started] (NUTCH-2094) Stopping and Restarting a crawl has issues in the Web UI

2015-09-18 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2094 started by Chris A. Mattmann.

> Stopping and Restarting a crawl has issues in the Web UI
> 
>
> Key: NUTCH-2094
> URL: https://issues.apache.org/jira/browse/NUTCH-2094
> Project: Nutch
>  Issue Type: Bug
>  Components: web gui
>Reporter: Prerna Satija
>Assignee: Chris A. Mattmann
> Fix For: 2.4
>
>
> I have created a stop button in Nutch webapp to stop a running crawl from the 
> UI on click of a "stop" button. While testing, I found that I am able to stop 
> a crawl successfully but when I restart a stopped crawl and try to stop it, 
> it doesn't stop. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2094) Stopping and Restarting a crawl has issues in the Web UI

2015-09-18 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2094:
-
Component/s: web gui

> Stopping and Restarting a crawl has issues in the Web UI
> 
>
> Key: NUTCH-2094
> URL: https://issues.apache.org/jira/browse/NUTCH-2094
> Project: Nutch
>  Issue Type: Bug
>  Components: web gui
>Reporter: Prerna Satija
>Assignee: Chris A. Mattmann
> Fix For: 2.4
>
>
> I have created a stop button in Nutch webapp to stop a running crawl from the 
> UI on click of a "stop" button. While testing, I found that I am able to stop 
> a crawl successfully but when I restart a stopped crawl and try to stop it, 
> it doesn't stop. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2094) Stopping and Restarting a crawl has issues in the Web UI

2015-09-18 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2094:
-
Summary: Stopping and Restarting a crawl has issues in the Web UI  (was: 
When stopping a crawl in Nutch 2.3, I was having trouble when I start an 
already stopped crawl and then stop it again. )

> Stopping and Restarting a crawl has issues in the Web UI
> 
>
> Key: NUTCH-2094
> URL: https://issues.apache.org/jira/browse/NUTCH-2094
> Project: Nutch
>  Issue Type: Bug
>Reporter: Prerna Satija
>Assignee: Chris A. Mattmann
>
> I have created a stop button in Nutch webapp to stop a running crawl from the 
> UI on click of a "stop" button. While testing, I found that I am able to stop 
> a crawl successfully but when I restart a stopped crawl and try to stop it, 
> it doesn't stop. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2094) When stopping a crawl in Nutch 2.3, I was having trouble when I start an already stopped crawl and then stop it again.

2015-09-18 Thread Chris A. Mattmann (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14876885#comment-14876885
 ] 

Chris A. Mattmann commented on NUTCH-2094:
--

Lewis, doesn't look like it. prernasatija's patch removes the call to #getInfo?

> When stopping a crawl in Nutch 2.3, I was having trouble when I start an 
> already stopped crawl and then stop it again. 
> ---
>
> Key: NUTCH-2094
> URL: https://issues.apache.org/jira/browse/NUTCH-2094
> Project: Nutch
>  Issue Type: Bug
>Reporter: Prerna Satija
>Assignee: Chris A. Mattmann
>
> I have created a stop button in Nutch webapp to stop a running crawl from the 
> UI on click of a "stop" button. While testing, I found that I am able to stop 
> a crawl successfully but when I restart a stopped crawl and try to stop it, 
> it doesn't stop. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (NUTCH-2099) Refactoring the REST endpoints for integration with webui

2015-09-18 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-2099.
--
Resolution: Fixed

Thanks Sujen and Lewis!

{noformat}
[chipotle:~/tmp/nutch1.11] mattmann% svn commit -m "Fix for NUTCH-2099: 
Refactoring the REST endpoints for integration with webui contributed by Sujen 
Shah  this closes #59."
SendingCHANGES.txt
Sendingsrc/java/org/apache/nutch/crawl/CrawlDb.java
Sendingsrc/java/org/apache/nutch/crawl/DeduplicationJob.java
Sendingsrc/java/org/apache/nutch/crawl/Generator.java
Sendingsrc/java/org/apache/nutch/crawl/Injector.java
Sendingsrc/java/org/apache/nutch/crawl/LinkDb.java
Sendingsrc/java/org/apache/nutch/fetcher/Fetcher.java
Sendingsrc/java/org/apache/nutch/indexer/IndexingJob.java
Sendingsrc/java/org/apache/nutch/metadata/Nutch.java
Sendingsrc/java/org/apache/nutch/parse/ParseSegment.java
Sendingsrc/java/org/apache/nutch/service/model/request/JobConfig.java
Sendingsrc/java/org/apache/nutch/service/model/response/JobInfo.java
Sendingsrc/java/org/apache/nutch/util/NutchTool.java
Transmitting file data .
Committed revision 1703941.
{noformat}


> Refactoring the REST endpoints for integration with webui
> -
>
> Key: NUTCH-2099
> URL: https://issues.apache.org/jira/browse/NUTCH-2099
> Project: Nutch
>  Issue Type: Sub-task
>  Components: REST_api, web gui
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> This PR changes the structure of the arguments in the REST endpoints. Earlier 
> the args were accepted in a Map form and now it is 
> Map. This is to allow Wicket to create the proper requests 
> objects and send it to NutchServer. 
> With the above, I have also added the metadata required for these services in 
> Nutch metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2099) Refactoring the REST endpoints for integration with webui

2015-09-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14876882#comment-14876882
 ] 

ASF GitHub Bot commented on NUTCH-2099:
---

Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/59


> Refactoring the REST endpoints for integration with webui
> -
>
> Key: NUTCH-2099
> URL: https://issues.apache.org/jira/browse/NUTCH-2099
> Project: Nutch
>  Issue Type: Sub-task
>  Components: REST_api, web gui
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> This PR changes the structure of the arguments in the REST endpoints. Earlier 
> the args were accepted in a Map form and now it is 
> Map. This is to allow Wicket to create the proper requests 
> objects and send it to NutchServer. 
> With the above, I have also added the metadata required for these services in 
> Nutch metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] nutch pull request: Fix for NUTCH-2099 Contributed by Sujen Shah

2015-09-18 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/59


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Work started] (NUTCH-2099) Refactoring the REST endpoints for integration with webui

2015-09-18 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2099 started by Chris A. Mattmann.

> Refactoring the REST endpoints for integration with webui
> -
>
> Key: NUTCH-2099
> URL: https://issues.apache.org/jira/browse/NUTCH-2099
> Project: Nutch
>  Issue Type: Sub-task
>  Components: REST_api, web gui
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> This PR changes the structure of the arguments in the REST endpoints. Earlier 
> the args were accepted in a Map form and now it is 
> Map. This is to allow Wicket to create the proper requests 
> objects and send it to NutchServer. 
> With the above, I have also added the metadata required for these services in 
> Nutch metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (NUTCH-2099) Refactoring the REST endpoints for integration with webui

2015-09-18 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-2099:


Assignee: Chris A. Mattmann

> Refactoring the REST endpoints for integration with webui
> -
>
> Key: NUTCH-2099
> URL: https://issues.apache.org/jira/browse/NUTCH-2099
> Project: Nutch
>  Issue Type: Sub-task
>  Components: REST_api, web gui
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> This PR changes the structure of the arguments in the REST endpoints. Earlier 
> the args were accepted in a Map form and now it is 
> Map. This is to allow Wicket to create the proper requests 
> objects and send it to NutchServer. 
> With the above, I have also added the metadata required for these services in 
> Nutch metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2111) Set temporary file location for selenium tmp files

2015-09-18 Thread Kim Whitehall (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14876832#comment-14876832
 ] 

Kim Whitehall commented on NUTCH-2111:
--

[~lewismc] and [~mjoyce] I think you'll like this patch. 

> Set temporary file location for selenium tmp files
> --
>
> Key: NUTCH-2111
> URL: https://issues.apache.org/jira/browse/NUTCH-2111
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.10
>Reporter: Kim Whitehall
>
> When using the selenium plug in (local mode or selenium grid), a large # tmp 
> files can be generated for each webdriver executed. The default location for 
> selenium is the /tmp library. Thus very quickly (and inadvertently) the 
> nutch-selenium interaction can lead to filesystem issues. 
> I propose to include a config in nutch-default.xml that allows users to 
> specify where they want the selenium tmp files to be written. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (NUTCH-2111) Set temporary file location for selenium tmp files

2015-09-18 Thread Kim Whitehall (JIRA)

Kim Whitehall created NUTCH-2111:


 Summary: Set temporary file location for selenium tmp files
 Key: NUTCH-2111
 URL: https://issues.apache.org/jira/browse/NUTCH-2111
 Project: Nutch
  Issue Type: Improvement
  Components: plugin
Affects Versions: 1.10
Reporter: Kim Whitehall


When using the selenium plug in (local mode or selenium grid), a large # tmp 
files can be generated for each webdriver executed. The default location for 
selenium is the /tmp library. Thus very quickly (and inadvertently) the 
nutch-selenium interaction can lead to filesystem issues. 
I propose to include a config in nutch-default.xml that allows users to specify 
where they want the selenium tmp files to be written. 
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2011) Endpoint to support realtime JSON output from the fetcher

2015-09-18 Thread Sujen Shah (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14876639#comment-14876639
 ] 

Sujen Shah commented on NUTCH-2011:
---

Hi [~ahmadia], 
There is an implementation of this in the org.apache.nutch.fetcher package and 
also a corresponding endpoint in the REST API. But the current issue with the 
implementation is that the entire data (ref class FetchNode) is stored in 
memory (ref class FetchNodeDb), which gets very large with large crawls. 

We could discuss a few options of how to implement this and come up with an 
efficient solution. Any suggestions ?  

> Endpoint to support realtime JSON output from the fetcher
> -
>
> Key: NUTCH-2011
> URL: https://issues.apache.org/jira/browse/NUTCH-2011
> Project: Nutch
>  Issue Type: Sub-task
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
>
> This fix will create an endpoint to query the Nutch REST service and get a 
> real-time JSON response of the current/past Fetched URLs. 
> This endpoint also includes pagination of the output to reduce data transfer 
> bw in large crawls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: NUTCH-1946 Upgrade to Gora 0.6.1

2015-09-18 Thread Renato Marroquín Mogrovejo

Hey Lewis,

I checked the patch and all changes seem reasonable.
Looking forward to this release!


Renato M.

2015-09-17 8:29 GMT+02:00 Lewis John Mcgibbney :

> Hi user@ and dev@,
>
> Quick message to ask kindly for a call to arms. I pushed a patch to
> NUTCH-1946 [0] for Nutch 2.X HEAD [1]
>
> This includes
>
>- Upgrade to Gora 0.6.1
>- Upgrade to Hadoop 2.5.1 (which Gora supports fully) see NUTCH-2101
>
>- Introduction of @Deprecation within NutchJob constructors and
>introduction of static method invocations to shadow that of Hadoop 2.5.1
>API.
>- Utilization of Gora's org.apache.gora.memory.MemStore
>
> 
>which finally gets us testing more classes again. In particular
>TestInjector, TestFetcher, TestGoraStorage.
>- Removal of unused imports throughout the Nutch 2.X HEAD codebase.
>
> Please review if you can, really looking forward to having a 2.X release
> soon and any code review from the Nutch  and Gora communities would be
> sterling.
>
> Ta
> Lewis
>
> [0] https://issues.apache.org/jira/browse/NUTCH-1946
> [1] http://svn.apache.org/repos/asf/nutch/branches/2.x/
>
> --
> *Lewis*
>

[jira] [Commented] (NUTCH-2099) Refactoring the REST endpoints for integration with webui

2015-09-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14876337#comment-14876337
 ] 

ASF GitHub Bot commented on NUTCH-2099:
---

Github user sujen1412 commented on a diff in the pull request:

https://github.com/apache/nutch/pull/59#discussion_r39897798
  
--- Diff: src/java/org/apache/nutch/metadata/Nutch.java ---
@@ -80,4 +80,11 @@
public static final String STAT_PROGRESS = "progress";
/**Used by Nutch REST service */
public static final String CRAWL_ID_KEY = "storage.crawl.id";
+   
+   public static final String ARG_SEEDDIR = "url_dir";
+   public static final String ARG_CRAWLDB = "crawldb";
--- End diff --

Hi @lewismc I have added some documentation to the new introduced metadata. 
Please let me know if its proper. Thanks :)  


> Refactoring the REST endpoints for integration with webui
> -
>
> Key: NUTCH-2099
> URL: https://issues.apache.org/jira/browse/NUTCH-2099
> Project: Nutch
>  Issue Type: Sub-task
>  Components: REST_api, web gui
>Reporter: Sujen Shah
>  Labels: memex
> Fix For: 1.11
>
>
> This PR changes the structure of the arguments in the REST endpoints. Earlier 
> the args were accepted in a Map form and now it is 
> Map. This is to allow Wicket to create the proper requests 
> objects and send it to NutchServer. 
> With the above, I have also added the metadata required for these services in 
> Nutch metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] nutch pull request: Fix for NUTCH-2099 Contributed by Sujen Shah

2015-09-18 Thread sujen1412

Github user sujen1412 commented on a diff in the pull request:

https://github.com/apache/nutch/pull/59#discussion_r39897798
  
--- Diff: src/java/org/apache/nutch/metadata/Nutch.java ---
@@ -80,4 +80,11 @@
public static final String STAT_PROGRESS = "progress";
/**Used by Nutch REST service */
public static final String CRAWL_ID_KEY = "storage.crawl.id";
+   
+   public static final String ARG_SEEDDIR = "url_dir";
+   public static final String ARG_CRAWLDB = "crawldb";
--- End diff --

Hi @lewismc I have added some documentation to the new introduced metadata. 
Please let me know if its proper. Thanks :)  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] nutch pull request: Fix for NUTCH-2099 Contributed by Sujen Shah

2015-09-18 Thread sujen1412

Github user sujen1412 commented on a diff in the pull request:

https://github.com/apache/nutch/pull/59#discussion_r39897636
  
--- Diff: src/java/org/apache/nutch/crawl/CrawlDb.java ---
@@ -261,30 +262,68 @@ public int run(String[] args) throws Exception {
   additionsAllowed = false;
 }
 
-String crawldb = crawlId+"/crawldb";
-String segment_dir = crawlId+"/segments";
-File segmentsDir = new File(segment_dir);
-File[] segmentsList = segmentsDir.listFiles();  
-Arrays.sort(segmentsList, new Comparator(){
-  @Override
-  public int compare(File f1, File f2) {
-if(f1.lastModified()>f2.lastModified())
-  return -1;
-else
-  return 0;
-  }  
-});
+Path crawlDb;
+if(args.containsKey(Nutch.ARG_CRAWLDB)) {
+   Object crawldbPath = args.get(Nutch.ARG_CRAWLDB);
+   if(crawldbPath instanceof Path) {
+   crawlDb = (Path) crawldbPath;
--- End diff --

@lewismc, I have corrected the formatting in the new commit. Thanks :) 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (NUTCH-2099) Refactoring the REST endpoints for integration with webui

2015-09-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14876336#comment-14876336
 ] 

ASF GitHub Bot commented on NUTCH-2099:
---

Github user sujen1412 commented on a diff in the pull request:

https://github.com/apache/nutch/pull/59#discussion_r39897636
  
--- Diff: src/java/org/apache/nutch/crawl/CrawlDb.java ---
@@ -261,30 +262,68 @@ public int run(String[] args) throws Exception {
   additionsAllowed = false;
 }
 
-String crawldb = crawlId+"/crawldb";
-String segment_dir = crawlId+"/segments";
-File segmentsDir = new File(segment_dir);
-File[] segmentsList = segmentsDir.listFiles();  
-Arrays.sort(segmentsList, new Comparator(){
-  @Override
-  public int compare(File f1, File f2) {
-if(f1.lastModified()>f2.lastModified())
-  return -1;
-else
-  return 0;
-  }  
-});
+Path crawlDb;
+if(args.containsKey(Nutch.ARG_CRAWLDB)) {
+   Object crawldbPath = args.get(Nutch.ARG_CRAWLDB);
+   if(crawldbPath instanceof Path) {
+   crawlDb = (Path) crawldbPath;
--- End diff --

@lewismc, I have corrected the formatting in the new commit. Thanks :) 


> Refactoring the REST endpoints for integration with webui
> -
>
> Key: NUTCH-2099
> URL: https://issues.apache.org/jira/browse/NUTCH-2099
> Project: Nutch
>  Issue Type: Sub-task
>  Components: REST_api, web gui
>Reporter: Sujen Shah
>  Labels: memex
> Fix For: 1.11
>
>
> This PR changes the structure of the arguments in the REST endpoints. Earlier 
> the args were accepted in a Map form and now it is 
> Map. This is to allow Wicket to create the proper requests 
> objects and send it to NutchServer. 
> With the above, I have also added the metadata required for these services in 
> Nutch metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] nutch pull request: Fix for NUTCH-2099 Contributed by Sujen Shah

2015-09-18 Thread chrismattmann

Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/59#discussion_r39874200
  
--- Diff: src/java/org/apache/nutch/crawl/CrawlDb.java ---
@@ -236,10 +237,10 @@ public int run(String[] args) throws Exception {
* Used for Nutch REST service
*/
   @Override
-  public Map run(Map args, String crawlId) 
throws Exception {
+  public Map run(Map args, String crawlId) 
throws Exception {
--- End diff --

Thanks @sujen1412 for considering 
[nutch-python](https://github.com/chrismattmann/nutch-python) with this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (NUTCH-2099) Refactoring the REST endpoints for integration with webui

2015-09-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14875904#comment-14875904
 ] 

ASF GitHub Bot commented on NUTCH-2099:
---

Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/59#discussion_r39874200
  
--- Diff: src/java/org/apache/nutch/crawl/CrawlDb.java ---
@@ -236,10 +237,10 @@ public int run(String[] args) throws Exception {
* Used for Nutch REST service
*/
   @Override
-  public Map run(Map args, String crawlId) 
throws Exception {
+  public Map run(Map args, String crawlId) 
throws Exception {
--- End diff --

Thanks @sujen1412 for considering 
[nutch-python](https://github.com/chrismattmann/nutch-python) with this.


> Refactoring the REST endpoints for integration with webui
> -
>
> Key: NUTCH-2099
> URL: https://issues.apache.org/jira/browse/NUTCH-2099
> Project: Nutch
>  Issue Type: Sub-task
>  Components: REST_api, web gui
>Reporter: Sujen Shah
>  Labels: memex
> Fix For: 1.11
>
>
> This PR changes the structure of the arguments in the REST endpoints. Earlier 
> the args were accepted in a Map form and now it is 
> Map. This is to allow Wicket to create the proper requests 
> objects and send it to NutchServer. 
> With the above, I have also added the metadata required for these services in 
> Nutch metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2091) Increase robustness and crawling versatility of Nutch for the Deep Web

2015-09-18 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2091:
-
Summary: Increase robustness and crawling versatility of Nutch for the Deep 
Web  (was: Make Nutch more robust and smart)

> Increase robustness and crawling versatility of Nutch for the Deep Web
> --
>
> Key: NUTCH-2091
> URL: https://issues.apache.org/jira/browse/NUTCH-2091
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.10
>Reporter: Asitang Mishra
>Priority: Minor
>  Labels: memex, nutch
>
> Nutch fails to grab a page or crawl in a manner that is more productive in 
> certain cases. This issue is to discuss those specific cases and try to 
> generalize them into Nutch to make it even more robust and productive.
> I came across three websites and got many issues. I have toned down those 
> issues into fine points.
> 1. Some websites detect that the crawler is not a browser (marketwired) 
> (cookie validations) and send it to the first page again and again.
> 2. Some data behind a click (detect which clicks: javascript void) of 'a tag' 
> that is not a link exactly (an improvement for the selenium plugin)
> 3. When clicked something on a page and the page changed, how to get back the 
> page before clicking further (can’t obviously look for a back button or cross 
> button. Can save the old state juxtapose with new info and only take the 
> extra info)
> 4. Differentiate between a navigation link and a common link in a forum page 
> so that both links can be used differently to decide the progress of the 
> crawler (nav links decide the rounds and other links we can go one round)
> 5. Bring the capability of changing # to ? (pataxia.com). Right now url 
> normalization completely removes the part after # thinking that it's a simple 
> anchor tag.
> 6. Easy route-decision in property file to decide how the fetcher will behave 
> (instead of going all BFS or DFS, there should be a away to make it go 
> DEPTH-LIMITED search. Esp good for forums and the likes of it. And users can 
> give some known inputs like depth etc. to direct the crawler if they know 
> something specific about the site)
> 7. A forum can be roughly generalized into: a meta topic page (no nav links) 
> -> post list (with nav links) -> post page (with nav links) : How to make 
> nutch aware of this structure/heirachy. If manually give simple clues as 
> well. Can be seen as an extension of the last point.
> 8. Sometimes even nav links are not actual links but ajax requests.
> NOTE: Nav links (definition here): the structure on a web page (like a forum) 
> which gives us an option to go to various pages by numbers or next, previous, 
> first and or last pages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2091) Make Nutch more robust and smart

2015-09-18 Thread Chris A. Mattmann (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14875894#comment-14875894
 ] 

Chris A. Mattmann commented on NUTCH-2091:
--

This is a fantastic summary of many of the issues we are seeing with Memex 
Asitang. Bravo. Let's keep working the issues here and making Nutch more robust 
to handle these types of situations.

> Make Nutch more robust and smart
> 
>
> Key: NUTCH-2091
> URL: https://issues.apache.org/jira/browse/NUTCH-2091
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.10
>Reporter: Asitang Mishra
>Priority: Minor
>  Labels: memex, nutch
>
> Nutch fails to grab a page or crawl in a manner that is more productive in 
> certain cases. This issue is to discuss those specific cases and try to 
> generalize them into Nutch to make it even more robust and productive.
> I came across three websites and got many issues. I have toned down those 
> issues into fine points.
> 1. Some websites detect that the crawler is not a browser (marketwired) 
> (cookie validations) and send it to the first page again and again.
> 2. Some data behind a click (detect which clicks: javascript void) of 'a tag' 
> that is not a link exactly (an improvement for the selenium plugin)
> 3. When clicked something on a page and the page changed, how to get back the 
> page before clicking further (can’t obviously look for a back button or cross 
> button. Can save the old state juxtapose with new info and only take the 
> extra info)
> 4. Differentiate between a navigation link and a common link in a forum page 
> so that both links can be used differently to decide the progress of the 
> crawler (nav links decide the rounds and other links we can go one round)
> 5. Bring the capability of changing # to ? (pataxia.com). Right now url 
> normalization completely removes the part after # thinking that it's a simple 
> anchor tag.
> 6. Easy route-decision in property file to decide how the fetcher will behave 
> (instead of going all BFS or DFS, there should be a away to make it go 
> DEPTH-LIMITED search. Esp good for forums and the likes of it. And users can 
> give some known inputs like depth etc. to direct the crawler if they know 
> something specific about the site)
> 7. A forum can be roughly generalized into: a meta topic page (no nav links) 
> -> post list (with nav links) -> post page (with nav links) : How to make 
> nutch aware of this structure/heirachy. If manually give simple clues as 
> well. Can be seen as an extension of the last point.
> 8. Sometimes even nav links are not actual links but ajax requests.
> NOTE: Nav links (definition here): the structure on a web page (like a forum) 
> which gives us an option to go to various pages by numbers or next, previous, 
> first and or last pages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] nutch pull request: WARC exporter for the CommonCrawlDataDumper

2015-09-18 Thread jorgelbg

Github user jorgelbg commented on a diff in the pull request:

https://github.com/apache/nutch/pull/55#discussion_r39866086
  
--- Diff: 
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
 ---
@@ -193,34 +197,54 @@ public HttpResponse(HttpBase http, URL url, 
CrawlDatum datum)
   reqStr.append("\r\n");
 
   if (http.isIfModifiedSinceEnabled() && datum.getModifiedTime() > 0) {
-reqStr.append("If-Modified-Since: "
-+ HttpDateFormat.toString(datum.getModifiedTime()));
+reqStr.append("If-Modified-Since: " + HttpDateFormat
+.toString(datum.getModifiedTime()));
 reqStr.append("\r\n");
   }
   reqStr.append("\r\n");
 
+  // store the request in the metadata?
+  if (conf.getBoolean("store.http.request", false) == true) {
+headers.add("_request_", reqStr.toString());
+  }
+
   byte[] reqBytes = reqStr.toString().getBytes();
 
   req.write(reqBytes);
   req.flush();
 
   PushbackInputStream in = // process response
-  new PushbackInputStream(new 
BufferedInputStream(socket.getInputStream(),
-  Http.BUFFER_SIZE), Http.BUFFER_SIZE);
+  new PushbackInputStream(
+  new BufferedInputStream(socket.getInputStream(),
+  Http.BUFFER_SIZE), Http.BUFFER_SIZE);
 
   StringBuffer line = new StringBuffer();
 
+  // store the http headers verbatim
+  if (conf.getBoolean("store.http.headers", false) == true) {
+httpHeaders = new StringBuffer();
+  }
+
+  headers.add("nutch.fetch.time", Long.toString(datum.getFetchTime()));
--- End diff --

I understand, fixed/pushed then!, the misleading portion of the comment for 
me was that it should hold the right value unless the `CrawlDbReducer` gets 
called. The `fetchTime` parameter gets initialized to the 
`System.currentTimeMillis()` upon creation. But this way we have 0 doubts about 
the value. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (NUTCH-2102) WARC Exporter

2015-09-18 Thread Jorge Luis Betancourt Gonzalez (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14875754#comment-14875754
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2102:
---

+1 It looks good, the nutch entry will definitively will make it easier to use 
:)

> WARC Exporter
> -
>
> Key: NUTCH-2102
> URL: https://issues.apache.org/jira/browse/NUTCH-2102
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, dumpers
>Affects Versions: 1.10
>Reporter: Julien Nioche
> Attachments: NUTCH-2102.patch
>
>
> This patch adds a WARC exporter 
> [http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike 
> the code submitted in [https://github.com/apache/nutch/pull/55] which is 
> based on the CommonCrawlDataDumper, this exporter is a MapReduce job and 
> hence should be able to cope with large segments in a timely fashion and also 
> is not limited to the local file system.
> Later on we could have a WARCImporter to generate segments from WARC files, 
> which is outside the scope of the CCDD anyway. Also WARC is not specific to 
> CommonCrawl, which is why the package name does not reflect it.
> I don't think it would be a problem to have both the modified CCDD and this 
> class providing similar functionalities.
> This class is called in the following way 
> ./nutch org.apache.nutch.tools.warc.WARCExporter 
> /data/nutch-dipe/1kcrawl/warc -dir /data/nutch-dipe/1kcrawl/segments/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] nutch pull request: WARC exporter for the CommonCrawlDataDumper

2015-09-18 Thread jnioche

Github user jnioche commented on a diff in the pull request:

https://github.com/apache/nutch/pull/55#discussion_r39863257
  
--- Diff: 
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
 ---
@@ -193,34 +197,54 @@ public HttpResponse(HttpBase http, URL url, 
CrawlDatum datum)
   reqStr.append("\r\n");
 
   if (http.isIfModifiedSinceEnabled() && datum.getModifiedTime() > 0) {
-reqStr.append("If-Modified-Since: "
-+ HttpDateFormat.toString(datum.getModifiedTime()));
+reqStr.append("If-Modified-Since: " + HttpDateFormat
+.toString(datum.getModifiedTime()));
 reqStr.append("\r\n");
   }
   reqStr.append("\r\n");
 
+  // store the request in the metadata?
+  if (conf.getBoolean("store.http.request", false) == true) {
+headers.add("_request_", reqStr.toString());
+  }
+
   byte[] reqBytes = reqStr.toString().getBytes();
 
   req.write(reqBytes);
   req.flush();
 
   PushbackInputStream in = // process response
-  new PushbackInputStream(new 
BufferedInputStream(socket.getInputStream(),
-  Http.BUFFER_SIZE), Http.BUFFER_SIZE);
+  new PushbackInputStream(
+  new BufferedInputStream(socket.getInputStream(),
+  Http.BUFFER_SIZE), Http.BUFFER_SIZE);
 
   StringBuffer line = new StringBuffer();
 
+  // store the http headers verbatim
+  if (conf.getBoolean("store.http.headers", false) == true) {
+httpHeaders = new StringBuffer();
+  }
+
+  headers.add("nutch.fetch.time", Long.toString(datum.getFetchTime()));
--- End diff --

It is correct in the output of the fetcher step when accessing the fetch 
datum but I don't think it is the case at this point in the code

> it is executed in the HttpResponse class (right after the fetcher gets 
executed)

not after but right in the middle of the fetcher's work. 

It is set to the right value in the output method of the fetcherthread 
[https://github.com/apache/nutch/blob/8397611b49de4aac408806765191fc796ba4b15f/src/java/org/apache/nutch/fetcher/FetcherThread.java#L528]
 but that's AFTER the protocol implementation fetched the content.

In short not clear what the value is at this point of the code but it's 
unlikely to be correct. Just use System.currentTimeMillis()



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] nutch pull request: WARC exporter for the CommonCrawlDataDumper

2015-09-18 Thread jorgelbg

Github user jorgelbg commented on a diff in the pull request:

https://github.com/apache/nutch/pull/55#discussion_r39860853
  
--- Diff: 
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
 ---
@@ -193,34 +197,54 @@ public HttpResponse(HttpBase http, URL url, 
CrawlDatum datum)
   reqStr.append("\r\n");
 
   if (http.isIfModifiedSinceEnabled() && datum.getModifiedTime() > 0) {
-reqStr.append("If-Modified-Since: "
-+ HttpDateFormat.toString(datum.getModifiedTime()));
+reqStr.append("If-Modified-Since: " + HttpDateFormat
+.toString(datum.getModifiedTime()));
 reqStr.append("\r\n");
   }
   reqStr.append("\r\n");
 
+  // store the request in the metadata?
+  if (conf.getBoolean("store.http.request", false) == true) {
+headers.add("_request_", reqStr.toString());
+  }
+
   byte[] reqBytes = reqStr.toString().getBytes();
 
   req.write(reqBytes);
   req.flush();
 
   PushbackInputStream in = // process response
-  new PushbackInputStream(new 
BufferedInputStream(socket.getInputStream(),
-  Http.BUFFER_SIZE), Http.BUFFER_SIZE);
+  new PushbackInputStream(
+  new BufferedInputStream(socket.getInputStream(),
+  Http.BUFFER_SIZE), Http.BUFFER_SIZE);
 
   StringBuffer line = new StringBuffer();
 
+  // store the http headers verbatim
+  if (conf.getBoolean("store.http.headers", false) == true) {
+httpHeaders = new StringBuffer();
+  }
+
+  headers.add("nutch.fetch.time", Long.toString(datum.getFetchTime()));
--- End diff --

Though of using the `System.currentTimeMillis()` method but reviewing this 
comment https://github.com/apache/nutch/pull/55#issuecomment-140663159 though 
that it will had the right value. 

The comment on the `getFetchTime()` method says:
 
  > Returns either the time of the last fetch, or the next fetch time,
  > depending on whether Fetcher or CrawlDbReducer set the time.

Since this is executed in the HttpResponse class (right after the fetcher 
gets executed) I though it would be save to assume that the date would be 
accurate. If this is wrong the fix is easy enough. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (NUTCH-2104) Add documentation to the protocol-selenium plugin Readme file re: selenium grid implementation

2015-09-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14875686#comment-14875686
 ] 

ASF GitHub Bot commented on NUTCH-2104:
---

GitHub user kwhitehall opened a pull request:

https://github.com/apache/nutch/pull/60

fix for NUTCH-2104 contributed by kwhitehall

- Updated the documentation for protocol-selenium

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kwhitehall/nutch NUTCH-2104

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/60.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #60


commit ccf9a8e1029d359721df714891c4de93f665fa03
Author: Kim Whitehall 
Date:   2015-09-18T14:31:18Z

fix for NUTCH-2104 contributed by kwhitehall




> Add documentation to the protocol-selenium plugin Readme file re: selenium 
> grid implementation
> --
>
> Key: NUTCH-2104
> URL: https://issues.apache.org/jira/browse/NUTCH-2104
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.10
>Reporter: Kim Whitehall
>Priority: Trivial
>
> Adding some documentation to the protocol-selenium Readme file with regards 
> to advice on using the selenium grid. Namely:
> (1) parameters to set for optimization of the grid 
> (2) pitfalls to beware of when using the grid 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] nutch pull request: fix for NUTCH-2104 contributed by kwhitehall

2015-09-18 Thread kwhitehall

GitHub user kwhitehall opened a pull request:

https://github.com/apache/nutch/pull/60

fix for NUTCH-2104 contributed by kwhitehall

- Updated the documentation for protocol-selenium

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kwhitehall/nutch NUTCH-2104

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/60.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #60


commit ccf9a8e1029d359721df714891c4de93f665fa03
Author: Kim Whitehall 
Date:   2015-09-18T14:31:18Z

fix for NUTCH-2104 contributed by kwhitehall




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Re: Job Opening at Common Crawl - Crawl Engineer / Data Scientist

2015-09-18 Thread Mattmann, Chris A (3980)

awesome thanks for sharing!

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: Julien Nioche 
Reply-To: "dev@nutch.apache.org" 
Date: Friday, September 18, 2015 at 2:54 AM
To: "dev@nutch.apache.org" , "u...@nutch.apache.org"

Cc: "s...@commoncrawl.org" 
Subject: Fwd: Job Opening at Common Crawl - Crawl Engineer / Data Scientist

>Nutch people, 
>
>
>Just in case you missed the announcement below. As you probably know CC
>use Nutch for their crawls, this is a fantastic opportunity to put your
>Nutch skills to great use!
>
>
>Julien
>
>-- Forwarded message --
>From: Sara Crouse 
>Date: 17 September 2015 at 22:51
>Subject: Job Opening at Common Crawl - Crawl Engineer / Data Scientist
>To: Common Crawl 
>
>
>Hello again CC community,
>
>In addition to my appointment, another staff transition is on the
>horizon, and I would like to ask for your help finding candidates to fill
>a critical role. At the end of this month, Stephen Merity (data
>scientist, crawl engineer, and much more!) will leave
> Common Crawl to work on image recognition and language understanding
>using deep learning at MetaMind, a new startup. Stephen, has been a great
>asset to Common Crawl, and we are grateful that he wishes to remain
>engaged with us in a volunteer capacity going
> forward.
>
>This week, we therefore launch a search to fill the role of Crawl
>Engineer/Data Scientist. Below and posted here
>https://commoncrawl.org/jobs/ is the job description. We appreciate any
>help you can provide in spreading the word about this unique opportunity.
>If you have specific referrals, or wish to apply, please
> contact j...@commoncrawl.org.
>
>Many thanks,
>
>Sara
>---
>
>_CRAWL ENGINEER / DATA SCIENTIST at THE COMMON CRAWL FOUNDATION_
>
>*Location* 
>San Francisco or Remote
>
>
>*Job Summary*
>Common Crawl (CC) is the non-profit organization that builds and
>maintains the single largest publicly accessible dataset of the world’s
>knowledge, encompassing petabytes of web crawl data.
>
>If democratizing access to web information and tackling the engineering
>challenges of working with data at the scale of the web sounds exciting
>to you, we would love to hear from you. If you have worked on open source
>projects before or can share code samples
> with us, please don't hesitate to send relevant links along with your
>application.
>
>
>
>*Description*
>
>
>/Primary Responsibilities/
>_Running the crawl_
>* Spinning up and managing Hadoop clusters on Amazon EC2
>* Running regular comprehensive crawls of the web using Nutch
>* Preparing and publishing crawl data to data hosting partner, Amazon Web
>Services
>* Incident response and diagnosis of crawl issues as they occur, e.g.
>** Replacing lost instances due to EC2 problems / spot instance losses
>** Responding to and remedying webmaster queries and issues
>
>_Crawl engineering_
>* Maintaining, developing, and deploying new features as required by
>running the Nutch crawler, e.g.:
>** Providing netiquette features, such as following robots.txt, as
>required, and load balancing a crawl across millions of domains
>
>** Implementing and improving ranking algorithms to prioritize the
>crawling of popular pages
>* Extending existing tools to work efficiently with large datasets
>* Working with the Nutch community to push improvements to the crawler to
>the public
>
>/Other Responsibilities/
>* Building support tools and artifacts, including documentation,
>tutorials, and example code or supporting frameworks for processing CC
>data using different tools.
>* Identifying and reporting on research and innovations that result from
>analysis and derivative use of CC data.
>* Community evangelism:
>** Collaborating with partners in academia and industry
>** Engaging regularly with user discussion group and responding to
>frequent inquiries about how to use CC data
>** Writing technical blog posts
>** Presenting on or representing CC at conferences, meetups, etc.
>
>
>*Qualifications*
>/Minimum qualifications/
>* Fluent in Java (Nutch and Hadoop are core to our mission)
>* Familiarity with the JVM big data ecosystem (Hadoop, HDFS, ...)
>* Knowledge the Amazon Web Services (AWS) ecosystem
>* Experience with Python
>* Basic command line Unix knowledge
>* BS Computer Science or equivalent work experience
>
>/Preferred qualifications/
>* Experience with running web crawlers
>* Cluster comput

[GitHub] nutch pull request: WARC exporter for the CommonCrawlDataDumper

2015-09-18 Thread jnioche

Github user jnioche commented on a diff in the pull request:

https://github.com/apache/nutch/pull/55#discussion_r39856699
  
--- Diff: 
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
 ---
@@ -193,34 +197,54 @@ public HttpResponse(HttpBase http, URL url, 
CrawlDatum datum)
   reqStr.append("\r\n");
 
   if (http.isIfModifiedSinceEnabled() && datum.getModifiedTime() > 0) {
-reqStr.append("If-Modified-Since: "
-+ HttpDateFormat.toString(datum.getModifiedTime()));
+reqStr.append("If-Modified-Since: " + HttpDateFormat
+.toString(datum.getModifiedTime()));
 reqStr.append("\r\n");
   }
   reqStr.append("\r\n");
 
+  // store the request in the metadata?
+  if (conf.getBoolean("store.http.request", false) == true) {
+headers.add("_request_", reqStr.toString());
+  }
+
   byte[] reqBytes = reqStr.toString().getBytes();
 
   req.write(reqBytes);
   req.flush();
 
   PushbackInputStream in = // process response
-  new PushbackInputStream(new 
BufferedInputStream(socket.getInputStream(),
-  Http.BUFFER_SIZE), Http.BUFFER_SIZE);
+  new PushbackInputStream(
+  new BufferedInputStream(socket.getInputStream(),
+  Http.BUFFER_SIZE), Http.BUFFER_SIZE);
 
   StringBuffer line = new StringBuffer();
 
+  // store the http headers verbatim
+  if (conf.getBoolean("store.http.headers", false) == true) {
+httpHeaders = new StringBuffer();
+  }
+
+  headers.add("nutch.fetch.time", Long.toString(datum.getFetchTime()));
--- End diff --

Why take `datum.getFetchTime()` and what does its value correspond to? 
Isn't that the time is was due for fetching?
I'd have used `System.currentTimeMillis()` as this is exactly the moment we 
fetch the content and would definitely be accurate


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (NUTCH-2106) Runtime to contain Selenium and dependencies only once

2015-09-18 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14847281#comment-14847281
 ] 

Sebastian Nagel commented on NUTCH-2106:


Avoiding conflicting dependencies is the reason for the Nutch plugin system 
[[1|https://wiki.apache.org/nutch/WhatsTheProblemWithPluginsAndClass-loading]]. 
However, if a plugin depends on another plugin and both depend on a library, 
there is no way: both plugins must rely on the same version (or two versions 
with compatible API).
- protocol-selenium depends on lib-selenium
- both depend on selenium-java (currently the same version)
- when the plugin protocol-selenium is loaded the lib-selenium.jar is just 
added to the classpath of protocol-selenium's own class loader. The classes 
from lib-selenium.jar do not live in it's own class loader! They are used 
directly (and not via the lib-selenium plugin instance) from classes in 
protocol-selenium.
- the same situation for protocol-interactiveselenium

As a consequence, the Selenium version used by lib-selenium dictates the 
version to be used by the two protocol plugins. So, why not bundle Selenium 
jars and dependencies in lib-selenium?

> Runtime to contain Selenium and dependencies only once
> --
>
> Key: NUTCH-2106
> URL: https://issues.apache.org/jira/browse/NUTCH-2106
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.11
>Reporter: Sebastian Nagel
> Fix For: 1.11
>
> Attachments: NUTCH-2106.patch
>
>
> All Selenium-based plugins contain the same dependendent jars which 
> significantly affects the size of runtime and bin package:
> {noformat}
> % du -hs runtime/local/plugins/*selenium/ runtime/deploy/*.job
> 25M runtime/local/plugins/lib-selenium/
> 25M runtime/local/plugins/protocol-interactiveselenium/
> 25M runtime/local/plugins/protocol-selenium/
> 182M runtime/deploy/apache-nutch-1.11-SNAPSHOT.job
> {noformat}
> Since all plugins depend on the same Selenium version we could bundle the 
> dependencies in lib-selenium and let the other plugins load it from there:
> - let lib-selenium export all dependent libs, e.g.:
> {code:xml|title=lib-selenium/plugin.xml}
> 
>   ...
>   
> 
>   
> {code}
> - both protocol plugins already import lib-selenium: the dependencies in 
> ivy.xml can be removed
> As expected, these changes make the runtime smaller:
> {noformat}
> 25M runtime/local/plugins/lib-selenium/
> 20K runtime/local/plugins/protocol-interactiveselenium/
> 16K runtime/local/plugins/protocol-selenium/
> 138M runtime/deploy/apache-nutch-1.11-SNAPSHOT.job
> {noformat}
> Open points:
> - I've tested only protocol-selenium using chromedriver. Should also test 
> protocol-interactiveselenium?
> - What about phantomjsdriver-1.2.1.jar? It was contained in lib-selenium and 
> protocol-selenium but not protocol-interactiveselenium. Is there a reason for 
> this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2106) Runtime to contain Selenium and dependencies only once

2015-09-18 Thread Lewis John McGibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14805366#comment-14805366
 ] 

Lewis John McGibbney commented on NUTCH-2106:
-

[~kwhitehall] lets touch base on this and try to include  within 
selenium definition. This is Maven magic so maybe we can print out 

{code}
ant report
{code}
.. that way we can see how many transient dependencies come from selenium.

[~wastl-nagel], tbh this was (and still is) and underlying concern for plugin 
dependencies... e.g. we recently introduced Apache Mahout. These libraries are 
non trivial by any means. We have the same issue.

I would encourage all additions to evaluate existing compatibility and where 
new functionality fits it. We do not want to break new features as old folks. :)


> Runtime to contain Selenium and dependencies only once
> --
>
> Key: NUTCH-2106
> URL: https://issues.apache.org/jira/browse/NUTCH-2106
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.11
>Reporter: Sebastian Nagel
> Fix For: 1.11
>
> Attachments: NUTCH-2106.patch
>
>
> All Selenium-based plugins contain the same dependendent jars which 
> significantly affects the size of runtime and bin package:
> {noformat}
> % du -hs runtime/local/plugins/*selenium/ runtime/deploy/*.job
> 25M runtime/local/plugins/lib-selenium/
> 25M runtime/local/plugins/protocol-interactiveselenium/
> 25M runtime/local/plugins/protocol-selenium/
> 182M runtime/deploy/apache-nutch-1.11-SNAPSHOT.job
> {noformat}
> Since all plugins depend on the same Selenium version we could bundle the 
> dependencies in lib-selenium and let the other plugins load it from there:
> - let lib-selenium export all dependent libs, e.g.:
> {code:xml|title=lib-selenium/plugin.xml}
> 
>   ...
>   
> 
>   
> {code}
> - both protocol plugins already import lib-selenium: the dependencies in 
> ivy.xml can be removed
> As expected, these changes make the runtime smaller:
> {noformat}
> 25M runtime/local/plugins/lib-selenium/
> 20K runtime/local/plugins/protocol-interactiveselenium/
> 16K runtime/local/plugins/protocol-selenium/
> 138M runtime/deploy/apache-nutch-1.11-SNAPSHOT.job
> {noformat}
> Open points:
> - I've tested only protocol-selenium using chromedriver. Should also test 
> protocol-interactiveselenium?
> - What about phantomjsdriver-1.2.1.jar? It was contained in lib-selenium and 
> protocol-selenium but not protocol-interactiveselenium. Is there a reason for 
> this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Fwd: Job Opening at Common Crawl - Crawl Engineer / Data Scientist

2015-09-18 Thread Julien Nioche

Nutch people,

Just in case you missed the announcement below. As you probably know CC use
Nutch for their crawls, this is a fantastic opportunity to put your Nutch
skills to great use!

Julien

-- Forwarded message --
From: Sara Crouse 
Date: 17 September 2015 at 22:51
Subject: Job Opening at Common Crawl - Crawl Engineer / Data Scientist
To: Common Crawl 


Hello again CC community,

In addition to my appointment, another staff transition is on the horizon,
and I would like to ask for your help finding candidates to fill a critical
role. At the end of this month, Stephen Merity (data scientist, crawl
engineer, and much more!) will leave Common Crawl to work on image
recognition and language understanding using deep learning at MetaMind, a
new startup. Stephen, has been a great asset to Common Crawl, and we are
grateful that he wishes to remain engaged with us in a volunteer capacity
going forward.

This week, we therefore launch a search to fill the role of Crawl
Engineer/Data Scientist. Below and posted here https://commoncrawl.org/jobs/
is the job description. We appreciate any help you can provide in spreading
the word about this unique opportunity. If you have specific referrals, or
wish to apply, please contact j...@commoncrawl.org.

Many thanks,

Sara

---

_CRAWL ENGINEER / DATA SCIENTIST at THE COMMON CRAWL FOUNDATION_

*Location*
San Francisco or Remote


*Job Summary*
Common Crawl (CC) is the non-profit organization that builds and maintains
the single largest publicly accessible dataset of the world’s knowledge,
encompassing petabytes of web crawl data.

If democratizing access to web information and tackling the engineering
challenges of working with data at the scale of the web sounds exciting to
you, we would love to hear from you. If you have worked on open source
projects before or can share code samples with us, please don't hesitate to
send relevant links along with your application.


*Description*

/Primary Responsibilities/
_Running the crawl_
* Spinning up and managing Hadoop clusters on Amazon EC2
* Running regular comprehensive crawls of the web using Nutch
* Preparing and publishing crawl data to data hosting partner, Amazon Web
Services
* Incident response and diagnosis of crawl issues as they occur, e.g.
** Replacing lost instances due to EC2 problems / spot instance losses
** Responding to and remedying webmaster queries and issues

_Crawl engineering_
* Maintaining, developing, and deploying new features as required by
running the Nutch crawler, e.g.:
** Providing netiquette features, such as following robots.txt, as
required, and load balancing a crawl across millions of domains
** Implementing and improving ranking algorithms to prioritize the crawling
of popular pages
* Extending existing tools to work efficiently with large datasets
* Working with the Nutch community to push improvements to the crawler to
the public

/Other Responsibilities/
* Building support tools and artifacts, including documentation, tutorials,
and example code or supporting frameworks for processing CC data using
different tools.
* Identifying and reporting on research and innovations that result from
analysis and derivative use of CC data.
* Community evangelism:
** Collaborating with partners in academia and industry
** Engaging regularly with user discussion group and responding to frequent
inquiries about how to use CC data
** Writing technical blog posts
** Presenting on or representing CC at conferences, meetups, etc.


*Qualifications*
/Minimum qualifications/
* Fluent in Java (Nutch and Hadoop are core to our mission)
* Familiarity with the JVM big data ecosystem (Hadoop, HDFS, ...)
* Knowledge the Amazon Web Services (AWS) ecosystem
* Experience with Python
* Basic command line Unix knowledge
* BS Computer Science or equivalent work experience

/Preferred qualifications/
* Experience with running web crawlers
* Cluster computing experience (Hadoop preferred)
* Running parallel jobs over dozens of terabytes of data
* Experience committing to open source projects and participating in open
source forums


*About Common Crawl*
The Common Crawl Foundation is a California 501(c)(3) registered non-profit
with the goal of democratizing access to web information by producing and
maintaining an open repository of web crawl data that is universally
accessible and analyzable.

Our vision is of a truly open web that allows open access to information
and enables greater innovation in research, business and education. We
level the playing field by making wholesale extraction, transformation and
analysis of web data cheap and easy.

The Common Crawl Foundation is an Equal Opportunity Employer.


*To Apply*
Please send your cover letter and resumé to j...@commoncrawl.org.

-- 
You received this message because you are subscribed to the Google Groups
"Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send

[jira] [Commented] (NUTCH-2099) Refactoring the REST endpoints for integration with webui

2015-09-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14805313#comment-14805313
 ] 

ASF GitHub Bot commented on NUTCH-2099:
---

Github user lewismc commented on a diff in the pull request:

https://github.com/apache/nutch/pull/59#discussion_r39839080
  
--- Diff: src/java/org/apache/nutch/metadata/Nutch.java ---
@@ -80,4 +80,11 @@
public static final String STAT_PROGRESS = "progress";
/**Used by Nutch REST service */
public static final String CRAWL_ID_KEY = "storage.crawl.id";
+   
+   public static final String ARG_SEEDDIR = "url_dir";
+   public static final String ARG_CRAWLDB = "crawldb";
--- End diff --

@sujen1412 any comments on augmenting trivial Javadoc?
It makes a huge difference when documented.


> Refactoring the REST endpoints for integration with webui
> -
>
> Key: NUTCH-2099
> URL: https://issues.apache.org/jira/browse/NUTCH-2099
> Project: Nutch
>  Issue Type: Sub-task
>  Components: REST_api, web gui
>Reporter: Sujen Shah
>  Labels: memex
> Fix For: 1.11
>
>
> This PR changes the structure of the arguments in the REST endpoints. Earlier 
> the args were accepted in a Map form and now it is 
> Map. This is to allow Wicket to create the proper requests 
> objects and send it to NutchServer. 
> With the above, I have also added the metadata required for these services in 
> Nutch metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] nutch pull request: Fix for NUTCH-2099 Contributed by Sujen Shah

2015-09-18 Thread lewismc

Github user lewismc commented on a diff in the pull request:

https://github.com/apache/nutch/pull/59#discussion_r39839080
  
--- Diff: src/java/org/apache/nutch/metadata/Nutch.java ---
@@ -80,4 +80,11 @@
public static final String STAT_PROGRESS = "progress";
/**Used by Nutch REST service */
public static final String CRAWL_ID_KEY = "storage.crawl.id";
+   
+   public static final String ARG_SEEDDIR = "url_dir";
+   public static final String ARG_CRAWLDB = "crawldb";
--- End diff --

@sujen1412 any comments on augmenting trivial Javadoc?
It makes a huge difference when documented.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (NUTCH-2099) Refactoring the REST endpoints for integration with webui

2015-09-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14805311#comment-14805311
 ] 

ASF GitHub Bot commented on NUTCH-2099:
---

Github user lewismc commented on a diff in the pull request:

https://github.com/apache/nutch/pull/59#discussion_r39839001
  
--- Diff: src/java/org/apache/nutch/crawl/CrawlDb.java ---
@@ -261,30 +262,68 @@ public int run(String[] args) throws Exception {
   additionsAllowed = false;
 }
 
-String crawldb = crawlId+"/crawldb";
-String segment_dir = crawlId+"/segments";
-File segmentsDir = new File(segment_dir);
-File[] segmentsList = segmentsDir.listFiles();  
-Arrays.sort(segmentsList, new Comparator(){
-  @Override
-  public int compare(File f1, File f2) {
-if(f1.lastModified()>f2.lastModified())
-  return -1;
-else
-  return 0;
-  }  
-});
+Path crawlDb;
+if(args.containsKey(Nutch.ARG_CRAWLDB)) {
+   Object crawldbPath = args.get(Nutch.ARG_CRAWLDB);
+   if(crawldbPath instanceof Path) {
+   crawlDb = (Path) crawldbPath;
--- End diff --

no probs.
We can always sort it out pre-release. It can be a final ticket to format 
code.
Thank you for attention to detail.


> Refactoring the REST endpoints for integration with webui
> -
>
> Key: NUTCH-2099
> URL: https://issues.apache.org/jira/browse/NUTCH-2099
> Project: Nutch
>  Issue Type: Sub-task
>  Components: REST_api, web gui
>Reporter: Sujen Shah
>  Labels: memex
> Fix For: 1.11
>
>
> This PR changes the structure of the arguments in the REST endpoints. Earlier 
> the args were accepted in a Map form and now it is 
> Map. This is to allow Wicket to create the proper requests 
> objects and send it to NutchServer. 
> With the above, I have also added the metadata required for these services in 
> Nutch metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] nutch pull request: Fix for NUTCH-2099 Contributed by Sujen Shah

2015-09-18 Thread lewismc

Github user lewismc commented on a diff in the pull request:

https://github.com/apache/nutch/pull/59#discussion_r39839001
  
--- Diff: src/java/org/apache/nutch/crawl/CrawlDb.java ---
@@ -261,30 +262,68 @@ public int run(String[] args) throws Exception {
   additionsAllowed = false;
 }
 
-String crawldb = crawlId+"/crawldb";
-String segment_dir = crawlId+"/segments";
-File segmentsDir = new File(segment_dir);
-File[] segmentsList = segmentsDir.listFiles();  
-Arrays.sort(segmentsList, new Comparator(){
-  @Override
-  public int compare(File f1, File f2) {
-if(f1.lastModified()>f2.lastModified())
-  return -1;
-else
-  return 0;
-  }  
-});
+Path crawlDb;
+if(args.containsKey(Nutch.ARG_CRAWLDB)) {
+   Object crawldbPath = args.get(Nutch.ARG_CRAWLDB);
+   if(crawldbPath instanceof Path) {
+   crawlDb = (Path) crawldbPath;
--- End diff --

no probs.
We can always sort it out pre-release. It can be a final ticket to format 
code.
Thank you for attention to detail.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] nutch pull request: Fix for NUTCH-2099 Contributed by Sujen Shah

2015-09-18 Thread lewismc

Github user lewismc commented on a diff in the pull request:

https://github.com/apache/nutch/pull/59#discussion_r39838934
  
--- Diff: src/java/org/apache/nutch/crawl/CrawlDb.java ---
@@ -236,10 +237,10 @@ public int run(String[] args) throws Exception {
* Used for Nutch REST service
*/
   @Override
-  public Map run(Map args, String crawlId) 
throws Exception {
+  public Map run(Map args, String crawlId) 
throws Exception {
--- End diff --

ack


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (NUTCH-2099) Refactoring the REST endpoints for integration with webui

2015-09-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14805308#comment-14805308
 ] 

ASF GitHub Bot commented on NUTCH-2099:
---

Github user lewismc commented on a diff in the pull request:

https://github.com/apache/nutch/pull/59#discussion_r39838934
  
--- Diff: src/java/org/apache/nutch/crawl/CrawlDb.java ---
@@ -236,10 +237,10 @@ public int run(String[] args) throws Exception {
* Used for Nutch REST service
*/
   @Override
-  public Map run(Map args, String crawlId) 
throws Exception {
+  public Map run(Map args, String crawlId) 
throws Exception {
--- End diff --

ack


> Refactoring the REST endpoints for integration with webui
> -
>
> Key: NUTCH-2099
> URL: https://issues.apache.org/jira/browse/NUTCH-2099
> Project: Nutch
>  Issue Type: Sub-task
>  Components: REST_api, web gui
>Reporter: Sujen Shah
>  Labels: memex
> Fix For: 1.11
>
>
> This PR changes the structure of the arguments in the REST endpoints. Earlier 
> the args were accepted in a Map form and now it is 
> Map. This is to allow Wicket to create the proper requests 
> objects and send it to NutchServer. 
> With the above, I have also added the metadata required for these services in 
> Nutch metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

57 matches

Mail list logo