Re: wiki editing permission

2016-02-18 Thread Sebastian Nagel
Hi Joe,

added user JosephNaegele to https://wiki.apache.org/nutch/ContributorsGroup

You should now be able to edit the Nutch wiki.

Cheers and thanks in advance for your contributions,
Sebastian

On 02/18/2016 04:55 PM, Joseph Naegele wrote:
> Hi,
> 
> I would like permission to freely contribute to the wiki. My wiki username
> is JosephNaegele.
> Examples of things I'd add immediately:
> 
> - Fix Javadoc links to point to valid URLs
> - Update PluginCentral so users don't have to sift through OldPluginCentral.
> Example: It's unclear that the Parser extension point requires a
> "contentType" parameter.
> 
> Thanks,
> Joe
> 



[Nutch Wiki] Update of "ContributorsGroup" by SebastianNagel

2016-02-18 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "ContributorsGroup" page has been changed by SebastianNagel:
https://wiki.apache.org/nutch/ContributorsGroup?action=diff=37=38

Comment:
add Joseph Naegele to contributors group

   * ayeshahasan
   * Kshamak
   * AmmarShadiq
+  * JosephNaegele
  


[jira] [Commented] (NUTCH-2218) Switch CrawlCompletion arg parsing to Commons CLI

2016-02-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152853#comment-15152853
 ] 

Hudson commented on NUTCH-2218:
---

SUCCESS: Integrated in Nutch-trunk #3349 (See 
[https://builds.apache.org/job/Nutch-trunk/3349/])
NUTCH-2218 - Update CHANGES.txt. Merge PR #91 (joyce: 
[http://svn.apache.org/viewvc/nutch/trunk/?view=rev=1731103])
* trunk/CHANGES.txt
NUTCH-2218 - Update CrawlComplete util to use Commons CLI (joyce: 
[http://svn.apache.org/viewvc/nutch/trunk/?view=rev=1731102])
* trunk/src/java/org/apache/nutch/util/CrawlCompletionStats.java


> Switch CrawlCompletion arg parsing to Commons CLI
> -
>
> Key: NUTCH-2218
> URL: https://issues.apache.org/jira/browse/NUTCH-2218
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.11
>Reporter: Michael Joyce
>Assignee: Michael Joyce
>Priority: Minor
> Fix For: 1.12
>
>
> The current CrawlCompletion utility should be updated to use commons CLI 
> instead of doing manual arg parsing and checking.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Jenkins build is back to normal : Nutch-trunk #3349

2016-02-18 Thread Apache Jenkins Server
See 



[jira] [Commented] (NUTCH-2218) Switch CrawlCompletion arg parsing to Commons CLI

2016-02-18 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152751#comment-15152751
 ] 

Lewis John McGibbney commented on NUTCH-2218:
-

Nice Mike thanks

> Switch CrawlCompletion arg parsing to Commons CLI
> -
>
> Key: NUTCH-2218
> URL: https://issues.apache.org/jira/browse/NUTCH-2218
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.11
>Reporter: Michael Joyce
>Assignee: Michael Joyce
>Priority: Minor
> Fix For: 1.12
>
>
> The current CrawlCompletion utility should be updated to use commons CLI 
> instead of doing manual arg parsing and checking.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2218) Switch CrawlCompletion arg parsing to Commons CLI

2016-02-18 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce resolved NUTCH-2218.
--
Resolution: Fixed

[~lewismc], This got merged. I added an example to the option you raised as 
well. If that doesn't address your concerns let me know and I'll update in 
another ticket.

{code}
| -> ./bin/nutch crawlcomplete
usage: CrawlCompletionStats [-h] -inputDirs  -mode 
   [-numReducers ] -outputDir 
 -h,--helpShow this message
 -inputDirsComma separated list of crawl directories
  (e.g., "./crawl1,./crawl2")
 -mode  Set statistics gathering mode (by 'host' or
  by 'domain')
 -numReducersOptional number of reduce jobs to use.
  Defaults to 1
 -outputDirOutput directory where results should be
  dumped
{code}

> Switch CrawlCompletion arg parsing to Commons CLI
> -
>
> Key: NUTCH-2218
> URL: https://issues.apache.org/jira/browse/NUTCH-2218
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.11
>Reporter: Michael Joyce
>Assignee: Michael Joyce
>Priority: Minor
> Fix For: 1.12
>
>
> The current CrawlCompletion utility should be updated to use commons CLI 
> instead of doing manual arg parsing and checking.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: NUTCH-2218 - Update CrawlComplete util with Co...

2016-02-18 Thread MJJoyce
Github user MJJoyce closed the pull request at:

https://github.com/apache/nutch/pull/91


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (NUTCH-2218) Switch CrawlCompletion arg parsing to Commons CLI

2016-02-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152736#comment-15152736
 ] 

ASF GitHub Bot commented on NUTCH-2218:
---

Github user MJJoyce closed the pull request at:

https://github.com/apache/nutch/pull/91


> Switch CrawlCompletion arg parsing to Commons CLI
> -
>
> Key: NUTCH-2218
> URL: https://issues.apache.org/jira/browse/NUTCH-2218
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.11
>Reporter: Michael Joyce
>Assignee: Michael Joyce
>Priority: Minor
> Fix For: 1.12
>
>
> The current CrawlCompletion utility should be updated to use commons CLI 
> instead of doing manual arg parsing and checking.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2218) Switch CrawlCompletion arg parsing to Commons CLI

2016-02-18 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152731#comment-15152731
 ] 

Michael Joyce commented on NUTCH-2218:
--

Sorry for any confusion here folks. Changes were merged in r1731102. README was 
updated in 1731103 since I forgot to update. PR should be closed in 1731103.

> Switch CrawlCompletion arg parsing to Commons CLI
> -
>
> Key: NUTCH-2218
> URL: https://issues.apache.org/jira/browse/NUTCH-2218
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.11
>Reporter: Michael Joyce
>Assignee: Michael Joyce
>Priority: Minor
> Fix For: 1.12
>
>
> The current CrawlCompletion utility should be updated to use commons CLI 
> instead of doing manual arg parsing and checking.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


wiki editing permission

2016-02-18 Thread Joseph Naegele
Hi,

I would like permission to freely contribute to the wiki. My wiki username
is JosephNaegele.
Examples of things I'd add immediately:

- Fix Javadoc links to point to valid URLs
- Update PluginCentral so users don't have to sift through OldPluginCentral.
Example: It's unclear that the Parser extension point requires a
"contentType" parameter.

Thanks,
Joe



[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-02-18 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152420#comment-15152420
 ] 

Chris A. Mattmann commented on NUTCH-2191:
--

Markus, we don't need to fix the plugin dependency broader issue. We should 
just focus here on NUTCH-2191 and for that matter I agree with Karanjeet on his 
solution for part #1 aka creating a new lib-htmlunit library and changing the 
dependency to it. For #2 please try again by rebuliding - it should work.

> Add protocol-htmlunit
> -
>
> Key: NUTCH-2191
> URL: https://issues.apache.org/jira/browse/NUTCH-2191
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2191.patch, NUTCH-2191.patch
>
>
> HtmlUnit is, opposed to other Javascript enabled headless browsers, a 
> portable library and should therefore be better suited for very large scale 
> crawls. This issue is an attempt to implement protocol-htmlunit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-02-18 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152221#comment-15152221
 ] 

Markus Jelsma commented on NUTCH-2191:
--

1. although that could work, it does not truely resolve the dependency problem 
Nutch plugins are facing as i described 
https://issues.apache.org/jira/browse/NUTCH-2191?focusedCommentId=15070952=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15070952
 it would still be a work-around :(

2. no, new WebClient() also opened a window here. If that shouldnt happen i 
probably had a stale build.



> Add protocol-htmlunit
> -
>
> Key: NUTCH-2191
> URL: https://issues.apache.org/jira/browse/NUTCH-2191
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2191.patch, NUTCH-2191.patch
>
>
> HtmlUnit is, opposed to other Javascript enabled headless browsers, a 
> portable library and should therefore be better suited for very large scale 
> crawls. This issue is an attempt to implement protocol-htmlunit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-02-18 Thread Karanjeet Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152191#comment-15152191
 ] 

Karanjeet Singh commented on NUTCH-2191:


1. Yes and I have a plan for that. I have tested it with Selenium 2.44.0 
configuration which was there in Nutch before Nov 25, 2015. How about if we 
create another plugin lib-htmlunit with Selenium 2.44.0 configuration and use 
that in protocol-htmlunit instead of lib-selenium. I hope, that should work 
fine.

2. I hope you mean it does seem to work. :)  Please let me know if you are 
still facing any problem.

> Add protocol-htmlunit
> -
>
> Key: NUTCH-2191
> URL: https://issues.apache.org/jira/browse/NUTCH-2191
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2191.patch, NUTCH-2191.patch
>
>
> HtmlUnit is, opposed to other Javascript enabled headless browsers, a 
> portable library and should therefore be better suited for very large scale 
> crawls. This issue is an attempt to implement protocol-htmlunit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (NUTCH-2191) Add protocol-htmlunit

2016-02-18 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152184#comment-15152184
 ] 

Markus Jelsma edited comment on NUTCH-2191 at 2/18/16 11:34 AM:


1. ah yes,we still need to fix this crazy plugin dependency problem somehow
2. Thanks! Hmm, apparently that doesn't seem to work :)


was (Author: markus17):
1. ah yes,we still need to fix this crazy plugin dependency problem somehow
2. Thanks!

> Add protocol-htmlunit
> -
>
> Key: NUTCH-2191
> URL: https://issues.apache.org/jira/browse/NUTCH-2191
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2191.patch, NUTCH-2191.patch
>
>
> HtmlUnit is, opposed to other Javascript enabled headless browsers, a 
> portable library and should therefore be better suited for very large scale 
> crawls. This issue is an attempt to implement protocol-htmlunit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-02-18 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152184#comment-15152184
 ] 

Markus Jelsma commented on NUTCH-2191:
--

1. ah yes,we still need to fix this crazy plugin dependency problem somehow
2. Thanks!

> Add protocol-htmlunit
> -
>
> Key: NUTCH-2191
> URL: https://issues.apache.org/jira/browse/NUTCH-2191
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2191.patch, NUTCH-2191.patch
>
>
> HtmlUnit is, opposed to other Javascript enabled headless browsers, a 
> portable library and should therefore be better suited for very large scale 
> crawls. This issue is an attempt to implement protocol-htmlunit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-02-18 Thread Karanjeet Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152152#comment-15152152
 ] 

Karanjeet Singh commented on NUTCH-2191:


1. For SSL, if you try to use it with Selenium 2.44.0, it will work fine. I 
suspect HtmlUnit is not compatible with new version of HttpClient.jar

2. You can do a headless browsing if you do following changes:
{code}
//private WebClient client = new WebClient(BrowserVersion.FIREFOX_38);
private WebClient client = new WebClient();
{code}
This will not open Firefox browser.

> Add protocol-htmlunit
> -
>
> Key: NUTCH-2191
> URL: https://issues.apache.org/jira/browse/NUTCH-2191
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2191.patch, NUTCH-2191.patch
>
>
> HtmlUnit is, opposed to other Javascript enabled headless browsers, a 
> portable library and should therefore be better suited for very large scale 
> crawls. This issue is an attempt to implement protocol-htmlunit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-02-18 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152141#comment-15152141
 ] 

Markus Jelsma commented on NUTCH-2191:
--

Hello Kshijtij - well no, certainly not at this time. The plugin hardly works 
at all, this patch is for 1.x and just a work that is sometimes in progress.

> Add protocol-htmlunit
> -
>
> Key: NUTCH-2191
> URL: https://issues.apache.org/jira/browse/NUTCH-2191
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2191.patch, NUTCH-2191.patch
>
>
> HtmlUnit is, opposed to other Javascript enabled headless browsers, a 
> portable library and should therefore be better suited for very large scale 
> crawls. This issue is an attempt to implement protocol-htmlunit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-02-18 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152140#comment-15152140
 ] 

Markus Jelsma commented on NUTCH-2191:
--

Hi - it works indeed. But new problems appear, as usual!

1. SSL does not work due to
{code}
2016-02-18 11:53:21,130 ERROR htmlunit.Http - Failed to get protocol output
java.lang.IllegalArgumentException: Cannot locate declared field 
org.apache.http.impl.client.HttpClientBuilder.sslContext
at 
org.apache.commons.lang3.reflect.FieldUtils.readDeclaredField(FieldUtils.java:382)
at 
com.gargoylesoftware.htmlunit.HttpWebConnection.createConnectionManager(HttpWebConnection.java:944)
at 
com.gargoylesoftware.htmlunit.HttpWebConnection.getResponse(HttpWebConnection.java:161)
at 
com.gargoylesoftware.htmlunit.WebClient.loadWebResponseFromWebConnection(WebClient.java:1321)
at 
com.gargoylesoftware.htmlunit.WebClient.loadWebResponse(WebClient.java:1238)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:346)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:415)
at 
org.apache.nutch.protocol.htmlunit.HttpResponse.(HttpResponse.java:103)
{code}

2. I don't know how yet but since it uses Selenium, every time i try a file a 
browser opens! This is crazy, i didn't know this was even possible.

Markus

> Add protocol-htmlunit
> -
>
> Key: NUTCH-2191
> URL: https://issues.apache.org/jira/browse/NUTCH-2191
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2191.patch, NUTCH-2191.patch
>
>
> HtmlUnit is, opposed to other Javascript enabled headless browsers, a 
> portable library and should therefore be better suited for very large scale 
> crawls. This issue is an attempt to implement protocol-htmlunit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-02-18 Thread Kshitij Shukla (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152005#comment-15152005
 ] 

Kshitij Shukla commented on NUTCH-2191:
---

Hello [~markus17], is it possible to get screenshot of the webpage which is 
being crawled, using this plugin ? Also is it compatible with 
apache-nutch-2.3.1 ? 
I am newbie to nutch system, but tried to find the plugin in this location 
"$NUTCH_HOME/src/plugin" but didnt found any. So can you tell me from where I 
could get this plugin ? 
Thank you.

> Add protocol-htmlunit
> -
>
> Key: NUTCH-2191
> URL: https://issues.apache.org/jira/browse/NUTCH-2191
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2191.patch, NUTCH-2191.patch
>
>
> HtmlUnit is, opposed to other Javascript enabled headless browsers, a 
> portable library and should therefore be better suited for very large scale 
> crawls. This issue is an attempt to implement protocol-htmlunit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)