[jira] [Updated] (NUTCH-2064) URLNormalizer basic to properly encode non-ASCII characters

2015-07-22 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2064:
---
Attachment: NUTCH-2064-v3.patch

Only the path/file segment of the URL should be subject of percent encoding, 
IDNs need a different treatment (NUTCH-1321) and for the query part we need 
also different rules. Attached updated patch, not finally tested.

> URLNormalizer basic to properly encode non-ASCII characters
> ---
>
> Key: NUTCH-2064
> URL: https://issues.apache.org/jira/browse/NUTCH-2064
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.10
>Reporter: Markus Jelsma
> Fix For: 1.11
>
> Attachments: NUTCH-1098.patch, NUTCH-1098.patch, NUTCH-2064-v3.patch
>
>
> NUTCH-1098 rewritten to work on trunk. Unit test is identical to 1098.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2021) Use protocol-selenium to Capture Screenshots of the Page as it is Fetched

2015-07-22 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14637586#comment-14637586
 ] 

Chris A. Mattmann commented on NUTCH-2021:
--

+1 great work Lewis.

> Use protocol-selenium to Capture Screenshots of the Page as it is Fetched
> -
>
> Key: NUTCH-2021
> URL: https://issues.apache.org/jira/browse/NUTCH-2021
> Project: Nutch
>  Issue Type: New Feature
>  Components: plugin, protocol
>Affects Versions: 1.10
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11
>
> Attachments: NUTCH-2021.patch, NUTCH-2021.patch, NUTCH-2021.v3.patch, 
> NUTCH-2021v2.patch, speakers-bureau.php.png
>
>
> This should be a piece of cake. It can be done as follows
> {code}
> WebDriver driver = new FirefoxDriver();
> driver.get("http://www.google.com/";);
> File scrFile = ((TakesScreenshot)driver).getScreenshotAs(OutputType.FILE);
> // Now you can do whatever you need to do with it, for example copy somewhere
> FileUtils.copyFile(scrFile, new File("/usr/local/pics/screenshot.png"));
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2062) Add Plugin for interacting with Selenium WebDriver

2015-07-22 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14637584#comment-14637584
 ] 

Chris A. Mattmann commented on NUTCH-2062:
--

+1 from me. Commit!

> Add Plugin for interacting with Selenium WebDriver
> --
>
> Key: NUTCH-2062
> URL: https://issues.apache.org/jira/browse/NUTCH-2062
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
>  Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-2062v2.patch
>
>
> The protocol-selenium plugin is great for pulling webpages that dynamically 
> load content. However, I've run into use cases where I need to actively 
> interact with a page in Selenium before it becomes useful. For instance, I 
> may need to paginate through a table to get all results that I'm interested 
> in. This plugin will handle that use case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2064) URLNormalizer basic to properly encode non-ASCII characters

2015-07-22 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14637563#comment-14637563
 ] 

Sebastian Nagel commented on NUTCH-2064:


Hi Markus, why not define the range(s) of characters which can be safely 
unescaped by a positive statement as in the 
[RFC3986|https://tools.ietf.org/html/rfc3986#section-2.2]:
{quote}
For consistency, percent-encoded octets in the ranges of ALPHA (%41-%5A and 
%61-%7A), DIGIT (%30-%39),
hyphen (%2D), period (%2E), underscore (%5F), or tilde (%7E) should not be 
created by URI producers and, when
found in a URI, should be decoded to their corresponding unreserved characters 
by URI normalizers.
{quote}
It's more than & and /, also, e.g. a plus sign as in 
[http://google.com/search?q=c%2B%2B]. See also 
[Percent-encoding|https://en.wikipedia.org/wiki/Percent-encoding#Percent-encoding_in_a_URI]
 in Wikipedia.

> URLNormalizer basic to properly encode non-ASCII characters
> ---
>
> Key: NUTCH-2064
> URL: https://issues.apache.org/jira/browse/NUTCH-2064
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.10
>Reporter: Markus Jelsma
> Fix For: 1.11
>
> Attachments: NUTCH-1098.patch, NUTCH-1098.patch
>
>
> NUTCH-1098 rewritten to work on trunk. Unit test is identical to 1098.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2004) ParseChecker does not handle redirects

2015-07-22 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce updated NUTCH-2004:
-
Labels: memex  (was: )

> ParseChecker does not handle redirects
> --
>
> Key: NUTCH-2004
> URL: https://issues.apache.org/jira/browse/NUTCH-2004
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.9
>Reporter: Michael Joyce
>Assignee: Michael Joyce
>Priority: Minor
>  Labels: memex
> Fix For: 1.11
>
>
> At the moment ParseChecker doesn't handle redirects. If it gets anything but 
> a success status it errors out. It would be nice if it handled redirects a 
> bit more gracefully based on the http.redirects config setting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2063) Add -mimeStats flag to FileDumper tool

2015-07-22 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce updated NUTCH-2063:
-
Labels: memex  (was: )

> Add -mimeStats flag to FileDumper tool
> --
>
> Key: NUTCH-2063
> URL: https://issues.apache.org/jira/browse/NUTCH-2063
> Project: Nutch
>  Issue Type: Bug
>  Components: dumpers
>Affects Versions: 1.10
>Reporter: Lewis John McGibbney
>Assignee: Michael Joyce
>  Labels: memex
> Fix For: 1.11
>
> Attachments: nutch-2063-joyce-21July2015.patch
>
>
> Right now in order to get a MimeType distribution for any given number of 
> segments, one is required to dump some data. This is a waste if one just 
> wishes to see the mime type distribution across a number of segments.
> An improvement to the FileDumper tool would be the addition of a -mimeStats 
> flag which would not attempt to dump any data but instead merely provide the 
> total stats message providing insight into how the FileDumper should be best 
> used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2062) Add Plugin for interacting with Selenium WebDriver

2015-07-22 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce updated NUTCH-2062:
-
Labels: memex  (was: )

> Add Plugin for interacting with Selenium WebDriver
> --
>
> Key: NUTCH-2062
> URL: https://issues.apache.org/jira/browse/NUTCH-2062
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
>  Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-2062v2.patch
>
>
> The protocol-selenium plugin is great for pulling webpages that dynamically 
> load content. However, I've run into use cases where I need to actively 
> interact with a page in Selenium before it becomes useful. For instance, I 
> may need to paginate through a table to get all results that I'm interested 
> in. This plugin will handle that use case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2062) Add Plugin for interacting with Selenium WebDriver

2015-07-22 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14637063#comment-14637063
 ] 

Michael Joyce commented on NUTCH-2062:
--

[~lewismc], I've update the PR with the changes. Let me know what you think.

> Add Plugin for interacting with Selenium WebDriver
> --
>
> Key: NUTCH-2062
> URL: https://issues.apache.org/jira/browse/NUTCH-2062
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 1.11
>
> Attachments: NUTCH-2062v2.patch
>
>
> The protocol-selenium plugin is great for pulling webpages that dynamically 
> load content. However, I've run into use cases where I need to actively 
> interact with a page in Selenium before it becomes useful. For instance, I 
> may need to paginate through a table to get all results that I'm interested 
> in. This plugin will handle that use case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2062) Add Plugin for interacting with Selenium WebDriver

2015-07-22 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14636958#comment-14636958
 ] 

Michael Joyce commented on NUTCH-2062:
--

Cheers [~lewismc], let me see what I can do with regards to updating the PR 
with these updates.

> Add Plugin for interacting with Selenium WebDriver
> --
>
> Key: NUTCH-2062
> URL: https://issues.apache.org/jira/browse/NUTCH-2062
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 1.11
>
> Attachments: NUTCH-2062v2.patch
>
>
> The protocol-selenium plugin is great for pulling webpages that dynamically 
> load content. However, I've run into use cases where I need to actively 
> interact with a page in Selenium before it becomes useful. For instance, I 
> may need to paginate through a table to get all results that I'm interested 
> in. This plugin will handle that use case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2062) Add Plugin for interacting with Selenium WebDriver

2015-07-22 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2062:

Attachment: NUTCH-2062v2.patch

[~mjoyce] can you please try this patch out? I've
 * renamed all relevant properties within nutch-default.xml to your new 
convention of libselenium.blah
 * included the new package within default.properties
 * added license headers and corrected package naming within the new handlers 
interfaces
 * modularized 
src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java
 such that we can now getDriver based upon the new adaptive driver 
configuration which defaults to FirefoxDriver.

One thing to possibly consider. The dependency inclusions within the new 
plugin.xml may conflict with whats existing in lib-selenium and 
protocol-selenium. I think we may have to ensure that these are in sync.
Excellent job on this one Jimmy.
Please let me know how this tests out. Thanks

> Add Plugin for interacting with Selenium WebDriver
> --
>
> Key: NUTCH-2062
> URL: https://issues.apache.org/jira/browse/NUTCH-2062
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 1.11
>
> Attachments: NUTCH-2062v2.patch
>
>
> The protocol-selenium plugin is great for pulling webpages that dynamically 
> load content. However, I've run into use cases where I need to actively 
> interact with a page in Selenium before it becomes useful. For instance, I 
> may need to paginate through a table to get all results that I'm interested 
> in. This plugin will handle that use case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2062) Add Plugin for interacting with Selenium WebDriver

2015-07-22 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2062:

Assignee: Michael Joyce

> Add Plugin for interacting with Selenium WebDriver
> --
>
> Key: NUTCH-2062
> URL: https://issues.apache.org/jira/browse/NUTCH-2062
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 1.11
>
>
> The protocol-selenium plugin is great for pulling webpages that dynamically 
> load content. However, I've run into use cases where I need to actively 
> interact with a page in Selenium before it becomes useful. For instance, I 
> may need to paginate through a table to get all results that I'm interested 
> in. This plugin will handle that use case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2063) Add -mimeStats flag to FileDumper tool

2015-07-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14636935#comment-14636935
 ] 

Hudson commented on NUTCH-2063:
---

SUCCESS: Integrated in Nutch-trunk #3224 (See 
[https://builds.apache.org/job/Nutch-trunk/3224/])
NUTCH-2063 Add -mimeStats flag to FileDumper tool (lewismc: 
http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1692268)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/tools/FileDumper.java


> Add -mimeStats flag to FileDumper tool
> --
>
> Key: NUTCH-2063
> URL: https://issues.apache.org/jira/browse/NUTCH-2063
> Project: Nutch
>  Issue Type: Bug
>  Components: dumpers
>Affects Versions: 1.10
>Reporter: Lewis John McGibbney
>Assignee: Michael Joyce
> Fix For: 1.11
>
> Attachments: nutch-2063-joyce-21July2015.patch
>
>
> Right now in order to get a MimeType distribution for any given number of 
> segments, one is required to dump some data. This is a waste if one just 
> wishes to see the mime type distribution across a number of segments.
> An improvement to the FileDumper tool would be the addition of a -mimeStats 
> flag which would not attempt to dump any data but instead merely provide the 
> total stats message providing insight into how the FileDumper should be best 
> used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2063) Add -mimeStats flag to FileDumper tool

2015-07-22 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2063.
-
Resolution: Fixed

Committed revision 1692268.
Nice work [~mjoyce]

> Add -mimeStats flag to FileDumper tool
> --
>
> Key: NUTCH-2063
> URL: https://issues.apache.org/jira/browse/NUTCH-2063
> Project: Nutch
>  Issue Type: Bug
>  Components: dumpers
>Affects Versions: 1.10
>Reporter: Lewis John McGibbney
>Assignee: Michael Joyce
> Fix For: 1.11
>
> Attachments: nutch-2063-joyce-21July2015.patch
>
>
> Right now in order to get a MimeType distribution for any given number of 
> segments, one is required to dump some data. This is a waste if one just 
> wishes to see the mime type distribution across a number of segments.
> An improvement to the FileDumper tool would be the addition of a -mimeStats 
> flag which would not attempt to dump any data but instead merely provide the 
> total stats message providing insight into how the FileDumper should be best 
> used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2064) URLNormalizer basic to properly encode non-ASCII characters

2015-07-22 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14636773#comment-14636773
 ] 

Lewis John McGibbney commented on NUTCH-2064:
-

+1

> URLNormalizer basic to properly encode non-ASCII characters
> ---
>
> Key: NUTCH-2064
> URL: https://issues.apache.org/jira/browse/NUTCH-2064
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.10
>Reporter: Markus Jelsma
> Fix For: 1.11
>
> Attachments: NUTCH-1098.patch, NUTCH-1098.patch
>
>
> NUTCH-1098 rewritten to work on trunk. Unit test is identical to 1098.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2063) Add -mimeStats flag to FileDumper tool

2015-07-22 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2063:

Assignee: Michael Joyce  (was: Lewis John McGibbney)

> Add -mimeStats flag to FileDumper tool
> --
>
> Key: NUTCH-2063
> URL: https://issues.apache.org/jira/browse/NUTCH-2063
> Project: Nutch
>  Issue Type: Bug
>  Components: dumpers
>Affects Versions: 1.10
>Reporter: Lewis John McGibbney
>Assignee: Michael Joyce
> Fix For: 1.11
>
> Attachments: nutch-2063-joyce-21July2015.patch
>
>
> Right now in order to get a MimeType distribution for any given number of 
> segments, one is required to dump some data. This is a waste if one just 
> wishes to see the mime type distribution across a number of segments.
> An improvement to the FileDumper tool would be the addition of a -mimeStats 
> flag which would not attempt to dump any data but instead merely provide the 
> total stats message providing insight into how the FileDumper should be best 
> used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2064) URLNormalizer basic to properly encode non-ASCII characters

2015-07-22 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2064:
-
Attachment: NUTCH-1098.patch

Excellent! I have added both characters as a new test and it passes.

> URLNormalizer basic to properly encode non-ASCII characters
> ---
>
> Key: NUTCH-2064
> URL: https://issues.apache.org/jira/browse/NUTCH-2064
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.10
>Reporter: Markus Jelsma
> Fix For: 1.11
>
> Attachments: NUTCH-1098.patch, NUTCH-1098.patch
>
>
> NUTCH-1098 rewritten to work on trunk. Unit test is identical to 1098.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)