[jira] [Updated] (NUTCH-2064) URLNormalizer basic to properly encode non-ASCII characters
[ https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2064: --- Attachment: NUTCH-2064-v3.patch Only the path/file segment of the URL should be subject of percent encoding, IDNs need a different treatment (NUTCH-1321) and for the query part we need also different rules. Attached updated patch, not finally tested. > URLNormalizer basic to properly encode non-ASCII characters > --- > > Key: NUTCH-2064 > URL: https://issues.apache.org/jira/browse/NUTCH-2064 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.10 >Reporter: Markus Jelsma > Fix For: 1.11 > > Attachments: NUTCH-1098.patch, NUTCH-1098.patch, NUTCH-2064-v3.patch > > > NUTCH-1098 rewritten to work on trunk. Unit test is identical to 1098. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2021) Use protocol-selenium to Capture Screenshots of the Page as it is Fetched
[ https://issues.apache.org/jira/browse/NUTCH-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14637586#comment-14637586 ] Chris A. Mattmann commented on NUTCH-2021: -- +1 great work Lewis. > Use protocol-selenium to Capture Screenshots of the Page as it is Fetched > - > > Key: NUTCH-2021 > URL: https://issues.apache.org/jira/browse/NUTCH-2021 > Project: Nutch > Issue Type: New Feature > Components: plugin, protocol >Affects Versions: 1.10 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11 > > Attachments: NUTCH-2021.patch, NUTCH-2021.patch, NUTCH-2021.v3.patch, > NUTCH-2021v2.patch, speakers-bureau.php.png > > > This should be a piece of cake. It can be done as follows > {code} > WebDriver driver = new FirefoxDriver(); > driver.get("http://www.google.com/";); > File scrFile = ((TakesScreenshot)driver).getScreenshotAs(OutputType.FILE); > // Now you can do whatever you need to do with it, for example copy somewhere > FileUtils.copyFile(scrFile, new File("/usr/local/pics/screenshot.png")); > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2062) Add Plugin for interacting with Selenium WebDriver
[ https://issues.apache.org/jira/browse/NUTCH-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14637584#comment-14637584 ] Chris A. Mattmann commented on NUTCH-2062: -- +1 from me. Commit! > Add Plugin for interacting with Selenium WebDriver > -- > > Key: NUTCH-2062 > URL: https://issues.apache.org/jira/browse/NUTCH-2062 > Project: Nutch > Issue Type: Improvement > Components: plugin >Affects Versions: 1.10 >Reporter: Michael Joyce >Assignee: Michael Joyce > Labels: memex > Fix For: 1.11 > > Attachments: NUTCH-2062v2.patch > > > The protocol-selenium plugin is great for pulling webpages that dynamically > load content. However, I've run into use cases where I need to actively > interact with a page in Selenium before it becomes useful. For instance, I > may need to paginate through a table to get all results that I'm interested > in. This plugin will handle that use case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2064) URLNormalizer basic to properly encode non-ASCII characters
[ https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14637563#comment-14637563 ] Sebastian Nagel commented on NUTCH-2064: Hi Markus, why not define the range(s) of characters which can be safely unescaped by a positive statement as in the [RFC3986|https://tools.ietf.org/html/rfc3986#section-2.2]: {quote} For consistency, percent-encoded octets in the ranges of ALPHA (%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E), underscore (%5F), or tilde (%7E) should not be created by URI producers and, when found in a URI, should be decoded to their corresponding unreserved characters by URI normalizers. {quote} It's more than & and /, also, e.g. a plus sign as in [http://google.com/search?q=c%2B%2B]. See also [Percent-encoding|https://en.wikipedia.org/wiki/Percent-encoding#Percent-encoding_in_a_URI] in Wikipedia. > URLNormalizer basic to properly encode non-ASCII characters > --- > > Key: NUTCH-2064 > URL: https://issues.apache.org/jira/browse/NUTCH-2064 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.10 >Reporter: Markus Jelsma > Fix For: 1.11 > > Attachments: NUTCH-1098.patch, NUTCH-1098.patch > > > NUTCH-1098 rewritten to work on trunk. Unit test is identical to 1098. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2004) ParseChecker does not handle redirects
[ https://issues.apache.org/jira/browse/NUTCH-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce updated NUTCH-2004: - Labels: memex (was: ) > ParseChecker does not handle redirects > -- > > Key: NUTCH-2004 > URL: https://issues.apache.org/jira/browse/NUTCH-2004 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.9 >Reporter: Michael Joyce >Assignee: Michael Joyce >Priority: Minor > Labels: memex > Fix For: 1.11 > > > At the moment ParseChecker doesn't handle redirects. If it gets anything but > a success status it errors out. It would be nice if it handled redirects a > bit more gracefully based on the http.redirects config setting. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2063) Add -mimeStats flag to FileDumper tool
[ https://issues.apache.org/jira/browse/NUTCH-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce updated NUTCH-2063: - Labels: memex (was: ) > Add -mimeStats flag to FileDumper tool > -- > > Key: NUTCH-2063 > URL: https://issues.apache.org/jira/browse/NUTCH-2063 > Project: Nutch > Issue Type: Bug > Components: dumpers >Affects Versions: 1.10 >Reporter: Lewis John McGibbney >Assignee: Michael Joyce > Labels: memex > Fix For: 1.11 > > Attachments: nutch-2063-joyce-21July2015.patch > > > Right now in order to get a MimeType distribution for any given number of > segments, one is required to dump some data. This is a waste if one just > wishes to see the mime type distribution across a number of segments. > An improvement to the FileDumper tool would be the addition of a -mimeStats > flag which would not attempt to dump any data but instead merely provide the > total stats message providing insight into how the FileDumper should be best > used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2062) Add Plugin for interacting with Selenium WebDriver
[ https://issues.apache.org/jira/browse/NUTCH-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce updated NUTCH-2062: - Labels: memex (was: ) > Add Plugin for interacting with Selenium WebDriver > -- > > Key: NUTCH-2062 > URL: https://issues.apache.org/jira/browse/NUTCH-2062 > Project: Nutch > Issue Type: Improvement > Components: plugin >Affects Versions: 1.10 >Reporter: Michael Joyce >Assignee: Michael Joyce > Labels: memex > Fix For: 1.11 > > Attachments: NUTCH-2062v2.patch > > > The protocol-selenium plugin is great for pulling webpages that dynamically > load content. However, I've run into use cases where I need to actively > interact with a page in Selenium before it becomes useful. For instance, I > may need to paginate through a table to get all results that I'm interested > in. This plugin will handle that use case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2062) Add Plugin for interacting with Selenium WebDriver
[ https://issues.apache.org/jira/browse/NUTCH-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14637063#comment-14637063 ] Michael Joyce commented on NUTCH-2062: -- [~lewismc], I've update the PR with the changes. Let me know what you think. > Add Plugin for interacting with Selenium WebDriver > -- > > Key: NUTCH-2062 > URL: https://issues.apache.org/jira/browse/NUTCH-2062 > Project: Nutch > Issue Type: Improvement > Components: plugin >Affects Versions: 1.10 >Reporter: Michael Joyce >Assignee: Michael Joyce > Fix For: 1.11 > > Attachments: NUTCH-2062v2.patch > > > The protocol-selenium plugin is great for pulling webpages that dynamically > load content. However, I've run into use cases where I need to actively > interact with a page in Selenium before it becomes useful. For instance, I > may need to paginate through a table to get all results that I'm interested > in. This plugin will handle that use case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2062) Add Plugin for interacting with Selenium WebDriver
[ https://issues.apache.org/jira/browse/NUTCH-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14636958#comment-14636958 ] Michael Joyce commented on NUTCH-2062: -- Cheers [~lewismc], let me see what I can do with regards to updating the PR with these updates. > Add Plugin for interacting with Selenium WebDriver > -- > > Key: NUTCH-2062 > URL: https://issues.apache.org/jira/browse/NUTCH-2062 > Project: Nutch > Issue Type: Improvement > Components: plugin >Affects Versions: 1.10 >Reporter: Michael Joyce >Assignee: Michael Joyce > Fix For: 1.11 > > Attachments: NUTCH-2062v2.patch > > > The protocol-selenium plugin is great for pulling webpages that dynamically > load content. However, I've run into use cases where I need to actively > interact with a page in Selenium before it becomes useful. For instance, I > may need to paginate through a table to get all results that I'm interested > in. This plugin will handle that use case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2062) Add Plugin for interacting with Selenium WebDriver
[ https://issues.apache.org/jira/browse/NUTCH-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2062: Attachment: NUTCH-2062v2.patch [~mjoyce] can you please try this patch out? I've * renamed all relevant properties within nutch-default.xml to your new convention of libselenium.blah * included the new package within default.properties * added license headers and corrected package naming within the new handlers interfaces * modularized src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java such that we can now getDriver based upon the new adaptive driver configuration which defaults to FirefoxDriver. One thing to possibly consider. The dependency inclusions within the new plugin.xml may conflict with whats existing in lib-selenium and protocol-selenium. I think we may have to ensure that these are in sync. Excellent job on this one Jimmy. Please let me know how this tests out. Thanks > Add Plugin for interacting with Selenium WebDriver > -- > > Key: NUTCH-2062 > URL: https://issues.apache.org/jira/browse/NUTCH-2062 > Project: Nutch > Issue Type: Improvement > Components: plugin >Affects Versions: 1.10 >Reporter: Michael Joyce >Assignee: Michael Joyce > Fix For: 1.11 > > Attachments: NUTCH-2062v2.patch > > > The protocol-selenium plugin is great for pulling webpages that dynamically > load content. However, I've run into use cases where I need to actively > interact with a page in Selenium before it becomes useful. For instance, I > may need to paginate through a table to get all results that I'm interested > in. This plugin will handle that use case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2062) Add Plugin for interacting with Selenium WebDriver
[ https://issues.apache.org/jira/browse/NUTCH-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2062: Assignee: Michael Joyce > Add Plugin for interacting with Selenium WebDriver > -- > > Key: NUTCH-2062 > URL: https://issues.apache.org/jira/browse/NUTCH-2062 > Project: Nutch > Issue Type: Improvement > Components: plugin >Affects Versions: 1.10 >Reporter: Michael Joyce >Assignee: Michael Joyce > Fix For: 1.11 > > > The protocol-selenium plugin is great for pulling webpages that dynamically > load content. However, I've run into use cases where I need to actively > interact with a page in Selenium before it becomes useful. For instance, I > may need to paginate through a table to get all results that I'm interested > in. This plugin will handle that use case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2063) Add -mimeStats flag to FileDumper tool
[ https://issues.apache.org/jira/browse/NUTCH-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14636935#comment-14636935 ] Hudson commented on NUTCH-2063: --- SUCCESS: Integrated in Nutch-trunk #3224 (See [https://builds.apache.org/job/Nutch-trunk/3224/]) NUTCH-2063 Add -mimeStats flag to FileDumper tool (lewismc: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1692268) * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/tools/FileDumper.java > Add -mimeStats flag to FileDumper tool > -- > > Key: NUTCH-2063 > URL: https://issues.apache.org/jira/browse/NUTCH-2063 > Project: Nutch > Issue Type: Bug > Components: dumpers >Affects Versions: 1.10 >Reporter: Lewis John McGibbney >Assignee: Michael Joyce > Fix For: 1.11 > > Attachments: nutch-2063-joyce-21July2015.patch > > > Right now in order to get a MimeType distribution for any given number of > segments, one is required to dump some data. This is a waste if one just > wishes to see the mime type distribution across a number of segments. > An improvement to the FileDumper tool would be the addition of a -mimeStats > flag which would not attempt to dump any data but instead merely provide the > total stats message providing insight into how the FileDumper should be best > used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (NUTCH-2063) Add -mimeStats flag to FileDumper tool
[ https://issues.apache.org/jira/browse/NUTCH-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-2063. - Resolution: Fixed Committed revision 1692268. Nice work [~mjoyce] > Add -mimeStats flag to FileDumper tool > -- > > Key: NUTCH-2063 > URL: https://issues.apache.org/jira/browse/NUTCH-2063 > Project: Nutch > Issue Type: Bug > Components: dumpers >Affects Versions: 1.10 >Reporter: Lewis John McGibbney >Assignee: Michael Joyce > Fix For: 1.11 > > Attachments: nutch-2063-joyce-21July2015.patch > > > Right now in order to get a MimeType distribution for any given number of > segments, one is required to dump some data. This is a waste if one just > wishes to see the mime type distribution across a number of segments. > An improvement to the FileDumper tool would be the addition of a -mimeStats > flag which would not attempt to dump any data but instead merely provide the > total stats message providing insight into how the FileDumper should be best > used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2064) URLNormalizer basic to properly encode non-ASCII characters
[ https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14636773#comment-14636773 ] Lewis John McGibbney commented on NUTCH-2064: - +1 > URLNormalizer basic to properly encode non-ASCII characters > --- > > Key: NUTCH-2064 > URL: https://issues.apache.org/jira/browse/NUTCH-2064 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.10 >Reporter: Markus Jelsma > Fix For: 1.11 > > Attachments: NUTCH-1098.patch, NUTCH-1098.patch > > > NUTCH-1098 rewritten to work on trunk. Unit test is identical to 1098. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2063) Add -mimeStats flag to FileDumper tool
[ https://issues.apache.org/jira/browse/NUTCH-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2063: Assignee: Michael Joyce (was: Lewis John McGibbney) > Add -mimeStats flag to FileDumper tool > -- > > Key: NUTCH-2063 > URL: https://issues.apache.org/jira/browse/NUTCH-2063 > Project: Nutch > Issue Type: Bug > Components: dumpers >Affects Versions: 1.10 >Reporter: Lewis John McGibbney >Assignee: Michael Joyce > Fix For: 1.11 > > Attachments: nutch-2063-joyce-21July2015.patch > > > Right now in order to get a MimeType distribution for any given number of > segments, one is required to dump some data. This is a waste if one just > wishes to see the mime type distribution across a number of segments. > An improvement to the FileDumper tool would be the addition of a -mimeStats > flag which would not attempt to dump any data but instead merely provide the > total stats message providing insight into how the FileDumper should be best > used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2064) URLNormalizer basic to properly encode non-ASCII characters
[ https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2064: - Attachment: NUTCH-1098.patch Excellent! I have added both characters as a new test and it passes. > URLNormalizer basic to properly encode non-ASCII characters > --- > > Key: NUTCH-2064 > URL: https://issues.apache.org/jira/browse/NUTCH-2064 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.10 >Reporter: Markus Jelsma > Fix For: 1.11 > > Attachments: NUTCH-1098.patch, NUTCH-1098.patch > > > NUTCH-1098 rewritten to work on trunk. Unit test is identical to 1098. -- This message was sent by Atlassian JIRA (v6.3.4#6332)