[jira] [Commented] (NUTCH-2460) use the headless option of firefox and chrome in protocol-selenium
[ https://issues.apache.org/jira/browse/NUTCH-2460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16264536#comment-16264536 ] ASF GitHub Bot commented on NUTCH-2460: --- hussein-alahmad opened a new pull request #245: fix for NUTCH-2460 contributed by Hussein Alahmad URL: https://github.com/apache/nutch/pull/245 use the headless option of firefox and chrome in protocol-selenium the --headless option is added to firefox in version 55 or later , and in chrome in version 59 or later ... this is much better than relying on xvfb and its associates . we can add it as a property in the config file . I'm trying it on my local machine , and will create a pull request when I finish testing it I've tested it using firefox 57.0 , gecodriver 0.19.1 and selenium 3.7.1 Important note : you need to add the following property to nutch-default.xml or nutch-site.xml for the headless option to work selenium.firefox.headless true A Boolean value representing if firefox should run headless . make sure that firefox version is 55 or later, and selenium webDriver version is 3.6.0 or later. The default value is false. Currently this option exist for - 'firefox' This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > use the headless option of firefox and chrome in protocol-selenium > -- > > Key: NUTCH-2460 > URL: https://issues.apache.org/jira/browse/NUTCH-2460 > Project: Nutch > Issue Type: Improvement > Components: plugin, protocol >Reporter: hussein Al_Ahmad >Priority: Minor > > the --headless option is added to firefox in version 55 or later , and in > chrome in version 59 or later ... > this is much better than relying on xvfb and its associates . > we can add it as a property in the config file . > I'm trying it on my local machine , and will create a pull request when I > finish testing it . -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (NUTCH-2464) Headers That Contain HTML Elements Are Not Parsed
[ https://issues.apache.org/jira/browse/NUTCH-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Luis Betancourt Gonzalez updated NUTCH-2464: -- Affects Version/s: (was: 2.3) 1.13 > Headers That Contain HTML Elements Are Not Parsed > - > > Key: NUTCH-2464 > URL: https://issues.apache.org/jira/browse/NUTCH-2464 > Project: Nutch > Issue Type: Bug > Components: plugin >Affects Versions: 1.13 > Environment: Internal development/test environments. >Reporter: Cass Pallansch > Attachments: NUTCH-2464-complex-header.html > > > Nutch does not appear to traverse the HTML elements that may be contained > within header elements (e.g., H1, H2, H3, etc. tags). Many times there are > anchors and/or tags within these elements that contain the actual text > nodes that should be picked up as the header value for indexing purposes. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2464) Headers That Contain HTML Elements Are Not Parsed
[ https://issues.apache.org/jira/browse/NUTCH-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16264424#comment-16264424 ] ASF GitHub Bot commented on NUTCH-2464: --- jorgelbg opened a new pull request #244: Fix for NUTCH-2464 get textual content from nested heading nodes URL: https://github.com/apache/nutch/pull/244 As suggested by @sebastian-nagel refactored the `getNodeValue` method to use `NodeWalker` iterators. This allows traversing the entire DOM tree in case of nested nodes as explained on the issue. Added a test case for this issue as well. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Headers That Contain HTML Elements Are Not Parsed > - > > Key: NUTCH-2464 > URL: https://issues.apache.org/jira/browse/NUTCH-2464 > Project: Nutch > Issue Type: Bug > Components: plugin >Affects Versions: 2.3 > Environment: Internal development/test environments. >Reporter: Cass Pallansch > Attachments: NUTCH-2464-complex-header.html > > > Nutch does not appear to traverse the HTML elements that may be contained > within header elements (e.g., H1, H2, H3, etc. tags). Many times there are > anchors and/or tags within these elements that contain the actual text > nodes that should be picked up as the header value for indexing purposes. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
Build path errors(Eclipse) in the latest nutch develop
Hello all, I have tried to run the latest git(git clone http://github.com/apache/nutch.git) version of Nutch in Eclipse, but I got several build path errors. Description Resource Path Location Type Build path contains duplicate entry: 'src/plugin/protocol-htmlunit/src/java/' for project 'nutch' nutch Build path Build Path Problem Project 'nutch' is missing required source folder: 'src/plugin/parse-replace/src/java/' nutch Build path Build Path Problem Project 'nutch' is missing required source folder: 'src/plugin/parse-replace/src/test/' nutch Build path Build Path Problem Any ideas? Thanks. Semyon.