[jira] [Commented] (NUTCH-2460) use the headless option of firefox and chrome in protocol-selenium

2017-11-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16264536#comment-16264536
 ] 

ASF GitHub Bot commented on NUTCH-2460:
---

hussein-alahmad opened a new pull request #245: fix for NUTCH-2460 contributed 
by Hussein Alahmad
URL: https://github.com/apache/nutch/pull/245
 
 
   use the headless option of firefox and chrome in protocol-selenium
   
   the --headless option is added to firefox in version 55 or later , and in 
chrome in version 59 or later ...
   this is much better than relying on xvfb and its associates .
   we can add it as a property in the config file .
   I'm trying it on my local machine , and will create a pull request when I 
finish testing it 
   
   I've tested it using firefox 57.0 , gecodriver 0.19.1 and selenium 3.7.1
   
   Important note : you need to add the following property to nutch-default.xml 
or nutch-site.xml for the headless option to work
   
   
   selenium.firefox.headless
   true
   A Boolean value representing if firefox should
   run headless . make sure that firefox version is 55 or later,
   and selenium webDriver version is 3.6.0 or later. The default 
value is false.
   Currently this option exist for - 'firefox' 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> use the headless option of firefox and chrome in protocol-selenium
> --
>
> Key: NUTCH-2460
> URL: https://issues.apache.org/jira/browse/NUTCH-2460
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, protocol
>Reporter: hussein Al_Ahmad
>Priority: Minor
>
> the --headless option is added to firefox in version 55 or later , and in 
> chrome in version 59 or later ...
> this is much better than relying on  xvfb and its associates .
> we can add it as a property in the config file .
> I'm trying it on my local machine , and will create a pull request when I 
> finish testing it .



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2464) Headers That Contain HTML Elements Are Not Parsed

2017-11-23 Thread Jorge Luis Betancourt Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Luis Betancourt Gonzalez updated NUTCH-2464:
--
Affects Version/s: (was: 2.3)
   1.13

> Headers That Contain HTML Elements Are Not Parsed
> -
>
> Key: NUTCH-2464
> URL: https://issues.apache.org/jira/browse/NUTCH-2464
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin
>Affects Versions: 1.13
> Environment: Internal development/test environments.
>Reporter: Cass Pallansch
> Attachments: NUTCH-2464-complex-header.html
>
>
> Nutch does not appear to traverse the HTML elements that may be contained 
> within header elements (e.g., H1, H2, H3, etc. tags).  Many times there are 
> anchors and/or  tags within these elements that contain the actual text 
> nodes that should be picked up as the header value for indexing purposes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2464) Headers That Contain HTML Elements Are Not Parsed

2017-11-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16264424#comment-16264424
 ] 

ASF GitHub Bot commented on NUTCH-2464:
---

jorgelbg opened a new pull request #244: Fix for NUTCH-2464 get textual content 
from nested heading nodes
URL: https://github.com/apache/nutch/pull/244
 
 
   As suggested by @sebastian-nagel refactored the `getNodeValue` method to use 
`NodeWalker` iterators. This allows traversing the entire DOM tree in case of 
nested nodes as explained on the issue. Added a test case for this issue as 
well.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Headers That Contain HTML Elements Are Not Parsed
> -
>
> Key: NUTCH-2464
> URL: https://issues.apache.org/jira/browse/NUTCH-2464
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin
>Affects Versions: 2.3
> Environment: Internal development/test environments.
>Reporter: Cass Pallansch
> Attachments: NUTCH-2464-complex-header.html
>
>
> Nutch does not appear to traverse the HTML elements that may be contained 
> within header elements (e.g., H1, H2, H3, etc. tags).  Many times there are 
> anchors and/or  tags within these elements that contain the actual text 
> nodes that should be picked up as the header value for indexing purposes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Build path errors(Eclipse) in the latest nutch develop

2017-11-23 Thread Semyon Semyonov
Hello all,

I have tried to run the latest git(git clone 
http://github.com/apache/nutch.git) version of Nutch in Eclipse, but I got 
several build path errors.

Description    Resource    Path    Location    Type
Build path contains duplicate entry: 'src/plugin/protocol-htmlunit/src/java/' 
for project 'nutch'    nutch        Build path    Build Path Problem
Project 'nutch' is missing required source folder: 
'src/plugin/parse-replace/src/java/'    nutch        Build path    Build Path 
Problem
Project 'nutch' is missing required source folder: 
'src/plugin/parse-replace/src/test/'    nutch        Build path    Build Path 
Problem

Any ideas?

Thanks.
Semyon.