[ https://issues.apache.org/jira/browse/NUTCH-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764803#comment-17764803 ]
Hudson commented on NUTCH-3000: ------------------------------- SUCCESS: Integrated in Jenkins build Nutch ยป Nutch-trunk #110 (See [https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/110/]) NUTCH-3000 - the selenium protocol should return the full html, not just the inner body element. (tallison: [https://github.com/apache/nutch/commit/820d129a8adff9a34eed2ed3c04cfee377b56b63]) * (edit) src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java > protocol-selenium returns only the body,strips off the <head/> element > ---------------------------------------------------------------------- > > Key: NUTCH-3000 > URL: https://issues.apache.org/jira/browse/NUTCH-3000 > Project: Nutch > Issue Type: Bug > Components: protocol > Reporter: Tim Allison > Priority: Major > > The selenium protocol returns only the body portion of the html, which means > that neither the title nor the other page metadata in the <head/> section > gets extracted. > {noformat} > String innerHtml = driver.findElement(By.tagName("body")) > .getAttribute("innerHTML"); > {noformat} > We should return the full html, no? -- This message was sent by Atlassian Jira (v8.20.10#820010)