[ https://issues.apache.org/jira/browse/NUTCH-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764697#comment-17764697 ]
Markus Jelsma commented on NUTCH-3000: -------------------------------------- Yes, this is a bit odd indeed. +1 > protocol-selenium returns only the body,strips off the <head/> element > ---------------------------------------------------------------------- > > Key: NUTCH-3000 > URL: https://issues.apache.org/jira/browse/NUTCH-3000 > Project: Nutch > Issue Type: Bug > Components: protocol > Reporter: Tim Allison > Priority: Major > > The selenium protocol returns only the body portion of the html, which means > that neither the title nor the other page metadata in the <head/> section > gets extracted. > {noformat} > String innerHtml = driver.findElement(By.tagName("body")) > .getAttribute("innerHTML"); > {noformat} > We should return the full html, no? -- This message was sent by Atlassian Jira (v8.20.10#820010)