[ https://issues.apache.org/jira/browse/NUTCH-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison resolved NUTCH-3000. -------------------------------- Fix Version/s: 1.20 Resolution: Fixed > protocol-selenium returns only the body,strips off the <head/> element > ---------------------------------------------------------------------- > > Key: NUTCH-3000 > URL: https://issues.apache.org/jira/browse/NUTCH-3000 > Project: Nutch > Issue Type: Bug > Components: protocol > Reporter: Tim Allison > Priority: Major > Fix For: 1.20 > > > The selenium protocol returns only the body portion of the html, which means > that neither the title nor the other page metadata in the <head/> section > gets extracted. > {noformat} > String innerHtml = driver.findElement(By.tagName("body")) > .getAttribute("innerHTML"); > {noformat} > We should return the full html, no? -- This message was sent by Atlassian Jira (v8.20.10#820010)