[jira] [Updated] (NUTCH-1375) extract main content of a html file
[ https://issues.apache.org/jira/browse/NUTCH-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1375: Patch Info: Patch Available Fix Version/s: 1.7 extract main content of a html file --- Key: NUTCH-1375 URL: https://issues.apache.org/jira/browse/NUTCH-1375 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: behnam nikbakht Fix For: 1.7 Attachments: NUTCH-1375.patch i write a code, that can extract main content of a html (usally weblogs). this content usally apperas in bodyp tag but there is no insurance. also might be multiple tags with form of bodyp but only one of them is main content. this code first find body node, and then compute weight of childs nodes that compute based on text volume and height. so the code find lowest node that have maximum text volume. i hope that improvement of this code cause to solutions to find fake or duplicated pages. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1375) extract main content of a html file
[ https://issues.apache.org/jira/browse/NUTCH-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] behnam nikbakht updated NUTCH-1375: --- Attachment: NUTCH-1375.patch extract main content of a html file --- Key: NUTCH-1375 URL: https://issues.apache.org/jira/browse/NUTCH-1375 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: behnam nikbakht Attachments: NUTCH-1375.patch i write a code, that can extract main content of a html (usally weblogs). this content usally apperas in bodyp tag but there is no insurance. also might be multiple tags with form of bodyp but only one of them is main content. this code first find body node, and then compute weight of childs nodes that compute based on text volume and height. so the code find lowest node that have maximum text volume. i hope that improvement of this code cause to solutions to find fake or duplicated pages. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira