[jira] [Updated] (NUTCH-1375) extract main content of a html file
[ https://issues.apache.org/jira/browse/NUTCH-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1375: Patch Info: Patch Available Fix Version/s: 1.7 > extract main content of a html file > --- > > Key: NUTCH-1375 > URL: https://issues.apache.org/jira/browse/NUTCH-1375 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.4 >Reporter: behnam nikbakht > Fix For: 1.7 > > Attachments: NUTCH-1375.patch > > > i write a code, that can extract main content of a html (usally weblogs). > this content usally apperas in tag but there is no insurance. also > might be multiple tags with form of but only one of them is main > content. this code first find body node, and then compute weight of childs > nodes that compute based on text volume and height. so the code find lowest > node that have maximum text volume. > i hope that improvement of this code cause to solutions to find fake or > duplicated pages. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1375) extract main content of a html file
[ https://issues.apache.org/jira/browse/NUTCH-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] behnam nikbakht updated NUTCH-1375: --- Attachment: NUTCH-1375.patch > extract main content of a html file > --- > > Key: NUTCH-1375 > URL: https://issues.apache.org/jira/browse/NUTCH-1375 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.4 >Reporter: behnam nikbakht > Attachments: NUTCH-1375.patch > > > i write a code, that can extract main content of a html (usally weblogs). > this content usally apperas in tag but there is no insurance. also > might be multiple tags with form of but only one of them is main > content. this code first find body node, and then compute weight of childs > nodes that compute based on text volume and height. so the code find lowest > node that have maximum text volume. > i hope that improvement of this code cause to solutions to find fake or > duplicated pages. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira