[ https://issues.apache.org/jira/browse/NUTCH-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13281024#comment-13281024 ]
Julien Nioche commented on NUTCH-1375: -------------------------------------- your patch generates noise (a + myBeautifulPatch.patch) + there is no documentation or indications on how it should be used. A neat way of doing would be to integrate your implementation into Boilerpipe instead (see related issue) > extract main content of a html file > ----------------------------------- > > Key: NUTCH-1375 > URL: https://issues.apache.org/jira/browse/NUTCH-1375 > Project: Nutch > Issue Type: Bug > Components: parser > Affects Versions: 1.4 > Reporter: behnam nikbakht > Attachments: NUTCH-1375.patch > > > i write a code, that can extract main content of a html (usally weblogs). > this content usally apperas in <body><p> tag but there is no insurance. also > might be multiple tags with form of <body><p> but only one of them is main > content. this code first find body node, and then compute weight of childs > nodes that compute based on text volume and height. so the code find lowest > node that have maximum text volume. > i hope that improvement of this code cause to solutions to find fake or > duplicated pages. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira