[ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15546005#comment-15546005 ]
Kris commented on NUTCH-978: ---------------------------- here is a solution that i am currently experimenting with > A Plugin for extracting certain element of a web page on html page parsing. > --------------------------------------------------------------------------- > > Key: NUTCH-978 > URL: https://issues.apache.org/jira/browse/NUTCH-978 > Project: Nutch > Issue Type: New Feature > Components: parser > Affects Versions: 1.2 > Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9 > Reporter: Ammar Shadiq > Assignee: Chris A. Mattmann > Priority: Minor > Labels: gsoc2012, mentor > Attachments: > [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, > app_guardian_ivory_coast_news_exmpl.png, > app_screenshoot_configuration_result.png, > app_screenshoot_configuration_result_anchor.png, > app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, > for_GSoc.zip, version_alpha2.zip > > Original Estimate: 1,680h > Remaining Estimate: 1,680h > > Nutch use parse-html plugin to parse web pages, it process the contents of > the web page by removing html tags and component like javascript and css and > leaving the extracted text to be stored on the index. Nutch by default > doesn't have the capability to select certain atomic element on an html page, > like certain tags, certain content, some part of the page, etc. > A html page have a tree-like xml pattern with html tag as its branch and text > as its node. This branch and node could be extracted using XPath. XPath > allowing us to select a certain branch or node of an XML and therefore could > be used to extract certain information and treat it differently based on its > content and the user requirements. Furthermore a web domain like news website > usually have a same html code structure for storing the information on its > web pages. This same html code structure could be parsed using the same XPath > query and retrieve the same content information element. All of the XPath > query for selecting various content could be stored on a XPath Configuration > File. > The purpose of nutch are for various web source, not all of the web page > retrieved from those various source have the same html code structure, thus > have to be threated differently using the correct XPath Configuration. The > selection of the correct XPath configuration could be done automatically > using regex by matching the url of the web page with valid url pattern for > that xpath configuration. > This automatic mechanism allow the user of nutch to process various web page > and get only certain information that user wants therefore making the index > more accurate and its content more flexible. > The component for this idea have been tested on nutch 1.2 for selecting > certain elements on various news website for the purpose of document > clustering. This includes a Configuration Editor Application build using > NetBeans 6.9 Application Framework. though its need a few debugging. > http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip -- This message was sent by Atlassian JIRA (v6.3.4#6332)