[jira] [Commented] (NUTCH-978) A Plugin for extracting certain element of a web page on html page parsing.

Kris (JIRA) Tue, 04 Oct 2016 10:13:32 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15546005#comment-15546005
 ]


Kris commented on NUTCH-978:
----------------------------

here is a solution that i am currently experimenting with

> A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2012, mentor
>         Attachments: 
> [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, 
> app_guardian_ivory_coast_news_exmpl.png, 
> app_screenshoot_configuration_result.png, 
> app_screenshoot_configuration_result_anchor.png, 
> app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, 
> for_GSoc.zip, version_alpha2.zip
>
>   Original Estimate: 1,680h
>  Remaining Estimate: 1,680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of 
> the web page by removing html tags and component like javascript and css and 
> leaving the extracted text to be stored on the index. Nutch by default 
> doesn't have the capability to select certain atomic element on an html page, 
> like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text 
> as its node. This branch and node could be extracted using XPath. XPath 
> allowing us to select a certain branch or node of an XML and therefore could 
> be used to extract certain information and treat it differently based on its 
> content and the user requirements. Furthermore a web domain like news website 
> usually have a same html code structure for storing the information on its 
> web pages. This same html code structure could be parsed using the same XPath 
> query and retrieve the same content information element. All of the XPath 
> query for selecting various content could be stored on a XPath Configuration 
> File.
> The purpose of nutch are for various web source, not all of the web page 
> retrieved from those various source have the same html code structure, thus 
> have to be threated differently using the correct XPath Configuration. The 
> selection of the correct XPath configuration could be done automatically 
> using regex by matching the url of the web page with valid url pattern for 
> that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page 
> and get only certain information that user wants therefore making the index 
> more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting 
> certain elements on various news website for the purpose of document 
> clustering. This includes a Configuration Editor Application build using 
> NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-978) A Plugin for extracting certain element of a web page on html page parsing.

Reply via email to