[jira] [Updated] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

2012-03-19 Thread Ammar Shadiq (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ammar Shadiq updated NUTCH-978:
---

Attachment: version_alpha2.zip

upload latest version, worked on 1.2

 [GSoC 2011] A Plugin for extracting certain element of a web page on html 
 page parsing.
 ---

 Key: NUTCH-978
 URL: https://issues.apache.org/jira/browse/NUTCH-978
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.2
 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
Reporter: Ammar Shadiq
Assignee: Chris A. Mattmann
Priority: Minor
  Labels: gsoc2011, mentor
 Fix For: nutchgora

 Attachments: 
 [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, 
 app_guardian_ivory_coast_news_exmpl.png, 
 app_screenshoot_configuration_result.png, 
 app_screenshoot_configuration_result_anchor.png, 
 app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, 
 for_GSoc.zip, version_alpha2.zip

   Original Estimate: 1,680h
  Remaining Estimate: 1,680h

 Nutch use parse-html plugin to parse web pages, it process the contents of 
 the web page by removing html tags and component like javascript and css and 
 leaving the extracted text to be stored on the index. Nutch by default 
 doesn't have the capability to select certain atomic element on an html page, 
 like certain tags, certain content, some part of the page, etc.
 A html page have a tree-like xml pattern with html tag as its branch and text 
 as its node. This branch and node could be extracted using XPath. XPath 
 allowing us to select a certain branch or node of an XML and therefore could 
 be used to extract certain information and treat it differently based on its 
 content and the user requirements. Furthermore a web domain like news website 
 usually have a same html code structure for storing the information on its 
 web pages. This same html code structure could be parsed using the same XPath 
 query and retrieve the same content information element. All of the XPath 
 query for selecting various content could be stored on a XPath Configuration 
 File.
 The purpose of nutch are for various web source, not all of the web page 
 retrieved from those various source have the same html code structure, thus 
 have to be threated differently using the correct XPath Configuration. The 
 selection of the correct XPath configuration could be done automatically 
 using regex by matching the url of the web page with valid url pattern for 
 that xpath configuration.
 This automatic mechanism allow the user of nutch to process various web page 
 and get only certain information that user wants therefore making the index 
 more accurate and its content more flexible.
 The component for this idea have been tested on nutch 1.2 for selecting 
 certain elements on various news website for the purpose of document 
 clustering. This includes a Configuration Editor Application build using 
 NetBeans 6.9 Application Framework. though its need a few debugging.
 http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

2012-02-19 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-978:
---

Attachment: for_GSoc.zip

In it's present form this is quite literally all over the place and is merely 
for safe keeping.

 [GSoC 2011] A Plugin for extracting certain element of a web page on html 
 page parsing.
 ---

 Key: NUTCH-978
 URL: https://issues.apache.org/jira/browse/NUTCH-978
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.2
 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
Reporter: Ammar Shadiq
Assignee: Chris A. Mattmann
Priority: Minor
  Labels: gsoc2011, mentor
 Fix For: nutchgora

 Attachments: 
 [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, 
 app_guardian_ivory_coast_news_exmpl.png, 
 app_screenshoot_configuration_result.png, 
 app_screenshoot_configuration_result_anchor.png, 
 app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, 
 for_GSoc.zip

   Original Estimate: 1,680h
  Remaining Estimate: 1,680h

 Nutch use parse-html plugin to parse web pages, it process the contents of 
 the web page by removing html tags and component like javascript and css and 
 leaving the extracted text to be stored on the index. Nutch by default 
 doesn't have the capability to select certain atomic element on an html page, 
 like certain tags, certain content, some part of the page, etc.
 A html page have a tree-like xml pattern with html tag as its branch and text 
 as its node. This branch and node could be extracted using XPath. XPath 
 allowing us to select a certain branch or node of an XML and therefore could 
 be used to extract certain information and treat it differently based on its 
 content and the user requirements. Furthermore a web domain like news website 
 usually have a same html code structure for storing the information on its 
 web pages. This same html code structure could be parsed using the same XPath 
 query and retrieve the same content information element. All of the XPath 
 query for selecting various content could be stored on a XPath Configuration 
 File.
 The purpose of nutch are for various web source, not all of the web page 
 retrieved from those various source have the same html code structure, thus 
 have to be threated differently using the correct XPath Configuration. The 
 selection of the correct XPath configuration could be done automatically 
 using regex by matching the url of the web page with valid url pattern for 
 that xpath configuration.
 This automatic mechanism allow the user of nutch to process various web page 
 and get only certain information that user wants therefore making the index 
 more accurate and its content more flexible.
 The component for this idea have been tested on nutch 1.2 for selecting 
 certain elements on various news website for the purpose of document 
 clustering. This includes a Configuration Editor Application build using 
 NetBeans 6.9 Application Framework. though its need a few debugging.
 http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

2011-04-08 Thread Ammar Shadiq (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ammar Shadiq updated NUTCH-978:
---

Attachment: Screenshot.png

 [GSoC 2011] A Plugin for extracting certain element of a web page on html 
 page parsing.
 ---

 Key: NUTCH-978
 URL: https://issues.apache.org/jira/browse/NUTCH-978
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.2
 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
Reporter: Ammar Shadiq
Assignee: Chris A. Mattmann
Priority: Minor
  Labels: gsoc2011, mentor
 Fix For: 2.0

 Attachments: Screenshot.png, 
 [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, 
 app_screenshoot_configuration_result.png, 
 app_screenshoot_configuration_result_anchor.png, 
 app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png

   Original Estimate: 1680h
  Remaining Estimate: 1680h

 Nutch use parse-html plugin to parse web pages, it process the contents of 
 the web page by removing html tags and component like javascript and css and 
 leaving the extracted text to be stored on the index. Nutch by default 
 doesn't have the capability to select certain atomic element on an html page, 
 like certain tags, certain content, some part of the page, etc.
 A html page have a tree-like xml pattern with html tag as its branch and text 
 as its node. This branch and node could be extracted using XPath. XPath 
 allowing us to select a certain branch or node of an XML and therefore could 
 be used to extract certain information and treat it differently based on its 
 content and the user requirements. Furthermore a web domain like news website 
 usually have a same html code structure for storing the information on its 
 web pages. This same html code structure could be parsed using the same XPath 
 query and retrieve the same content information element. All of the XPath 
 query for selecting various content could be stored on a XPath Configuration 
 File.
 The purpose of nutch are for various web source, not all of the web page 
 retrieved from those various source have the same html code structure, thus 
 have to be threated differently using the correct XPath Configuration. The 
 selection of the correct XPath configuration could be done automatically 
 using regex by matching the url of the web page with valid url pattern for 
 that xpath configuration.
 This automatic mechanism allow the user of nutch to process various web page 
 and get only certain information that user wants therefore making the index 
 more accurate and its content more flexible.
 The component for this idea have been tested on nutch 1.2 for selecting 
 certain elements on various news website for the purpose of document 
 clustering. This includes a Configuration Editor Application build using 
 NetBeans 6.9 Application Framework. though its need a few debugging.
 http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

2011-04-08 Thread Ammar Shadiq (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ammar Shadiq updated NUTCH-978:
---

Attachment: (was: Screenshot.png)

 [GSoC 2011] A Plugin for extracting certain element of a web page on html 
 page parsing.
 ---

 Key: NUTCH-978
 URL: https://issues.apache.org/jira/browse/NUTCH-978
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.2
 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
Reporter: Ammar Shadiq
Assignee: Chris A. Mattmann
Priority: Minor
  Labels: gsoc2011, mentor
 Fix For: 2.0

 Attachments: 
 [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, 
 app_guardian_ivory_coast_news_exmpl.png, 
 app_screenshoot_configuration_result.png, 
 app_screenshoot_configuration_result_anchor.png, 
 app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png

   Original Estimate: 1680h
  Remaining Estimate: 1680h

 Nutch use parse-html plugin to parse web pages, it process the contents of 
 the web page by removing html tags and component like javascript and css and 
 leaving the extracted text to be stored on the index. Nutch by default 
 doesn't have the capability to select certain atomic element on an html page, 
 like certain tags, certain content, some part of the page, etc.
 A html page have a tree-like xml pattern with html tag as its branch and text 
 as its node. This branch and node could be extracted using XPath. XPath 
 allowing us to select a certain branch or node of an XML and therefore could 
 be used to extract certain information and treat it differently based on its 
 content and the user requirements. Furthermore a web domain like news website 
 usually have a same html code structure for storing the information on its 
 web pages. This same html code structure could be parsed using the same XPath 
 query and retrieve the same content information element. All of the XPath 
 query for selecting various content could be stored on a XPath Configuration 
 File.
 The purpose of nutch are for various web source, not all of the web page 
 retrieved from those various source have the same html code structure, thus 
 have to be threated differently using the correct XPath Configuration. The 
 selection of the correct XPath configuration could be done automatically 
 using regex by matching the url of the web page with valid url pattern for 
 that xpath configuration.
 This automatic mechanism allow the user of nutch to process various web page 
 and get only certain information that user wants therefore making the index 
 more accurate and its content more flexible.
 The component for this idea have been tested on nutch 1.2 for selecting 
 certain elements on various news website for the purpose of document 
 clustering. This includes a Configuration Editor Application build using 
 NetBeans 6.9 Application Framework. though its need a few debugging.
 http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

2011-04-07 Thread Ammar Shadiq (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ammar Shadiq updated NUTCH-978:
---

Attachment: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf

Proposal Updated

 [GSoC 2011] A Plugin for extracting certain element of a web page on html 
 page parsing.
 ---

 Key: NUTCH-978
 URL: https://issues.apache.org/jira/browse/NUTCH-978
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.2
 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
Reporter: Ammar Shadiq
Assignee: Chris A. Mattmann
Priority: Minor
  Labels: gsoc2011, mentor
 Fix For: 2.0

 Attachments: 
 [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, 
 [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, 
 app_screenshoot_configuration_result.png, 
 app_screenshoot_configuration_result_anchor.png, 
 app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png

   Original Estimate: 1680h
  Remaining Estimate: 1680h

 Nutch use parse-html plugin to parse web pages, it process the contents of 
 the web page by removing html tags and component like javascript and css and 
 leaving the extracted text to be stored on the index. Nutch by default 
 doesn't have the capability to select certain atomic element on an html page, 
 like certain tags, certain content, some part of the page, etc.
 A html page have a tree-like xml pattern with html tag as its branch and text 
 as its node. This branch and node could be extracted using XPath. XPath 
 allowing us to select a certain branch or node of an XML and therefore could 
 be used to extract certain information and treat it differently based on its 
 content and the user requirements. Furthermore a web domain like news website 
 usually have a same html code structure for storing the information on its 
 web pages. This same html code structure could be parsed using the same XPath 
 query and retrieve the same content information element. All of the XPath 
 query for selecting various content could be stored on a XPath Configuration 
 File.
 The purpose of nutch are for various web source, not all of the web page 
 retrieved from those various source have the same html code structure, thus 
 have to be threated differently using the correct XPath Configuration. The 
 selection of the correct XPath configuration could be done automatically 
 using regex by matching the url of the web page with valid url pattern for 
 that xpath configuration.
 This automatic mechanism allow the user of nutch to process various web page 
 and get only certain information that user wants therefore making the index 
 more accurate and its content more flexible.
 The component for this idea have been tested on nutch 1.2 for selecting 
 certain elements on various news website for the purpose of document 
 clustering. This includes a Configuration Editor Application build using 
 NetBeans 6.9 Application Framework. though its need a few debugging.
 http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

2011-04-07 Thread Ammar Shadiq (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ammar Shadiq updated NUTCH-978:
---

Attachment: (was: 
[Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf)

 [GSoC 2011] A Plugin for extracting certain element of a web page on html 
 page parsing.
 ---

 Key: NUTCH-978
 URL: https://issues.apache.org/jira/browse/NUTCH-978
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.2
 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
Reporter: Ammar Shadiq
Assignee: Chris A. Mattmann
Priority: Minor
  Labels: gsoc2011, mentor
 Fix For: 2.0

 Attachments: 
 [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, 
 app_screenshoot_configuration_result.png, 
 app_screenshoot_configuration_result_anchor.png, 
 app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png

   Original Estimate: 1680h
  Remaining Estimate: 1680h

 Nutch use parse-html plugin to parse web pages, it process the contents of 
 the web page by removing html tags and component like javascript and css and 
 leaving the extracted text to be stored on the index. Nutch by default 
 doesn't have the capability to select certain atomic element on an html page, 
 like certain tags, certain content, some part of the page, etc.
 A html page have a tree-like xml pattern with html tag as its branch and text 
 as its node. This branch and node could be extracted using XPath. XPath 
 allowing us to select a certain branch or node of an XML and therefore could 
 be used to extract certain information and treat it differently based on its 
 content and the user requirements. Furthermore a web domain like news website 
 usually have a same html code structure for storing the information on its 
 web pages. This same html code structure could be parsed using the same XPath 
 query and retrieve the same content information element. All of the XPath 
 query for selecting various content could be stored on a XPath Configuration 
 File.
 The purpose of nutch are for various web source, not all of the web page 
 retrieved from those various source have the same html code structure, thus 
 have to be threated differently using the correct XPath Configuration. The 
 selection of the correct XPath configuration could be done automatically 
 using regex by matching the url of the web page with valid url pattern for 
 that xpath configuration.
 This automatic mechanism allow the user of nutch to process various web page 
 and get only certain information that user wants therefore making the index 
 more accurate and its content more flexible.
 The component for this idea have been tested on nutch 1.2 for selecting 
 certain elements on various news website for the purpose of document 
 clustering. This includes a Configuration Editor Application build using 
 NetBeans 6.9 Application Framework. though its need a few debugging.
 http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

2011-04-06 Thread Ammar Shadiq (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ammar Shadiq updated NUTCH-978:
---

Attachment: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf

Proposal for Google Summer of Code 2011
http://www.google-melange.com/gsoc/homepage/google/gsoc2011

haven't found any mentor yet :-(

 [GSoC 2011] A Plugin for extracting certain element of a web page on html 
 page parsing.
 ---

 Key: NUTCH-978
 URL: https://issues.apache.org/jira/browse/NUTCH-978
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.2
 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
Reporter: Ammar Shadiq
  Labels: gsoc
 Fix For: 2.0

 Attachments: 
 [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf

   Original Estimate: 1680h
  Remaining Estimate: 1680h

 Nutch use parse-html plugin to parse web pages, it process the contents of 
 the web page by removing html tags and component like javascript and css and 
 leaving the extracted text to be stored on the index. Nutch by default 
 doesn't have the capability to select certain atomic element on an html page, 
 like certain tags, certain content, some part of the page, etc.
 A html page have a tree-like xml pattern with html tag as its branch and text 
 as its node. This branch and node could be extracted using XPath. XPath 
 allowing us to select a certain branch or node of an XML and therefore could 
 be used to extract certain information and treat it differently based on its 
 content and the user requirements. Furthermore a web domain like news website 
 usually have a same html code structure for storing the information on its 
 web pages. This same html code structure could be parsed using the same XPath 
 query and retrieve the same content information element. All of the XPath 
 query for selecting various content could be stored on a XPath Configuration 
 File.
 The purpose of nutch are for various web source, not all of the web page 
 retrieved from those various source have the same html code structure, thus 
 have to be threated differently using the correct XPath Configuration. The 
 selection of the correct XPath configuration could be done automatically 
 using regex by matching the url of the web page with valid url pattern for 
 that xpath configuration.
 This automatic mechanism allow the user of nutch to process various web page 
 and get only certain information that user wants therefore making the index 
 more accurate and its content more flexible.
 The component for this idea have been tested on nutch 1.2 for selecting 
 certain elements on various news website for the purpose of document 
 clustering. This includes a Configuration Editor Application build using 
 NetBeans 6.9 Application Framework. though its need a few debugging.
 http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

2011-04-06 Thread Ammar Shadiq (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ammar Shadiq updated NUTCH-978:
---

Priority: Minor  (was: Major)

 [GSoC 2011] A Plugin for extracting certain element of a web page on html 
 page parsing.
 ---

 Key: NUTCH-978
 URL: https://issues.apache.org/jira/browse/NUTCH-978
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.2
 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
Reporter: Ammar Shadiq
Priority: Minor
  Labels: gsoc
 Fix For: 2.0

 Attachments: 
 [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf

   Original Estimate: 1680h
  Remaining Estimate: 1680h

 Nutch use parse-html plugin to parse web pages, it process the contents of 
 the web page by removing html tags and component like javascript and css and 
 leaving the extracted text to be stored on the index. Nutch by default 
 doesn't have the capability to select certain atomic element on an html page, 
 like certain tags, certain content, some part of the page, etc.
 A html page have a tree-like xml pattern with html tag as its branch and text 
 as its node. This branch and node could be extracted using XPath. XPath 
 allowing us to select a certain branch or node of an XML and therefore could 
 be used to extract certain information and treat it differently based on its 
 content and the user requirements. Furthermore a web domain like news website 
 usually have a same html code structure for storing the information on its 
 web pages. This same html code structure could be parsed using the same XPath 
 query and retrieve the same content information element. All of the XPath 
 query for selecting various content could be stored on a XPath Configuration 
 File.
 The purpose of nutch are for various web source, not all of the web page 
 retrieved from those various source have the same html code structure, thus 
 have to be threated differently using the correct XPath Configuration. The 
 selection of the correct XPath configuration could be done automatically 
 using regex by matching the url of the web page with valid url pattern for 
 that xpath configuration.
 This automatic mechanism allow the user of nutch to process various web page 
 and get only certain information that user wants therefore making the index 
 more accurate and its content more flexible.
 The component for this idea have been tested on nutch 1.2 for selecting 
 certain elements on various news website for the purpose of document 
 clustering. This includes a Configuration Editor Application build using 
 NetBeans 6.9 Application Framework. though its need a few debugging.
 http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

2011-04-06 Thread Ammar Shadiq (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ammar Shadiq updated NUTCH-978:
---

Attachment: app_screenshoot_url_regex_filter.png
app_screenshoot_source_view.png
app_screenshoot_configuration_result_anchor.png
app_screenshoot_configuration_result.png

 [GSoC 2011] A Plugin for extracting certain element of a web page on html 
 page parsing.
 ---

 Key: NUTCH-978
 URL: https://issues.apache.org/jira/browse/NUTCH-978
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.2
 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
Reporter: Ammar Shadiq
Priority: Minor
  Labels: gsoc2011, mentor
 Fix For: 2.0

 Attachments: 
 [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, 
 app_screenshoot_configuration_result.png, 
 app_screenshoot_configuration_result_anchor.png, 
 app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png

   Original Estimate: 1680h
  Remaining Estimate: 1680h

 Nutch use parse-html plugin to parse web pages, it process the contents of 
 the web page by removing html tags and component like javascript and css and 
 leaving the extracted text to be stored on the index. Nutch by default 
 doesn't have the capability to select certain atomic element on an html page, 
 like certain tags, certain content, some part of the page, etc.
 A html page have a tree-like xml pattern with html tag as its branch and text 
 as its node. This branch and node could be extracted using XPath. XPath 
 allowing us to select a certain branch or node of an XML and therefore could 
 be used to extract certain information and treat it differently based on its 
 content and the user requirements. Furthermore a web domain like news website 
 usually have a same html code structure for storing the information on its 
 web pages. This same html code structure could be parsed using the same XPath 
 query and retrieve the same content information element. All of the XPath 
 query for selecting various content could be stored on a XPath Configuration 
 File.
 The purpose of nutch are for various web source, not all of the web page 
 retrieved from those various source have the same html code structure, thus 
 have to be threated differently using the correct XPath Configuration. The 
 selection of the correct XPath configuration could be done automatically 
 using regex by matching the url of the web page with valid url pattern for 
 that xpath configuration.
 This automatic mechanism allow the user of nutch to process various web page 
 and get only certain information that user wants therefore making the index 
 more accurate and its content more flexible.
 The component for this idea have been tested on nutch 1.2 for selecting 
 certain elements on various news website for the purpose of document 
 clustering. This includes a Configuration Editor Application build using 
 NetBeans 6.9 Application Framework. though its need a few debugging.
 http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira