[ADVANCED-DOTNET] Web Page Content Parsing

Smitha Fri, 31 May 2002 01:29:06 -0700

Problem Statement:
My requirement is to access a web page over the internet and parse the
complete Web Page (HTML) content returned by the web-site


Current Approach:
Does .Net provide any re-usable components or classes for achieving the
same?
I have currently used the System.Net namespace functions provided by .Net
namely HttpWebRequest, HttpWebResponse and HttpWebClient to retrieve the
HTML content from the URL accessed. The content returned is stored in a
stream object and then converted to a string to parse each HTML element and
attribute. The parsing also identifies all links on the page (and converts
it to the absolute path),  all images (which is downloaded), any javascript
code and any includes like ".js" , ".jpeg", ".css" files. The end result
obtained is a replica of the web page accessed. All this has been done
programmatically.


Drawback of the approach:
It is a very tedious approach since the string containing the HTML code
needs to be parsed and stripped into elements, attributes, links, includes,
images etc.
It is not effective as far as performance is considered since the logic
involves content crawling and downloads.

My Question:
Are there any better approaches to meet the requirement - may be something
like re-usable HTML parsers or Web page content parsers already available
or pre-defined classes/libraries in .Net?

You can read messages from the Advanced DOTNET archive, unsubscribe from Advanced 
DOTNET, or
subscribe to other DevelopMentor lists at http://discuss.develop.com.

[ADVANCED-DOTNET] Web Page Content Parsing

Reply via email to