Problem Statement: My requirement is to access a web page over the internet and parse the complete Web Page (HTML) content returned by the web-site
Current Approach: Does .Net provide any re-usable components or classes for achieving the same? I have currently used the System.Net namespace functions provided by .Net namely HttpWebRequest, HttpWebResponse and HttpWebClient to retrieve the HTML content from the URL accessed. The content returned is stored in a stream object and then converted to a string to parse each HTML element and attribute. The parsing also identifies all links on the page (and converts it to the absolute path), all images (which is downloaded), any javascript code and any includes like ".js" , ".jpeg", ".css" files. The end result obtained is a replica of the web page accessed. All this has been done programmatically. Drawback of the approach: It is a very tedious approach since the string containing the HTML code needs to be parsed and stripped into elements, attributes, links, includes, images etc. It is not effective as far as performance is considered since the logic involves content crawling and downloads. My Question: Are there any better approaches to meet the requirement - may be something like re-usable HTML parsers or Web page content parsers already available or pre-defined classes/libraries in .Net? You can read messages from the Advanced DOTNET archive, unsubscribe from Advanced DOTNET, or subscribe to other DevelopMentor lists at http://discuss.develop.com.
