This response may be a little late, but there is a C# component, HTML Document[1], that does exactly this. It provides a DOM type of structure, and can return an XmlDocument representing the HTML. I've been using it for a while and it works quite well. The developers are very responsive to any problems you have. My only complaint is that in difficult situations it tends to drop information. For example if there is a tag like
<a href=doc.html title= forgot the quotes onclick="or left one off> it takes a best guess and chops some info (maybe the onclick attribute value). Aside from that issue, it works quite well, and if you run into a problem, they usually get a new version out within a day. Erick [1] http://www.devcomponents.com/htmldoc/download.html ----- Original Message ----- From: "Craig Andera" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Friday, May 31, 2002 6:37 AM Subject: Re: [ADVANCED-DOTNET] Web Page Content Parsing > > Are there any better approaches to meet the requirement - may > > be something like re-usable HTML parsers or Web page content > > parsers already available or pre-defined classes/libraries in .Net? > > Well, there's nothing built-in, so you have a few choices. This is the > order I would do them in. > > 1) If you can guarantee that the resulting HTML is XHTML compliant, you > can use an XML parser to process it. > 2) You can try to find some reusable code that someone else already > wrote. I believe someone was going to try to do this as part of Ghengis: > http://www.sellsbrothers.com/genghis. > 3) You can use COM interop to muck around with getting Internet Explorer > to do the parsing for you. > > You can read messages from the Advanced DOTNET archive, unsubscribe from Advanced DOTNET, or > subscribe to other DevelopMentor lists at http://discuss.develop.com. > You can read messages from the Advanced DOTNET archive, unsubscribe from Advanced DOTNET, or subscribe to other DevelopMentor lists at http://discuss.develop.com.
