"Adam D. Ruppe" <destructiona...@gmail.com> wrote in message news:nlccexskkftzaapfd...@dfeed.kimsufi.thecybershadow.net... > On Thursday, 15 December 2011 at 09:55:22 UTC, breezes wrote: >> Is there a class that can fetch a web page from the internet? And is >> std.xml the right module for parsing it >> into a DOM tree? > > You might want to use my dom.d > > https://github.com/adamdruppe/misc-stuff-including-D-programming-language-web-stuff > > Grab dom.d, characterencodings.d, and curl.d. > > Here's an example program: > > ==== > import arsd.dom; > import arsd.curl; > > import std.stdio; > > void main() { > auto document = new Document(); > document.parseGarbage(curl("http://digitalmars.com/")); > > writeln(document.querySelector("p")); > } > ===== > > Compile like this: > > dmd yourfile.d dom.d characterencodings.d curl.d > > You'll need the curl C library from an outside source. If you're > on Linux, it is probably already installed. If you're on Windows, > check the Internet. > > // this downloads a file from the web and returns a string > curl(site url); > > // this builds a DOM tree out of html. It's called parseGarbage because > // it tries to figure out really bad html - so it works on a lot of web > // sites. > document.parseGarbage(string); > > // My dom.d includes a lot of functions you might know from > // javascript like getElementById, getElementsByTagName, and the > // get element by CSS selector functions > document.querySelector("p") // get the first paragraph > > > And then, finally, the writeln puts out the html of an element.
Yup, I can confirm Adam's tools are great for this. At the moment, std.xml is known to have problems and is currently undergoing a rewrite.