On 26 Aug, 23:22, "bruce" <[EMAIL PROTECTED]> wrote: > > ok, i can somehow live with this, i can accommodate it. but tell me, when > the parse module/class for libxml2dom does its thing, why does it not go > forward on the tree when it comes to a </html>, if there's more text in the > string to process???
I imagine that libxml2, which actually does the parsing, stops doing its work when it has successfully closed all open elements. Perhaps there's a way of making it go on and potentially complain about trailing input. > oh, also, regarding screen parsing/crawling, i've seen a number of sites > that have discussed using a web testing app, like selinium, and driving a > browser process, in order to really capture all the required data. any > thoughts on the pros/cons of this kind of approach to scraping data... Once upon a time I used the KPartPlugins to automate Konqueror, combining them with a DOM implementation, qtxmldom, which let me read the contents of Web pages from a real browser. Unfortunately, that technology doesn't work with recent versions of KDE (or PyKDE), and attempts to use Mozilla via PyXPCOM weren't successful. If you wanted to pursue this route, my advice would be to ask the Mozilla people, particularly those who work with PyXPCOM. An alternative might be to look into the state of bindings for the Webkit browser technologies. Paul -- http://mail.python.org/mailman/listinfo/python-list