On Mar 7, 9:56 pm, "bruce" <bedoug...@earthlink.net> wrote: > .... > > and this solution will somehow allow a user to create a web parsing/scraping > app for parising links, and javascript from a web page?
not just parsing the links and the "static" javascript, but: * actually executing the javascript, giving the quotes page quotes a chance to actually _look_ like it would if it was being viewed as a quotes real quotes web browser. so any XMLHTTPRequests will _actually_ get executed, _actually_ result in _actually_ having the content of the web page _properly_ modified. so, e.g instead of seeing a "Loader" page on gmail you would _actually_ see the user's email and the adverts (assuming you went to the trouble of putting in the username/password) because the AJAX would _actually_ get executed by the WebKit engine, and the DOM model accessed thereafter. * giving the user the opportunity to call DOM methods such as getElementsByTagName and the opportunity to access properties such as document.anchors. in webkit-glib "gdom" bindings, that would be: * anchor_list = gdom_document_get_elements_by_tag_name(doc, "a"); or * g_object_get(doc, "anchors", &anchor_list, NULL); which in pywebkitgtk (thanks to python-pygobject auto-generation of python bindings from gobject bindings) translates into: * doc.get_elements_by_tag_name("a") or * doc.props.anchors which in pyjamas-desktop, a high-level abstraction on top of _that_, turns into: * from pyjamas import DOM anchor_list = DOM.getElementsByTagName(doc, "a") or * from pyjamas import DOM anchor_list = DOM.getAttribute(doc, "anchors") answer: yes. l. > -----Original Message----- > From: python-list-bounces+bedouglas=earthlink....@python.org > > [mailto:python-list-bounces+bedouglas=earthlink....@python.org]on Behalf > Oflkcl > Sent: Saturday, March 07, 2009 2:34 AM > To: python-l...@python.org > Subject: Re: Parsing/Crawler Questions - solution > > On Mar 7, 12:19 am, rounderwe...@gmail.com wrote: > > So, it sounds like your update means that it is related to a specific > > url. > > > I'm curious about this issue myself. I've often wondered how one > > could properly crawl anAJAX-ish site when you're not sure how quickly > > the data will be returned after the page has been. > > you want to look at the webkit engine - no not the graphical browser > - the ParseTree example - and combine it with pywebkitgtk - no not the > "original" version, the one which has DOM-manipulation bindings > through webkit-glib. > > the webkit parse tree example is, despite it being based on the GTK > "port" as they like to call it in webkit (which just means that it > links with GTK not QT4 or wxWidgets), is a console-based application. > > in other words, despite it being GTK, it still does NOT output > graphical crap to the screen, yet it still *executes* the javascript > on the page. > > dummy functions for "mouse", "keyboard", "console errors" are given as > examples and are left as an exercise for the application writer to > fill-in-the-blanks. > > combining this parse tree example with pywebkitgtk (see > demobrowser.py) would provide a means by which web pages can be > executed AT THE CONSOLE NOT AS A GUI APP, then, thanks to the glib / > gobject bindings, a python app will be able to walk the DOM tree as > expected. > > i _just_ fixed pyjamas-desktop's iterators in the pyjamas.DOM module > for someone, on the pyjamas-dev mailing list. > > http://github.com/lkcl/pyjamas-desktop/tree/8ed365b89efe5d1d3451c3e3c... > dd014540 > > so, actually, you may be better off starting from pyjamas-desktop and > then cutting out the "fire up the GTK window" bit, from pyjd.py. > > pyjd.py is based on pywebkitgtk's demobrowser.py > > the alternative to webkit is to use python-hulahop - it will do the > same thing, but just using python bindings to gecko instead of python- > bindings-to-glib-bindings-to-webkit. > > l. > --http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list