Brian van den Broek wrote: > Hi all, > > I have a Palm handheld, and use the excellent (and written in Python) > Plucker <http://www.plkr.org/> to spider webpages and format the > results for viewing on the Palm. > > One site I 'pluck' is the Daily Python URL > <http://www.pythonware.com/daily/>. From the point of view of a daily > custom 'newspaper' everything but the last day or two of URLs is so > much cruft. (The cruft would be the total history of the last > seven'ish days, the navigation links for www.pythonware.com, etc.) > > Today, I wrote a script to parse the Daily URL, and create a minimal > local html page including nothing but the last n items, n links, or > last n days worth of links. (Which is employed is a user option.) > Then, I pluck that, rather than the actual Daily URL site. Works > great. :-) (If anyone on the list is a fellow plucker'er and would be > interested in my script, I'm happy to share.) > > In anticipation of wanting to do the same thing to other sites, I've > spent a bit of time abstracting it. I've made some real progress. But, > before I finish up, I've a voice in the back of my head asking if > maybe I'm re-inventing the wheel. > > To my shame, I've not spent very much time at all exploring available > frameworks and modules for any domain, and almost none for web-related > tasks. So, does anyone know of any modules or frameworks which would > make the sort of task I am describing easier? > > The difficulty in making my routine general is that pretty much each > site will need its own code for identifying what counts as a distinct > item (such as a URL and its description in the Daily URL) and what > counts as a distinct block of items (such as a days worth of Daily URL > items). I can't imagine there's a way around that, but if someone else > has done much of the work in setting up the general structure to be > tweaked for each site, that'd be good to know. (Doesn't feel like one > that would be googleable.)
Beautiful Soup can help with parsing and accessing the web page. You could certainly write your plucker on top of it. http://www.crummy.com/software/BeautifulSoup/ Alternately ElementTidy might help. It can parse web pages and it has limited XPath support. XPath might be a good language for expressing your plucking rules. http://effbot.org/zone/element-tidylib.htm An ideal package would be one that parses real-world HTML and has full XPath support, but I don't know of such a thing...maybe amara or lxml? Kent _______________________________________________ Tutor maillist - [email protected] http://mail.python.org/mailman/listinfo/tutor
