Re: HTML parsing/scraping python
Take a look at SW Explorer Automation (http://home.comcast.net/~furmana/SWIEAutomation.htm)(SWEA). SWEA creates an object model (automation interface) for any Web application running in Internet Explorer. It supports all IE functionality:frames, java script, dialogs, downloads. The runtime can also work under non-interactive user accounts (ASP.NET or service applications) on Window 2000/2003 server or Windows XP. -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML parsing/scraping python
Sanjay Arora [EMAIL PROTECTED] writes: We are looking to select the language toolset more suitable for a project that requires getting data from several web-sites in real- timehtml parsing/scraping. It would require full emulation of the browser, including handling cookies, automated logins following multiple web-link paths. Multiple threading would be a plus but not requirement. [...] What's the application? John -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML parsing/scraping python
John J. Lee wrote: Sanjay Arora [EMAIL PROTECTED] writes: We are looking to select the language toolset more suitable for a project that requires getting data from several web-sites in real- timehtml parsing/scraping. It would require full emulation of the browser, including handling cookies, automated logins following multiple web-link paths. Multiple threading would be a plus but not requirement. [...] What's the application? John I'll do your googling for you ;-p (The topic guide needs to be updated for mechanize, pamie, beautiful soup, clientTable, pullparser, etc.) http://www.python.org/topics/web/HTML.html http://blog.ianbicking.org/best-of-the-web-app-test-frameworks.html -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML parsing/scraping python
The standard library module for fetching HTML is urllib2. The best module for scraping the HTML is BeautifulSoup. There is a project called mechanize, built by John Lee on top of urllib2 and other standard modules. It will emulate a browsers behaviour - including history, cookies, basic authentication, etc. There are several modules for automated form filling - FormEncode being one. All the best, Fuzzyman http://www.voidspace.org.uk/python/index.shtml -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML parsing/scraping python
Fuzzyman [EMAIL PROTECTED] writes: The standard library module for fetching HTML is urllib2. Does urllib2 replace everything in urllib? I thought there was some urllib functionality that urllib2 didn't do. There is a project called mechanize, built by John Lee on top of urllib2 and other standard modules. It will emulate a browsers behaviour - including history, cookies, basic authentication, etc. urllib2 handles cookies and authentication. I use those features daily. I'm not sure history would apply, unless you're also handling javascript. Is there some other way to ask the browser to go back in history? mike -- Mike Meyer [EMAIL PROTECTED] http://www.mired.org/home/mwm/ Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information. -- http://mail.python.org/mailman/listinfo/python-list
HTML parsing/scraping python
We are looking to select the language toolset more suitable for a project that requires getting data from several web-sites in real- timehtml parsing/scraping. It would require full emulation of the browser, including handling cookies, automated logins following multiple web-link paths. Multiple threading would be a plus but not requirement. Some solutions were suggested: Perl: LWP::Simple WWW::Mechanize HTML::Parser Curl libcurl: Can you suggest solutions for python? Pros Cons using Perl vs. Python? Why Python? Pointers to various other tools their comparisons with python solutions will be most appreciated. Anyone who is knowledgeable about the application subject, please do share your knowledge to help us do this right. With best regards. Sanjay. -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML parsing/scraping python
Sanjay Arora [EMAIL PROTECTED] writes: We are looking to select the language toolset more suitable for a project that requires getting data from several web-sites in real- timehtml parsing/scraping. It would require full emulation of the browser, including handling cookies, automated logins following multiple web-link paths. Multiple threading would be a plus but not requirement. Believe it or not, everything you ask for can be done by Python out of the box. But there are limitations. For one, the HTML parsing module that comes with Python doesn't handle invalid HTML very well. Thanks to Netscape, invalid HTML is the rule rather than the exception on the web. So you probably want to use a third party module for that. I use BeautifulSoup, which handles XML, HTML, has a *lovely* API (going from BeautifulSoup to DOM is always a major dissapointment), and works well with broken X/HTML. That sufficient for my needs, but I haven't been asked to do a lot of automated form filling, so the facilities in the standard library work for me. There are third party tools to help with that. I'm sure someone willsuggest them. Can you suggest solutions for python? Pros Cons using Perl vs. Python? Why Python? Because it's beautiful. Seriously, Python code is very readable, by design. Of course, some of the features that make that happen drive some people crazy. If you're one of them, then Python isn't the language for you. mike -- Mike Meyer [EMAIL PROTECTED] http://www.mired.org/home/mwm/ Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information. -- http://mail.python.org/mailman/listinfo/python-list