[web2py] Re: scraping utils in controller

2013-05-04 Thread villas
> Actually, at the moment, the above will generate an error because apparently there is an unbalanced tag somewhere on the web2py.com page. Yes scraping real-life websites is frustrating because so much html is broken. In the end I achieved reasonable success with Beautifulsoup. However, I

[web2py] Re: scraping utils in controller

2013-05-04 Thread Leonel Câmara
I think you should put the scraping code in a module. It's a simple separation of concerns thing to me. Think about it, you may like this data scrapped and parsed in other projects. Then in the controller/scheduler function you could import the module ask it to give you the new rows of the t

[web2py] Re: scraping utils in controller

2013-05-03 Thread Anthony
I think that sounds reasonable. On Friday, May 3, 2013 3:50:29 PM UTC-4, Timmie wrote: > > Thank you very much. This is helpful! > > Actually, what I wanna do is > 1) read a *DATA* from a exteranl page and insert it to the > database. > 2) Using a cron /scheduler to update the database table per

[web2py] Re: scraping utils in controller

2013-05-03 Thread Timmie
Thank you very much. This is helpful! Actually, what I wanna do is 1) read a *DATA* from a exteranl page and insert it to the database. 2) Using a cron /scheduler to update the database table periodically if the source web page had changed Could you give me a idea how to get going with 1)? S

[web2py] Re: scraping utils in controller

2013-05-03 Thread Anthony
That same code works in a controller -- it was merely being demonstrated in a shell. Instead of urllib.urlopen, you can now use fetch (which also works on GAE): from gluon.tools import fetch page = TAG(fetch('http://www.web2py.com')) page.elements('div') # gives you a list of all DIV elements in