> Actually, at the moment, the above will generate an error because
apparently there is an unbalanced tag somewhere on the web2py.com page.
Yes scraping real-life websites is frustrating because so much html is
broken. In the end I achieved reasonable success with Beautifulsoup.
However, I
I think you should put the scraping code in a module. It's a simple
separation of concerns thing to me. Think about it, you may like this data
scrapped and parsed in other projects.
Then in the controller/scheduler function you could import the module ask
it to give you the new rows of the t
I think that sounds reasonable.
On Friday, May 3, 2013 3:50:29 PM UTC-4, Timmie wrote:
>
> Thank you very much. This is helpful!
>
> Actually, what I wanna do is
> 1) read a *DATA* from a exteranl page and insert it to the
> database.
> 2) Using a cron /scheduler to update the database table per
Thank you very much. This is helpful!
Actually, what I wanna do is
1) read a *DATA* from a exteranl page and insert it to the
database.
2) Using a cron /scheduler to update the database table periodically if the
source web page had changed
Could you give me a idea how to get going with 1)?
S
That same code works in a controller -- it was merely being demonstrated in
a shell. Instead of urllib.urlopen, you can now use fetch (which also works
on GAE):
from gluon.tools import fetch
page = TAG(fetch('http://www.web2py.com'))
page.elements('div') # gives you a list of all DIV elements in
5 matches
Mail list logo