On Wednesday, January 28, 2015 at 8:36:59 AM UTC-8, peter.n...@gmail.com wrote:
> I am totally new to Python and please accept my apologies upfront for 
> potential newbie errors. I am trying to parse a 'simple' web page: 
> http://flow.gassco.no/
> 
> When opening the page first time in my browser I need to confirm T&C with an 
> accept button. After accepting T&C I would like to scrape some data from that 
> follow up page. It appears that when opening in a browser directly 
> http://flow.gassco.no/acceptDisclaimer I would get around that T&C.
> But not when I open the URL via beautifulsoap
> 
> My parsing/scraping tool is implemented in bs, but I fail to parse the 
> content as I am not getting around T&C. When printing "response.text" from 
> BS, I get below code. How do I get around this form for accepting terms & 
> conditions so that I can parse/scrape data from that page?
> 
> Here is what I am doing:
> 
> #!/usr/bin/env python 
> import requests 
> import bs4 
> index_url='http://flow.gassco.no/acceptDisclaimer'
> 
> def get_video_page_urls(): 
> response = requests.get(index_url) 
> soup = bs4.BeautifulSoup(response.text) 
> return soup 
> print(get_video_page_urls()) 
> 
> ++++
> 
> PRINTOUT from response.text:
> 
>    <form action="acceptDisclaimer" method="get">
>      <input class="accept" type="submit" value="Accept"/>
>      <input class="decline" name="decline" onclick="window.location 
> ='http://www.gassco.no'" type="button" value="Decline"/>
>      </form></div></div></div></div></div>
> 
>     <script type="text/javascript">
>     var _gaq = _gaq || [];
>     _gaq.push(['_setAccount', 'UA-30727768-1']);
>     _gaq.push(['_trackPageview']);
> 
>     (function() {
>         var ga = document.createElement('script'); ga.type = 
> 'text/javascript'; ga.async = true;
>         ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 
> 'http://www') + '.google-analytics.com/ga.js';
>         var s = document.getElementsByTagName('script')[0]; 
> s.parentNode.insertBefore(ga, s);
>     })();
> 
> </script>

Try clearing your browser cookies and then reopening the page, it should spit 
you back to the TOC screen. 

You can use the Session class to keep track of your cookies between requests:

with requests.Session() as s:

    # Request sessionid cookie and store it in the current session
    response = s.get('http://flow.gassco.no')
    
    # Subsequent gets will now include the session cookie 
    response = s.get('http://flow.gassco.no/acceptDisclaimer')

A good place to start when debugging something like this is to open up the 
developer tools in your browser (F12 in chrome/firefox) and observe the GET 
requests that get sent out as you click on different buttons.
-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to