Re: Extracting data from a table with multiple pages

Travis Leleu Thu, 02 Oct 2014 20:36:07 -0700

My understanding is they instantiate a headless browser.  My experience is
they are much more resource intensive than a vanilla http request using
python.  They have to load the entire browser stack, a javascript
interpreter, etc.

That said, you only have to instantiate the headless browser once, not once
per page you want to scrape.  It gives you a browser that you can
programmatically control using your python code and the API for e.g.,
Selenium.  So you launch the browser, then can tell it to load pages,
interact with elements, etc.

Any of the headless browsers are going to consume more resources than
scrapy, but it's not an obscene amount.  It's just a full browser, so it's
like running (most of) Chrome, or Safari, just without the GUI presentation.

Honestly, if you're scraping just one site, you shouldn't hit it with more
than a couple of requests / second anyhow (at MOST).  So I wouldn't worry
too much about the browser consuming too many resources -- if your
resources are impacted by running the browsers, you also may be negatively
impacting the site you're scraping.  That is to say, don't have more than a
few browsers running at a time anyhow.

On Thu, Oct 2, 2014 at 8:05 PM, Chetan Motamarri <[email protected]> wrote:

> Hi Travis,
>
> Thanks for your reply.
>
> Using phantomJS, can I crawl the data without opening the webpage ? What I
> mean is if write some automated script in selenium to pull data, each page
> will open in browser and pulls data right..
>
> This method will take lot of time to pull data (as every page has to be
> opened in browser and then crawled). I need to pull all games discussions
> like this. So I am worried about time too.
>
> Is phantomJS is also like this(selenium) or different ?
>
> Thanks again.
>
> On Thursday, October 2, 2014 7:38:56 PM UTC-7, Chetan Motamarri wrote:
>>
>> Hi,
>>
>> I need to extract start date & time of all discussions(in all pages) from
>> this url "http://steamcommunity.com/workshop/discussions/?appid=570"; . I
>> tried in lot of ways but can't.
>>
>> Here discussions are changing dynamically.(i.e. when page 2 is clicked, I
>> was not able find 2nd page discussions in source code).
>>
>> My idea is if there is any source file of discussions (like .xml/.json)
>> then we can pull data directly from that src page. But I was not able to
>> find out location of source file of discussions.
>>
>> How to use scrapy here ?
>>
>  --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Extracting data from a table with multiple pages

Reply via email to