Re: Is this a valid use case?

Jakob de Maeyer Tue, 03 Nov 2015 05:07:10 -0800

Hey Jim,

it still seems unintuitive that you need to go through http requests
when you have full access to everything. Have you looked at Drupal's
static generator?


However, if making an HTTP request is your only (simple) way of
generating the page that you want in the failover, Scrapy might indeed
be an option. If you know (i.e. can generate a list of) all your URLs
you could simply put them in a Spider's `start_urls` or
`start_requests()`, and I would prefer that over the requests library
because you get Scrapy's throttling, error handling, etc. If the URLs
are unknown, you can make use of CrawlSpider and spider rules.


Cheers,
-Jakob




On 11/02/2015 11:06 PM, Jim Priest wrote:
> We would like to implement something like that moving forward.
> 
> In the meantime we have a lot of pages currently cached we'd like to
> check (these may never get updated so would never see the on_save hook),
> and we also have a lot of static resources we need to check as well that
> have no 'save now' hook available.
> 
> Ideally we'd have something that ran on a schedule for a broad update
> (once a week?) and then via implementing hooks where we can - that would
> cover everything else.
> 
> Jim
> 
> 
> 
> On Mon, Nov 2, 2015 at 4:55 PM, Travis Leleu <[email protected]
> <mailto:[email protected]>> wrote:
> 
>     Jim, I'd probably add a hook to the on_save event in your blogs that
>     pushes the URL into a queue.  Have a simple script that saves the
>     content to your static failover.  No need for a spider/crawler when
>     you just want to grab one page's content on an event trigger.
> 
>     Perhaps I'm not understanding why you'd need something heavy like
>     scrapy, you could write a 30 line python program to monitor the
>     queue, requests.get() the page, then save to static location.
> 
>     On Mon, Nov 2, 2015 at 5:16 PM, Jim Priest <[email protected]
>     <mailto:[email protected]>> wrote:
> 
>         I should have provided a bit more info on our use case :)
> 
>         We have a lot of dynamic content in Drupal, blogs, etc.   The
>         failover content is static versions of this dynamic content. 
>         Currently this is done via a rather clunky Akamai tool which
>         we're hoping to replace.
> 
>         Another goal is to more immediately update this content - ie
>         someone updates a Drupal page, it is immediately spidered (via
>         API call or something) and that content is then saved to failover. 
> 
>         I could probably cobble something together with wget or some
>         other tool but trying to not reinvent the wheel here as much as
>         possible.
> 
>         Thanks!
>         Jim
> 
> 
> 
>         On Monday, November 2, 2015 at 7:28:39 AM UTC-5, Jakob de Maeyer
>         wrote:
> 
>             Hey Jim,
> 
>             Scrapy is great at two things:
>             1. downloading web pages, and
>             2. extracting unstructured data.
> 
>             In your case, you should have already have access to the raw
>             files (via FTP, etc.), as well as to the data in a
>             structured format. It would be possible to do what you're
>             aiming at with Scrapy, but it doesn't seem to be the most
>             elegant solution. What speaks against setting up an rsync
>             cronjob or similar to keep the failover in sync?
> 
> 
>             Cheers,
>             -Jakob
> 
>         -- 
>         You received this message because you are subscribed to the
>         Google Groups "scrapy-users" group.
>         To unsubscribe from this group and stop receiving emails from
>         it, send an email to [email protected]
>         <mailto:[email protected]>.
>         To post to this group, send email to
>         [email protected]
>         <mailto:[email protected]>.
>         Visit this group at http://groups.google.com/group/scrapy-users.
>         For more options, visit https://groups.google.com/d/optout.
> 
> 
>     -- 
>     You received this message because you are subscribed to the Google
>     Groups "scrapy-users" group.
>     To unsubscribe from this group and stop receiving emails from it,
>     send an email to [email protected]
>     <mailto:[email protected]>.
> 
>     To post to this group, send email to [email protected]
>     <mailto:[email protected]>.
>     Visit this group at http://groups.google.com/group/scrapy-users.
>     For more options, visit https://groups.google.com/d/optout.
> 
> 
> -- 
> You received this message because you are subscribed to a topic in the
> Google Groups "scrapy-users" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/scrapy-users/lmmJAIT42NI/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected]
> <mailto:[email protected]>.
> To post to this group, send email to [email protected]
> <mailto:[email protected]>.
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Is this a valid use case?

Reply via email to