Re: How does follow and rules work

Travis Leleu Mon, 20 Oct 2014 10:47:04 -0700

Scrapy is built on twisted, so you don't need to enable anything.  In fact,
you don't really need to know Twisted at all, unless you want to write
middlewares.


For automatically downloading PDF files, I'd look into DownloaderMiddleware
component of scrapy.  There is an example in the documentation about using
a downloader middleware for saving images, which you should be able to
adapt for PDFs fairly easily.

I'm not sure if scrapy will invoke the callback on non-html responses.  If
it does, you simply need to save the file to disk in standard python code
-- yielding items is more about data extraction, and AFAIK the pipelines
don't support files.  (Theoretically you could serialize an image, sent it
through the item pipeline, deserialize, save to disk... but why not just
put a f.write() call into your parse function whenever you match PDF files?)

On Mon, Oct 20, 2014 at 6:52 AM, Szymon Roziewski <
[email protected]> wrote:

> With such a rule
>  Rule(LxmlLinkExtractor (allow=("ecolex.org/server2.php/libcat/docs", )),
> callback='get_file'),
> I would like to grab all files that suit to this phrase i.e. doc, pdf,
> txt, csv files.
> But what I obtain is only the ability to get txt files
> I have a callback method here
>
>     def get_file(self, response):
>         item = FiledownloadItem()
>         item["file_urls"] = [response.url]
>         yield item
>
>
>
> On Friday, 17 October 2014 14:45:32 UTC+2, Szymon Roziewski wrote:
>
>> Hi scrapy people,
>>
>> I am quite new to scrapy. I have done one script which works and I am
>> developing it.
>>
>> Could you explain me one thing please.
>>
>> If I have such code
>>     rules = [
>>         Rule(LxmlLinkExtractor(allow=("ecolex/ledge/view/SearchResults",
>> )), follow=True),
>>         Rule(LxmlLinkExtractor (allow=("ecolex/ledge/view/RecordDetails",
>> )), callback='found_items'),
>>     ]
>>
>> what happens actually?
>>
>> For each phrases all links will be extracted and for SearchResults
>> spider would only follow such links until reaches all links.
>>
>> If on the website a link with pattern RecordDetails is seized, spider
>> would apply a method 'found_items' for further processing.
>>
>> The thing is about task scheduling here.
>>
>> Does it happen sequentially or in parallel ?
>>
>> I mean, spider scrapes some data from a site with pattern RecordDetails
>> and after all scraped items switches to follow another link and scrapes?
>>
>> This is something automagical. How scrapy knows what to do first, to
>> scrape or to follow?
>>
>> Is it sequential job:
>>
>> following one site -> scraping all content
>> following second site -> scraping all content
>>
>> Or we have some parallelization like:
>> following one site -> scraping all content & following second site ->
>> scraping all content
>>
>> I would like to make it the latter style if it is not like this.
>>
>> The question is how could I do it?
>>
>> Regards,
>> Szymon Roziewski
>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: How does follow and rules work

Reply via email to