With such a rule
Rule(LxmlLinkExtractor (allow=("ecolex.org/server2.php/libcat/docs", )),
callback='get_file'),
I would like to grab all files that suit to this phrase i.e. doc, pdf, txt,
csv files.
But what I obtain is only the ability to get txt files
I have a callback method here
def get_file(self, response):
item = FiledownloadItem()
item["file_urls"] = [response.url]
yield item
On Friday, 17 October 2014 14:45:32 UTC+2, Szymon Roziewski wrote:
>
> Hi scrapy people,
>
> I am quite new to scrapy. I have done one script which works and I am
> developing it.
>
> Could you explain me one thing please.
>
> If I have such code
> rules = [
> Rule(LxmlLinkExtractor(allow=("ecolex/ledge/view/SearchResults",
> )), follow=True),
> Rule(LxmlLinkExtractor (allow=("ecolex/ledge/view/RecordDetails",
> )), callback='found_items'),
> ]
>
> what happens actually?
>
> For each phrases all links will be extracted and for SearchResults spider
> would only follow such links until reaches all links.
>
> If on the website a link with pattern RecordDetails is seized, spider
> would apply a method 'found_items' for further processing.
>
> The thing is about task scheduling here.
>
> Does it happen sequentially or in parallel ?
>
> I mean, spider scrapes some data from a site with pattern RecordDetails
> and after all scraped items switches to follow another link and scrapes?
>
> This is something automagical. How scrapy knows what to do first, to
> scrape or to follow?
>
> Is it sequential job:
>
> following one site -> scraping all content
> following second site -> scraping all content
>
> Or we have some parallelization like:
> following one site -> scraping all content & following second site ->
> scraping all content
>
> I would like to make it the latter style if it is not like this.
>
> The question is how could I do it?
>
> Regards,
> Szymon Roziewski
>
>
>
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.