Re: how to download and save a file with scrapy

Paul Tremberth Tue, 25 Feb 2014 01:28:25 -0800

Hi Matt,

one way to do that is to play with the FilesPipeline *get_media_requests()*,
passing additional data through the meta dict
and then using a custom *file_path()* method


Below, I use a dict in *file_urls *and not a list, so that I can pass a URL 
and a custom *file_name*

Using the same IETF example I used above in the thread:

A simple spider downloading some files from IETF.org

from scrapy.spider import Spider
from scrapy.http import Request
from scrapy.item import Item, Field


class IetfItem(Item):
    files = Field()
    file_urls = Field()


class IETFSpider(Spider):
    name = 'ietfpipe'
    allowed_domains = ['ietf.org']
    start_urls = ['http://www.ietf.org']
    file_urls = [
        'http://www.ietf.org/images/ietflogotrans.gif',
        'http://www.ietf.org/rfc/rfc2616.txt',
        'http://www.rfc-editor.org/rfc/rfc2616.ps',
        'http://www.rfc-editor.org/rfc/rfc2616.pdf',
        'http://tools.ietf.org/html/rfc2616.html',
    ]

    def parse(self, response):
        for cnt, furl in enumerate(self.file_urls, start=1):
            yield IetfItem(file_urls=[{"file_url": furl, "file_name": 
"file_%03d" % cnt}])



Custom FilesPipeline

from scrapy.contrib.pipeline.files import FilesPipeline
from scrapy.http import Request

class MyFilesPipeline(FilesPipeline):

    def get_media_requests(self, item, info):
        for file_spec in item['file_urls']:
            yield Request(url=file_spec["file_url"], meta={"file_spec": 
file_spec})

    def file_path(self, request, response=None, info=None):
        return request.meta["file_spec"]["file_name"]



Hope this helps

/Paul.

On Friday, February 21, 2014 6:44:20 AM UTC+1, Matt Cialini wrote:
>
> Hello Paul!
>
> I'm Matt. I know this is a somewhat old group now but I have found your 
> advice about FilesPipeline and it works great. I had one question though. 
> Do you know of an easy way to pass in a file_name field for each url so 
> that the FilesPipeline will save each url with the correct name?
>
> Thanks!
>
> On Saturday, September 21, 2013 1:03:09 PM UTC-4, Paul Tremberth wrote:
>>
>> Hi Ana,
>>
>> if you want to use the FilesPipeline, before it's in an official Scrapy 
>> release,
>> here's one way to do it:
>>
>> 1) download 
>> https://raw.github.com/scrapy/scrapy/master/scrapy/contrib/pipeline/files.py
>> and save it somewhere in your Scrapy project,
>> let's say at the root of your project (but that's not the best 
>> location...)
>> yourproject/files.py
>>
>> 2) then, enable this pipeline by adding this to your settings.py
>>
>> ITEM_PIPELINES = [
>>     'yourproject.files.FilesPipeline',
>> ]
>> FILES_STORE = '/path/to/yourproject/downloads'
>>
>> FILES_STORE needs to point to a location where Scrapy can write (create 
>> it beforehand)
>>
>> 3) add 2 special fields to your item definition
>>     file_urls = Field()
>>     files = Field()
>>
>> 4) in your spider, when you have an URL for a file to download,
>> add it to your Item instance before returning it
>>
>> ...
>>     myitem = YourProjectItem()
>>     ...
>>     myitem["file_urls"] = ["http://www.example.com/somefileiwant.csv";]
>>     yield myitem
>>
>> 5) run your spider and you should see files in the FILES_STORE folder
>>
>> Here's an example that download a few files from the IETF website
>>
>> the scrapy project is called "filedownload"
>>
>> items.py looks like this:
>>
>> from scrapy.item import Item, Field
>>
>> class FiledownloadItem(Item):
>>     file_urls = Field()
>>     files = Field()
>>  
>>
>> this is the code for the spider:
>>
>> from scrapy.spider import BaseSpider
>> from filedownload.items import FiledownloadItem
>>
>> class IetfSpider(BaseSpider):
>>     name = "ietf"
>>     allowed_domains = ["ietf.org"]
>>     start_urls = (
>>         'http://www.ietf.org/',
>>         )
>>
>>     def parse(self, response):
>>         yield FiledownloadItem(
>>             file_urls=[
>>                 'http://www.ietf.org/images/ietflogotrans.gif',
>>                 'http://www.ietf.org/rfc/rfc2616.txt',
>>                 'http://www.rfc-editor.org/rfc/rfc2616.ps',
>>                 'http://www.rfc-editor.org/rfc/rfc2616.pdf',
>>                 'http://tools.ietf.org/html/rfc2616.html',
>>             ]
>>         )
>>
>> When you run the spider, at the end, you should see in the console 
>> something like this:
>>
>> 2013-09-21 18:30:42+0200 [ietf] DEBUG: Scraped from <200 
>> http://www.ietf.org/>
>> {'file_urls': ['http://www.ietf.org/images/ietflogotrans.gif',
>>                'http://www.ietf.org/rfc/rfc2616.txt',
>>                'http://www.rfc-editor.org/rfc/rfc2616.ps',
>>                'http://www.rfc-editor.org/rfc/rfc2616.pdf',
>>                'http://tools.ietf.org/html/rfc2616.html'],
>>  'files': [{'checksum': 'e4b6ca0dd271ce887e70a1a2a5d681df',
>>             'path': 'full/4f7f3e96b2dda337913105cd751a2d05d7e64b64.gif',
>>             'url': 'http://www.ietf.org/images/ietflogotrans.gif'},
>>            {'checksum': '9fa63f5083e4d2112d2e71b008e387e8',
>>             'path': 'full/454ea89fbeaf00219fbcae49960d8bd1016994b0.txt',
>>             'url': 'http://www.ietf.org/rfc/rfc2616.txt'},
>>            {'checksum': '5f0dc88aced3b0678d702fb26454e851',
>>             'path': 'full/f76736e9f1f22d7d5563208d97d13e7cc7a3a633.ps',
>>             'url': 'http://www.rfc-editor.org/rfc/rfc2616.ps'},
>>            {'checksum': '2d555310626966c3521cda04ae2fe76f',
>>             'path': 'full/6ff52709da9514feb13211b6eb050458f353b49a.pdf',
>>             'url': 'http://www.rfc-editor.org/rfc/rfc2616.pdf'},
>>            {'checksum': '735820b4f0f4df7048b288ba36612295',
>>             'path': 'full/7192dd9a00a8567bf3dc4c21ababdcec6c69ce7f.html',
>>             'url': 'http://tools.ietf.org/html/rfc2616.html'}]}
>> 2013-09-21 18:30:42+0200 [ietf] INFO: Closing spider (finished)
>>
>> which tells you what files were downloaded, and where they were stored.
>>
>> Hope this helps.
>>
>> On Tuesday, September 17, 2013 1:46:15 PM UTC+2, Ana Carolina Assis Jesus 
>> wrote:
>>>
>>> Hi Paul, 
>>>
>>> Could you give me an example on how to use the pipeline, please? 
>>>
>>> Thanks, 
>>> Ana 
>>>
>>> On Tue, Sep 17, 2013 at 12:19 PM, Ana Carolina Assis Jesus 
>>> <[email protected]> wrote: 
>>> > well, I installed about two weeks ago, but a tagged version... so 
>>> > maybe I dont have it... 
>>> > But I really need pipeline even if get button, at principle, at least, 
>>> > should just download a file! I mean, it is what it does manualy... 
>>> > ??? 
>>> > 
>>> > Thanks! 
>>> > 
>>> > On Tue, Sep 17, 2013 at 12:14 PM, Paul Tremberth 
>>> > <[email protected]> wrote: 
>>> >> Well, the FilesPipeline is a module inside scrapy.contrib.pipelines 
>>> >> It was committed less than 2 weeks ago.(Scrapy is being improved all 
>>> the 
>>> >> time by the community) 
>>> >> 
>>> >> It depends when and how you installed scrapy: 
>>> >> - if you install a tagged version using pip or easy_install (as it's 
>>> >> recommended; 
>>> >> http://doc.scrapy.org/en/latest/intro/install.html#installing-scrapy) 
>>>
>>> >> you won't have the Pipeline and you have to add it yourself 
>>> >> 
>>> >> - if you installed from source less than 2 weeks ago (git clone 
>>> >> [email protected]:scrapy/scrapy.git; cd scrapy; sudo python setup.py 
>>> install) 
>>> >> you should be good (but Scrapy from latest source code might be 
>>> unstable and 
>>> >> not fully tested) 
>>> >> 
>>> >> 
>>> >> On Tuesday, September 17, 2013 12:04:31 PM UTC+2, Ana Carolina Assis 
>>> Jesus 
>>> >> wrote: 
>>> >>> 
>>> >>> Hi Paul. 
>>> >>> 
>>> >>> What do you mean by installing scrapy from source? 
>>> >>> I need a new version from it? 
>>> >>> 
>>> >>> On Tue, Sep 17, 2013 at 12:01 PM, Paul Tremberth 
>>> >>> <[email protected]> wrote: 
>>> >>> > Hi Ana, 
>>> >>> > to download files, you should have a look at the new FilesPipeline 
>>> >>> > https://github.com/scrapy/scrapy/pull/370 
>>> >>> > 
>>> >>> > It's in the master branch though, not in a tagged version of 
>>> Scrapy, so 
>>> >>> > you'll have to install scrapy from source. 
>>> >>> > 
>>> >>> > Paul. 
>>> >>> > 
>>> >>> > 
>>> >>> > On Tuesday, September 17, 2013 11:50:05 AM UTC+2, Ana Carolina 
>>> Assis 
>>> >>> > Jesus 
>>> >>> > wrote: 
>>> >>> >> 
>>> >>> >> Hi! 
>>> >>> >> 
>>> >>> >> I am trying to download a csv file with scrapy. 
>>> >>> >> I could crawl inside the site and get to the form I need and then 
>>> I 
>>> >>> >> find 
>>> >>> >> two buttons to click. 
>>> >>> >> One will list the transactions while the second one will download 
>>> a 
>>> >>> >> XXX.cvs file. 
>>> >>> >> 
>>> >>> >> How do I save this file within scrapy? 
>>> >>> >> 
>>> >>> >> I mean, if I choose the list transactions, I will get another 
>>> webpage 
>>> >>> >> and 
>>> >>> >> this I can see. 
>>> >>> >> But what if I choose the action to download? I guess I should not 
>>> use 
>>> >>> >> the 
>>> >>> >> return self.parse_dosomething but something else to save the file 
>>> it 
>>> >>> >> should 
>>> >>> >> give me (???) 
>>> >>> >> 
>>> >>> >> Or should the download start by itself? 
>>> >>> >> 
>>> >>> >> Thanks, 
>>> >>> >> Ana 
>>> >>> > 
>>> >>> > -- 
>>> >>> > You received this message because you are subscribed to a topic in 
>>> the 
>>> >>> > Google Groups "scrapy-users" group. 
>>> >>> > To unsubscribe from this topic, visit 
>>> >>> > 
>>> https://groups.google.com/d/topic/scrapy-users/kzGHFjXywuY/unsubscribe. 
>>> >>> > To unsubscribe from this group and all its topics, send an email 
>>> to 
>>> >>> > [email protected]. 
>>> >>> > To post to this group, send email to [email protected]. 
>>> >>> > Visit this group at http://groups.google.com/group/scrapy-users. 
>>> >>> > For more options, visit https://groups.google.com/groups/opt_out. 
>>> >> 
>>> >> -- 
>>> >> You received this message because you are subscribed to a topic in 
>>> the 
>>> >> Google Groups "scrapy-users" group. 
>>> >> To unsubscribe from this topic, visit 
>>> >> 
>>> https://groups.google.com/d/topic/scrapy-users/kzGHFjXywuY/unsubscribe. 
>>> >> To unsubscribe from this group and all its topics, send an email to 
>>> >> [email protected]. 
>>> >> To post to this group, send email to [email protected]. 
>>> >> Visit this group at http://groups.google.com/group/scrapy-users. 
>>> >> For more options, visit https://groups.google.com/groups/opt_out. 
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.

Re: how to download and save a file with scrapy

Reply via email to