Re: how to download and save a file with scrapy

Matt Cialini Fri, 21 Feb 2014 01:52:17 -0800

Hello Paul!

I'm Matt. I know this is a somewhat old group now but I have found your 
advice about FilesPipeline and it works great. I had one question though. 
Do you know of an easy way to pass in a file_name field for each url so 
that the FilesPipeline will save each url with the correct name?


Thanks!

On Saturday, September 21, 2013 1:03:09 PM UTC-4, Paul Tremberth wrote:
>
> Hi Ana,
>
> if you want to use the FilesPipeline, before it's in an official Scrapy 
> release,
> here's one way to do it:
>
> 1) download 
> https://raw.github.com/scrapy/scrapy/master/scrapy/contrib/pipeline/files.py
> and save it somewhere in your Scrapy project,
> let's say at the root of your project (but that's not the best location...)
> yourproject/files.py
>
> 2) then, enable this pipeline by adding this to your settings.py
>
> ITEM_PIPELINES = [
>     'yourproject.files.FilesPipeline',
> ]
> FILES_STORE = '/path/to/yourproject/downloads'
>
> FILES_STORE needs to point to a location where Scrapy can write (create it 
> beforehand)
>
> 3) add 2 special fields to your item definition
>     file_urls = Field()
>     files = Field()
>
> 4) in your spider, when you have an URL for a file to download,
> add it to your Item instance before returning it
>
> ...
>     myitem = YourProjectItem()
>     ...
>     myitem["file_urls"] = ["http://www.example.com/somefileiwant.csv";]
>     yield myitem
>
> 5) run your spider and you should see files in the FILES_STORE folder
>
> Here's an example that download a few files from the IETF website
>
> the scrapy project is called "filedownload"
>
> items.py looks like this:
>
> from scrapy.item import Item, Field
>
> class FiledownloadItem(Item):
>     file_urls = Field()
>     files = Field()
>  
>
> this is the code for the spider:
>
> from scrapy.spider import BaseSpider
> from filedownload.items import FiledownloadItem
>
> class IetfSpider(BaseSpider):
>     name = "ietf"
>     allowed_domains = ["ietf.org"]
>     start_urls = (
>         'http://www.ietf.org/',
>         )
>
>     def parse(self, response):
>         yield FiledownloadItem(
>             file_urls=[
>                 'http://www.ietf.org/images/ietflogotrans.gif',
>                 'http://www.ietf.org/rfc/rfc2616.txt',
>                 'http://www.rfc-editor.org/rfc/rfc2616.ps',
>                 'http://www.rfc-editor.org/rfc/rfc2616.pdf',
>                 'http://tools.ietf.org/html/rfc2616.html',
>             ]
>         )
>
> When you run the spider, at the end, you should see in the console 
> something like this:
>
> 2013-09-21 18:30:42+0200 [ietf] DEBUG: Scraped from <200 
> http://www.ietf.org/>
> {'file_urls': ['http://www.ietf.org/images/ietflogotrans.gif',
>                'http://www.ietf.org/rfc/rfc2616.txt',
>                'http://www.rfc-editor.org/rfc/rfc2616.ps',
>                'http://www.rfc-editor.org/rfc/rfc2616.pdf',
>                'http://tools.ietf.org/html/rfc2616.html'],
>  'files': [{'checksum': 'e4b6ca0dd271ce887e70a1a2a5d681df',
>             'path': 'full/4f7f3e96b2dda337913105cd751a2d05d7e64b64.gif',
>             'url': 'http://www.ietf.org/images/ietflogotrans.gif'},
>            {'checksum': '9fa63f5083e4d2112d2e71b008e387e8',
>             'path': 'full/454ea89fbeaf00219fbcae49960d8bd1016994b0.txt',
>             'url': 'http://www.ietf.org/rfc/rfc2616.txt'},
>            {'checksum': '5f0dc88aced3b0678d702fb26454e851',
>             'path': 'full/f76736e9f1f22d7d5563208d97d13e7cc7a3a633.ps',
>             'url': 'http://www.rfc-editor.org/rfc/rfc2616.ps'},
>            {'checksum': '2d555310626966c3521cda04ae2fe76f',
>             'path': 'full/6ff52709da9514feb13211b6eb050458f353b49a.pdf',
>             'url': 'http://www.rfc-editor.org/rfc/rfc2616.pdf'},
>            {'checksum': '735820b4f0f4df7048b288ba36612295',
>             'path': 'full/7192dd9a00a8567bf3dc4c21ababdcec6c69ce7f.html',
>             'url': 'http://tools.ietf.org/html/rfc2616.html'}]}
> 2013-09-21 18:30:42+0200 [ietf] INFO: Closing spider (finished)
>
> which tells you what files were downloaded, and where they were stored.
>
> Hope this helps.
>
> On Tuesday, September 17, 2013 1:46:15 PM UTC+2, Ana Carolina Assis Jesus 
> wrote:
>>
>> Hi Paul, 
>>
>> Could you give me an example on how to use the pipeline, please? 
>>
>> Thanks, 
>> Ana 
>>
>> On Tue, Sep 17, 2013 at 12:19 PM, Ana Carolina Assis Jesus 
>> <[email protected]> wrote: 
>> > well, I installed about two weeks ago, but a tagged version... so 
>> > maybe I dont have it... 
>> > But I really need pipeline even if get button, at principle, at least, 
>> > should just download a file! I mean, it is what it does manualy... 
>> > ??? 
>> > 
>> > Thanks! 
>> > 
>> > On Tue, Sep 17, 2013 at 12:14 PM, Paul Tremberth 
>> > <[email protected]> wrote: 
>> >> Well, the FilesPipeline is a module inside scrapy.contrib.pipelines 
>> >> It was committed less than 2 weeks ago.(Scrapy is being improved all 
>> the 
>> >> time by the community) 
>> >> 
>> >> It depends when and how you installed scrapy: 
>> >> - if you install a tagged version using pip or easy_install (as it's 
>> >> recommended; 
>> >> http://doc.scrapy.org/en/latest/intro/install.html#installing-scrapy) 
>> >> you won't have the Pipeline and you have to add it yourself 
>> >> 
>> >> - if you installed from source less than 2 weeks ago (git clone 
>> >> [email protected]:scrapy/scrapy.git; cd scrapy; sudo python setup.py 
>> install) 
>> >> you should be good (but Scrapy from latest source code might be 
>> unstable and 
>> >> not fully tested) 
>> >> 
>> >> 
>> >> On Tuesday, September 17, 2013 12:04:31 PM UTC+2, Ana Carolina Assis 
>> Jesus 
>> >> wrote: 
>> >>> 
>> >>> Hi Paul. 
>> >>> 
>> >>> What do you mean by installing scrapy from source? 
>> >>> I need a new version from it? 
>> >>> 
>> >>> On Tue, Sep 17, 2013 at 12:01 PM, Paul Tremberth 
>> >>> <[email protected]> wrote: 
>> >>> > Hi Ana, 
>> >>> > to download files, you should have a look at the new FilesPipeline 
>> >>> > https://github.com/scrapy/scrapy/pull/370 
>> >>> > 
>> >>> > It's in the master branch though, not in a tagged version of 
>> Scrapy, so 
>> >>> > you'll have to install scrapy from source. 
>> >>> > 
>> >>> > Paul. 
>> >>> > 
>> >>> > 
>> >>> > On Tuesday, September 17, 2013 11:50:05 AM UTC+2, Ana Carolina 
>> Assis 
>> >>> > Jesus 
>> >>> > wrote: 
>> >>> >> 
>> >>> >> Hi! 
>> >>> >> 
>> >>> >> I am trying to download a csv file with scrapy. 
>> >>> >> I could crawl inside the site and get to the form I need and then 
>> I 
>> >>> >> find 
>> >>> >> two buttons to click. 
>> >>> >> One will list the transactions while the second one will download 
>> a 
>> >>> >> XXX.cvs file. 
>> >>> >> 
>> >>> >> How do I save this file within scrapy? 
>> >>> >> 
>> >>> >> I mean, if I choose the list transactions, I will get another 
>> webpage 
>> >>> >> and 
>> >>> >> this I can see. 
>> >>> >> But what if I choose the action to download? I guess I should not 
>> use 
>> >>> >> the 
>> >>> >> return self.parse_dosomething but something else to save the file 
>> it 
>> >>> >> should 
>> >>> >> give me (???) 
>> >>> >> 
>> >>> >> Or should the download start by itself? 
>> >>> >> 
>> >>> >> Thanks, 
>> >>> >> Ana 
>> >>> > 
>> >>> > -- 
>> >>> > You received this message because you are subscribed to a topic in 
>> the 
>> >>> > Google Groups "scrapy-users" group. 
>> >>> > To unsubscribe from this topic, visit 
>> >>> > 
>> https://groups.google.com/d/topic/scrapy-users/kzGHFjXywuY/unsubscribe. 
>> >>> > To unsubscribe from this group and all its topics, send an email to 
>> >>> > [email protected]. 
>> >>> > To post to this group, send email to [email protected]. 
>> >>> > Visit this group at http://groups.google.com/group/scrapy-users. 
>> >>> > For more options, visit https://groups.google.com/groups/opt_out. 
>> >> 
>> >> -- 
>> >> You received this message because you are subscribed to a topic in the 
>> >> Google Groups "scrapy-users" group. 
>> >> To unsubscribe from this topic, visit 
>> >> https://groups.google.com/d/topic/scrapy-users/kzGHFjXywuY/unsubscribe. 
>>
>> >> To unsubscribe from this group and all its topics, send an email to 
>> >> [email protected]. 
>> >> To post to this group, send email to [email protected]. 
>> >> Visit this group at http://groups.google.com/group/scrapy-users. 
>> >> For more options, visit https://groups.google.com/groups/opt_out. 
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.

Re: how to download and save a file with scrapy

Reply via email to