Re: how to download and save a file with scrapy

Casey Klimkowsky Mon, 24 Mar 2014 05:12:09 -0700

Hi Matt,

I was wondering if you ever figured out your problem? I am also looking to 
use the FilesPipeline with custom file names. I was able to edit 
FilesPipeline itself to achieve this result, but obviously it would be a 
better practice to extend FilesPipeline and override the necessary methods 
instead. When I use a solution similar to Paul's, my files are not 
downloaded to my hard drive.


Thank you!

On Tuesday, February 25, 2014 9:03:20 AM UTC-6, Matt Cialini wrote:
>
> Hi Paul,
>
> Thanks for the suggestion. I'm trying to implement it now but the files 
> aren't being written to disk correctly. What function in files.py handles 
> the actual saving of the file?
>
> Every item I pass into files.py eventually is a FileDownloadItem 
> {'file_urls': [list of several of these dict objects {file_url:url, 
> file_name:name}]}
>
> I'll attach my code to this if you have time to look it over. Basically I 
> think something is not being passed in correctly in files.py, but it's hard 
> to search through and determine where.
>
> Thanks so much Paul!
>
> - Matt C
>
>
> On Tue, Feb 25, 2014 at 4:28 AM, Paul Tremberth 
> <[email protected]<javascript:>
> > wrote:
>
>> Hi Matt,
>>
>> one way to do that is to play with the FilesPipeline 
>> *get_media_requests()*,
>> passing additional data through the meta dict
>> and then using a custom *file_path()* method
>>
>> Below, I use a dict in *file_urls *and not a list, so that I can pass a 
>> URL and a custom *file_name*
>>
>> Using the same IETF example I used above in the thread:
>>
>> A simple spider downloading some files from IETF.org
>>
>> from scrapy.spider import Spider
>> from scrapy.http import Request
>> from scrapy.item import Item, Field
>>
>>
>> class IetfItem(Item):
>>     files = Field()
>>     file_urls = Field()
>>
>>
>> class IETFSpider(Spider):
>>     name = 'ietfpipe'
>>     allowed_domains = ['ietf.org']
>>     start_urls = ['http://www.ietf.org']
>>     file_urls = [
>>         'http://www.ietf.org/images/ietflogotrans.gif',
>>         'http://www.ietf.org/rfc/rfc2616.txt',
>>         'http://www.rfc-editor.org/rfc/rfc2616.ps',
>>         'http://www.rfc-editor.org/rfc/rfc2616.pdf',
>>         'http://tools.ietf.org/html/rfc2616.html',
>>     ]
>>
>>     def parse(self, response):
>>         for cnt, furl in enumerate(self.file_urls, start=1):
>>             yield IetfItem(file_urls=[{"file_url": furl, "file_name": 
>> "file_%03d" % cnt}])
>>
>>
>>
>> Custom FilesPipeline
>>
>> from scrapy.contrib.pipeline.files import FilesPipeline
>> from scrapy.http import Request
>>
>> class MyFilesPipeline(FilesPipeline):
>>
>>     def get_media_requests(self, item, info):
>>         for file_spec in item['file_urls']:
>>             yield Request(url=file_spec["file_url"], meta={"file_spec": 
>> file_spec})
>>
>>     def file_path(self, request, response=None, info=None):
>>         return request.meta["file_spec"]["file_name"]
>>
>>
>>
>> Hope this helps
>>
>> /Paul.
>>
>> On Friday, February 21, 2014 6:44:20 AM UTC+1, Matt Cialini wrote:
>>>
>>> Hello Paul!
>>>
>>> I'm Matt. I know this is a somewhat old group now but I have found your 
>>> advice about FilesPipeline and it works great. I had one question though. 
>>> Do you know of an easy way to pass in a file_name field for each url so 
>>> that the FilesPipeline will save each url with the correct name?
>>>
>>> Thanks!
>>>
>>> On Saturday, September 21, 2013 1:03:09 PM UTC-4, Paul Tremberth wrote:
>>>>
>>>> Hi Ana,
>>>>
>>>> if you want to use the FilesPipeline, before it's in an official 
>>>> Scrapy release,
>>>> here's one way to do it:
>>>>
>>>> 1) download https://raw.github.com/scrapy/scrapy/master/
>>>> scrapy/contrib/pipeline/files.py
>>>> and save it somewhere in your Scrapy project,
>>>> let's say at the root of your project (but that's not the best 
>>>> location...)
>>>> yourproject/files.py
>>>>
>>>> 2) then, enable this pipeline by adding this to your settings.py
>>>>
>>>> ITEM_PIPELINES = [
>>>>     'yourproject.files.FilesPipeline',
>>>> ]
>>>> FILES_STORE = '/path/to/yourproject/downloads'
>>>>
>>>> FILES_STORE needs to point to a location where Scrapy can write (create 
>>>> it beforehand)
>>>>
>>>> 3) add 2 special fields to your item definition
>>>>     file_urls = Field()
>>>>     files = Field()
>>>>
>>>> 4) in your spider, when you have an URL for a file to download,
>>>> add it to your Item instance before returning it
>>>>
>>>> ...
>>>>     myitem = YourProjectItem()
>>>>     ...
>>>>     myitem["file_urls"] = ["http://www.example.com/somefileiwant.csv";]
>>>>     yield myitem
>>>>
>>>> 5) run your spider and you should see files in the FILES_STORE folder
>>>>
>>>> Here's an example that download a few files from the IETF website
>>>>
>>>> the scrapy project is called "filedownload"
>>>>
>>>> items.py looks like this:
>>>>
>>>> from scrapy.item import Item, Field
>>>>
>>>> class FiledownloadItem(Item):
>>>>     file_urls = Field()
>>>>     files = Field()
>>>>  
>>>>
>>>> this is the code for the spider:
>>>>
>>>> from scrapy.spider import BaseSpider
>>>> from filedownload.items import FiledownloadItem
>>>>
>>>> class IetfSpider(BaseSpider):
>>>>     name = "ietf"
>>>>     allowed_domains = ["ietf.org"]
>>>>     start_urls = (
>>>>         'http://www.ietf.org/',
>>>>         )
>>>>
>>>>     def parse(self, response):
>>>>         yield FiledownloadItem(
>>>>             file_urls=[
>>>>                 'http://www.ietf.org/images/ietflogotrans.gif',
>>>>                 'http://www.ietf.org/rfc/rfc2616.txt',
>>>>                 'http://www.rfc-editor.org/rfc/rfc2616.ps',
>>>>                 'http://www.rfc-editor.org/rfc/rfc2616.pdf',
>>>>                 'http://tools.ietf.org/html/rfc2616.html',
>>>>             ]
>>>>         )
>>>>
>>>> When you run the spider, at the end, you should see in the console 
>>>> something like this:
>>>>
>>>> 2013-09-21 18:30:42+0200 [ietf] DEBUG: Scraped from <200 
>>>> http://www.ietf.org/>
>>>> {'file_urls': ['http://www.ietf.org/images/ietflogotrans.gif',
>>>>                'http://www.ietf.org/rfc/rfc2616.txt',
>>>>                'http://www.rfc-editor.org/rfc/rfc2616.ps',
>>>>                'http://www.rfc-editor.org/rfc/rfc2616.pdf',
>>>>                'http://tools.ietf.org/html/rfc2616.html'],
>>>>  'files': [{'checksum': 'e4b6ca0dd271ce887e70a1a2a5d681df',
>>>>             'path': 'full/4f7f3e96b2dda337913105cd751a2d
>>>> 05d7e64b64.gif',
>>>>             'url': 'http://www.ietf.org/images/ietflogotrans.gif'},
>>>>            {'checksum': '9fa63f5083e4d2112d2e71b008e387e8',
>>>>             'path': 'full/454ea89fbeaf00219fbcae49960d8b
>>>> d1016994b0.txt',
>>>>             'url': 'http://www.ietf.org/rfc/rfc2616.txt'},
>>>>            {'checksum': '5f0dc88aced3b0678d702fb26454e851',
>>>>             'path': 'full/f76736e9f1f22d7d5563208d97d13e7cc7a3a633.ps',
>>>>             'url': 'http://www.rfc-editor.org/rfc/rfc2616.ps'},
>>>>            {'checksum': '2d555310626966c3521cda04ae2fe76f',
>>>>             'path': 'full/6ff52709da9514feb13211b6eb0504
>>>> 58f353b49a.pdf',
>>>>             'url': 'http://www.rfc-editor.org/rfc/rfc2616.pdf'},
>>>>            {'checksum': '735820b4f0f4df7048b288ba36612295',
>>>>             'path': 'full/7192dd9a00a8567bf3dc4c21ababdc
>>>> ec6c69ce7f.html',
>>>>             'url': 'http://tools.ietf.org/html/rfc2616.html'}]}
>>>> 2013-09-21 18:30:42+0200 [ietf] INFO: Closing spider (finished)
>>>>
>>>> which tells you what files were downloaded, and where they were stored.
>>>>
>>>> Hope this helps.
>>>>
>>>> On Tuesday, September 17, 2013 1:46:15 PM UTC+2, Ana Carolina Assis 
>>>> Jesus wrote:
>>>>>
>>>>> Hi Paul, 
>>>>>
>>>>> Could you give me an example on how to use the pipeline, please? 
>>>>>
>>>>> Thanks, 
>>>>> Ana 
>>>>>
>>>>> On Tue, Sep 17, 2013 at 12:19 PM, Ana Carolina Assis Jesus 
>>>>> <[email protected]> wrote: 
>>>>> > well, I installed about two weeks ago, but a tagged version... so 
>>>>> > maybe I dont have it... 
>>>>> > But I really need pipeline even if get button, at principle, at 
>>>>> least, 
>>>>> > should just download a file! I mean, it is what it does manualy... 
>>>>> > ??? 
>>>>> > 
>>>>> > Thanks! 
>>>>> > 
>>>>> > On Tue, Sep 17, 2013 at 12:14 PM, Paul Tremberth 
>>>>> > <[email protected]> wrote: 
>>>>> >> Well, the FilesPipeline is a module inside scrapy.contrib.pipelines 
>>>>> >> It was committed less than 2 weeks ago.(Scrapy is being improved 
>>>>> all the 
>>>>> >> time by the community) 
>>>>> >> 
>>>>> >> It depends when and how you installed scrapy: 
>>>>> >> - if you install a tagged version using pip or easy_install (as 
>>>>> it's 
>>>>> >> recommended; 
>>>>> >> http://doc.scrapy.org/en/latest/intro/install.html#
>>>>> installing-scrapy) 
>>>>> >> you won't have the Pipeline and you have to add it yourself 
>>>>> >> 
>>>>> >> - if you installed from source less than 2 weeks ago (git clone 
>>>>> >> [email protected]:scrapy/scrapy.git; cd scrapy; sudo python setup.py 
>>>>> install) 
>>>>> >> you should be good (but Scrapy from latest source code might be 
>>>>> unstable and 
>>>>> >> not fully tested) 
>>>>> >> 
>>>>> >> 
>>>>> >> On Tuesday, September 17, 2013 12:04:31 PM UTC+2, Ana Carolina 
>>>>> Assis Jesus 
>>>>> >> wrote: 
>>>>> >>> 
>>>>> >>> Hi Paul. 
>>>>> >>> 
>>>>> >>> What do you mean by installing scrapy from source? 
>>>>> >>> I need a new version from it? 
>>>>> >>> 
>>>>> >>> On Tue, Sep 17, 2013 at 12:01 PM, Paul Tremberth 
>>>>> >>> <[email protected]> wrote: 
>>>>> >>> > Hi Ana, 
>>>>> >>> > to download files, you should have a look at the new 
>>>>> FilesPipeline 
>>>>> >>> > https://github.com/scrapy/scrapy/pull/370 
>>>>> >>> > 
>>>>> >>> > It's in the master branch though, not in a tagged version of 
>>>>> Scrapy, so 
>>>>> >>> > you'll have to install scrapy from source. 
>>>>> >>> > 
>>>>> >>> > Paul. 
>>>>> >>> > 
>>>>> >>> > 
>>>>> >>> > On Tuesday, September 17, 2013 11:50:05 AM UTC+2, Ana Carolina 
>>>>> Assis 
>>>>> >>> > Jesus 
>>>>> >>> > wrote: 
>>>>> >>> >> 
>>>>> >>> >> Hi! 
>>>>> >>> >> 
>>>>> >>> >> I am trying to download a csv file with scrapy. 
>>>>> >>> >> I could crawl inside the site and get to the form I need and 
>>>>> then I 
>>>>> >>> >> find 
>>>>> >>> >> two buttons to click. 
>>>>> >>> >> One will list the transactions while the second one will 
>>>>> download a 
>>>>> >>> >> XXX.cvs file. 
>>>>> >>> >> 
>>>>> >>> >> How do I save this file within scrapy? 
>>>>> >>> >> 
>>>>> >>> >> I mean, if I choose the list transactions, I will get another 
>>>>> webpage 
>>>>> >>> >> and 
>>>>> >>> >> this I can see. 
>>>>> >>> >> But what if I choose the action to download? I guess I should 
>>>>> not use 
>>>>> >>> >> the 
>>>>> >>> >> return self.parse_dosomething but something else to save the 
>>>>> file it 
>>>>> >>> >> should 
>>>>> >>> >> give me (???) 
>>>>> >>> >> 
>>>>> >>> >> Or should the download start by itself? 
>>>>> >>> >> 
>>>>> >>> >> Thanks, 
>>>>> >>> >> Ana 
>>>>> >>> > 
>>>>> >>> > -- 
>>>>> >>> > You received this message because you are subscribed to a topic 
>>>>> in the 
>>>>> >>> > Google Groups "scrapy-users" group. 
>>>>> >>> > To unsubscribe from this topic, visit 
>>>>> >>> > https://groups.google.com/d/topic/scrapy-users/
>>>>> kzGHFjXywuY/unsubscribe. 
>>>>> >>> > To unsubscribe from this group and all its topics, send an email 
>>>>> to 
>>>>> >>> > [email protected]. 
>>>>> >>> > To post to this group, send email to [email protected]. 
>>>>>
>>>>> >>> > Visit this group at http://groups.google.com/group/scrapy-users. 
>>>>>
>>>>> >>> > For more options, visit https://groups.google.com/groups/opt_out. 
>>>>>
>>>>> >> 
>>>>> >> -- 
>>>>> >> You received this message because you are subscribed to a topic in 
>>>>> the 
>>>>> >> Google Groups "scrapy-users" group. 
>>>>> >> To unsubscribe from this topic, visit 
>>>>> >> https://groups.google.com/d/topic/scrapy-users/
>>>>> kzGHFjXywuY/unsubscribe. 
>>>>> >> To unsubscribe from this group and all its topics, send an email to 
>>>>> >> [email protected]. 
>>>>> >> To post to this group, send email to [email protected]. 
>>>>> >> Visit this group at http://groups.google.com/group/scrapy-users. 
>>>>> >> For more options, visit https://groups.google.com/groups/opt_out. 
>>>>>
>>>>  -- 
>> You received this message because you are subscribed to a topic in the 
>> Google Groups "scrapy-users" group.
>> To unsubscribe from this topic, visit 
>> https://groups.google.com/d/topic/scrapy-users/kzGHFjXywuY/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to 
>> [email protected] <javascript:>.
>> To post to this group, send email to [email protected]<javascript:>
>> .
>> Visit this group at http://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: how to download and save a file with scrapy

Reply via email to