Re: how to download and save a file with scrapy

Matt Cialini Tue, 25 Mar 2014 05:52:37 -0700

I actually sent you the old code I had. The new one also edited these
functions, but instead of path[0] and ret[0], i had just path and ret.



On Tue, Mar 25, 2014 at 1:34 AM, Matt Cialini <[email protected]>wrote:

> Hi Casey,
>
> I ended up using Paul's suggestion and expanded on it to fit my needs.
> Basically my spider creates a single instance of FileDownloadItem
> {'file_urls': [list of several of dict objects {file_url:url,
> file_name:name} ] }. The individual dict objects have their file_url =
> their web url, and file name = the save title. It yields the item to the
> FilesPipeline, in which I just edited a few functions to better match the
> item structure i passed in.
>
> def _get_filesystem_path(self, path):
>         str = self.basedir + path[0]
>         return str
>
> def file_path(self, request, response=None, info=None):
>         def _warn():
>             #print "_warn"
>             from scrapy.exceptions import ScrapyDeprecationWarning
>             import warnings
>             warnings.warn('FilesPipeline.file_key(url) method is
> deprecated, please use '
>                           'file_path(request, response=None, info=None)
> instead',
>                           category=ScrapyDeprecationWarning, stacklevel=1)
>
>         # check if called from file_key with url as first argument
>         if not isinstance(request, Request):
>             _warn()
>             url = request
>         else:
>             url = request.url
>
>         # detect if file_key() method has been overridden
>         if not hasattr(self.file_key, '_base'):
>             _warn()
>             return self.file_key(url)
>         media_ext = os.path.splitext(url)[1]  # change to request.url
> after deprecation
>         ret = request.meta["file_spec"]["file_name"]
>         return ret[0] + media_ext
>
>
>
> On Sun, Mar 23, 2014 at 10:00 PM, Casey Klimkowsky 
> <[email protected]>wrote:
>
>> Hi Matt,
>>
>> I was wondering if you ever figured out your problem? I am also looking
>> to use the FilesPipeline with custom file names. I was able to edit
>> FilesPipeline itself to achieve this result, but obviously it would be a
>> better practice to extend FilesPipeline and override the necessary methods
>> instead. When I use a solution similar to Paul's, my files are not
>> downloaded to my hard drive.
>>
>> Thank you!
>>
>>
>> On Tuesday, February 25, 2014 9:03:20 AM UTC-6, Matt Cialini wrote:
>>
>>> Hi Paul,
>>>
>>> Thanks for the suggestion. I'm trying to implement it now but the files
>>> aren't being written to disk correctly. What function in files.py handles
>>> the actual saving of the file?
>>>
>>> Every item I pass into files.py eventually is a FileDownloadItem
>>> {'file_urls': [list of several of these dict objects {file_url:url,
>>> file_name:name}]}
>>>
>>> I'll attach my code to this if you have time to look it over. Basically
>>> I think something is not being passed in correctly in files.py, but it's
>>> hard to search through and determine where.
>>>
>>> Thanks so much Paul!
>>>
>>> - Matt C
>>>
>>>
>>> On Tue, Feb 25, 2014 at 4:28 AM, Paul Tremberth <[email protected]>wrote:
>>>
>>>> Hi Matt,
>>>>
>>>> one way to do that is to play with the FilesPipeline
>>>> *get_media_requests()*,
>>>> passing additional data through the meta dict
>>>> and then using a custom *file_path()* method
>>>>
>>>> Below, I use a dict in *file_urls *and not a list, so that I can pass
>>>> a URL and a custom *file_name*
>>>>
>>>> Using the same IETF example I used above in the thread:
>>>>
>>>> A simple spider downloading some files from IETF.org
>>>>
>>>> from scrapy.spider import Spider
>>>> from scrapy.http import Request
>>>> from scrapy.item import Item, Field
>>>>
>>>>
>>>> class IetfItem(Item):
>>>>     files = Field()
>>>>     file_urls = Field()
>>>>
>>>>
>>>> class IETFSpider(Spider):
>>>>     name = 'ietfpipe'
>>>>     allowed_domains = ['ietf.org']
>>>>     start_urls = ['http://www.ietf.org']
>>>>     file_urls = [
>>>>         'http://www.ietf.org/images/ietflogotrans.gif',
>>>>         'http://www.ietf.org/rfc/rfc2616.txt',
>>>>         'http://www.rfc-editor.org/rfc/rfc2616.ps',
>>>>         'http://www.rfc-editor.org/rfc/rfc2616.pdf',
>>>>         'http://tools.ietf.org/html/rfc2616.html',
>>>>     ]
>>>>
>>>>     def parse(self, response):
>>>>         for cnt, furl in enumerate(self.file_urls, start=1):
>>>>             yield IetfItem(file_urls=[{"file_url": furl, "file_name":
>>>> "file_%03d" % cnt}])
>>>>
>>>>
>>>>
>>>> Custom FilesPipeline
>>>>
>>>> from scrapy.contrib.pipeline.files import FilesPipeline
>>>> from scrapy.http import Request
>>>>
>>>> class MyFilesPipeline(FilesPipeline):
>>>>
>>>>     def get_media_requests(self, item, info):
>>>>         for file_spec in item['file_urls']:
>>>>             yield Request(url=file_spec["file_url"],
>>>> meta={"file_spec": file_spec})
>>>>
>>>>     def file_path(self, request, response=None, info=None):
>>>>         return request.meta["file_spec"]["file_name"]
>>>>
>>>>
>>>>
>>>> Hope this helps
>>>>
>>>> /Paul.
>>>>
>>>> On Friday, February 21, 2014 6:44:20 AM UTC+1, Matt Cialini wrote:
>>>>>
>>>>> Hello Paul!
>>>>>
>>>>> I'm Matt. I know this is a somewhat old group now but I have found
>>>>> your advice about FilesPipeline and it works great. I had one question
>>>>> though. Do you know of an easy way to pass in a file_name field for each
>>>>> url so that the FilesPipeline will save each url with the correct name?
>>>>>
>>>>> Thanks!
>>>>>
>>>>> On Saturday, September 21, 2013 1:03:09 PM UTC-4, Paul Tremberth wrote:
>>>>>>
>>>>>> Hi Ana,
>>>>>>
>>>>>> if you want to use the FilesPipeline, before it's in an official
>>>>>> Scrapy release,
>>>>>> here's one way to do it:
>>>>>>
>>>>>> 1) download https://raw.github.com/scrapy/scrapy/master/scrapy/
>>>>>> contrib/pipeline/files.py
>>>>>> and save it somewhere in your Scrapy project,
>>>>>> let's say at the root of your project (but that's not the best
>>>>>> location...)
>>>>>> yourproject/files.py
>>>>>>
>>>>>> 2) then, enable this pipeline by adding this to your settings.py
>>>>>>
>>>>>> ITEM_PIPELINES = [
>>>>>>     'yourproject.files.FilesPipeline',
>>>>>> ]
>>>>>> FILES_STORE = '/path/to/yourproject/downloads'
>>>>>>
>>>>>> FILES_STORE needs to point to a location where Scrapy can write
>>>>>> (create it beforehand)
>>>>>>
>>>>>> 3) add 2 special fields to your item definition
>>>>>>     file_urls = Field()
>>>>>>     files = Field()
>>>>>>
>>>>>> 4) in your spider, when you have an URL for a file to download,
>>>>>> add it to your Item instance before returning it
>>>>>>
>>>>>> ...
>>>>>>     myitem = YourProjectItem()
>>>>>>     ...
>>>>>>     myitem["file_urls"] = ["http://www.example.com/somefileiwant.csv
>>>>>> "]
>>>>>>     yield myitem
>>>>>>
>>>>>> 5) run your spider and you should see files in the FILES_STORE folder
>>>>>>
>>>>>> Here's an example that download a few files from the IETF website
>>>>>>
>>>>>> the scrapy project is called "filedownload"
>>>>>>
>>>>>> items.py looks like this:
>>>>>>
>>>>>> from scrapy.item import Item, Field
>>>>>>
>>>>>> class FiledownloadItem(Item):
>>>>>>     file_urls = Field()
>>>>>>     files = Field()
>>>>>>
>>>>>>
>>>>>> this is the code for the spider:
>>>>>>
>>>>>> from scrapy.spider import BaseSpider
>>>>>> from filedownload.items import FiledownloadItem
>>>>>>
>>>>>> class IetfSpider(BaseSpider):
>>>>>>     name = "ietf"
>>>>>>     allowed_domains = ["ietf.org"]
>>>>>>     start_urls = (
>>>>>>         'http://www.ietf.org/',
>>>>>>         )
>>>>>>
>>>>>>     def parse(self, response):
>>>>>>         yield FiledownloadItem(
>>>>>>             file_urls=[
>>>>>>                 'http://www.ietf.org/images/ietflogotrans.gif',
>>>>>>                 'http://www.ietf.org/rfc/rfc2616.txt',
>>>>>>                 'http://www.rfc-editor.org/rfc/rfc2616.ps',
>>>>>>                 'http://www.rfc-editor.org/rfc/rfc2616.pdf',
>>>>>>                 'http://tools.ietf.org/html/rfc2616.html',
>>>>>>             ]
>>>>>>         )
>>>>>>
>>>>>> When you run the spider, at the end, you should see in the console
>>>>>> something like this:
>>>>>>
>>>>>> 2013-09-21 18:30:42+0200 [ietf] DEBUG: Scraped from <200
>>>>>> http://www.ietf.org/>
>>>>>> {'file_urls': ['http://www.ietf.org/images/ietflogotrans.gif',
>>>>>>                'http://www.ietf.org/rfc/rfc2616.txt',
>>>>>>                'http://www.rfc-editor.org/rfc/rfc2616.ps',
>>>>>>                'http://www.rfc-editor.org/rfc/rfc2616.pdf',
>>>>>>                'http://tools.ietf.org/html/rfc2616.html'],
>>>>>>  'files': [{'checksum': 'e4b6ca0dd271ce887e70a1a2a5d681df',
>>>>>>             'path': 'full/4f7f3e96b2dda337913105cd751a2d
>>>>>> 05d7e64b64.gif',
>>>>>>             'url': 'http://www.ietf.org/images/ietflogotrans.gif'},
>>>>>>            {'checksum': '9fa63f5083e4d2112d2e71b008e387e8',
>>>>>>             'path': 'full/454ea89fbeaf00219fbcae49960d8b
>>>>>> d1016994b0.txt',
>>>>>>             'url': 'http://www.ietf.org/rfc/rfc2616.txt'},
>>>>>>            {'checksum': '5f0dc88aced3b0678d702fb26454e851',
>>>>>>             'path': 'full/f76736e9f1f22d7d5563208d97d13e7cc7a3a633.ps
>>>>>> ',
>>>>>>             'url': 'http://www.rfc-editor.org/rfc/rfc2616.ps'},
>>>>>>            {'checksum': '2d555310626966c3521cda04ae2fe76f',
>>>>>>             'path': 'full/6ff52709da9514feb13211b6eb0504
>>>>>> 58f353b49a.pdf',
>>>>>>             'url': 'http://www.rfc-editor.org/rfc/rfc2616.pdf'},
>>>>>>            {'checksum': '735820b4f0f4df7048b288ba36612295',
>>>>>>             'path': 'full/7192dd9a00a8567bf3dc4c21ababdc
>>>>>> ec6c69ce7f.html',
>>>>>>             'url': 'http://tools.ietf.org/html/rfc2616.html'}]}
>>>>>> 2013-09-21 18:30:42+0200 [ietf] INFO: Closing spider (finished)
>>>>>>
>>>>>> which tells you what files were downloaded, and where they were
>>>>>> stored.
>>>>>>
>>>>>> Hope this helps.
>>>>>>
>>>>>> On Tuesday, September 17, 2013 1:46:15 PM UTC+2, Ana Carolina Assis
>>>>>> Jesus wrote:
>>>>>>>
>>>>>>> Hi Paul,
>>>>>>>
>>>>>>> Could you give me an example on how to use the pipeline, please?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ana
>>>>>>>
>>>>>>> On Tue, Sep 17, 2013 at 12:19 PM, Ana Carolina Assis Jesus
>>>>>>> <[email protected]> wrote:
>>>>>>> > well, I installed about two weeks ago, but a tagged version... so
>>>>>>> > maybe I dont have it...
>>>>>>> > But I really need pipeline even if get button, at principle, at
>>>>>>> least,
>>>>>>> > should just download a file! I mean, it is what it does manualy...
>>>>>>> > ???
>>>>>>> >
>>>>>>> > Thanks!
>>>>>>> >
>>>>>>> > On Tue, Sep 17, 2013 at 12:14 PM, Paul Tremberth
>>>>>>> > <[email protected]> wrote:
>>>>>>> >> Well, the FilesPipeline is a module inside
>>>>>>> scrapy.contrib.pipelines
>>>>>>> >> It was committed less than 2 weeks ago.(Scrapy is being improved
>>>>>>> all the
>>>>>>> >> time by the community)
>>>>>>> >>
>>>>>>> >> It depends when and how you installed scrapy:
>>>>>>> >> - if you install a tagged version using pip or easy_install (as
>>>>>>> it's
>>>>>>> >> recommended;
>>>>>>> >> http://doc.scrapy.org/en/latest/intro/install.html#installin
>>>>>>> g-scrapy)
>>>>>>> >> you won't have the Pipeline and you have to add it yourself
>>>>>>> >>
>>>>>>> >> - if you installed from source less than 2 weeks ago (git clone
>>>>>>> >> [email protected]:scrapy/scrapy.git; cd scrapy; sudo python
>>>>>>> setup.py install)
>>>>>>> >> you should be good (but Scrapy from latest source code might be
>>>>>>> unstable and
>>>>>>> >> not fully tested)
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> On Tuesday, September 17, 2013 12:04:31 PM UTC+2, Ana Carolina
>>>>>>> Assis Jesus
>>>>>>> >> wrote:
>>>>>>> >>>
>>>>>>> >>> Hi Paul.
>>>>>>> >>>
>>>>>>> >>> What do you mean by installing scrapy from source?
>>>>>>> >>> I need a new version from it?
>>>>>>> >>>
>>>>>>> >>> On Tue, Sep 17, 2013 at 12:01 PM, Paul Tremberth
>>>>>>> >>> <[email protected]> wrote:
>>>>>>> >>> > Hi Ana,
>>>>>>> >>> > to download files, you should have a look at the new
>>>>>>> FilesPipeline
>>>>>>> >>> > https://github.com/scrapy/scrapy/pull/370
>>>>>>> >>> >
>>>>>>> >>> > It's in the master branch though, not in a tagged version of
>>>>>>> Scrapy, so
>>>>>>> >>> > you'll have to install scrapy from source.
>>>>>>> >>> >
>>>>>>> >>> > Paul.
>>>>>>> >>> >
>>>>>>> >>> >
>>>>>>> >>> > On Tuesday, September 17, 2013 11:50:05 AM UTC+2, Ana Carolina
>>>>>>> Assis
>>>>>>> >>> > Jesus
>>>>>>> >>> > wrote:
>>>>>>> >>> >>
>>>>>>> >>> >> Hi!
>>>>>>> >>> >>
>>>>>>> >>> >> I am trying to download a csv file with scrapy.
>>>>>>> >>> >> I could crawl inside the site and get to the form I need and
>>>>>>> then I
>>>>>>> >>> >> find
>>>>>>> >>> >> two buttons to click.
>>>>>>> >>> >> One will list the transactions while the second one will
>>>>>>> download a
>>>>>>> >>> >> XXX.cvs file.
>>>>>>> >>> >>
>>>>>>> >>> >> How do I save this file within scrapy?
>>>>>>> >>> >>
>>>>>>> >>> >> I mean, if I choose the list transactions, I will get another
>>>>>>> webpage
>>>>>>> >>> >> and
>>>>>>> >>> >> this I can see.
>>>>>>> >>> >> But what if I choose the action to download? I guess I should
>>>>>>> not use
>>>>>>> >>> >> the
>>>>>>> >>> >> return self.parse_dosomething but something else to save the
>>>>>>> file it
>>>>>>> >>> >> should
>>>>>>> >>> >> give me (???)
>>>>>>> >>> >>
>>>>>>> >>> >> Or should the download start by itself?
>>>>>>> >>> >>
>>>>>>> >>> >> Thanks,
>>>>>>> >>> >> Ana
>>>>>>> >>> >
>>>>>>> >>> > --
>>>>>>> >>> > You received this message because you are subscribed to a
>>>>>>> topic in the
>>>>>>> >>> > Google Groups "scrapy-users" group.
>>>>>>> >>> > To unsubscribe from this topic, visit
>>>>>>> >>> > https://groups.google.com/d/topic/scrapy-users/kzGHFjXywuY/
>>>>>>> unsubscribe.
>>>>>>> >>> > To unsubscribe from this group and all its topics, send an
>>>>>>> email to
>>>>>>> >>> > [email protected].
>>>>>>> >>> > To post to this group, send email to
>>>>>>> [email protected].
>>>>>>> >>> > Visit this group at http://groups.google.com/group
>>>>>>> /scrapy-users.
>>>>>>> >>> > For more options, visit https://groups.google.com/grou
>>>>>>> ps/opt_out.
>>>>>>> >>
>>>>>>> >> --
>>>>>>> >> You received this message because you are subscribed to a topic
>>>>>>> in the
>>>>>>> >> Google Groups "scrapy-users" group.
>>>>>>> >> To unsubscribe from this topic, visit
>>>>>>> >> https://groups.google.com/d/topic/scrapy-users/kzGHFjXywuY/
>>>>>>> unsubscribe.
>>>>>>> >> To unsubscribe from this group and all its topics, send an email
>>>>>>> to
>>>>>>> >> [email protected].
>>>>>>> >> To post to this group, send email to [email protected].
>>>>>>> >> Visit this group at http://groups.google.com/group/scrapy-users.
>>>>>>> >> For more options, visit https://groups.google.com/groups/opt_out.
>>>>>>>
>>>>>>>
>>>>>>  --
>>>> You received this message because you are subscribed to a topic in the
>>>> Google Groups "scrapy-users" group.
>>>> To unsubscribe from this topic, visit https://groups.google.com/d/
>>>> topic/scrapy-users/kzGHFjXywuY/unsubscribe.
>>>> To unsubscribe from this group and all its topics, send an email to
>>>> [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>>
>>>
>>>  --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "scrapy-users" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/scrapy-users/kzGHFjXywuY/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>> [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at http://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: how to download and save a file with scrapy

Reply via email to