Hi Matt,
one way to do that is to play with the FilesPipeline *get_media_requests()*,
passing additional data through the meta dict
and then using a custom *file_path()* method
Below, I use a dict in *file_urls *and not a list, so that I can pass a URL
and a custom *file_name*
Using the same IETF example I used above in the thread:
A simple spider downloading some files from IETF.org
from scrapy.spider import Spider
from scrapy.http import Request
from scrapy.item import Item, Field
class IetfItem(Item):
files = Field()
file_urls = Field()
class IETFSpider(Spider):
name = 'ietfpipe'
allowed_domains = ['ietf.org']
start_urls = ['http://www.ietf.org']
file_urls = [
'http://www.ietf.org/images/ietflogotrans.gif',
'http://www.ietf.org/rfc/rfc2616.txt',
'http://www.rfc-editor.org/rfc/rfc2616.ps',
'http://www.rfc-editor.org/rfc/rfc2616.pdf',
'http://tools.ietf.org/html/rfc2616.html',
]
def parse(self, response):
for cnt, furl in enumerate(self.file_urls, start=1):
yield IetfItem(file_urls=[{"file_url": furl, "file_name":
"file_%03d" % cnt}])
Custom FilesPipeline
from scrapy.contrib.pipeline.files import FilesPipeline
from scrapy.http import Request
class MyFilesPipeline(FilesPipeline):
def get_media_requests(self, item, info):
for file_spec in item['file_urls']:
yield Request(url=file_spec["file_url"], meta={"file_spec":
file_spec})
def file_path(self, request, response=None, info=None):
return request.meta["file_spec"]["file_name"]
Hope this helps
/Paul.
On Friday, February 21, 2014 6:44:20 AM UTC+1, Matt Cialini wrote:
>
> Hello Paul!
>
> I'm Matt. I know this is a somewhat old group now but I have found your
> advice about FilesPipeline and it works great. I had one question though.
> Do you know of an easy way to pass in a file_name field for each url so
> that the FilesPipeline will save each url with the correct name?
>
> Thanks!
>
> On Saturday, September 21, 2013 1:03:09 PM UTC-4, Paul Tremberth wrote:
>>
>> Hi Ana,
>>
>> if you want to use the FilesPipeline, before it's in an official Scrapy
>> release,
>> here's one way to do it:
>>
>> 1) download
>> https://raw.github.com/scrapy/scrapy/master/scrapy/contrib/pipeline/files.py
>> and save it somewhere in your Scrapy project,
>> let's say at the root of your project (but that's not the best
>> location...)
>> yourproject/files.py
>>
>> 2) then, enable this pipeline by adding this to your settings.py
>>
>> ITEM_PIPELINES = [
>> 'yourproject.files.FilesPipeline',
>> ]
>> FILES_STORE = '/path/to/yourproject/downloads'
>>
>> FILES_STORE needs to point to a location where Scrapy can write (create
>> it beforehand)
>>
>> 3) add 2 special fields to your item definition
>> file_urls = Field()
>> files = Field()
>>
>> 4) in your spider, when you have an URL for a file to download,
>> add it to your Item instance before returning it
>>
>> ...
>> myitem = YourProjectItem()
>> ...
>> myitem["file_urls"] = ["http://www.example.com/somefileiwant.csv"]
>> yield myitem
>>
>> 5) run your spider and you should see files in the FILES_STORE folder
>>
>> Here's an example that download a few files from the IETF website
>>
>> the scrapy project is called "filedownload"
>>
>> items.py looks like this:
>>
>> from scrapy.item import Item, Field
>>
>> class FiledownloadItem(Item):
>> file_urls = Field()
>> files = Field()
>>
>>
>> this is the code for the spider:
>>
>> from scrapy.spider import BaseSpider
>> from filedownload.items import FiledownloadItem
>>
>> class IetfSpider(BaseSpider):
>> name = "ietf"
>> allowed_domains = ["ietf.org"]
>> start_urls = (
>> 'http://www.ietf.org/',
>> )
>>
>> def parse(self, response):
>> yield FiledownloadItem(
>> file_urls=[
>> 'http://www.ietf.org/images/ietflogotrans.gif',
>> 'http://www.ietf.org/rfc/rfc2616.txt',
>> 'http://www.rfc-editor.org/rfc/rfc2616.ps',
>> 'http://www.rfc-editor.org/rfc/rfc2616.pdf',
>> 'http://tools.ietf.org/html/rfc2616.html',
>> ]
>> )
>>
>> When you run the spider, at the end, you should see in the console
>> something like this:
>>
>> 2013-09-21 18:30:42+0200 [ietf] DEBUG: Scraped from <200
>> http://www.ietf.org/>
>> {'file_urls': ['http://www.ietf.org/images/ietflogotrans.gif',
>> 'http://www.ietf.org/rfc/rfc2616.txt',
>> 'http://www.rfc-editor.org/rfc/rfc2616.ps',
>> 'http://www.rfc-editor.org/rfc/rfc2616.pdf',
>> 'http://tools.ietf.org/html/rfc2616.html'],
>> 'files': [{'checksum': 'e4b6ca0dd271ce887e70a1a2a5d681df',
>> 'path': 'full/4f7f3e96b2dda337913105cd751a2d05d7e64b64.gif',
>> 'url': 'http://www.ietf.org/images/ietflogotrans.gif'},
>> {'checksum': '9fa63f5083e4d2112d2e71b008e387e8',
>> 'path': 'full/454ea89fbeaf00219fbcae49960d8bd1016994b0.txt',
>> 'url': 'http://www.ietf.org/rfc/rfc2616.txt'},
>> {'checksum': '5f0dc88aced3b0678d702fb26454e851',
>> 'path': 'full/f76736e9f1f22d7d5563208d97d13e7cc7a3a633.ps',
>> 'url': 'http://www.rfc-editor.org/rfc/rfc2616.ps'},
>> {'checksum': '2d555310626966c3521cda04ae2fe76f',
>> 'path': 'full/6ff52709da9514feb13211b6eb050458f353b49a.pdf',
>> 'url': 'http://www.rfc-editor.org/rfc/rfc2616.pdf'},
>> {'checksum': '735820b4f0f4df7048b288ba36612295',
>> 'path': 'full/7192dd9a00a8567bf3dc4c21ababdcec6c69ce7f.html',
>> 'url': 'http://tools.ietf.org/html/rfc2616.html'}]}
>> 2013-09-21 18:30:42+0200 [ietf] INFO: Closing spider (finished)
>>
>> which tells you what files were downloaded, and where they were stored.
>>
>> Hope this helps.
>>
>> On Tuesday, September 17, 2013 1:46:15 PM UTC+2, Ana Carolina Assis Jesus
>> wrote:
>>>
>>> Hi Paul,
>>>
>>> Could you give me an example on how to use the pipeline, please?
>>>
>>> Thanks,
>>> Ana
>>>
>>> On Tue, Sep 17, 2013 at 12:19 PM, Ana Carolina Assis Jesus
>>> <[email protected]> wrote:
>>> > well, I installed about two weeks ago, but a tagged version... so
>>> > maybe I dont have it...
>>> > But I really need pipeline even if get button, at principle, at least,
>>> > should just download a file! I mean, it is what it does manualy...
>>> > ???
>>> >
>>> > Thanks!
>>> >
>>> > On Tue, Sep 17, 2013 at 12:14 PM, Paul Tremberth
>>> > <[email protected]> wrote:
>>> >> Well, the FilesPipeline is a module inside scrapy.contrib.pipelines
>>> >> It was committed less than 2 weeks ago.(Scrapy is being improved all
>>> the
>>> >> time by the community)
>>> >>
>>> >> It depends when and how you installed scrapy:
>>> >> - if you install a tagged version using pip or easy_install (as it's
>>> >> recommended;
>>> >> http://doc.scrapy.org/en/latest/intro/install.html#installing-scrapy)
>>>
>>> >> you won't have the Pipeline and you have to add it yourself
>>> >>
>>> >> - if you installed from source less than 2 weeks ago (git clone
>>> >> [email protected]:scrapy/scrapy.git; cd scrapy; sudo python setup.py
>>> install)
>>> >> you should be good (but Scrapy from latest source code might be
>>> unstable and
>>> >> not fully tested)
>>> >>
>>> >>
>>> >> On Tuesday, September 17, 2013 12:04:31 PM UTC+2, Ana Carolina Assis
>>> Jesus
>>> >> wrote:
>>> >>>
>>> >>> Hi Paul.
>>> >>>
>>> >>> What do you mean by installing scrapy from source?
>>> >>> I need a new version from it?
>>> >>>
>>> >>> On Tue, Sep 17, 2013 at 12:01 PM, Paul Tremberth
>>> >>> <[email protected]> wrote:
>>> >>> > Hi Ana,
>>> >>> > to download files, you should have a look at the new FilesPipeline
>>> >>> > https://github.com/scrapy/scrapy/pull/370
>>> >>> >
>>> >>> > It's in the master branch though, not in a tagged version of
>>> Scrapy, so
>>> >>> > you'll have to install scrapy from source.
>>> >>> >
>>> >>> > Paul.
>>> >>> >
>>> >>> >
>>> >>> > On Tuesday, September 17, 2013 11:50:05 AM UTC+2, Ana Carolina
>>> Assis
>>> >>> > Jesus
>>> >>> > wrote:
>>> >>> >>
>>> >>> >> Hi!
>>> >>> >>
>>> >>> >> I am trying to download a csv file with scrapy.
>>> >>> >> I could crawl inside the site and get to the form I need and then
>>> I
>>> >>> >> find
>>> >>> >> two buttons to click.
>>> >>> >> One will list the transactions while the second one will download
>>> a
>>> >>> >> XXX.cvs file.
>>> >>> >>
>>> >>> >> How do I save this file within scrapy?
>>> >>> >>
>>> >>> >> I mean, if I choose the list transactions, I will get another
>>> webpage
>>> >>> >> and
>>> >>> >> this I can see.
>>> >>> >> But what if I choose the action to download? I guess I should not
>>> use
>>> >>> >> the
>>> >>> >> return self.parse_dosomething but something else to save the file
>>> it
>>> >>> >> should
>>> >>> >> give me (???)
>>> >>> >>
>>> >>> >> Or should the download start by itself?
>>> >>> >>
>>> >>> >> Thanks,
>>> >>> >> Ana
>>> >>> >
>>> >>> > --
>>> >>> > You received this message because you are subscribed to a topic in
>>> the
>>> >>> > Google Groups "scrapy-users" group.
>>> >>> > To unsubscribe from this topic, visit
>>> >>> >
>>> https://groups.google.com/d/topic/scrapy-users/kzGHFjXywuY/unsubscribe.
>>> >>> > To unsubscribe from this group and all its topics, send an email
>>> to
>>> >>> > [email protected].
>>> >>> > To post to this group, send email to [email protected].
>>> >>> > Visit this group at http://groups.google.com/group/scrapy-users.
>>> >>> > For more options, visit https://groups.google.com/groups/opt_out.
>>> >>
>>> >> --
>>> >> You received this message because you are subscribed to a topic in
>>> the
>>> >> Google Groups "scrapy-users" group.
>>> >> To unsubscribe from this topic, visit
>>> >>
>>> https://groups.google.com/d/topic/scrapy-users/kzGHFjXywuY/unsubscribe.
>>> >> To unsubscribe from this group and all its topics, send an email to
>>> >> [email protected].
>>> >> To post to this group, send email to [email protected].
>>> >> Visit this group at http://groups.google.com/group/scrapy-users.
>>> >> For more options, visit https://groups.google.com/groups/opt_out.
>>>
>>
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.