Hi Matt,
I was wondering if you ever figured out your problem? I am also looking to
use the FilesPipeline with custom file names. I was able to edit
FilesPipeline itself to achieve this result, but obviously it would be a
better practice to extend FilesPipeline and override the necessary methods
instead. When I use a solution similar to Paul's, my files are not
downloaded to my hard drive.
Thank you!
On Tuesday, February 25, 2014 9:03:20 AM UTC-6, Matt Cialini wrote:
>
> Hi Paul,
>
> Thanks for the suggestion. I'm trying to implement it now but the files
> aren't being written to disk correctly. What function in files.py handles
> the actual saving of the file?
>
> Every item I pass into files.py eventually is a FileDownloadItem
> {'file_urls': [list of several of these dict objects {file_url:url,
> file_name:name}]}
>
> I'll attach my code to this if you have time to look it over. Basically I
> think something is not being passed in correctly in files.py, but it's hard
> to search through and determine where.
>
> Thanks so much Paul!
>
> - Matt C
>
>
> On Tue, Feb 25, 2014 at 4:28 AM, Paul Tremberth
> <[email protected]<javascript:>
> > wrote:
>
>> Hi Matt,
>>
>> one way to do that is to play with the FilesPipeline
>> *get_media_requests()*,
>> passing additional data through the meta dict
>> and then using a custom *file_path()* method
>>
>> Below, I use a dict in *file_urls *and not a list, so that I can pass a
>> URL and a custom *file_name*
>>
>> Using the same IETF example I used above in the thread:
>>
>> A simple spider downloading some files from IETF.org
>>
>> from scrapy.spider import Spider
>> from scrapy.http import Request
>> from scrapy.item import Item, Field
>>
>>
>> class IetfItem(Item):
>> files = Field()
>> file_urls = Field()
>>
>>
>> class IETFSpider(Spider):
>> name = 'ietfpipe'
>> allowed_domains = ['ietf.org']
>> start_urls = ['http://www.ietf.org']
>> file_urls = [
>> 'http://www.ietf.org/images/ietflogotrans.gif',
>> 'http://www.ietf.org/rfc/rfc2616.txt',
>> 'http://www.rfc-editor.org/rfc/rfc2616.ps',
>> 'http://www.rfc-editor.org/rfc/rfc2616.pdf',
>> 'http://tools.ietf.org/html/rfc2616.html',
>> ]
>>
>> def parse(self, response):
>> for cnt, furl in enumerate(self.file_urls, start=1):
>> yield IetfItem(file_urls=[{"file_url": furl, "file_name":
>> "file_%03d" % cnt}])
>>
>>
>>
>> Custom FilesPipeline
>>
>> from scrapy.contrib.pipeline.files import FilesPipeline
>> from scrapy.http import Request
>>
>> class MyFilesPipeline(FilesPipeline):
>>
>> def get_media_requests(self, item, info):
>> for file_spec in item['file_urls']:
>> yield Request(url=file_spec["file_url"], meta={"file_spec":
>> file_spec})
>>
>> def file_path(self, request, response=None, info=None):
>> return request.meta["file_spec"]["file_name"]
>>
>>
>>
>> Hope this helps
>>
>> /Paul.
>>
>> On Friday, February 21, 2014 6:44:20 AM UTC+1, Matt Cialini wrote:
>>>
>>> Hello Paul!
>>>
>>> I'm Matt. I know this is a somewhat old group now but I have found your
>>> advice about FilesPipeline and it works great. I had one question though.
>>> Do you know of an easy way to pass in a file_name field for each url so
>>> that the FilesPipeline will save each url with the correct name?
>>>
>>> Thanks!
>>>
>>> On Saturday, September 21, 2013 1:03:09 PM UTC-4, Paul Tremberth wrote:
>>>>
>>>> Hi Ana,
>>>>
>>>> if you want to use the FilesPipeline, before it's in an official
>>>> Scrapy release,
>>>> here's one way to do it:
>>>>
>>>> 1) download https://raw.github.com/scrapy/scrapy/master/
>>>> scrapy/contrib/pipeline/files.py
>>>> and save it somewhere in your Scrapy project,
>>>> let's say at the root of your project (but that's not the best
>>>> location...)
>>>> yourproject/files.py
>>>>
>>>> 2) then, enable this pipeline by adding this to your settings.py
>>>>
>>>> ITEM_PIPELINES = [
>>>> 'yourproject.files.FilesPipeline',
>>>> ]
>>>> FILES_STORE = '/path/to/yourproject/downloads'
>>>>
>>>> FILES_STORE needs to point to a location where Scrapy can write (create
>>>> it beforehand)
>>>>
>>>> 3) add 2 special fields to your item definition
>>>> file_urls = Field()
>>>> files = Field()
>>>>
>>>> 4) in your spider, when you have an URL for a file to download,
>>>> add it to your Item instance before returning it
>>>>
>>>> ...
>>>> myitem = YourProjectItem()
>>>> ...
>>>> myitem["file_urls"] = ["http://www.example.com/somefileiwant.csv"]
>>>> yield myitem
>>>>
>>>> 5) run your spider and you should see files in the FILES_STORE folder
>>>>
>>>> Here's an example that download a few files from the IETF website
>>>>
>>>> the scrapy project is called "filedownload"
>>>>
>>>> items.py looks like this:
>>>>
>>>> from scrapy.item import Item, Field
>>>>
>>>> class FiledownloadItem(Item):
>>>> file_urls = Field()
>>>> files = Field()
>>>>
>>>>
>>>> this is the code for the spider:
>>>>
>>>> from scrapy.spider import BaseSpider
>>>> from filedownload.items import FiledownloadItem
>>>>
>>>> class IetfSpider(BaseSpider):
>>>> name = "ietf"
>>>> allowed_domains = ["ietf.org"]
>>>> start_urls = (
>>>> 'http://www.ietf.org/',
>>>> )
>>>>
>>>> def parse(self, response):
>>>> yield FiledownloadItem(
>>>> file_urls=[
>>>> 'http://www.ietf.org/images/ietflogotrans.gif',
>>>> 'http://www.ietf.org/rfc/rfc2616.txt',
>>>> 'http://www.rfc-editor.org/rfc/rfc2616.ps',
>>>> 'http://www.rfc-editor.org/rfc/rfc2616.pdf',
>>>> 'http://tools.ietf.org/html/rfc2616.html',
>>>> ]
>>>> )
>>>>
>>>> When you run the spider, at the end, you should see in the console
>>>> something like this:
>>>>
>>>> 2013-09-21 18:30:42+0200 [ietf] DEBUG: Scraped from <200
>>>> http://www.ietf.org/>
>>>> {'file_urls': ['http://www.ietf.org/images/ietflogotrans.gif',
>>>> 'http://www.ietf.org/rfc/rfc2616.txt',
>>>> 'http://www.rfc-editor.org/rfc/rfc2616.ps',
>>>> 'http://www.rfc-editor.org/rfc/rfc2616.pdf',
>>>> 'http://tools.ietf.org/html/rfc2616.html'],
>>>> 'files': [{'checksum': 'e4b6ca0dd271ce887e70a1a2a5d681df',
>>>> 'path': 'full/4f7f3e96b2dda337913105cd751a2d
>>>> 05d7e64b64.gif',
>>>> 'url': 'http://www.ietf.org/images/ietflogotrans.gif'},
>>>> {'checksum': '9fa63f5083e4d2112d2e71b008e387e8',
>>>> 'path': 'full/454ea89fbeaf00219fbcae49960d8b
>>>> d1016994b0.txt',
>>>> 'url': 'http://www.ietf.org/rfc/rfc2616.txt'},
>>>> {'checksum': '5f0dc88aced3b0678d702fb26454e851',
>>>> 'path': 'full/f76736e9f1f22d7d5563208d97d13e7cc7a3a633.ps',
>>>> 'url': 'http://www.rfc-editor.org/rfc/rfc2616.ps'},
>>>> {'checksum': '2d555310626966c3521cda04ae2fe76f',
>>>> 'path': 'full/6ff52709da9514feb13211b6eb0504
>>>> 58f353b49a.pdf',
>>>> 'url': 'http://www.rfc-editor.org/rfc/rfc2616.pdf'},
>>>> {'checksum': '735820b4f0f4df7048b288ba36612295',
>>>> 'path': 'full/7192dd9a00a8567bf3dc4c21ababdc
>>>> ec6c69ce7f.html',
>>>> 'url': 'http://tools.ietf.org/html/rfc2616.html'}]}
>>>> 2013-09-21 18:30:42+0200 [ietf] INFO: Closing spider (finished)
>>>>
>>>> which tells you what files were downloaded, and where they were stored.
>>>>
>>>> Hope this helps.
>>>>
>>>> On Tuesday, September 17, 2013 1:46:15 PM UTC+2, Ana Carolina Assis
>>>> Jesus wrote:
>>>>>
>>>>> Hi Paul,
>>>>>
>>>>> Could you give me an example on how to use the pipeline, please?
>>>>>
>>>>> Thanks,
>>>>> Ana
>>>>>
>>>>> On Tue, Sep 17, 2013 at 12:19 PM, Ana Carolina Assis Jesus
>>>>> <[email protected]> wrote:
>>>>> > well, I installed about two weeks ago, but a tagged version... so
>>>>> > maybe I dont have it...
>>>>> > But I really need pipeline even if get button, at principle, at
>>>>> least,
>>>>> > should just download a file! I mean, it is what it does manualy...
>>>>> > ???
>>>>> >
>>>>> > Thanks!
>>>>> >
>>>>> > On Tue, Sep 17, 2013 at 12:14 PM, Paul Tremberth
>>>>> > <[email protected]> wrote:
>>>>> >> Well, the FilesPipeline is a module inside scrapy.contrib.pipelines
>>>>> >> It was committed less than 2 weeks ago.(Scrapy is being improved
>>>>> all the
>>>>> >> time by the community)
>>>>> >>
>>>>> >> It depends when and how you installed scrapy:
>>>>> >> - if you install a tagged version using pip or easy_install (as
>>>>> it's
>>>>> >> recommended;
>>>>> >> http://doc.scrapy.org/en/latest/intro/install.html#
>>>>> installing-scrapy)
>>>>> >> you won't have the Pipeline and you have to add it yourself
>>>>> >>
>>>>> >> - if you installed from source less than 2 weeks ago (git clone
>>>>> >> [email protected]:scrapy/scrapy.git; cd scrapy; sudo python setup.py
>>>>> install)
>>>>> >> you should be good (but Scrapy from latest source code might be
>>>>> unstable and
>>>>> >> not fully tested)
>>>>> >>
>>>>> >>
>>>>> >> On Tuesday, September 17, 2013 12:04:31 PM UTC+2, Ana Carolina
>>>>> Assis Jesus
>>>>> >> wrote:
>>>>> >>>
>>>>> >>> Hi Paul.
>>>>> >>>
>>>>> >>> What do you mean by installing scrapy from source?
>>>>> >>> I need a new version from it?
>>>>> >>>
>>>>> >>> On Tue, Sep 17, 2013 at 12:01 PM, Paul Tremberth
>>>>> >>> <[email protected]> wrote:
>>>>> >>> > Hi Ana,
>>>>> >>> > to download files, you should have a look at the new
>>>>> FilesPipeline
>>>>> >>> > https://github.com/scrapy/scrapy/pull/370
>>>>> >>> >
>>>>> >>> > It's in the master branch though, not in a tagged version of
>>>>> Scrapy, so
>>>>> >>> > you'll have to install scrapy from source.
>>>>> >>> >
>>>>> >>> > Paul.
>>>>> >>> >
>>>>> >>> >
>>>>> >>> > On Tuesday, September 17, 2013 11:50:05 AM UTC+2, Ana Carolina
>>>>> Assis
>>>>> >>> > Jesus
>>>>> >>> > wrote:
>>>>> >>> >>
>>>>> >>> >> Hi!
>>>>> >>> >>
>>>>> >>> >> I am trying to download a csv file with scrapy.
>>>>> >>> >> I could crawl inside the site and get to the form I need and
>>>>> then I
>>>>> >>> >> find
>>>>> >>> >> two buttons to click.
>>>>> >>> >> One will list the transactions while the second one will
>>>>> download a
>>>>> >>> >> XXX.cvs file.
>>>>> >>> >>
>>>>> >>> >> How do I save this file within scrapy?
>>>>> >>> >>
>>>>> >>> >> I mean, if I choose the list transactions, I will get another
>>>>> webpage
>>>>> >>> >> and
>>>>> >>> >> this I can see.
>>>>> >>> >> But what if I choose the action to download? I guess I should
>>>>> not use
>>>>> >>> >> the
>>>>> >>> >> return self.parse_dosomething but something else to save the
>>>>> file it
>>>>> >>> >> should
>>>>> >>> >> give me (???)
>>>>> >>> >>
>>>>> >>> >> Or should the download start by itself?
>>>>> >>> >>
>>>>> >>> >> Thanks,
>>>>> >>> >> Ana
>>>>> >>> >
>>>>> >>> > --
>>>>> >>> > You received this message because you are subscribed to a topic
>>>>> in the
>>>>> >>> > Google Groups "scrapy-users" group.
>>>>> >>> > To unsubscribe from this topic, visit
>>>>> >>> > https://groups.google.com/d/topic/scrapy-users/
>>>>> kzGHFjXywuY/unsubscribe.
>>>>> >>> > To unsubscribe from this group and all its topics, send an email
>>>>> to
>>>>> >>> > [email protected].
>>>>> >>> > To post to this group, send email to [email protected].
>>>>>
>>>>> >>> > Visit this group at http://groups.google.com/group/scrapy-users.
>>>>>
>>>>> >>> > For more options, visit https://groups.google.com/groups/opt_out.
>>>>>
>>>>> >>
>>>>> >> --
>>>>> >> You received this message because you are subscribed to a topic in
>>>>> the
>>>>> >> Google Groups "scrapy-users" group.
>>>>> >> To unsubscribe from this topic, visit
>>>>> >> https://groups.google.com/d/topic/scrapy-users/
>>>>> kzGHFjXywuY/unsubscribe.
>>>>> >> To unsubscribe from this group and all its topics, send an email to
>>>>> >> [email protected].
>>>>> >> To post to this group, send email to [email protected].
>>>>> >> Visit this group at http://groups.google.com/group/scrapy-users.
>>>>> >> For more options, visit https://groups.google.com/groups/opt_out.
>>>>>
>>>> --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "scrapy-users" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/scrapy-users/kzGHFjXywuY/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>> [email protected] <javascript:>.
>> To post to this group, send email to [email protected]<javascript:>
>> .
>> Visit this group at http://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>
>
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.