Hello Paul! I'm Matt. I know this is a somewhat old group now but I have found your advice about FilesPipeline and it works great. I had one question though. Do you know of an easy way to pass in a file_name field for each url so that the FilesPipeline will save each url with the correct name?
Thanks! On Saturday, September 21, 2013 1:03:09 PM UTC-4, Paul Tremberth wrote: > > Hi Ana, > > if you want to use the FilesPipeline, before it's in an official Scrapy > release, > here's one way to do it: > > 1) download > https://raw.github.com/scrapy/scrapy/master/scrapy/contrib/pipeline/files.py > and save it somewhere in your Scrapy project, > let's say at the root of your project (but that's not the best location...) > yourproject/files.py > > 2) then, enable this pipeline by adding this to your settings.py > > ITEM_PIPELINES = [ > 'yourproject.files.FilesPipeline', > ] > FILES_STORE = '/path/to/yourproject/downloads' > > FILES_STORE needs to point to a location where Scrapy can write (create it > beforehand) > > 3) add 2 special fields to your item definition > file_urls = Field() > files = Field() > > 4) in your spider, when you have an URL for a file to download, > add it to your Item instance before returning it > > ... > myitem = YourProjectItem() > ... > myitem["file_urls"] = ["http://www.example.com/somefileiwant.csv"] > yield myitem > > 5) run your spider and you should see files in the FILES_STORE folder > > Here's an example that download a few files from the IETF website > > the scrapy project is called "filedownload" > > items.py looks like this: > > from scrapy.item import Item, Field > > class FiledownloadItem(Item): > file_urls = Field() > files = Field() > > > this is the code for the spider: > > from scrapy.spider import BaseSpider > from filedownload.items import FiledownloadItem > > class IetfSpider(BaseSpider): > name = "ietf" > allowed_domains = ["ietf.org"] > start_urls = ( > 'http://www.ietf.org/', > ) > > def parse(self, response): > yield FiledownloadItem( > file_urls=[ > 'http://www.ietf.org/images/ietflogotrans.gif', > 'http://www.ietf.org/rfc/rfc2616.txt', > 'http://www.rfc-editor.org/rfc/rfc2616.ps', > 'http://www.rfc-editor.org/rfc/rfc2616.pdf', > 'http://tools.ietf.org/html/rfc2616.html', > ] > ) > > When you run the spider, at the end, you should see in the console > something like this: > > 2013-09-21 18:30:42+0200 [ietf] DEBUG: Scraped from <200 > http://www.ietf.org/> > {'file_urls': ['http://www.ietf.org/images/ietflogotrans.gif', > 'http://www.ietf.org/rfc/rfc2616.txt', > 'http://www.rfc-editor.org/rfc/rfc2616.ps', > 'http://www.rfc-editor.org/rfc/rfc2616.pdf', > 'http://tools.ietf.org/html/rfc2616.html'], > 'files': [{'checksum': 'e4b6ca0dd271ce887e70a1a2a5d681df', > 'path': 'full/4f7f3e96b2dda337913105cd751a2d05d7e64b64.gif', > 'url': 'http://www.ietf.org/images/ietflogotrans.gif'}, > {'checksum': '9fa63f5083e4d2112d2e71b008e387e8', > 'path': 'full/454ea89fbeaf00219fbcae49960d8bd1016994b0.txt', > 'url': 'http://www.ietf.org/rfc/rfc2616.txt'}, > {'checksum': '5f0dc88aced3b0678d702fb26454e851', > 'path': 'full/f76736e9f1f22d7d5563208d97d13e7cc7a3a633.ps', > 'url': 'http://www.rfc-editor.org/rfc/rfc2616.ps'}, > {'checksum': '2d555310626966c3521cda04ae2fe76f', > 'path': 'full/6ff52709da9514feb13211b6eb050458f353b49a.pdf', > 'url': 'http://www.rfc-editor.org/rfc/rfc2616.pdf'}, > {'checksum': '735820b4f0f4df7048b288ba36612295', > 'path': 'full/7192dd9a00a8567bf3dc4c21ababdcec6c69ce7f.html', > 'url': 'http://tools.ietf.org/html/rfc2616.html'}]} > 2013-09-21 18:30:42+0200 [ietf] INFO: Closing spider (finished) > > which tells you what files were downloaded, and where they were stored. > > Hope this helps. > > On Tuesday, September 17, 2013 1:46:15 PM UTC+2, Ana Carolina Assis Jesus > wrote: >> >> Hi Paul, >> >> Could you give me an example on how to use the pipeline, please? >> >> Thanks, >> Ana >> >> On Tue, Sep 17, 2013 at 12:19 PM, Ana Carolina Assis Jesus >> <[email protected]> wrote: >> > well, I installed about two weeks ago, but a tagged version... so >> > maybe I dont have it... >> > But I really need pipeline even if get button, at principle, at least, >> > should just download a file! I mean, it is what it does manualy... >> > ??? >> > >> > Thanks! >> > >> > On Tue, Sep 17, 2013 at 12:14 PM, Paul Tremberth >> > <[email protected]> wrote: >> >> Well, the FilesPipeline is a module inside scrapy.contrib.pipelines >> >> It was committed less than 2 weeks ago.(Scrapy is being improved all >> the >> >> time by the community) >> >> >> >> It depends when and how you installed scrapy: >> >> - if you install a tagged version using pip or easy_install (as it's >> >> recommended; >> >> http://doc.scrapy.org/en/latest/intro/install.html#installing-scrapy) >> >> you won't have the Pipeline and you have to add it yourself >> >> >> >> - if you installed from source less than 2 weeks ago (git clone >> >> [email protected]:scrapy/scrapy.git; cd scrapy; sudo python setup.py >> install) >> >> you should be good (but Scrapy from latest source code might be >> unstable and >> >> not fully tested) >> >> >> >> >> >> On Tuesday, September 17, 2013 12:04:31 PM UTC+2, Ana Carolina Assis >> Jesus >> >> wrote: >> >>> >> >>> Hi Paul. >> >>> >> >>> What do you mean by installing scrapy from source? >> >>> I need a new version from it? >> >>> >> >>> On Tue, Sep 17, 2013 at 12:01 PM, Paul Tremberth >> >>> <[email protected]> wrote: >> >>> > Hi Ana, >> >>> > to download files, you should have a look at the new FilesPipeline >> >>> > https://github.com/scrapy/scrapy/pull/370 >> >>> > >> >>> > It's in the master branch though, not in a tagged version of >> Scrapy, so >> >>> > you'll have to install scrapy from source. >> >>> > >> >>> > Paul. >> >>> > >> >>> > >> >>> > On Tuesday, September 17, 2013 11:50:05 AM UTC+2, Ana Carolina >> Assis >> >>> > Jesus >> >>> > wrote: >> >>> >> >> >>> >> Hi! >> >>> >> >> >>> >> I am trying to download a csv file with scrapy. >> >>> >> I could crawl inside the site and get to the form I need and then >> I >> >>> >> find >> >>> >> two buttons to click. >> >>> >> One will list the transactions while the second one will download >> a >> >>> >> XXX.cvs file. >> >>> >> >> >>> >> How do I save this file within scrapy? >> >>> >> >> >>> >> I mean, if I choose the list transactions, I will get another >> webpage >> >>> >> and >> >>> >> this I can see. >> >>> >> But what if I choose the action to download? I guess I should not >> use >> >>> >> the >> >>> >> return self.parse_dosomething but something else to save the file >> it >> >>> >> should >> >>> >> give me (???) >> >>> >> >> >>> >> Or should the download start by itself? >> >>> >> >> >>> >> Thanks, >> >>> >> Ana >> >>> > >> >>> > -- >> >>> > You received this message because you are subscribed to a topic in >> the >> >>> > Google Groups "scrapy-users" group. >> >>> > To unsubscribe from this topic, visit >> >>> > >> https://groups.google.com/d/topic/scrapy-users/kzGHFjXywuY/unsubscribe. >> >>> > To unsubscribe from this group and all its topics, send an email to >> >>> > [email protected]. >> >>> > To post to this group, send email to [email protected]. >> >>> > Visit this group at http://groups.google.com/group/scrapy-users. >> >>> > For more options, visit https://groups.google.com/groups/opt_out. >> >> >> >> -- >> >> You received this message because you are subscribed to a topic in the >> >> Google Groups "scrapy-users" group. >> >> To unsubscribe from this topic, visit >> >> https://groups.google.com/d/topic/scrapy-users/kzGHFjXywuY/unsubscribe. >> >> >> To unsubscribe from this group and all its topics, send an email to >> >> [email protected]. >> >> To post to this group, send email to [email protected]. >> >> Visit this group at http://groups.google.com/group/scrapy-users. >> >> For more options, visit https://groups.google.com/groups/opt_out. >> > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/groups/opt_out.
