I actually sent you the old code I had. The new one also edited these functions, but instead of path[0] and ret[0], i had just path and ret.
On Tue, Mar 25, 2014 at 1:34 AM, Matt Cialini <[email protected]>wrote: > Hi Casey, > > I ended up using Paul's suggestion and expanded on it to fit my needs. > Basically my spider creates a single instance of FileDownloadItem > {'file_urls': [list of several of dict objects {file_url:url, > file_name:name} ] }. The individual dict objects have their file_url = > their web url, and file name = the save title. It yields the item to the > FilesPipeline, in which I just edited a few functions to better match the > item structure i passed in. > > def _get_filesystem_path(self, path): > str = self.basedir + path[0] > return str > > def file_path(self, request, response=None, info=None): > def _warn(): > #print "_warn" > from scrapy.exceptions import ScrapyDeprecationWarning > import warnings > warnings.warn('FilesPipeline.file_key(url) method is > deprecated, please use ' > 'file_path(request, response=None, info=None) > instead', > category=ScrapyDeprecationWarning, stacklevel=1) > > # check if called from file_key with url as first argument > if not isinstance(request, Request): > _warn() > url = request > else: > url = request.url > > # detect if file_key() method has been overridden > if not hasattr(self.file_key, '_base'): > _warn() > return self.file_key(url) > media_ext = os.path.splitext(url)[1] # change to request.url > after deprecation > ret = request.meta["file_spec"]["file_name"] > return ret[0] + media_ext > > > > On Sun, Mar 23, 2014 at 10:00 PM, Casey Klimkowsky > <[email protected]>wrote: > >> Hi Matt, >> >> I was wondering if you ever figured out your problem? I am also looking >> to use the FilesPipeline with custom file names. I was able to edit >> FilesPipeline itself to achieve this result, but obviously it would be a >> better practice to extend FilesPipeline and override the necessary methods >> instead. When I use a solution similar to Paul's, my files are not >> downloaded to my hard drive. >> >> Thank you! >> >> >> On Tuesday, February 25, 2014 9:03:20 AM UTC-6, Matt Cialini wrote: >> >>> Hi Paul, >>> >>> Thanks for the suggestion. I'm trying to implement it now but the files >>> aren't being written to disk correctly. What function in files.py handles >>> the actual saving of the file? >>> >>> Every item I pass into files.py eventually is a FileDownloadItem >>> {'file_urls': [list of several of these dict objects {file_url:url, >>> file_name:name}]} >>> >>> I'll attach my code to this if you have time to look it over. Basically >>> I think something is not being passed in correctly in files.py, but it's >>> hard to search through and determine where. >>> >>> Thanks so much Paul! >>> >>> - Matt C >>> >>> >>> On Tue, Feb 25, 2014 at 4:28 AM, Paul Tremberth <[email protected]>wrote: >>> >>>> Hi Matt, >>>> >>>> one way to do that is to play with the FilesPipeline >>>> *get_media_requests()*, >>>> passing additional data through the meta dict >>>> and then using a custom *file_path()* method >>>> >>>> Below, I use a dict in *file_urls *and not a list, so that I can pass >>>> a URL and a custom *file_name* >>>> >>>> Using the same IETF example I used above in the thread: >>>> >>>> A simple spider downloading some files from IETF.org >>>> >>>> from scrapy.spider import Spider >>>> from scrapy.http import Request >>>> from scrapy.item import Item, Field >>>> >>>> >>>> class IetfItem(Item): >>>> files = Field() >>>> file_urls = Field() >>>> >>>> >>>> class IETFSpider(Spider): >>>> name = 'ietfpipe' >>>> allowed_domains = ['ietf.org'] >>>> start_urls = ['http://www.ietf.org'] >>>> file_urls = [ >>>> 'http://www.ietf.org/images/ietflogotrans.gif', >>>> 'http://www.ietf.org/rfc/rfc2616.txt', >>>> 'http://www.rfc-editor.org/rfc/rfc2616.ps', >>>> 'http://www.rfc-editor.org/rfc/rfc2616.pdf', >>>> 'http://tools.ietf.org/html/rfc2616.html', >>>> ] >>>> >>>> def parse(self, response): >>>> for cnt, furl in enumerate(self.file_urls, start=1): >>>> yield IetfItem(file_urls=[{"file_url": furl, "file_name": >>>> "file_%03d" % cnt}]) >>>> >>>> >>>> >>>> Custom FilesPipeline >>>> >>>> from scrapy.contrib.pipeline.files import FilesPipeline >>>> from scrapy.http import Request >>>> >>>> class MyFilesPipeline(FilesPipeline): >>>> >>>> def get_media_requests(self, item, info): >>>> for file_spec in item['file_urls']: >>>> yield Request(url=file_spec["file_url"], >>>> meta={"file_spec": file_spec}) >>>> >>>> def file_path(self, request, response=None, info=None): >>>> return request.meta["file_spec"]["file_name"] >>>> >>>> >>>> >>>> Hope this helps >>>> >>>> /Paul. >>>> >>>> On Friday, February 21, 2014 6:44:20 AM UTC+1, Matt Cialini wrote: >>>>> >>>>> Hello Paul! >>>>> >>>>> I'm Matt. I know this is a somewhat old group now but I have found >>>>> your advice about FilesPipeline and it works great. I had one question >>>>> though. Do you know of an easy way to pass in a file_name field for each >>>>> url so that the FilesPipeline will save each url with the correct name? >>>>> >>>>> Thanks! >>>>> >>>>> On Saturday, September 21, 2013 1:03:09 PM UTC-4, Paul Tremberth wrote: >>>>>> >>>>>> Hi Ana, >>>>>> >>>>>> if you want to use the FilesPipeline, before it's in an official >>>>>> Scrapy release, >>>>>> here's one way to do it: >>>>>> >>>>>> 1) download https://raw.github.com/scrapy/scrapy/master/scrapy/ >>>>>> contrib/pipeline/files.py >>>>>> and save it somewhere in your Scrapy project, >>>>>> let's say at the root of your project (but that's not the best >>>>>> location...) >>>>>> yourproject/files.py >>>>>> >>>>>> 2) then, enable this pipeline by adding this to your settings.py >>>>>> >>>>>> ITEM_PIPELINES = [ >>>>>> 'yourproject.files.FilesPipeline', >>>>>> ] >>>>>> FILES_STORE = '/path/to/yourproject/downloads' >>>>>> >>>>>> FILES_STORE needs to point to a location where Scrapy can write >>>>>> (create it beforehand) >>>>>> >>>>>> 3) add 2 special fields to your item definition >>>>>> file_urls = Field() >>>>>> files = Field() >>>>>> >>>>>> 4) in your spider, when you have an URL for a file to download, >>>>>> add it to your Item instance before returning it >>>>>> >>>>>> ... >>>>>> myitem = YourProjectItem() >>>>>> ... >>>>>> myitem["file_urls"] = ["http://www.example.com/somefileiwant.csv >>>>>> "] >>>>>> yield myitem >>>>>> >>>>>> 5) run your spider and you should see files in the FILES_STORE folder >>>>>> >>>>>> Here's an example that download a few files from the IETF website >>>>>> >>>>>> the scrapy project is called "filedownload" >>>>>> >>>>>> items.py looks like this: >>>>>> >>>>>> from scrapy.item import Item, Field >>>>>> >>>>>> class FiledownloadItem(Item): >>>>>> file_urls = Field() >>>>>> files = Field() >>>>>> >>>>>> >>>>>> this is the code for the spider: >>>>>> >>>>>> from scrapy.spider import BaseSpider >>>>>> from filedownload.items import FiledownloadItem >>>>>> >>>>>> class IetfSpider(BaseSpider): >>>>>> name = "ietf" >>>>>> allowed_domains = ["ietf.org"] >>>>>> start_urls = ( >>>>>> 'http://www.ietf.org/', >>>>>> ) >>>>>> >>>>>> def parse(self, response): >>>>>> yield FiledownloadItem( >>>>>> file_urls=[ >>>>>> 'http://www.ietf.org/images/ietflogotrans.gif', >>>>>> 'http://www.ietf.org/rfc/rfc2616.txt', >>>>>> 'http://www.rfc-editor.org/rfc/rfc2616.ps', >>>>>> 'http://www.rfc-editor.org/rfc/rfc2616.pdf', >>>>>> 'http://tools.ietf.org/html/rfc2616.html', >>>>>> ] >>>>>> ) >>>>>> >>>>>> When you run the spider, at the end, you should see in the console >>>>>> something like this: >>>>>> >>>>>> 2013-09-21 18:30:42+0200 [ietf] DEBUG: Scraped from <200 >>>>>> http://www.ietf.org/> >>>>>> {'file_urls': ['http://www.ietf.org/images/ietflogotrans.gif', >>>>>> 'http://www.ietf.org/rfc/rfc2616.txt', >>>>>> 'http://www.rfc-editor.org/rfc/rfc2616.ps', >>>>>> 'http://www.rfc-editor.org/rfc/rfc2616.pdf', >>>>>> 'http://tools.ietf.org/html/rfc2616.html'], >>>>>> 'files': [{'checksum': 'e4b6ca0dd271ce887e70a1a2a5d681df', >>>>>> 'path': 'full/4f7f3e96b2dda337913105cd751a2d >>>>>> 05d7e64b64.gif', >>>>>> 'url': 'http://www.ietf.org/images/ietflogotrans.gif'}, >>>>>> {'checksum': '9fa63f5083e4d2112d2e71b008e387e8', >>>>>> 'path': 'full/454ea89fbeaf00219fbcae49960d8b >>>>>> d1016994b0.txt', >>>>>> 'url': 'http://www.ietf.org/rfc/rfc2616.txt'}, >>>>>> {'checksum': '5f0dc88aced3b0678d702fb26454e851', >>>>>> 'path': 'full/f76736e9f1f22d7d5563208d97d13e7cc7a3a633.ps >>>>>> ', >>>>>> 'url': 'http://www.rfc-editor.org/rfc/rfc2616.ps'}, >>>>>> {'checksum': '2d555310626966c3521cda04ae2fe76f', >>>>>> 'path': 'full/6ff52709da9514feb13211b6eb0504 >>>>>> 58f353b49a.pdf', >>>>>> 'url': 'http://www.rfc-editor.org/rfc/rfc2616.pdf'}, >>>>>> {'checksum': '735820b4f0f4df7048b288ba36612295', >>>>>> 'path': 'full/7192dd9a00a8567bf3dc4c21ababdc >>>>>> ec6c69ce7f.html', >>>>>> 'url': 'http://tools.ietf.org/html/rfc2616.html'}]} >>>>>> 2013-09-21 18:30:42+0200 [ietf] INFO: Closing spider (finished) >>>>>> >>>>>> which tells you what files were downloaded, and where they were >>>>>> stored. >>>>>> >>>>>> Hope this helps. >>>>>> >>>>>> On Tuesday, September 17, 2013 1:46:15 PM UTC+2, Ana Carolina Assis >>>>>> Jesus wrote: >>>>>>> >>>>>>> Hi Paul, >>>>>>> >>>>>>> Could you give me an example on how to use the pipeline, please? >>>>>>> >>>>>>> Thanks, >>>>>>> Ana >>>>>>> >>>>>>> On Tue, Sep 17, 2013 at 12:19 PM, Ana Carolina Assis Jesus >>>>>>> <[email protected]> wrote: >>>>>>> > well, I installed about two weeks ago, but a tagged version... so >>>>>>> > maybe I dont have it... >>>>>>> > But I really need pipeline even if get button, at principle, at >>>>>>> least, >>>>>>> > should just download a file! I mean, it is what it does manualy... >>>>>>> > ??? >>>>>>> > >>>>>>> > Thanks! >>>>>>> > >>>>>>> > On Tue, Sep 17, 2013 at 12:14 PM, Paul Tremberth >>>>>>> > <[email protected]> wrote: >>>>>>> >> Well, the FilesPipeline is a module inside >>>>>>> scrapy.contrib.pipelines >>>>>>> >> It was committed less than 2 weeks ago.(Scrapy is being improved >>>>>>> all the >>>>>>> >> time by the community) >>>>>>> >> >>>>>>> >> It depends when and how you installed scrapy: >>>>>>> >> - if you install a tagged version using pip or easy_install (as >>>>>>> it's >>>>>>> >> recommended; >>>>>>> >> http://doc.scrapy.org/en/latest/intro/install.html#installin >>>>>>> g-scrapy) >>>>>>> >> you won't have the Pipeline and you have to add it yourself >>>>>>> >> >>>>>>> >> - if you installed from source less than 2 weeks ago (git clone >>>>>>> >> [email protected]:scrapy/scrapy.git; cd scrapy; sudo python >>>>>>> setup.py install) >>>>>>> >> you should be good (but Scrapy from latest source code might be >>>>>>> unstable and >>>>>>> >> not fully tested) >>>>>>> >> >>>>>>> >> >>>>>>> >> On Tuesday, September 17, 2013 12:04:31 PM UTC+2, Ana Carolina >>>>>>> Assis Jesus >>>>>>> >> wrote: >>>>>>> >>> >>>>>>> >>> Hi Paul. >>>>>>> >>> >>>>>>> >>> What do you mean by installing scrapy from source? >>>>>>> >>> I need a new version from it? >>>>>>> >>> >>>>>>> >>> On Tue, Sep 17, 2013 at 12:01 PM, Paul Tremberth >>>>>>> >>> <[email protected]> wrote: >>>>>>> >>> > Hi Ana, >>>>>>> >>> > to download files, you should have a look at the new >>>>>>> FilesPipeline >>>>>>> >>> > https://github.com/scrapy/scrapy/pull/370 >>>>>>> >>> > >>>>>>> >>> > It's in the master branch though, not in a tagged version of >>>>>>> Scrapy, so >>>>>>> >>> > you'll have to install scrapy from source. >>>>>>> >>> > >>>>>>> >>> > Paul. >>>>>>> >>> > >>>>>>> >>> > >>>>>>> >>> > On Tuesday, September 17, 2013 11:50:05 AM UTC+2, Ana Carolina >>>>>>> Assis >>>>>>> >>> > Jesus >>>>>>> >>> > wrote: >>>>>>> >>> >> >>>>>>> >>> >> Hi! >>>>>>> >>> >> >>>>>>> >>> >> I am trying to download a csv file with scrapy. >>>>>>> >>> >> I could crawl inside the site and get to the form I need and >>>>>>> then I >>>>>>> >>> >> find >>>>>>> >>> >> two buttons to click. >>>>>>> >>> >> One will list the transactions while the second one will >>>>>>> download a >>>>>>> >>> >> XXX.cvs file. >>>>>>> >>> >> >>>>>>> >>> >> How do I save this file within scrapy? >>>>>>> >>> >> >>>>>>> >>> >> I mean, if I choose the list transactions, I will get another >>>>>>> webpage >>>>>>> >>> >> and >>>>>>> >>> >> this I can see. >>>>>>> >>> >> But what if I choose the action to download? I guess I should >>>>>>> not use >>>>>>> >>> >> the >>>>>>> >>> >> return self.parse_dosomething but something else to save the >>>>>>> file it >>>>>>> >>> >> should >>>>>>> >>> >> give me (???) >>>>>>> >>> >> >>>>>>> >>> >> Or should the download start by itself? >>>>>>> >>> >> >>>>>>> >>> >> Thanks, >>>>>>> >>> >> Ana >>>>>>> >>> > >>>>>>> >>> > -- >>>>>>> >>> > You received this message because you are subscribed to a >>>>>>> topic in the >>>>>>> >>> > Google Groups "scrapy-users" group. >>>>>>> >>> > To unsubscribe from this topic, visit >>>>>>> >>> > https://groups.google.com/d/topic/scrapy-users/kzGHFjXywuY/ >>>>>>> unsubscribe. >>>>>>> >>> > To unsubscribe from this group and all its topics, send an >>>>>>> email to >>>>>>> >>> > [email protected]. >>>>>>> >>> > To post to this group, send email to >>>>>>> [email protected]. >>>>>>> >>> > Visit this group at http://groups.google.com/group >>>>>>> /scrapy-users. >>>>>>> >>> > For more options, visit https://groups.google.com/grou >>>>>>> ps/opt_out. >>>>>>> >> >>>>>>> >> -- >>>>>>> >> You received this message because you are subscribed to a topic >>>>>>> in the >>>>>>> >> Google Groups "scrapy-users" group. >>>>>>> >> To unsubscribe from this topic, visit >>>>>>> >> https://groups.google.com/d/topic/scrapy-users/kzGHFjXywuY/ >>>>>>> unsubscribe. >>>>>>> >> To unsubscribe from this group and all its topics, send an email >>>>>>> to >>>>>>> >> [email protected]. >>>>>>> >> To post to this group, send email to [email protected]. >>>>>>> >> Visit this group at http://groups.google.com/group/scrapy-users. >>>>>>> >> For more options, visit https://groups.google.com/groups/opt_out. >>>>>>> >>>>>>> >>>>>> -- >>>> You received this message because you are subscribed to a topic in the >>>> Google Groups "scrapy-users" group. >>>> To unsubscribe from this topic, visit https://groups.google.com/d/ >>>> topic/scrapy-users/kzGHFjXywuY/unsubscribe. >>>> To unsubscribe from this group and all its topics, send an email to >>>> [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at http://groups.google.com/group/scrapy-users. >>>> For more options, visit https://groups.google.com/groups/opt_out. >>>> >>> >>> -- >> You received this message because you are subscribed to a topic in the >> Google Groups "scrapy-users" group. >> To unsubscribe from this topic, visit >> https://groups.google.com/d/topic/scrapy-users/kzGHFjXywuY/unsubscribe. >> To unsubscribe from this group and all its topics, send an email to >> [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at http://groups.google.com/group/scrapy-users. >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
