Hi Paul, I haven't tried it, but it's possible. There is two approach to this: - Use scrapy signals to compress a text file when an event occur like spider closed - Or in pipeline, compress data then save it directly to the file. - Or use a NoSQL database like MongoDB already support compression: http://doc.scrapy.org/en/latest/topics/item-pipeline.html#write-items-to-mongodb https://github.com/sebdah/scrapy-mongodb
Best Regards. Lhassan. 2016-08-19 10:57 GMT+01:00 Paul Tremberth <[email protected]>: > Hi Kasper, > > files come in all kinds, and zipping them may not be the best option > depending on the use-case (for already compressed ones for example). > Maybe you are referring to text files? In that case, zipping them would > save quite a bit of disk space indeed. > > > FilesPipeline does not allow me to provide my own files store, unless I > hack away on the scrapy code itself > > Are you interested in a patch that allows the FilesPipeline to accept > custom store schemes? > > As far as I know, this is possible. > I would second Lhassan here, on using a subclassed FilesPipeline, with a > custom STORE_SCHEME referencing your store. > Have you tried this? > Or are you having trouble getting this to work? Maybe the documentation is > a bit daunting? in which case we can improve it. > > Regards, > Paul. > > On Thursday, August 18, 2016 at 3:28:49 PM UTC+2, Kasper Marstal wrote: >> >> Hi Lhassan >> >> Okay, thank your for your reply. >> >> Kasper >> >> On Thursday, August 18, 2016 at 2:50:18 PM UTC+2, Lhassan Baazzi wrote: >>> >>> Hi, >>> >>> If you are going with the zip option, just create you own pipeline that >>> extend the base file pipeline and publish it as a package on Github, if >>> someone else needed and use it. >>> >>> Best Regards. >>> Lhassan >>> Le 18 août 2016 13:16, "Kasper Marstal" <[email protected]> a écrit : >>> >>>> Hi all, >>>> >>>> I am scraping a couple of million documents and need to save space on >>>> my disk to store the data. An attractive option is to save the files >>>> directly to a ZIP file since the compression ratio is really good with this >>>> kind of data (~18). However, the FilesPipeline does not allow me to provide >>>> my own files store, unless I hack away on the scrapy code itself, which I >>>> would like to avoid. So, a couple of questions for the scrapy developers: >>>> >>>> - Are you interested in a patch that allows the FilesPipeline to accept >>>> custom store schemes? OR >>>> - Are you interested in a patch with a ZipFilesStore? In addition, >>>> - Is this ZIP-file approach a common way of dealing with large amounts >>>> of data, or do you have best-practices on this subject that I am not aware >>>> of? >>>> >>>> Kind Regards, >>>> Kasper Marstal >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "scrapy-users" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at https://groups.google.com/group/scrapy-users. >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> -- > You received this message because you are subscribed to the Google Groups > "scrapy-users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/scrapy-users. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
