Re: Extending FilesPipeline with a custom store scheme

Lhassan Baazzi Fri, 19 Aug 2016 03:07:17 -0700

Hi Paul,

I haven't tried it, but it's possible.
There is two approach to this:
- Use scrapy signals to compress a text file when an event occur like
spider closed
- Or in pipeline, compress data then save it directly to the file.
- Or use a NoSQL database like MongoDB already support compression:
http://doc.scrapy.org/en/latest/topics/item-pipeline.html#write-items-to-mongodb
https://github.com/sebdah/scrapy-mongodb



Best Regards.
Lhassan.



2016-08-19 10:57 GMT+01:00 Paul Tremberth <[email protected]>:

> Hi Kasper,
>
> files come in all kinds, and zipping them may not be the best option
> depending on the use-case (for already compressed ones for example).
> Maybe you are referring to text files? In that case, zipping them would
> save quite a bit of disk space indeed.
>
> > FilesPipeline does not allow me to provide my own files store, unless I
> hack away on the scrapy code itself
> > Are you interested in a patch that allows the FilesPipeline to accept
> custom store schemes?
>
> As far as I know, this is possible.
> I would second Lhassan here, on using a subclassed FilesPipeline, with a
> custom STORE_SCHEME referencing your store.
> Have you tried this?
> Or are you having trouble getting this to work? Maybe the documentation is
> a bit daunting? in which case we can improve it.
>
> Regards,
> Paul.
>
> On Thursday, August 18, 2016 at 3:28:49 PM UTC+2, Kasper Marstal wrote:
>>
>> Hi Lhassan
>>
>> Okay, thank your for your reply.
>>
>> Kasper
>>
>> On Thursday, August 18, 2016 at 2:50:18 PM UTC+2, Lhassan Baazzi wrote:
>>>
>>> Hi,
>>>
>>> If you are going with the zip option, just create you own pipeline that
>>> extend the base file pipeline and publish it as a package on Github, if
>>> someone else needed and use it.
>>>
>>> Best Regards.
>>> Lhassan
>>> Le 18 août 2016 13:16, "Kasper Marstal" <[email protected]> a écrit :
>>>
>>>> Hi all,
>>>>
>>>> I am scraping a couple of million documents and need to save space on
>>>> my disk to store the data. An attractive option is to save the files
>>>> directly to a ZIP file since the compression ratio is really good with this
>>>> kind of data (~18). However, the FilesPipeline does not allow me to provide
>>>> my own files store, unless I hack away on the scrapy code itself, which I
>>>> would like to avoid. So, a couple of questions for the scrapy developers:
>>>>
>>>> - Are you interested in a patch that allows the FilesPipeline to accept
>>>> custom store schemes? OR
>>>> - Are you interested in a patch with a ZipFilesStore? In addition,
>>>> - Is this ZIP-file approach a common way of dealing with large amounts
>>>> of data, or do you have best-practices on this subject that I am not aware
>>>> of?
>>>>
>>>> Kind Regards,
>>>> Kasper Marstal
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "scrapy-users" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/scrapy-users.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>> --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Extending FilesPipeline with a custom store scheme

Reply via email to