Re: Feature request: collectstatic shouldn't recopy files that already exist in destination

2012-10-10 Thread Stephen Burrows
Thanks, Florian. That clarifies things. I didn't realize collectstatic was 
using two backends, but it makes sense. And thanks for putting up with me 
not being up to speed on the IRC discussion.

On Tuesday, October 9, 2012 1:20:39 AM UTC-7, Florian Apolloner wrote:
>
> Hi Stephen,
>
> On Tuesday, October 9, 2012 7:28:43 AM UTC+2, Stephen Burrows wrote:
>>
>> I'm a little confused by the track the discussion took recently... my 
>> impression was that the solution would *not* be to change from 
>> last_modified to a checksum, or to add an additional checksum method. 
>> Instead, storage backeds could support a has_changed method, which could be 
>> overridden to provide last_modified checking *or* checksum comparisons - or 
>> any other method of checking whether a file has changed. This seems like a 
>> useful, easy-to-explain, very generic way to handle checking whether a file 
>> has changed.
>
>
> This would be one way to do it (we've also discussed that in IRC), but we 
> couldn't figure out any implementation that would actually be useable. To 
> make this work you'd need to have a function signature like "def 
> has_changed(file_name, other_thing)" whereas other_thing would be some 
> hash/modified_time whatever provided by the source storage. Now we are back 
> to the situation I described in my answer to Jeremy…
>
>  
>
>> And since what staticfiles actually cares about is whether the file has 
>> changed, it seems like it would make more 
>
> sense to use a method that checks whether the file has changed, rather 
>> than just checking the last modification date.
>
>
> Well staticfiles doesn't care, it's only collectstatic which cares to some 
> extend. So in my opinion the cleanest way (which means without coupling the 
> two needed storage backends together to strongly) is to provide your own 
> collectstatic command where you can do what you want with the checks (if 
> you have problems implementing that we'd happily move some code to extra 
> methods to make reusing collectstatic easier).
>  
>
>> Would I be correct in thinking that the main argument against this is API 
>> bloat, or is there something else that I'm not seeing?
>>
>
> I'd rather say: We don't see __any__ nice way to actually support 
> something which is generic enough the be useful. 
>
> Cheers,
> Florian
>

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/django-developers/-/A7G6vLq58aYJ.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: Feature request: collectstatic shouldn't recopy files that already exist in destination

2012-10-09 Thread Florian Apolloner
Hi Stephen,

On Tuesday, October 9, 2012 7:28:43 AM UTC+2, Stephen Burrows wrote:
>
> I'm a little confused by the track the discussion took recently... my 
> impression was that the solution would *not* be to change from 
> last_modified to a checksum, or to add an additional checksum method. 
> Instead, storage backeds could support a has_changed method, which could be 
> overridden to provide last_modified checking *or* checksum comparisons - or 
> any other method of checking whether a file has changed. This seems like a 
> useful, easy-to-explain, very generic way to handle checking whether a file 
> has changed.


This would be one way to do it (we've also discussed that in IRC), but we 
couldn't figure out any implementation that would actually be useable. To 
make this work you'd need to have a function signature like "def 
has_changed(file_name, other_thing)" whereas other_thing would be some 
hash/modified_time whatever provided by the source storage. Now we are back 
to the situation I described in my answer to Jeremy…

 

> And since what staticfiles actually cares about is whether the file has 
> changed, it seems like it would make more 

sense to use a method that checks whether the file has changed, rather than 
> just checking the last modification date.


Well staticfiles doesn't care, it's only collectstatic which cares to some 
extend. So in my opinion the cleanest way (which means without coupling the 
two needed storage backends together to strongly) is to provide your own 
collectstatic command where you can do what you want with the checks (if 
you have problems implementing that we'd happily move some code to extra 
methods to make reusing collectstatic easier).
 

> Would I be correct in thinking that the main argument against this is API 
> bloat, or is there something else that I'm not seeing?
>

I'd rather say: We don't see __any__ nice way to actually support something 
which is generic enough the be useful. 

Cheers,
Florian

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/django-developers/-/vjZYGSYGnwUJ.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: Feature request: collectstatic shouldn't recopy files that already exist in destination

2012-10-09 Thread Florian Apolloner
Hi Jeremy,

On Tuesday, October 9, 2012 5:15:04 AM UTC+2, jdunck wrote:
>
> Would it be reasonable to have a backend-specific hook to determine a 
> fingerprint, where that could be mtime or md5 or whathaveyou as long as 
> equality (or maybe ordering) works?
>

Given our discussion in IRC yesterday the answer seems to be no. Eg 
consider this: collectstatic actually uses two backends, a "source" 
storage-backend and an "upload" storage-backend. Now if the 
storage-backends are allowed to return any fingerprint, how would you make 
sure that both of these backends actually use the same fingerprint? So 
essentially to have a S3-backend which (let's just abuse modified_time for 
now) returns a md5 hash from modified_time, you'd have to also change the 
other backend you want to use (usually just the local filesystem backend) 
to return md5 instead…

Cheers,
Florian

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/django-developers/-/zsJ1b1zETFMJ.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: Feature request: collectstatic shouldn't recopy files that already exist in destination

2012-10-08 Thread Stephen Burrows
I'm a little confused by the track the discussion took recently... my 
impression was that the solution would *not* be to change from 
last_modified to a checksum, or to add an additional checksum method. 
Instead, storage backeds could support a has_changed method, which could be 
overridden to provide last_modified checking *or* checksum comparisons - or 
any other method of checking whether a file has changed. This seems like a 
useful, easy-to-explain, very generic way to handle checking whether a file 
has changed. And since what staticfiles actually cares about is whether the 
file has changed, it seems like it would make more sense to use a method 
that checks whether the file has changed, rather than just checking the 
last modification date.

Would I be correct in thinking that the main argument against this is API 
bloat, or is there something else that I'm not seeing?

On Thursday, September 27, 2012 9:51:52 AM UTC-7, Dan Loewenherz wrote:
>
> Hey all!
>
> This is a feature request / proposal (one which I'm willing to build out, 
> given that I've already developed a solution for my own uploader).
>
> I run a consulting business that helps small startups build initial MVPs. 
> When the time ultimately comes to deciding how to store static assets, my 
> preference (as is that of many others) is to use Amazon S3, with Amazon 
> CloudFront acting as a CDN to the appropriate buckets. For the purposes of 
> this ticket, s/S3/your object storage of choice/g.
>
> Now, Django has an awesome mechanism for making sure static assets are up 
> on S3. With the appropriate static storage backend, running ./manage.py 
> collectstatic just searches through your static media folders and copies 
> the files.
>
> The problem I've run into is that collectstatic copies all files, 
> regardless of whether they already exist on the destination. Uploading 
> 5-10MB of files is pretty wasteful (and time consuming) when none of the 
> files have changed and no new files have been added.
>
> As I wrote in the trac ticket (https://code.djangoproject.com/ticket/19021), 
> my current solution was to write a management command that does essentially 
> the same thing that collectstatic does. But, there are a few differences. 
> Here's a rundown (copied straight from the trac ticket).
>
> I currently solve this problem by creating a file containing metadata of 
>> all the static media at the root of the destination. This file is a JSON 
>> object that contains file paths as keys and checksum as values. When an 
>> upload is started, the uploader checks to see if the file path exists as a 
>> key in the dictionary. If it does, it checks to see if the checksums have 
>> changed. If they haven't changed, the uploader skips the file. At the end 
>> of the upload, the checksum file is updated on the destination.
>
>
> I'll contribute the patch. I know there is not a lot of time before the 
> feature freeze, but I'm willing to make this happen if there's interest.
>
> If we don't want to change behavior, perhaps adding a flag such as 
> --skip-unchanged-files to the collectstatic command is the way to go?
>
> All the best,
> Dan
>

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/django-developers/-/khLi3g_3z4EJ.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: Feature request: collectstatic shouldn't recopy files that already exist in destination

2012-10-08 Thread Jeremy Dunck
Would it be reasonable to have a backend-specific hook to determine a 
fingerprint, where that could be mtime or md5 or whathaveyou as long as 
equality (or maybe ordering) works?



On Oct 8, 2012, at 10:23 AM, Alex Ogier  wrote:

> On Mon, Oct 8, 2012 at 1:06 PM, ptone  wrote:
> While git may be common, and your problem not unique - this is still a 
> condition of your dev environment rendering modification dates invalid. There 
> might be other situations where this is the case (I've run into scripts that 
> muck with modification dates based on camera/jpeg metadata).
> 
> So after some further discussion on IRC - it was determined that md5, while 
> somewhat common, was far from a standard, and was likely not to be available 
> as remote call for network based storage backends. And so the final 
> resolution is to wontfix the ticket.
> 
> In the end - this lack of a universal fingerprint is just a limitation of our 
> storage tools.
> 
> -Preston
> 
> Is there a reason this fingerprint must be universal? If you're dealing with 
> a backend like S3, where network latency and expensive writes are a problem, 
> but md5 is a builtin remote call (available on any GET), why not just do an 
> md5 sum in the _save() method? Basically, just buffer the File object you 
> receive, take an md5 in python, and then make a decision whether to upload or 
> not. In the common case of reading from local disk and writing to S3, this is 
> a big win, and doesn't require cooperation from any other backends, or 
> standardizing on md5 as a fingerprint method.
> 
> Best,
> Alex Ogier 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Django developers" group.
> To post to this group, send email to django-developers@googlegroups.com.
> To unsubscribe from this group, send email to 
> django-developers+unsubscr...@googlegroups.com.
> For more options, visit this group at 
> http://groups.google.com/group/django-developers?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: Feature request: collectstatic shouldn't recopy files that already exist in destination

2012-10-08 Thread Alex Ogier
On Mon, Oct 8, 2012 at 1:06 PM, ptone  wrote:

> While git may be common, and your problem not unique - this is still a
> condition of your dev environment rendering modification dates invalid.
> There might be other situations where this is the case (I've run into
> scripts that muck with modification dates based on camera/jpeg metadata).
>
> So after some further discussion on IRC - it was determined that md5,
> while somewhat common, was far from a standard, and was likely not to be
> available as remote call for network based storage backends. And so the
> final resolution is to wontfix the ticket.
>
> In the end - this lack of a universal fingerprint is just a limitation of
> our storage tools.
>
> -Preston
>

Is there a reason this fingerprint must be universal? If you're dealing
with a backend like S3, where network latency and expensive writes are a
problem, but md5 is a builtin remote call (available on any GET), why not
just do an md5 sum in the _save() method? Basically, just buffer the File
object you receive, take an md5 in python, and then make a decision whether
to upload or not. In the common case of reading from local disk and writing
to S3, this is a big win, and doesn't require cooperation from any other
backends, or standardizing on md5 as a fingerprint method.

Best,
Alex Ogier

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: Feature request: collectstatic shouldn't recopy files that already exist in destination

2012-10-08 Thread ptone


On Monday, October 8, 2012 8:49:58 AM UTC-7, Dan Loewenherz wrote:
>
>
> > On Mon, Oct 1, 2012 at 12:47 AM, Jannis Leidel 
> > > 
>> wrote:
>> Then, frankly, this is a problem of the storage backends, not Django's. 
>> The S3BotoStorage backend *does* have a modified_time method:
>>
>>   
>> https://bitbucket.org/david/django-storages/src/1574890d87be/storages/backends/s3boto.py#cl-298
>>
>> What storage backend do you use that doesn't have a modified_time method?
>>
>
> I don't think you're seeing the problem I'm having. I'm working with a 
> distributed team using git. This means when we check out files, the local 
> modified time is the time at which I checked the files out, not the time 
> which the files were actually last modified.
>
> As a result, it's a questionable metric for figuring out if a file is the 
> same or not, since every team member's local machine thinks they were all 
> just created! We end up re-uploading the file every time.
>

While git may be common, and your problem not unique - this is still a 
condition of your dev environment rendering modification dates invalid. 
There might be other situations where this is the case (I've run into 
scripts that muck with modification dates based on camera/jpeg metadata).

So after some further discussion on IRC - it was determined that md5, while 
somewhat common, was far from a standard, and was likely not to be 
available as remote call for network based storage backends. And so the 
final resolution is to wontfix the ticket.

In the end - this lack of a universal fingerprint is just a limitation of 
our storage tools.

-Preston


> > This is a bit confusing...why call it last_modified when that's doesn't 
>> necessarily reflect what it's doing? It would be more flexible to create 
>> two methods:
>>
>> It's called modified_time, not last_modified.
>>
>
> Sorry, typo.
>  
>
>>
>> > def modification_identifier(self):
>> >
>> > def has_changed(self):
>> >
>> > Then, any backend could implement these however they might like, and 
>> collectstatic would have no excuse in uploading the same file more than 
>> once. Overloading last_modified to also do things like calculate md5's 
>> seems a bit hacky to me, and confusing for any developer maintaining a 
>> custom storage backend that doesn't support last modified.
>>
>> I disagree, modified_time is perfectly capable of handling your use case.
>>
>
> No it does not address my needs, as I described above.
>
> Dan
>

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/django-developers/-/pCGNll2gjiYJ.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: Feature request: collectstatic shouldn't recopy files that already exist in destination

2012-10-08 Thread Dan Loewenherz
On Mon, Oct 8, 2012 at 3:50 AM, Jannis Leidel  wrote:

> You're able to ask S3 for the date of last modification, I don't see why a
> comparison by hashing the file content is needed additionally. It'd have to
> download the full file to do that on Django's side and I'm not aware of a
> API for getting a hash from cloudfiles, S3 etc.
>

S3 stores the md5 info in an Etag header.

Regarding Cloudfiles, this is what Rackspace has to say:

You can ensure end-to-end data integrity by including an MD5 checksum of
> your object's data in the ETag header. You are not required to include
> the ETag header, but it is recommended to ensure that the storage system
> successfully stored your object's content.


Dan

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: Feature request: collectstatic shouldn't recopy files that already exist in destination

2012-10-08 Thread Dan Loewenherz
> > On Mon, Oct 1, 2012 at 12:47 AM, Jannis Leidel  wrote:
> Then, frankly, this is a problem of the storage backends, not Django's.
> The S3BotoStorage backend *does* have a modified_time method:
>
>
> https://bitbucket.org/david/django-storages/src/1574890d87be/storages/backends/s3boto.py#cl-298
>
> What storage backend do you use that doesn't have a modified_time method?
>

I don't think you're seeing the problem I'm having. I'm working with a
distributed team using git. This means when we check out files, the local
modified time is the time at which I checked the files out, not the time
which the files were actually last modified.

As a result, it's a questionable metric for figuring out if a file is the
same or not, since every team member's local machine thinks they were all
just created! We end up re-uploading the file every time.

> This is a bit confusing...why call it last_modified when that's doesn't
> necessarily reflect what it's doing? It would be more flexible to create
> two methods:
>
> It's called modified_time, not last_modified.
>

Sorry, typo.


>
> > def modification_identifier(self):
> >
> > def has_changed(self):
> >
> > Then, any backend could implement these however they might like, and
> collectstatic would have no excuse in uploading the same file more than
> once. Overloading last_modified to also do things like calculate md5's
> seems a bit hacky to me, and confusing for any developer maintaining a
> custom storage backend that doesn't support last modified.
>
> I disagree, modified_time is perfectly capable of handling your use case.
>

No it does not address my needs, as I described above.

Dan

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: Feature request: collectstatic shouldn't recopy files that already exist in destination

2012-10-08 Thread Jannis Leidel

On 08.10.2012, at 07:38, ptone  wrote:

> so after scanning this thread and the ticket again - it is still unclear that 
> there could be a completely universal solution.
> 
> While it would be nice if the storage API had a checksum(name) or md5(name) 
> method - not all custom storage backends are going to support a single 
> checksum standard.  S3 doesn't explicitly support MD5 (apparently it 
> unofficially does through ETags).  Without a universal checksum - you can't 
> use it to compare files across arbitrary backends.

You're able to ask S3 for the date of last modification, I don't see why a 
comparison by hashing the file content is needed additionally. It'd have to 
download the full file to do that on Django's side and I'm not aware of a API 
for getting a hash from cloudfiles, S3 etc.

> I do agree that hacking modified_time return value is a little ugly - the API 
> is clearly documented as "returns a datetime..." - so returning a M55 
> checksum there is, well, hacky.

I beg to differ, returning a datetime object makes absolute sense for comparing 
it to another datetime object. What I meant before is that the modified_time 
method can be written however the user wants as long as it returns a datetime 
object, even a date that is known to be older than the file on disk.

> If you are passionate about moving this forward, here is what I'd suggest.
> 
> Implement, document, and test .md5(name) as a standard method on storage 
> backends - like modified_time this would raise NotImplementedError if not 
> available - this could easily be its own ticket. md5 is probably the closest 
> you'll get to a checksum standard.

-1

Jannis 


> On Sunday, October 7, 2012 8:59:16 PM UTC-7, Dan Loewenherz wrote:
> This issue just got me again tonight, so I'll try to push once more on this 
> issue. It seems right now most people don't care that this is broken, which 
> is a bummer, but in which case I'll just continue using my working solution.
> 
> Dan
> 
> On Sat, Oct 6, 2012 at 10:48 AM, Dan Loewenherz  wrote:
> Hey Jannis,
> 
> On Mon, Oct 1, 2012 at 12:47 AM, Jannis Leidel  wrote:
> 
> On 30.09.2012, at 23:41, Dan Loewenherz  wrote:
> 
> > Many backends don't support last modified times, and even if they all did, 
> > it's incorrect to assume that last modified time is an accurate heuristic 
> > for whether a file has already been uploaded or not.
> 
> Well but it's an accurate way to decide whether a file has been changed on 
> the filesystem, and that's what collectstatic cares about. The storage 
> backend *is* the API to extend that when needed, so feel free to use it.
> 
> It's accurate *only* in certain situations. And on a distributed development 
> team, I've run into a lot of issues with developers re-upload files that have 
> already been uploaded because they just recently updated their repo.
> 
> A checksum is the only true accurate method to determine if a file has 
> changed.
> 
> Additionally, you didn't address my point that I quoted from. Storage 
> backends don't just reflect filesystems--they could reflect files stored in a 
> database, S3, etc. And some of these filesystems don't support last modified 
> times.
> 
> > It might be a better idea to let the backends decide when a file has been 
> > changed (instead of just calling the backend's last modified method).
> 
> I don't understand, you can easily implement exactly that in the 
> last_modified method if you'd like.
> 
> This is a bit confusing...why call it last_modified when that's doesn't 
> necessarily reflect what it's doing? It would be more flexible to create two 
> methods:
> 
> def modification_identifier(self):
> 
> def has_changed(self):
> 
> Then, any backend could implement these however they might like, and 
> collectstatic would have no excuse in uploading the same file more than once. 
> Overloading last_modified to also do things like calculate md5's seems a bit 
> hacky to me, and confusing for any developer maintaining a custom storage 
> backend that doesn't support last modified.
> 
> Dan
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Django developers" group.
> To view this discussion on the web visit 
> https://groups.google.com/d/msg/django-developers/-/weKD2x1XY4oJ.
> To post to this group, send email to django-developers@googlegroups.com.
> To unsubscribe from this group, send email to 
> django-developers+unsubscr...@googlegroups.com.
> For more options, visit this group at 
> http://groups.google.com/group/django-developers?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: Feature request: collectstatic shouldn't recopy files that already exist in destination

2012-10-08 Thread Jannis Leidel

On 06.10.2012, at 19:48, Dan Loewenherz  wrote:

> Hey Jannis,
> 
> On Mon, Oct 1, 2012 at 12:47 AM, Jannis Leidel  wrote:
> 
> On 30.09.2012, at 23:41, Dan Loewenherz  wrote:
> 
> > Many backends don't support last modified times, and even if they all did, 
> > it's incorrect to assume that last modified time is an accurate heuristic 
> > for whether a file has already been uploaded or not.
> 
> Well but it's an accurate way to decide whether a file has been changed on 
> the filesystem, and that's what collectstatic cares about. The storage 
> backend *is* the API to extend that when needed, so feel free to use it.
> 
> It's accurate *only* in certain situations. And on a distributed development 
> team, I've run into a lot of issues with developers re-upload files that have 
> already been uploaded because they just recently updated their repo.
> 
> A checksum is the only true accurate method to determine if a file has 
> changed.
> 
> Additionally, you didn't address my point that I quoted from. Storage 
> backends don't just reflect filesystems--they could reflect files stored in a 
> database, S3, etc. And some of these filesystems don't support last modified 
> times.

Then, frankly, this is a problem of the storage backends, not Django's. The 
S3BotoStorage backend *does* have a modified_time method:

  
https://bitbucket.org/david/django-storages/src/1574890d87be/storages/backends/s3boto.py#cl-298

What storage backend do you use that doesn't have a modified_time method?

> This is a bit confusing...why call it last_modified when that's doesn't 
> necessarily reflect what it's doing? It would be more flexible to create two 
> methods:

It's called modified_time, not last_modified. 

> def modification_identifier(self):
> 
> def has_changed(self):
> 
> Then, any backend could implement these however they might like, and 
> collectstatic would have no excuse in uploading the same file more than once. 
> Overloading last_modified to also do things like calculate md5's seems a bit 
> hacky to me, and confusing for any developer maintaining a custom storage 
> backend that doesn't support last modified.

I disagree, modified_time is perfectly capable of handling your use case.

Jannis

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: Feature request: collectstatic shouldn't recopy files that already exist in destination

2012-10-07 Thread ptone
so after scanning this thread and the ticket again - it is still unclear 
that there could be a completely universal solution.

While it would be nice if the storage API had a checksum(name) or md5(name) 
method - not all custom storage backends are going to support a single 
checksum standard.  S3 doesn't explicitly support MD5 (apparently it 
unofficially does through ETags).  Without a universal checksum - you can't 
use it to compare files across arbitrary backends.

I do agree that hacking modified_time return value is a little ugly - the 
API is clearly documented as "returns a datetime..." - so returning a M55 
checksum there is, well, hacky.

If you are passionate about moving this forward, here is what I'd suggest.

Implement, document, and test .md5(name) as a standard method on storage 
backends - like modified_time this would raise NotImplementedError if not 
available - this could easily be its own ticket. md5 is probably the 
closest you'll get to a checksum standard.

Once you have an md5 method defined for backends - you could support a 
--md5 option to collectstatic that would use that as the target/source 
comparison.

Another workaround is to just use collectstatic locally - and rsync 
--checksum to your remote if it supports rsync.

-Preston


On Sunday, October 7, 2012 8:59:16 PM UTC-7, Dan Loewenherz wrote:
>
> This issue just got me again tonight, so I'll try to push once more on 
> this issue. It seems right now most people don't care that this is broken, 
> which is a bummer, but in which case I'll just continue using my working 
> solution.
>
> Dan
>
> On Sat, Oct 6, 2012 at 10:48 AM, Dan Loewenherz 
> > wrote:
>
>> Hey Jannis,
>>
>> On Mon, Oct 1, 2012 at 12:47 AM, Jannis Leidel 
>> > wrote:
>>
>>>
>>> On 30.09.2012, at 23:41, Dan Loewenherz > 
>>> wrote:
>>>
>>> > Many backends don't support last modified times, and even if they all 
>>> did, it's incorrect to assume that last modified time is an accurate 
>>> heuristic for whether a file has already been uploaded or not.
>>>
>>> Well but it's an accurate way to decide whether a file has been changed 
>>> on the filesystem, and that's what collectstatic cares about. The storage 
>>> backend *is* the API to extend that when needed, so feel free to use it.
>>>
>>
>> It's accurate *only* in certain situations. And on a distributed 
>> development team, I've run into a lot of issues with developers re-upload 
>> files that have already been uploaded because they just recently updated 
>> their repo.
>>
>> A checksum is the only true accurate method to determine if a file has 
>> changed.
>>
>> Additionally, you didn't address my point that I quoted from. Storage 
>> backends don't just reflect filesystems--they could reflect files stored in 
>> a database, S3, etc. And some of these filesystems don't support last 
>> modified times.
>>  
>> > It might be a better idea to let the backends decide when a file has 
>>> been changed (instead of just calling the backend's last modified method).
>>>
>>> I don't understand, you can easily implement exactly that in the 
>>> last_modified method if you'd like.
>>>
>>
>> This is a bit confusing...why call it last_modified when that's doesn't 
>> necessarily reflect what it's doing? It would be more flexible to create 
>> two methods:
>>
>> def modification_identifier(self):
>>
>> def has_changed(self):
>>
>> Then, any backend could implement these however they might like, and 
>> collectstatic would have no excuse in uploading the same file more than 
>> once. Overloading last_modified to also do things like calculate md5's 
>> seems a bit hacky to me, and confusing for any developer maintaining a 
>> custom storage backend that doesn't support last modified.
>>  
>> Dan
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/django-developers/-/weKD2x1XY4oJ.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: Feature request: collectstatic shouldn't recopy files that already exist in destination

2012-10-07 Thread Dan Loewenherz
This issue just got me again tonight, so I'll try to push once more on this
issue. It seems right now most people don't care that this is broken, which
is a bummer, but in which case I'll just continue using my working solution.

Dan

On Sat, Oct 6, 2012 at 10:48 AM, Dan Loewenherz  wrote:

> Hey Jannis,
>
> On Mon, Oct 1, 2012 at 12:47 AM, Jannis Leidel  wrote:
>
>>
>> On 30.09.2012, at 23:41, Dan Loewenherz  wrote:
>>
>> > Many backends don't support last modified times, and even if they all
>> did, it's incorrect to assume that last modified time is an accurate
>> heuristic for whether a file has already been uploaded or not.
>>
>> Well but it's an accurate way to decide whether a file has been changed
>> on the filesystem, and that's what collectstatic cares about. The storage
>> backend *is* the API to extend that when needed, so feel free to use it.
>>
>
> It's accurate *only* in certain situations. And on a distributed
> development team, I've run into a lot of issues with developers re-upload
> files that have already been uploaded because they just recently updated
> their repo.
>
> A checksum is the only true accurate method to determine if a file has
> changed.
>
> Additionally, you didn't address my point that I quoted from. Storage
> backends don't just reflect filesystems--they could reflect files stored in
> a database, S3, etc. And some of these filesystems don't support last
> modified times.
>
> > It might be a better idea to let the backends decide when a file has
>> been changed (instead of just calling the backend's last modified method).
>>
>> I don't understand, you can easily implement exactly that in the
>> last_modified method if you'd like.
>>
>
> This is a bit confusing...why call it last_modified when that's doesn't
> necessarily reflect what it's doing? It would be more flexible to create
> two methods:
>
> def modification_identifier(self):
>
> def has_changed(self):
>
> Then, any backend could implement these however they might like, and
> collectstatic would have no excuse in uploading the same file more than
> once. Overloading last_modified to also do things like calculate md5's
> seems a bit hacky to me, and confusing for any developer maintaining a
> custom storage backend that doesn't support last modified.
>
> Dan
>

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: Feature request: collectstatic shouldn't recopy files that already exist in destination

2012-10-06 Thread Dan Loewenherz
Hey Jannis,

On Mon, Oct 1, 2012 at 12:47 AM, Jannis Leidel  wrote:

>
> On 30.09.2012, at 23:41, Dan Loewenherz  wrote:
>
> > Many backends don't support last modified times, and even if they all
> did, it's incorrect to assume that last modified time is an accurate
> heuristic for whether a file has already been uploaded or not.
>
> Well but it's an accurate way to decide whether a file has been changed on
> the filesystem, and that's what collectstatic cares about. The storage
> backend *is* the API to extend that when needed, so feel free to use it.
>

It's accurate *only* in certain situations. And on a distributed
development team, I've run into a lot of issues with developers re-upload
files that have already been uploaded because they just recently updated
their repo.

A checksum is the only true accurate method to determine if a file has
changed.

Additionally, you didn't address my point that I quoted from. Storage
backends don't just reflect filesystems--they could reflect files stored in
a database, S3, etc. And some of these filesystems don't support last
modified times.

> It might be a better idea to let the backends decide when a file has been
> changed (instead of just calling the backend's last modified method).
>
> I don't understand, you can easily implement exactly that in the
> last_modified method if you'd like.
>

This is a bit confusing...why call it last_modified when that's doesn't
necessarily reflect what it's doing? It would be more flexible to create
two methods:

def modification_identifier(self):

def has_changed(self):

Then, any backend could implement these however they might like, and
collectstatic would have no excuse in uploading the same file more than
once. Overloading last_modified to also do things like calculate md5's
seems a bit hacky to me, and confusing for any developer maintaining a
custom storage backend that doesn't support last modified.

Dan

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: Feature request: collectstatic shouldn't recopy files that already exist in destination

2012-10-01 Thread Jannis Leidel

On 30.09.2012, at 23:41, Dan Loewenherz  wrote:

> Many backends don't support last modified times, and even if they all did, 
> it's incorrect to assume that last modified time is an accurate heuristic for 
> whether a file has already been uploaded or not.

Well but it's an accurate way to decide whether a file has been changed on the 
filesystem, and that's what collectstatic cares about. The storage backend *is* 
the API to extend that when needed, so feel free to use it.

> It might be a better idea to let the backends decide when a file has been 
> changed (instead of just calling the backend's last modified method).

I don't understand, you can easily implement exactly that in the last_modified 
method if you'd like.

Jannis

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: Feature request: collectstatic shouldn't recopy files that already exist in destination

2012-10-01 Thread Jannis Leidel

On 28.09.2012, at 05:38, Dan Loewenherz  wrote:

> On Thu, Sep 27, 2012 at 4:13 PM, Carl Meyer  wrote:
> Hi Dan,
> 
> On 09/27/2012 04:47 PM, Dan Loewenherz wrote:
> > Just updated the ticket.
> >
> > As I commented, the heuristic for checking if a file has been modified
> > lies in line 282 of collectstatic.py:
> >
> > *if not prefixed_path in self.copied_files:*
> > *
> > return self.log("Skipping '%s' (already copied earlier)" % path)
> >
> > *
> > https://github.com/django/django/blob/master/django/contrib/staticfiles/management/commands/collectstatic.py#L282
> >
> > This seems off, since a path may stay the same but a file's contents may
> > change.
> 
> That's not checking whether the file has been modified, that's checking
> whether the same source file was previously copied in the same
> collectstatic run (due to overlapping/duplicate file sources of some kind).
> 
> The check for modification date is up on line 234 in the delete_file
> method, which is called by both link_file and copy_file.
> 
> Thanks, I missed that.
> 
> I still see an issue here. In any sort of source control, when a user updates 
> their repo, local files that were updated remotely show up as modified at the 
> time the repo is cloned or updated, not when the file was actually last saved 
> by the last author. You then have the same scenario I pointed to earlier: 
> when multiple people work on a project, they will re-upload the same files 
> multiple times.
> 
> Don't get me wrong here--I'm happy to see there is some sort of check here to 
> avoid collectstatic'ing the same file, but I think it's warranted to push 
> back on the use of last modified as the heuristic. I think using a checksum 
> would work much better, and solves the problem this is trying to solve in a 
> much more foolproof way. With all this said, I don't think this logic belongs 
> in a "delete_file" method. IMO it'd be better to separate out the logic of 
> "does this file already exist, if so, skip it" from "delete this file if it 
> exists on the target" (delete_file).
> 
> @Karen--thanks for digging up that SO post. It actually is super relevant 
> here. As mentioned, when uploading files to S3, it's quite time consuming to 
> perform a network round trip to pick up file metadata (such as last modified 
> time) for each file you're uploading. This was initial reason I chose to 
> store this data in a single location that I'd only need to grab once.

Dan,

I added prefetching the metadata to the Boto based storage backend in 
django-storages a while ago to reduce the time spent on querying S3.

See 
https://bitbucket.org/david/django-storages/src/1574890d87be/storages/backends/s3boto.py#cl-42

You can either set the setting AWS_PRELOAD_METADATA to True or subclass 
S3BotoStorage and pass preload_metadata=True into the constructor.

Hope that helps,

Jannis

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: Feature request: collectstatic shouldn't recopy files that already exist in destination

2012-09-30 Thread Dan Loewenherz
I think it's an API limitation. Many backends don't support last modified
times, and even if they all did, it's incorrect to assume that last
modified time is an accurate heuristic for whether a file has already been
uploaded or not. It might be a better idea to let the backends decide when
a file has been changed (instead of just calling the backend's last
modified method).

Dan

On Sun, Sep 30, 2012 at 2:28 PM, Stephen Burrows <
stephen.r.burr...@gmail.com> wrote:

> I use S3 as well, and I have seen cases where files get copied that I know
> don't need to be. That being said, it was never so slow that it was an
> issue for me. Is there clear evidence that this is something which can't be
> handled by the S3 backend due to an inadequate API on the django side?
>
> --
> You received this message because you are subscribed to the Google Groups
> "Django developers" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/django-developers/-/D9eemazwQUQJ.
>
> To post to this group, send email to django-developers@googlegroups.com.
> To unsubscribe from this group, send email to
> django-developers+unsubscr...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/django-developers?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: Feature request: collectstatic shouldn't recopy files that already exist in destination

2012-09-30 Thread Stephen Burrows
I use S3 as well, and I have seen cases where files get copied that I know 
don't need to be. That being said, it was never so slow that it was an 
issue for me. Is there clear evidence that this is something which can't be 
handled by the S3 backend due to an inadequate API on the django side?

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/django-developers/-/D9eemazwQUQJ.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: Feature request: collectstatic shouldn't recopy files that already exist in destination

2012-09-27 Thread James Rivett-Carnac
+1
I like this implementation. 

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/django-developers/-/eKXUlc0TKgYJ.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: Feature request: collectstatic shouldn't recopy files that already exist in destination

2012-09-27 Thread Dan Loewenherz
On Thu, Sep 27, 2012 at 4:13 PM, Carl Meyer  wrote:

> Hi Dan,
>
> On 09/27/2012 04:47 PM, Dan Loewenherz wrote:
> > Just updated the ticket.
> >
> > As I commented, the heuristic for checking if a file has been modified
> > lies in line 282 of collectstatic.py:
> >
> > *if not prefixed_path in self.copied_files:*
> > *
> > return self.log("Skipping '%s' (already copied earlier)" % path)
> >
> > *
> >
> https://github.com/django/django/blob/master/django/contrib/staticfiles/management/commands/collectstatic.py#L282
> >
> > This seems off, since a path may stay the same but a file's contents may
> > change.
>
> That's not checking whether the file has been modified, that's checking
> whether the same source file was previously copied in the same
> collectstatic run (due to overlapping/duplicate file sources of some kind).
>
> The check for modification date is up on line 234 in the delete_file
> method, which is called by both link_file and copy_file.
>

Thanks, I missed that.

I still see an issue here. In any sort of source control, when a user
updates their repo, local files that were updated remotely show up as
modified at the time the repo is cloned or updated, not when the file was
actually last saved by the last author. You then have the same scenario I
pointed to earlier: when multiple people work on a project, they will
re-upload the same files multiple times.

Don't get me wrong here--I'm happy to see there is some sort of check here
to avoid collectstatic'ing the same file, but I think it's warranted to
push back on the use of last modified as the heuristic. I think using a
checksum would work much better, and solves the problem this is trying to
solve in a much more foolproof way. With all this said, I don't think this
logic belongs in a "delete_file" method. IMO it'd be better to separate out
the logic of "does this file already exist, if so, skip it" from "delete
this file if it exists on the target" (delete_file).

@Karen--thanks for digging up that SO post. It actually is super relevant
here. As mentioned, when uploading files to S3, it's quite time consuming
to perform a network round trip to pick up file metadata (such as last
modified time) for each file you're uploading. This was initial reason I
chose to store this data in a single location that I'd only need to grab
once.

Dan

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: Feature request: collectstatic shouldn't recopy files that already exist in destination

2012-09-27 Thread Carl Meyer
Hi Dan,

On 09/27/2012 04:47 PM, Dan Loewenherz wrote:
> Just updated the ticket.
> 
> As I commented, the heuristic for checking if a file has been modified
> lies in line 282 of collectstatic.py:
> 
> *if not prefixed_path in self.copied_files:*
> *
> return self.log("Skipping '%s' (already copied earlier)" % path)
> 
> *
> https://github.com/django/django/blob/master/django/contrib/staticfiles/management/commands/collectstatic.py#L282
> 
> This seems off, since a path may stay the same but a file's contents may
> change.

That's not checking whether the file has been modified, that's checking
whether the same source file was previously copied in the same
collectstatic run (due to overlapping/duplicate file sources of some kind).

The check for modification date is up on line 234 in the delete_file
method, which is called by both link_file and copy_file.

Carl

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: Feature request: collectstatic shouldn't recopy files that already exist in destination

2012-09-27 Thread Dan Loewenherz
Just updated the ticket.

As I commented, the heuristic for checking if a file has been modified lies
in line 282 of collectstatic.py:

*if not prefixed_path in self.copied_files:*
*
return self.log("Skipping '%s' (already copied earlier)" % path)

*
https://github.com/django/django/blob/master/django/contrib/staticfiles/management/commands/collectstatic.py#L282

This seems off, since a path may stay the same but a file's contents may
change.

Also, the existing functionality doesn't work when multiple people are
running collectstatic (or the same person is running it on multiple
computers).

Dan

On Thu, Sep 27, 2012 at 3:12 PM, Karen Tracey  wrote:

> On Thu, Sep 27, 2012 at 12:51 PM, Dan Loewenherz wrote:
>
>> The problem I've run into is that collectstatic copies all files,
>> regardless of whether they already exist on the destination.
>
>
> No, as noted in the ticket, which has been closed needsinfo, staticfiles
> already only copies modified files. And I don't think that's a
> recently-added feature, so I suspect the behavior you are seeing has more
> to do with the S3 side than the staticfiles code. This post:
>
>
> http://stackoverflow.com/questions/6618013/django-staticfiles-and-amazon-s3-how-to-detect-modified-files
>
> looks enlightening though even that is not all that new. It sounds like
> the problem here is efficiently getting the last-modified time for the
> files out in S3, not with staticfiles.
>
> Karen
>
>  --
> You received this message because you are subscribed to the Google Groups
> "Django developers" group.
> To post to this group, send email to django-developers@googlegroups.com.
> To unsubscribe from this group, send email to
> django-developers+unsubscr...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/django-developers?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: Feature request: collectstatic shouldn't recopy files that already exist in destination

2012-09-27 Thread Karen Tracey
On Thu, Sep 27, 2012 at 12:51 PM, Dan Loewenherz wrote:

> The problem I've run into is that collectstatic copies all files,
> regardless of whether they already exist on the destination.


No, as noted in the ticket, which has been closed needsinfo, staticfiles
already only copies modified files. And I don't think that's a
recently-added feature, so I suspect the behavior you are seeing has more
to do with the S3 side than the staticfiles code. This post:

http://stackoverflow.com/questions/6618013/django-staticfiles-and-amazon-s3-how-to-detect-modified-files

looks enlightening though even that is not all that new. It sounds like the
problem here is efficiently getting the last-modified time for the files
out in S3, not with staticfiles.

Karen

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: Feature request: collectstatic shouldn't recopy files that already exist in destination

2012-09-27 Thread mjl Martin J. Laubach
Good idea, but shouldn't it be a per-storage thing? Perhaps this could be 
done with a couple of callbacks in the collectstatic run:

   - Before collectstatic starts, so the storage backend can pick up its 
   inventory from the remote
   - One called for each file that would be copied, and that can veto a 
   copy operation
   - At the end of the collectstatic run, to update the inventory

That would make things customisable even for strange setups like storing 
files in a database or whatnot?

   Cheers,

mjl

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/django-developers/-/DLpjFU0NS_4J.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: Feature request: collectstatic shouldn't recopy files that already exist in destination

2012-09-27 Thread Anders Steinlein
On Thu, Sep 27, 2012 at 6:51 PM, Dan Loewenherz wrote:

> Hey all!
>
> This is a feature request / proposal (one which I'm willing to build out,
> given that I've already developed a solution for my own uploader).
>
> [...]
>
> I'll contribute the patch. I know there is not a lot of time before the
> feature freeze, but I'm willing to make this happen if there's interest.
>

Yes, please. I've been wanting this myself. This should IMHO be the default
for collectstatic, but having a flag to *force* copying all files.

-- 
*Anders Steinlein*
*Eliksir AS*
http://e5r.no

E-post: and...@e5r.no
Mobil: +47 926 13 069
Twitter: @asteinlein

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: Feature request: collectstatic shouldn't recopy files that already exist in destination

2012-09-27 Thread Babatunde Akinyanmi
I like this feature and have recently been thinking of implementing
such on my own myself.
+1 for the feature request.

On 9/27/12, Dan Loewenherz  wrote:
> Hey all!
>
> This is a feature request / proposal (one which I'm willing to build out,
> given that I've already developed a solution for my own uploader).
>
> I run a consulting business that helps small startups build initial MVPs.
> When the time ultimately comes to deciding how to store static assets, my
> preference (as is that of many others) is to use Amazon S3, with Amazon
> CloudFront acting as a CDN to the appropriate buckets. For the purposes of
> this ticket, s/S3/your object storage of choice/g.
>
> Now, Django has an awesome mechanism for making sure static assets are up
> on S3. With the appropriate static storage backend, running ./manage.py
> collectstatic just searches through your static media folders and copies
> the files.
>
> The problem I've run into is that collectstatic copies all files,
> regardless of whether they already exist on the destination. Uploading
> 5-10MB of files is pretty wasteful (and time consuming) when none of the
> files have changed and no new files have been added.
>
> As I wrote in the trac ticket
> (https://code.djangoproject.com/ticket/19021),
> my current solution was to write a management command that does essentially
> the same thing that collectstatic does. But, there are a few differences.
> Here's a rundown (copied straight from the trac ticket).
>
> I currently solve this problem by creating a file containing metadata of
>> all the static media at the root of the destination. This file is a JSON
>> object that contains file paths as keys and checksum as values. When an
>> upload is started, the uploader checks to see if the file path exists as
>> a
>> key in the dictionary. If it does, it checks to see if the checksums have
>> changed. If they haven't changed, the uploader skips the file. At the end
>> of the upload, the checksum file is updated on the destination.
>
>
> I'll contribute the patch. I know there is not a lot of time before the
> feature freeze, but I'm willing to make this happen if there's interest.
>
> If we don't want to change behavior, perhaps adding a flag such as
> --skip-unchanged-files to the collectstatic command is the way to go?
>
> All the best,
> Dan
>
> --
> You received this message because you are subscribed to the Google Groups
> "Django developers" group.
> To post to this group, send email to django-developers@googlegroups.com.
> To unsubscribe from this group, send email to
> django-developers+unsubscr...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/django-developers?hl=en.
>
>

-- 
Sent from my mobile device

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Feature request: collectstatic shouldn't recopy files that already exist in destination

2012-09-27 Thread Dan Loewenherz
Hey all!

This is a feature request / proposal (one which I'm willing to build out,
given that I've already developed a solution for my own uploader).

I run a consulting business that helps small startups build initial MVPs.
When the time ultimately comes to deciding how to store static assets, my
preference (as is that of many others) is to use Amazon S3, with Amazon
CloudFront acting as a CDN to the appropriate buckets. For the purposes of
this ticket, s/S3/your object storage of choice/g.

Now, Django has an awesome mechanism for making sure static assets are up
on S3. With the appropriate static storage backend, running ./manage.py
collectstatic just searches through your static media folders and copies
the files.

The problem I've run into is that collectstatic copies all files,
regardless of whether they already exist on the destination. Uploading
5-10MB of files is pretty wasteful (and time consuming) when none of the
files have changed and no new files have been added.

As I wrote in the trac ticket (https://code.djangoproject.com/ticket/19021),
my current solution was to write a management command that does essentially
the same thing that collectstatic does. But, there are a few differences.
Here's a rundown (copied straight from the trac ticket).

I currently solve this problem by creating a file containing metadata of
> all the static media at the root of the destination. This file is a JSON
> object that contains file paths as keys and checksum as values. When an
> upload is started, the uploader checks to see if the file path exists as a
> key in the dictionary. If it does, it checks to see if the checksums have
> changed. If they haven't changed, the uploader skips the file. At the end
> of the upload, the checksum file is updated on the destination.


I'll contribute the patch. I know there is not a lot of time before the
feature freeze, but I'm willing to make this happen if there's interest.

If we don't want to change behavior, perhaps adding a flag such as
--skip-unchanged-files to the collectstatic command is the way to go?

All the best,
Dan

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.