Re: Feature request: collectstatic shouldn't recopy files that already exist in destination
Thanks, Florian. That clarifies things. I didn't realize collectstatic was using two backends, but it makes sense. And thanks for putting up with me not being up to speed on the IRC discussion. On Tuesday, October 9, 2012 1:20:39 AM UTC-7, Florian Apolloner wrote: > > Hi Stephen, > > On Tuesday, October 9, 2012 7:28:43 AM UTC+2, Stephen Burrows wrote: >> >> I'm a little confused by the track the discussion took recently... my >> impression was that the solution would *not* be to change from >> last_modified to a checksum, or to add an additional checksum method. >> Instead, storage backeds could support a has_changed method, which could be >> overridden to provide last_modified checking *or* checksum comparisons - or >> any other method of checking whether a file has changed. This seems like a >> useful, easy-to-explain, very generic way to handle checking whether a file >> has changed. > > > This would be one way to do it (we've also discussed that in IRC), but we > couldn't figure out any implementation that would actually be useable. To > make this work you'd need to have a function signature like "def > has_changed(file_name, other_thing)" whereas other_thing would be some > hash/modified_time whatever provided by the source storage. Now we are back > to the situation I described in my answer to Jeremy… > > > >> And since what staticfiles actually cares about is whether the file has >> changed, it seems like it would make more > > sense to use a method that checks whether the file has changed, rather >> than just checking the last modification date. > > > Well staticfiles doesn't care, it's only collectstatic which cares to some > extend. So in my opinion the cleanest way (which means without coupling the > two needed storage backends together to strongly) is to provide your own > collectstatic command where you can do what you want with the checks (if > you have problems implementing that we'd happily move some code to extra > methods to make reusing collectstatic easier). > > >> Would I be correct in thinking that the main argument against this is API >> bloat, or is there something else that I'm not seeing? >> > > I'd rather say: We don't see __any__ nice way to actually support > something which is generic enough the be useful. > > Cheers, > Florian > -- You received this message because you are subscribed to the Google Groups "Django developers" group. To view this discussion on the web visit https://groups.google.com/d/msg/django-developers/-/A7G6vLq58aYJ. To post to this group, send email to django-developers@googlegroups.com. To unsubscribe from this group, send email to django-developers+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.
Re: Feature request: collectstatic shouldn't recopy files that already exist in destination
Hi Stephen, On Tuesday, October 9, 2012 7:28:43 AM UTC+2, Stephen Burrows wrote: > > I'm a little confused by the track the discussion took recently... my > impression was that the solution would *not* be to change from > last_modified to a checksum, or to add an additional checksum method. > Instead, storage backeds could support a has_changed method, which could be > overridden to provide last_modified checking *or* checksum comparisons - or > any other method of checking whether a file has changed. This seems like a > useful, easy-to-explain, very generic way to handle checking whether a file > has changed. This would be one way to do it (we've also discussed that in IRC), but we couldn't figure out any implementation that would actually be useable. To make this work you'd need to have a function signature like "def has_changed(file_name, other_thing)" whereas other_thing would be some hash/modified_time whatever provided by the source storage. Now we are back to the situation I described in my answer to Jeremy… > And since what staticfiles actually cares about is whether the file has > changed, it seems like it would make more sense to use a method that checks whether the file has changed, rather than > just checking the last modification date. Well staticfiles doesn't care, it's only collectstatic which cares to some extend. So in my opinion the cleanest way (which means without coupling the two needed storage backends together to strongly) is to provide your own collectstatic command where you can do what you want with the checks (if you have problems implementing that we'd happily move some code to extra methods to make reusing collectstatic easier). > Would I be correct in thinking that the main argument against this is API > bloat, or is there something else that I'm not seeing? > I'd rather say: We don't see __any__ nice way to actually support something which is generic enough the be useful. Cheers, Florian -- You received this message because you are subscribed to the Google Groups "Django developers" group. To view this discussion on the web visit https://groups.google.com/d/msg/django-developers/-/vjZYGSYGnwUJ. To post to this group, send email to django-developers@googlegroups.com. To unsubscribe from this group, send email to django-developers+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.
Re: Feature request: collectstatic shouldn't recopy files that already exist in destination
Hi Jeremy, On Tuesday, October 9, 2012 5:15:04 AM UTC+2, jdunck wrote: > > Would it be reasonable to have a backend-specific hook to determine a > fingerprint, where that could be mtime or md5 or whathaveyou as long as > equality (or maybe ordering) works? > Given our discussion in IRC yesterday the answer seems to be no. Eg consider this: collectstatic actually uses two backends, a "source" storage-backend and an "upload" storage-backend. Now if the storage-backends are allowed to return any fingerprint, how would you make sure that both of these backends actually use the same fingerprint? So essentially to have a S3-backend which (let's just abuse modified_time for now) returns a md5 hash from modified_time, you'd have to also change the other backend you want to use (usually just the local filesystem backend) to return md5 instead… Cheers, Florian -- You received this message because you are subscribed to the Google Groups "Django developers" group. To view this discussion on the web visit https://groups.google.com/d/msg/django-developers/-/zsJ1b1zETFMJ. To post to this group, send email to django-developers@googlegroups.com. To unsubscribe from this group, send email to django-developers+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.
Re: Feature request: collectstatic shouldn't recopy files that already exist in destination
I'm a little confused by the track the discussion took recently... my impression was that the solution would *not* be to change from last_modified to a checksum, or to add an additional checksum method. Instead, storage backeds could support a has_changed method, which could be overridden to provide last_modified checking *or* checksum comparisons - or any other method of checking whether a file has changed. This seems like a useful, easy-to-explain, very generic way to handle checking whether a file has changed. And since what staticfiles actually cares about is whether the file has changed, it seems like it would make more sense to use a method that checks whether the file has changed, rather than just checking the last modification date. Would I be correct in thinking that the main argument against this is API bloat, or is there something else that I'm not seeing? On Thursday, September 27, 2012 9:51:52 AM UTC-7, Dan Loewenherz wrote: > > Hey all! > > This is a feature request / proposal (one which I'm willing to build out, > given that I've already developed a solution for my own uploader). > > I run a consulting business that helps small startups build initial MVPs. > When the time ultimately comes to deciding how to store static assets, my > preference (as is that of many others) is to use Amazon S3, with Amazon > CloudFront acting as a CDN to the appropriate buckets. For the purposes of > this ticket, s/S3/your object storage of choice/g. > > Now, Django has an awesome mechanism for making sure static assets are up > on S3. With the appropriate static storage backend, running ./manage.py > collectstatic just searches through your static media folders and copies > the files. > > The problem I've run into is that collectstatic copies all files, > regardless of whether they already exist on the destination. Uploading > 5-10MB of files is pretty wasteful (and time consuming) when none of the > files have changed and no new files have been added. > > As I wrote in the trac ticket (https://code.djangoproject.com/ticket/19021), > my current solution was to write a management command that does essentially > the same thing that collectstatic does. But, there are a few differences. > Here's a rundown (copied straight from the trac ticket). > > I currently solve this problem by creating a file containing metadata of >> all the static media at the root of the destination. This file is a JSON >> object that contains file paths as keys and checksum as values. When an >> upload is started, the uploader checks to see if the file path exists as a >> key in the dictionary. If it does, it checks to see if the checksums have >> changed. If they haven't changed, the uploader skips the file. At the end >> of the upload, the checksum file is updated on the destination. > > > I'll contribute the patch. I know there is not a lot of time before the > feature freeze, but I'm willing to make this happen if there's interest. > > If we don't want to change behavior, perhaps adding a flag such as > --skip-unchanged-files to the collectstatic command is the way to go? > > All the best, > Dan > -- You received this message because you are subscribed to the Google Groups "Django developers" group. To view this discussion on the web visit https://groups.google.com/d/msg/django-developers/-/khLi3g_3z4EJ. To post to this group, send email to django-developers@googlegroups.com. To unsubscribe from this group, send email to django-developers+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.
Re: Feature request: collectstatic shouldn't recopy files that already exist in destination
Would it be reasonable to have a backend-specific hook to determine a fingerprint, where that could be mtime or md5 or whathaveyou as long as equality (or maybe ordering) works? On Oct 8, 2012, at 10:23 AM, Alex Ogier wrote: > On Mon, Oct 8, 2012 at 1:06 PM, ptone wrote: > While git may be common, and your problem not unique - this is still a > condition of your dev environment rendering modification dates invalid. There > might be other situations where this is the case (I've run into scripts that > muck with modification dates based on camera/jpeg metadata). > > So after some further discussion on IRC - it was determined that md5, while > somewhat common, was far from a standard, and was likely not to be available > as remote call for network based storage backends. And so the final > resolution is to wontfix the ticket. > > In the end - this lack of a universal fingerprint is just a limitation of our > storage tools. > > -Preston > > Is there a reason this fingerprint must be universal? If you're dealing with > a backend like S3, where network latency and expensive writes are a problem, > but md5 is a builtin remote call (available on any GET), why not just do an > md5 sum in the _save() method? Basically, just buffer the File object you > receive, take an md5 in python, and then make a decision whether to upload or > not. In the common case of reading from local disk and writing to S3, this is > a big win, and doesn't require cooperation from any other backends, or > standardizing on md5 as a fingerprint method. > > Best, > Alex Ogier > -- > You received this message because you are subscribed to the Google Groups > "Django developers" group. > To post to this group, send email to django-developers@googlegroups.com. > To unsubscribe from this group, send email to > django-developers+unsubscr...@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/django-developers?hl=en. -- You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com. To unsubscribe from this group, send email to django-developers+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.
Re: Feature request: collectstatic shouldn't recopy files that already exist in destination
On Mon, Oct 8, 2012 at 1:06 PM, ptone wrote: > While git may be common, and your problem not unique - this is still a > condition of your dev environment rendering modification dates invalid. > There might be other situations where this is the case (I've run into > scripts that muck with modification dates based on camera/jpeg metadata). > > So after some further discussion on IRC - it was determined that md5, > while somewhat common, was far from a standard, and was likely not to be > available as remote call for network based storage backends. And so the > final resolution is to wontfix the ticket. > > In the end - this lack of a universal fingerprint is just a limitation of > our storage tools. > > -Preston > Is there a reason this fingerprint must be universal? If you're dealing with a backend like S3, where network latency and expensive writes are a problem, but md5 is a builtin remote call (available on any GET), why not just do an md5 sum in the _save() method? Basically, just buffer the File object you receive, take an md5 in python, and then make a decision whether to upload or not. In the common case of reading from local disk and writing to S3, this is a big win, and doesn't require cooperation from any other backends, or standardizing on md5 as a fingerprint method. Best, Alex Ogier -- You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com. To unsubscribe from this group, send email to django-developers+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.
Re: Feature request: collectstatic shouldn't recopy files that already exist in destination
On Monday, October 8, 2012 8:49:58 AM UTC-7, Dan Loewenherz wrote: > > > > On Mon, Oct 1, 2012 at 12:47 AM, Jannis Leidel > > > >> wrote: >> Then, frankly, this is a problem of the storage backends, not Django's. >> The S3BotoStorage backend *does* have a modified_time method: >> >> >> https://bitbucket.org/david/django-storages/src/1574890d87be/storages/backends/s3boto.py#cl-298 >> >> What storage backend do you use that doesn't have a modified_time method? >> > > I don't think you're seeing the problem I'm having. I'm working with a > distributed team using git. This means when we check out files, the local > modified time is the time at which I checked the files out, not the time > which the files were actually last modified. > > As a result, it's a questionable metric for figuring out if a file is the > same or not, since every team member's local machine thinks they were all > just created! We end up re-uploading the file every time. > While git may be common, and your problem not unique - this is still a condition of your dev environment rendering modification dates invalid. There might be other situations where this is the case (I've run into scripts that muck with modification dates based on camera/jpeg metadata). So after some further discussion on IRC - it was determined that md5, while somewhat common, was far from a standard, and was likely not to be available as remote call for network based storage backends. And so the final resolution is to wontfix the ticket. In the end - this lack of a universal fingerprint is just a limitation of our storage tools. -Preston > > This is a bit confusing...why call it last_modified when that's doesn't >> necessarily reflect what it's doing? It would be more flexible to create >> two methods: >> >> It's called modified_time, not last_modified. >> > > Sorry, typo. > > >> >> > def modification_identifier(self): >> > >> > def has_changed(self): >> > >> > Then, any backend could implement these however they might like, and >> collectstatic would have no excuse in uploading the same file more than >> once. Overloading last_modified to also do things like calculate md5's >> seems a bit hacky to me, and confusing for any developer maintaining a >> custom storage backend that doesn't support last modified. >> >> I disagree, modified_time is perfectly capable of handling your use case. >> > > No it does not address my needs, as I described above. > > Dan > -- You received this message because you are subscribed to the Google Groups "Django developers" group. To view this discussion on the web visit https://groups.google.com/d/msg/django-developers/-/pCGNll2gjiYJ. To post to this group, send email to django-developers@googlegroups.com. To unsubscribe from this group, send email to django-developers+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.
Re: Feature request: collectstatic shouldn't recopy files that already exist in destination
On Mon, Oct 8, 2012 at 3:50 AM, Jannis Leidel wrote: > You're able to ask S3 for the date of last modification, I don't see why a > comparison by hashing the file content is needed additionally. It'd have to > download the full file to do that on Django's side and I'm not aware of a > API for getting a hash from cloudfiles, S3 etc. > S3 stores the md5 info in an Etag header. Regarding Cloudfiles, this is what Rackspace has to say: You can ensure end-to-end data integrity by including an MD5 checksum of > your object's data in the ETag header. You are not required to include > the ETag header, but it is recommended to ensure that the storage system > successfully stored your object's content. Dan -- You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com. To unsubscribe from this group, send email to django-developers+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.
Re: Feature request: collectstatic shouldn't recopy files that already exist in destination
> > On Mon, Oct 1, 2012 at 12:47 AM, Jannis Leidel wrote: > Then, frankly, this is a problem of the storage backends, not Django's. > The S3BotoStorage backend *does* have a modified_time method: > > > https://bitbucket.org/david/django-storages/src/1574890d87be/storages/backends/s3boto.py#cl-298 > > What storage backend do you use that doesn't have a modified_time method? > I don't think you're seeing the problem I'm having. I'm working with a distributed team using git. This means when we check out files, the local modified time is the time at which I checked the files out, not the time which the files were actually last modified. As a result, it's a questionable metric for figuring out if a file is the same or not, since every team member's local machine thinks they were all just created! We end up re-uploading the file every time. > This is a bit confusing...why call it last_modified when that's doesn't > necessarily reflect what it's doing? It would be more flexible to create > two methods: > > It's called modified_time, not last_modified. > Sorry, typo. > > > def modification_identifier(self): > > > > def has_changed(self): > > > > Then, any backend could implement these however they might like, and > collectstatic would have no excuse in uploading the same file more than > once. Overloading last_modified to also do things like calculate md5's > seems a bit hacky to me, and confusing for any developer maintaining a > custom storage backend that doesn't support last modified. > > I disagree, modified_time is perfectly capable of handling your use case. > No it does not address my needs, as I described above. Dan -- You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com. To unsubscribe from this group, send email to django-developers+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.
Re: Feature request: collectstatic shouldn't recopy files that already exist in destination
On 08.10.2012, at 07:38, ptone wrote: > so after scanning this thread and the ticket again - it is still unclear that > there could be a completely universal solution. > > While it would be nice if the storage API had a checksum(name) or md5(name) > method - not all custom storage backends are going to support a single > checksum standard. S3 doesn't explicitly support MD5 (apparently it > unofficially does through ETags). Without a universal checksum - you can't > use it to compare files across arbitrary backends. You're able to ask S3 for the date of last modification, I don't see why a comparison by hashing the file content is needed additionally. It'd have to download the full file to do that on Django's side and I'm not aware of a API for getting a hash from cloudfiles, S3 etc. > I do agree that hacking modified_time return value is a little ugly - the API > is clearly documented as "returns a datetime..." - so returning a M55 > checksum there is, well, hacky. I beg to differ, returning a datetime object makes absolute sense for comparing it to another datetime object. What I meant before is that the modified_time method can be written however the user wants as long as it returns a datetime object, even a date that is known to be older than the file on disk. > If you are passionate about moving this forward, here is what I'd suggest. > > Implement, document, and test .md5(name) as a standard method on storage > backends - like modified_time this would raise NotImplementedError if not > available - this could easily be its own ticket. md5 is probably the closest > you'll get to a checksum standard. -1 Jannis > On Sunday, October 7, 2012 8:59:16 PM UTC-7, Dan Loewenherz wrote: > This issue just got me again tonight, so I'll try to push once more on this > issue. It seems right now most people don't care that this is broken, which > is a bummer, but in which case I'll just continue using my working solution. > > Dan > > On Sat, Oct 6, 2012 at 10:48 AM, Dan Loewenherz wrote: > Hey Jannis, > > On Mon, Oct 1, 2012 at 12:47 AM, Jannis Leidel wrote: > > On 30.09.2012, at 23:41, Dan Loewenherz wrote: > > > Many backends don't support last modified times, and even if they all did, > > it's incorrect to assume that last modified time is an accurate heuristic > > for whether a file has already been uploaded or not. > > Well but it's an accurate way to decide whether a file has been changed on > the filesystem, and that's what collectstatic cares about. The storage > backend *is* the API to extend that when needed, so feel free to use it. > > It's accurate *only* in certain situations. And on a distributed development > team, I've run into a lot of issues with developers re-upload files that have > already been uploaded because they just recently updated their repo. > > A checksum is the only true accurate method to determine if a file has > changed. > > Additionally, you didn't address my point that I quoted from. Storage > backends don't just reflect filesystems--they could reflect files stored in a > database, S3, etc. And some of these filesystems don't support last modified > times. > > > It might be a better idea to let the backends decide when a file has been > > changed (instead of just calling the backend's last modified method). > > I don't understand, you can easily implement exactly that in the > last_modified method if you'd like. > > This is a bit confusing...why call it last_modified when that's doesn't > necessarily reflect what it's doing? It would be more flexible to create two > methods: > > def modification_identifier(self): > > def has_changed(self): > > Then, any backend could implement these however they might like, and > collectstatic would have no excuse in uploading the same file more than once. > Overloading last_modified to also do things like calculate md5's seems a bit > hacky to me, and confusing for any developer maintaining a custom storage > backend that doesn't support last modified. > > Dan > > > -- > You received this message because you are subscribed to the Google Groups > "Django developers" group. > To view this discussion on the web visit > https://groups.google.com/d/msg/django-developers/-/weKD2x1XY4oJ. > To post to this group, send email to django-developers@googlegroups.com. > To unsubscribe from this group, send email to > django-developers+unsubscr...@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/django-developers?hl=en. -- You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com. To unsubscribe from this group, send email to django-developers+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.
Re: Feature request: collectstatic shouldn't recopy files that already exist in destination
On 06.10.2012, at 19:48, Dan Loewenherz wrote: > Hey Jannis, > > On Mon, Oct 1, 2012 at 12:47 AM, Jannis Leidel wrote: > > On 30.09.2012, at 23:41, Dan Loewenherz wrote: > > > Many backends don't support last modified times, and even if they all did, > > it's incorrect to assume that last modified time is an accurate heuristic > > for whether a file has already been uploaded or not. > > Well but it's an accurate way to decide whether a file has been changed on > the filesystem, and that's what collectstatic cares about. The storage > backend *is* the API to extend that when needed, so feel free to use it. > > It's accurate *only* in certain situations. And on a distributed development > team, I've run into a lot of issues with developers re-upload files that have > already been uploaded because they just recently updated their repo. > > A checksum is the only true accurate method to determine if a file has > changed. > > Additionally, you didn't address my point that I quoted from. Storage > backends don't just reflect filesystems--they could reflect files stored in a > database, S3, etc. And some of these filesystems don't support last modified > times. Then, frankly, this is a problem of the storage backends, not Django's. The S3BotoStorage backend *does* have a modified_time method: https://bitbucket.org/david/django-storages/src/1574890d87be/storages/backends/s3boto.py#cl-298 What storage backend do you use that doesn't have a modified_time method? > This is a bit confusing...why call it last_modified when that's doesn't > necessarily reflect what it's doing? It would be more flexible to create two > methods: It's called modified_time, not last_modified. > def modification_identifier(self): > > def has_changed(self): > > Then, any backend could implement these however they might like, and > collectstatic would have no excuse in uploading the same file more than once. > Overloading last_modified to also do things like calculate md5's seems a bit > hacky to me, and confusing for any developer maintaining a custom storage > backend that doesn't support last modified. I disagree, modified_time is perfectly capable of handling your use case. Jannis -- You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com. To unsubscribe from this group, send email to django-developers+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.
Re: Feature request: collectstatic shouldn't recopy files that already exist in destination
so after scanning this thread and the ticket again - it is still unclear that there could be a completely universal solution. While it would be nice if the storage API had a checksum(name) or md5(name) method - not all custom storage backends are going to support a single checksum standard. S3 doesn't explicitly support MD5 (apparently it unofficially does through ETags). Without a universal checksum - you can't use it to compare files across arbitrary backends. I do agree that hacking modified_time return value is a little ugly - the API is clearly documented as "returns a datetime..." - so returning a M55 checksum there is, well, hacky. If you are passionate about moving this forward, here is what I'd suggest. Implement, document, and test .md5(name) as a standard method on storage backends - like modified_time this would raise NotImplementedError if not available - this could easily be its own ticket. md5 is probably the closest you'll get to a checksum standard. Once you have an md5 method defined for backends - you could support a --md5 option to collectstatic that would use that as the target/source comparison. Another workaround is to just use collectstatic locally - and rsync --checksum to your remote if it supports rsync. -Preston On Sunday, October 7, 2012 8:59:16 PM UTC-7, Dan Loewenherz wrote: > > This issue just got me again tonight, so I'll try to push once more on > this issue. It seems right now most people don't care that this is broken, > which is a bummer, but in which case I'll just continue using my working > solution. > > Dan > > On Sat, Oct 6, 2012 at 10:48 AM, Dan Loewenherz > > wrote: > >> Hey Jannis, >> >> On Mon, Oct 1, 2012 at 12:47 AM, Jannis Leidel >> > wrote: >> >>> >>> On 30.09.2012, at 23:41, Dan Loewenherz > >>> wrote: >>> >>> > Many backends don't support last modified times, and even if they all >>> did, it's incorrect to assume that last modified time is an accurate >>> heuristic for whether a file has already been uploaded or not. >>> >>> Well but it's an accurate way to decide whether a file has been changed >>> on the filesystem, and that's what collectstatic cares about. The storage >>> backend *is* the API to extend that when needed, so feel free to use it. >>> >> >> It's accurate *only* in certain situations. And on a distributed >> development team, I've run into a lot of issues with developers re-upload >> files that have already been uploaded because they just recently updated >> their repo. >> >> A checksum is the only true accurate method to determine if a file has >> changed. >> >> Additionally, you didn't address my point that I quoted from. Storage >> backends don't just reflect filesystems--they could reflect files stored in >> a database, S3, etc. And some of these filesystems don't support last >> modified times. >> >> > It might be a better idea to let the backends decide when a file has >>> been changed (instead of just calling the backend's last modified method). >>> >>> I don't understand, you can easily implement exactly that in the >>> last_modified method if you'd like. >>> >> >> This is a bit confusing...why call it last_modified when that's doesn't >> necessarily reflect what it's doing? It would be more flexible to create >> two methods: >> >> def modification_identifier(self): >> >> def has_changed(self): >> >> Then, any backend could implement these however they might like, and >> collectstatic would have no excuse in uploading the same file more than >> once. Overloading last_modified to also do things like calculate md5's >> seems a bit hacky to me, and confusing for any developer maintaining a >> custom storage backend that doesn't support last modified. >> >> Dan >> > > -- You received this message because you are subscribed to the Google Groups "Django developers" group. To view this discussion on the web visit https://groups.google.com/d/msg/django-developers/-/weKD2x1XY4oJ. To post to this group, send email to django-developers@googlegroups.com. To unsubscribe from this group, send email to django-developers+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.
Re: Feature request: collectstatic shouldn't recopy files that already exist in destination
This issue just got me again tonight, so I'll try to push once more on this issue. It seems right now most people don't care that this is broken, which is a bummer, but in which case I'll just continue using my working solution. Dan On Sat, Oct 6, 2012 at 10:48 AM, Dan Loewenherz wrote: > Hey Jannis, > > On Mon, Oct 1, 2012 at 12:47 AM, Jannis Leidel wrote: > >> >> On 30.09.2012, at 23:41, Dan Loewenherz wrote: >> >> > Many backends don't support last modified times, and even if they all >> did, it's incorrect to assume that last modified time is an accurate >> heuristic for whether a file has already been uploaded or not. >> >> Well but it's an accurate way to decide whether a file has been changed >> on the filesystem, and that's what collectstatic cares about. The storage >> backend *is* the API to extend that when needed, so feel free to use it. >> > > It's accurate *only* in certain situations. And on a distributed > development team, I've run into a lot of issues with developers re-upload > files that have already been uploaded because they just recently updated > their repo. > > A checksum is the only true accurate method to determine if a file has > changed. > > Additionally, you didn't address my point that I quoted from. Storage > backends don't just reflect filesystems--they could reflect files stored in > a database, S3, etc. And some of these filesystems don't support last > modified times. > > > It might be a better idea to let the backends decide when a file has >> been changed (instead of just calling the backend's last modified method). >> >> I don't understand, you can easily implement exactly that in the >> last_modified method if you'd like. >> > > This is a bit confusing...why call it last_modified when that's doesn't > necessarily reflect what it's doing? It would be more flexible to create > two methods: > > def modification_identifier(self): > > def has_changed(self): > > Then, any backend could implement these however they might like, and > collectstatic would have no excuse in uploading the same file more than > once. Overloading last_modified to also do things like calculate md5's > seems a bit hacky to me, and confusing for any developer maintaining a > custom storage backend that doesn't support last modified. > > Dan > -- You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com. To unsubscribe from this group, send email to django-developers+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.
Re: Feature request: collectstatic shouldn't recopy files that already exist in destination
Hey Jannis, On Mon, Oct 1, 2012 at 12:47 AM, Jannis Leidel wrote: > > On 30.09.2012, at 23:41, Dan Loewenherz wrote: > > > Many backends don't support last modified times, and even if they all > did, it's incorrect to assume that last modified time is an accurate > heuristic for whether a file has already been uploaded or not. > > Well but it's an accurate way to decide whether a file has been changed on > the filesystem, and that's what collectstatic cares about. The storage > backend *is* the API to extend that when needed, so feel free to use it. > It's accurate *only* in certain situations. And on a distributed development team, I've run into a lot of issues with developers re-upload files that have already been uploaded because they just recently updated their repo. A checksum is the only true accurate method to determine if a file has changed. Additionally, you didn't address my point that I quoted from. Storage backends don't just reflect filesystems--they could reflect files stored in a database, S3, etc. And some of these filesystems don't support last modified times. > It might be a better idea to let the backends decide when a file has been > changed (instead of just calling the backend's last modified method). > > I don't understand, you can easily implement exactly that in the > last_modified method if you'd like. > This is a bit confusing...why call it last_modified when that's doesn't necessarily reflect what it's doing? It would be more flexible to create two methods: def modification_identifier(self): def has_changed(self): Then, any backend could implement these however they might like, and collectstatic would have no excuse in uploading the same file more than once. Overloading last_modified to also do things like calculate md5's seems a bit hacky to me, and confusing for any developer maintaining a custom storage backend that doesn't support last modified. Dan -- You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com. To unsubscribe from this group, send email to django-developers+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.
Re: Feature request: collectstatic shouldn't recopy files that already exist in destination
On 30.09.2012, at 23:41, Dan Loewenherz wrote: > Many backends don't support last modified times, and even if they all did, > it's incorrect to assume that last modified time is an accurate heuristic for > whether a file has already been uploaded or not. Well but it's an accurate way to decide whether a file has been changed on the filesystem, and that's what collectstatic cares about. The storage backend *is* the API to extend that when needed, so feel free to use it. > It might be a better idea to let the backends decide when a file has been > changed (instead of just calling the backend's last modified method). I don't understand, you can easily implement exactly that in the last_modified method if you'd like. Jannis -- You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com. To unsubscribe from this group, send email to django-developers+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.
Re: Feature request: collectstatic shouldn't recopy files that already exist in destination
On 28.09.2012, at 05:38, Dan Loewenherz wrote: > On Thu, Sep 27, 2012 at 4:13 PM, Carl Meyer wrote: > Hi Dan, > > On 09/27/2012 04:47 PM, Dan Loewenherz wrote: > > Just updated the ticket. > > > > As I commented, the heuristic for checking if a file has been modified > > lies in line 282 of collectstatic.py: > > > > *if not prefixed_path in self.copied_files:* > > * > > return self.log("Skipping '%s' (already copied earlier)" % path) > > > > * > > https://github.com/django/django/blob/master/django/contrib/staticfiles/management/commands/collectstatic.py#L282 > > > > This seems off, since a path may stay the same but a file's contents may > > change. > > That's not checking whether the file has been modified, that's checking > whether the same source file was previously copied in the same > collectstatic run (due to overlapping/duplicate file sources of some kind). > > The check for modification date is up on line 234 in the delete_file > method, which is called by both link_file and copy_file. > > Thanks, I missed that. > > I still see an issue here. In any sort of source control, when a user updates > their repo, local files that were updated remotely show up as modified at the > time the repo is cloned or updated, not when the file was actually last saved > by the last author. You then have the same scenario I pointed to earlier: > when multiple people work on a project, they will re-upload the same files > multiple times. > > Don't get me wrong here--I'm happy to see there is some sort of check here to > avoid collectstatic'ing the same file, but I think it's warranted to push > back on the use of last modified as the heuristic. I think using a checksum > would work much better, and solves the problem this is trying to solve in a > much more foolproof way. With all this said, I don't think this logic belongs > in a "delete_file" method. IMO it'd be better to separate out the logic of > "does this file already exist, if so, skip it" from "delete this file if it > exists on the target" (delete_file). > > @Karen--thanks for digging up that SO post. It actually is super relevant > here. As mentioned, when uploading files to S3, it's quite time consuming to > perform a network round trip to pick up file metadata (such as last modified > time) for each file you're uploading. This was initial reason I chose to > store this data in a single location that I'd only need to grab once. Dan, I added prefetching the metadata to the Boto based storage backend in django-storages a while ago to reduce the time spent on querying S3. See https://bitbucket.org/david/django-storages/src/1574890d87be/storages/backends/s3boto.py#cl-42 You can either set the setting AWS_PRELOAD_METADATA to True or subclass S3BotoStorage and pass preload_metadata=True into the constructor. Hope that helps, Jannis -- You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com. To unsubscribe from this group, send email to django-developers+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.
Re: Feature request: collectstatic shouldn't recopy files that already exist in destination
I think it's an API limitation. Many backends don't support last modified times, and even if they all did, it's incorrect to assume that last modified time is an accurate heuristic for whether a file has already been uploaded or not. It might be a better idea to let the backends decide when a file has been changed (instead of just calling the backend's last modified method). Dan On Sun, Sep 30, 2012 at 2:28 PM, Stephen Burrows < stephen.r.burr...@gmail.com> wrote: > I use S3 as well, and I have seen cases where files get copied that I know > don't need to be. That being said, it was never so slow that it was an > issue for me. Is there clear evidence that this is something which can't be > handled by the S3 backend due to an inadequate API on the django side? > > -- > You received this message because you are subscribed to the Google Groups > "Django developers" group. > To view this discussion on the web visit > https://groups.google.com/d/msg/django-developers/-/D9eemazwQUQJ. > > To post to this group, send email to django-developers@googlegroups.com. > To unsubscribe from this group, send email to > django-developers+unsubscr...@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/django-developers?hl=en. > -- You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com. To unsubscribe from this group, send email to django-developers+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.
Re: Feature request: collectstatic shouldn't recopy files that already exist in destination
I use S3 as well, and I have seen cases where files get copied that I know don't need to be. That being said, it was never so slow that it was an issue for me. Is there clear evidence that this is something which can't be handled by the S3 backend due to an inadequate API on the django side? -- You received this message because you are subscribed to the Google Groups "Django developers" group. To view this discussion on the web visit https://groups.google.com/d/msg/django-developers/-/D9eemazwQUQJ. To post to this group, send email to django-developers@googlegroups.com. To unsubscribe from this group, send email to django-developers+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.
Re: Feature request: collectstatic shouldn't recopy files that already exist in destination
+1 I like this implementation. -- You received this message because you are subscribed to the Google Groups "Django developers" group. To view this discussion on the web visit https://groups.google.com/d/msg/django-developers/-/eKXUlc0TKgYJ. To post to this group, send email to django-developers@googlegroups.com. To unsubscribe from this group, send email to django-developers+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.
Re: Feature request: collectstatic shouldn't recopy files that already exist in destination
On Thu, Sep 27, 2012 at 4:13 PM, Carl Meyer wrote: > Hi Dan, > > On 09/27/2012 04:47 PM, Dan Loewenherz wrote: > > Just updated the ticket. > > > > As I commented, the heuristic for checking if a file has been modified > > lies in line 282 of collectstatic.py: > > > > *if not prefixed_path in self.copied_files:* > > * > > return self.log("Skipping '%s' (already copied earlier)" % path) > > > > * > > > https://github.com/django/django/blob/master/django/contrib/staticfiles/management/commands/collectstatic.py#L282 > > > > This seems off, since a path may stay the same but a file's contents may > > change. > > That's not checking whether the file has been modified, that's checking > whether the same source file was previously copied in the same > collectstatic run (due to overlapping/duplicate file sources of some kind). > > The check for modification date is up on line 234 in the delete_file > method, which is called by both link_file and copy_file. > Thanks, I missed that. I still see an issue here. In any sort of source control, when a user updates their repo, local files that were updated remotely show up as modified at the time the repo is cloned or updated, not when the file was actually last saved by the last author. You then have the same scenario I pointed to earlier: when multiple people work on a project, they will re-upload the same files multiple times. Don't get me wrong here--I'm happy to see there is some sort of check here to avoid collectstatic'ing the same file, but I think it's warranted to push back on the use of last modified as the heuristic. I think using a checksum would work much better, and solves the problem this is trying to solve in a much more foolproof way. With all this said, I don't think this logic belongs in a "delete_file" method. IMO it'd be better to separate out the logic of "does this file already exist, if so, skip it" from "delete this file if it exists on the target" (delete_file). @Karen--thanks for digging up that SO post. It actually is super relevant here. As mentioned, when uploading files to S3, it's quite time consuming to perform a network round trip to pick up file metadata (such as last modified time) for each file you're uploading. This was initial reason I chose to store this data in a single location that I'd only need to grab once. Dan -- You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com. To unsubscribe from this group, send email to django-developers+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.
Re: Feature request: collectstatic shouldn't recopy files that already exist in destination
Hi Dan, On 09/27/2012 04:47 PM, Dan Loewenherz wrote: > Just updated the ticket. > > As I commented, the heuristic for checking if a file has been modified > lies in line 282 of collectstatic.py: > > *if not prefixed_path in self.copied_files:* > * > return self.log("Skipping '%s' (already copied earlier)" % path) > > * > https://github.com/django/django/blob/master/django/contrib/staticfiles/management/commands/collectstatic.py#L282 > > This seems off, since a path may stay the same but a file's contents may > change. That's not checking whether the file has been modified, that's checking whether the same source file was previously copied in the same collectstatic run (due to overlapping/duplicate file sources of some kind). The check for modification date is up on line 234 in the delete_file method, which is called by both link_file and copy_file. Carl -- You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com. To unsubscribe from this group, send email to django-developers+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.
Re: Feature request: collectstatic shouldn't recopy files that already exist in destination
Just updated the ticket. As I commented, the heuristic for checking if a file has been modified lies in line 282 of collectstatic.py: *if not prefixed_path in self.copied_files:* * return self.log("Skipping '%s' (already copied earlier)" % path) * https://github.com/django/django/blob/master/django/contrib/staticfiles/management/commands/collectstatic.py#L282 This seems off, since a path may stay the same but a file's contents may change. Also, the existing functionality doesn't work when multiple people are running collectstatic (or the same person is running it on multiple computers). Dan On Thu, Sep 27, 2012 at 3:12 PM, Karen Tracey wrote: > On Thu, Sep 27, 2012 at 12:51 PM, Dan Loewenherz wrote: > >> The problem I've run into is that collectstatic copies all files, >> regardless of whether they already exist on the destination. > > > No, as noted in the ticket, which has been closed needsinfo, staticfiles > already only copies modified files. And I don't think that's a > recently-added feature, so I suspect the behavior you are seeing has more > to do with the S3 side than the staticfiles code. This post: > > > http://stackoverflow.com/questions/6618013/django-staticfiles-and-amazon-s3-how-to-detect-modified-files > > looks enlightening though even that is not all that new. It sounds like > the problem here is efficiently getting the last-modified time for the > files out in S3, not with staticfiles. > > Karen > > -- > You received this message because you are subscribed to the Google Groups > "Django developers" group. > To post to this group, send email to django-developers@googlegroups.com. > To unsubscribe from this group, send email to > django-developers+unsubscr...@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/django-developers?hl=en. > -- You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com. To unsubscribe from this group, send email to django-developers+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.
Re: Feature request: collectstatic shouldn't recopy files that already exist in destination
On Thu, Sep 27, 2012 at 12:51 PM, Dan Loewenherz wrote: > The problem I've run into is that collectstatic copies all files, > regardless of whether they already exist on the destination. No, as noted in the ticket, which has been closed needsinfo, staticfiles already only copies modified files. And I don't think that's a recently-added feature, so I suspect the behavior you are seeing has more to do with the S3 side than the staticfiles code. This post: http://stackoverflow.com/questions/6618013/django-staticfiles-and-amazon-s3-how-to-detect-modified-files looks enlightening though even that is not all that new. It sounds like the problem here is efficiently getting the last-modified time for the files out in S3, not with staticfiles. Karen -- You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com. To unsubscribe from this group, send email to django-developers+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.
Re: Feature request: collectstatic shouldn't recopy files that already exist in destination
Good idea, but shouldn't it be a per-storage thing? Perhaps this could be done with a couple of callbacks in the collectstatic run: - Before collectstatic starts, so the storage backend can pick up its inventory from the remote - One called for each file that would be copied, and that can veto a copy operation - At the end of the collectstatic run, to update the inventory That would make things customisable even for strange setups like storing files in a database or whatnot? Cheers, mjl -- You received this message because you are subscribed to the Google Groups "Django developers" group. To view this discussion on the web visit https://groups.google.com/d/msg/django-developers/-/DLpjFU0NS_4J. To post to this group, send email to django-developers@googlegroups.com. To unsubscribe from this group, send email to django-developers+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.
Re: Feature request: collectstatic shouldn't recopy files that already exist in destination
On Thu, Sep 27, 2012 at 6:51 PM, Dan Loewenherz wrote: > Hey all! > > This is a feature request / proposal (one which I'm willing to build out, > given that I've already developed a solution for my own uploader). > > [...] > > I'll contribute the patch. I know there is not a lot of time before the > feature freeze, but I'm willing to make this happen if there's interest. > Yes, please. I've been wanting this myself. This should IMHO be the default for collectstatic, but having a flag to *force* copying all files. -- *Anders Steinlein* *Eliksir AS* http://e5r.no E-post: and...@e5r.no Mobil: +47 926 13 069 Twitter: @asteinlein -- You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com. To unsubscribe from this group, send email to django-developers+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.
Re: Feature request: collectstatic shouldn't recopy files that already exist in destination
I like this feature and have recently been thinking of implementing such on my own myself. +1 for the feature request. On 9/27/12, Dan Loewenherz wrote: > Hey all! > > This is a feature request / proposal (one which I'm willing to build out, > given that I've already developed a solution for my own uploader). > > I run a consulting business that helps small startups build initial MVPs. > When the time ultimately comes to deciding how to store static assets, my > preference (as is that of many others) is to use Amazon S3, with Amazon > CloudFront acting as a CDN to the appropriate buckets. For the purposes of > this ticket, s/S3/your object storage of choice/g. > > Now, Django has an awesome mechanism for making sure static assets are up > on S3. With the appropriate static storage backend, running ./manage.py > collectstatic just searches through your static media folders and copies > the files. > > The problem I've run into is that collectstatic copies all files, > regardless of whether they already exist on the destination. Uploading > 5-10MB of files is pretty wasteful (and time consuming) when none of the > files have changed and no new files have been added. > > As I wrote in the trac ticket > (https://code.djangoproject.com/ticket/19021), > my current solution was to write a management command that does essentially > the same thing that collectstatic does. But, there are a few differences. > Here's a rundown (copied straight from the trac ticket). > > I currently solve this problem by creating a file containing metadata of >> all the static media at the root of the destination. This file is a JSON >> object that contains file paths as keys and checksum as values. When an >> upload is started, the uploader checks to see if the file path exists as >> a >> key in the dictionary. If it does, it checks to see if the checksums have >> changed. If they haven't changed, the uploader skips the file. At the end >> of the upload, the checksum file is updated on the destination. > > > I'll contribute the patch. I know there is not a lot of time before the > feature freeze, but I'm willing to make this happen if there's interest. > > If we don't want to change behavior, perhaps adding a flag such as > --skip-unchanged-files to the collectstatic command is the way to go? > > All the best, > Dan > > -- > You received this message because you are subscribed to the Google Groups > "Django developers" group. > To post to this group, send email to django-developers@googlegroups.com. > To unsubscribe from this group, send email to > django-developers+unsubscr...@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/django-developers?hl=en. > > -- Sent from my mobile device -- You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com. To unsubscribe from this group, send email to django-developers+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.
Feature request: collectstatic shouldn't recopy files that already exist in destination
Hey all! This is a feature request / proposal (one which I'm willing to build out, given that I've already developed a solution for my own uploader). I run a consulting business that helps small startups build initial MVPs. When the time ultimately comes to deciding how to store static assets, my preference (as is that of many others) is to use Amazon S3, with Amazon CloudFront acting as a CDN to the appropriate buckets. For the purposes of this ticket, s/S3/your object storage of choice/g. Now, Django has an awesome mechanism for making sure static assets are up on S3. With the appropriate static storage backend, running ./manage.py collectstatic just searches through your static media folders and copies the files. The problem I've run into is that collectstatic copies all files, regardless of whether they already exist on the destination. Uploading 5-10MB of files is pretty wasteful (and time consuming) when none of the files have changed and no new files have been added. As I wrote in the trac ticket (https://code.djangoproject.com/ticket/19021), my current solution was to write a management command that does essentially the same thing that collectstatic does. But, there are a few differences. Here's a rundown (copied straight from the trac ticket). I currently solve this problem by creating a file containing metadata of > all the static media at the root of the destination. This file is a JSON > object that contains file paths as keys and checksum as values. When an > upload is started, the uploader checks to see if the file path exists as a > key in the dictionary. If it does, it checks to see if the checksums have > changed. If they haven't changed, the uploader skips the file. At the end > of the upload, the checksum file is updated on the destination. I'll contribute the patch. I know there is not a lot of time before the feature freeze, but I'm willing to make this happen if there's interest. If we don't want to change behavior, perhaps adding a flag such as --skip-unchanged-files to the collectstatic command is the way to go? All the best, Dan -- You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com. To unsubscribe from this group, send email to django-developers+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.