Re: Compressed Pristines (Design Doc)
2012/3/25 Branko Čibej br...@e-reka.si: On 22.03.2012 17:01, Branko Čibej wrote: On 22.03.2012 16:50, Daniel Shahaf wrote: Branko Čibej wrote on Thu, Mar 22, 2012 at 16:37:24 +0100: [..] Based on these observations, it's clear that the implementation should proceed as follows: Step 1: Just compress the pristine files, do not use any packing. This gives a 60% decrease in disk usage in the HTTPD case, but even if the decrease is only 30%, it's still worth the effort. Step 2: Store small (for some definition of small) compressed pristine files in a SQLite database. In the case of HTTPD, this gives an exter up to 90% savings in disk usage, but this is a very specific test case and it's hard to guess what kind of gain we'd get on average. Makes sense for me. In that case we also benefit on performance (in case sqlite blob API has acceptable performance) And IMHO small should be really small (up to 4k) to prevent wc.db growing in size. -- Ivan Zhakov
Re: Compressed Pristines (Design Doc)
On 26 March 2012 11:53, Ivan Zhakov i...@visualsvn.com wrote: 2012/3/25 Branko Čibej br...@e-reka.si: On 22.03.2012 17:01, Branko Čibej wrote: On 22.03.2012 16:50, Daniel Shahaf wrote: Branko Čibej wrote on Thu, Mar 22, 2012 at 16:37:24 +0100: [..] Based on these observations, it's clear that the implementation should proceed as follows: Step 1: Just compress the pristine files, do not use any packing. This gives a 60% decrease in disk usage in the HTTPD case, but even if the decrease is only 30%, it's still worth the effort. Step 2: Store small (for some definition of small) compressed pristine files in a SQLite database. In the case of HTTPD, this gives an exter up to 90% savings in disk usage, but this is a very specific test case and it's hard to guess what kind of gain we'd get on average. Makes sense for me. In that case we also benefit on performance (in case sqlite blob API has acceptable performance) And IMHO small should be really small (up to 4k) to prevent wc.db growing in size. There's no requirement for putting pristines in the wc.db, it can easily be a different database that's part of the same connection. More to the point, in order to make using a database worthwhile, the size limit shouldn't be /too/ low. With a 4k filesystem block size, files up to 4k in size will have 50% wasted on average; 8k files will waste 25%; and so on. My test compared using 8k and 32k limits, and just increasing that limit added an extra more than 50% space savings (on top of the already huge savings of storing up-to-8k files in blobs) with no significant difference in insertion times. (This last makes sense, as sqlite will flush in multiples of page sizes, so the insertion times are really proportional to the overall amount of data written. On average YMMV disclaimer/.) -- Brane
Re: Compressed Pristines (Design Doc)
Hyrum K Wright wrote on Fri, Mar 23, 2012 at 13:54:25 -0500: As mentioned elsewhere, I too was surprised by the choice of a custom container, though I think you make a good argument for it. One simplification I was thinking about is this: what if the container only needed to support add and batch-delete operations? These are the current contraints of the existing pristine store; would they introduce additional simplicity into your design? In some respects, it looks like you're solving *two* problems: compression and the internal fragmentation due to large FS block sizes. How orthogonal are the problems? Could they be solved independently of each other in some way? I know that compression exposes the internal fragmentation issue, but used alone it certainly doesn't make things *worse* does it? Personally I've also been wondering, while reading the design doc, how applicable are the solutions to libsvn_fs -- or if they could be modularized in a way that lets libsvn_fs re-use parts of them, etc. I haven't found much so far, but this is another angle to look at things from.
Re: Compressed Pristines (Design Doc)
Daniel Shahaf wrote on Mon, Mar 26, 2012 at 14:30:34 +0200: I haven't found much so far (Provided as an observation; not implying it's a problem.)
Re: Compressed Pristines (Design Doc)
- Original Message - From: Daniel Shahaf danie...@elego.de To: Hyrum K Wright hyrum.wri...@wandisco.com Cc: Ashod Nakashian ashodnakash...@yahoo.com; dev@subversion.apache.org dev@subversion.apache.org; Philip Martin philip.mar...@wandisco.com; Greg Stein gst...@gmail.com Sent: Monday, March 26, 2012 5:30 PM Subject: Re: Compressed Pristines (Design Doc) Hyrum K Wright wrote on Fri, Mar 23, 2012 at 13:54:25 -0500: As mentioned elsewhere, I too was surprised by the choice of a custom container, though I think you make a good argument for it. One simplification I was thinking about is this: what if the container only needed to support add and batch-delete operations? These are the current contraints of the existing pristine store; would they introduce additional simplicity into your design? In some respects, it looks like you're solving *two* problems: compression and the internal fragmentation due to large FS block sizes. How orthogonal are the problems? Could they be solved independently of each other in some way? I know that compression exposes the internal fragmentation issue, but used alone it certainly doesn't make things *worse* does it? Personally I've also been wondering, while reading the design doc, how applicable are the solutions to libsvn_fs -- or if they could be modularized in a way that lets libsvn_fs re-use parts of them, etc. I haven't found much so far, but this is another angle to look at things from. This is certainly something to plan for. I didn't include such info to avoid widening the scope and because we haven't agreed on the design yet. I'll probably get to that when we have consensus on the design, which will hopefully be soon. -Ash
Re: Compressed Pristines (Design Doc)
Hi Ash, I noticed that Remove pristine store or render optional is considered a Non-Goal. If changes are made to wc-db in order to manage compressed pristines, it might make sense to ensure that the design can also handle optional pristines in the future. The typical Subversion use case (code/text) will obviously benefit from compressed pristines. However, when storing binary files (e.g. graphics), which tend to be larger and less frequently modified files, optional pristines will likely be more beneficial. Thanks, Thomas Å. On 22 mar 2012, at 08:15, Ashod Nakashian wrote: From: Daniel Shahaf danie...@elego.de To: Greg Stein gst...@gmail.com Cc: Ashod Nakashian ashodnakash...@yahoo.com; dev@subversion.apache.org Sent: Wednesday, March 21, 2012 2:08 PM Subject: Re: Compressed Pristines (Design Doc) Greg Stein wrote on Wed, Mar 21, 2012 at 16:51:47 -0400: On Wed, Mar 21, 2012 at 16:11, Daniel Shahaf d...@daniel.shahaf.name wrote: Ashod Nakashian wrote on Wed, Mar 21, 2012 at 12:19:02 -0700: All, I'm happy to share[1] with you the design document for the Compressed Pristines feature. The document is public and anyone can comment on any part I can't. Can you please move the document to our wiki, or dump it in an email to dev@, or on a pastebin, somewhere everyone canread it. I just opened it in an incognito window in Chrome. You should be able to access the thing. Tried, I get as far as the doc title. I don't see its contents. Daniel (and all who can't access the doc), I'm attaching the PDF and ODT versions with updates based on Greg's comments. I'd like to hear all opinions and comments. Google docs is a fairly ideal environment for live commenting and editing, so it's too bad that you can't access the file. Please let me know if you have any notes/comments on the design. If you'd like to use the ODT file for comments and edits, please mark your input clearly and I'll update the Google doc with your notes. Thanks, AshSubversionCompressedPristinesDesign.pdfSubversionCompressedPristinesDesign.odt
Re: Compressed Pristines (Design Doc)
Yeah... optional pristines is orthogonal, and should be considered seperately. It is also a very difficult problem because users of the various APIs expect the pristine to always be present. Cheers, -g On Mar 25, 2012 7:41 PM, Thomas Åkesson tho...@akesson.cc wrote: Hi Ash, I noticed that Remove pristine store or render optional is considered a Non-Goal. If changes are made to wc-db in order to manage compressed pristines, it might make sense to ensure that the design can also handle optional pristines in the future. The typical Subversion use case (code/text) will obviously benefit from compressed pristines. However, when storing binary files (e.g. graphics), which tend to be larger and less frequently modified files, optional pristines will likely be more beneficial. Thanks, Thomas Å. On 22 mar 2012, at 08:15, Ashod Nakashian wrote: From: Daniel Shahaf danie...@elego.de To: Greg Stein gst...@gmail.com Cc: Ashod Nakashian ashodnakash...@yahoo.com; dev@subversion.apache.org Sent: Wednesday, March 21, 2012 2:08 PM Subject: Re: Compressed Pristines (Design Doc) Greg Stein wrote on Wed, Mar 21, 2012 at 16:51:47 -0400: On Wed, Mar 21, 2012 at 16:11, Daniel Shahaf d...@daniel.shahaf.name wrote: Ashod Nakashian wrote on Wed, Mar 21, 2012 at 12:19:02 -0700: All, I'm happy to share[1] with you the design document for the Compressed Pristines feature. The document is public and anyone can comment on any part I can't. Can you please move the document to our wiki, or dump it in an email to dev@, or on a pastebin, somewhere everyone canread it. I just opened it in an incognito window in Chrome. You should be able to access the thing. Tried, I get as far as the doc title. I don't see its contents. Daniel (and all who can't access the doc), I'm attaching the PDF and ODT versions with updates based on Greg's comments. I'd like to hear all opinions and comments. Google docs is a fairly ideal environment for live commenting and editing, so it's too bad that you can't access the file. Please let me know if you have any notes/comments on the design. If you'd like to use the ODT file for comments and edits, please mark your input clearly and I'll update the Google doc with your notes. Thanks, AshSubversionCompressedPristinesDesign.pdfSubversionCompressedPristinesDesign.odt
Re: Compressed Pristines (Design Doc)
From: Greg Stein gst...@gmail.com To: Thomas Åkesson tho...@akesson.cc Cc: Ashod Nakashian ashodnakash...@yahoo.com; Subversion Development dev@subversion.apache.org Sent: Monday, March 26, 2012 7:42 AM Subject: Re: Compressed Pristines (Design Doc) Yeah... optional pristines is orthogonal, and should be considered seperately. It is also a very difficult problem because users of the various APIs expect the pristine to always be present. Cheers, -g On Mar 25, 2012 7:41 PM, Thomas Åkesson tho...@akesson.cc wrote: Hi Ash, I noticed that Remove pristine store or render optional is considered a Non-Goal. If changes are made to wc-db in order to manage compressed pristines, it might make sense to ensure that the design can also handle optional pristines in the future. I do not mind in the least to make provisions for such a future possibility. But like Greg said, it's really orthogonal to the feature at hand and does suffer quite a bit of complexity itself to be included within the current scope, which is already a mouthful. -Ash The typical Subversion use case (code/text) will obviously benefit from compressed pristines. However, when storing binary files (e.g. graphics), which tend to be larger and less frequently modified files, optional pristines will likely be more beneficial. Thanks, Thomas Å. On 22 mar 2012, at 08:15, Ashod Nakashian wrote: From: Daniel Shahaf danie...@elego.de To: Greg Stein gst...@gmail.com Cc: Ashod Nakashian ashodnakash...@yahoo.com; dev@subversion.apache.org Sent: Wednesday, March 21, 2012 2:08 PM Subject: Re: Compressed Pristines (Design Doc) Greg Stein wrote on Wed, Mar 21, 2012 at 16:51:47 -0400: On Wed, Mar 21, 2012 at 16:11, Daniel Shahaf d...@daniel.shahaf.name wrote: Ashod Nakashian wrote on Wed, Mar 21, 2012 at 12:19:02 -0700: All, I'm happy to share[1] with you the design document for the Compressed Pristines feature. The document is public and anyone can comment on any part I can't. Can you please move the document to our wiki, or dump it in an email to dev@, or on a pastebin, somewhere everyone canread it. I just opened it in an incognito window in Chrome. You should be able to access the thing. Tried, I get as far as the doc title. I don't see its contents. Daniel (and all who can't access the doc), I'm attaching the PDF and ODT versions with updates based on Greg's comments. I'd like to hear all opinions and comments. Google docs is a fairly ideal environment for live commenting and editing, so it's too bad that you can't access the file. Please let me know if you have any notes/comments on the design. If you'd like to use the ODT file for comments and edits, please mark your input clearly and I'll update the Google doc with your notes. Thanks, AshSubversionCompressedPristinesDesign.pdfSubversionCompressedPristinesDesign.odt
Re: Compressed Pristines (Design Doc)
On 22.03.2012 17:01, Branko Čibej wrote: On 22.03.2012 16:50, Daniel Shahaf wrote: Branko Čibej wrote on Thu, Mar 22, 2012 at 16:37:24 +0100: It's called SQLite. Heh. I wondered whether I should mention that the server uses BDB to store pristine files. (yes, the situation there is different in several relevant ways) To clarify: I'm /not/ advocating that we store each and every file into an SQLite BLOB. Files larger than several block sizes would be better off on disk as real files (the compressor can, e.g., buffer compressed contents up to, say, 32k, and if they become larger, spill directly into a file; otherwise, dump into a BLOB). If we don't care about shared pristine store, we don't even need a separate database, these blobs can go into wc.db (which, as Greg points out, also serves as an index). Since we need a few datapoints, I made a quick test to see what kind of space savings we can get with SQLite. Note that I've not tried any auto-vacuum settings, because my test only does insertions. I used a checkout of the current HTTPD trunk for my data set, and compressed all pristines, then moved them into a SQLite database depending on size; first, all compressed files 8k or smaller, next, all compressed files 32k or smaller. Note that my script does not prune empty directories from the pristine fanout. Here's the log: brane@zulu:~/src/httpd$ svn co http://svn.apache.org/repos/asf/httpd/httpd/trunk [...] U trunk Checked out revision 1305001. brane@zulu:~/src/httpd$ find trunk/.svn/pristine -type f | wc -l 3114 brane@zulu:~/src/httpd$ du -sh trunk/.svn/pristine/ 42Mtrunk/.svn/pristine/ time gzip `find ./trunk/.svn/pristine -name '*.svn-base'` real0m14.569s user0m1.282s sys 0m0.747s brane@zulu:~/src/httpd$ du -sh trunk/.svn/pristine/ 17Mtrunk/.svn/pristine/ brane@zulu:~/src/httpd$ find trunk/.svn/pristine -size -8k -type f | wc -l 2856 # # N.B.: 8k max size per blob # brane@zulu:~/src/httpd$ time python pristine.py trunk/.svn/pristine/ real0m29.683s user0m0.533s sys 0m1.641s brane@zulu:~/src/httpd$ du -sh trunk/.svn/pristine/ 4.7Mtrunk/.svn/pristine/ brane@zulu:~/src/httpd$ ll trunk/.svn/pristine//pristine.db -rw-r--r-- 1 brane staff 322560 Mar 25 12:43 ps/pristine.db # # N.B.: 32k max size per blob # brane@zulu:~/src/httpd$ time python pristine.py trunk/.svn/pristine/ real0m23.831s user0m0.529s sys 0m1.616s brane@zulu:~/src/httpd$ du -sh trunk/.svn/pristine/ 1.2Mtrunk/.svn/pristine/ The pristine.py script is attached. Based on these observations, it's clear that the implementation should proceed as follows: Step 1: Just compress the pristine files, do not use any packing. This gives a 60% decrease in disk usage in the HTTPD case, but even if the decrease is only 30%, it's still worth the effort. Step 2: Store small (for some definition of small) compressed pristine files in a SQLite database. In the case of HTTPD, this gives an exter up to 90% savings in disk usage, but this is a very specific test case and it's hard to guess what kind of gain we'd get on average. All in all, looking at these number, there's a /looong/ way to go before we start playing with custom pack formats and compression of packed similar files. I'm not at all sure we'll ever really need the potential space savings of these methods, especially compared to the obvious risk to WC stability that writing and testing such code obviously brings. Anyway, it's certain that creating this packed format is /not/ the first step to take. -- Brane import os, sys import sqlite3 MAX_BLOB = 32768 class Pristine(object): def __init__(self, database): self.conn = sqlite3.connect(database, isolation_level = IMMEDIATE) self.conn.text_factory = str self.cursor = self.conn.cursor() self.cursor.execute(PRAGMA page_size = 1024) self.cursor.execute(PRAGMA encoding = 'UTF-8') @classmethod def create(cls, database): if os.path.exists(database): os.unlink(database) db = cls(database) db.cursor.execute(CREATE TABLE pristine ( digest CHAR(40) PRIMARY KEY, contents BLOB)) db.conn.commit() return db def insert(self, filename): digest = os.path.basename(filename).partition(.)[0] contents = open(filename, 'rb').read() self.cursor.execute( INSERT INTO pristine (digest, contents) VALUES (?, ?), [digest, contents]) db.conn.commit() os.remove(filename) if __name__ == __main__: db = Pristine.create(os.path.join(sys.argv[1], pristine.db)) for dirpath, dirnames, filenames in os.walk(sys.argv[1]): for name in filenames: if not name.endswith(.svn-base.gz): continue filename = os.path.join(dirpath, name) if os.stat(filename).st_size MAX_BLOB: continue
AW: Compressed Pristines (Design Doc)
Hi, Erik, Von: Erik Huelsmann [mailto:ehu...@gmail.com] To substantiate that claim, I took the pristines directory from my Subversion working copy and did some experimenting. See results below: $ ls -ls uncompressed-pristines/*/*.svn-base | awk '{ tot += $1; } END { print total size tot; }' total size: 188724 $ cp -Rp uncompressed-pristines/ compressed-pristines $ gzip compressed-pristines/*/*.svn-base $ ls -ls compressed-pristines/*/*.svn-base.gz | awk '{ tot += $1; } END { print total size tot; }' total size: 52320 $ cat compressed-pristines/*/*.svn-base.gz combined-compressed-file Are you sure you should not combine the uncompressed pristines, and compress them afterwards? AFAICS, one of the points of the proposal is to profit from the inter-file redundancies. Mit freundlichen Grüßen Markus Schaber -- ___ We software Automation. 3S-Smart Software Solutions GmbH Markus Schaber | Entwicklung Memminger Str. 151 | 87439 Kempten | Tel. +49-831-54031-0 | Fax +49-831-54031-50 Email: m.scha...@3s-software.com | Web: http://www.3s-software.com CoDeSys Internet-Forum: http://forum.3s-software.com Geschäftsführer: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner | Handelsregister: Kempten HRB 6186 | USt-IDNr.: DE 167014915
Re: Compressed Pristines (Design Doc)
On Wed, Mar 21, 2012 at 2:19 PM, Ashod Nakashian ashodnakash...@yahoo.com wrote: All, I'm happy to share[1] with you the design document for the Compressed Pristines feature. The document is public and anyone can comment on any part (select, right-click and comment away). If you'd like to get *editing* permission, please email me and I'll add you to the list of editors. I'm sure there will be much to criticize and debate, I'd love to hear all input, but being pragmatic, I also would like to a) experiment and figure out the best approach in practice, backed with real data and consensus and b) to finish this feature rather than debate forever (it's been debated for almost a decade this December!). As such, what's not clear, I've left out or written TBD notes and at the same time I've already made experimental changes locally to have a more learned information rather than an academic design (this, not to mention reading 100s of dev-list mails). I made a serious attempt at specifying as much of the hard facts/reqs/goals as possible to narrow the scope and avoid feature-creep. I'd like to take this feature on a lightweight branch and start committing code and getting reviews (and contributions!!) while we finalize the design and decide on the details (those who can create branches and grant commit rights please let me know when is the right time to do this - I'm ready and have code to commit and develop further). I thank everyone who will help us get this finally done in advance and look forward to hearing from you all. -Ash [1] https://docs.google.com/document/d/1ktIsewfMBMVBxbn-Ng8NwkNwAS_QJ6eC7GOygsbBeEc/edit So, I've read through the design document, and the various threads, and have a couple of comments / questions which I don't think have been addressed. My first impression, though is to give you major kudos for going through the effort to research and think about this complex and subtle problem. Now my thoughts... As mentioned elsewhere, I too was surprised by the choice of a custom container, though I think you make a good argument for it. One simplification I was thinking about is this: what if the container only needed to support add and batch-delete operations? These are the current contraints of the existing pristine store; would they introduce additional simplicity into your design? In some respects, it looks like you're solving *two* problems: compression and the internal fragmentation due to large FS block sizes. How orthogonal are the problems? Could they be solved independently of each other in some way? I know that compression exposes the internal fragmentation issue, but used alone it certainly doesn't make things *worse* does it? Finally, in all the above let's not let the perfect be the enemy of the good. If something *simple* will give us demonstrable performance improvements now, can we do so without limiting out ability to do a more complex and complete solution later? Anyway, good work, and here's hoping it yield fruit. -Hyrum -- uberSVN: Apache Subversion Made Easy http://www.uberSVN.com/
Re: Compressed Pristines (Design Doc)
Thanks Ash! I'm in the middle of something right now, but I'll read it once I'm done. Ashod Nakashian wrote on Thu, Mar 22, 2012 at 00:15:21 -0700: From: Daniel Shahaf danie...@elego.de To: Greg Stein gst...@gmail.com Cc: Ashod Nakashian ashodnakash...@yahoo.com; dev@subversion.apache.org Sent: Wednesday, March 21, 2012 2:08 PM Subject: Re: Compressed Pristines (Design Doc) Greg Stein wrote on Wed, Mar 21, 2012 at 16:51:47 -0400: On Wed, Mar 21, 2012 at 16:11, Daniel Shahaf d...@daniel.shahaf.name wrote: Ashod Nakashian wrote on Wed, Mar 21, 2012 at 12:19:02 -0700: All, I'm happy to share[1] with you the design document for the Compressed Pristines feature. The document is public and anyone can comment on any part I can't. Can you please move the document to our wiki, or dump it in an email to dev@, or on a pastebin, somewhere everyone canread it. I just opened it in an incognito window in Chrome. You should be able to access the thing. Tried, I get as far as the doc title. I don't see its contents. Daniel (and all who can't access the doc), I'm attaching the PDF and ODT versions with updates based on Greg's comments. I'd like to hear all opinions and comments. Google docs is a fairly ideal environment for live commenting and editing, so it's too bad that you can't access the file. Please let me know if you have any notes/comments on the design. If you'd like to use the ODT file for comments and edits, please mark your input clearly and I'll update the Google doc with your notes. Thanks, Ash
Re: Compressed Pristines (Design Doc)
OK, I've had a cruise through now. First of all I have to say it's an order of magnitude larger than what I'd imagined it would be. That makes the move it elsewhere idea I'd had less practical than I'd predicted. I'm also not intending to take you up on your offer to proxy me to the doc, though thanks for making it. Design-wise I'm a bit surprised that the choice ended up being rolling a custom file format. Thanks for your work. Cheers, Daniel Ashod Nakashian wrote on Thu, Mar 22, 2012 at 00:15:21 -0700: From: Daniel Shahaf danie...@elego.de To: Greg Stein gst...@gmail.com Cc: Ashod Nakashian ashodnakash...@yahoo.com; dev@subversion.apache.org Sent: Wednesday, March 21, 2012 2:08 PM Subject: Re: Compressed Pristines (Design Doc) Greg Stein wrote on Wed, Mar 21, 2012 at 16:51:47 -0400: On Wed, Mar 21, 2012 at 16:11, Daniel Shahaf d...@daniel.shahaf.name wrote: Ashod Nakashian wrote on Wed, Mar 21, 2012 at 12:19:02 -0700: All, I'm happy to share[1] with you the design document for the Compressed Pristines feature. The document is public and anyone can comment on any part I can't. Can you please move the document to our wiki, or dump it in an email to dev@, or on a pastebin, somewhere everyone canread it. I just opened it in an incognito window in Chrome. You should be able to access the thing. Tried, I get as far as the doc title. I don't see its contents. Daniel (and all who can't access the doc), I'm attaching the PDF and ODT versions with updates based on Greg's comments. I'd like to hear all opinions and comments. Google docs is a fairly ideal environment for live commenting and editing, so it's too bad that you can't access the file. Please let me know if you have any notes/comments on the design. If you'd like to use the ODT file for comments and edits, please mark your input clearly and I'll update the Google doc with your notes. Thanks, Ash
Re: Compressed Pristines (Design Doc)
From: Daniel Shahaf danie...@elego.de To: Ashod Nakashian ashodnakash...@yahoo.com Cc: dev@subversion.apache.org dev@subversion.apache.org Sent: Thursday, March 22, 2012 7:30 AM Subject: Re: Compressed Pristines (Design Doc) OK, I've had a cruise through now. First of all I have to say it's an order of magnitude larger than what I'd imagined it would be. That makes the move it elsewhere idea I'd had less practical than I'd predicted. I'm also not intending to take you up on your offer to proxy me to the doc, though thanks for making it. If there are any ideas for simplifying things, I think it's well worth the effort. I for one am not for unecessary complexity. This is why I took the time to outline a set of requirements. If the requirements are excessive, let's simply them first. And based on the requirements alone can one justify the design. Design-wise I'm a bit surprised that the choice ended up being rolling a custom file format. Personally I know not of any library that can deliver the requirements that we need (outlined in the doc). Again, if the requirements are in question, let's simplify them. If there is such a library, suggesting it will save us a lot of time and effort. Otherwise, using a Tar-like container will just not cut it. On the other hand, the proposed custom format is rather simple and its code shouldn't be complex. In fact, I suspect Tar is more complex (considering it must store more information than we do). -Ash Thanks for your work. Cheers, Daniel Ashod Nakashian wrote on Thu, Mar 22, 2012 at 00:15:21 -0700: From: Daniel Shahaf danie...@elego.de To: Greg Stein gst...@gmail.com Cc: Ashod Nakashian ashodnakash...@yahoo.com; dev@subversion.apache.org Sent: Wednesday, March 21, 2012 2:08 PM Subject: Re: Compressed Pristines (Design Doc) Greg Stein wrote on Wed, Mar 21, 2012 at 16:51:47 -0400: On Wed, Mar 21, 2012 at 16:11, Daniel Shahaf d...@daniel.shahaf.name wrote: Ashod Nakashian wrote on Wed, Mar 21, 2012 at 12:19:02 -0700: All, I'm happy to share[1] with you the design document for the Compressed Pristines feature. The document is public and anyone can comment on any part I can't. Can you please move the document to our wiki, or dump it in an email to dev@, or on a pastebin, somewhere everyone canread it. I just opened it in an incognito window in Chrome. You should be able to access the thing. Tried, I get as far as the doc title. I don't see its contents. Daniel (and all who can't access the doc), I'm attaching the PDF and ODT versions with updates based on Greg's comments. I'd like to hear all opinions and comments. Google docs is a fairly ideal environment for live commenting and editing, so it's too bad that you can't access the file. Please let me know if you have any notes/comments on the design. If you'd like to use the ODT file for comments and edits, please mark your input clearly and I'll update the Google doc with your notes. Thanks, Ash
Re: Compressed Pristines (Design Doc)
On Thu, Mar 22, 2012 at 11:18 AM, Ashod Nakashian ashodnakash...@yahoo.com wrote: Design-wise I'm a bit surprised that the choice ended up being rolling a custom file format. Personally I know not of any library that can deliver the requirements that we need (outlined in the doc). Again, if the requirements are in question, let's simplify them. If there is such a library, suggesting it will save us a lot of time and effort. Otherwise, using a Tar-like container will just not cut it. On the other hand, the proposed custom format is rather simple and its code shouldn't be complex. In fact, I suspect Tar is more complex (considering it must store more information than we do). I am not sure what Daniel meant, but I had always just assumed we would simply compress the files in the existing pristines. I think your document does a nice job explaining why that is not good enough. In that sense, I would also say that I was surprised by the choice of a custom file format, but that does not mean I would question it. I think your document does a nice job in revealing some of the subtle complexities of this feature. That gives me more hope on progress towards a solution. -- Thanks Mark Phippard http://markphip.blogspot.com/
Re: Compressed Pristines (Design Doc)
On 22.03.2012 16:23, Mark Phippard wrote: On Thu, Mar 22, 2012 at 11:18 AM, Ashod Nakashian ashodnakash...@yahoo.com wrote: Design-wise I'm a bit surprised that the choice ended up being rolling a custom file format. Personally I know not of any library that can deliver the requirements that we need (outlined in the doc). Again, if the requirements are in question, let's simplify them. If there is such a library, suggesting it will save us a lot of time and effort. Otherwise, using a Tar-like container will just not cut it. On the other hand, the proposed custom format is rather simple and its code shouldn't be complex. In fact, I suspect Tar is more complex (considering it must store more information than we do). I am not sure what Daniel meant, but I had always just assumed we would simply compress the files in the existing pristines. I think your document does a nice job explaining why that is not good enough. In that sense, I would also say that I was surprised by the choice of a custom file format, but that does not mean I would question it. I think your document does a nice job in revealing some of the subtle complexities of this feature. That gives me more hope on progress towards a solution. I'd like to point out that there /is/ a library that handles storage, lookup, access and deletion of many small files in a single large one quite efficiently. Well tested, too, widely used, and configurable with regard to space reclamation. Moreover, we're already using that library. It's called SQLite. -- Brane
Re: Compressed Pristines (Design Doc)
Ashod Nakashian wrote on Thu, Mar 22, 2012 at 08:18:40 -0700: From: Daniel Shahaf danie...@elego.de To: Ashod Nakashian ashodnakash...@yahoo.com Cc: dev@subversion.apache.org dev@subversion.apache.org Sent: Thursday, March 22, 2012 7:30 AM Subject: Re: Compressed Pristines (Design Doc) OK, I've had a cruise through now. First of all I have to say it's an order of magnitude larger than what I'd imagined it would be. That makes the move it elsewhere idea I'd had less practical than I'd predicted. I'm also not intending to take you up on your offer to proxy me to the doc, though thanks for making it. If there are any ideas for simplifying things, I think it's well worth the effort. I for one am not for unecessary complexity. This is why I took the time to outline a set of requirements. If the requirements are excessive, let's simply them first. And based on the requirements alone can one justify the design. Fair enough. One requirement is extensibility (features in 1.9 timeframe, assuming your design is released in 1.8). I see you included a format number, but --- for example --- perhaps the index entries should contain a few RESERVED bytes too? (It would have help a lot in manually fixing FSFS corruptions if we'd left a few unused bytes here and there in revision files...) Another requirement is concurrency. ra_serf downloads files concurrently, and the editor (svn_delta_editor_t, 1.8's svn_editor_t) allows retrieving the text of multiple files concurrently. Does your design allow for adding two new pristines with their contents arriving interleaved? (There is one thread in the client process, but several TCP sockets.) Design-wise I'm a bit surprised that the choice ended up being rolling a custom file format. Personally I know not of any library that can deliver the requirements that we need (outlined in the doc). Again, if the requirements are in I'm not familiar offhand with such a library either, but perhaps someone else on list is. question, let's simplify them. If there is such a library, suggesting it will save us a lot of time and effort. Otherwise, using a Tar-like container will just not cut it. On the other hand, the proposed custom format is rather simple and its code shouldn't be complex. In fact, I suspect Tar is more complex (considering it must store more information than we do). Let's see how far we can get with the custom format. If the someone invented that wheel already factor pops up too often I'm sure we'll notice. Cheers, Daniel -Ash Thanks for your work. Cheers, Daniel Ashod Nakashian wrote on Thu, Mar 22, 2012 at 00:15:21 -0700: From: Daniel Shahaf danie...@elego.de To: Greg Stein gst...@gmail.com Cc: Ashod Nakashian ashodnakash...@yahoo.com; dev@subversion.apache.org Sent: Wednesday, March 21, 2012 2:08 PM Subject: Re: Compressed Pristines (Design Doc) Greg Stein wrote on Wed, Mar 21, 2012 at 16:51:47 -0400: On Wed, Mar 21, 2012 at 16:11, Daniel Shahaf d...@daniel.shahaf.name wrote: Ashod Nakashian wrote on Wed, Mar 21, 2012 at 12:19:02 -0700: All, I'm happy to share[1] with you the design document for the Compressed Pristines feature. The document is public and anyone can comment on any part I can't. Can you please move the document to our wiki, or dump it in an email to dev@, or on a pastebin, somewhere everyone canread it. I just opened it in an incognito window in Chrome. You should be able to access the thing. Tried, I get as far as the doc title. I don't see its contents. Daniel (and all who can't access the doc), I'm attaching the PDF and ODT versions with updates based on Greg's comments. I'd like to hear all opinions and comments. Google docs is a fairly ideal environment for live commenting and editing, so it's too bad that you can't access the file. Please let me know if you have any notes/comments on the design. If you'd like to use the ODT file for comments and edits, please mark your input clearly and I'll update the Google doc with your notes. Thanks, Ash
AW: Compressed Pristines (Design Doc)
Hi, I just want to shed light on three arguments against a new custom archive format. Compressing the files using a standard format (like gz or xz) file-by-file has the advantage of better debuggability. Developers can easily (de)compress or otherwise those files using their standard utilities when trying to debug problems. Using a custom format always makes that process more difficult. In addition, increasingly more file systems support features like block_suballocation, tail packing or tail merging[1]. This drastically reduces the space loss due to files being smaller than the block size. And the third argument is the simplicity of implementation. Just checking .svn/pristines/ab/abcd.gz in addition to .svn/pristines/ab/abcd when searching for a pristine file is much easier to implement. I'm not opposed in general to storing pristines in an archive, but the disadvantages should be weighted in when making the decision. A different, somehow related idea is a common pristine store somewhere in the users directory, shared by several working copies. Especially when checking out several working copies of the same project (or similar branches), this could save a lot of network traffic. Best regards Markus Schaber [1]: http://msdn.microsoft.com/en-us/library/windows/desktop/ee681827%28v=vs.85%29.aspx claims tail packing support for NTFS. http://en.wikipedia.org/wiki/Block_suballocation claims support for Btrfs, ReiserFS, Reiser4, FreeBSD UFS2. And AFAIR, XFS has a similar feature. -- ___ We software Automation. 3S-Smart Software Solutions GmbH Markus Schaber | Developer Memminger Str. 151 | 87439 Kempten | Germany | Tel. +49-831-54031-0 | Fax +49-831-54031-50 Email: m.scha...@3s-software.com | Web: http://www.3s-software.com CoDeSys internet forum: http://forum.3s-software.com Download CoDeSys sample projects: http://www.3s-software.com/index.shtml?sample_projects Managing Directors: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner | Trade register: Kempten HRB 6186 | Tax ID No.: DE 167014915 -Ursprüngliche Nachricht- Von: Mark Phippard [mailto:markp...@gmail.com] Gesendet: Donnerstag, 22. März 2012 16:23 An: Ashod Nakashian Cc: Daniel Shahaf; dev@subversion.apache.org Betreff: Re: Compressed Pristines (Design Doc) On Thu, Mar 22, 2012 at 11:18 AM, Ashod Nakashian ashodnakash...@yahoo.com wrote: Design-wise I'm a bit surprised that the choice ended up being rolling a custom file format. Personally I know not of any library that can deliver the requirements that we need (outlined in the doc). Again, if the requirements are in question, let's simplify them. If there is such a library, suggesting it will save us a lot of time and effort. Otherwise, using a Tar-like container will just not cut it. On the other hand, the proposed custom format is rather simple and its code shouldn't be complex. In fact, I suspect Tar is more complex (considering it must store more information than we do). I am not sure what Daniel meant, but I had always just assumed we would simply compress the files in the existing pristines. I think your document does a nice job explaining why that is not good enough. In that sense, I would also say that I was surprised by the choice of a custom file format, but that does not mean I would question it. I think your document does a nice job in revealing some of the subtle complexities of this feature. That gives me more hope on progress towards a solution. -- Thanks Mark Phippard http://markphip.blogspot.com/
Re: Compressed Pristines (Design Doc)
Branko Čibej wrote on Thu, Mar 22, 2012 at 16:37:24 +0100: It's called SQLite. Heh. I wondered whether I should mention that the server uses BDB to store pristine files. (yes, the situation there is different in several relevant ways)
Re: Compressed Pristines (Design Doc)
On 22.03.2012 16:50, Daniel Shahaf wrote: Branko Čibej wrote on Thu, Mar 22, 2012 at 16:37:24 +0100: It's called SQLite. Heh. I wondered whether I should mention that the server uses BDB to store pristine files. (yes, the situation there is different in several relevant ways) To clarify: I'm /not/ advocating that we store each and every file into an SQLite BLOB. Files larger than several block sizes would be better off on disk as real files (the compressor can, e.g., buffer compressed contents up to, say, 32k, and if they become larger, spill directly into a file; otherwise, dump into a BLOB). If we don't care about shared pristine store, we don't even need a separate database, these blobs can go into wc.db (which, as Greg points out, also serves as an index). -- Brane
Re: Compressed Pristines (Design Doc)
On Thu, Mar 22, 2012 at 18:30, Daniel Shahaf danie...@elego.de wrote: OK, I've had a cruise through now. First of all I have to say it's an order of magnitude larger than what I'd imagined it would be. That makes the move it elsewhere idea I'd had less practical than I'd predicted. I'm also not intending to take you up on your offer to proxy me to the doc, though thanks for making it. Design-wise I'm a bit surprised that the choice ended up being rolling a custom file format. Thanks for your work. +1. I believe we should implement compressed pristine in simple way: just compress pristine files itself, without inventing some new format. -- Ivan Zhakov
Re: Compressed Pristines (Design Doc)
Hi Ash, Thanks for picking up the initiative to implement this feature. On Thu, Mar 22, 2012 at 7:01 PM, Ivan Zhakov i...@visualsvn.com wrote: On Thu, Mar 22, 2012 at 18:30, Daniel Shahaf danie...@elego.de wrote: OK, I've had a cruise through now. First of all I have to say it's an order of magnitude larger than what I'd imagined it would be. That makes the move it elsewhere idea I'd had less practical than I'd predicted. I'm also not intending to take you up on your offer to proxy me to the doc, though thanks for making it. Design-wise I'm a bit surprised that the choice ended up being rolling a custom file format. Thanks for your work. +1. I believe we should implement compressed pristine in simple way: just compress pristine files itself, without inventing some new format. As the others, I'm surprised we seem to be going with a custom file format. You claim source files are generally small in size and hence only small benefits can be had from compressing them, if at all, due to the fact that they would be of sub-block size already. To substantiate that claim, I took the pristines directory from my Subversion working copy and did some experimenting. See results below: $ ls -ls uncompressed-pristines/*/*.svn-base | awk '{ tot += $1; } END { print total size tot; }' total size: 188724 $ cp -Rp uncompressed-pristines/ compressed-pristines $ gzip compressed-pristines/*/*.svn-base $ ls -ls compressed-pristines/*/*.svn-base.gz | awk '{ tot += $1; } END { print total size tot; }' total size: 52320 $ cat compressed-pristines/*/*.svn-base.gz combined-compressed-file $ ls -ls combined-compressed-file 41812 So, if I look at the Subversion pristines in my working copy, the reduction in allocated blocks goes from 100% to 27%. To be honest, I doubt the complexity we'll be importing just to reduce the allocated number of blocks from 27% to 22% is really worth it: the savings are already tremendous. Won't the creation of a custom storage format just serve to destabilize our working copy? Do you have data which triggered you to design this custom format? Bye, Erik.
Re: Compressed Pristines (Design Doc)
Erik Huelsmann ehu...@gmail.com writes: As the others, I'm surprised we seem to be going with a custom file format. You claim source files are generally small in size and hence only small benefits can be had from compressing them, if at all, due to the fact that they would be of sub-block size already. I was surprised too, so I looked at GCC where a trunk checkout has 75,000 files of various types: $ find .svn/pristine -type f | wc -l 75192 Uncompressed: $ du -hs .svn/pristine 635M.svn/pristine $ find .svn/pristine -type f | xargs ls -ls | awk '{tot += $1} END {print tot}' 641536 Individually compressed is smaller by a factor of 2: $ find .svn/pristine -type f | xargs gzip $ du -hs .svn/pristine 367M.svn/pristine $ find .svn/pristine -type f | xargs ls -ls | awk '{tot += $1} END {print tot}' 365624 As one single file is smaller by another factor of 3: $ find .svn/pristine -type f | xargs cat one-big-file $ du -hs one-big-file 122Mone-big-file $ ls -ls one-big-file | awk '{print $1}' 124516 When individually compressed most of the 75,000 files are less than 4K: $ find .svn/pristine -size -4096c | wc -l 71571 more than half are less than 1K: $ find .svn/pristine -size -1024c | wc -l 53707 and nearly half are less than 0.5K: $ find .svn/pristine -size -512c | wc -l 36521 In the uncompressed state: 62323 are less than 4K 36648 are less than 1K 21828 are less than 0.5K Maybe GCC is not typical but, rather to my surprise, combining the compressed files would be a significant improvement. I also have an httpd trunk checkout (needs cleanup so bigger than normal): 90M uncompressed 37M individually compressed 23M as one big file That's more like your figures for Subversion where the major step is individual compression. -- uberSVN: Apache Subversion Made Easy http://www.uberSVN.com
Re: Compressed Pristines (Design Doc)
All, I'm happy to share[1] with you the design document for the Compressed Pristines feature. The document is public and anyone can comment on any part (select, right-click and comment away). If you'd like to get *editing* permission, please email me and I'll add you to the list of editors. I'm sure there will be much to criticize and debate, I'd love to hear all input, but being pragmatic, I also would like to a) experiment and figure out the best approach in practice, backed with real data and consensus and b) to finish this feature rather than debate forever (it's been debated for almost a decade this December!). As such, what's not clear, I've left out or written TBD notes and at the same time I've already made experimental changes locally to have a more learned information rather than an academic design (this, not to mention reading 100s of dev-list mails). I made a serious attempt at specifying as much of the hard facts/reqs/goals as possible to narrow the scope and avoid feature-creep. I'd like to take this feature on a lightweight branch and start committing code and getting reviews (and contributions!!) while we finalize the design and decide on the details (those who can create branches and grant commit rights please let me know when is the right time to do this - I'm ready and have code to commit and develop further). I thank everyone who will help us get this finally done in advance and look forward to hearing from you all. -Ash [1] https://docs.google.com/document/d/1ktIsewfMBMVBxbn-Ng8NwkNwAS_QJ6eC7GOygsbBeEc/edit From: Hyrum K Wright hyrum.wri...@wandisco.com To: Ashod Nakashian ashodnakash...@yahoo.com Cc: Philip Martin philip.mar...@wandisco.com; Greg Stein gst...@gmail.com; dev@subversion.apache.org dev@subversion.apache.org Sent: Monday, March 12, 2012 5:31 PM Subject: Re: Compressed Pristines On Mon, Mar 12, 2012 at 7:11 AM, Ashod Nakashian ashodnakash...@yahoo.com wrote: - Original Message - From: Philip Martin philip.mar...@wandisco.com To: Ashod Nakashian ashodnakash...@yahoo.com Cc: Greg Stein gst...@gmail.com; dev@subversion.apache.org dev@subversion.apache.org Sent: Monday, March 12, 2012 2:40 PM Subject: Re: Compressed Pristines Ashod Nakashian ashodnakash...@yahoo.com writes: * Are there any documentation/design/discussions on this feature that I could study? There has been some discussion in the past on the dev list. I don't think it is written down anywhere else. * Who should coordinate and be contacted on decision points? The dev list. * I know this feature was planned for 1.8. Is that still reasonable? (I can't find a release date for 1.8) Will 1.8 wait for this or has this feature been demoted to a low-priority in general? The reason I ask is to have a vague idea as to how close this feature is on the critical path to future releases. We don't plan like that. The features that will go into 1.8 are the features people choose to write. Got it. Thanks. Here is my plan. I'll study whatever discussion took place on dev-list. My main technical concern is related to any complications that aren't obvious (due to design or requirements elsewhere in svn). I'll come up with an architectural overview for the feature design and a breakdown of major milestones. When ready, I'll share them on this list and open it for discussion. Based on the results, code changes can commence. A few observations based upon my past poking around this area. The current implementation of the pristine store is designed to be streamy. That is, external users aren't supposed to know or care where the actual contents live, or how they are accessed, but simply get a stream from which they can read and write the contents. In principle, this should make compressing said contents relatively easy, as we could just insert a compressing stream in this pipeline and everything would automagically work. Since this hasn't yet happened, you can probably guess that it isn't as easy as that. :) The primary issue when I looked at this problem was that the streamy abstraction is broken in several places, such as when we install the new pristine file. There are also certain consumers, such as a external diff tools, that require an uncompressed on-disk file to operate on, and we currently just provide the pristine as that file. Compressed pristines would require recreating the uncompressed version when such a tool is invoked. Whether this is a useful tradeoff, I don't know. Generally, though, I'm +1 for compressed pristines, as that was one of the design goals of wc-ng in the first place. (Oh, and extra points for selectively compressing pristines based upon mime-type.) So start digging in, asking on the dev@ list and #svn-dev on Freenode, and sending in patches. You'll find a community of folks eager for your input, and willing to help you. Hope that helps, -Hyrum -- uberSVN: Apache
Re: Compressed Pristines (Design Doc)
Ashod Nakashian wrote on Wed, Mar 21, 2012 at 12:19:02 -0700: All, I'm happy to share[1] with you the design document for the Compressed Pristines feature. The document is public and anyone can comment on any part I can't. Can you please move the document to our wiki, or dump it in an email to dev@, or on a pastebin, somewhere everyone canread it. Thanks
Re: Compressed Pristines (Design Doc)
On Wed, Mar 21, 2012 at 16:11, Daniel Shahaf d...@daniel.shahaf.name wrote: Ashod Nakashian wrote on Wed, Mar 21, 2012 at 12:19:02 -0700: All, I'm happy to share[1] with you the design document for the Compressed Pristines feature. The document is public and anyone can comment on any part I can't. Can you please move the document to our wiki, or dump it in an email to dev@, or on a pastebin, somewhere everyone canread it. I just opened it in an incognito window in Chrome. You should be able to access the thing. Cheers, -g
Re: Compressed Pristines (Design Doc)
Greg Stein wrote on Wed, Mar 21, 2012 at 16:51:47 -0400: On Wed, Mar 21, 2012 at 16:11, Daniel Shahaf d...@daniel.shahaf.name wrote: Ashod Nakashian wrote on Wed, Mar 21, 2012 at 12:19:02 -0700: All, I'm happy to share[1] with you the design document for the Compressed Pristines feature. The document is public and anyone can comment on any part I can't. Can you please move the document to our wiki, or dump it in an email to dev@, or on a pastebin, somewhere everyone canread it. I just opened it in an incognito window in Chrome. You should be able to access the thing. Tried, I get as far as the doc title. I don't see its contents.
Re: Compressed Pristines (Design Doc)
On 21.03.2012 22:08, Daniel Shahaf wrote: Greg Stein wrote on Wed, Mar 21, 2012 at 16:51:47 -0400: On Wed, Mar 21, 2012 at 16:11, Daniel Shahaf d...@daniel.shahaf.name wrote: Ashod Nakashian wrote on Wed, Mar 21, 2012 at 12:19:02 -0700: All, I'm happy to share[1] with you the design document for the Compressed Pristines feature. The document is public and anyone can comment on any part I can't. Can you please move the document to our wiki, or dump it in an email to dev@, or on a pastebin, somewhere everyone canread it. I just opened it in an incognito window in Chrome. You should be able to access the thing. Tried, I get as far as the doc title. I don't see its contents. Enable javascript in your browser. Sorry. -- Brane