Re: FSFS format7 and compressed XML bundles
On 2013-03-05 16:52:30 +, Julian Foad wrote: Vincent Lefevre wrote: [about server-side vs client-side] But even if there would be no problems with the construction/reconstruction, it would be a bad solution, IMHO. Indeed, for a commit, it is the client that is supposed to expand the data before sending the diff to the server, What do you mean the client [...] is supposed to expand the data? I don't understand why you think the client is supposed to do such a thing. Because the diff between two huge compressed files is generally huge (unless some rsync-friendly option has been applied, when available). So, if the client doesn't uncompress the data for the server, it will have to send a huge diff or a huge compressed file, even though the diff between the uncompressed data may be small. So, if deconstruction/reconstruction is possible (canonical form), it is much more efficient to do this on the client side. That point _is_ specific to a server-side solution. With a client-side solution, the user's word processor may not mind if a versioning operation such as a commit (through a decompressing plug-in) followed by checkout (through a re-compressing plug-in) changes the bit pattern of the compressed file, so long as the uncompressed content that it represents is unchanged. I disagree. It's not clear what you disagree with. With the second sentence (... may not mind ...), thus with the first sentence too. The word processor may not mind (in theory, because in practice, one may have bugs that depend on the bit pattern, and it would be bad to expose the user to such kind of bugs and non-deterministic behavior), but for the user this may be important. For instance, a different bit pattern will break a possible signature on the compressed file. I agree that it *may* be important for the user, but the users have control so they can use this client-side scheme in scenarios where it works for them and not use it in other scenarios. But one should need a scheme that will also work in the case where users care about the bit pattern of the compressed file. Moreover even when the users know that the exact bit pattern of the compressed file is not important at some time, this may no longer be true in the future. For instance, some current word processor may ignore the dates in zip files, but future ones may take them into account. So, you need to wonder what data are important in a zip file, including undocumented ones used by some implementations (as the zip format allows extensions). Taking them into account when it appears that these data become meaningful is too late, because such data would have already been lost in past versions of the Subversion repository. On 2013-03-05 17:10:02 +, Julian Foad wrote: I (Julian Foad) wrote: Vincent Lefevre wrote: On 2013-03-05 13:30:28 +, Julian Foad wrote: Vincent Lefevre wrote: On 2013-03-01 14:58:10 +, Philip Martin wrote: A server-side solution is difficult. Suppose the client has some uncompressed content U which it compresses to C and sends to the server. The server can uncompress C to get U but unless the compression scheme has a canonical compressed form, with no other forms allowed, the server cannot avoid storing C because there is no guarantee that C can be reconstructed from U. This is not specific to server side. Even on the client side, the reconstruction may not be always possible, e.g. if the system is upgraded or if NFS is used. And the compression level may need to be detected or provided in some way. Hi Vincent. I'm not sure you understood Philip's point. This should be more clear about what I meant below. What I'm saying is that whether this is done entirely on the server side (a bad solution, IMHO) or on the client side (see below why), the problems are similar. The point Philip made is *not* a problem if done client-side; Let me take that back. The point that I interpreted as being the most significant impact of what Philip said, namely that the Subversion protocols and system design require reproducible content, is only a problem when done server-side. Other impacts of that same point, such as you mentioned, are applicable no matter whether server-side or client-side. The Subversion protocols and system design *currently* require reproducible content, but if new features are added, e.g. due to the fact that the users don't mind about the exact compressed content of some file, then it could be decided to change the protocols and the requirements (the server could consider some canonical uncompressed form as a reference). [...] So my main point is that the server-side expand/compress is a non-starter of an idea, because it violates basic Subversion requirements, whereas client-side is a viable option for some use cases. I would reject the server-side expand/compress, not because of the current requirements
Re: some questions about the delta operation in SVN
Branko Čibej wrote on Wed, Mar 06, 2013 at 06:41:40 +0100: On 06.03.2013 06:21, Daniel Shahaf wrote: Bo Chen wrote on Tue, Mar 05, 2013 at 23:49:06 -0500: Can anyone help me make clear the following questions? Thanks very much. I make some updates, and the SVN client generates the delta and sends it to the SVN server. Does the server simply store this delta to the repository, or do something more? The latter. The client always generates a delta against file.c@HEAD, but the filesystem stores skip-deltas. That would be file@BASE, since that's what the client has a pristine version of. :) It @BASE and @HEAD will be the same node-rev, else the commit will fail.
Re: some questions about the delta operation in SVN
Clarify one point: The file@base (or the file @head) refers to the file I am currently updating, right? I am very curious why the server needs to re-compute the skip-delta. Is there any rule to guide the server which pristine version to be delta-ed against? To optimize the delta (specifically, to optimize the storage for the delta)? Thanks. Bo On Wed, Mar 6, 2013 at 7:05 AM, Daniel Shahaf d...@daniel.shahaf.namewrote: Branko Čibej wrote on Wed, Mar 06, 2013 at 06:41:40 +0100: On 06.03.2013 06:21, Daniel Shahaf wrote: Bo Chen wrote on Tue, Mar 05, 2013 at 23:49:06 -0500: Can anyone help me make clear the following questions? Thanks very much. I make some updates, and the SVN client generates the delta and sends it to the SVN server. Does the server simply store this delta to the repository, or do something more? The latter. The client always generates a delta against file.c@HEAD, but the filesystem stores skip-deltas. That would be file@BASE, since that's what the client has a pristine version of. :) It @BASE and @HEAD will be the same node-rev, else the commit will fail.
Re: some questions about the delta operation in SVN
Bo Chen wrote on Wed, Mar 06, 2013 at 10:14:41 -0500: I am very curious why the server needs to re-compute the skip-delta. Is there any rule to guide the server which pristine version to be delta-ed against? To optimize the delta (specifically, to optimize the storage for the delta)? See notes/skip-deltas in trunk. The delta base is not the immediate previous revision in order to reduce the delta chain length needed for reconstructing a random revision of the file from O(log H) to O(H), for a file with H revisions.
Re: some questions about the delta operation in SVN
Daniel Shahaf wrote on Wed, Mar 06, 2013 at 17:24:46 +0200: Bo Chen wrote on Wed, Mar 06, 2013 at 10:14:41 -0500: I am very curious why the server needs to re-compute the skip-delta. Is there any rule to guide the server which pristine version to be delta-ed against? To optimize the delta (specifically, to optimize the storage for the delta)? See notes/skip-deltas in trunk. The delta base is not the immediate previous revision in order to reduce the delta chain length needed for reconstructing a random revision of the file from O(log H) to O(H), for typo: should be from O(H) to O(log H) a file with H revisions.
Re: some questions about the delta operation in SVN
On Wed, Mar 6, 2013 at 7:24 AM, Daniel Shahaf d...@daniel.shahaf.name wrote: See notes/skip-deltas in trunk. Things are a little bit more complicated in trunk than what's in notes/skip-deltas for fsfs because we now have some knobs that let you adjust how the skip deltas behave. This changes the default behavior of a newly created repo in trunk versus the behavior in previous versions. If you're interested in the way trunk behaves you should also look at the db/fsfs.conf file under a repository created with a trunk svnadmin and look at the deltification section.
Re: merge test 125 is coredumping
On Tue, Feb 26, 2013 at 11:37 PM, Julian Foad julianf...@btopenworld.com wrote: -- Certified Supported Apache Subversion Downloads: http://www.wandisco.com/subversion/download - Original Message - From: Philip Martin philip.mar...@wandisco.com To: dev@subversion.apache.org Cc: Sent: Monday, 25 February 2013, 5:37 Subject: Re: merge test 125 is coredumping Stefan Sperling s...@elego.de writes: Merge test 125 is leaving behind a core file on my system which contains the following trace. Is anyone else seeing this? at subversion/libsvn_client/merge.c:4743 4743 SVN_ERR_ASSERT(*gap_start *gap_end); I see the same assert. This test has always been XFAIL. Sure, I see it too. It's associated with http://subversion.tigris.org/issues/show_bug.cgi?id=4132 merge of replaced source asserts, as stated in the source comment on the assertion. I have tried investigating before but not yet succeeded in tracing the root cause. Fixed http://svn.apache.org/viewvc?view=revisionrevision=1453425 -- Paul T. Burba CollabNet, Inc. -- www.collab.net -- Enterprise Cloud Development Skype: ptburba
Re: FSFS format7 and compressed XML bundles
This is all very insightful and informative. For fun, I threw together a quick script which commits a series of extremely minor changes to a MS Word file and monitors how the repository size evolves. I then added the following lines to the script to commit not the original Word file but an unzipped and tarred version. I used the following command line to unzip and tar: mkdir ziptar (cd ziptar unzip ../File.docx tar cvf ../File.docx.tar ./*) rm -rf ziptar Here is what I get: # Original file Revision 1. 174 KB Revision 2. 231 KB (delta 57K) Revision 3. 304 KB (delta 73K) Revision 4. 377 KB (delta 73K) # With unzipping and tarring applied Revision 1. 158 KB Revision 2. 163 KB (delta 5K) Revision 3. 172 KB (delta 9k) Revision 4. 177 KB (delta 5k) So significant (10X) space savings; with larger documents with heavy imagery the ratios would probably increase. And the second half of the zip-tar round-trip would of course need to be implemented in a hook at the right time. But as Vincent noted, this is not really a satisfying solution. Later versions of the office package may be sensitive to the difference that the zip-tar-zip round-trip introduces. I could see doing this to facilitate frequent intermediate commits of large long-lived documents. But I would probably never feel safe about it if I didn't commit the original at some intervals as well. In the end, the only satisfying long-term solution would be an efficient delta-calculation between the two compressed representations, which would probably require the relevant office packages to use some sort of rsync-aware (or rsync-compatible) compression On 3/6/2013 5:41 AM, Vincent Lefevre wrote: Moreover even when the users know that the exact bit pattern of the compressed file is not important at some time, this may no longer be true in the future. For instance, some current word processor may ignore the dates in zip files, but future ones may take them into account. So, you need to wonder what data are important in a zip file, including undocumented ones used by some implementations (as the zip format allows extensions). Taking them into account when it appears that these data become meaningful is too late, because such data would have already been lost in past versions of the Subversion repository.
Re: FSFS format7 and compressed XML bundles
Vincent Lefevre wrote: On 2013-03-05 16:52:30 +, Julian Foad wrote: Vincent Lefevre wrote: [about server-side vs client-side] [...] Because the diff between two huge compressed files is generally huge (unless some rsync-friendly option has been applied, when available). So, if the client doesn't uncompress the data for the server, it will have to send a huge diff or a huge compressed file, even though the diff between the uncompressed data may be small. So, if deconstruction/reconstruction is possible (canonical form), it is much more efficient to do this on the client side. Certainly that is true. That point _is_ specific to a server-side solution. With a client-side solution, the user's word processor may not mind if a versioning operation such as a commit (through a decompressing plug-in) followed by checkout (through a re-compressing plug-in) changes the bit pattern of the compressed file, so long as the uncompressed content that it represents is unchanged. I disagree. It's not clear what you disagree with. With the second sentence (... may not mind ...), thus with the first sentence too. [...] The word processor may not mind (in theory, because in practice, one may have bugs that depend on the bit pattern, and it would be bad to expose the user to such kind of bugs and non-deterministic behavior), but for the user this may be important. For instance, a different bit pattern will break a possible signature on the compressed file. I agree that it *may* be important for the user, but the users have control so they can use this client-side scheme in scenarios where it works for them and not use it in other scenarios. But one should need a scheme that will also work in the case where users care about the bit pattern of the compressed file. Moreover even when the users know that the exact bit pattern of the compressed file is not important at some time, this may no longer be true in the future. For instance, some current word processor may ignore the dates in zip files, but future ones may take them into account. So, you need to wonder what data are important in a zip file, including undocumented ones used by some implementations (as the zip format allows extensions). Taking them into account when it appears that these data become meaningful is too late, because such data would have already been lost in past versions of the Subversion repository. If you are thinking about a solution that we can apply automatically, then yes it would need to work in the case where users care about preserving the bit pattern. I was thinking about an opt-in system, where the user is in control of specifying which files get processed in this way. If the user is unsure whether the non-preservation of bit pattern is going to be important for their word processor files in the future, they can ask the provider of their word processor whether this kind of modification is officially supported. In many cases the answer will be yes, we explicitly support that kind of archiving. On 2013-03-05 17:10:02 +, Julian Foad wrote: [...] Let me take that back. The point that I interpreted as being the most significant impact of what Philip said, namely that the Subversion protocols and system design require reproducible content, is only a problem when done server-side. Other impacts of that same point, such as you mentioned, are applicable no matter whether server-side or client-side. The Subversion protocols and system design *currently* require reproducible content, but if new features are added, e.g. due to the fact that the users don't mind about the exact compressed content of some file, then it could be decided to change the protocols and the requirements (the server could consider some canonical uncompressed form as a reference). Conceivably. [...] So my main point is that the server-side expand/compress is a non-starter of an idea, because it violates basic Subversion requirements, whereas client-side is a viable option for some use cases. I would reject the server-side expand/compress, not because of the current requirements (which could be changed to more or less match what happens on the client side), but because of performance reasons (see my first paragraph of this message). Interesting thoughts. The design of a bit-pattern-preserving solution is an interesting challenge. In general a compression algorithm may have no canonical form, and not even be deterministically reproducible using only data that is available in the compressed file, and in those cases I don't see any theroretical solution. However, perhaps some commonly used compressions are found in practice to be in a form which can be reconstructed by the compression algorithm, if given a set of parameters that we are able to extract from the compressed data. Perhaps it would be possible to design a scheme that scans the data stream for
Re: merge test 125 is coredumping
Paul Burba wrote: On Tue, Feb 26, 2013 at 11:37 PM, Julian Foad wrote: Philip Martin philip.mar...@wandisco.com Stefan Sperling s...@elego.de writes: Merge test 125 is leaving behind a core file on my system which contains the following trace. Is anyone else seeing this? at subversion/libsvn_client/merge.c:4743 4743 SVN_ERR_ASSERT(*gap_start *gap_end); I see the same assert. This test has always been XFAIL. Sure, I see it too. It's associated with http://subversion.tigris.org/issues/show_bug.cgi?id=4132 merge of replaced source asserts, as stated in the source comment on the assertion. I have tried investigating before but not yet succeeded in tracing the root cause. Fixed http://svn.apache.org/viewvc?view=revisionrevision=1453425 Fantastic! - Julian
[PATCH] correct installation of mod_dontdothat
Hello, The installation of mod_dontdothat was moved to make install-tools, however the trunk code tries to install with libtool which fails with the message: cannot install mod_dontdothat.la to a directory not ending in [...]/lib/apache2/modules The attached patch fixes this. This was mentioned earlier here: http://mail-archives.apache.org/mod_mbox/subversion-dev/201302.mbox/%3c87ip5yhlrd@ntlworld.com%3E This can be seen working with an rpm package of the nightly trunk tarballs here: https://build.opensuse.org/package/show?package=subversionproject=home%3AAndreasStieger%3Asvn18 [[[ * build.conf (mod_dontdothat): install as apache module ]]] Regards, Andreas Index: build.conf === --- build.conf (revision 1453508) +++ build.conf (working copy) @@ -383,7 +383,7 @@ type = apache-mod path = tools/server-side/mod_dontdothat nonlibs = mod_dav_svn apr aprutil libs = libsvn_subr xml -install = tools +install = apache-mod msvc-libs = libhttpd.lib #
Re: [PATCH] correct installation of mod_dontdothat
[Andreas Stieger] The installation of mod_dontdothat was moved to make install-tools, however the trunk code tries to install with libtool which fails with the message: cannot install mod_dontdothat.la to a directory not ending in [...]/lib/apache2/modules That reminds me. We really should be installing Apache modules with 'libtool --mode=install', because on some platforms that is _not_ just a simple copy like you'd expect; sometimes it has to do other things. (In particular, the executable in-tree copy and the installed copy of an executable or library may need to be linked differently.) We'd have to work around libtool's sanity check on the install path. (Why does it do this? Who knows.) I guess install to a temporary directory, then do a normal copy from there to the real location. This is kind of a bite-sized task if anyone wants to take it on. I've been meaning to get to it myself for ages and ages - I don't have a very good track record of finding time to do even the little things for svn.
Re: [PATCH] correct installation of mod_dontdothat
Andreas Stieger wrote: The installation of mod_dontdothat was moved to make install-tools, however the trunk code tries to install with libtool which fails with the message: cannot install mod_dontdothat.la to a directory not ending in [...]/lib/apache2/modules The attached patch fixes this. This was mentioned earlier here: http://mail-archives.apache.org/mod_mbox/subversion-dev/201302.mbox/%3c87ip5yhlrd@ntlworld.com%3E This can be seen working with an rpm package of the nightly trunk tarballs here: https://build.opensuse.org/package/show?package=subversionproject=home%3AAndreasStieger%3Asvn18 [[[ * build.conf (mod_dontdothat): install as apache module ]]] With this patch, I confirm that my install (from an out-of-source-tree build) now completes without throwing an error. It now installs mod_dontdothat during make install-mods-shared instead of during make install-tools. However, I don't know if that's what we really want. If something is under tools, that suggests to me that it should be installed by install-tools and perhaps not by install-mods-shared... but we have to do something. I have no particular objection, as I'm sure package managers can work around it whatever way they wish. What do others think? - Julian
Re: some questions about the delta operation in SVN
On Wed, Mar 6, 2013 at 4:14 PM, Bo Chen bo.irvine.c...@gmail.com wrote: Clarify one point: The file@base (or the file @head) refers to the file I am currently updating, right? I am very curious why the server needs to re-compute the skip-delta. Is there any rule to guide the server which pristine version to be delta-ed against? To optimize the delta (specifically, to optimize the storage for the delta)? Thanks. Bo Hi Bo, To prevent any confusion, I want to point out that there are two *independent* places where deltas are being applied: (1) Between client and server. This is to conserve network bandwidth and is optional (depending on the protocol, they may simply send fulltext). The delta base is always the latest version that client has, i.e. BASE. In case of an update, the client tells the server what the respective BASE revision is (I'm on rev 42 for sub-tree X/Y. Please send me the data for revision 59.) Data sent from the client to the server is *always* fully reconstructed from the incoming delta. This is necessary to calculate and verify the MD5 / SHA1 checksums. All of this happens streamy, i.e. the data gets reconstructed and processed *while* coming in. There is no temporary file on the server eventually containing the full file contents. Data sent from the server to the client always starts as a fulltext read from the repository. If the client has already another version of that file and the protocol supports deltas, the server will read that other revision from the repository, too, and then calculate the delta while sending it streamingly to the client. (2) When the server writes data to the repository, it starts of with some fulltext coming in and *may* choose to deltify the new contents against some existing contents. This is done to conserve disk space and results in a chain of deltas that *all* need to be read and combined to reconstruct the fulltext. As Ben already pointed out, 1.8 has a number of tuning knobs that allow you to shift the balance between data size (small deltas) and reconstruction effort (number of deltas to read and process for a given fulltext). -- Stefan^2. -- Certified Supported Apache Subversion Downloads: * http://www.wandisco.com/subversion/download *
Re: [PATCH] correct installation of mod_dontdothat
Peter Samuelson pe...@p12n.org writes: That reminds me. We really should be installing Apache modules with 'libtool --mode=install', because on some platforms that is _not_ just a simple copy like you'd expect; sometimes it has to do other things. We currently use Apache's apxs to install mod_dav_svn and mod_authz_svn and we leave it up to that script to invoke libtool as required. Are you saying we should explictly invoke libtool? Does apxs do the wrong thing? -- Certified Supported Apache Subversion Downloads: http://www.wandisco.com/subversion/download
Re: [PATCH] correct installation of mod_dontdothat
That reminds me. We really should be installing Apache modules with 'libtool --mode=install', because on some platforms that is _not_ just a simple copy like you'd expect; sometimes it has to do other things. [Philip Martin] We currently use Apache's apxs to install mod_dav_svn and mod_authz_svn and we leave it up to that script to invoke libtool as required. Yes, well, how would apxs know anything about libtool? apxs just knows there's a module at such-and-such path and it needs to be installed. Are you saying we should explictly invoke libtool? Does apxs do the wrong thing? It certainly does the wrong thing in my Debian build, so I've had to patch it to use 'libtool --mode=install' instead / in addition. Say you build svn in /tmp/xyz. Then in order to make sure you can _run_ your stuff without installing, at least on some platforms, libtool arranges for executables and libraries to include all sorts of paths like /tmp/xyz/subversion/libsvn_client in the default library search path baked into the executable. (This is called the RPATH and you can view it with 'objdump -p'.) When you 'make install', libtool then _relinks_ everything to remove those RPATH references to /tmp/xyz. This actually has security implications. If you build svn in /tmp/xyz, install it system-wide, and a malicious user later creates their own /tmp/xyz/subversion/libsvn_client with a trojaned library ... you don't want the system svn to actually _use_ it. Peter
Re: [PATCH] correct installation of mod_dontdothat
Peter Samuelson pe...@p12n.org writes: That reminds me. We really should be installing Apache modules with 'libtool --mode=install', because on some platforms that is _not_ just a simple copy like you'd expect; sometimes it has to do other things. [Philip Martin] We currently use Apache's apxs to install mod_dav_svn and mod_authz_svn and we leave it up to that script to invoke libtool as required. Yes, well, how would apxs know anything about libtool? apxs just knows there's a module at such-and-such path and it needs to be installed. On my Debian box apxs knows about libtool. Are you saying we should explictly invoke libtool? Does apxs do the wrong thing? It certainly does the wrong thing in my Debian build, so I've had to patch it to use 'libtool --mode=install' instead / in addition. On my Debian box apxs does the right thing: if true ; then cd subversion/mod_authz_svn ; /usr/bin/install -c -d /usr/local/subversion/apache ; /usr/bin/apxs2 -i -S LIBEXECDIR=/usr/local/subversion/apache -n authz_svn mod_authz_svn.la ; fi /usr/share/apache2/build/instdso.sh SH_LIBTOOL='/usr/share/apr-1.0/build/libtool' mod_authz_svn.la /usr/local/subversion/apache /usr/share/apr-1.0/build/libtool --mode=install cp mod_authz_svn.la /usr/local/subversion/apache/ libtool: install: warning: relinking `mod_authz_svn.la' libtool: install: (cd /home/pm/sw/subversion/obj/subversion/mod_authz_svn; /bin/sh /home/pm/sw/subversion/obj/libtool --tag CC --silent --mode=relink gcc -shared -g -O2 -pthread -rpath /usr/local/subversion/apache -avoid-version -module -o mod_authz_svn.la mod_authz_svn.lo ../../subversion/libsvn_repos/libsvn_repos-1.la ../../subversion/libsvn_subr/libsvn_subr-1.la ) libtool: install: cp .libs/mod_authz_svn.soT /usr/local/subversion/apache/mod_authz_svn.so libtool: install: cp .libs/mod_authz_svn.lai /usr/local/subversion/apache/mod_authz_svn.la libtool: finish: PATH=/usr/local/subversion/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/sbin ldconfig -n /usr/local/subversion/apache Perhaps you mean mod_dontdothat? That doesn't get installed properly because it doesn't use apxs. -- Certified Supported Apache Subversion Downloads: http://www.wandisco.com/subversion/download
Re: FSFS format7 and compressed XML bundles
On 2013-03-06 18:55:55 +, Julian Foad wrote: I don't know if anything like that would be feasible. It may be possible in theory but too complex in practice. The parameters we need to extract would include such things as the Huffman coding tables used and also parameters that influence deeper implementation details of the compression algorithm. And of course for each compression algorithm we'd need an implementation that accepts all of these parameters -- an off-the-shelf compression library probably wouldn't be support this. The parameters could also be provided by the user, e.g. via a svn property. For instance, if the user wants some file file.xz to be handled uncompressed, he can add a svn:compress property whose value is xz -9 (if the -9 option was used). Then the client would do a unxz on the file. If the user wants the bit pattern to be preserved, the client would also do a xz -9 on the unxz output. If some command fails or the result is not identical to the initial file (for preserved bit pattern), the file would be handled compressed (or the client should issue an error message if requested by the user). Otherwise the file could be handled uncompressed. This is the basic idea. Then there are various implementation choices, such as whether the commands should be part of Subversion or external commands provided by the system. With a property, it would not be possible to change the behavior on past revisions, but tools could do that on a svn dump. -- Vincent Lefèvre vinc...@vinc17.net - Web: http://www.vinc17.net/ 100% accessible validated (X)HTML - Blog: http://www.vinc17.net/blog/ Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)