Re: [announce] Sharebox, a FUSE filesystem relying on git-annex
I've written a metastore clone for a project where we need to store a linux distribution in version control (legacy code). I'm also using it for my personal vcs-home stuff. It is a naive and bluntly straightforward way to do this, but it seems to be working. You can find it at https://github.com/harleypig/gitperms I use git hooks and a central file to (re)store the metadata. Maybe it can be of some use to someone else. -- Harley J Pig ___ vcs-home mailing list vcs-home@lists.madduck.net http://lists.madduck.net/listinfo/vcs-home
Re: [announce] Sharebox, a FUSE filesystem relying on git-annex
On Sun, Apr 10, 2011 at 16:43, Harley J Pig harley...@gmail.com wrote: You can find it at https://github.com/harleypig/gitperms Are you willing to bounce that onto the git list or should I do so? RIchard ___ vcs-home mailing list vcs-home@lists.madduck.net http://lists.madduck.net/listinfo/vcs-home
Re: [announce] Sharebox, a FUSE filesystem relying on git-annex
On Sun, Apr 10, 2011 at 09:48, Richard Hartmann richih.mailingl...@gmail.com wrote: Are you willing to bounce that onto the git list or should I do so? I'm not subscribed to that list, go ahead and post it if you would. Thank you. -- Harley J Pig ___ vcs-home mailing list vcs-home@lists.madduck.net http://lists.madduck.net/listinfo/vcs-home
Re: [announce] Sharebox, a FUSE filesystem relying on git-annex
On Mon, Apr 11, 2011 at 02:07, Harley J Pig harley...@gmail.com wrote: I'm not subscribed to that list, go ahead and post it if you would. Thank you. Done. Richard ___ vcs-home mailing list vcs-home@lists.madduck.net http://lists.madduck.net/listinfo/vcs-home
Re: [announce] Sharebox, a FUSE filesystem relying on git-annex
On Sat, Apr 9, 2011 at 04:42, Christophe-Marie Duquesne chm.duque...@gmail.com wrote: git-annex does location tracking. Even if you delete the link, the file is still there and other repositories know what repositories have the file. If you want to be sure the file is always reachable, you have to force a repository to act central and to download every files. That is a mount option I have already added ( -o getall). FYI, git-annex gained the ability to use a bup remote. This will solve all problems in this regard if used correctly and will even give you indefinite and full history. As an aside, please look here [1] for a current discussion on how to store metadata in git, enabling git-annex to do so, enabling any FUSE front-ends to act more in line with normal file systems. Smudge filters were mentioned so this must be good ;) Richard [1] http://marc.info/?l=gitm=130220380412726w=4 ___ vcs-home mailing list vcs-home@lists.madduck.net http://lists.madduck.net/listinfo/vcs-home
Re: [announce] Sharebox, a FUSE filesystem relying on git-annex
I'll try to gather things I can answer to: I see you include fuse.py - http://code.google.com/p/fusepy/ - in your repo. how does it compare to fuse-python - http://pypi.python.org/pypi/fuse-python ? fusepy is written with ctypes while fuse-python is a full-blown C extension. At first, I was using fuse-python, but I ended thinking fusepy was less bloated and less painful (just a file to include versus a library to compile and install). where will you store this backup copy? introducing a node/repository which will hold backup copies can be considered going to a centralized model; which is something you (Christophe-Marie) try to explicitly avoid, but I think this is not necessarily a problem) git-annex does location tracking. Even if you delete the link, the file is still there and other repositories know what repositories have the file. If you want to be sure the file is always reachable, you have to force a repository to act central and to download every files. That is a mount option I have already added ( -o getall). This is also an area I hope to improve in git-annex, by using git smudge filters. So it might get a mode where files can be modified and git commit just annexes the new content. That would be great. I am not sure using fuse would still be necessary, then. ___ vcs-home mailing list vcs-home@lists.madduck.net http://lists.madduck.net/listinfo/vcs-home
Re: [announce] Sharebox, a FUSE filesystem relying on git-annex
Hi I see there have been some good thoughts given about this. I am currently on vacation in a place where I do not have internet access. I'll come back to you in a week. Regards, Christophe-Marie ___ vcs-home mailing list vcs-home@lists.madduck.net http://lists.madduck.net/listinfo/vcs-home
Re: [announce] Sharebox, a FUSE filesystem relying on git-annex
On Sun, Apr 3, 2011 at 11:35, Dieter Plaetinck die...@plaetinck.be wrote: - centralized: have 1 (or more) remotes that always keep a copy of the files which are being removed on all other remotes, these would be backup-nodes, they don't follow the strict always in sync rule that applies to the regular nodes. (they follow the original git-annex idea more strictly) FWIW, there has been talk about using bup as a storage back-end for git-annex. That would allow you to keep full revision history and all files in one or two main locations and just use plain git-annex on the other ones. - decentralized: allow users to remove files by removing the symlink, but still keep the blob in .git-annex on at least one of the nodes, so that it can be restored from that. Leaving a stale object in the store that no one really knows about seems like an extremely bad idea. And even if git-annex were able to track its existence internally while hiding the symlink from the user, I fear this would cause confusion. I would prefer a way to properly delete a file from all repos, but the bup-backed one would obviously still keep everything around. Of course, you wouldn't need the bup back-end for your podcasts, but for photos or other important personal data, it would be useful. Richard ___ vcs-home mailing list vcs-home@lists.madduck.net http://lists.madduck.net/listinfo/vcs-home
Re: [announce] Sharebox, a FUSE filesystem relying on git-annex
Dieter Plaetinck wrote: I think having support for this in git-annex would be very useful, even if it's not that efficient: if this can be dealt with in git-annex, individual higherlevel projects like sharebox and dvcs-autosync have less headaches. Not to mention sharebox/dvcs-autosync would need to do really inefficient things to deal with it anyway. (because they can't involve themselves into the actual git/dvcs tricks, they work on a higher level of abstraction), and it might also benefit some users who work with git-annex manually. How do you see this? How hard/cumbersome is it to implement this in git-annex? Why is it inefficient? It's not really clear to me after reading the smudge information on http://www.kernel.org/pub/software/scm/git/docs/gitattributes.html http://git-annex.branchable.com/todo/smudge/ if toobig then git_annex_add file else git_add file git_commit file unfortunately I don't think so: - with dvcs-autosync we often commit early, as in, the file could still be in the process of being written to, or it could be modified again after we added it. From what I understand, we would need to forbid our users from changing the file after it is added to git-annex, and worse: if git-annex does its move file, replace file with symlink trick, while the user is writing to it, this might break things. You're right. However, you would also not want to commit many partial versions of a large file as it was being written. - when a remote A pulls in the changes from remote B, for dropbox-like behavior it should also automatically: * run `git annex get` * git commit .git-annex/*/*.log Does this seem about right? Yes. - deletes will also need to propagate automatically (see next paragraph), still need to figure out how to do that best. Note that dropbox-like behavior is different from the behavior you usually expect from git-annex users. * usual git-annex behavior: every remote stands on it's own, there is no forced being in sync, so that deletes must happen as initiated by the user, and this way you can prevent them from removing files if you expect it could be the last instance of the file. * dropbox-like : remote A remove a file - *all other remotes* should remove the file, so that their working copy looks the same. BUT the file should still be available *somewhere* so that a restore can be initiated (preferably from any of these nodes) I see two solutions here: - centralized: have 1 (or more) remotes that always keep a copy of the files which are being removed on all other remotes, these would be backup-nodes, they don't follow the strict always in sync rule that applies to the regular nodes. (they follow the original git-annex idea more strictly) - decentralized: allow users to remove files by removing the symlink, but still keep the blob in .git-annex on at least one of the nodes, so that it can be restored from that. Yes, that's the default behavior if the symlink is removed. There is then a git annex unused pass that can be used to find and remove unused content when space is needed. Given the size of modern drives, that could be run nightly or something. -- see shy jo signature.asc Description: Digital signature ___ vcs-home mailing list vcs-home@lists.madduck.net http://lists.madduck.net/listinfo/vcs-home
Re: [announce] Sharebox, a FUSE filesystem relying on git-annex
Richard Hartmann wrote: I know Joey pondered this as well, you will find some references on git-annex' ikiwiki. This is needed for S3 in the medium term, anyway. Basically, the plan is to encrypt the files with a symmetric key and then allow access to that key via other keys. That way, you can share some files between machines/people and still make sure no one gets at stuff they shouldn't. The way to encrypt object files' names is still somewhat open to discussion, afaik. Classical dilemma: Where should this be discussed? On this list or within the ikiwiki? Maybe everyone interested should read through the ikiwiki and after some discussion here, we can dump use cases, design decisions etc back into ikiwiki as a TODO once Joey is happy with it? I've put together my current thoughts at http://git-annex.branchable.com/design/encryption/ Comments appreciated in any medium (except watercolors). -- see shy jo signature.asc Description: Digital signature ___ vcs-home mailing list vcs-home@lists.madduck.net http://lists.madduck.net/listinfo/vcs-home
Re: [announce] Sharebox, a FUSE filesystem relying on git-annex
On Sun, 3 Apr 2011 11:18:05 -0400 Joey Hess j...@kitenet.net wrote: Dieter Plaetinck wrote: I think having support for this in git-annex would be very useful, even if it's not that efficient: if this can be dealt with in git-annex, individual higherlevel projects like sharebox and dvcs-autosync have less headaches. Not to mention sharebox/dvcs-autosync would need to do really inefficient things to deal with it anyway. (because they can't involve themselves into the actual git/dvcs tricks, they work on a higher level of abstraction), and it might also benefit some users who work with git-annex manually. How do you see this? How hard/cumbersome is it to implement this in git-annex? Why is it inefficient? It's not really clear to me after reading the smudge information on http://www.kernel.org/pub/software/scm/git/docs/gitattributes.html http://git-annex.branchable.com/todo/smudge/ if toobig then git_annex_add file else git_add file git_commit file unfortunately I don't think so: - with dvcs-autosync we often commit early, as in, the file could still be in the process of being written to, or it could be modified again after we added it. From what I understand, we would need to forbid our users from changing the file after it is added to git-annex, and worse: if git-annex does its move file, replace file with symlink trick, while the user is writing to it, this might break things. You're right. However, you would also not want to commit many partial versions of a large file as it was being written. Well, if it ever happens once, that's once too many. Since we're aiming for a dropbox-like near-instant-synchronisation system, the way of working is different then when using git for , say.. version controlling source code. So it _will_ happen that we commit versions of files as they are in the progress of being written. Even if the user decides to store something like a continuously being updated logfile in his dropbox-like system, I want to be able to support that. ___ vcs-home mailing list vcs-home@lists.madduck.net http://lists.madduck.net/listinfo/vcs-home
Re: [announce] Sharebox, a FUSE filesystem relying on git-annex
Dieter Plaetinck wrote: @Joey: you mentioned you think inotify might be a better backend/paradigm for this than fuse, so do you think implementing git-annex in something like dvcs-autosync is feasible? and/or preferable? Feasable? Certianly. Preferable? I'm in the let a thousand flowers bloom phase. It's spring. :) As Christophe-Marie has pointed out, git-annex makes annexed files semi-immutable, and FUSE can hide that quirk, while inotify watching cannot. That could be confusing for certian users or use cases, if they are not aware of what is going on. Or it could be something quickly learned about how these special replicated directories work, that files have to be copied to be changed. This is also an area I hope to improve in git-annex, by using git smudge filters. So it might get a mode where files can be modified and git commit just annexes the new content. Last time I looked at this, git was not *quite* there to let it be done efficiently. I quite like dvcs-autosync (partially because inotify is more simple than fuse, partially because it currently works already quite well) and I'm interested in making it support space efficient storage of big files; from what I've read it should be possible to do this with git-annex (which should not even change how we currently deal with small files, they would still be in git) but I'm still doing my first baby steps with git-annex so I wouldn't know. Advice very welcome.. All it probably needs at is simplest is something like this (excuse the haskell): toobig - checkFileSize file if toobig then git_annex_add file else git_add file git_commit file Another note : files being tracked with git-annex through sharebox or dvcs-autosync or whatever should always have at least 1 backup copy, so that if the file gets deleted everywhere, it still can be retrieved from somewhere (which raises the interesting question: where will you store this backup copy? introducing a node/repository which will hold backup copies can be considered going to a centralized model; which is something you (Christophe-Marie) try to explicitly avoid, but I think this is not necessarily a problem) This is something git annex goes to large lengths to deal with. It will enforce N backup copies; it tracks which other repositories have which files; it can transfer wanted file contents from other repositories in either a decentralized or a centralized manner; the other repositories can be on other drives of the same computer, or accessible by ssh, or even, now, Amazon S3. -- see shy jo signature.asc Description: Digital signature ___ vcs-home mailing list vcs-home@lists.madduck.net http://lists.madduck.net/listinfo/vcs-home
Re: [announce] Sharebox, a FUSE filesystem relying on git-annex
On Thu, 31 Mar 2011 18:56:54 +0200 Christophe-Marie Duquesne chm.duque...@gmail.com wrote: Hi, I am currently writing a FUSE file system based on git-annex for replicating binary files on several machines. I thought I could share it here in order to get some ideas and contributors. What are your goals? Seamless synchronization à la dropbox. Ability to use with big binary files such as mp3/movies. Entirely decentralized. Don't use unnecessary space Keep it simple: avoid special VCS commands and keep a filesystem interface as much as possible. you also need to do various git/git-annex commands, or am I missing something? Why? Because sparkleshare and dvcs-autosync are bad at versioning binary files I quite like dvcs-autosync, but it indeed lacks space-efficient storage of big files. I would like to try if we can use git-annex to support this in dvcs-autosync, although AFAIK git-annex is not transparent in the way regular git is transparent (i.e. it needs to explicitly copy files between locations), I assume this is the reason you need to go for a FUSE-based approach? or do you just prefer this over regular fs + inotify? Because Unison needs disk space for each couple of hosts it synchronizes and thus does not really scales for more than 2 hosts Because Coda is not completely decentralized and it bothers me you actually tried coda? it's something I'm interested in, on paper it looks like an awesome, maybe-even-perfect open source dropbox-clone but the reality is probably different, I never tried it so I wouldn't know. What do you have? A python implementation. It is about 600 sloc, and you'll find it on https://github.com/chmduquesne/sharebox Be careful, it is very alpha and it still does not have a proper conflict handler. Hey, but copying is slow! On my machine, copying files to a sharebox fs is about 10 times slower than copying it on a normal fs. All the time is spent in python's os.write(): I guess the only way to work around this problem is to rewrite the whole thing in C, but I am keeping this for later. hmm, writing files is i/o-bound, I doubt the language will have much effect here. check with top/vmstat if you get iowait, if so your storage medium is getting saturated and rewriting in C won't help. maybe a network/buffering/.. issue. I am interested in: - suggestions for the functional design (I have my ideas, but I'd love to be challenged). in your REAMDE you suggest to use a crontab for synchronisation; maybe you can reuse/be inspired by the xmpp system dvcs-autosync uses; it works quite well, it's quite robust and it's instant :) Dieter ___ vcs-home mailing list vcs-home@lists.madduck.net http://lists.madduck.net/listinfo/vcs-home
Re: [announce] Sharebox, a FUSE filesystem relying on git-annex
On Thu, Mar 31, 2011 at 8:04 PM, Dieter Plaetinck die...@plaetinck.be wrote: you also need to do various git/git-annex commands, or am I missing something? Ideally, that would be only at set up time. I quite like dvcs-autosync, but it indeed lacks space-efficient storage of big files. I would like to try if we can use git-annex to support this in dvcs-autosync, although AFAIK git-annex is not transparent in the way regular git is transparent (i.e. it needs to explicitly copy files between locations), I assume this is the reason you need to go for a FUSE-based approach? or do you just prefer this over regular fs + inotify? I don't really like FUSE, and I would actually prefer using inotify, but I think it would not be transparent enough. I think a filesystem is the right abstraction here. you actually tried coda? it's something I'm interested in, on paper it looks like an awesome, maybe-even-perfect open source dropbox-clone but the reality is probably different, I never tried it so I wouldn't know. I did not try it, but I looked at the documentation. It is not purely decentralized: some machines are servers, others are clients and the roles stay the same (If I believe this page: http://www.coda.cs.cmu.edu/ljpaper/lj.html). hmm, writing files is i/o-bound, I doubt the language will have much effect here. check with top/vmstat if you get iowait, if so your storage medium is getting saturated and rewriting in C won't help. maybe a network/buffering/.. issue. I'll have a look. Actually to come to this conclusion, I used the loopback-fs provided by fusepy, which just mirrors another part of your file system, and I timed the copy of an iso. This copy was 10 times slower than on a real fs (60 seconds instead of 6). I concluded that this was due to python. I have about the same performance on my filesystem. I'll complete the experiment tomorrow with fuse_xmp, which is another fuse loopback-fs, but done in C. in your REAMDE you suggest to use a crontab for synchronisation; maybe you can reuse/be inspired by the xmpp system dvcs-autosync uses; it works quite well, it's quite robust and it's instant :) Yes. I had a 'sync=xx' option, for specifying an interval time between synchronisations, but I removed it for this very reason. ___ vcs-home mailing list vcs-home@lists.madduck.net http://lists.madduck.net/listinfo/vcs-home