Hello fellow hackers! I have worked a little more on hashed-storage again in last few days. It now manages the index "semi-automatically". What this means is that it will keep the hashed index up to date with any file changes whenever needed. It however won't be able to notice that files were added or removed from version control, since to efficiently implement this, it would need help from darcs.
I will probably wrap this up in a standalone program and try to distribute it to a wider community to collect benchmark results. It doesn't make sense to pursue this if it doesn't make things faster for real world use. Ideally, I'd try to get Simon wrt http://bugs.darcs.net/issue1202 try it out in his setting, it would be interesting to know if it helps there. In other news, I have started adding unit tests, so the repository now has a few checks in it. It also checks against darcs, so we can check that it behaves the same as current darcs on the file access level. I don't expect that the code would introduce correctness bugs into darcs -- the implementation is more of a performance issue than correctness. The code is fairly simple and is generally straightforward with little variation in different runs. I'll look into adding more cases of course, and probably also some test data, so I can experiment with working copies and such. In yet other news, I have also done some bigger benchmark, this time on a repository with 80k files. Extending my pre-existing 40k repository with new 40k files, 2000 files per patch, took about half an hour with darcs (probably on order of minutes with git). Most of this time has been apparently spent in code dealing with pristine (again). Anyway, the current status is that darcs-diff is about 10 times faster than darcs whatsnew and within 70 % slower than git diff. I don't think there's much leeway here, since almost all the extra time darcs-diff uses (compared to git diff), as far as I can tell, is in Data.Binary. We would have to somehow use the mmap'd index files directly to cut down this cost, instead of building a haskell data structure out of it (a list of tuples) with Data.Binary. I could try to implement a Storable interface instead, which should eliminate at least part of that cost, although it's not very high priority just now. Also, the code is confined to a single module, currently worth 170 lines. To move further in the direction of directly benefitting darcs, I will take on the TreeIO monad next, and then probably darcs integration right away. These two combined should make it possible to significantly speed up: whatsnew, record, revert, diff (through diffing improvements) and pull, apply (through TreeIO). It would also simplify check and repair, since those currently use an ad-hoc version of what TreeIO is supposed to solve more elegantly. This also means I'm deferring work on a better, packed repository format that would speed up darcs get and remote pull. This latter will be much more intrusive, and is tied to many other areas of darcs, like eg. cache-ing. It will also require repository format conversion, so we should get it right this time. The repository format will stay compatible on the patch-level, so it's sort of like darcs-1 -> hashed conversion. The other kind (darcs-1 -> darcs-2) will probably come if we migrate to a different patch format. (Probably camp's, to get rid of exponential commutes? But that's a long time from now, we'll see where things wander...) Yours, Petr. -- Peter Rockai | me()mornfall!net | prockai()redhat!com http://blog.mornfall.net | http://web.mornfall.net "In My Egotistical Opinion, most people's C programs should be indented six feet downward and covered with dirt." -- Blair P. Houghton on the subject of C program indentation _______________________________________________ darcs-users mailing list [email protected] http://lists.osuosl.org/mailman/listinfo/darcs-users
