Hi,
Here are my initial thoughts about the hashed storage work. I'm still
reading through it and I anticipate having more to say as I spend more
time on it. I'm looking at both the hashed-storage package and the changes
to darcs to make use of it.
I'll start by summarising what it's about, both as an introduction for
others and so any misconceptions or omissions can be corrected.
I haven't yet had time to read through all the past mailing list traffic
on the subject of hashed-storage etc, so please point me to any relevant
posts if it seems appropriate. I will try to read through them asap.
The basic point of this work is to optimise and clean up the way that
darcs interacts with the filesystem. The core of this is the new 'Tree'
type, which replaces the old 'Slurpy' type. It differs most importantly by
using explicitly embedded IO actions for each file or subdirectory, as
opposed to the lazy IO of Slurpy.
The work is split into a new 'hashed-storage' package and some changes to
darcs to make use of this package. In general the changes to darcs are
relatively small and are often simplifications.
As its name suggests, hashed-storage takes over the work of dealing
with hash-structured storage as used by the darcs 1.5/2 formats. It also
introduces the index, which is a cache of the hashes (and related info)
of an entire tree, stored in a single file. This allows for much faster
operations in many cases. hashed-storage also deals with "plain" trees
such as the working copy and old-style pristine.
There is also a 'TreeIO' monad, intended to provide an abstraction for
working with 'Tree's and also some optimisations such as buffering up
writes until a size threshold is reached.
And now my initial comments:
Overall I think something like this is clearly the right way to go and
generally I think it looks nice, with easy to read code. Although I
haven't been through the changes to darcs in great detail yet they
generally look like providing significant improvements in simplicity.
My most important question is about the dividing line between
hashed-storage and darcs-hs. The biggest weakness of hashed-storage as a
separate package is that it has functions that are specific to Darcs
repositories. To my mind, it either needs to abandon this knowledge,
perhaps by abstracting it over a typeclass that is meaningful in itself,
or some or all of hashed-storage needs to move into the darcs tree proper.
Otherwise, changes to some parts of darcs will continue to require
lockstep changes to hashed-storage - which is ok when Petr is developing
the two together intensively on his own, but not really when other people
are involved or over a longer timeframe. One option would be to move it
all into the darcs tree but distribute it as a separate cabal package with
a more darcs-specific name.
Onto more specific comments, which are mainly in note form:
- The naming of Darcs.Gorsvet is obscure and a barrier to future
understanding of the code
- In Darcs.Gorsvet, is there any reason not to use bracket in
mInCurrentDirectory (as a comment suggests)?
- The floatPath function from Storage.Hashed is documented as being unsafe
and then used all over Darcs.Gorsvet. Needs an explanation of what's
unsafe and why it's ok to use it in Darcs.Gorsvet.
- In various places (e.g. deleting on windows, corrupt pending etc)
a file gets renamed to a fixed different name, which could cause
problems if it already exists. There should be some standard scheme to
avoid collisions.
- We need to sort out haskell_policy, the ratification of readFile all
over the place is just silly. Replacing it with hlint as has been
suggested sounds like a very good idea.
- I think treeDiff is actually relatively unlikely to be wrong because it
delegate s actual line by line diffing to Patch.Prim which hasn't changed.
Storage.Hashed.Diff exists but seems unused, what's the status and
plans there?
- unrecordedChanges - Why do IgnoreTimes and LookForAdds lead to ignoring
the index?
- Presumably biggest risk is random bugs with invalid indexes. An
explanation of when the index can become invalid would be helpful (there
may be one in the code that I haven't spotted yet).
- Storage.Hashed.Index: code is full of magic numbers
- Petr and I have already discussed overloading Tree and TreeIO
over the container (IO). This would make writing test rigs
easier, and mean that code that processes Tree could be polymorphic
where appropriate. Overloaded code would typically constrain the container
to be in Monad or perhaps Applicative.
- Win32 removeFile workaround: not used consistently, should be hidden
behind some standard abstraction?
- I don't really understand what virtualTreeIO does and doesn't do.
- I think TreeIO should be abstract, i.e. a newtype, unless there is some
value in exposing the fact that it is a StateT.
- hashedTreeIO never deletes things, so what removes dead stuff?
- why don't the two cases of hashedTreeIO.updateFile use the same code
path?
- I'm very dubious about semantics of finish - why doesn't expandPath call
it? Presumably the documented usage semantics of readIndex have something
to do with this, but it feels fragile, and I only sort of understand the
whole dirs_changed/finish stuff in readIndex'
That's all for now...
Cheers,
Ganesh
_______________________________________________
darcs-users mailing list
[email protected]
http://lists.osuosl.org/mailman/listinfo/darcs-users