On 3/23/2010 0:06, Jason Dagit wrote:
I have too many side projects for the amount of time I give them, but
one idea that keeps coming back up in my brain is to use
criterion/progression to benchmark various parsers for the current darcs
format. I was thinking pitting attoparsec vs. darcs source vs.
attoparsec-iteratee vs. pure iteratee vs. database backend vs. ??.
Comparing memory usage would be in there too, but I don't think
criterion has a way to do that yet.
Would that, along with some asymptotic memory/time analyses, satisfy
your craving? I ask because it seems like knowing a particular
parser/format works well enough for general purpose usage isn't as good
as having evidence that it works well on a specific specialized task.
Mostly.
I can choose one to use based on the requirements of the current
project. Same for YAML or JSON... But each and every "special" or
"proprietary" parser brings its own learning curve.)
Which one would you pick for a YAML patch format? Suppose Haskell isn't
a consideration.
For YAML there are predominantly two standard parsers available in most
languages: a language-specific parser and a binding around libsyck, the
C SAX-like parser. Most of of the language-specific parsers have
SAX-like modes of operation, to further complicate things. I'd start
with the language-specific parser and migrate to the libsyck-based one
if necessary, but it might not be, depending on the language I'm working
with of course.
I'm a little confused by the flow of the conversation here. Are you
implying that even if we had a tested/robust RFC822 parser in Haskell
you'd rather we didn't use that format?
Given the choice between parsing YAML or RFC822, as a third-party
consumer of darcs patches/information/metadata, I'd rather parse YAML.
I'm not completely opposed to RFC822-style patch metadata formatting,
but I definitely think there are better formats worth considering first.
I brought up YAML in particular because I think it can be good for
RFC822-like "style", when read by human eyes, while having an overall
more explicitly defined markup and data structure.
Just some musings about a pony format:
Yes, this and keeping as much on disk as possible while inspecting a
patch sequence lead me recently to wonder again about using a 3rd party
database as the storage. Sqlite is easy, but not my favorite (Mainly I
dislike the lack of foreign keys and type enforcement. Those are merely
annoying but not show stoppers due to features like triggers and using a
typed programming language to interact with sqlite).
I think I mentioned once before that I do think Sqlite could make a very
nice backend for some potential future darcs/darcs-offspring format. At
the very least it would be something interesting to experiment. Sqlite
is particularly appealing because it is a single-file DB format and can
be transmitted easily over the wire. Of course, that file could grow
fairly large and you'd end up needing some smart protocol for push/pull
hand-shakes to avoid having to download an entire DB everytime... You
could possibly break it into sections like the current inventories, but
I'd assume you would lose some of the advantages of using a DB format in
the first place the smaller you chunk the inventories.
It seems like if we used a relational db we'd be forced to store patch
hunks in the filesystem, but that's probably for the best anyway. With
the hunks stored on disk separately you'd almost never need to have them
in memory (I think). I guess maybe the initial diff that created the
patch or a replace patch might require it. Perhaps some conflicts.
Basically the patch inventory would be in a table and indexed so that
hopefully we'd see good performance when interacting with it.
Of course, you'd certainly want a hashed, packed format for storing all
of those hunks, rather than individually.
I expect we'd still need hashed-storage to efficiently query/update the
filesystem and we'd probably also want the filecache (not sure). So I'm
really only talking about storing the patch metadata in the database.
I would expect the filecache wouldn't be necessary with a relational
database: the filecache is a cached mapping between a file (name) and
the patches that modify that file. Given a relational DB, its simply a
relation of the patch_summary table and the patch table: SELECT * FROM
patch WHERE patch.hash = (SELECT hash FROM patch_summary WHERE filename
= "...")
A similar pony repository format idea might be to try experimenting with
one of the new, hip document databases like couchdb. I've thought at
times hashed-storage already seems to be converging in the direction of
a document database... Interesting thought, a couchdb-based darcs...
--
--Max Battcher--
http://worldmaker.net
_______________________________________________
darcs-users mailing list
[email protected]
http://lists.osuosl.org/mailman/listinfo/darcs-users