Re: [fossil-users] Fossil 2.1: Scaling

2015-04-16 Thread Nico Williams
On Mon, Mar 2, 2015 at 6:30 AM, Richard Hipp d...@sqlite.org wrote:
 Ben Pollack's essay at
 http://bitquabit.com/post/unorthodocs-abandon-your-dvcs-and-return-to-sanity/
 succinctly points up some of the problems with DVCS versus centralized
 VCS (like subversion).  Much further discussion occurs on the various
 news aggregator sites.

 So I was thinking, could Fossil 2.0 be enhanced in ways to support
 scaling to the point where it works on really massive projects?

 The key idea would be to relax the requirement that each client load
 the entire history of the project.  Instead, a clone would only load a

git can do this, and it's a relatively new feature.  The really nice
thing would be to load whatever is needed on demand, or to perform
certain operations (e.g., producing annotated sources, viewing
history, ...) on the server.

 limited amount of history (a month, a year, perhaps even just the most
 recent check-in).  This would make cloning much faster and the
 resulting clone much smaller.  Missing content could be downloaded
 from the server on an as-needed basis.  So, for example, if the user
 does fossil update trunk:2010-01-01 then the local client would
 first have to go back to the server to fetch content from 2010.  The
 additional content would be added to the local repository.  And so the
 repository would still grow.  But it grows only on an as-needed basis
 rather than starting out at full size.  And in the common case where
 the developer never needs to look at any content over a few months
 old, the growth is limited.

 By downloading the meta-data that is currently computed locally by
 rebuild, many operations on older content, such as timelines or
 search, could be performed even without having the data present.  In
 the bsd-src.fossil repository, the content is 78% of the repository
 file and the meta-data is the other 22%.  So a clone that stored only
 the most recent content together with all metadata might be about
 1/4th the size of a full clone.  For even greater savings, perhaps the
 metadata could be time-limited, though not as severely as the content.
 So perhaps the clone would only initialize to the last month of
 content and the last five years of metadata.

 For wide repositories (such as bsd-src) that hold many thousands of
 files in a single check-out, Fossil could be enhanced to allow
 cloning, checkout, and commit of just a small slice of the entire
 tree.  So, for example, a clone might hold just the bin/ subdirectory
 of bsd-src containing just 56 files, rather than all 147720 files of a
 complete check-out.  Fossil should be able to do everything it
 normally does with just this subset, including commit changes, except
 that on new manifests generated by the commit, the R-card would have
 to be omitted since the entire tree is necessary to compute the
 R-card.  But the R-card is optional already, controlled by the
 repo-cksum setting, which is turned off in bsd-src, so there would
 be no loss in functionality.

Yes, this would be very nice.  Though a BSD would probably need
significant build system rototilling to make it possible for
developers to work on isolated portions of the code with partial
clones only.

 The sync protocol would need to be greatly enhanced to support this
 functionality.  Also, the schema for the meta-data, which currently is
 an implementation detail, would need to become part of the interface.
 Exposing the meta-data as interface would have been unthinkable a few
 years ago, but at this point we have accumulated enough experience
 about what is needed in the meta-data to perhaps make exposing its
 design a reasonable alternative.

Exposing the metadata would be one of the best things Fossil could do,
IMO, once it's ready.

Nico
--
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] Fossil 2.1: Scaling

2015-03-03 Thread Warren Young
I’m going to start two different reply forks: I’ll reply to the Pollack article 
here, then send another message later to chime in on your proposal, drh.

On Mar 2, 2015, at 5:30 AM, Richard Hipp d...@sqlite.org wrote:
 
 Ben Pollack's essay at
 http://bitquabit.com/post/unorthodocs-abandon-your-dvcs-and-return-to-sanity/
 succinctly points up some of the problems with DVCS versus centralized
 VCS (like subversion).

Thanks for the pointer.  It sums up most of my problems with the Git and GitHub 
models.  It’s too bad Pollack doesn’t include Fossil in his comparison.

I don’t think all of his points apply to Fossil:

1. “Sanely track renames.”  In this respect, I think Fossil offers one step 
forward, one back relative to Subversion.

While Fossil does seem to realize that a rename isn’t the same thing as 
add+delete *most* of the time — I have managed to confuse it a few times into 
seeing a rename as add+delete — it doesn’t backtrace through a rename in finfo 
output:

  f new x.fossil
  mkdir x
  cd x
  f open ../x.fossil
  touch a
  f add a
  f ci -m ‘initial'
  f mv a b
  mv a b
  f ci -m ‘renamed a to b’
  f finfo b
  2015-03-03 [bc09e28048] .. (user: warren, artifact: [da39a3ee5e], branch:
   trunk)

Point being, I usually end up having to go into “fossil ui” to trace the 
ancestry of a file back through a rename.  Doubtless it is possible to do this 
from the command line somehow, but I miss the behavior of “svn log” which did 
the backtrace for you.


2. Explosion of manuals and tutorials.

To some extent, the relative paucity of Fossil training material is a 
consequence of its…ummm… unpopularity?  But, it is also the case that it 
doesn’t *need* as much training material.

I read the Red Bean Book, and I still don’t quite understand how svn branching 
and merging is supposed to work in practice.  We ended up ignoring about half 
of the mechanics proposed therein, which made our conversion to Fossil more 
difficult, but it worked at the time for us.

With Fossil, though, I’m branching and merging for the first time without 
difficulty.  The hardest part was getting past the scar tissue laid down by 
svn.  “It’s that simple, really?  No way, can’t be.  There’s got to be more to 
it than that!”

Partly this is due to the ability to create a branch as part of a checkin with 
f ci --branch”.  Partly it is due to the branch structure being made visible 
in f ui.  These are genuine advances over Subversion, and I thank you for them, 
drh.


3. “fossil bundle” makes Fossil nearly as easy to use as Subversion for 
drive-by patches.  I believe the equivalent Fossil sequence is:

a. Clone the master repo
b. Open a copy of the repo
c. Make your change
d. Check it in on a branch; ignore the auto-sync complaint
e. Bundle the new branch
f. Send the bundle to the project maintainer
g. Watch it get ignored

So, just two more steps than svn, rather than 4, as with git.

One of the two extra steps is due to the fact that clone is separate from open 
in Fossil, which I consider a feature.  It allows multiple opens on a local 
clone.  

I absolutely hate the Git alternative where you have to keep switching the 
local checkout/repo to see different branches.  The checkout operation itself 
is time consuming, then it eats more time due to the forced rebuilds, since it 
must rebuild the objects to match the changed sources.  With Fossil, I can keep 
not only multiple source checkouts from a single repo clone, but also multiple 
build trees.

The other extra step is the apparent necessity to check your changes in.  This 
is also a feature since it’s how Fossil records the checkin comment, the user 
info, etc.  If the project maintainer accepts the bundle without changes, this 
step saves him work that he’d otherwise have to do manually in the patch(1) 
case.

If someone wants to tie “fossil bundle” into the ticketing system, we can save 
a step here by bypassing the email step.  (fossil bundle submit?)

Perhaps we could save the other step by offering a clone-and-open mode, perhaps 
by storing the Fossil repo file inside the opened tree?  I propose “fossil 
hack,” so named because it would be used by people who just want to do a quick 
hack on your repo, not seriously spend a lot of time with it.
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] Fossil 2.1: Scaling

2015-03-03 Thread Warren Young
On Mar 2, 2015, at 5:30 AM, Richard Hipp d...@sqlite.org wrote:
 
 The key idea would be to relax the requirement that each client load
 the entire history of the project.  Instead, a clone would only load a
 limited amount of history (a month, a year, perhaps even just the most
 recent check-in).

This would be wonderful!

I would suggest a refinement on the simple “SELECT * WHERE modification_date  
1month” idea, though: I actually want the past month (or whatever) of history 
on *each* open branch relative to the date of the checkin time on the tip of 
that branch.

That is, if I last changed the “v8” branch two years ago, I still expect it to 
give me a month’s worth of file info on that branch.  I need it if I am going 
to do tech support on that branch: “v10 does thus-and-such when you press the 
Foobie button, but v8 behaves differently.  Can you look into the code to find 
out why?”  I also need it if I’m going to backport a fix to v8 from v10.

I’m thinking this should be the default behavior, at least by configuration at 
the master repo level.  It should take extra flags to get a complete clone.  
This is a break with existing practice, but Pollack’s right: it’s rare for me 
to actually need deep history on every branch.  The longest I typically need 
history for is however long it’s been since the last release on that branch.

If I get this feature, it’s going to make me want another one, though: the 
ability to merge two repositories.

For one of my several svn repos which I converted to Fossil, I purposely 
checked in only the tip of the svn trunk into a fresh repo, rather than convert 
a decade of history.  I did it for the reason Pollack points out in the 
article: faster checkins and other tree traversals.  If I need to go back into 
the pre-Fossil history, I have a separate svn-to-Fossil conversion repo.

If Fossil’s speed becomes independent of the depth of the checkin history, I’d 
like the ability to make those two into a single repo, since the *effect* of a 
clone will then be more like my current setup, where there are relatively few 
checkins, and none of the blobs have forks in their history yet.

I realize I can probably do that by hand with some kind of export-and-reimport 
via the Git fast-export path, but I’d like to do it entirely within Fossil, if 
possible.

 For wide repositories (such as bsd-src) that hold many thousands of
 files in a single check-out, Fossil could be enhanced to allow
 cloning, checkout, and commit of just a small slice of the entire
 tree.

This would also be awesome.  I miss that from Subversion.  Those of us who have 
converted from Subversion often have trees that depend on the ability to check 
out just a slice of the tree.

I had only one Subversion repo at home, with everything “versionable” stored in 
it.  I could then check out different chunks of the tree, placing each where I 
wanted that working subtree to live.

If I were creating such a system fresh in Fossil today, I’d create separate 
Fossil repos for each different sub-tree of files.  Not because this is what 
*I* need, but because this is what *Fossil* expects.

I’m currently hacking around this by checking the monolithic repo out into a 
hidden location, then creating symlinks back into subfolders of that checkout.  
Yes, ick.

I’ve considered reconverting the repo, using Git’s ability to rewrite history 
and thereby slice the repo up, but that’s just more work than it’s worth to me.

What I really want is what you propose: Subversion-like subrepo checkouts.
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] Fossil 2.1: Scaling

2015-03-02 Thread Richard Boehme
One question that arises is: how do I define what a server is? Can I
get the complete repository history for everything else but get a more
limited history for files that are larger than a certain size, or that
have certain extensions?

How would this work with sub-repositories (sorry, not versed very well
in fossil, but I understand that there can be sub respositories that
are nested under the main one (for instance for a directory which
contains a lot of videos or images))

Thanks.

Richard


On 3/2/15, Richard Hipp d...@sqlite.org wrote:
 Ben Pollack's essay at
 http://bitquabit.com/post/unorthodocs-abandon-your-dvcs-and-return-to-sanity/
 succinctly points up some of the problems with DVCS versus centralized
 VCS (like subversion).  Much further discussion occurs on the various
 news aggregator sites.

 So I was thinking, could Fossil 2.0 be enhanced in ways to support
 scaling to the point where it works on really massive projects?

 The key idea would be to relax the requirement that each client load
 the entire history of the project.  Instead, a clone would only load a
 limited amount of history (a month, a year, perhaps even just the most
 recent check-in).  This would make cloning much faster and the
 resulting clone much smaller.  Missing content could be downloaded
 from the server on an as-needed basis.  So, for example, if the user
 does fossil update trunk:2010-01-01 then the local client would
 first have to go back to the server to fetch content from 2010.  The
 additional content would be added to the local repository.  And so the
 repository would still grow.  But it grows only on an as-needed basis
 rather than starting out at full size.  And in the common case where
 the developer never needs to look at any content over a few months
 old, the growth is limited.

 By downloading the meta-data that is currently computed locally by
 rebuild, many operations on older content, such as timelines or
 search, could be performed even without having the data present.  In
 the bsd-src.fossil repository, the content is 78% of the repository
 file and the meta-data is the other 22%.  So a clone that stored only
 the most recent content together with all metadata might be about
 1/4th the size of a full clone.  For even greater savings, perhaps the
 metadata could be time-limited, though not as severely as the content.
 So perhaps the clone would only initialize to the last month of
 content and the last five years of metadata.

 For wide repositories (such as bsd-src) that hold many thousands of
 files in a single check-out, Fossil could be enhanced to allow
 cloning, checkout, and commit of just a small slice of the entire
 tree.  So, for example, a clone might hold just the bin/ subdirectory
 of bsd-src containing just 56 files, rather than all 147720 files of a
 complete check-out.  Fossil should be able to do everything it
 normally does with just this subset, including commit changes, except
 that on new manifests generated by the commit, the R-card would have
 to be omitted since the entire tree is necessary to compute the
 R-card.  But the R-card is optional already, controlled by the
 repo-cksum setting, which is turned off in bsd-src, so there would
 be no loss in functionality.

 Tickets and wiki in a clone might be similarly limited to (say) the
 previous 12 months of content, or the most recent change, whichever is
 larger.

 With these kinds of changes, it seems like Fossil might be made to
 scale to arbitrarily massive repositories on the client side.  On the
 server side, the current design would work until the repository grew
 too big to fit into a single disk file, at which point the server
 would need to be redesigned to use a client/server database like,
 PostgreSQL, that can scale to sizes larger than the 140 terabyte limit
 of SQLite.  But that would be a really big repo.  22 years of BSD
 history fits in 7.2 GB, or 61 GB uncompressed.  So it would take a
 rather larger project to get into the terabyte range.

 The sync protocol would need to be greatly enhanced to support this
 functionality.  Also, the schema for the meta-data, which currently is
 an implementation detail, would need to become part of the interface.
 Exposing the meta-data as interface would have been unthinkable a few
 years ago, but at this point we have accumulated enough experience
 about what is needed in the meta-data to perhaps make exposing its
 design a reasonable alternative.

 These are just thoughts to elicit comments and discussion.  I have
 several unrelated and much higher-priority tasks to keep me busy at
 the moment, so this is not something that would happen right away,
 unless somebody else steps up to do a lot of the implementation work.

 --
 D. Richard Hipp
 d...@sqlite.org
 ___
 fossil-users mailing list
 fossil-users@lists.fossil-scm.org
 http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users



-- 
Thank you.

Richard Boehme

Re: [fossil-users] Fossil 2.1: Scaling

2015-03-02 Thread Richard Hipp
On 3/2/15, Richard Boehme rboe...@gmail.com wrote:
 One question that arises is: how do I define what a server is? Can I
 get the complete repository history for everything else but get a more
 limited history for files that are larger than a certain size, or that
 have certain extensions?

That is theoretically possible given the file format.  It is simply a
question of writing the necessary code to implement that capability.



 How would this work with sub-repositories (sorry, not versed very well
 in fossil, but I understand that there can be sub respositories that
 are nested under the main one (for instance for a directory which
 contains a lot of videos or images))


I think sub-repositories is an orthogonal topic.

-- 
D. Richard Hipp
d...@sqlite.org
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] Fossil 2.1: Scaling

2015-03-02 Thread Joerg Sonnenberger
On Mon, Mar 02, 2015 at 07:30:44AM -0500, Richard Hipp wrote:
 So I was thinking, could Fossil 2.0 be enhanced in ways to support
 scaling to the point where it works on really massive projects?

I think the single biggest practical issue right now still goes back to
the baseline manifests not being efficient enough. Would you consider
changing the rules to allow truely incremental manifests? I agree that
having full manifests is sometimes nicer, but I think those would be
build on-demand and cached separately. I belive that is the majority of
the current meta data, which matters a lot whenever a rebuild happens.

Joerg
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] Fossil 2.1: Scaling

2015-03-02 Thread Richard Hipp
On 3/2/15, Joerg Sonnenberger jo...@britannica.bec.de wrote:
 On Mon, Mar 02, 2015 at 07:30:44AM -0500, Richard Hipp wrote:
 So I was thinking, could Fossil 2.0 be enhanced in ways to support
 scaling to the point where it works on really massive projects?

 I think the single biggest practical issue right now still goes back to
 the baseline manifests not being efficient enough. Would you consider
 changing the rules to allow truely incremental manifests? I agree that
 having full manifests is sometimes nicer, but I think those would be
 build on-demand and cached separately. I belive that is the majority of
 the current meta data, which matters a lot whenever a rebuild happens.


The current mechanism is to have periodic full baseline manifests, and
then have deltas against those baselines in between.  Hence, no more
than two artifacts ever need to be decoded in order to access a
manifest - the baseline and its delta.

Are you proposing to have deltas of deltas, so that a potentially
large number of artifacts need to be decoded in order to reconstruct
the complete manifest?

I don't understand how that would help.  Can you provide more explanation?

-- 
D. Richard Hipp
d...@sqlite.org
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] Fossil 2.1: Scaling

2015-03-02 Thread Joerg Sonnenberger
On Mon, Mar 02, 2015 at 11:38:38AM -0500, Richard Hipp wrote:
 On 3/2/15, Joerg Sonnenberger jo...@britannica.bec.de wrote:
  On Mon, Mar 02, 2015 at 07:30:44AM -0500, Richard Hipp wrote:
  So I was thinking, could Fossil 2.0 be enhanced in ways to support
  scaling to the point where it works on really massive projects?
 
  I think the single biggest practical issue right now still goes back to
  the baseline manifests not being efficient enough. Would you consider
  changing the rules to allow truely incremental manifests? I agree that
  having full manifests is sometimes nicer, but I think those would be
  build on-demand and cached separately. I belive that is the majority of
  the current meta data, which matters a lot whenever a rebuild happens.
 
 
 The current mechanism is to have periodic full baseline manifests, and
 then have deltas against those baselines in between.  Hence, no more
 than two artifacts ever need to be decoded in order to access a
 manifest - the baseline and its delta.

I know. The manifest contains two parts: non-file content and the file
list. For delta manifests, the file list is encoded as changes relative
to the base line.

 Are you proposing to have deltas of deltas, so that a potentially
 large number of artifacts need to be decoded in order to reconstruct
 the complete manifest?

I think we have two different situations when it comes to access the
file list:

(1) Getting the full list. This is primarily used for initial checks and
as part of the status handling of checkouts, maybe also for the web view.

(2) Getting the changes relative to another checkin. This is what update
etc. is interested in.

The problem with the base line encoding is that it still has a high
degree of redundancy. While delta compression removes a good chunk of
the overhead in terms of disk space, rebuild still has to process the
full amount. That's a significant part for a large tree. My suggestion
is to store a plain file delta in the manifest. Let's call this is a
pure delta manifest. Rebuild parsing is then linear in the number of
changed files. The plink table is a direct mapping of the pure delta
manifest, they have effectively the same data. To keep the performance
of case (1) above, a new full manifest table is stored separate and
computed on demand. That can be either during rebuild or on first
access. Heuristics like X commits since last full manifest can be
applied. This is a (local) cache, no need to transfer it via sync
protocol, no need to preserve it during rebuild either.

Joerg
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users