Re: RFC: bitmap storage for precursors and phases
On 02/18/2017 04:51 AM, Jun Wu wrote: Excerpts from Stanislau Hlebik's message of 2017-02-17 16:06:33 +: This is implementation of two caches (nonpublic + precursor) using serialized sorted lists and sets https://bitbucket.org/stashlebik/hg/commits/99879579ac2848a2567810b677d8344150a7b319?at=hiddenbitmaps_lists I had a quick look. Here are my suggestions: 1. Prefer caching at a lower-level, so they get more widely used. +1 on that, Practically, do not change localrepo.py, change phasecache.py and obsolete.py instead. I happened to do some refactoring in phasecache [1]. I'll send a V2 soon, since the change seems related. Note: phasecache is some C code so you need some data to prove the change is worthy. 2. The "Python set" serialization is already in repoview.py. Reuse it. Maybe move "_writehiddencache" to scmutil, or a new simple class. "tryreadcache" could be changed and then moved. Then the functions could be reused in obsolete.getrevs. [1]: https://www.mercurial-scm.org/pipermail/mercurial-devel/2017-February/092817.html ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel -- Pierre-Yves David ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Re: RFC: bitmap storage for precursors and phases
On 02/14/2017 02:04 AM, Sean Farley wrote: Jun Wu writes: In general, I think this is a good direction. Some random thoughts: - general purposed I think the bitmap is not always a cache, so it should only have operations like set/unset/readfromdisk/writetodisk. Practically, I won't couple cache invalidation with the bitmap implementation. In additional, I'll try to avoid using Python-only types in the interface. So once we decide to rewrite part of the implementation in native C, we won't have trouble. See "revset" below for a possibility that bitmap is used as a non-set. - revset This is a possibility that probably won't happen any time soon. The revset currently uses Python set for maintaining its state. For huge sets, Python sets may not be a good option. And various operations could benefit from an always-topologically-sorted set, which is the bitmap. - mmap My intuition is that bitmaps fit better with mmap which can reduce the reading loading cost. I think "vfs.mmapread" could be a thing, and various places can benefit from it - Gabor recently showed interest in loading revlog data by mmap, I had patches that uses mmap to read revlog index. In additional, not directly related to this series, I'm a big fan of single direction data flow. But the current code base does not seem to do a good job in this area. As we are adding more caching layers to the code base, it'd be nice if we have some tiny framework maintaining the dependency of all kinds of data, to be able to understand the data flow easily, and just to be more confident about loading orders. I think people more experienced on architecture may want to share some ideas here. I was thinking about a more high-level approach (please feel free to pick apart): r = repo.filtered("bitmap1") r2 = r.filtered("bitmap2") So that r2 would be an intersection of bitmap1 and bitmap2 (haven't thought about a union nor the inverse). This double filtering idea is interresting. maybe we could have the 'repoview' API understant repo.filtered("foo+bar") as a combination of filtering of foo+bar. The smart part of repoview (eg: filter hierarchy for cache inheritance, cache key, etc) should be able to automatically compute what do to for a combinaison. Exposing the bitmap at that level seems strange. I think it is better to have the internal implementation of the filtering rely on a bitmat than to have the repository/repoview API to expose bitmap directly. Cheers, -- Pierre-Yves David ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Re: RFC: bitmap storage for precursors and phases
Excerpts from Sean Farley's message of 2017-02-21 15:45:44 -0800: > Augie Fackler writes: > > > On Fri, Feb 17, 2017 at 07:14:12PM -0800, Jun Wu wrote: > >> Excerpts from Bryan O'Sullivan's message of 2017-02-17 13:29:58 -0800: > >> > I think there are multiple topics being discussed: > >> > > >> > 1. How to solve the overhead loading hiddenrevs with minimal changes? > >> > 2. Why is the bitmap format interesting (future use-cases)? > >> > > >> For 2, > >> > >> len(filteredrevs) in my hg-committed is 2155. It's small. I tend to think > >> about solutions that scale longer and are not too complex to build. That > >> may or may not be a good habit. > >> > > > > For what it's worth, my clone of hg has 12716 hidden revisions - about > > a quarter of the entire repo history is hidden. > > Actually, that's 40%. Mine is a bit over 25%. I think you have incorrectly excluded hidden revs from repo history: 12716.0 / (len(repo) - len(repo.changelog.filteredrevs) + 12716) = 28.8% # correct 12716.0 / (len(repo) - len(repo.changelog.filteredrevs)) = 40.3% # incorrect Yours should be around 20%: 8000.0 / (len(repo) - len(repo.changelog.filteredrevs) + 8000) = 20.3% ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Re: RFC: bitmap storage for precursors and phases
Augie Fackler writes: > On Fri, Feb 17, 2017 at 07:14:12PM -0800, Jun Wu wrote: >> Excerpts from Bryan O'Sullivan's message of 2017-02-17 13:29:58 -0800: >> > I think there are multiple topics being discussed: >> > >> > 1. How to solve the overhead loading hiddenrevs with minimal changes? >> > 2. Why is the bitmap format interesting (future use-cases)? >> > >> For 2, >> >> len(filteredrevs) in my hg-committed is 2155. It's small. I tend to think >> about solutions that scale longer and are not too complex to build. That >> may or may not be a good habit. >> > > For what it's worth, my clone of hg has 12716 hidden revisions - about > a quarter of the entire repo history is hidden. Actually, that's 40%. Mine is a bit over 25%. ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Re: RFC: bitmap storage for precursors and phases
Jun Wu writes: > Excerpts from Bryan O'Sullivan's message of 2017-02-17 13:29:58 -0800: >> On Fri, Feb 17, 2017 at 10:30 AM, Jun Wu wrote: >> >> > Excerpts from Stanislau Hlebik's message of 2017-02-17 11:24:34 +: >> > > As I said before we will load all non-public revs in one set and all >> > >> > The problem is, loading a Python set from disk is O(size-of-the-set). >> > >> > Bitmap's loading cost should be basically 0 (with mmap). I think that's why >> > we want bitmap at the first place. There are other choices like packfile >> > index, hash tables, but bitmap is the simplest and most efficient. >> > >> >> Hey folks, >> >> I haven't yet seen mention of some considerations that seem very important >> in driving the decision-making, so I'd appreciate it if someone could fill >> me in. >> >> Firstly, what's our current understanding of the sizes and compositions of >> these sets of numbers? In theory, we have a lot of data from practical >> application at Facebook, but nobody's brought it up. >> >> To clarify why this matters, if an obsstore contains say 100 entries, then >> a very naive implementation should be fine. If it's related to something >> else, such as the size of, amount of activity in, or local age of a repo, >> then maybe we have a different set of decisions to make. > > First, I agree that if N is small, O(N) or O(log(N)), or O(1), do not matter > much in practice. I tend to always reason about future-proof solutions, > which may or may not be a good habit - I'll adjust in the future. > > I think there are multiple topics being discussed: > > 1. How to solve the overhead loading hiddenrevs with minimal changes? > 2. Why is the bitmap format interesting (future use-cases)? > > For 1, > > I once tried to make one bit of revlog flag as the source of truth of > "hidden", which looks a nice solution. But it got rejected because revlog > has to be append-only. Moving the flag bits out formed the bitmap idea > naturally. > > Bitmap fits here but it does not mean we have to use it to solve this > particular problem. For example, stateful chg will get us some easy wins, > and see the problem analysis and proposed solution below. > > Note that we actually have a simple caching format for the "hiddenrevs". > See repoview.py, computehidden. > > But what's wrong with "computehidden" is the choice of the cache key - it > calls "hideablerevs", which takes 0.5 seconds in obsolete.getrevs(repo, > 'obsolete'), which reads the giant obsstore (in hg-committed) and do some > complex computation. The result is then hashed and used as a cache key of > "computehidden". > > So, if we do not want to know the "obsolete()" revisions, but just want to > know what's hidden. Changing the cache key would be a very straightforward > and effective solution which does not require a new file format. > > That'll help commands like "hg st", "hg log -T foo", etc. But it won't > help complex "log"s which do require the "obsolete()" revisions. To speed > up the latter, we can probably just copy the caching infrastructure from > repoview.py. > > For 2, > > len(filteredrevs) in my hg-committed is 2155. It's small. I tend to think Another data point for my hg-committed: len(filteredrevs) > 8k len(obs markers) > 120k ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Re: RFC: bitmap storage for precursors and phases
Excerpts from Augie Fackler's message of 2017-02-19 21:06:53 -0500: > On Fri, Feb 17, 2017 at 09:59:48PM +, Stanislau Hlebik wrote: > > Excerpts from Bryan O'Sullivan's message of 2017-02-17 13:29:58 -0800: > > > On Fri, Feb 17, 2017 at 10:30 AM, Jun Wu wrote: > > > > > > > Excerpts from Stanislau Hlebik's message of 2017-02-17 11:24:34 +: > > > > > As I said before we will load all non-public revs in one set and all > > > > > > > > The problem is, loading a Python set from disk is O(size-of-the-set). > > > > > > > > Bitmap's loading cost should be basically 0 (with mmap). I think that's > > > > why > > > > we want bitmap at the first place. There are other choices like packfile > > > > index, hash tables, but bitmap is the simplest and most efficient. > > > > > > > > > > Hey folks, > > > > > > I haven't yet seen mention of some considerations that seem very important > > > in driving the decision-making, so I'd appreciate it if someone could fill > > > me in. > > > > > > Firstly, what's our current understanding of the sizes and compositions of > > > these sets of numbers? In theory, we have a lot of data from practical > > > application at Facebook, but nobody's brought it up. > > > > I assume that both sets (set for nonpublic commits and set for > > obsstore) are going to be very small comparing to the repo size. I > > expect both sets < 1% of the repo size. And the sets is going to be > > sparse. > > I replied elsewhere in the thread, but in my clone of hg it's on the > order of 25-30% of the history, so assuming it's going to be very > sparse is probably unwise. In that case it's better to use bitmaps. But to do it we need to get rid of filteredrevs iteration in scmutil.filteredhash() function. ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Re: RFC: bitmap storage for precursors and phases
On Fri, Feb 17, 2017 at 09:59:48PM +, Stanislau Hlebik wrote: > Excerpts from Bryan O'Sullivan's message of 2017-02-17 13:29:58 -0800: > > On Fri, Feb 17, 2017 at 10:30 AM, Jun Wu wrote: > > > > > Excerpts from Stanislau Hlebik's message of 2017-02-17 11:24:34 +: > > > > As I said before we will load all non-public revs in one set and all > > > > > > The problem is, loading a Python set from disk is O(size-of-the-set). > > > > > > Bitmap's loading cost should be basically 0 (with mmap). I think that's > > > why > > > we want bitmap at the first place. There are other choices like packfile > > > index, hash tables, but bitmap is the simplest and most efficient. > > > > > > > Hey folks, > > > > I haven't yet seen mention of some considerations that seem very important > > in driving the decision-making, so I'd appreciate it if someone could fill > > me in. > > > > Firstly, what's our current understanding of the sizes and compositions of > > these sets of numbers? In theory, we have a lot of data from practical > > application at Facebook, but nobody's brought it up. > > I assume that both sets (set for nonpublic commits and set for > obsstore) are going to be very small comparing to the repo size. I > expect both sets < 1% of the repo size. And the sets is going to be > sparse. I replied elsewhere in the thread, but in my clone of hg it's on the order of 25-30% of the history, so assuming it's going to be very sparse is probably unwise. ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Re: RFC: bitmap storage for precursors and phases
On Fri, Feb 17, 2017 at 07:14:12PM -0800, Jun Wu wrote: > Excerpts from Bryan O'Sullivan's message of 2017-02-17 13:29:58 -0800: > > I think there are multiple topics being discussed: > > > > 1. How to solve the overhead loading hiddenrevs with minimal changes? > > 2. Why is the bitmap format interesting (future use-cases)? > > > For 2, > > len(filteredrevs) in my hg-committed is 2155. It's small. I tend to think > about solutions that scale longer and are not too complex to build. That > may or may not be a good habit. > For what it's worth, my clone of hg has 12716 hidden revisions - about a quarter of the entire repo history is hidden. ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Re: RFC: bitmap storage for precursors and phases
Excerpts from Jun Wu's message of 2017-02-17 19:14:12 -0800: > Excerpts from Bryan O'Sullivan's message of 2017-02-17 13:29:58 -0800: > > On Fri, Feb 17, 2017 at 10:30 AM, Jun Wu wrote: > > > > > Excerpts from Stanislau Hlebik's message of 2017-02-17 11:24:34 +: > > > > As I said before we will load all non-public revs in one set and all > > > > > > The problem is, loading a Python set from disk is O(size-of-the-set). > > > > > > Bitmap's loading cost should be basically 0 (with mmap). I think that's > > > why > > > we want bitmap at the first place. There are other choices like packfile > > > index, hash tables, but bitmap is the simplest and most efficient. > > > > > > > Hey folks, > > > > I haven't yet seen mention of some considerations that seem very important > > in driving the decision-making, so I'd appreciate it if someone could fill > > me in. > > > > Firstly, what's our current understanding of the sizes and compositions of > > these sets of numbers? In theory, we have a lot of data from practical > > application at Facebook, but nobody's brought it up. > > > > To clarify why this matters, if an obsstore contains say 100 entries, then > > a very naive implementation should be fine. If it's related to something > > else, such as the size of, amount of activity in, or local age of a repo, > > then maybe we have a different set of decisions to make. > > First, I agree that if N is small, O(N) or O(log(N)), or O(1), do not matter > much in practice. I tend to always reason about future-proof solutions, > which may or may not be a good habit - I'll adjust in the future. > > I think there are multiple topics being discussed: > > 1. How to solve the overhead loading hiddenrevs with minimal changes? > 2. Why is the bitmap format interesting (future use-cases)? > > For 1, > > I once tried to make one bit of revlog flag as the source of truth of > "hidden", which looks a nice solution. But it got rejected because revlog > has to be append-only. Moving the flag bits out formed the bitmap idea > naturally. > > Bitmap fits here but it does not mean we have to use it to solve this > particular problem. For example, stateful chg will get us some easy wins, > and see the problem analysis and proposed solution below. > > Note that we actually have a simple caching format for the "hiddenrevs". > See repoview.py, computehidden. > > But what's wrong with "computehidden" is the choice of the cache key - it > calls "hideablerevs", which takes 0.5 seconds in obsolete.getrevs(repo, > 'obsolete'), which reads the giant obsstore (in hg-committed) and do some > complex computation. The result is then hashed and used as a cache key of > "computehidden". > > So, if we do not want to know the "obsolete()" revisions, but just want to > know what's hidden. Changing the cache key would be a very straightforward > and effective solution which does not require a new file format. Hm, changing the cache key to smth like phaseroots + obsstore + changelog mtime + size may be an easy and good win. That's a good idea. Although it will still get invalidated after each write. > > That'll help commands like "hg st", "hg log -T foo", etc. But it won't > help complex "log"s which do require the "obsolete()" revisions. To speed > up the latter, we can probably just copy the caching infrastructure from > repoview.py. > > For 2, > > len(filteredrevs) in my hg-committed is 2155. It's small. I tend to think > about solutions that scale longer and are not too complex to build. That > may or may not be a good habit. > > That said, I agree if the set is small and perf, then whatever easier to > implement makes sense. > > I have some crazy ideas about the future where a giant bitmap is > practically useful. For example, source controlling "visible heads" and > the obsstore, then adding some C code to quickly fill a bitmap (requires > "set" and "setancestors/descendants"), then we can show the user whenever > their repo used to look like back in any time by using the bitmap filter. > That'll help investigate hard issues, and implement "hg undo". > > > Secondly, what operations are being performed on these sets today, and do > > we know if those operations are going to change? > > > > For instance, if a may be extended via new elements added to its end, then > > an mmap-based implementation immediately becomes awkward. If a set is > > Not that bad if we preallocate space. > > > traversed or we perform operations like union or intersection, then a > > bitmap might make sense if the set is very small and dense, but not if it's > > big and sparse. > > Not necessarily true if we use an advanced variant, like the "roaringbitmap" > mentioned by you last year [1]. > > [1]: > https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-September/087864.html > > > (Jun's idea that mmap is effectively free is also not borne out in > > practice, but that's
Re: RFC: bitmap storage for precursors and phases
Excerpts from Stanislau Hlebik's message of 2017-02-17 16:06:33 +: > This is implementation of two caches (nonpublic + precursor) using > serialized sorted lists and sets > https://bitbucket.org/stashlebik/hg/commits/99879579ac2848a2567810b677d8344150a7b319?at=hiddenbitmaps_lists I had a quick look. Here are my suggestions: 1. Prefer caching at a lower-level, so they get more widely used. Practically, do not change localrepo.py, change phasecache.py and obsolete.py instead. I happened to do some refactoring in phasecache [1]. I'll send a V2 soon, since the change seems related. Note: phasecache is some C code so you need some data to prove the change is worthy. 2. The "Python set" serialization is already in repoview.py. Reuse it. Maybe move "_writehiddencache" to scmutil, or a new simple class. "tryreadcache" could be changed and then moved. Then the functions could be reused in obsolete.getrevs. [1]: https://www.mercurial-scm.org/pipermail/mercurial-devel/2017-February/092817.html ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Re: RFC: bitmap storage for precursors and phases
Excerpts from Bryan O'Sullivan's message of 2017-02-17 13:29:58 -0800: > On Fri, Feb 17, 2017 at 10:30 AM, Jun Wu wrote: > > > Excerpts from Stanislau Hlebik's message of 2017-02-17 11:24:34 +: > > > As I said before we will load all non-public revs in one set and all > > > > The problem is, loading a Python set from disk is O(size-of-the-set). > > > > Bitmap's loading cost should be basically 0 (with mmap). I think that's why > > we want bitmap at the first place. There are other choices like packfile > > index, hash tables, but bitmap is the simplest and most efficient. > > > > Hey folks, > > I haven't yet seen mention of some considerations that seem very important > in driving the decision-making, so I'd appreciate it if someone could fill > me in. > > Firstly, what's our current understanding of the sizes and compositions of > these sets of numbers? In theory, we have a lot of data from practical > application at Facebook, but nobody's brought it up. > > To clarify why this matters, if an obsstore contains say 100 entries, then > a very naive implementation should be fine. If it's related to something > else, such as the size of, amount of activity in, or local age of a repo, > then maybe we have a different set of decisions to make. First, I agree that if N is small, O(N) or O(log(N)), or O(1), do not matter much in practice. I tend to always reason about future-proof solutions, which may or may not be a good habit - I'll adjust in the future. I think there are multiple topics being discussed: 1. How to solve the overhead loading hiddenrevs with minimal changes? 2. Why is the bitmap format interesting (future use-cases)? For 1, I once tried to make one bit of revlog flag as the source of truth of "hidden", which looks a nice solution. But it got rejected because revlog has to be append-only. Moving the flag bits out formed the bitmap idea naturally. Bitmap fits here but it does not mean we have to use it to solve this particular problem. For example, stateful chg will get us some easy wins, and see the problem analysis and proposed solution below. Note that we actually have a simple caching format for the "hiddenrevs". See repoview.py, computehidden. But what's wrong with "computehidden" is the choice of the cache key - it calls "hideablerevs", which takes 0.5 seconds in obsolete.getrevs(repo, 'obsolete'), which reads the giant obsstore (in hg-committed) and do some complex computation. The result is then hashed and used as a cache key of "computehidden". So, if we do not want to know the "obsolete()" revisions, but just want to know what's hidden. Changing the cache key would be a very straightforward and effective solution which does not require a new file format. That'll help commands like "hg st", "hg log -T foo", etc. But it won't help complex "log"s which do require the "obsolete()" revisions. To speed up the latter, we can probably just copy the caching infrastructure from repoview.py. For 2, len(filteredrevs) in my hg-committed is 2155. It's small. I tend to think about solutions that scale longer and are not too complex to build. That may or may not be a good habit. That said, I agree if the set is small and perf, then whatever easier to implement makes sense. I have some crazy ideas about the future where a giant bitmap is practically useful. For example, source controlling "visible heads" and the obsstore, then adding some C code to quickly fill a bitmap (requires "set" and "setancestors/descendants"), then we can show the user whenever their repo used to look like back in any time by using the bitmap filter. That'll help investigate hard issues, and implement "hg undo". > Secondly, what operations are being performed on these sets today, and do > we know if those operations are going to change? > > For instance, if a may be extended via new elements added to its end, then > an mmap-based implementation immediately becomes awkward. If a set is Not that bad if we preallocate space. > traversed or we perform operations like union or intersection, then a > bitmap might make sense if the set is very small and dense, but not if it's > big and sparse. Not necessarily true if we use an advanced variant, like the "roaringbitmap" mentioned by you last year [1]. [1]: https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-September/087864.html > (Jun's idea that mmap is effectively free is also not borne out in > practice, but that's a second-order issue here.) > > So tell us what the needed API is, what the characteristics of the data > are, and that'll help understand how to make a good decision. ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Re: RFC: bitmap storage for precursors and phases
Excerpts from Bryan O'Sullivan's message of 2017-02-17 13:29:58 -0800: > On Fri, Feb 17, 2017 at 10:30 AM, Jun Wu wrote: > > > Excerpts from Stanislau Hlebik's message of 2017-02-17 11:24:34 +: > > > As I said before we will load all non-public revs in one set and all > > > > The problem is, loading a Python set from disk is O(size-of-the-set). > > > > Bitmap's loading cost should be basically 0 (with mmap). I think that's why > > we want bitmap at the first place. There are other choices like packfile > > index, hash tables, but bitmap is the simplest and most efficient. > > > > Hey folks, > > I haven't yet seen mention of some considerations that seem very important > in driving the decision-making, so I'd appreciate it if someone could fill > me in. > > Firstly, what's our current understanding of the sizes and compositions of > these sets of numbers? In theory, we have a lot of data from practical > application at Facebook, but nobody's brought it up. I assume that both sets (set for nonpublic commits and set for obsstore) are going to be very small comparing to the repo size. I expect both sets < 1% of the repo size. And the sets is going to be sparse. > > To clarify why this matters, if an obsstore contains say 100 entries, then > a very naive implementation should be fine. If it's related to something > else, such as the size of, amount of activity in, or local age of a repo, > then maybe we have a different set of decisions to make. > > Secondly, what operations are being performed on these sets today, and do > we know if those operations are going to change? > > For instance, if a may be extended via new elements added to its end, then > an mmap-based implementation immediately becomes awkward. If a set is > traversed or we perform operations like union or intersection, then a > bitmap might make sense if the set is very small and dense, but not if it's > big and sparse. We need three operations: iteration over the whole set, testing if an element belongs to the set and adding new elements. New elements can be added to the end of the set. Not sure if more operations will be added, probably not. Agree, we can't use bitmaps for iteration. But potentially we can clean all of the places where iteration is used. Most of them are pretty straight-forward except for one tricky case in scmutil.filteredhash() function. > > (Jun's idea that mmap is effectively free is also not borne out in > practice, but that's a second-order issue here.) > > So tell us what the needed API is, what the characteristics of the data > are, and that'll help understand how to make a good decision. ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Re: RFC: bitmap storage for precursors and phases
On Fri, Feb 17, 2017 at 10:30 AM, Jun Wu wrote: > Excerpts from Stanislau Hlebik's message of 2017-02-17 11:24:34 +: > > As I said before we will load all non-public revs in one set and all > > The problem is, loading a Python set from disk is O(size-of-the-set). > > Bitmap's loading cost should be basically 0 (with mmap). I think that's why > we want bitmap at the first place. There are other choices like packfile > index, hash tables, but bitmap is the simplest and most efficient. > Hey folks, I haven't yet seen mention of some considerations that seem very important in driving the decision-making, so I'd appreciate it if someone could fill me in. Firstly, what's our current understanding of the sizes and compositions of these sets of numbers? In theory, we have a lot of data from practical application at Facebook, but nobody's brought it up. To clarify why this matters, if an obsstore contains say 100 entries, then a very naive implementation should be fine. If it's related to something else, such as the size of, amount of activity in, or local age of a repo, then maybe we have a different set of decisions to make. Secondly, what operations are being performed on these sets today, and do we know if those operations are going to change? For instance, if a may be extended via new elements added to its end, then an mmap-based implementation immediately becomes awkward. If a set is traversed or we perform operations like union or intersection, then a bitmap might make sense if the set is very small and dense, but not if it's big and sparse. (Jun's idea that mmap is effectively free is also not borne out in practice, but that's a second-order issue here.) So tell us what the needed API is, what the characteristics of the data are, and that'll help understand how to make a good decision. ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Re: RFC: bitmap storage for precursors and phases
Excerpts from Jun Wu's message of 2017-02-17 10:30:45 -0800: > Excerpts from Stanislau Hlebik's message of 2017-02-17 11:24:34 +: > > As I said before we will load all non-public revs in one set and all > > The problem is, loading a Python set from disk is O(size-of-the-set). > > Bitmap's loading cost should be basically 0 (with mmap). I think that's why > we want bitmap at the first place. There are other choices like packfile > index, hash tables, but bitmap is the simplest and most efficient. Agree, but Python set is going to be small and loading it is also fast. > > > precursor revs in another set. So `ispublic()` and `isprecursor()` > > checks are very fast, it's just a set lookup. Same with update - it's > > just an insertion in the set. > > If cache invalidation happens for common write operations and the cache > needs to be rebuilt from scratch, then the solution is not better than > stateful chg - the chg way is easier (no need to serialize data), and safer > (not vulnerable to cache file corruption). > > To make the bitmap cache really useful (solving problems that chg cannot not > solve), we need to incrementally update the cache, and avoid rebuilding it > from scratch, at least for common commands like "rebase", "amend", "commit". > > > > Some code can be changed - "scmutil.filteredhash" seems to be one user > > > that iterates "filteredrevs". But what it needs is only a hash - it > > > could hash something else, like the mtime, size etc. > > > > Bookmarks, changelog, obsstore and tags can affect filtered set. > > For filtered repo we'll need to use size + mtime of bookmarks, > > changelog, obsstore, tags and maybe even smth else. That maybe > > error-prone. > > The "filteredhash" seems to be only used by tags and branchmap cache. I > think we can actually get rid of it by only caching unfiltered versions and > do the filtering later. It also seems wrong to have tags and filteredrevs > mutually depend on each other. If we can get rid of `filteredhash` then we can probably proceed with bitmaps. Although using bitmaps will still require bigger code changes than using serialized sets. > > > > Bitmaps could also be smarter - like maintaining the min and max revisions > > > so it does not need to be exactly len(repo). ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Re: RFC: bitmap storage for precursors and phases
Excerpts from Stanislau Hlebik's message of 2017-02-17 11:24:34 +: > As I said before we will load all non-public revs in one set and all The problem is, loading a Python set from disk is O(size-of-the-set). Bitmap's loading cost should be basically 0 (with mmap). I think that's why we want bitmap at the first place. There are other choices like packfile index, hash tables, but bitmap is the simplest and most efficient. > precursor revs in another set. So `ispublic()` and `isprecursor()` > checks are very fast, it's just a set lookup. Same with update - it's > just an insertion in the set. If cache invalidation happens for common write operations and the cache needs to be rebuilt from scratch, then the solution is not better than stateful chg - the chg way is easier (no need to serialize data), and safer (not vulnerable to cache file corruption). To make the bitmap cache really useful (solving problems that chg cannot not solve), we need to incrementally update the cache, and avoid rebuilding it from scratch, at least for common commands like "rebase", "amend", "commit". > > Some code can be changed - "scmutil.filteredhash" seems to be one user > > that iterates "filteredrevs". But what it needs is only a hash - it > > could hash something else, like the mtime, size etc. > > Bookmarks, changelog, obsstore and tags can affect filtered set. > For filtered repo we'll need to use size + mtime of bookmarks, > changelog, obsstore, tags and maybe even smth else. That maybe > error-prone. The "filteredhash" seems to be only used by tags and branchmap cache. I think we can actually get rid of it by only caching unfiltered versions and do the filtering later. It also seems wrong to have tags and filteredrevs mutually depend on each other. > > Bitmaps could also be smarter - like maintaining the min and max revisions > > so it does not need to be exactly len(repo). ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Re: RFC: bitmap storage for precursors and phases
Excerpts from Stanislau Hlebik's message of 2017-02-17 11:24:34 +: > Excerpts from Jun Wu's message of 2017-02-16 13:42:46 -0800: > > Excerpts from Stanislau Hlebik's message of 2017-02-16 19:39:07 +: > > > Excerpts from Stanislau Hlebik's message of 2017-02-14 09:29:25 +: > > > > Excerpts from Sean Farley's message of 2017-02-13 18:30:25 -0800: > > > > > Jun Wu writes: > > > > > > > > > > > Excerpts from Sean Farley's message of 2017-02-13 17:04:35 -0800: > > > > > >> I was thinking about a more high-level approach (please feel free > > > > > >> to > > > > > >> pick apart): > > > > > >> > > > > > >> r = repo.filtered("bitmap1") > > > > > >> r2 = r.filtered("bitmap2") > > > > > >> > > > > > >> So that r2 would be an intersection of bitmap1 and bitmap2 (haven't > > > > > >> thought about a union nor the inverse). > > > > > > > > > > > > That does not conflict with my comments. It could be implemented as > > > > > > nested > > > > > > filters, or flatten the bitmap by doing an "or" operation. > > > > > > > > > > Righto. Just wanted to bring it up early before things are set in > > > > > stone. > > > > > > > > Current `repoview.filtered()` implementation can apply only one filter. > > > > I > > > > think it will be error-prone to change it, won't it? > > > > > > It seems that it's better to use sorted lists instead of bitmaps. In a > > > couple of places it is expected that repo.filteredrevs supports > > > iteration. But iteration over a bitmap is very slow. Instead we can > > > store list of non-public revs and list of precursor revs and load them > > > in a set. > > > > > > It will be slow for the case where repo has lots of draft commits. > > > In this case it's probably better to disable this feature completely. > > > > I think it's fine to have slow iteration, as long as the iteration operation > > is uncommon. The point is, the most common operation should be fast - that > > is, ishidden(rev) should be fast. Stateful chg will help the read-only case, > > the bitmap + mmap would ideally make write (like rebase) cases fast > . > As I said before we will load all non-public revs in one set and all > precursor revs in another set. So `ispublic()` and `isprecursor()` > checks are very fast, it's just a set lookup. Same with update - it's > just an insertion in the set. > > > > > Some code can be changed - "scmutil.filteredhash" seems to be one user that > > iterates "filteredrevs". But what it needs is only a hash - it could hash > > something else, like the mtime, size etc. > > Bookmarks, changelog, obsstore and tags can affect filtered set. > For filtered repo we'll need to use size + mtime of bookmarks, > changelog, obsstore, tags and maybe even smth else. That maybe > error-prone. This is implementation of two caches (nonpublic + precursor) using serialized sorted lists and sets https://bitbucket.org/stashlebik/hg/commits/99879579ac2848a2567810b677d8344150a7b319?at=hiddenbitmaps_lists > > > > Bitmaps could also be smarter - like maintaining the min and max revisions > > so it does not need to be exactly len(repo). ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Re: RFC: bitmap storage for precursors and phases
Excerpts from Jun Wu's message of 2017-02-16 13:42:46 -0800: > Excerpts from Stanislau Hlebik's message of 2017-02-16 19:39:07 +: > > Excerpts from Stanislau Hlebik's message of 2017-02-14 09:29:25 +: > > > Excerpts from Sean Farley's message of 2017-02-13 18:30:25 -0800: > > > > Jun Wu writes: > > > > > > > > > Excerpts from Sean Farley's message of 2017-02-13 17:04:35 -0800: > > > > >> I was thinking about a more high-level approach (please feel free to > > > > >> pick apart): > > > > >> > > > > >> r = repo.filtered("bitmap1") > > > > >> r2 = r.filtered("bitmap2") > > > > >> > > > > >> So that r2 would be an intersection of bitmap1 and bitmap2 (haven't > > > > >> thought about a union nor the inverse). > > > > > > > > > > That does not conflict with my comments. It could be implemented as > > > > > nested > > > > > filters, or flatten the bitmap by doing an "or" operation. > > > > > > > > Righto. Just wanted to bring it up early before things are set in stone. > > > > > > Current `repoview.filtered()` implementation can apply only one filter. I > > > think it will be error-prone to change it, won't it? > > > > It seems that it's better to use sorted lists instead of bitmaps. In a > > couple of places it is expected that repo.filteredrevs supports > > iteration. But iteration over a bitmap is very slow. Instead we can > > store list of non-public revs and list of precursor revs and load them > > in a set. > > > > It will be slow for the case where repo has lots of draft commits. > > In this case it's probably better to disable this feature completely. > > I think it's fine to have slow iteration, as long as the iteration operation > is uncommon. The point is, the most common operation should be fast - that > is, ishidden(rev) should be fast. Stateful chg will help the read-only case, > the bitmap + mmap would ideally make write (like rebase) cases fast . As I said before we will load all non-public revs in one set and all precursor revs in another set. So `ispublic()` and `isprecursor()` checks are very fast, it's just a set lookup. Same with update - it's just an insertion in the set. > > Some code can be changed - "scmutil.filteredhash" seems to be one user that > iterates "filteredrevs". But what it needs is only a hash - it could hash > something else, like the mtime, size etc. Bookmarks, changelog, obsstore and tags can affect filtered set. For filtered repo we'll need to use size + mtime of bookmarks, changelog, obsstore, tags and maybe even smth else. That maybe error-prone. > > Bitmaps could also be smarter - like maintaining the min and max revisions > so it does not need to be exactly len(repo). ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Re: RFC: bitmap storage for precursors and phases
Excerpts from Stanislau Hlebik's message of 2017-02-16 19:39:07 +: > Excerpts from Stanislau Hlebik's message of 2017-02-14 09:29:25 +: > > Excerpts from Sean Farley's message of 2017-02-13 18:30:25 -0800: > > > Jun Wu writes: > > > > > > > Excerpts from Sean Farley's message of 2017-02-13 17:04:35 -0800: > > > >> I was thinking about a more high-level approach (please feel free to > > > >> pick apart): > > > >> > > > >> r = repo.filtered("bitmap1") > > > >> r2 = r.filtered("bitmap2") > > > >> > > > >> So that r2 would be an intersection of bitmap1 and bitmap2 (haven't > > > >> thought about a union nor the inverse). > > > > > > > > That does not conflict with my comments. It could be implemented as > > > > nested > > > > filters, or flatten the bitmap by doing an "or" operation. > > > > > > Righto. Just wanted to bring it up early before things are set in stone. > > > > Current `repoview.filtered()` implementation can apply only one filter. I > > think it will be error-prone to change it, won't it? > > It seems that it's better to use sorted lists instead of bitmaps. In a > couple of places it is expected that repo.filteredrevs supports > iteration. But iteration over a bitmap is very slow. Instead we can > store list of non-public revs and list of precursor revs and load them > in a set. > > It will be slow for the case where repo has lots of draft commits. > In this case it's probably better to disable this feature completely. I think it's fine to have slow iteration, as long as the iteration operation is uncommon. The point is, the most common operation should be fast - that is, ishidden(rev) should be fast. Stateful chg will help the read-only case, the bitmap + mmap would ideally make write (like rebase) cases fast. Some code can be changed - "scmutil.filteredhash" seems to be one user that iterates "filteredrevs". But what it needs is only a hash - it could hash something else, like the mtime, size etc. Bitmaps could also be smarter - like maintaining the min and max revisions so it does not need to be exactly len(repo). ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Re: RFC: bitmap storage for precursors and phases
Excerpts from Stanislau Hlebik's message of 2017-02-14 09:29:25 +: > Excerpts from Sean Farley's message of 2017-02-13 18:30:25 -0800: > > Jun Wu writes: > > > > > Excerpts from Sean Farley's message of 2017-02-13 17:04:35 -0800: > > >> I was thinking about a more high-level approach (please feel free to > > >> pick apart): > > >> > > >> r = repo.filtered("bitmap1") > > >> r2 = r.filtered("bitmap2") > > >> > > >> So that r2 would be an intersection of bitmap1 and bitmap2 (haven't > > >> thought about a union nor the inverse). > > > > > > That does not conflict with my comments. It could be implemented as nested > > > filters, or flatten the bitmap by doing an "or" operation. > > > > Righto. Just wanted to bring it up early before things are set in stone. > > Current `repoview.filtered()` implementation can apply only one filter. I > think it will be error-prone to change it, won't it? It seems that it's better to use sorted lists instead of bitmaps. In a couple of places it is expected that repo.filteredrevs supports iteration. But iteration over a bitmap is very slow. Instead we can store list of non-public revs and list of precursor revs and load them in a set. It will be slow for the case where repo has lots of draft commits. In this case it's probably better to disable this feature completely. ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Re: RFC: bitmap storage for precursors and phases
Excerpts from Gregory Szorc's message of 2017-02-13 20:28:37 -0700: > > > On Feb 10, 2017, at 10:33, Stanislau Hlebik wrote: > > > > Last year Durham sent a proposal for bitmap storage for computehidden() > > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mercurial-2Dscm.org_pipermail_mercurial-2Ddevel_2016-2DSeptember_087860.html&d=DwIFAg&c=5VD0RTtNlTh3ycd41b3MUw&r=1EQ58Dmb5uX1qHujcsT1Mg&m=ee96J8IDM8uhZeONo6skPSXItouNwkeRGNxHeCaXiXM&s=biY_gtRGCl25baAE_Nlw8cl4YUefBM0Xb7L0kuGklRA&e= > > . > > > > It got a few useful comments, two most important comments: > > Use bitmaps for lower-level data structures, for example, bitmap for public > > commits and bitmap for commits that are affected by obsolescence markers > > Add cache validation checks > > > > This is a new RFC that addresses these issues. > > In the code, there is a cache only for precursors. Later I'm planning to > > add cache for public commits. > > Cache validation key was added. It can be any key. For precursorcache I've > > chosen obsstore file size and mtime and cache validation key. For > > phasecache we can use hash of sorted phaseroots file since this file is > > usually small. > > Until there is a better supported mechanism for storing "draft pushes" in an > external store (Facebook's inifinitepush extension may fulfill this), > Mozilla's "Try" and code review repos need to scale to >100,000 heads / phase > roots. So a hash of phaseroots may not scale. (Of course, one solution to > this problem would be inverting phases storage to be head based instead of > root based. That doesn't solve problem if you have 100k public heads and 100k > draft heads. Scaling is hard.) Then I will use the same cache key as for 'obsstore' - mtime + size. BTW infinitepush is ready and opensourced so you can give it a try: https://bitbucket.org/facebook/hg-experimental/src/567c1477eebf43ff5fe7412c650ec49aa3e44a3c/infinitepush/?at=default > > > > > In a few places, I decided to delete cache entirely: > > Caches are blown up if we receive obsmarkers via bundle2 or > > exchange._pullobsolete(). This is not necessary since cache won't be valid > > on the next run anyway because obsstore size and/or mtime was changed. We > > can change it. > > Obsstore can store markers for node that are not present in the repo. Since > > bitmap is basically a dict from rev to boolean it can't store markers for > > revisions that are not in the repo. It doesn't cause issues unless missing > > node is received from bundle or during pull. In this case precursorcache > > doesn't recognize it as obsoleted. To fix it cache is deleted during > > changegroup.apply() if we receive a node that is in obsstore.successors > > dict. > > Similar thing in bundlerepo - new obsoleted nodes may have been added, > > let's just not use precursorcache in bundlerepo. > > > > > > The code is here: > > https://urldefense.proofpoint.com/v2/url?u=https-3A__bitbucket.org_stashlebik_hg_commits_branch_hiddenbitmaps&d=DwIFAg&c=5VD0RTtNlTh3ycd41b3MUw&r=1EQ58Dmb5uX1qHujcsT1Mg&m=ee96J8IDM8uhZeONo6skPSXItouNwkeRGNxHeCaXiXM&s=dp9GiK3_WKq2zB8mVuMy2ksPF5tnGV_1T_OSAp6jD8o&e= > > > > It's not super clean yet, but all tests pass (except maybe for commit/code > > checks). > > Please look and let me know what you think! > > > > > > Thanks, > > Stas > > ___ > > Mercurial-devel mailing list > > Mercurial-devel@mercurial-scm.org > > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mercurial-2Dscm.org_mailman_listinfo_mercurial-2Ddevel&d=DwIFAg&c=5VD0RTtNlTh3ycd41b3MUw&r=1EQ58Dmb5uX1qHujcsT1Mg&m=ee96J8IDM8uhZeONo6skPSXItouNwkeRGNxHeCaXiXM&s=mqvxWBluu-CxyLDxFLW49s61VhbzU_NiqHHCSu6NVxw&e= > > ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Re: RFC: bitmap storage for precursors and phases
Excerpts from Sean Farley's message of 2017-02-13 18:30:25 -0800: > Jun Wu writes: > > > Excerpts from Sean Farley's message of 2017-02-13 17:04:35 -0800: > >> I was thinking about a more high-level approach (please feel free to > >> pick apart): > >> > >> r = repo.filtered("bitmap1") > >> r2 = r.filtered("bitmap2") > >> > >> So that r2 would be an intersection of bitmap1 and bitmap2 (haven't > >> thought about a union nor the inverse). > > > > That does not conflict with my comments. It could be implemented as nested > > filters, or flatten the bitmap by doing an "or" operation. > > Righto. Just wanted to bring it up early before things are set in stone. Current `repoview.filtered()` implementation can apply only one filter. I think it will be error-prone to change it, won't it? ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Re: RFC: bitmap storage for precursors and phases
> On Feb 10, 2017, at 10:33, Stanislau Hlebik wrote: > > Last year Durham sent a proposal for bitmap storage for computehidden() > https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-September/087860.html. > > It got a few useful comments, two most important comments: > Use bitmaps for lower-level data structures, for example, bitmap for public > commits and bitmap for commits that are affected by obsolescence markers > Add cache validation checks > > This is a new RFC that addresses these issues. > In the code, there is a cache only for precursors. Later I'm planning to add > cache for public commits. > Cache validation key was added. It can be any key. For precursorcache I've > chosen obsstore file size and mtime and cache validation key. For phasecache > we can use hash of sorted phaseroots file since this file is usually small. Until there is a better supported mechanism for storing "draft pushes" in an external store (Facebook's inifinitepush extension may fulfill this), Mozilla's "Try" and code review repos need to scale to >100,000 heads / phase roots. So a hash of phaseroots may not scale. (Of course, one solution to this problem would be inverting phases storage to be head based instead of root based. That doesn't solve problem if you have 100k public heads and 100k draft heads. Scaling is hard.) > > In a few places, I decided to delete cache entirely: > Caches are blown up if we receive obsmarkers via bundle2 or > exchange._pullobsolete(). This is not necessary since cache won't be valid on > the next run anyway because obsstore size and/or mtime was changed. We can > change it. > Obsstore can store markers for node that are not present in the repo. Since > bitmap is basically a dict from rev to boolean it can't store markers for > revisions that are not in the repo. It doesn't cause issues unless missing > node is received from bundle or during pull. In this case precursorcache > doesn't recognize it as obsoleted. To fix it cache is deleted during > changegroup.apply() if we receive a node that is in obsstore.successors dict. > Similar thing in bundlerepo - new obsoleted nodes may have been added, let's > just not use precursorcache in bundlerepo. > > > The code is here: > https://bitbucket.org/stashlebik/hg/commits/branch/hiddenbitmaps > It's not super clean yet, but all tests pass (except maybe for commit/code > checks). > Please look and let me know what you think! > > > Thanks, > Stas > ___ > Mercurial-devel mailing list > Mercurial-devel@mercurial-scm.org > https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Re: RFC: bitmap storage for precursors and phases
Jun Wu writes: > Excerpts from Sean Farley's message of 2017-02-13 17:04:35 -0800: >> I was thinking about a more high-level approach (please feel free to >> pick apart): >> >> r = repo.filtered("bitmap1") >> r2 = r.filtered("bitmap2") >> >> So that r2 would be an intersection of bitmap1 and bitmap2 (haven't >> thought about a union nor the inverse). > > That does not conflict with my comments. It could be implemented as nested > filters, or flatten the bitmap by doing an "or" operation. Righto. Just wanted to bring it up early before things are set in stone. ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Re: RFC: bitmap storage for precursors and phases
Excerpts from Sean Farley's message of 2017-02-13 17:04:35 -0800: > I was thinking about a more high-level approach (please feel free to > pick apart): > > r = repo.filtered("bitmap1") > r2 = r.filtered("bitmap2") > > So that r2 would be an intersection of bitmap1 and bitmap2 (haven't > thought about a union nor the inverse). That does not conflict with my comments. It could be implemented as nested filters, or flatten the bitmap by doing an "or" operation. ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Re: RFC: bitmap storage for precursors and phases
Jun Wu writes: > In general, I think this is a good direction. Some random thoughts: > > - general purposed > > I think the bitmap is not always a cache, so it should only have > operations like set/unset/readfromdisk/writetodisk. Practically, I won't > couple cache invalidation with the bitmap implementation. > > In additional, I'll try to avoid using Python-only types in the > interface. So once we decide to rewrite part of the implementation in > native C, we won't have trouble. > > See "revset" below for a possibility that bitmap is used as a non-set. > > - revset > > This is a possibility that probably won't happen any time soon. > > The revset currently uses Python set for maintaining its state. For huge > sets, Python sets may not be a good option. And various operations could > benefit from an always-topologically-sorted set, which is the bitmap. > > - mmap > > My intuition is that bitmaps fit better with mmap which can reduce the > reading loading cost. I think "vfs.mmapread" could be a thing, and > various places can benefit from it - Gabor recently showed interest in > loading revlog data by mmap, I had patches that uses mmap to read revlog > index. > > In additional, not directly related to this series, I'm a big fan of > single direction data flow. But the current code base does not seem to do a > good job in this area. As we are adding more caching layers to the code > base, it'd be nice if we have some tiny framework maintaining the dependency > of all kinds of data, to be able to understand the data flow easily, and > just to be more confident about loading orders. I think people more > experienced on architecture may want to share some ideas here. I was thinking about a more high-level approach (please feel free to pick apart): r = repo.filtered("bitmap1") r2 = r.filtered("bitmap2") So that r2 would be an intersection of bitmap1 and bitmap2 (haven't thought about a union nor the inverse). ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Re: RFC: bitmap storage for precursors and phases
Excerpts from Jun Wu's message of 2017-02-10 11:04:24 -0800: > In general, I think this is a good direction. Some random thoughts: > > - general purposed > > I think the bitmap is not always a cache, so it should only have > operations like set/unset/readfromdisk/writetodisk. Practically, I won't > couple cache invalidation with the bitmap implementation. > > In additional, I'll try to avoid using Python-only types in the > interface. So once we decide to rewrite part of the implementation in > native C, we won't have trouble. I can decouple them into bitmap and bitmapcache. bitmap then need to have an generic header. bitmapcache will store invalidation key there. > > See "revset" below for a possibility that bitmap is used as a non-set. > > - revset > > This is a possibility that probably won't happen any time soon. > > The revset currently uses Python set for maintaining its state. For huge > sets, Python sets may not be a good option. And various operations could > benefit from an always-topologically-sorted set, which is the bitmap. > > - mmap > > My intuition is that bitmaps fit better with mmap which can reduce the > reading loading cost. I think "vfs.mmapread" could be a thing, and > various places can benefit from it - Gabor recently showed interest in > loading revlog data by mmap, I had patches that uses mmap to read revlog > index. Bitmap files are going to be small and reading them all in memory should be fast. But if it'll ever become a bottleneck we can add mmap read of bitmap files. > > In additional, not directly related to this series, I'm a big fan of > single direction data flow. But the current code base does not seem to do a > good job in this area. As we are adding more caching layers to the code > base, it'd be nice if we have some tiny framework maintaining the dependency > of all kinds of data, to be able to understand the data flow easily, and > just to be more confident about loading orders. I think people more > experienced on architecture may want to share some ideas here. > > Excerpts from Stanislau Hlebik's message of 2017-02-10 17:33:28 +: > > Last year Durham sent a proposal for bitmap storage for computehidden() > > https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-September/087860.html > > . > > > > It got a few useful comments, two most important comments: > > > > 1. Use bitmaps for lower-level data structures, for example, bitmap for > > public commits and bitmap for commits that are affected by obsolescence > > markers > > 2. Add cache validation checks > > > > This is a new RFC that addresses these issues. > > > > 1. In the code, there is a cache only for precursors. Later I'm planning > > to add cache for public commits. > > 2. Cache validation key was added. It can be any key. For precursorcache > > I've chosen obsstore file size and mtime and cache validation key. For > > phasecache we can use hash of sorted phaseroots file since this file is > > usually small. > > > > > > In a few places, I decided to delete cache entirely: > > > > 1. Caches are blown up if we receive obsmarkers via bundle2 or > > exchange._pullobsolete(). This is not necessary since cache won't be valid > > on the next run anyway because obsstore size and/or mtime was changed. We > > can change it. > > 2. Obsstore can store markers for node that are not present in the repo. > > Since bitmap is basically a dict from rev to boolean it can't store markers > > for revisions that are not in the repo. It doesn't cause issues unless > > missing node is received from bundle or during pull. In this case > > precursorcache doesn't recognize it as obsoleted. To fix it cache is > > deleted during changegroup.apply() if we receive a node that is in > > obsstore.successors dict. > > 3. Similar thing in bundlerepo - new obsoleted nodes may have been > > added, let's just not use precursorcache in bundlerepo. > > > > > > The code is here: > > https://bitbucket.org/stashlebik/hg/commits/branch/hiddenbitmaps > > It's not super clean yet, but all tests pass (except maybe for commit/code > > checks). > > Please look and let me know what you think! > > > > > > Thanks, > > Stas ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Re: RFC: bitmap storage for precursors and phases
In general, I think this is a good direction. Some random thoughts: - general purposed I think the bitmap is not always a cache, so it should only have operations like set/unset/readfromdisk/writetodisk. Practically, I won't couple cache invalidation with the bitmap implementation. In additional, I'll try to avoid using Python-only types in the interface. So once we decide to rewrite part of the implementation in native C, we won't have trouble. See "revset" below for a possibility that bitmap is used as a non-set. - revset This is a possibility that probably won't happen any time soon. The revset currently uses Python set for maintaining its state. For huge sets, Python sets may not be a good option. And various operations could benefit from an always-topologically-sorted set, which is the bitmap. - mmap My intuition is that bitmaps fit better with mmap which can reduce the reading loading cost. I think "vfs.mmapread" could be a thing, and various places can benefit from it - Gabor recently showed interest in loading revlog data by mmap, I had patches that uses mmap to read revlog index. In additional, not directly related to this series, I'm a big fan of single direction data flow. But the current code base does not seem to do a good job in this area. As we are adding more caching layers to the code base, it'd be nice if we have some tiny framework maintaining the dependency of all kinds of data, to be able to understand the data flow easily, and just to be more confident about loading orders. I think people more experienced on architecture may want to share some ideas here. Excerpts from Stanislau Hlebik's message of 2017-02-10 17:33:28 +: > Last year Durham sent a proposal for bitmap storage for computehidden() > https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-September/087860.html > . > > It got a few useful comments, two most important comments: > > 1. Use bitmaps for lower-level data structures, for example, bitmap for > public commits and bitmap for commits that are affected by obsolescence > markers > 2. Add cache validation checks > > This is a new RFC that addresses these issues. > > 1. In the code, there is a cache only for precursors. Later I'm planning > to add cache for public commits. > 2. Cache validation key was added. It can be any key. For precursorcache > I've chosen obsstore file size and mtime and cache validation key. For > phasecache we can use hash of sorted phaseroots file since this file is > usually small. > > > In a few places, I decided to delete cache entirely: > > 1. Caches are blown up if we receive obsmarkers via bundle2 or > exchange._pullobsolete(). This is not necessary since cache won't be valid on > the next run anyway because obsstore size and/or mtime was changed. We can > change it. > 2. Obsstore can store markers for node that are not present in the repo. > Since bitmap is basically a dict from rev to boolean it can't store markers > for revisions that are not in the repo. It doesn't cause issues unless > missing node is received from bundle or during pull. In this case > precursorcache doesn't recognize it as obsoleted. To fix it cache is deleted > during changegroup.apply() if we receive a node that is in > obsstore.successors dict. > 3. Similar thing in bundlerepo - new obsoleted nodes may have been added, > let's just not use precursorcache in bundlerepo. > > > The code is here: > https://bitbucket.org/stashlebik/hg/commits/branch/hiddenbitmaps > It's not super clean yet, but all tests pass (except maybe for commit/code > checks). > Please look and let me know what you think! > > > Thanks, > Stas ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel