Re: [PATCH 3 of 4 flagprocessor v8] revlog: flag processor

2017-01-09 Thread Pierre-Yves David



On 01/08/2017 10:42 PM, Gregory Szorc wrote:

On Thu, Jan 5, 2017 at 9:42 AM, Remi Chaintron > wrote:

# HG changeset patch
# User Remi Chaintron >
# Date 1483636648 0
#  Thu Jan 05 17:17:28 2017 +
# Node ID 8701df1c04340e9951481dc4c366ba550b4e790f
# Parent  a93b1ff78332f4f78e77ec9aaa2c63f7f975399c
revlog: flag processor

Add the ability for revlog objects to process revision flags and apply
registered transforms on read/write operations.

This patch introduces:
- the 'revlog._processflags()' method that looks at revision flags
and applies
  flag processors registered on them. Due to the need to handle
non-commutative
  operations, flag transforms are applied in stable order but the
order in which
  the transforms are applied is reversed between read and write
operations.
- the 'addflagprocessor()' method allowing to register processors on
flags.
  Flag processors are defined as a 3-tuple of (read, write, raw)
functions to be
  applied depending on the operation being performed.
- an update on 'revlog.addrevision()' behavior. The current
flagprocessor design
  relies on extensions to wrap around 'addrevision()' to set flags
on revision
  data, and on the flagprocessor to perform the actual
transformation of its
  contents. In the lfs case, this means we need to process flags
before we meet
  the 2GB size check, leading to performing some operations before
it happens:
  - if flags are set on the revision data, we assume some extensions
might be
modifying the contents using the flag processor next, and we
compute the
node for the original revision data (still allowing extension to
override
the node by wrapping around 'addrevision()').
  - we then invoke the flag processor to apply registered transforms
(in lfs's
case, drastically reducing the size of large blobs).
  - finally, we proceed with the 2GB size check.

Note: In the case a cachedelta is passed to 'addrevision()' and we
detect the
flag processor modified the revision data, we chose to trust the
flag processor
and drop the cachedelta.


I've only been casually following this series, so apologies if my
feedback below has been addressed already.

From an architecture perspective, I really like this series.

But one thing that is rubbing me the wrong way is the reserved USR
flags.


Actually, these USR flag have been dropped of the official and are now 
used in the tests. (so not part of BC). The test should probably use 
explicit name to make this clearer.


The official approach regarding new flags is for people with new flag 
usecase to "quickly" contact upstream to reserve one of the available 
flag. Upstream do need to have actual support but act as a central 
registry for flags meaning.


Rémi said he will follow up with some internal documentation about that.

(note, if someone want to make as case for USR flag, he can do it 
independently once the case series is in)



I like the idea of reserving flags for future use. However, for
features that aren't in core, a simple bit is not sufficient to identify
behavior. For example, extension "fbext" authored by Facebook and
extension "mozext" authored by Mozilla could both use USR0 for different
things. If someone at Facebook running "fbext" clones a Mozilla-hosted
repo, their USR0 processor will do a completely different thing from
what "mozext" intended. Chaos ensues.

There are a number of ways to solve problems like this. One way is a
central registry of transforms. But we only have 4 bits so that won't
scale. Another is to use strings, probably with prefixes to "namespace"
extensions. But we don't have an easy way to inject strings into
revisions (although we could use a bit flag to indicate additional flags
are at the beginning of the revlog chunk data in string format, kinda
like how filelogs store copy/rename metadata). Another idea is repo
requirements. You could add a requirement like "revlogusr0=fbext" to
indicate that the USR0 flag is associated with "fbext." That way if
another loaded extension providing USR0 is present, the repo will fail
to open.

But even the repo requirements solution has problems. What about `hg
clone`? The remote server isn't advertising what the USR flags are
doing. So, a client will happily start receiving changegroup data with
flags having behavior defined by another extension. When they go to
actually read that revision, things blow up. So for repos relying on USR
flags, we need the client and/or server to advertise which USR flags
they support and for what purpose and to fail fast if support isn't
present. This means that a repo needs to know what USR flags are in use.
This could be implemented as a repo requirement.

I think it is fine to reserve a few bits for user flags. 

Re: [PATCH 3 of 4 flagprocessor v8] revlog: flag processor

2017-01-08 Thread Gregory Szorc
On Thu, Jan 5, 2017 at 9:42 AM, Remi Chaintron  wrote:

> # HG changeset patch
> # User Remi Chaintron 
> # Date 1483636648 0
> #  Thu Jan 05 17:17:28 2017 +
> # Node ID 8701df1c04340e9951481dc4c366ba550b4e790f
> # Parent  a93b1ff78332f4f78e77ec9aaa2c63f7f975399c
> revlog: flag processor
>
> Add the ability for revlog objects to process revision flags and apply
> registered transforms on read/write operations.
>
> This patch introduces:
> - the 'revlog._processflags()' method that looks at revision flags and
> applies
>   flag processors registered on them. Due to the need to handle
> non-commutative
>   operations, flag transforms are applied in stable order but the order in
> which
>   the transforms are applied is reversed between read and write operations.
> - the 'addflagprocessor()' method allowing to register processors on flags.
>   Flag processors are defined as a 3-tuple of (read, write, raw) functions
> to be
>   applied depending on the operation being performed.
> - an update on 'revlog.addrevision()' behavior. The current flagprocessor
> design
>   relies on extensions to wrap around 'addrevision()' to set flags on
> revision
>   data, and on the flagprocessor to perform the actual transformation of
> its
>   contents. In the lfs case, this means we need to process flags before we
> meet
>   the 2GB size check, leading to performing some operations before it
> happens:
>   - if flags are set on the revision data, we assume some extensions might
> be
> modifying the contents using the flag processor next, and we compute
> the
> node for the original revision data (still allowing extension to
> override
> the node by wrapping around 'addrevision()').
>   - we then invoke the flag processor to apply registered transforms (in
> lfs's
> case, drastically reducing the size of large blobs).
>   - finally, we proceed with the 2GB size check.
>
> Note: In the case a cachedelta is passed to 'addrevision()' and we detect
> the
> flag processor modified the revision data, we chose to trust the flag
> processor
> and drop the cachedelta.
>

I've only been casually following this series, so apologies if my feedback
below has been addressed already.

>From an architecture perspective, I really like this series.

But one thing that is rubbing me the wrong way is the reserved USR flags. I
like the idea of reserving flags for future use. However, for features that
aren't in core, a simple bit is not sufficient to identify behavior. For
example, extension "fbext" authored by Facebook and extension "mozext"
authored by Mozilla could both use USR0 for different things. If someone at
Facebook running "fbext" clones a Mozilla-hosted repo, their USR0 processor
will do a completely different thing from what "mozext" intended. Chaos
ensues.

There are a number of ways to solve problems like this. One way is a
central registry of transforms. But we only have 4 bits so that won't
scale. Another is to use strings, probably with prefixes to "namespace"
extensions. But we don't have an easy way to inject strings into revisions
(although we could use a bit flag to indicate additional flags are at the
beginning of the revlog chunk data in string format, kinda like how
filelogs store copy/rename metadata). Another idea is repo requirements.
You could add a requirement like "revlogusr0=fbext" to indicate that the
USR0 flag is associated with "fbext." That way if another loaded extension
providing USR0 is present, the repo will fail to open.

But even the repo requirements solution has problems. What about `hg
clone`? The remote server isn't advertising what the USR flags are doing.
So, a client will happily start receiving changegroup data with flags
having behavior defined by another extension. When they go to actually read
that revision, things blow up. So for repos relying on USR flags, we need
the client and/or server to advertise which USR flags they support and for
what purpose and to fail fast if support isn't present. This means that a
repo needs to know what USR flags are in use. This could be implemented as
a repo requirement.

I think it is fine to reserve a few bits for user flags. But we need the
docs to indicate that repos relying on these flags are (currently) very
fragile outside of tightly controlled environments and should not be used
lightly.

There are a few minor nits inline. None require a v9 unless you are already
writing one IMO.


>
> diff --git a/mercurial/bundlerepo.py b/mercurial/bundlerepo.py
> --- a/mercurial/bundlerepo.py
> +++ b/mercurial/bundlerepo.py
> @@ -148,7 +148,10 @@
>  delta = self._chunk(chain.pop())
>  text = mdiff.patches(text, [delta])
>
> -self.checkhash(text, node, rev=rev)
> +text, validatehash = self._processflags(text, self.flags(rev),
> +'read', raw=raw)
> +if validatehash:
> +self.checkhash(text, node, rev=rev)
>

Re: [PATCH 3 of 4 flagprocessor v8] revlog: flag processor

2017-01-06 Thread Rémi Chaintron
On Thu, 5 Jan 2017 at 17:50 Remi Chaintron  wrote:

> # HG changeset patch
> # User Remi Chaintron 
> # Date 1483636648 0
> #  Thu Jan 05 17:17:28 2017 +
> # Node ID 8701df1c04340e9951481dc4c366ba550b4e790f
> # Parent  a93b1ff78332f4f78e77ec9aaa2c63f7f975399c
> revlog: flag processor
>
> Add the ability for revlog objects to process revision flags and apply
> registered transforms on read/write operations.
>
> This patch introduces:
> - the 'revlog._processflags()' method that looks at revision flags and
> applies
>   flag processors registered on them. Due to the need to handle
> non-commutative
>   operations, flag transforms are applied in stable order but the order in
> which
>   the transforms are applied is reversed between read and write operations.
> - the 'addflagprocessor()' method allowing to register processors on flags.
>   Flag processors are defined as a 3-tuple of (read, write, raw) functions
> to be
>   applied depending on the operation being performed.
> - an update on 'revlog.addrevision()' behavior. The current flagprocessor
> design
>   relies on extensions to wrap around 'addrevision()' to set flags on
> revision
>   data, and on the flagprocessor to perform the actual transformation of
> its
>   contents. In the lfs case, this means we need to process flags before we
> meet
>   the 2GB size check, leading to performing some operations before it
> happens:
>   - if flags are set on the revision data, we assume some extensions might
> be
> modifying the contents using the flag processor next, and we compute
> the
> node for the original revision data (still allowing extension to
> override
> the node by wrapping around 'addrevision()').
>   - we then invoke the flag processor to apply registered transforms (in
> lfs's
> case, drastically reducing the size of large blobs).
>   - finally, we proceed with the 2GB size check.
>
> Note: In the case a cachedelta is passed to 'addrevision()' and we detect
> the
> flag processor modified the revision data, we chose to trust the flag
> processor
> and drop the cachedelta.
>
> diff --git a/mercurial/bundlerepo.py b/mercurial/bundlerepo.py
> --- a/mercurial/bundlerepo.py
> +++ b/mercurial/bundlerepo.py
> @@ -148,7 +148,10 @@
>  delta = self._chunk(chain.pop())
>  text = mdiff.patches(text, [delta])
>
> -self.checkhash(text, node, rev=rev)
> +text, validatehash = self._processflags(text, self.flags(rev),
> +'read', raw=raw)
> +if validatehash:
> +self.checkhash(text, node, rev=rev)
>  self._cache = (node, rev, text)
>  return text
>
> diff --git a/mercurial/revlog.py b/mercurial/revlog.py
> --- a/mercurial/revlog.py
> +++ b/mercurial/revlog.py
> @@ -56,6 +56,10 @@
>  REVIDX_ISCENSORED = (1 << 15) # revision has censor metadata, must be
> verified
>  REVIDX_DEFAULT_FLAGS = 0
>  REVIDX_KNOWN_FLAGS = REVIDX_ISCENSORED
> +# stable order in which flags need to be processed and their processors
> applied
> +REVIDX_FLAGS_ORDER = [
> +REVIDX_ISCENSORED,
> +]
>
>  # max size of revlog with inline data
>  _maxinline = 131072
> @@ -64,6 +68,39 @@
>  RevlogError = error.RevlogError
>  LookupError = error.LookupError
>  CensoredNodeError = error.CensoredNodeError
> +ProgrammingError = error.ProgrammingError
> +
> +# Store flag processors (cf. 'addflagprocessor()' to register)
> +_flagprocessors = { }
> +
> +def addflagprocessor(flag, processor):
> +"""Register a flag processor on a revision data flag.
> +
> +Invariant:
> +- Flags need to be defined in REVIDX_KNOWN_FLAGS and
> REVIDX_FLAGS_ORDER.
> +- Only one flag processor can be registered on a specific flag.
> +- flagprocessors must be 3-tuples of functions (read, write, raw)
> with the
> +  following signatures:
> +  - (read)  f(self, text) -> newtext, bool
> +  - (write) f(self, text) -> newtext, bool
> +  - (raw)   f(self, text) -> bool
> +  The boolean returned by these transforms is used to determine
> whether
> +  'newtext' can be used for hash integrity checking.
> +
> +  Note: The 'raw' transform is used for changegroup generation and in
> some
> +  debug commands. In this case the transform only indicates whether
> the
> +  contents can be used for hash integrity checks.
> +"""
> +if not flag & REVIDX_KNOWN_FLAGS:
> +raise ProgrammingError(_(
> +"cannot register processor on unknown flag '%x'." % (flag)))
> +if flag not in REVIDX_FLAGS_ORDER:
> +raise ProgrammingError(_(
> +"flag '%x' undefined in REVIDX_FLAGS_ORDER." % (flag)))
> +if flag in _flagprocessors:
> +raise error.Abort(_(
> +"cannot register multiple processors on flag '%x'." % (flag)))
> +_flagprocessors[flag] = processor
>
>  def getoffset(q):
>  return int(q >> 16)
> @@ -1231,11 +1268,6 @@
>  if rev