Re: [RFC/WIP PATCH 11/11] Document protocol version 2

2015-06-04 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 We should definitely _not_ add anything that scales with the repository
 size. For instance, the symref field only shows the HEAD now. That's
 OK, as it's constant size.

I agree that this is an easy-to-explain rule to keep the design
sensible.

 We do not show _all_ symrefs right now
 because of pkt-line length limitations. With v2, we could in theory
 start doing so (one per pkt-line). But that does not make it a good
 idea.

Very true.  If we want symref information conveyed, then we should
either make a new symref advertisement available (and it is OK to
use a single-bit capability to enable or disable that new section),
or make the ref advertisement natively capable of always doing so
(i.e. if there is no need to make this optional).


--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/WIP PATCH 11/11] Document protocol version 2

2015-06-04 Thread Jeff King
On Mon, Jun 01, 2015 at 04:40:54PM -0700, Stefan Beller wrote:

 Thinking about this further, maybe it is a good idea to restrict the
 capabilities advertising to alphabetical order?

This seems like an unnecessary restriction. The main impetus seems to
be:

 So how does parse_capability scale w.r.t the number of capabilities?
 If parse_capability is just a linear search then it is O(n) and with n
 capabilities the client faces an O(n^2) computation which is bad. So
 if we were to require alphabetic capabilities, you could internally
 keep track and the whole operation is O(n). I just wonder if this is
 premature optimization or some thought we need to think of.

but that is an implementation problem, not a protocol problem. The
parsing side can easily use something better than O(n) lookups (e.g., a
binary search). I think we can live with O(n lg n) to parse the whole
list. A true in-order merge would be O(n), but the difference is
probably negligible here, because...

 Now if we assume the number of capabilities grows over time a lot
 (someone may abuse it for a cool feature, similar to the refs
 currently. Nobody thought about having so many refs in advance)

I think if it grows without bound, we are doing it wrong. We should
generally only be adding capabilities that require a single short tag in
the list (server supports foo, client asks for foo). I'd find it
acceptable to add ones that repeat, as long as we are very sure that the
repetition is going to be small, or scales with the size of the config
(e.g., servers says here is a mirror you can seed from; here is another
mirror you can seed from).

We should definitely _not_ add anything that scales with the repository
size. For instance, the symref field only shows the HEAD now. That's
OK, as it's constant size. We do not show _all_ symrefs right now
because of pkt-line length limitations. With v2, we could in theory
start doing so (one per pkt-line). But that does not make it a good
idea. The right way to implement that is:

  1. the server says I can tell you about symrefs

  2. client says OK, I understand how to parse your symref data

  3. for each ref in the advertisement, tack on \0symref=... (or
 whatever).

The capability portion of the conversation remains constant-sized, and
the O(# of refs) portion is part of the ref advertisement, which is
inherently of that magnitude.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/WIP PATCH 11/11] Document protocol version 2

2015-06-02 Thread Junio C Hamano
Stefan Beller sbel...@google.com writes:

 I think the recent issue with the push certificates shows that having 
 arbitrary
 data after the = is a bad idea.

I do not think push certificate episode tells any such thing.

It was about not carefully using cryptography with arbitrary data.
How that arbitrary data came to the machinery is irrelevant.  We
could have used base-64 to encode the server nonce when transferring
it to the machinery via capability, but then decode it in order to
place it in the cerficiate.

Do not restrict transport for such a reason and make legitimate uses
of the transport unnecessarily harder for later users.  What needs
to be done is to think how the data that was transport was used.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/WIP PATCH 11/11] Document protocol version 2

2015-06-01 Thread Stefan Beller
On Fri, May 29, 2015 at 3:21 PM, Jeff King p...@peff.net wrote:
 On Fri, May 29, 2015 at 02:52:14PM -0700, Junio C Hamano wrote:

  Currently we can do a = as part of the line after the first ref, such as
 
  symref=HEAD:refs/heads/master agent=git/2:2.4.0
 
  so I thought we want to keep this.

 I do not understand that statement.

 Capability exchange in v2 is one packet per cap, so the above
 example would be expressed as:

   symref=HEAD:refs/heads/master
 agent=git/2:2.4.0

 right?  Your keyvaluepair is limited to [a-z0-9-_=]*, and neither
 of the above two can be expressed with that, which was why I said
 you need two different set of characters before and after =.  Left
 hand side of = is tightly limited and that is OK.  Right hand side
 may contain characters like ':', '.' and '/', so your alphabet need
 to be more lenient, even in v1 (which I would imagine would be any
 octet other than SP, LF and NUL).

I think the recent issue with the push certificates shows that having arbitrary
data after the = is a bad idea. So we need to be very cautious when to allow
which data after the =.

I'll try split up the patch.


 Yes. See git_user_agent_sanitized(), for example, which allows basically
 any printable ASCII except for SP.

 I think the v2 capabilities do not even need to have that restriction.
 It can allow arbitrary binary data, because it has an 8bit-clean framing
 mechanism (pkt-lines). Of course, that means such capabilities cannot be
 represented in a v1 conversation (whose framing mechanism involves SP
 and NUL). But it's probably acceptable to introduce new capabilities
 which are only available in a v2 conversation. Old clients that do not
 understand v2 would not understand the capability either. It does
 require new clients implementing the capability to _also_ implement v2
 if they have not done so, but I do not mind pushing people in that
 direction.

 The initial v2 client implementation should probably do a few cautionary
 things, then:

   1. Do _not_ fold the per-pkt capabilities into a v1 string; that loses
  the robust framing. I suggested string_list earlier, but probably
  we want a list of ptr/len pair, so that it can remain NUL-clean.

   2. Avoid holding on to unknown packets longer than necessary. Some
  capability pkt-lines may be arbitrarily large (up to 64K). If we do
  not understand them during the v2 read of the capabilities, there
  is no point hanging on to them. It's not _wrong_ to do so, but just
  inefficient; if we know that clients will just throw away unknown
  packets, then we can later introduce new packets with large data,
  without worrying about wasting the client's resources.

  I suspect it's not that big a deal either way, though. I have no
  plans for sending a bunch of large packets, and anyway network
  bandwidth is probably more precious than client memory.

That's very sensible thoughts after rereading this email. The version
I'll be sending out today will not follow those suggestions though. :(


 -Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/WIP PATCH 11/11] Document protocol version 2

2015-06-01 Thread Stefan Beller
On Mon, Jun 1, 2015 at 4:14 PM, Stefan Beller sbel...@google.com wrote:
 On Fri, May 29, 2015 at 3:21 PM, Jeff King p...@peff.net wrote:
 On Fri, May 29, 2015 at 02:52:14PM -0700, Junio C Hamano wrote:

  Currently we can do a = as part of the line after the first ref, such as
 
  symref=HEAD:refs/heads/master agent=git/2:2.4.0
 
  so I thought we want to keep this.

 I do not understand that statement.

 Capability exchange in v2 is one packet per cap, so the above
 example would be expressed as:

   symref=HEAD:refs/heads/master
 agent=git/2:2.4.0

 right?  Your keyvaluepair is limited to [a-z0-9-_=]*, and neither
 of the above two can be expressed with that, which was why I said
 you need two different set of characters before and after =.  Left
 hand side of = is tightly limited and that is OK.  Right hand side
 may contain characters like ':', '.' and '/', so your alphabet need
 to be more lenient, even in v1 (which I would imagine would be any
 octet other than SP, LF and NUL).

 I think the recent issue with the push certificates shows that having 
 arbitrary
 data after the = is a bad idea. So we need to be very cautious when to allow
 which data after the =.

 I'll try split up the patch.


 Yes. See git_user_agent_sanitized(), for example, which allows basically
 any printable ASCII except for SP.

 I think the v2 capabilities do not even need to have that restriction.
 It can allow arbitrary binary data, because it has an 8bit-clean framing
 mechanism (pkt-lines). Of course, that means such capabilities cannot be
 represented in a v1 conversation (whose framing mechanism involves SP
 and NUL). But it's probably acceptable to introduce new capabilities
 which are only available in a v2 conversation. Old clients that do not
 understand v2 would not understand the capability either. It does
 require new clients implementing the capability to _also_ implement v2
 if they have not done so, but I do not mind pushing people in that
 direction.

 The initial v2 client implementation should probably do a few cautionary
 things, then:

   1. Do _not_ fold the per-pkt capabilities into a v1 string; that loses
  the robust framing. I suggested string_list earlier, but probably
  we want a list of ptr/len pair, so that it can remain NUL-clean.

   2. Avoid holding on to unknown packets longer than necessary. Some
  capability pkt-lines may be arbitrarily large (up to 64K). If we do
  not understand them during the v2 read of the capabilities, there
  is no point hanging on to them. It's not _wrong_ to do so, but just
  inefficient; if we know that clients will just throw away unknown
  packets, then we can later introduce new packets with large data,
  without worrying about wasting the client's resources.

  I suspect it's not that big a deal either way, though. I have no
  plans for sending a bunch of large packets, and anyway network
  bandwidth is probably more precious than client memory.

 That's very sensible thoughts after rereading this email. The version
 I'll be sending out today will not follow those suggestions though. :(

Thinking about this further, maybe it is a good idea to restrict the
capabilities
advertising to alphabetical order?

The exchange would look like this:

server:
  for capability in list:
  pkt_write(capability)
  pkt_flush

client:
  do
line = recv_pkt()
parse_capability(line)
  while line != flush

with parse_capability checking if we know the capability and maybe setting some
internal field if we know this capability.

Now if we assume the number of capabilities grows over time a lot (someone may
abuse it for a cool feature, similar to the refs currently. Nobody
thought about
having so many refs in advance)

So how does parse_capability scale w.r.t the number of capabilities?
If parse_capability is just a linear search then it is O(n) and with n
capabilities
the client faces an O(n^2) computation which is bad. So if we were to require
alphabetic capabilities, you could internally keep track and the whole operation
is O(n). I just wonder if this is premature optimization or some thought we need
to think of.

To prevent this problem from popping up, it must be easier to
introduce a new phase
after the capabilities exchange than to just abuse the capabilities
phase for whatever
you plan on doing.

Thanks,
Stefan



 -Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/WIP PATCH 11/11] Document protocol version 2

2015-05-29 Thread Jeff King
On Fri, May 29, 2015 at 02:52:14PM -0700, Junio C Hamano wrote:

  Currently we can do a = as part of the line after the first ref, such as
 
  symref=HEAD:refs/heads/master agent=git/2:2.4.0
 
  so I thought we want to keep this.
 
 I do not understand that statement.
 
 Capability exchange in v2 is one packet per cap, so the above
 example would be expressed as:
 
   symref=HEAD:refs/heads/master
 agent=git/2:2.4.0
 
 right?  Your keyvaluepair is limited to [a-z0-9-_=]*, and neither
 of the above two can be expressed with that, which was why I said
 you need two different set of characters before and after =.  Left
 hand side of = is tightly limited and that is OK.  Right hand side
 may contain characters like ':', '.' and '/', so your alphabet need
 to be more lenient, even in v1 (which I would imagine would be any
 octet other than SP, LF and NUL).

Yes. See git_user_agent_sanitized(), for example, which allows basically
any printable ASCII except for SP.

I think the v2 capabilities do not even need to have that restriction.
It can allow arbitrary binary data, because it has an 8bit-clean framing
mechanism (pkt-lines). Of course, that means such capabilities cannot be
represented in a v1 conversation (whose framing mechanism involves SP
and NUL). But it's probably acceptable to introduce new capabilities
which are only available in a v2 conversation. Old clients that do not
understand v2 would not understand the capability either. It does
require new clients implementing the capability to _also_ implement v2
if they have not done so, but I do not mind pushing people in that
direction.

The initial v2 client implementation should probably do a few cautionary
things, then:

  1. Do _not_ fold the per-pkt capabilities into a v1 string; that loses
 the robust framing. I suggested string_list earlier, but probably
 we want a list of ptr/len pair, so that it can remain NUL-clean.

  2. Avoid holding on to unknown packets longer than necessary. Some
 capability pkt-lines may be arbitrarily large (up to 64K). If we do
 not understand them during the v2 read of the capabilities, there
 is no point hanging on to them. It's not _wrong_ to do so, but just
 inefficient; if we know that clients will just throw away unknown
 packets, then we can later introduce new packets with large data,
 without worrying about wasting the client's resources.

 I suspect it's not that big a deal either way, though. I have no
 plans for sending a bunch of large packets, and anyway network
 bandwidth is probably more precious than client memory.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/WIP PATCH 11/11] Document protocol version 2

2015-05-29 Thread Junio C Hamano
Stefan Beller sbel...@google.com writes:

 +Capability discovery (v2)
 +-
 ...
 +  capability-list  =  *(capability) [agent LF] flush-pkt
 +  capability   =  PKT-LINE(capability: keyvaluepair LF)
 +  agent=  keyvaluepair LF
 +  keyvaluepair =  1*(LC_ALPHA / DIGIT / - / _ / =)

 What is the = doing there?  If you meant to cover things like
 lang=en with this, I do not think it is a good idea.  Rather, it
 should be more like this:

 capability = 1*(LC_ALPHA / DIGIT / - / _) [ = value ]
 value = 0*( any octet other than LF, NUL )

 in order to leave us wiggle room to have more than very limited
 subset of US-ASCII in 'value'.  I suspect that we may want to allow
 anything other than LF (unlike v1 that allowed anything other than
 SP and LF).

 Currently we can do a = as part of the line after the first ref, such as

 symref=HEAD:refs/heads/master agent=git/2:2.4.0

 so I thought we want to keep this.

I do not understand that statement.

Capability exchange in v2 is one packet per cap, so the above
example would be expressed as:

symref=HEAD:refs/heads/master
agent=git/2:2.4.0

right?  Your keyvaluepair is limited to [a-z0-9-_=]*, and neither
of the above two can be expressed with that, which was why I said
you need two different set of characters before and after =.  Left
hand side of = is tightly limited and that is OK.  Right hand side
may contain characters like ':', '.' and '/', so your alphabet need
to be more lenient, even in v1 (which I would imagine would be any
octet other than SP, LF and NUL).
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/WIP PATCH 11/11] Document protocol version 2

2015-05-29 Thread Junio C Hamano
Stefan Beller sbel...@google.com writes:

 @@ -1,11 +1,11 @@
  Packfile transfer protocols
  ===
  
 -Git supports transferring data in packfiles over the ssh://, git:// and
 +Git supports transferring data in packfiles over the ssh://, git://, http:// 
 and

When you have chance, can you do things like this, which is a clear
improvement of the current document even if we never had v2, as
separate patches?

 +Capability discovery (v2)
 +-
 ...
 +  capability-list  =  *(capability) [agent LF] flush-pkt
 +  capability   =  PKT-LINE(capability: keyvaluepair LF)
 +  agent=  keyvaluepair LF
 +  keyvaluepair =  1*(LC_ALPHA / DIGIT / - / _ / =)

What is the = doing there?  If you meant to cover things like
lang=en with this, I do not think it is a good idea.  Rather, it
should be more like this:

capability = 1*(LC_ALPHA / DIGIT / - / _) [ = value ]
value = 0*( any octet other than LF, NUL )

in order to leave us wiggle room to have more than very limited
subset of US-ASCII in 'value'.  I suspect that we may want to allow
anything other than LF (unlike v1 that allowed anything other than
SP and LF).

 +  LC_ALPHA =  %x61-7A
 +
 +
 +The client MUST ignore any data on pkt-lines starting with anything
 +different than capability for future ease of extension.
 +
 +The client MUST NOT ask for capabilities the server did not say it
 +supports. The server MUST diagnose and abort if capabilities it does
 +not understand was requested. The server MUST NOT ignore capabilities
 +that client requested and server advertised.  As a consequence of these
 +rules, server MUST NOT advertise capabilities it does not understand.

I think it was already discussed that we shouldn't do the
capability: and cap: prefixes in reviews of earlier parts, so
the details of this part would be updated?

 @@ -154,10 +203,14 @@ If HEAD is a valid ref, HEAD MUST appear as the first 
 advertised
  ref.  If HEAD is not a valid ref, HEAD MUST NOT appear in the
  advertisement list at all, but other refs may still appear.
  
 -The stream MUST include capability declarations behind a NUL on the
 -first ref. The peeled value of a ref (that is ref^{}) MUST be
 -immediately after the ref itself, if presented. A conforming server
 -MUST peel the ref if it's an annotated tag.
 +In version 1 the stream MUST include capability declarations behind
 +a NUL on the first ref. The peeled value of a ref (that is ref^{})
 +MUST be immediately after the ref itself, if presented. A conforming
 +server MUST peel the ref if it's an annotated tag.
 +
 +In version 2 the capabilities are already negotiated, so the first ref
 +MUST NOT be followed by any capability advertisement, but it should be
 +treated as any other refs advertising line.

Sensible.

 @@ -178,13 +231,28 @@ MUST peel the ref if it's an annotated tag.
shallow  =  PKT-LINE(shallow SP obj-id)
  
capability-list  =  capability *(SP capability)
 -  capability   =  1*(LC_ALPHA / DIGIT / - / _)
 +  capability   =  1*(LC_ALPHA / DIGIT / - / _ / =)

Ditto.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/WIP PATCH 11/11] Document protocol version 2

2015-05-29 Thread Stefan Beller
On Fri, May 29, 2015 at 1:35 PM, Junio C Hamano gits...@pobox.com wrote:
 Stefan Beller sbel...@google.com writes:

 @@ -1,11 +1,11 @@
  Packfile transfer protocols
  ===

 -Git supports transferring data in packfiles over the ssh://, git:// and
 +Git supports transferring data in packfiles over the ssh://, git://, 
 http:// and

 When you have chance, can you do things like this, which is a clear
 improvement of the current document even if we never had v2, as
 separate patches?

will do.


 +Capability discovery (v2)
 +-
 ...
 +  capability-list  =  *(capability) [agent LF] flush-pkt
 +  capability   =  PKT-LINE(capability: keyvaluepair LF)
 +  agent=  keyvaluepair LF
 +  keyvaluepair =  1*(LC_ALPHA / DIGIT / - / _ / =)

 What is the = doing there?  If you meant to cover things like
 lang=en with this, I do not think it is a good idea.  Rather, it
 should be more like this:

 capability = 1*(LC_ALPHA / DIGIT / - / _) [ = value ]
 value = 0*( any octet other than LF, NUL )

 in order to leave us wiggle room to have more than very limited
 subset of US-ASCII in 'value'.  I suspect that we may want to allow
 anything other than LF (unlike v1 that allowed anything other than
 SP and LF).

Currently we can do a = as part of the line after the first ref, such as

symref=HEAD:refs/heads/master agent=git/2:2.4.0

so I thought we want to keep this. And below I just corrected what I thought
was a difference between documentation and implementation.


 +  LC_ALPHA =  %x61-7A
 +
 +
 +The client MUST ignore any data on pkt-lines starting with anything
 +different than capability for future ease of extension.
 +
 +The client MUST NOT ask for capabilities the server did not say it
 +supports. The server MUST diagnose and abort if capabilities it does
 +not understand was requested. The server MUST NOT ignore capabilities
 +that client requested and server advertised.  As a consequence of these
 +rules, server MUST NOT advertise capabilities it does not understand.

 I think it was already discussed that we shouldn't do the
 capability: and cap: prefixes in reviews of earlier parts, so
 the details of this part would be updated?

sure


 @@ -154,10 +203,14 @@ If HEAD is a valid ref, HEAD MUST appear as the first 
 advertised
  ref.  If HEAD is not a valid ref, HEAD MUST NOT appear in the
  advertisement list at all, but other refs may still appear.

 -The stream MUST include capability declarations behind a NUL on the
 -first ref. The peeled value of a ref (that is ref^{}) MUST be
 -immediately after the ref itself, if presented. A conforming server
 -MUST peel the ref if it's an annotated tag.
 +In version 1 the stream MUST include capability declarations behind
 +a NUL on the first ref. The peeled value of a ref (that is ref^{})
 +MUST be immediately after the ref itself, if presented. A conforming
 +server MUST peel the ref if it's an annotated tag.
 +
 +In version 2 the capabilities are already negotiated, so the first ref
 +MUST NOT be followed by any capability advertisement, but it should be
 +treated as any other refs advertising line.

 Sensible.

 @@ -178,13 +231,28 @@ MUST peel the ref if it's an annotated tag.
shallow  =  PKT-LINE(shallow SP obj-id)

capability-list  =  capability *(SP capability)
 -  capability   =  1*(LC_ALPHA / DIGIT / - / _)
 +  capability   =  1*(LC_ALPHA / DIGIT / - / _ / =)

 Ditto.

 Thanks.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html