[squid-dev] [PATCH] Better support for unknown URL schemes

2016-03-15 Thread Amos Jeffries
Squid already contains AnyP::PROTO_UNKNOWN support for unknown protocols
but currently does not preserve the actual string value received for them.

This adds a textual representation ('image') to the UriScheme object to
fill that gap and ensure that all URL representatinos (ie cache keys,
logs and outgoing messages) are generated with the scheme string as it
was received.

Future work:
* add ACL support for arbitrary scheme names
* support for comparisons of unknown schemes

Amos
=== modified file 'src/URL.h'
--- src/URL.h   2016-02-23 08:51:22 +
+++ src/URL.h   2016-03-15 11:03:44 +
@@ -43,6 +43,7 @@
 
 /// convert the URL scheme to that given
 void setScheme(const AnyP::ProtocolType &p) {scheme_=p; touch();}
+void setSchemeImage(const char *str) {scheme_.setImage(str); touch();}
 
 void userInfo(const SBuf &s) {userInfo_=s; touch();}
 const SBuf &userInfo() const {return userInfo_;}

=== modified file 'src/anyp/UriScheme.cc'
--- src/anyp/UriScheme.cc   2016-01-01 00:12:18 +
+++ src/anyp/UriScheme.cc   2016-03-15 15:10:17 +
@@ -14,19 +14,17 @@
 char const *
 AnyP::UriScheme::c_str() const
 {
+if (!schemeImage_.isEmpty())
+return schemeImage_.c_str();
+
 if (theScheme_ == AnyP::PROTO_UNKNOWN)
-return "(unknown)";
-
-static char out[BUFSIZ];
-int p = 0;
+return "(unknown)"; // should have been set explicitly
 
 if (theScheme_ > AnyP::PROTO_NONE && theScheme_ < AnyP::PROTO_MAX) {
-const char *in = AnyP::ProtocolType_str[theScheme_];
-for (; p < (BUFSIZ-1) && in[p] != '\0'; ++p)
-out[p] = xtolower(in[p]);
+schemeImage_ = AnyP::ProtocolType_str[theScheme_];
+schemeImage_.toLower();
 }
-out[p] = '\0';
-return out;
+return schemeImage_.c_str();
 }
 
 unsigned short

=== modified file 'src/anyp/UriScheme.h'
--- src/anyp/UriScheme.h2016-01-01 00:12:18 +
+++ src/anyp/UriScheme.h2016-03-15 11:19:04 +
@@ -10,6 +10,7 @@
 #define SQUID_ANYP_URISCHEME_H
 
 #include "anyp/ProtocolType.h"
+#include "sbuf/SBuf.h"
 
 #include 
 
@@ -27,7 +28,7 @@
 ~UriScheme() {}
 
 operator AnyP::ProtocolType() const { return theScheme_; }
-
+// XXX: does not account for comparison of unknown schemes (by image)
 bool operator != (AnyP::ProtocolType const & aProtocol) const { return 
theScheme_ != aProtocol; }
 
 /** Get a char string representation of the scheme.
@@ -41,9 +42,16 @@
 
 unsigned short defaultPort() const;
 
+/// Sets the string representation of this scheme.
+/// Only needed if the scheme type is PROTO_UNKNOWN.
+void setImage(const char *str) {assert(theScheme_ == AnyP::PROTO_UNKNOWN); 
schemeImage_ = str;}
+
 private:
 /// This is a typecode pointer into the enum/registry of protocols handled.
 AnyP::ProtocolType theScheme_;
+
+/// the string representation to use for theScheme_
+mutable SBuf schemeImage_;
 };
 
 } // namespace AnyP

=== modified file 'src/url.cc'
--- src/url.cc  2016-01-01 00:12:18 +
+++ src/url.cc  2016-03-15 13:52:00 +
@@ -18,6 +18,7 @@
 
 static HttpRequest *urlParseFinish(const HttpRequestMethod& method,
const AnyP::ProtocolType protocol,
+   const char *const protoStr,
const char *const urlpath,
const char *const host,
const SBuf &login,
@@ -157,6 +158,9 @@
 if (strncasecmp(b, "whois", len) == 0)
 return AnyP::PROTO_WHOIS;
 
+if (len > 0)
+return AnyP::PROTO_UNKNOWN;
+
 return AnyP::PROTO_NONE;
 }
 
@@ -215,7 +219,7 @@
URL::Asterisk().cmp(url) == 0) {
 protocol = AnyP::PROTO_HTTP;
 port = AnyP::UriScheme(protocol).defaultPort();
-return urlParseFinish(method, protocol, url, host, SBuf(), port, 
request);
+return urlParseFinish(method, protocol, proto, url, host, SBuf(), 
port, request);
 } else if (!strncmp(url, "urn:", 4)) {
 return urnParse(method, url, request);
 } else {
@@ -420,7 +424,7 @@
 }
 }
 
-return urlParseFinish(method, protocol, urlpath, host, SBuf(login), port, 
request);
+return urlParseFinish(method, protocol, proto, urlpath, host, SBuf(login), 
port, request);
 }
 
 /**
@@ -431,6 +435,7 @@
 static HttpRequest *
 urlParseFinish(const HttpRequestMethod& method,
const AnyP::ProtocolType protocol,
+   const char *const protoStr, // for unknown protocols
const char *const urlpath,
const char *const host,
const SBuf &login,
@@ -443,6 +448,9 @@
 request->initHTTP(method, protocol, urlpath);
 }
 
+if (protocol == AnyP::PROTO_UNKNOWN)
+request->url.setSchemeImage(protoStr);
+
 request->url.host(host);
 request->url.userInfo(login);
 request->url.port(port);

_

Re: [squid-dev] [PATCH] Better support for unknown URL schemes

2016-03-15 Thread Alex Rousskov
On 03/15/2016 09:36 AM, Amos Jeffries wrote:
> Squid already contains AnyP::PROTO_UNKNOWN support for unknown protocols
> but currently does not preserve the actual string value received for them.
> 
> This adds a textual representation ('image') to the UriScheme object to
> fill that gap and ensure that all URL representatinos (ie cache keys,
> logs and outgoing messages) are generated with the scheme string as it
> was received.

Preserving textual representation is the right thing to do, but the
implementation itself needs a lot more work IMO.


>  /// convert the URL scheme to that given
>  void setScheme(const AnyP::ProtocolType &p) {scheme_=p; touch();}
> +void setSchemeImage(const char *str) {scheme_.setImage(str); touch();}


The parameter type should be changed instead. Yes, that will require
more changes, but those changes are essential to properly support
foreign protocol names. Without those changes, you are creating an API
bug in addition to adding a useful feature.

We (mostly you) have already done similar work for foreign HTTP request
methods. This is similar -- UriScheme should be used nearly everywhere
(instead of ProtocolType that does not carry enough information). It
could be argued that the UriScheme class itself should be renamed to
something like Protocol, but I do not want to argue about that.


> +/// Sets the string representation of this scheme.
> +/// Only needed if the scheme type is PROTO_UNKNOWN.
> +void setImage(const char *str) {assert(theScheme_ == 
> AnyP::PROTO_UNKNOWN); schemeImage_ = str;}

This method should be a constructor.


>  char const *
>  AnyP::UriScheme::c_str() const
>  {

This method should be replaced with AnyP::UriScheme::image() returning SBuf.


> +if (!schemeImage_.isEmpty())
> +return schemeImage_.c_str();
> +

Why are we not lowering the case of a foreign protocol image if we are
lowering the case of standard protocol images below? Either both should
be lowered or there should be a source code comment explaining the
discrepancy.

If both should be lowered, the lowering should be done in the
constructor to avoid lowering the same string multiple times.


>  if (theScheme_ > AnyP::PROTO_NONE && theScheme_ < AnyP::PROTO_MAX) {
> -const char *in = AnyP::ProtocolType_str[theScheme_];
> -for (; p < (BUFSIZ-1) && in[p] != '\0'; ++p)
> -out[p] = xtolower(in[p]);
> +schemeImage_ = AnyP::ProtocolType_str[theScheme_];
> +schemeImage_.toLower();
>  }

This slow code may not be needed at all if the rest of the changes are
implemented. If this code stays, please add an "XXX: Slow." or a similar
comment after toLower(). There are many ways to fix/optimize this, but
the best way probably depends on the rest of the changes so I am not
going to suggest any specific TODO here.


> @@ -157,6 +158,9 @@
>  if (strncasecmp(b, "whois", len) == 0)
>  return AnyP::PROTO_WHOIS;
>  
> +if (len > 0)
> +return AnyP::PROTO_UNKNOWN;
> +
>  return AnyP::PROTO_NONE;
>  }

This function should return a UriScheme object instead. Ideally, using
created-once constants for common protocols (so that we do not have to
deal with to-lower-case conversion every time a scheme image is needed).


> @@ -27,7 +28,7 @@
>  ~UriScheme() {}
...
> +/// the string representation to use for theScheme_
> +mutable SBuf schemeImage_;

Please do not name Foo members FooX. One Foo is enough!


> +/// the string representation to use for theScheme_

s/to use for theScheme_//
or
s/to use for theScheme_/of the scheme//

If you can make such a guarantee after all other changes, please add
something like "; always in lower case".


> +mutable SBuf schemeImage_;

Mutable may not be needed after all other changes.


It is difficult for me to review 4-line no-function-names diff, so
forgive me if I missed any context-dependent caveats.


Thank you,

Alex.

___
squid-dev mailing list
squid-dev@lists.squid-cache.org
http://lists.squid-cache.org/listinfo/squid-dev


Re: [squid-dev] [PATCH] Better support for unknown URL schemes

2016-06-15 Thread Amos Jeffries
On 16/03/2016 5:52 a.m., Alex Rousskov wrote:
> On 03/15/2016 09:36 AM, Amos Jeffries wrote:
>> Squid already contains AnyP::PROTO_UNKNOWN support for unknown protocols
>> but currently does not preserve the actual string value received for them.
>>
>> This adds a textual representation ('image') to the UriScheme object to
>> fill that gap and ensure that all URL representatinos (ie cache keys,
>> logs and outgoing messages) are generated with the scheme string as it
>> was received.
> 
> Preserving textual representation is the right thing to do, but the
> implementation itself needs a lot more work IMO.
> 
> 
>>  /// convert the URL scheme to that given
>>  void setScheme(const AnyP::ProtocolType &p) {scheme_=p; touch();}
>> +void setSchemeImage(const char *str) {scheme_.setImage(str); touch();}
> 
> 
> The parameter type should be changed instead. Yes, that will require
> more changes, but those changes are essential to properly support
> foreign protocol names. Without those changes, you are creating an API
> bug in addition to adding a useful feature.
> 
> We (mostly you) have already done similar work for foreign HTTP request
> methods. This is similar -- UriScheme should be used nearly everywhere
> (instead of ProtocolType that does not carry enough information). It
> could be argued that the UriScheme class itself should be renamed to
> something like Protocol, but I do not want to argue about that.
> 

There are two distinct layers involved. I'm sure you know this, but I'll
reiterate for the sake of clarity.

- ProtocolType is the transfer or transport protocol type.
- UriScheme is the URL/URI/URN scheme label.

They have been conflated in the past and are still in the difficult
process of de-tangling. The URI scheme implies a protocol, but is not
one in and of itself. For example HTTP ProtocolType can be used to
transfer ftp:// scheme URLs, etc.

Untangling them is happening, but not part of the scope of this patch
except to the point that the URL image() is being made to no longer
generate directly from the ProtocolType value in most cases.

> 
>> +/// Sets the string representation of this scheme.
>> +/// Only needed if the scheme type is PROTO_UNKNOWN.
>> +void setImage(const char *str) {assert(theScheme_ == 
>> AnyP::PROTO_UNKNOWN); schemeImage_ = str;}
> 
> This method should be a constructor.

Done. Note that this has side effects on HttpRequest construction and
requires move semantics to be added to UriScheme. Not a problem exactly,
but scope creep.

> 
>>  char const *
>>  AnyP::UriScheme::c_str() const
>>  {
> 
> This method should be replaced with AnyP::UriScheme::image() returning SBuf.
> 

Done.

> 
>> +if (!schemeImage_.isEmpty())
>> +return schemeImage_.c_str();
>> +
> 
> Why are we not lowering the case of a foreign protocol image if we are
> lowering the case of standard protocol images below? Either both should
> be lowered or there should be a source code comment explaining the
> discrepancy.

Transfer Protocol type and Scheme labels are both case-sensitive (and
differently cased).

So when using an externally supplied scheme label we must take what was
given.

But, since our registered ProtocolType and its ProtocolType_str is
actually the message transfer protocol label (upper case) we need to
lower the case of ProtocolType_str when using it as implicit URI scheme
for one of the registered/known scheme names.


> 
> If both should be lowered, the lowering should be done in the
> constructor to avoid lowering the same string multiple times.
> 

Trunk is what lowers the string multiple times (on each display of the
URL, including debugs). These patches make that happen only once and
stores the lowered result in the scheme image_ to re-use it for any
followup calls after the initial down-case.
 Initial patch did it on first display. After the ctor changes it now
happens one on construction regardless of whether display ever happens.


> 
>>  if (theScheme_ > AnyP::PROTO_NONE && theScheme_ < AnyP::PROTO_MAX) {
>> -const char *in = AnyP::ProtocolType_str[theScheme_];
>> -for (; p < (BUFSIZ-1) && in[p] != '\0'; ++p)
>> -out[p] = xtolower(in[p]);
>> +schemeImage_ = AnyP::ProtocolType_str[theScheme_];
>> +schemeImage_.toLower();
>>  }
> 
> This slow code may not be needed at all if the rest of the changes are
> implemented. If this code stays, please add an "XXX: Slow." or a similar
> comment after toLower(). There are many ways to fix/optimize this, but
> the best way probably depends on the rest of the changes so I am not
> going to suggest any specific TODO here.

AFAICT the best optimization we can do is just take the image string
from the urlParse logic. Which is what this patch is doing now after the
ctor changes.

The slow logic path is now only done for URI which are generated
internally by Squid from just a ProtocolType enum.

> 
>> @@ -157,6 +158,9 @@
>>  if (strncasecmp(b, "whois", le

Re: [squid-dev] [PATCH] Better support for unknown URL schemes

2016-08-16 Thread Amos Jeffries
Applied as trunk rev.14802.

Amos

___
squid-dev mailing list
squid-dev@lists.squid-cache.org
http://lists.squid-cache.org/listinfo/squid-dev