Re: [PATCH] Add a new log format variable "%p" that spits out the sanitized request path

2015-04-28 Thread Andrew Hayworth
Thanks!

On Tue, Apr 28, 2015 at 2:07 PM, Willy Tarreau  wrote:
> Hi Andrew,
>
> On Tue, Apr 28, 2015 at 10:54:58AM -0500, Andrew Hayworth wrote:
> (...)
>> I changed %HR -> %HU, and mentioned '(path)' in the docs.
>
> I found it was not changed in the parser nor the doc but I fixed it, don't
> worry.
>
>> > Now concerning the bug, it's not very important but it does exist :
>>
>> Good catch - Suffice it to say I've tested this patch a bit more thoroughly. 
>> :-)
>>
>> After a little reflection and during testing, I also found it
>> confusing that the new
>> directives sometimes logged "" and sometimes parsed as much as
>> they could out of the request line before giving up. This latest patch
>> more consistently
>> logs "" if the line is bad. I ended up counting spaces, based on the
>> premise that any valid request in any version of the protocol must
>> contain at least
>> one whitespace character.
>
> That's a good idea indeed.
>
> Thanks for your patience, I've now merged your work. That will definitely
> be useful in a number of circumstances.
>
> Cheers,
> Willy
>



-- 
- Andrew Hayworth



Re: [PATCH] Add a new log format variable "%p" that spits out the sanitized request path

2015-04-28 Thread Willy Tarreau
Hi Andrew,

On Tue, Apr 28, 2015 at 10:54:58AM -0500, Andrew Hayworth wrote:
(...)
> I changed %HR -> %HU, and mentioned '(path)' in the docs.

I found it was not changed in the parser nor the doc but I fixed it, don't
worry.

> > Now concerning the bug, it's not very important but it does exist :
> 
> Good catch - Suffice it to say I've tested this patch a bit more thoroughly. 
> :-)
> 
> After a little reflection and during testing, I also found it
> confusing that the new
> directives sometimes logged "" and sometimes parsed as much as
> they could out of the request line before giving up. This latest patch
> more consistently
> logs "" if the line is bad. I ended up counting spaces, based on the
> premise that any valid request in any version of the protocol must
> contain at least
> one whitespace character.

That's a good idea indeed.

Thanks for your patience, I've now merged your work. That will definitely
be useful in a number of circumstances.

Cheers,
Willy




Re: [PATCH] Add a new log format variable "%p" that spits out the sanitized request path

2015-04-28 Thread Andrew Hayworth
Hello -

On Tue, Apr 28, 2015 at 3:01 AM, Willy Tarreau  wrote:
> Hi again Andrew,
>
> I've spent some time on it. Good work. I'm seeing a bug in %HV, it doesn't
> always work. I like your idea to convert missing versions to "HTTP/1.0",
> but I think that in order to be even more accurate, we should report
> "HTTP/0.9" to translate what was actually seen in the logs, what do you
> think ?
>

This latest patch logs as HTTP/0.9 - I re-read the RFC and that
totally makes sense.

> One comment prior to the bug itself, I found %HR confusing because it
> designates a request (and is even mapped to LOG_HTTP_REQUEST), but what
> we name a request all over the code and doc is the association between
> the method, the uri and the version. Since it takes only the URI, I'd
> rather call it LOG_HTTP_URI and %HU. Maybe later we'll even decide to
> alias %HR to %r, I don't know. I tend to be careful to always plan for
> the future because we have accumulated a lot of misdesigns initially
> due to the narrower scope of the project, that we're still carrying
> today. Concerning "%HP" you can mention in the doc "(path)" since it's
> the name we use in other places for the same entity. That way it will
> be consistent.
>

I changed %HR -> %HU, and mentioned '(path)' in the docs.

> Now concerning the bug, it's not very important but it does exist :

Good catch - Suffice it to say I've tested this patch a bit more thoroughly. :-)

After a little reflection and during testing, I also found it
confusing that the new
directives sometimes logged "" and sometimes parsed as much as
they could out of the request line before giving up. This latest patch
more consistently
logs "" if the line is bad. I ended up counting spaces, based on the
premise that any valid request in any version of the protocol must
contain at least
one whitespace character.

Thanks!
-- 
- Andrew Hayworth


>From 74b1abcfe2202f7da5de7c6e2f33c303e5dd4f62 Mon Sep 17 00:00:00 2001
From: Andrew Hayworth 
Date: Mon, 27 Apr 2015 21:37:03 +
Subject: [PATCH] Add HTTP request-line log format directives

This commit adds 4 new log format variables that parse the
HTTP Request-Line for more specific logging than "%r" provides.

For example, we can parse the following HTTP Request-Line with
these new variables:

  "GET /foo?bar=baz HTTP/1.1"

- %HM: HTTP Method ("GET")
- %HV: HTTP Version ("HTTP/1.1")
- %HU: HTTP Request-URI ("/foo?bar=baz")
- %HP: HTTP Request-URI without query string ("/foo")
---
 doc/configuration.txt |   4 ++
 include/types/log.h   |   4 ++
 src/log.c | 164 ++
 3 files changed, 172 insertions(+)

diff --git a/doc/configuration.txt b/doc/configuration.txt
index a37f54c..bd7e2a8 100644
--- a/doc/configuration.txt
+++ b/doc/configuration.txt
@@ -13203,6 +13203,10 @@ Please refer to the table below for currently
defined variables :
   |   | %t   | date_time  (with millisecond resolution)  | date|
   |   | %ts  | termination_state | string  |
   | H | %tsc | termination_state with cookie status  | string  |
+  | H | %HM  | HTTP method (ex: POST)| string  |
+  | H | %HV  | HTTP version (ex: HTTP/1.0)   | string  |
+  | H | %HR  | HTTP request URI (ex: /foo?bar=baz)   | string  |
+  | H | %HP  | HTTP request URI without query string (path)  | string  |
   +---+--+---+-+

 R = Restrictions : H = mode http only ; S = SSL only
diff --git a/include/types/log.h b/include/types/log.h
index 345a09f..bbfe020 100644
--- a/include/types/log.h
+++ b/include/types/log.h
@@ -93,6 +93,10 @@ enum {
  LOG_FMT_HDRREQUESTLIST,
  LOG_FMT_HDRRESPONSLIST,
  LOG_FMT_REQ,
+ LOG_FMT_HTTP_METHOD,
+ LOG_FMT_HTTP_URI,
+ LOG_FMT_HTTP_PATH,
+ LOG_FMT_HTTP_VERSION,
  LOG_FMT_HOSTNAME,
  LOG_FMT_UNIQUEID,
  LOG_FMT_SSL_CIPHER,
diff --git a/src/log.c b/src/log.c
index 866b110..4391494 100644
--- a/src/log.c
+++ b/src/log.c
@@ -32,6 +32,7 @@
 #include 

 #include 
+#include 
 #include 
 #include 
 #include 
@@ -108,6 +109,10 @@ static const struct logformat_type logformat_keywords[] = {
  { "hrl", LOG_FMT_HDRREQUESTLIST, PR_MODE_TCP, LW_REQHDR, NULL }, /*
header request list */
  { "hs", LOG_FMT_HDRRESPONS, PR_MODE_TCP, LW_RSPHDR, NULL },  /*
header response */
  { "hsl", LOG_FMT_HDRRESPONSLIST, PR_MODE_TCP, LW_RSPHDR, NULL },  /*
header response list */
+ { "HM", LOG_FMT_HTTP_METHOD, PR_MODE_HTTP, LW_REQ, NULL },  /* HTTP method */
+ { "HR", LOG_FMT_HTTP_URI, PR_MODE_HTTP, LW_REQ, NULL },  /* HTTP full URI */
+ { "HP", LOG_FMT_HTTP_PATH, PR_MODE_HTTP, LW_REQ, NULL },  /* HTTP path */
+ { "HV", LOG_FMT_HTTP_VERSION, PR_MODE_HTTP, LW_REQ, NULL },  /* HTTP
version */
  { "lc", LOG_FMT_LOGCNT, PR_MODE_TCP, LW_INIT, NULL }, /* log counter */
  { "ms", LOG_FMT_MS, PR_MODE_TCP, LW_INIT, NULL },   /* accept
date millisecond */
  { "pid", LOG_FMT_PID, PR

Re: [PATCH] Add a new log format variable "%p" that spits out the sanitized request path

2015-04-28 Thread Willy Tarreau
Hi again Andrew,

I've spent some time on it. Good work. I'm seeing a bug in %HV, it doesn't
always work. I like your idea to convert missing versions to "HTTP/1.0",
but I think that in order to be even more accurate, we should report
"HTTP/0.9" to translate what was actually seen in the logs, what do you
think ?

One comment prior to the bug itself, I found %HR confusing because it
designates a request (and is even mapped to LOG_HTTP_REQUEST), but what
we name a request all over the code and doc is the association between
the method, the uri and the version. Since it takes only the URI, I'd
rather call it LOG_HTTP_URI and %HU. Maybe later we'll even decide to
alias %HR to %r, I don't know. I tend to be careful to always plan for
the future because we have accumulated a lot of misdesigns initially
due to the narrower scope of the project, that we're still carrying
today. Concerning "%HP" you can mention in the doc "(path)" since it's
the name we use in other places for the same entity. That way it will
be consistent.

Now concerning the bug, it's not very important but it does exist :

> + case LOG_FMT_HTTP_VERSION: // %HV
> +   uri = txn->uri ? txn->uri : "";
> +   if (tmp->options && LOG_OPT_QUOTE)
> + LOGCHAR('"');
> +
> +   end = uri + strlen(uri);
> +   spc = uri;
> +   // look for the last whitespace character
> +   while (uri < end) {
> + if (HTTP_IS_SPHT(*uri)) {
> +   spc = ++uri;
> + } else {
> +   uri++;
> + }
> +   }
> +
> +   // "Simple" HTTP request per RFC1945
> +   // Could technically be HTTP0.9 but we return HTTP/1.0 in the response
> +   // so mirror that here.
> +   if (spc == (end-1)) {
> + chunk.str = "HTTP/1.0";
> + chunk.len = strlen("HTTP/1.0");
> +   } else {
> + chunk.str = spc;
> + chunk.len = uri - spc;
> +   }

If you send "GET /" it says "HTTP/1.0", if you send "GET /foo", it says "/foo"
and if you send crap it says "". Technically speaking we're not looking
for the last space but what follows the second (series of) spaces. Thus I would
simply keep the same parsing method as for the other parts and do it the same
way :

+   // look for the first whitespace character
+   while (uri < end && !HTTP_IS_SPHT(*uri))
+   uri++;
+
+   // keep advancing past multiple spaces
+   while (uri < end && HTTP_IS_SPHT(*uri))
+   uri++;
+
+   // look for the next whitespace character
+   while (uri < end && !HTTP_IS_SPHT(*uri))
+   uri++;
+
+   // keep advancing past multiple spaces
+   while (uri < end && HTTP_IS_SPHT(*uri))
+   uri++;

Then you log between uri and end and you're done.

Don't be afraid of duplicating a few lines like this, code duplication
is what tells us it's about time to create new functions. We'll probably
later come up with something like skip_word() and skip_spaces() to make
the code cleaner.

If you think you won't have the time to roll out another patch soon,
I can make the changes myself, just tell it to me. It's just that I
don't want to step on your feet.

Thanks!
Willy




Re: [PATCH] Add a new log format variable "%p" that spits out the sanitized request path

2015-04-27 Thread Willy Tarreau
Hi Andrew,

On Mon, Apr 27, 2015 at 04:49:36PM -0500, Andrew Hayworth wrote:
> Hi Willy -
> 
> Sorry about the delay. I had to work on some other projects, and just
> got back to this.

Oh don't worry, I know this situation too well!

(...)
> On Wed, Apr 15, 2015 at 4:12 AM, Willy Tarreau  wrote:
> > There, if I send "GET \r\n\r\n", what will happen is that both spc and end
> > will point to the same space, resulting in  being -1, so you can
> > already see the segfault in memmove() and later. Also you need to keep in
> > mind that multiple spaces are tolerated and that tabs are tolerated as
> > well, but they're encoded as "#09" after encode_string().
> 
> I wasn't aware that either "GET \r\n\r\n" or tabs were valid in the HTTP
> request line, but if the HAProxy parser tolerates it then the logging should
> definitely not blow up if such a request comes through!

This one is not valid but anyone could send it (I did it by hand)
and haproxy must resist to this without crashing :-) And multiple
spaces/tabs are allowed.

> I've attached a patch that I believe addresses all of your feedback.
> Let me know what you thinks!

I'll try to assign some time today to review it and will keep you
informed.

Thanks!
Willy




Re: [PATCH] Add a new log format variable "%p" that spits out the sanitized request path

2015-04-27 Thread Andrew Hayworth
Hi Willy -

Sorry about the delay. I had to work on some other projects, and just
got back to this.
As usual, thank you for the feedback on the commit.

On Wed, Apr 15, 2015 at 4:12 AM, Willy Tarreau  wrote:
>
> - it did not apply to mainline, I had to apply it by hand, so I suspect
>   that you did it against 1.5 instead of 1.6. Any contrib must be done
>   on 1.6 (dev branch), including fixes, and if needed they're backported
>   later. This ensures we never lose a fix or feature when upgrading.

Apologies for that, I was indeed working against 1.5.

> There, if I send "GET \r\n\r\n", what will happen is that both spc and end
> will point to the same space, resulting in  being -1, so you can
> already see the segfault in memmove() and later. Also you need to keep in
> mind that multiple spaces are tolerated and that tabs are tolerated as
> well, but they're encoded as "#09" after encode_string().

I wasn't aware that either "GET \r\n\r\n" or tabs were valid in the HTTP
request line, but if the HAProxy parser tolerates it then the logging should
definitely not blow up if such a request comes through!

I've attached a patch that I believe addresses all of your feedback.
Let me know what you thinks!


-- 
- Andrew Hayworth


>From 01db55d61f9efcfe6133126ab17ca8bd22dbb1bf Mon Sep 17 00:00:00 2001
From: Andrew Hayworth 
Date: Mon, 27 Apr 2015 21:37:03 +
Subject: [PATCH] Add HTTP request-line log format directives

This commit adds 4 new log format variables that parse the
HTTP Request-Line for more specific logging than "%r" provides.

For example, we can parse the following HTTP Request-Line with
these new variables:

  "GET /foo?bar=baz HTTP/1.1"

- %HM: HTTP Method ("GET")
- %HV: HTTP Version ("HTTP/1.1")
- %HR: HTTP Request-URI ("/foo?bar=baz")
- %HP: HTTP Request-URI without query string ("/foo")
---
 doc/configuration.txt |   4 ++
 include/types/log.h   |   4 ++
 src/log.c | 139 ++
 3 files changed, 147 insertions(+)

diff --git a/doc/configuration.txt b/doc/configuration.txt
index a37f54c..8e090b6 100644
--- a/doc/configuration.txt
+++ b/doc/configuration.txt
@@ -13203,6 +13203,10 @@ Please refer to the table below for currently
defined variables :
   |   | %t   | date_time  (with millisecond resolution)  | date|
   |   | %ts  | termination_state | string  |
   | H | %tsc | termination_state with cookie status  | string  |
+  | H | %HM  | HTTP method (ex: POST)| string  |
+  | H | %HV  | HTTP version (ex: HTTP/1.0)   | string  |
+  | H | %HR  | HTTP request URI (ex: /foo?bar=baz)   | string  |
+  | H | %HP  | HTTP request URI without query string | string  |
   +---+--+---+-+

 R = Restrictions : H = mode http only ; S = SSL only
diff --git a/include/types/log.h b/include/types/log.h
index 345a09f..f02cc11 100644
--- a/include/types/log.h
+++ b/include/types/log.h
@@ -93,6 +93,10 @@ enum {
  LOG_FMT_HDRREQUESTLIST,
  LOG_FMT_HDRRESPONSLIST,
  LOG_FMT_REQ,
+ LOG_FMT_HTTP_METHOD,
+ LOG_FMT_HTTP_REQUEST,
+ LOG_FMT_HTTP_PATH,
+ LOG_FMT_HTTP_VERSION,
  LOG_FMT_HOSTNAME,
  LOG_FMT_UNIQUEID,
  LOG_FMT_SSL_CIPHER,
diff --git a/src/log.c b/src/log.c
index 866b110..d0858fa 100644
--- a/src/log.c
+++ b/src/log.c
@@ -32,6 +32,7 @@
 #include 

 #include 
+#include 
 #include 
 #include 
 #include 
@@ -108,6 +109,10 @@ static const struct logformat_type logformat_keywords[] = {
  { "hrl", LOG_FMT_HDRREQUESTLIST, PR_MODE_TCP, LW_REQHDR, NULL }, /*
header request list */
  { "hs", LOG_FMT_HDRRESPONS, PR_MODE_TCP, LW_RSPHDR, NULL },  /*
header response */
  { "hsl", LOG_FMT_HDRRESPONSLIST, PR_MODE_TCP, LW_RSPHDR, NULL },  /*
header response list */
+ { "HM", LOG_FMT_HTTP_METHOD, PR_MODE_HTTP, LW_REQ, NULL },  /* HTTP method */
+ { "HR", LOG_FMT_HTTP_REQUEST, PR_MODE_HTTP, LW_REQ, NULL },  /* HTTP
full request */
+ { "HP", LOG_FMT_HTTP_PATH, PR_MODE_HTTP, LW_REQ, NULL },  /* HTTP path */
+ { "HV", LOG_FMT_HTTP_VERSION, PR_MODE_HTTP, LW_REQ, NULL },  /* HTTP
version */
  { "lc", LOG_FMT_LOGCNT, PR_MODE_TCP, LW_INIT, NULL }, /* log counter */
  { "ms", LOG_FMT_MS, PR_MODE_TCP, LW_INIT, NULL },   /* accept
date millisecond */
  { "pid", LOG_FMT_PID, PR_MODE_TCP, LW_INIT, NULL }, /* log pid */
@@ -923,7 +928,10 @@ int build_logline(struct stream *s, char *dst,
size_t maxsize, struct list *list
  struct proxy *fe = sess->fe;
  struct proxy *be = s->be;
  struct http_txn *txn = s->txn;
+ struct chunk chunk;
  char *uri;
+ char *spc;
+ char *end;
  struct tm tm;
  int t_request;
  int hdr;
@@ -1523,6 +1531,137 @@ int build_logline(struct stream *s, char *dst,
size_t maxsize, struct list *list
last_isspace = 0;
break;

+ case LOG_FMT_HTTP_PATH: // %HP
+   uri = txn->uri ? txn->uri : "";
+
+   if (tmp->options && LOG_OPT_QUOTE)
+ LOGCHAR(

Re: [PATCH] Add a new log format variable "%p" that spits out the sanitized request path

2015-04-15 Thread Willy Tarreau
Hi Andrew,

On Tue, Apr 14, 2015 at 02:33:25PM -0500, Andrew Hayworth wrote:
> I've attached another patch that implements
> 4 separate log variables to log individual parts of the request line,
> as you suggested.

Great, thanks. I'm having a few comments about the patch :

- it did not apply to mainline, I had to apply it by hand, so I suspect
  that you did it against 1.5 instead of 1.6. Any contrib must be done
  on 1.6 (dev branch), including fixes, and if needed they're backported
  later. This ensures we never lose a fix or feature when upgrading.

- I got the following warning due to the variable declarations mixed
  with the code, which should really be avoided :

  src/log.c: In function 'build_logline':
  src/log.c:1606:5: warning: ISO C90 forbids mixed declarations and code 
[-Wdeclaration-after-statement]

- the following strchr/strrchr dance is dangerous and bogus. As a rule
  of thumb, you should always consider that if you're using strrchr(),
  you're parsing backwards and you're doing something wrong :

spc = strchr(tmplog, ' ');
end = strchr(tmplog, '?');
if (end == NULL) {
end = strrchr(tmplog, ' ');
}
if (end != NULL) {
nchar = end - spc - 1;
memmove(tmplog, spc+1, nchar);
tmplog[nchar] = '\0';
tmplog += nchar;
} else {
tmplog = ret;
}

There, if I send "GET \r\n\r\n", what will happen is that both spc and end
will point to the same space, resulting in  being -1, so you can
already see the segfault in memmove() and later. Also you need to keep in
mind that multiple spaces are tolerated and that tabs are tolerated as
well, but they're encoded as "#09" after encode_string().

Thus I'd suggest that you use encode_chunk() instead which takes a chunk
(string + length) and that you make this chunk point to the exact string
you want to encode. Example for the path above :

struct chunk chunk;
.../...

case LOG_FMT_HTTP_PATH: // %HP
uri = txn->uri ? txn->uri : "";
end = uri + strlen(end);
// look for first space
while (uri < end && !http_is_spht(*uri))
uri++;
// look for first non-space (uri)
while (uri < end && http_is_spht(*uri))
uri++;
// look for first space or question mark after uri
spc = uri;
while (spc < end && *spc != '?' && !http_is_spht(*spc))
spc++;

chunk.str = uri;
chunk.len = spc - uri;

if (tmp->options & LOG_OPT_QUOTE)
LOGCHAR('"');

ret = encode_chunk(tmplog, dst + maxsize,
   '#', url_encode_map, &chunk);

if (ret == NULL || *ret != '\0')
goto out;

tmplog = ret;

if (tmp->options & LOG_OPT_QUOTE)
LOGCHAR('"');

last_isspace = 0;
break;

%HU is the same without the stop on '?'.

I think you get the idea. I'm attaching the patch I rebased so that you
can save time and restart from here.

Thanks!
Willy

>From d090a00b2b9cd1f4d57ceb7ea3d0aa18857dd6ad Mon Sep 17 00:00:00 2001
From: Andrew Hayworth 
Date: Tue, 7 Apr 2015 21:42:53 +
Subject: MEDIUM: log: add log format variables that parse the HTTP request
 line

This commit adds 4 new log format variables that parse the
HTTP Request-Line for more specific logging than "%r" provides.

For example, we can parse the following HTTP Request-Line with
these new variables:

  "GET /foo?bar=baz HTTP/1.1"

- %HM: HTTP Method ("GET")
- %HV: HTTP Version ("HTTP/1.1")
- %HR: HTTP Request-URI ("/foo?bar=baz")
- %HP: HTTP Request-URI without query string ("/foo")
---
 doc/configuration.txt |   4 ++
 include/types/log.h   |   4 ++
 src/log.c | 109 ++
 3 files changed, 117 insertions(+)

diff --git a/doc/configuration.txt b/doc/configuration.txt
index 2242c71..5e2464a 100644
--- a/doc/configuration.txt
+++ b/doc/configuration.txt
@@ -13203,6 +13203,10 @@ Please refer to the table below for currently defined 
variables :
   |   | %t   | date_time  (with millisecond resolution)  | date|
   |   | %ts  | termination_state 

Re: [PATCH] Add a new log format variable "%p" that spits out the sanitized request path

2015-04-14 Thread Andrew Hayworth
Hi Willy -

On Sat, Apr 11, 2015 at 4:06 AM, Willy Tarreau  wrote:
>> Limited testing confirms that the path is not printed in the log when
>> %[path] is included in log-format either. Perhaps I'm missing a
>> configuration option that allows the use of that sample fetch in the
>> logs?
>
> OK I understand now. In fact you can do it using captures but that becomes
> a bit more complex, at least too complex for a general usage in my opinion.
>

How would we accomplish that using captures? I'm curious how it's done.

> But it systematically strips the HTTP version this time. I could suggest
> another approach which I think would be cleaner and more universal. Simply
> create not one, but 4 new log tags (and maybe prefix all of them with 'H'
> to avoid confusing naming with other protocols if similar ones are wanted
> later) :
>
>   - %HM : HTTP method, returns the method only (string before the first
> space in the uri variable)
>
>   - %HU : HTTP URI : returns the string between the first and second space
>   or end of line
>
>   - %HP : HTTP path : same as %HU but stops at first of question mark if
>   present before a space.
>
>   - %HV : HTTP version : reports the part after the second space if it exists.
>
> I think that this will provide the elements needed to have your desired log
> format and will allow other people to have theirs as well (eg those not
> interested in the version can remove it).
>
> Would that be OK for you ?

Yes, you're right - it should be more general. I've attached another
patch that implements
4 separate log variables to log individual parts of the request line,
as you suggested.

Thanks!

- Andrew Hayworth
>From ab5b317b8b09ab1909421d7245f7f7c2268ad442 Mon Sep 17 00:00:00 2001
From: Andrew Hayworth 
Date: Tue, 7 Apr 2015 21:42:53 +
Subject: [PATCH] Add log format variables that parse the HTTP request line

This commit adds 4 new log format variables that parse the
HTTP Request-Line for more specific logging than "%r" provides.

For example, we can parse the following HTTP Request-Line with
these new variables:

  "GET /foo?bar=baz HTTP/1.1"

- %HM: HTTP Method ("GET")
- %HV: HTTP Version ("HTTP/1.1")
- %HR: HTTP Request-URI ("/foo?bar=baz")
- %HP: HTTP Request-URI without query string ("/foo")
---
 doc/configuration.txt |   4 ++
 include/types/log.h   |   4 ++
 src/log.c | 107 ++
 3 files changed, 115 insertions(+)

diff --git a/doc/configuration.txt b/doc/configuration.txt
index 3ae6624..8a0b2f1 100644
--- a/doc/configuration.txt
+++ b/doc/configuration.txt
@@ -12342,6 +12342,10 @@ Please refer to the table below for currently
defined variables :
   |   | %t   | date_time  (with millisecond resolution)  | date|
   |   | %ts  | termination_state | string  |
   | H | %tsc | termination_state with cookie status  | string  |
+  | H | %HM  | HTTP method (ex: POST)| string  |
+  | H | %HV  | HTTP version (ex: HTTP/1.0)   | string  |
+  | H | %HR  | HTTP request URI (ex: /foo?bar=baz)   | string  |
+  | H | %HP  | HTTP request URI without query string | string  |
   +---+--+---+-+

 R = Restrictions : H = mode http only ; S = SSL only
diff --git a/include/types/log.h b/include/types/log.h
index c7e47ea..625106d 100644
--- a/include/types/log.h
+++ b/include/types/log.h
@@ -90,6 +90,10 @@ enum {
LOG_FMT_HDRREQUESTLIST,
LOG_FMT_HDRRESPONSLIST,
LOG_FMT_REQ,
+   LOG_FMT_HTTP_METHOD,
+   LOG_FMT_HTTP_REQUEST,
+   LOG_FMT_HTTP_PATH,
+   LOG_FMT_HTTP_VERSION,
LOG_FMT_HOSTNAME,
LOG_FMT_UNIQUEID,
LOG_FMT_SSL_CIPHER,
diff --git a/src/log.c b/src/log.c
index 1a5ad25..f227f47 100644
--- a/src/log.c
+++ b/src/log.c
@@ -108,6 +108,10 @@ static const struct logformat_type logformat_keywords[] = {
{ "hs", LOG_FMT_HDRRESPONS, PR_MODE_TCP, LW_RSPHDR, NULL },
/* header response */
{ "hsl", LOG_FMT_HDRRESPONSLIST, PR_MODE_TCP, LW_RSPHDR, NULL
},  /* header response list */
{ "ms", LOG_FMT_MS, PR_MODE_TCP, LW_INIT, NULL },   /*
accept date millisecond */
+   { "HM", LOG_FMT_HTTP_METHOD, PR_MODE_HTTP, LW_REQ, NULL },  /*
HTTP method */
+   { "HR", LOG_FMT_HTTP_REQUEST, PR_MODE_HTTP, LW_REQ, NULL },
/* HTTP full request */
+   { "HP", LOG_FMT_HTTP_PATH, PR_MODE_HTTP, LW_REQ, NULL },  /*
HTTP path */
+   { "HV", LOG_FMT_HTTP_VERSION, PR_MODE_HTTP, LW_REQ, NULL },
/* HTTP version */
{ "pid", LOG_FMT_PID, PR_MODE_TCP, LW_INIT, NULL }, /* log pid */
{ "r", LOG_FMT_REQ, PR_MODE_HTTP, LW_REQ, NULL },  /* request */
{ "rc", LOG_FMT_RETRIES, PR_MODE_TCP, LW_BYTES, NULL },  /* retries */
@@ -922,6 +926,7 @@ int build_logline(struct session *s, char *dst,
size_t maxsize, struct list *lis
char *tmplog;
 

Re: [PATCH] Add a new log format variable "%p" that spits out the sanitized request path

2015-04-11 Thread Willy Tarreau
Hi Andrew,

On Thu, Apr 09, 2015 at 08:55:20AM -0500, Andrew Hayworth wrote:
> Hi Willy -
> 
> Apologies if this comes through multiple times; I'm having some mail
> difficulties.

I received it only once FYI.

> When I attempt to use %[path] in a log-format directive, I get the
> following errors:
> 
>   parsing [/etc/haproxy/haproxy.cfg:83] : 'log-format' : sample fetch
>  may not be reliably used here because it needs 'HTTP request
> headers' which is not available here.
> 
> Limited testing confirms that the path is not printed in the log when
> %[path] is included in log-format either. Perhaps I'm missing a
> configuration option that allows the use of that sample fetch in the
> logs?

OK I understand now. In fact you can do it using captures but that becomes
a bit more complex, at least too complex for a general usage in my opinion.

> >> +char* strip_uri_params(char *str) {
> >> + int spaces = 0;
> >> + int end = strlen(str);
> >> +
> >> + int i;
> >> + int path_end = end;
> >> + for (i = 0; i < end; i++) {
> >> + if (str[i] == ' ' && spaces == 0) {
> >> + spaces++;
> >> + } else if (str[i] == '?' || (str[i] == ' ' && spaces > 0)) {
> >> + path_end = i;
> >> + break;
> >> + }
> >> + }
> >
> > What's the purpose of counting spaces to stop at the second one ? You
> > cannot not have any in the path, so I'm a bit puzzled.
> 
> It does seem counter-intuitive, but the uri variable at src/log.c#1546
> is actually the full HTTP request line.

Ah yes I didn't remember about this, you're right.

> Consider the following requests:
> 
>   GET /foo HTTP/1.1
>   GET /foo?bar HTTP/1.1
> 
> They should both produce an identical path (as the only difference is
> the query string), but they will not unless we strip the line at either
> the first question mark *or* the second space. Stripping at only the
> first question mark would produce logs like:
> 
>   GET /foo HTTP/1.1
>   GET /foo
> 
> ...which is undesirable.

Yes indeed.

> > Here I don't understand, you seem to be doing malloc+strncpy+strlen+free
> > just to get an integer which is the position of the question mark in the
> > chain that is determined very earély in your strip function, is that it ?
> > If so, that's totally overkill and not even logical!
> >
> > Thanks,
> > Willy
> >
> 
> You're right - this patch was not the most efficient or elegant. Chalk
> it up to not fully understanding the logging system, and my general lack
> of knowledge in C. I've included another patch that I think solves the
> problem much more elegantly, building on what you suggested.

But it systematically strips the HTTP version this time. I could suggest
another approach which I think would be cleaner and more universal. Simply
create not one, but 4 new log tags (and maybe prefix all of them with 'H'
to avoid confusing naming with other protocols if similar ones are wanted
later) :

  - %HM : HTTP method, returns the method only (string before the first
space in the uri variable)

  - %HU : HTTP URI : returns the string between the first and second space
  or end of line

  - %HP : HTTP path : same as %HU but stops at first of question mark if
  present before a space.

  - %HV : HTTP version : reports the part after the second space if it exists.

I think that this will provide the elements needed to have your desired log
format and will allow other people to have theirs as well (eg those not
interested in the version can remove it).

Would that be OK for you ?

Thanks,
Willy




Re: [PATCH] Add a new log format variable "%p" that spits out the sanitized request path

2015-04-09 Thread Andrew Hayworth
Hi Willy -

Apologies if this comes through multiple times; I'm having some mail
difficulties.

On Tue, Apr 7, 2015 at 7:11 PM, Willy Tarreau  wrote:
> Just a question, did you find any benefit in doing it with a new tag
> compared to %[path] ? It may just be a matter of convenience, I'm just
> wondering.

When I attempt to use %[path] in a log-format directive, I get the
following errors:

  parsing [/etc/haproxy/haproxy.cfg:83] : 'log-format' : sample fetch
 may not be reliably used here because it needs 'HTTP request
headers' which is not available here.

Limited testing confirms that the path is not printed in the log when
%[path] is included in log-format either. Perhaps I'm missing a
configuration option that allows the use of that sample fetch in the
logs?

> Other comments on the code below :
>
>> +char* strip_uri_params(char *str) {
>> + int spaces = 0;
>> + int end = strlen(str);
>> +
>> + int i;
>> + int path_end = end;
>> + for (i = 0; i < end; i++) {
>> + if (str[i] == ' ' && spaces == 0) {
>> + spaces++;
>> + } else if (str[i] == '?' || (str[i] == ' ' && spaces > 0)) {
>> + path_end = i;
>> + break;
>> + }
>> + }
>
> What's the purpose of counting spaces to stop at the second one ? You
> cannot not have any in the path, so I'm a bit puzzled.

It does seem counter-intuitive, but the uri variable at src/log.c#1546
is actually the
full HTTP request line. Consider the following requests:

  GET /foo HTTP/1.1
  GET /foo?bar HTTP/1.1

They should both produce an identical path (as the only difference is
the query string), but they will not unless we strip the line at either
the first question mark *or* the second space. Stripping at only the
first question mark would produce logs like:

  GET /foo HTTP/1.1
  GET /foo

...which is undesirable.


> Here I don't understand, you seem to be doing malloc+strncpy+strlen+free
> just to get an integer which is the position of the question mark in the
> chain that is determined very earély in your strip function, is that it ?
> If so, that's totally overkill and not even logical!
>
> Thanks,
> Willy
>

You're right - this patch was not the most efficient or elegant. Chalk
it up to not fully understanding the logging system, and my general lack
of knowledge in C. I've included another patch that I think solves the
problem much more elegantly, building on what you suggested.

I've also attached an updated patch file in case my mail client messes
up tabs/spaces or something.

Thank you so much!

- Andrew Hayworth

>From 8cda7475e6b456636f61c48c2132ecf32f4c23b1 Mon Sep 17 00:00:00 2001
From: Andrew Hayworth 
Date: Tue, 7 Apr 2015 21:42:53 +
Subject: [PATCH] Add a new log format variable "%p" that spits out the
 sanitized request path

It's often undesirable to log query params - and in some cases, it can
create legal compliance problems. This commit adds a new log format
variable that logs the HTTP verb and the path requested sans query
string (and additionally ommitting the protocol). For example, the
following HTTP request line:

  GET /foo?bar=baz HTTP/1.1

becomes:

  GET /foo

with this log format variable.
---
 include/types/log.h |  1 +
 src/log.c   | 24 
 2 files changed, 25 insertions(+)

diff --git a/include/types/log.h b/include/types/log.h
index c7e47ea..3205ce6 100644
--- a/include/types/log.h
+++ b/include/types/log.h
@@ -90,6 +90,7 @@ enum {
LOG_FMT_HDRREQUESTLIST,
LOG_FMT_HDRRESPONSLIST,
LOG_FMT_REQ,
+   LOG_FMT_PATH,
LOG_FMT_HOSTNAME,
LOG_FMT_UNIQUEID,
LOG_FMT_SSL_CIPHER,
diff --git a/src/log.c b/src/log.c
index 1a5ad25..af89e00 100644
--- a/src/log.c
+++ b/src/log.c
@@ -108,6 +108,7 @@ static const struct logformat_type logformat_keywords[] = {
{ "hs", LOG_FMT_HDRRESPONS, PR_MODE_TCP, LW_RSPHDR, NULL },
/* header response */
{ "hsl", LOG_FMT_HDRRESPONSLIST, PR_MODE_TCP, LW_RSPHDR, NULL
},  /* header response list */
{ "ms", LOG_FMT_MS, PR_MODE_TCP, LW_INIT, NULL },   /*
accept date millisecond */
+   { "p", LOG_FMT_PATH, PR_MODE_HTTP, LW_REQ, NULL },  /* path */
{ "pid", LOG_FMT_PID, PR_MODE_TCP, LW_INIT, NULL }, /* log pid */
{ "r", LOG_FMT_REQ, PR_MODE_HTTP, LW_REQ, NULL },  /* request */
{ "rc", LOG_FMT_RETRIES, PR_MODE_TCP, LW_BYTES, NULL },  /* retries */
@@ -1539,6 +1540,29 @@ int build_logline(struct session *s, char *dst,
size_t maxsize, struct list *lis


last_isspace = 0;

Re: [PATCH] Add a new log format variable "%p" that spits out the sanitized request path

2015-04-07 Thread Willy Tarreau
Hi Andrew,

On Tue, Apr 07, 2015 at 04:52:38PM -0500, Andrew Hayworth wrote:
> It's often undesirable to log query params - and in some cases, it can
> create legal compliance problems. This commit adds a new log format
> variable that logs the HTTP verb and the path requested sans query
> string (and additionally ommitting the protocol). For example, the
> following HTTP request line:
> 
>   GET /foo?bar=baz HTTP/1.1
> 
> becomes:
> 
>   GET /foo
> 
> with this log format variable.

Just a question, did you find any benefit in doing it with a new tag
compared to %[path] ? It may just be a matter of convenience, I'm just
wondering.

Other comments on the code below :

> +char* strip_uri_params(char *str) {
> + int spaces = 0;
> + int end = strlen(str);
> +
> + int i;
> + int path_end = end;
> + for (i = 0; i < end; i++) {
> + if (str[i] == ' ' && spaces == 0) {
> + spaces++;
> + } else if (str[i] == '?' || (str[i] == ' ' && spaces > 0)) {
> + path_end = i;
> + break;
> + }
> + }

What's the purpose of counting spaces to stop at the second one ? You
cannot not have any in the path, so I'm a bit puzzled.

> + char *temp = malloc(path_end + 1);

Please avoid inline declarations, not only they're prone to many bugs,
but they don't build on some older compilers.

Also please avoid mallocs as much as possible. In your case this function
returns a string for immediate use, there's no need for allocating, you
can copy the string directly into the trash.

>  /* return the name of the directive used in the current proxy for which we're
>   * currently parsing a header, when it is known.
>   */
> @@ -1539,6 +1564,27 @@ int build_logline(struct session *s, char *dst, size_t 
> maxsize, struct list *lis
>   last_isspace = 0;
>   break;
>  
> + case LOG_FMT_PATH: // %p
> + if (tmp->options & LOG_OPT_QUOTE)
> + LOGCHAR('"');
> + uri = txn->uri ? txn->uri : "";
> + ret = encode_string(tmplog, dst + maxsize,
> +'#', url_encode_map, 
> uri);
> + if (ret == NULL || *ret != '\0')
> + goto out;
> +
> + char *sanitized = strip_uri_params(tmplog);
> + if (sanitized == NULL)
> + goto out;
> +
> + tmplog += strlen(sanitized);
> + free(sanitized);

Here I don't understand, you seem to be doing malloc+strncpy+strlen+free
just to get an integer which is the position of the question mark in the
chain that is determined very earély in your strip function, is that it ?
If so, that's totally overkill and not even logical!

In this case I'd simply do something like this which looks much much
cleaner and more efficient :

case LOG_FMT_PATH: // %p
if (tmp->options & LOG_OPT_QUOTE)
LOGCHAR('"');
uri = txn->uri ? txn->uri : "";
ret = encode_string(tmplog, dst + maxsize,
   '#', url_encode_map, 
uri);
if (ret == NULL || *ret != '\0')
goto out;

+   uri = strchr(tmplog, '?');
+   tmplog = uri ? uri : ret;

if (tmp->options & LOG_OPT_QUOTE)
LOGCHAR('"');
last_isspace = 0;
break;

Or am I missing something ?

Thanks,
Willy




[PATCH] Add a new log format variable "%p" that spits out the sanitized request path

2015-04-07 Thread Andrew Hayworth
It's often undesirable to log query params - and in some cases, it can
create legal compliance problems. This commit adds a new log format
variable that logs the HTTP verb and the path requested sans query
string (and additionally ommitting the protocol). For example, the
following HTTP request line:

  GET /foo?bar=baz HTTP/1.1

becomes:

  GET /foo

with this log format variable.
---
 include/types/log.h |  1 +
 src/log.c   | 46 ++
 2 files changed, 47 insertions(+)

diff --git a/include/types/log.h b/include/types/log.h
index c7e47ea..3205ce6 100644
--- a/include/types/log.h
+++ b/include/types/log.h
@@ -90,6 +90,7 @@ enum {
LOG_FMT_HDRREQUESTLIST,
LOG_FMT_HDRRESPONSLIST,
LOG_FMT_REQ,
+   LOG_FMT_PATH,
LOG_FMT_HOSTNAME,
LOG_FMT_UNIQUEID,
LOG_FMT_SSL_CIPHER,
diff --git a/src/log.c b/src/log.c
index 1a5ad25..e9c1b10 100644
--- a/src/log.c
+++ b/src/log.c
@@ -108,6 +108,7 @@ static const struct logformat_type logformat_keywords[] = {
{ "hs", LOG_FMT_HDRRESPONS, PR_MODE_TCP, LW_RSPHDR, NULL },  /* header 
response */
{ "hsl", LOG_FMT_HDRRESPONSLIST, PR_MODE_TCP, LW_RSPHDR, NULL },  /* 
header response list */
{ "ms", LOG_FMT_MS, PR_MODE_TCP, LW_INIT, NULL },   /* accept date 
millisecond */
+   { "p", LOG_FMT_PATH, PR_MODE_HTTP, LW_REQ, NULL },  /* path */
{ "pid", LOG_FMT_PID, PR_MODE_TCP, LW_INIT, NULL }, /* log pid */
{ "r", LOG_FMT_REQ, PR_MODE_HTTP, LW_REQ, NULL },  /* request */
{ "rc", LOG_FMT_RETRIES, PR_MODE_TCP, LW_BYTES, NULL },  /* retries */
@@ -161,6 +162,30 @@ struct logformat_var_args var_args_list[] = {
{  0,  0 }
 };
 
+char* strip_uri_params(char *str) {
+   int spaces = 0;
+   int end = strlen(str);
+
+   int i;
+   int path_end = end;
+   for (i = 0; i < end; i++) {
+   if (str[i] == ' ' && spaces == 0) {
+   spaces++;
+   } else if (str[i] == '?' || (str[i] == ' ' && spaces > 0)) {
+   path_end = i;
+   break;
+   }
+   }
+
+   char *temp = malloc(path_end + 1);
+   if (temp == NULL)
+   return temp;
+
+   strncpy(temp, str, path_end);
+   temp[path_end] = '\0';
+   return temp;
+}
+
 /* return the name of the directive used in the current proxy for which we're
  * currently parsing a header, when it is known.
  */
@@ -1539,6 +1564,27 @@ int build_logline(struct session *s, char *dst, size_t 
maxsize, struct list *lis
last_isspace = 0;
break;
 
+   case LOG_FMT_PATH: // %p
+   if (tmp->options & LOG_OPT_QUOTE)
+   LOGCHAR('"');
+   uri = txn->uri ? txn->uri : "";
+   ret = encode_string(tmplog, dst + maxsize,
+  '#', url_encode_map, 
uri);
+   if (ret == NULL || *ret != '\0')
+   goto out;
+
+   char *sanitized = strip_uri_params(tmplog);
+   if (sanitized == NULL)
+   goto out;
+
+   tmplog += strlen(sanitized);
+   free(sanitized);
+
+   if (tmp->options & LOG_OPT_QUOTE)
+   LOGCHAR('"');
+   last_isspace = 0;
+   break;
+
case LOG_FMT_PID: // %pid
if (tmp->options & LOG_OPT_HEXA) {
iret = snprintf(tmplog, dst + maxsize - 
tmplog, "%04X", pid);
-- 
2.1.3