Re: [squid-users] Force ASCII encoding for access.log fields?

2014-06-27 Thread Mark DeCheser
 [serverIP],[clientIP],
 4012,692,498,GET,200,º^_x°*,username,20/Jun/2014:00:06:36

 The log format you used does not match this log line. The format produces:

 [squid-listening-IP],[clientIP],
 4012,692,498,GET,200,º^_x°*,username,20/Jun/2014:00:06:36

Thanks for the correction.  To expand on that point, on some of our
proxies, we have more than one IP being serviced by a single daemon. 
Recording which IP received the traffic is essential to proper accounting
(e.g. FreeRADIUS).

 URL-encoding is the %xx character encoding, it can be (and is) applied
 to anything which can legitimately contain non-ASCII characters or ASCII
 special characters. Content-Type header is not one of those places.

 You can use the '#' format modifier to URL-encode that %mt field
 explicitly. Like so:  %#mt

Amos, thank you so much for sharing this.  I plan to try it as soon as ...

 If you will share the exact Squid version you are using I would also
 like to check the code to see if the mt code is being correctly setup,
 that log entry looks a bit like random memory being displayed as if it
 were text.

... as soon as I finish upgrading from squid-3.1.10-16.el6 to
3.1.10-20.el6, both of which are packaged and delivered via the CentOS
repo :).  Totally ashamed I didn't even notice there was an update
available before posting.  I plan to schedule an outage to patch and I'll
report back with my findings.  If you suspect random memory chunks are
being written to the file as a consequence of this outdated version of
Squid, and even the more recent version I plan to move to does not address
this condition, feel free to share.

This particular proxy is pretty active.  We're averaging between 800,000 -
1.2M lines in the access log per day.  The proxy is non-caching, running
with 512MB RAM and 1GB swap (don't ask).

More soon,
MD



Re: [squid-users] Force ASCII encoding for access.log fields?

2014-06-26 Thread Amos Jeffries
On 27/06/2014 11:25 a.m., Mark DeCheser wrote:
 Hi everyone --
 
 I recently ran into a strange condition within my Squid access logs which
 is making importing the events into a database a bit more difficult. 
 Note, I am not logging directly to a database, but rather parsing event
 into a centralized database via batch/cron.
 
 Events in the access log, mainly which I see are in the ContentType field,
 are being recorded as non-ASCII characters.  When I attempt to import the
 log into PostgreSQL, psql barfs.
 
 Our logfile format in our Squid config looks like this:
 
 logformat my-custom %la,%a,%10tr,%st,%st,%rm,%03Hs,%mt,%[un,%tg
 access_log /var/log/squid/access.log my-custom
 
 Some examples of the events look like this:
 
 [serverIP],[clientIP],
 4012,692,498,GET,200,º^_x°*,username,20/Jun/2014:00:06:36

The log format you used does not match this log line. The format produces:

[squid-listening-IP],[clientIP],
4012,692,498,GET,200,º^_x°*,username,20/Jun/2014:00:06:36

 
 I'm running Squid instances on VPSes in a number of different countries. 
 This particular Squid instance is in Norway, and coincidentally enough
 happens to be the only VPS delivered to my organization that wasn't
 already set to en_US.UTF-8.
 
 # cat /etc/sysconfig/i18n
 LANG=en_US.UTF-8
 SYSFONT=latarcyrheb-sun16
 # echo $LANG
 en_US.UTF-8
 
 It could be a coincidence, but based on the fact that I have instances all
 over the world, and only this instance is giving me trouble ... I found it
 to be an odd coincidence.
 
 Ideally, if it's possible for Squid to force some kind of hex encoding for
 this Content-Type (or really, for any field that receives non ASCII
 characters), that would be optimal.   There are downstream alternatives
 which include finding / replacing non-ASCII chars in a preparation script.
  There's also the option to change the charset of the database itself so
 that it doesn't complain about the charset, but these alternatives seem a
 little reactionary.
 
 I've reviewed:  http://www.squid-cache.org/Doc/config/logformat/
 I also tried using iconv unsuccessfully: 
 http://stackoverflow.com/questions/12999651/how-to-remove-non-utf-8-characters-from-text-file
 
 It essentially leaves me with offset fields/columns in the logfile.
 
 I also reviewed Amos' comment here: 
 http://www.squid-cache.org/mail-archive/squid-users/201109/0343.html
 
 The difference in my case is that I'm dealing with Content-Type, not URL. 

URL-encoding is the %xx character encoding, it can be (and is) applied
to anything which can legitimately contain non-ASCII characters or ASCII
special characters. Content-Type header is not one of those places.

You can use the '#' format modifier to URL-encode that %mt field
explicitly. Like so:  %#mt

If you will share the exact Squid version you are using I would also
like to check the code to see if the mt code is being correctly setup,
that log entry looks a bit like random memory being displayed as if it
were text.

Amos