On 27/06/2014 11:25 a.m., Mark DeCheser wrote:
Hi everyone --
I recently ran into a strange condition within my Squid access logs which
is making importing the events into a database a bit more difficult.
Note, I am not logging directly to a database, but rather parsing event
into a centralized database via batch/cron.
Events in the access log, mainly which I see are in the ContentType field,
are being recorded as non-ASCII characters. When I attempt to import the
log into PostgreSQL, psql barfs.
Our logfile format in our Squid config looks like this:
logformat my-custom %la,%a,%10tr,%st,%st,%rm,%03Hs,%mt,%[un,%tg
access_log /var/log/squid/access.log my-custom
Some examples of the events look like this:
[serverIP],[clientIP],
4012,692,498,GET,200,º^_x°*,username,20/Jun/2014:00:06:36
The log format you used does not match this log line. The format produces:
[squid-listening-IP],[clientIP],
4012,692,498,GET,200,º^_x°*,username,20/Jun/2014:00:06:36
I'm running Squid instances on VPSes in a number of different countries.
This particular Squid instance is in Norway, and coincidentally enough
happens to be the only VPS delivered to my organization that wasn't
already set to en_US.UTF-8.
# cat /etc/sysconfig/i18n
LANG=en_US.UTF-8
SYSFONT=latarcyrheb-sun16
# echo $LANG
en_US.UTF-8
It could be a coincidence, but based on the fact that I have instances all
over the world, and only this instance is giving me trouble ... I found it
to be an odd coincidence.
Ideally, if it's possible for Squid to force some kind of hex encoding for
this Content-Type (or really, for any field that receives non ASCII
characters), that would be optimal. There are downstream alternatives
which include finding / replacing non-ASCII chars in a preparation script.
There's also the option to change the charset of the database itself so
that it doesn't complain about the charset, but these alternatives seem a
little reactionary.
I've reviewed: http://www.squid-cache.org/Doc/config/logformat/
I also tried using iconv unsuccessfully:
http://stackoverflow.com/questions/12999651/how-to-remove-non-utf-8-characters-from-text-file
It essentially leaves me with offset fields/columns in the logfile.
I also reviewed Amos' comment here:
http://www.squid-cache.org/mail-archive/squid-users/201109/0343.html
The difference in my case is that I'm dealing with Content-Type, not URL.
URL-encoding is the %xx character encoding, it can be (and is) applied
to anything which can legitimately contain non-ASCII characters or ASCII
special characters. Content-Type header is not one of those places.
You can use the '#' format modifier to URL-encode that %mt field
explicitly. Like so: %#mt
If you will share the exact Squid version you are using I would also
like to check the code to see if the mt code is being correctly setup,
that log entry looks a bit like random memory being displayed as if it
were text.
Amos