Re: [squid-users] Searching squid logs for pornographic sites

2008-06-12 Thread Chuck Kollars
> The approach is problematic, especially when 
> using "three letter word" combinations, which match 
> arbitrary, harmless URLs.

The dreaded "unintended match in the middle of a word"
problem can torpedo most any approach. Solving it is
not a matter of changing approaches, but rather of
changing tools. It can adversely affect virtually any
approach; conversely it can be "fixed" in virtually
any approach. 

What's needed is a way to specify "word boundaries"
while regular expression matching. Unfortunately the
regular expression syntax for word boundaries varies
from tool to tool. Perl and its derivatives let you
specify \b at the beginning and/or end of a word (or
its opposite \B for not-a-word-boundary). Classic
`egrep` provides the same functionality but with the
different regular expression syntax \< at the
beginning of a word and \> at the end of a word. GNU
egrep, GNU awk, and GNU Emacs support both syntaxes.
Tcl provides word boundary functionality with \m, \M,
and \y. Both Java and .NET are Perl-like. The "-F"
command line switch turns GNU `grep` into its even
stupider cousin `fgrep`, neither of which let you
specify word boundaries in regular expressions at all.


GNU grep does however let you use Perl-style regular
expression by specifying "-P" on the command line. And
perhaps most importantly, GNU grep (and GNU egrep,
which is the same program with different switches)
lets you quickly and automatically turn _everything_
in your regular expressions into full words with the 
"-w" command line switch (lots of convenience, not
much control:-).

In summary: If you want to specify word boundaries
inside the regular expressions, use either Perl or GNU
grep -P or some other fairly modern tool. If you want
word boundary functionality withOUT specifying word
boundaries in the regular expressions themselves, use
GNU grep -w. If you have no other choice, you can make
it work with classic egrep by inserting \< and \>
appropriately in your regular expressions. But classic
grep won't do word boundaries no matter what. (You can
sorta fake it, but it's a lot of effort and it doesn't
work in all cases.) Note in particular that the
easy-to-overlook "-w" command line switch on GNU grep
can make a night/day difference. 

Please do let this list know your results after a few
months (It sounds like I'm not the only one that's a
bit skeptical that the "bad words in URL" approach
that seemed to work reasonably a couple years ago will
give even ballpark results these days...)

thanks!



-Chuck Kollars


  


Re: [squid-users] Searching squid logs for pornographic sites

2008-06-12 Thread Michel (M)

Rob Asher disse na ultima mensagem:

>
 "Michel (M)" <[EMAIL PROTECTED]> 6/12/2008 6:59 AM >>>
>
>
>> but at the end this entire search might be useless since there is no
>> guaranty that www.mynewbabyisborn.org is no porn and that www.butt.com
>> is
>> porn, or how do you catch www.m-y.d-i.c-k.a.t.microsoft.com ?
>> I abandoned all this keyword_stuff_searching long time ago because even
>> if
>> it would work the user still could use a fantasyproxy somewhere on port
>> 42779 or a vpn as hamachi and then you do what?
>>
>> michel
>> ...
>
> I agree too but until there's a better way, we'll still use the keyword
> searching to find the blatant sites.  In our case, we're blocking egress
> traffic for everything except known services(our own proxies) so anonymous
> proxies and vpn's won't be able to connectUNLESS they can get to them
> through the proxies somehow.  Things like PHProxy and all the anonymizing
> sites make it tougher.  There's ways around anything I know but we adapt
> and keep plugging away.
>

sure, if you need it you need it ...
we offer the inverse approach to our costumers, we block all but the sites
the user allows, so the parents decide which sites the kids can go to and
everything else is blocked

michel
...





Tecnologia Internet Matik http://info.matik.com.br
Sistemas Wireless para o Provedor Banda Larga
Hospedagem e Email personalizado - e claro, no Brasil.




Re: [squid-users] Searching squid logs for pornographic sites

2008-06-12 Thread Jancs

Quoting Rob Asher <[EMAIL PROTECTED]>:

blocking egress traffic for everything except known services(our own  
 proxies) so anonymous proxies and vpn's won't be able to   
connectUNLESS they can get to them through the proxies somehow.   
 Things like PHProxy and all the anonymizing sites make it tougher.   
 There's ways around anything I know but we adapt and keep plugging   
away.


but there still exists the possibility to connect to outside service  
sitting on for example 80 or 443 port (actually very easy achievable  
with average skills needed and working like a charm) and then what?  
the only thing which can help in that case is packet analysis (i assume)


J.


This message was sent using IMP, the Internet Messaging Program.




Re: [squid-users] Searching squid logs for pornographic sites

2008-06-12 Thread Rob Asher


-
Rob Asher
Network Systems Technician
Paragould School District
(870)236-7744 Ext. 169


>>> "Michel (M)" <[EMAIL PROTECTED]> 6/12/2008 6:59 AM >>>


> but at the end this entire search might be useless since there is no
> guaranty that www.mynewbabyisborn.org is no porn and that www.butt.com is
> porn, or how do you catch www.m-y.d-i.c-k.a.t.microsoft.com ?
> I abandoned all this keyword_stuff_searching long time ago because even if
> it would work the user still could use a fantasyproxy somewhere on port
> 42779 or a vpn as hamachi and then you do what?
> 
> michel
> ...

I agree too but until there's a better way, we'll still use the keyword 
searching to find the blatant sites.  In our case, we're blocking egress 
traffic for everything except known services(our own proxies) so anonymous 
proxies and vpn's won't be able to connectUNLESS they can get to them 
through the proxies somehow.  Things like PHProxy and all the anonymizing sites 
make it tougher.  There's ways around anything I know but we adapt and keep 
plugging away.

Rob




-- 

This message has been scanned for viruses and
dangerous content by the Paragould School District
MailScanner, and is believed to be clean.



Re: [squid-users] Searching squid logs for pornographic sites

2008-06-12 Thread Ralf Hildebrandt
* Michel (M) <[EMAIL PROTECTED]>:

> but at the end this entire search might be useless since there is no
> guaranty that www.mynewbabyisborn.org is no porn and that www.butt.com is
> porn, or how do you catch www.m-y.d-i.c-k.a.t.microsoft.com ?

Exactly!!

> I abandoned all this keyword_stuff_searching long time ago because even if
> it would work the user still could use a fantasyproxy somewhere on port
> 42779 or a vpn as hamachi and then you do what?

Yep.

-- 
Ralf Hildebrandt (i.A. des IT-Zentrums) [EMAIL PROTECTED]
Charite - Universitätsmedizin BerlinTel.  +49 (0)30-450 570-155
Gemeinsame Einrichtung von FU- und HU-BerlinFax.  +49 (0)30-450 570-962
IT-Zentrum Standort CBF send no mail to [EMAIL PROTECTED]


Re: [squid-users] Searching squid logs for pornographic sites

2008-06-12 Thread Michel (M)

Ralf Hildebrandt disse na ultima mensagem:
> * Rob Asher <[EMAIL PROTECTED]>:
>> Here's something similar to what you're already doing except comparing
>> to a file of "badwords" to look for in the URL's and then emailing you
>> the results.
>>
>> #!/bin/sh
>> # filter.sh
>> #
>> cd /path/to/filterscript
>> cat /var/log/squid/access.log | grep -if /path/to/filterscript/badwords
>> > hits.out
>
> Useless use of cat:
> grep -if /path/to/filterscript/badwords /var/log/squid/access.log >
> hits.out
>
>> /path/to/filterscript/wordfilter.gawk hits.out
>>
>> cat /path/to/filterscript/word-report | /bin/mail -s "URL Filter Report"
>> [EMAIL PROTECTED]
>
> Useless use of cat:
> /bin/mail -s "URL Filter Report" [EMAIL PROTECTED] <
> /path/to/filterscript/word-report
>

well, when you are doing optimizing do it entirely  :) - only one line:

grep arg file | $mail_cmd

then, if you awk the log and pipe the buffer into the mail_cmd you even do
not need to create files and delete them later, so you can have it all in
one line


but at the end this entire search might be useless since there is no
guaranty that www.mynewbabyisborn.org is no porn and that www.butt.com is
porn, or how do you catch www.m-y.d-i.c-k.a.t.microsoft.com ?
I abandoned all this keyword_stuff_searching long time ago because even if
it would work the user still could use a fantasyproxy somewhere on port
42779 or a vpn as hamachi and then you do what?


michel
...





Tecnologia Internet Matik http://info.matik.com.br
Sistemas Wireless para o Provedor Banda Larga
Hospedagem e Email personalizado - e claro, no Brasil.




Re: [squid-users] Searching squid logs for pornographic sites

2008-06-12 Thread Ralf Hildebrandt
* Jason <[EMAIL PROTECTED]>:
> Look at these:
>
> http://www.meadvillelibrary.org/os/osfiltering-ala/smutscript/
> http://www.meadvillelibrary.org/os/osfiltering-ala/
> http://meadvillelibrary.org/os/filtering/filtermaintenance.html
>
> She wrote a script that searches logs for keywords and emails it to her.

I had a look at those.

The approach is problematic, especially when using "three letter word"
combinations, which match arbitrary, harmless URLs.

Like "wet" (which, in Germany is a part of the word "Wetter", which
means "weather", and yes, my users do care about the weather and if
they are rained upon). So these lists need to be adapted heavily.

I'll try and add my size approach...

-- 
Ralf Hildebrandt (i.A. des IT-Zentrums) [EMAIL PROTECTED]
Charite - Universitätsmedizin BerlinTel.  +49 (0)30-450 570-155
Gemeinsame Einrichtung von FU- und HU-BerlinFax.  +49 (0)30-450 570-962
IT-Zentrum Standort CBF send no mail to [EMAIL PROTECTED]


Re: [squid-users] Searching squid logs for pornographic sites

2008-06-12 Thread Ralf Hildebrandt
* Rob Asher <[EMAIL PROTECTED]>:
> Here's something similar to what you're already doing except comparing to a 
> file of "badwords" to look for in the URL's and then emailing you the results.
> 
> #!/bin/sh
> # filter.sh
> #
> cd /path/to/filterscript
> cat /var/log/squid/access.log | grep -if /path/to/filterscript/badwords > 
> hits.out

Useless use of cat:
grep -if /path/to/filterscript/badwords /var/log/squid/access.log > hits.out
 
> /path/to/filterscript/wordfilter.gawk hits.out
> 
> cat /path/to/filterscript/word-report | /bin/mail -s "URL Filter Report" 
> [EMAIL PROTECTED] 

Useless use of cat:
/bin/mail -s "URL Filter Report" [EMAIL PROTECTED] < 
/path/to/filterscript/word-report

Personally, I use "awk" to check for jpg/jpeg files exceeding a
certain size. I think the two approaches can be combined :)

-- 
Ralf Hildebrandt (i.A. des IT-Zentrums) [EMAIL PROTECTED]
Charite - Universitätsmedizin BerlinTel.  +49 (0)30-450 570-155
Gemeinsame Einrichtung von FU- und HU-BerlinFax.  +49 (0)30-450 570-962
IT-Zentrum Standort CBF send no mail to [EMAIL PROTECTED]


Re: [squid-users] Searching squid logs for pornographic sites

2008-06-11 Thread Jason

Look at these:

http://www.meadvillelibrary.org/os/osfiltering-ala/smutscript/
http://www.meadvillelibrary.org/os/osfiltering-ala/
http://meadvillelibrary.org/os/filtering/filtermaintenance.html

She wrote a script that searches logs for keywords and emails it to her.

Jason


Steven Engebretson wrote:

I am looking for a tool that will scan the access.log file for pornographic 
sites, and will report the specifics back.  We do not block access to any 
Internet sites, but need to monitor for objectionable content.

What I am doing now is just greping for some key words, and dumping the output 
into a file.  I am manually going through about 60,000 lines of log file, 
following my grep.  99% of these are false.  Any help would be appreciated.

Thank you all.


-Steven E.


--- AV & Spam Filtering by M+Guardian - Risk Free Email (TM) ---



  


Re: [squid-users] Searching squid logs for pornographic sites

2008-06-11 Thread Rob Asher
Here's something similar to what you're already doing except comparing to a 
file of "badwords" to look for in the URL's and then emailing you the results.

#!/bin/sh
# filter.sh
#
cd /path/to/filterscript
cat /var/log/squid/access.log | grep -if /path/to/filterscript/badwords > 
hits.out

/path/to/filterscript/wordfilter.gawk hits.out

cat /path/to/filterscript/word-report | /bin/mail -s "URL Filter Report" [EMAIL 
PROTECTED] 

rm hits.out


#!/bin/gawk -f
# wordfilter.gawk

BEGIN {
print "URL Filter Report:" > "/path/to/filterscript/word-report"
print "--" >> 
"/path/to/filterscript/word-report"
sp = " -> "
}

{
print strftime("%m-%d-%Y %H:%M:%S",$1), sp, $8 >> 
"/path/to/filterscript/word-report"
print $7 >> "/path/to/filterscript/word-report"
print "" >> "/path/to/filterscript/word-report"
}



You may need to adjust the columns printed in the awk script.  They're set for 
username instead of IP's.  Also, you'll need to make a 
"/path/to/filterscript/badwords" file with the words/regex you want to search 
forone per line.  Someone with better regex skills could probably eliminate 
a lot "false" hits with specific patterns in the "badwords" file.  I'm using 
this in addition to squidGuard and blacklists to catch URL's that were missed 
so the output isn't near as large as what you're getting.  

Rob



-
Rob Asher
Network Systems Technician
Paragould School District
(870)236-7744 Ext. 169


>>> "Steven Engebretson" <[EMAIL PROTECTED]> 6/11/2008 1:32 PM >>>
I am looking for a tool that will scan the access.log file for pornographic 
sites, and will report the specifics back.  We do not block access to any 
Internet sites, but need to monitor for objectionable content.

What I am doing now is just greping for some key words, and dumping the output 
into a file.  I am manually going through about 60,000 lines of log file, 
following my grep.  99% of these are false.  Any help would be appreciated.

Thank you all.


-Steven E.


-- 

This message has been scanned for viruses and
dangerous content by the Paragould School District
MailScanner, and is believed to be clean.



-- 

This message has been scanned for viruses and
dangerous content by the Paragould School District
MailScanner, and is believed to be clean.



Re: [squid-users] Searching squid logs for pornographic sites

2008-06-11 Thread julian julian

I suggest to use a log analizer like webalizer o sarg, this is a bit more 
complete for user behavior analisys.

Julián

--- On Wed, 6/11/08, Steven Engebretson <[EMAIL PROTECTED]> wrote:

> From: Steven Engebretson <[EMAIL PROTECTED]>
> Subject: [squid-users] Searching squid logs for pornographic sites
> To: squid-users@squid-cache.org
> Date: Wednesday, June 11, 2008, 11:32 AM
> I am looking for a tool that will scan the access.log file
> for pornographic sites, and will report the specifics back.
>  We do not block access to any Internet sites, but need to
> monitor for objectionable content.
> 
> What I am doing now is just greping for some key words, and
> dumping the output into a file.  I am manually going through
> about 60,000 lines of log file, following my grep.  99% of
> these are false.  Any help would be appreciated.
> 
> Thank you all.
> 
> 
> -Steven E.





Re: [squid-users] Searching squid logs for pornographic sites

2008-06-11 Thread Peter Albrecht
Hi Steven,

> I am looking for a tool that will scan the access.log file for 
> pornographic sites, and will report the specifics back.  We do not block 
> access to any Internet sites, but need to monitor for objectionable 
> content.   
> 
> What I am doing now is just greping for some key words, and dumping the 
> output into a file.  I am manually going through about 60,000 lines of 
> log file, following my grep.  99% of these are false.  Any help would be 
> appreciated.   

I'm not sure if I got you right: Are you trying to identify unwanted sites 
from your access.log? Maybe the blacklists from SquidGuard can be of any 
help:

http://www.squidguard.org/blacklists.html

If you compare the sites from your access logfile with the blacklists, you 
should be able to figure out which are unwanted sites. You would need to 
write a script comparing the files and reporting any hits.

Regards,

Peter

-- 
Peter Albrecht  Tel: +49-(0)-89-287793-83
Open Source School GmbH Fax: +49-(0)-89-287555-63 
Amalienstraße 45 RG
80799 München   http://www.opensourceschool.de

HRB 172645 - Amtsgericht München
Geschäftsführer: Peter Albrecht, Dr. Markus Wirtz



[squid-users] Searching squid logs for pornographic sites

2008-06-11 Thread Steven Engebretson
I am looking for a tool that will scan the access.log file for pornographic 
sites, and will report the specifics back.  We do not block access to any 
Internet sites, but need to monitor for objectionable content.

What I am doing now is just greping for some key words, and dumping the output 
into a file.  I am manually going through about 60,000 lines of log file, 
following my grep.  99% of these are false.  Any help would be appreciated.

Thank you all.


-Steven E.