Re: [squid-users] Searching squid logs for pornographic sites

2008-06-12 Thread Ralf Hildebrandt
* Rob Asher [EMAIL PROTECTED]:
 Here's something similar to what you're already doing except comparing to a 
 file of badwords to look for in the URL's and then emailing you the results.
 
 #!/bin/sh
 # filter.sh
 #
 cd /path/to/filterscript
 cat /var/log/squid/access.log | grep -if /path/to/filterscript/badwords  
 hits.out

Useless use of cat:
grep -if /path/to/filterscript/badwords /var/log/squid/access.log  hits.out
 
 /path/to/filterscript/wordfilter.gawk hits.out
 
 cat /path/to/filterscript/word-report | /bin/mail -s URL Filter Report 
 [EMAIL PROTECTED] 

Useless use of cat:
/bin/mail -s URL Filter Report [EMAIL PROTECTED]  
/path/to/filterscript/word-report

Personally, I use awk to check for jpg/jpeg files exceeding a
certain size. I think the two approaches can be combined :)

-- 
Ralf Hildebrandt (i.A. des IT-Zentrums) [EMAIL PROTECTED]
Charite - Universitätsmedizin BerlinTel.  +49 (0)30-450 570-155
Gemeinsame Einrichtung von FU- und HU-BerlinFax.  +49 (0)30-450 570-962
IT-Zentrum Standort CBF send no mail to [EMAIL PROTECTED]


Re: [squid-users] Searching squid logs for pornographic sites

2008-06-12 Thread Ralf Hildebrandt
* Jason [EMAIL PROTECTED]:
 Look at these:

 http://www.meadvillelibrary.org/os/osfiltering-ala/smutscript/
 http://www.meadvillelibrary.org/os/osfiltering-ala/
 http://meadvillelibrary.org/os/filtering/filtermaintenance.html

 She wrote a script that searches logs for keywords and emails it to her.

I had a look at those.

The approach is problematic, especially when using three letter word
combinations, which match arbitrary, harmless URLs.

Like wet (which, in Germany is a part of the word Wetter, which
means weather, and yes, my users do care about the weather and if
they are rained upon). So these lists need to be adapted heavily.

I'll try and add my size approach...

-- 
Ralf Hildebrandt (i.A. des IT-Zentrums) [EMAIL PROTECTED]
Charite - Universitätsmedizin BerlinTel.  +49 (0)30-450 570-155
Gemeinsame Einrichtung von FU- und HU-BerlinFax.  +49 (0)30-450 570-962
IT-Zentrum Standort CBF send no mail to [EMAIL PROTECTED]


Re: [squid-users] Searching squid logs for pornographic sites

2008-06-12 Thread Michel (M)

Ralf Hildebrandt disse na ultima mensagem:
 * Rob Asher [EMAIL PROTECTED]:
 Here's something similar to what you're already doing except comparing
 to a file of badwords to look for in the URL's and then emailing you
 the results.

 #!/bin/sh
 # filter.sh
 #
 cd /path/to/filterscript
 cat /var/log/squid/access.log | grep -if /path/to/filterscript/badwords
  hits.out

 Useless use of cat:
 grep -if /path/to/filterscript/badwords /var/log/squid/access.log 
 hits.out

 /path/to/filterscript/wordfilter.gawk hits.out

 cat /path/to/filterscript/word-report | /bin/mail -s URL Filter Report
 [EMAIL PROTECTED]

 Useless use of cat:
 /bin/mail -s URL Filter Report [EMAIL PROTECTED] 
 /path/to/filterscript/word-report


well, when you are doing optimizing do it entirely  :) - only one line:

grep arg file | $mail_cmd

then, if you awk the log and pipe the buffer into the mail_cmd you even do
not need to create files and delete them later, so you can have it all in
one line


but at the end this entire search might be useless since there is no
guaranty that www.mynewbabyisborn.org is no porn and that www.butt.com is
porn, or how do you catch www.m-y.d-i.c-k.a.t.microsoft.com ?
I abandoned all this keyword_stuff_searching long time ago because even if
it would work the user still could use a fantasyproxy somewhere on port
42779 or a vpn as hamachi and then you do what?


michel
...





Tecnologia Internet Matik http://info.matik.com.br
Sistemas Wireless para o Provedor Banda Larga
Hospedagem e Email personalizado - e claro, no Brasil.




Re: [squid-users] Searching squid logs for pornographic sites

2008-06-12 Thread Ralf Hildebrandt
* Michel (M) [EMAIL PROTECTED]:

 but at the end this entire search might be useless since there is no
 guaranty that www.mynewbabyisborn.org is no porn and that www.butt.com is
 porn, or how do you catch www.m-y.d-i.c-k.a.t.microsoft.com ?

Exactly!!

 I abandoned all this keyword_stuff_searching long time ago because even if
 it would work the user still could use a fantasyproxy somewhere on port
 42779 or a vpn as hamachi and then you do what?

Yep.

-- 
Ralf Hildebrandt (i.A. des IT-Zentrums) [EMAIL PROTECTED]
Charite - Universitätsmedizin BerlinTel.  +49 (0)30-450 570-155
Gemeinsame Einrichtung von FU- und HU-BerlinFax.  +49 (0)30-450 570-962
IT-Zentrum Standort CBF send no mail to [EMAIL PROTECTED]


Re: [squid-users] Searching squid logs for pornographic sites

2008-06-12 Thread Rob Asher


-
Rob Asher
Network Systems Technician
Paragould School District
(870)236-7744 Ext. 169


 Michel (M) [EMAIL PROTECTED] 6/12/2008 6:59 AM 


 but at the end this entire search might be useless since there is no
 guaranty that www.mynewbabyisborn.org is no porn and that www.butt.com is
 porn, or how do you catch www.m-y.d-i.c-k.a.t.microsoft.com ?
 I abandoned all this keyword_stuff_searching long time ago because even if
 it would work the user still could use a fantasyproxy somewhere on port
 42779 or a vpn as hamachi and then you do what?
 
 michel
 ...

I agree too but until there's a better way, we'll still use the keyword 
searching to find the blatant sites.  In our case, we're blocking egress 
traffic for everything except known services(our own proxies) so anonymous 
proxies and vpn's won't be able to connectUNLESS they can get to them 
through the proxies somehow.  Things like PHProxy and all the anonymizing sites 
make it tougher.  There's ways around anything I know but we adapt and keep 
plugging away.

Rob




-- 

This message has been scanned for viruses and
dangerous content by the Paragould School District
MailScanner, and is believed to be clean.



Re: [squid-users] Searching squid logs for pornographic sites

2008-06-12 Thread Jancs

Quoting Rob Asher [EMAIL PROTECTED]:

blocking egress traffic for everything except known services(our own  
 proxies) so anonymous proxies and vpn's won't be able to   
connectUNLESS they can get to them through the proxies somehow.   
 Things like PHProxy and all the anonymizing sites make it tougher.   
 There's ways around anything I know but we adapt and keep plugging   
away.


but there still exists the possibility to connect to outside service  
sitting on for example 80 or 443 port (actually very easy achievable  
with average skills needed and working like a charm) and then what?  
the only thing which can help in that case is packet analysis (i assume)


J.


This message was sent using IMP, the Internet Messaging Program.




Re: [squid-users] Searching squid logs for pornographic sites

2008-06-12 Thread Michel (M)

Rob Asher disse na ultima mensagem:


 Michel (M) [EMAIL PROTECTED] 6/12/2008 6:59 AM 


 but at the end this entire search might be useless since there is no
 guaranty that www.mynewbabyisborn.org is no porn and that www.butt.com
 is
 porn, or how do you catch www.m-y.d-i.c-k.a.t.microsoft.com ?
 I abandoned all this keyword_stuff_searching long time ago because even
 if
 it would work the user still could use a fantasyproxy somewhere on port
 42779 or a vpn as hamachi and then you do what?

 michel
 ...

 I agree too but until there's a better way, we'll still use the keyword
 searching to find the blatant sites.  In our case, we're blocking egress
 traffic for everything except known services(our own proxies) so anonymous
 proxies and vpn's won't be able to connectUNLESS they can get to them
 through the proxies somehow.  Things like PHProxy and all the anonymizing
 sites make it tougher.  There's ways around anything I know but we adapt
 and keep plugging away.


sure, if you need it you need it ...
we offer the inverse approach to our costumers, we block all but the sites
the user allows, so the parents decide which sites the kids can go to and
everything else is blocked

michel
...





Tecnologia Internet Matik http://info.matik.com.br
Sistemas Wireless para o Provedor Banda Larga
Hospedagem e Email personalizado - e claro, no Brasil.




Re: [squid-users] Searching squid logs for pornographic sites

2008-06-12 Thread Chuck Kollars
 The approach is problematic, especially when 
 using three letter word combinations, which match 
 arbitrary, harmless URLs.

The dreaded unintended match in the middle of a word
problem can torpedo most any approach. Solving it is
not a matter of changing approaches, but rather of
changing tools. It can adversely affect virtually any
approach; conversely it can be fixed in virtually
any approach. 

What's needed is a way to specify word boundaries
while regular expression matching. Unfortunately the
regular expression syntax for word boundaries varies
from tool to tool. Perl and its derivatives let you
specify \b at the beginning and/or end of a word (or
its opposite \B for not-a-word-boundary). Classic
`egrep` provides the same functionality but with the
different regular expression syntax \ at the
beginning of a word and \ at the end of a word. GNU
egrep, GNU awk, and GNU Emacs support both syntaxes.
Tcl provides word boundary functionality with \m, \M,
and \y. Both Java and .NET are Perl-like. The -F
command line switch turns GNU `grep` into its even
stupider cousin `fgrep`, neither of which let you
specify word boundaries in regular expressions at all.


GNU grep does however let you use Perl-style regular
expression by specifying -P on the command line. And
perhaps most importantly, GNU grep (and GNU egrep,
which is the same program with different switches)
lets you quickly and automatically turn _everything_
in your regular expressions into full words with the 
-w command line switch (lots of convenience, not
much control:-).

In summary: If you want to specify word boundaries
inside the regular expressions, use either Perl or GNU
grep -P or some other fairly modern tool. If you want
word boundary functionality withOUT specifying word
boundaries in the regular expressions themselves, use
GNU grep -w. If you have no other choice, you can make
it work with classic egrep by inserting \ and \
appropriately in your regular expressions. But classic
grep won't do word boundaries no matter what. (You can
sorta fake it, but it's a lot of effort and it doesn't
work in all cases.) Note in particular that the
easy-to-overlook -w command line switch on GNU grep
can make a night/day difference. 

Please do let this list know your results after a few
months (It sounds like I'm not the only one that's a
bit skeptical that the bad words in URL approach
that seemed to work reasonably a couple years ago will
give even ballpark results these days...)

thanks!



-Chuck Kollars


  


[squid-users] Searching squid logs for pornographic sites

2008-06-11 Thread Steven Engebretson
I am looking for a tool that will scan the access.log file for pornographic 
sites, and will report the specifics back.  We do not block access to any 
Internet sites, but need to monitor for objectionable content.

What I am doing now is just greping for some key words, and dumping the output 
into a file.  I am manually going through about 60,000 lines of log file, 
following my grep.  99% of these are false.  Any help would be appreciated.

Thank you all.


-Steven E.



Re: [squid-users] Searching squid logs for pornographic sites

2008-06-11 Thread Peter Albrecht
Hi Steven,

 I am looking for a tool that will scan the access.log file for 
 pornographic sites, and will report the specifics back.  We do not block 
 access to any Internet sites, but need to monitor for objectionable 
 content.   
 
 What I am doing now is just greping for some key words, and dumping the 
 output into a file.  I am manually going through about 60,000 lines of 
 log file, following my grep.  99% of these are false.  Any help would be 
 appreciated.   

I'm not sure if I got you right: Are you trying to identify unwanted sites 
from your access.log? Maybe the blacklists from SquidGuard can be of any 
help:

http://www.squidguard.org/blacklists.html

If you compare the sites from your access logfile with the blacklists, you 
should be able to figure out which are unwanted sites. You would need to 
write a script comparing the files and reporting any hits.

Regards,

Peter

-- 
Peter Albrecht  Tel: +49-(0)-89-287793-83
Open Source School GmbH Fax: +49-(0)-89-287555-63 
Amalienstraße 45 RG
80799 München   http://www.opensourceschool.de

HRB 172645 - Amtsgericht München
Geschäftsführer: Peter Albrecht, Dr. Markus Wirtz



Re: [squid-users] Searching squid logs for pornographic sites

2008-06-11 Thread julian julian

I suggest to use a log analizer like webalizer o sarg, this is a bit more 
complete for user behavior analisys.

Julián

--- On Wed, 6/11/08, Steven Engebretson [EMAIL PROTECTED] wrote:

 From: Steven Engebretson [EMAIL PROTECTED]
 Subject: [squid-users] Searching squid logs for pornographic sites
 To: squid-users@squid-cache.org
 Date: Wednesday, June 11, 2008, 11:32 AM
 I am looking for a tool that will scan the access.log file
 for pornographic sites, and will report the specifics back.
  We do not block access to any Internet sites, but need to
 monitor for objectionable content.
 
 What I am doing now is just greping for some key words, and
 dumping the output into a file.  I am manually going through
 about 60,000 lines of log file, following my grep.  99% of
 these are false.  Any help would be appreciated.
 
 Thank you all.
 
 
 -Steven E.





Re: [squid-users] Searching squid logs for pornographic sites

2008-06-11 Thread Rob Asher
Here's something similar to what you're already doing except comparing to a 
file of badwords to look for in the URL's and then emailing you the results.

#!/bin/sh
# filter.sh
#
cd /path/to/filterscript
cat /var/log/squid/access.log | grep -if /path/to/filterscript/badwords  
hits.out

/path/to/filterscript/wordfilter.gawk hits.out

cat /path/to/filterscript/word-report | /bin/mail -s URL Filter Report [EMAIL 
PROTECTED] 

rm hits.out


#!/bin/gawk -f
# wordfilter.gawk

BEGIN {
print URL Filter Report:  /path/to/filterscript/word-report
print --  
/path/to/filterscript/word-report
sp =  - 
}

{
print strftime(%m-%d-%Y %H:%M:%S,$1), sp, $8  
/path/to/filterscript/word-report
print $7  /path/to/filterscript/word-report
print   /path/to/filterscript/word-report
}



You may need to adjust the columns printed in the awk script.  They're set for 
username instead of IP's.  Also, you'll need to make a 
/path/to/filterscript/badwords file with the words/regex you want to search 
forone per line.  Someone with better regex skills could probably eliminate 
a lot false hits with specific patterns in the badwords file.  I'm using 
this in addition to squidGuard and blacklists to catch URL's that were missed 
so the output isn't near as large as what you're getting.  

Rob



-
Rob Asher
Network Systems Technician
Paragould School District
(870)236-7744 Ext. 169


 Steven Engebretson [EMAIL PROTECTED] 6/11/2008 1:32 PM 
I am looking for a tool that will scan the access.log file for pornographic 
sites, and will report the specifics back.  We do not block access to any 
Internet sites, but need to monitor for objectionable content.

What I am doing now is just greping for some key words, and dumping the output 
into a file.  I am manually going through about 60,000 lines of log file, 
following my grep.  99% of these are false.  Any help would be appreciated.

Thank you all.


-Steven E.


-- 

This message has been scanned for viruses and
dangerous content by the Paragould School District
MailScanner, and is believed to be clean.



-- 

This message has been scanned for viruses and
dangerous content by the Paragould School District
MailScanner, and is believed to be clean.



Re: [squid-users] Searching squid logs for pornographic sites

2008-06-11 Thread Jason

Look at these:

http://www.meadvillelibrary.org/os/osfiltering-ala/smutscript/
http://www.meadvillelibrary.org/os/osfiltering-ala/
http://meadvillelibrary.org/os/filtering/filtermaintenance.html

She wrote a script that searches logs for keywords and emails it to her.

Jason


Steven Engebretson wrote:

I am looking for a tool that will scan the access.log file for pornographic 
sites, and will report the specifics back.  We do not block access to any 
Internet sites, but need to monitor for objectionable content.

What I am doing now is just greping for some key words, and dumping the output 
into a file.  I am manually going through about 60,000 lines of log file, 
following my grep.  99% of these are false.  Any help would be appreciated.

Thank you all.


-Steven E.


--- AV  Spam Filtering by M+Guardian - Risk Free Email (TM) ---