Amos Jeffries wrote:
Dave wrote:
Hi,
   Thanks for your reply. The following is the ip and abbreviated msg:
(reason: 554 5.7.1 Service unavailable; Client host [65.24.5.137] blocked using dnsbl-1.uceprotect.net; To my squid issue, if aufs is less intensive and more efficient i'll definitely switch over to it. As for your suggestion about splitting in to multiple files I believe the version i have can do this, it has multiple acl statements for the safe_ports definition. My issue though is there's like 15000+ lines in this file, and investigating some like 500 are duplicates. I'd rather not have to manually go through this and do the split, is there a way i can split based on the dst, dstdomain, or url_regexp you referenced?

I just used the following commands, pulled off most of the job in a few minutes. The remainders that got left as regex was small. There are some that are duplicates of the domain-only list, but that can be dealt with later.


# Pull out the IPs
grep -v -E "[a-z]+" porn | sort -u >porn.ipa

# copy everything else into a temp file
grep -v -E "[a-z]+" porn | sort -u >temp.1

# pull out lines with only domain name
grep -E "^([0-9a-z\-]\.)+[a-z]+$" temp.1 | sort -u >temp.d

# pull out everthing without a domain name into another temp
grep -v -E "^([0-9a-z\-]\.)+[a-z]+$" temp.1 | sort -u >temp.2
rm temp.1

# pull out lines that are domain/ or domain<space> and drop the end
grep -E "^([0-9a-z\-]\.)+[a-z]+[\/ ]$" temp.2 | sed s/\\/// | sed s/\\ // | sort -u >>temp.d

# leave the rest as regex patterns
grep -v -E "^([0-9a-z\-]\.)+[a-z]+[\/ ]$" temp.2 | sort -u >porn.regex
rm temp.2

# sort the just-domains and make sure there are no duplicate.
cat temp.d | sort -u > porn.domains
rm temp.d

Amos

For what it's worth, this method will not remove overlapping domains (if http://yahoo.com/, http://www.yahoo.com/index.html and http://mail.yahoo.com are all included, you will have more entries than you need). Minor issue, perhaps, but it can lead to unpredictable results (the dstdomain acl type will disregard overlaps to keep the tree sort simple).

Find attached a (less than pretty) Perl script that will resolve these issues*. Critiques and patches welcome (hopefully it's commented enough to make sense to someone else). It's likely not optimized, but in my case the input list is not changed often, so optimization is not critical.

Chris

* It is only set up to handle a list of URLs. It stuffs IP addresses in with the other reg exes. It will not account for what I assume to be commented lines (starting with a #), or strings such as "=female+wrestling" but will treat them as part of the domain name. Error checking is minimal. It works for me, but comes without warranty. Salt to taste.
#!/usr/bin/perl

# Parses a the file, determines if a line should really be a site or
# domain block and pushes the data into the proper files.

use strict;

# Define variables;
$| = 1;
my ($url, $host, $scope, $time, $final);
my %domains = ();
my %regex = ();
my @hosts;
my @site_array;
my @lines;
# Open a bunch of file handles.
my $urlfile = "/etc/squid/acls/ExternalLinks.txt";
open (URLFILE, "< $urlfile");
my $allowurlfile = "/etc/squid/acls/allowurls";
unlink ($allowurlfile);
open (ALLOWURLS, "> $allowurlfile");
my $allowdomfile = "/etc/squid/acls/allowdoms";
unlink ($allowdomfile);
open (ALLOWDOMS, "> $allowdomfile");

# Start reading input
print "Working...";
while ($url = <URLFILE>) {
  chomp $url;
  my $time = time();
  # grab the host & (if it exists) path
  (undef, undef, $final) = $url =~ m#^(http(s)?://)?(.*)$#i;
  # Split the string on forward slashes
  my @url_array = split "/", $final;
  # Grab the host
  $host = shift @url_array;
  # Split the host into domain components
  my @host_array = split '\.', $host;
  # Check for a leading www (get rid of it!)
  if (@host_array[0] eq "www") { 
    shift @host_array;
  }
  # Put the fqdn back together.
  $host = join (".", @host_array);
  if (scalar(@url_array[0]) || isIP(@url_array)) { # Is this REALLY a site 
allow?
    # Yes, it's a site.
    my $time = time();
    # grab the host & (if it exists) path
    (undef, undef, $final) = $url =~ m#^(http(s)?://)?(.*)$#;
    # Escape special regex characters
    $final =~ s/(\\|\||\(|\)|\[|\{|\^|\$|\*|\+|\?)/\\$1/g;
    # Split the string on forward slashes
    my @url_array = split "/", $final;
    # Grab the host
    my $host = shift @url_array;
    # Split the host into domain components
    my @host_array = split '\.', $host;
    # Check for a leading www (get rid of it!)
    if (@host_array[0] eq "www") {
      shift @host_array;
    }
    # Put the fqdn back together.
    $host = join (".", @host_array);
    $final = join ('.', @host_array);
    $final .= "/";
    $final .= join ("/", @url_array);
    $final =~ s/\./\\\./g;
    # Now check for a duplicate site block
    if (1 != $regex{$final}->{defined}) { 
      $regex{$final}->{defined} = 1;
      # Create the entry
#print "Added site $url\n";
      $scope = "Site";
      $domains{$url}->{host} = $host;
      $domains{$url}->{final} = $final;
      $domains{$url}->{scope} = $scope;
      $domains{$url}->{time} = $time;
    }
  } else {
    # It's a Domain.
    # Is it a repeat?
    if (1 != $domains{$host}->{defined}) {
      # Haven't seen this one before.  Mark it as seen.
      $domains{$host}->{defined} = 1;
      $scope = "Domain";
      # Clear out empty array elements
      $final = join ('.', @host_array);
      $final = ".$final";
      # Create the entry
#print "Added domain $url\n";
      $domains{$url}->{host} = $host;
      $domains{$url}->{final} = $final;
      $domains{$url}->{scope} = $scope;
      $domains{$url}->{time} = $time;
      push @hosts, $host;
    }
  }
}
# Done reading the file.  Let's filter the data to remove duplication.
# Sort by number of host elements, remove subdomains of defined domains
sub byNumberOfHostElements { $a <=> $b }
# Somehow, this performs the desired sort.  Perl is weird.
my @sortedHosts = map { $_->[0] }
                  sort {
                      my @a_fields = @$a[1..$#$a];
                      my @b_fields = @$b[1..$#$b];

                      scalar(@a_fields) <=> scalar(@b_fields)
                  }
                  map { [$_, split'\.'] } @hosts;
foreach $host (@sortedHosts) {
  my $dotHost = ".$host";
  foreach my $urlToTest (keys %domains) {
    my $hostToTest = $domains{$urlToTest}->{host};
    my $dotHostToTest = ".$hostToTest";
    my $deleted = 0;
    my $different = 0;
    # If a subdomain of the host is found, drop it from the list
    if (($hostToTest =~ m/$host$/)) {
#print "$dotHost - $dotHostToTest - $urlToTest\n";
      # We have a potential match.  Verify further...
      my @host1 = split'\.', $hostToTest;
      my @host2 = split'\.', $host;
      my ($test1, $test2);
      while ($test1 = pop (@host1)) {
        $test2 = pop (@host2);
        if (defined($test1) && defined($test2)) {
          if ($test1 eq $test2) {
#print "# They match so far ($test1 eq $test2), check the next element\n";
            # They match so far, check the next element
            next;
          } else {
#print "# The hosts are different ($hostToTest $host). Break out of here.\n";
            # The hosts are different. Break out of here.
            $different = 1;
            last;
          }
        } elsif (!defined($test2)) {
          # We have a match.  Drop the subdomain.
#print "$hostToTest is a subdomain of $host.  Deleting.\n";
print "."; # So there is SOME indication of work progressing...
          delete $domains{$urlToTest};
          #delete @sortedHosts[$host];
          $deleted = 1;
        }
      }
      if (!$deleted && !$different && ("Domain" ne 
$domains{$urlToTest}->{scope})) { 
#print "$urlToTest is a subdomain of $host.  Deleting.\n";
print "."; # More progress indication
        delete $domains{$urlToTest};
      }
    }
  }
}
print "\n";
# Write the data
print ALLOWDOMS ".apexvs.com\n";
foreach $url (keys %domains) {
  $final = $domains{$url}->{final};
  $time = $domains{$url}->{time};
  if ("Site" eq $domains{$url}->{scope}) {
    $scope = "Site";
    print ALLOWURLS "$final\n";
  } else {
    $scope = "Domain";
    print ALLOWDOMS "$final\n";
  }
}
# Close it all up
close URLFILE;
close ALLOWURLS;
close ALLOWDOMS;
# Set proper ownership
ERROR! CHANGE 15 TO THE UID OF squiduser ON YOUR MACHINE chown (15, -1, 
$allowurlfile, $allowdomfile);
chmod (0644, $allowurlfile, $allowdomfile);
print "Done.  Don't forget to reload Squid to make changes effective.\n";
exit 0;

sub isIP {
  my @array = shift;
  for (my $i = 1; $i <= 4; $i++) {
    # Search the first 4 parts of the array for alpha and hyphen
    # Return 0 if found or if the array is shorter than 4 parts
    @array[$i] =~ /[a-zA-Z\-]/ || (!defined(@array[$i])) && return 0;
  }
  # No alpha or hyphen found, and there are at least four parts? It
  # could be an IP address.
  return 1;
}

Reply via email to