It was my impression that the domains and urls files had all changes
applied, and the diff files documented the daily changes. So if you were
going to use the enclosed domains and urls files the diff files were
unnecessary. But your question got me to wondering if *I* knew how they
worked (which is a good thing), so I did a little research.

I looked inside the blacklist that I downloaded on Friday, 11/23 and the
one downloaded on Monday, 11/26. I compared the porn directory of both
archives, and the 11/26 archive had 2 more files in the porn directory
than the 11/23 archive. The 2 additional files are:
domains.20011125.diff
urls.20011125.diff

I then took about 15 of the domains listed in the domains.20011125.diff
file and checked to see if they were in the 11/23 domains file, and then
if they were in the 11/26 domains file. The ones that I checked
confirmed my previous belief; in any given blacklist archive, the
domains file contains all of the changes listed in the various
domains.????????.diff files.

So let's say you last updated your domains file on 20011110 (and
everything was synchronized then), and you download the new blacklist
archive on 20011126. You could pull the domains.????????.diff files
dated after 20011110 and before 20011126:

domains.20011112.diff
domains.20011113.diff
domains.20011115.diff
domains.20011118.diff
domains.20011125.diff

Then apply these diff files via the update function, and you should end
up in exactly the same place as if you had used the enclosed domains
file and rebuild the database. In my script I am using the domains/urls
files; therefore for my purposes those diff files were unnecessary.

> ... but I assume there is something going on in that to
> prevent any changes made between downloads of fresh blacklists
> from being lost.

Yes, and I'll explain how. First, realize that my diff files only
contain changes that *I* have identified (nothing copied from the diff
files included in the archive). I pick up 20 to 50 new porn domains to
add every week by running a Calamaris report on the squid cache for the
last week, selecting the ip addresses of the PCs my teenaged sons use,
and listing the domains visited. Most of the ones I add are easily
spotted just scanning down the list. I put those +domains in my diff
file and I don't worry about ever taking them out. I also have
accumulated a list of domains or urls that are included in the blacklist
but shouldn't be there, for example -zdnet.com/pcmag. I add those to my
diff file and leave them there.

The contents of my porn directory diff files can be thought of like
this: The +'s are items that I want to make sure are included in the
database. The -'s are items that I want to ensure are NOT in the
database. It's just that simple.

I download the squidGuard list and a second blacklist from a university
in France. I cat those 2 porn domains files and the + items from the
diff file together, sort the result and dedupe with uniq. No matter how
many times a domain appears in those 3 input files, it only appears once
in the output.

Then I read the file from the previous step as input, and create new
output file that does not contain any of the - items from the diff file.
That new output file becomes the new domains file.

I use that process every time I update the porn files. So far, it works
for me.

Let me mention something else I learned when I did the diff file
research that was mentioned earlier. When I was looking through
domains.20011125.diff I saw a large number (more than 50) of deletions
listed for what appeared to be porn domains. Here are just a few
examples:

-10teengirls.com
-maxporno.com
-crazy-xxx-amateurs.com
-crazy-xxx-hardcore.com

These domains were in the 11/23 domains file but were not in the 11/26
domains file. I checked the sites and there is NO QUESTION that they are
porn sites. Why are they being deleted? The only reason they are still
blocked on my system is the blacklist from France included them. So if
you aren't combining the two blacklists, those sites are open for you.
(I'd bet users behind squidGuard are sending the sites in and asking
them to be deleted?) Glance down through one of those diff files at the
deletions, it's scary! I may go back and pull the deletions from the
diff files and add them to my diff files as additions.

I thought you might like to know about that.

Please let me know if there is still a little haze around what the
script is doing.

Rick


-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]]On Behalf Of Henry Barker
Sent: Thursday, November 29, 2001 5:50 PM
To: [EMAIL PROTECTED]
Subject: Re: Newbie Questions


All this talk about about list updates has gotten me thinking about a
question I thought I already understood.

Rick Matthews, in his handy-dandy script, indicates in a comment that
the .diff files included in the squidGuard blacklists, are fluff.  I am
wondering if this is because the changes are already included in the
blacklists themselves.

I didn't look at the script logic very well (chances are I wouldn't
understand it without several days' analysis), but I assume there is
something going on in that to prevent any changes made between downloads
of fresh blacklists from being lost.

My solution to preventing wipe-out of custom changes made between
published list updates, was to cat all the changes in the dated diff
files into domains.diff and urls.diff, respectively (for each list
category), and then run the update against those diff files-- in
essence, I am applying only deltas to my lists.  The recent transactions
with regard to updates has made me suspect that perhaps I am misinformed
as to the veracity of this strategy.

Any comments regarding this question are welcome.


Regards,
Henry Barker



Reply via email to