It was my impression that the domains and urls files had all changes applied, and the diff files documented the daily changes. So if you were going to use the enclosed domains and urls files the diff files were unnecessary. But your question got me to wondering if *I* knew how they worked (which is a good thing), so I did a little research.
I looked inside the blacklist that I downloaded on Friday, 11/23 and the one downloaded on Monday, 11/26. I compared the porn directory of both archives, and the 11/26 archive had 2 more files in the porn directory than the 11/23 archive. The 2 additional files are: domains.20011125.diff urls.20011125.diff I then took about 15 of the domains listed in the domains.20011125.diff file and checked to see if they were in the 11/23 domains file, and then if they were in the 11/26 domains file. The ones that I checked confirmed my previous belief; in any given blacklist archive, the domains file contains all of the changes listed in the various domains.????????.diff files. So let's say you last updated your domains file on 20011110 (and everything was synchronized then), and you download the new blacklist archive on 20011126. You could pull the domains.????????.diff files dated after 20011110 and before 20011126: domains.20011112.diff domains.20011113.diff domains.20011115.diff domains.20011118.diff domains.20011125.diff Then apply these diff files via the update function, and you should end up in exactly the same place as if you had used the enclosed domains file and rebuild the database. In my script I am using the domains/urls files; therefore for my purposes those diff files were unnecessary. > ... but I assume there is something going on in that to > prevent any changes made between downloads of fresh blacklists > from being lost. Yes, and I'll explain how. First, realize that my diff files only contain changes that *I* have identified (nothing copied from the diff files included in the archive). I pick up 20 to 50 new porn domains to add every week by running a Calamaris report on the squid cache for the last week, selecting the ip addresses of the PCs my teenaged sons use, and listing the domains visited. Most of the ones I add are easily spotted just scanning down the list. I put those +domains in my diff file and I don't worry about ever taking them out. I also have accumulated a list of domains or urls that are included in the blacklist but shouldn't be there, for example -zdnet.com/pcmag. I add those to my diff file and leave them there. The contents of my porn directory diff files can be thought of like this: The +'s are items that I want to make sure are included in the database. The -'s are items that I want to ensure are NOT in the database. It's just that simple. I download the squidGuard list and a second blacklist from a university in France. I cat those 2 porn domains files and the + items from the diff file together, sort the result and dedupe with uniq. No matter how many times a domain appears in those 3 input files, it only appears once in the output. Then I read the file from the previous step as input, and create new output file that does not contain any of the - items from the diff file. That new output file becomes the new domains file. I use that process every time I update the porn files. So far, it works for me. Let me mention something else I learned when I did the diff file research that was mentioned earlier. When I was looking through domains.20011125.diff I saw a large number (more than 50) of deletions listed for what appeared to be porn domains. Here are just a few examples: -10teengirls.com -maxporno.com -crazy-xxx-amateurs.com -crazy-xxx-hardcore.com These domains were in the 11/23 domains file but were not in the 11/26 domains file. I checked the sites and there is NO QUESTION that they are porn sites. Why are they being deleted? The only reason they are still blocked on my system is the blacklist from France included them. So if you aren't combining the two blacklists, those sites are open for you. (I'd bet users behind squidGuard are sending the sites in and asking them to be deleted?) Glance down through one of those diff files at the deletions, it's scary! I may go back and pull the deletions from the diff files and add them to my diff files as additions. I thought you might like to know about that. Please let me know if there is still a little haze around what the script is doing. Rick -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Henry Barker Sent: Thursday, November 29, 2001 5:50 PM To: [EMAIL PROTECTED] Subject: Re: Newbie Questions All this talk about about list updates has gotten me thinking about a question I thought I already understood. Rick Matthews, in his handy-dandy script, indicates in a comment that the .diff files included in the squidGuard blacklists, are fluff. I am wondering if this is because the changes are already included in the blacklists themselves. I didn't look at the script logic very well (chances are I wouldn't understand it without several days' analysis), but I assume there is something going on in that to prevent any changes made between downloads of fresh blacklists from being lost. My solution to preventing wipe-out of custom changes made between published list updates, was to cat all the changes in the dated diff files into domains.diff and urls.diff, respectively (for each list category), and then run the update against those diff files-- in essence, I am applying only deltas to my lists. The recent transactions with regard to updates has made me suspect that perhaps I am misinformed as to the veracity of this strategy. Any comments regarding this question are welcome. Regards, Henry Barker
