On Wed, Mar 20, 2013 at 10:26:21AM +0000, Steve Freegard wrote: > Listing e-mail addresses and URL paths could be done by normalizing them
yup > (e.g. lower-case, stripping query parameters etc.) Not necessarily - as I see there would be use cases for complete URLs as well as for stripped ones, maybe even for the domain part only. Further aspect: there are urls pointing clearly to spammy sites and other ones (I see them often in 419's) pointing to a completely legit page (say, an article to bbc.co.uk) used for something like illustrational purposes only. similar for email addresses - the domain part or the full address may be of some value depending on the situation. > and then hashing them > (e.g. MD5/SHA1 etc) and listing the hash. Good idea. Hashing should completely circumvent character issues. > As you say though - the issue is collecting the data and populating the > lists along and maintaining the rest of the infrastructure that serves it. How about honeynet.org?