Re: Fuzzy Lookups

2006-02-09 Thread name
Gregory Piñero ha scritto: > Wow, that looks excellent. I'll definately try it out. I'm assuming > this is an existing project, e.g. you didn't write it after reading > this thread? Yes it is an existing projects of course ;) Right now I've no time to improve it. I hope that later this summer I

Re: Fuzzy Lookups

2006-02-09 Thread Gregory Piñero
Wow, that looks excellent. I'll definately try it out. I'm assuming this is an existing project, e.g. you didn't write it after reading this thread? -Greg On 2/9/06, name <[EMAIL PROTECTED]> wrote: > Gregory Piñero ha scritto: > : > > If anyone would be kind enough to improve it I'd love to ha

Re: Fuzzy Lookups

2006-02-09 Thread name
Gregory Piñero ha scritto: : > If anyone would be kind enough to improve it I'd love to have these > features but I'm swamped this week! > > - MD5 checking for find exact matches regardless of name > - Put each set of duplicates in its own subfolder. Done? http://pyfdupes.sourceforge.net/ Bye, l

Re: Fuzzy Lookups

2006-02-03 Thread BBands
Diez B. Roggisch wrote: > I did a levenshtein-fuzzy-search myself, however I enhanced my version by > normalizing the distance the following way: > > def relative(a, b): > """ > Computes a relative distance between two strings. Its in the range > (0-1] where 1 means total equality. >

Re: Why checksum? [was Re: Fuzzy Lookups]

2006-02-02 Thread Tom Anderson
On Thu, 1 Feb 2006, it was written: > Tom Anderson <[EMAIL PROTECTED]> writes: > >>> The obvious way is make a list of hashes, and sort the list. >> >> Obvious, perhaps, prudent, no. To make the list of hashes, you have to >> read all of every single file first, which could take a while. If your

Re: Why checksum? [was Re: Fuzzy Lookups]

2006-02-01 Thread Erik Max Francis
Steven D'Aprano wrote: > This isn't a criticism, it is a genuine question. Why do people compare > local files with MD5 instead of doing a byte-to-byte compare? Is it purely > a caching thing (once you have the checksum, you don't need to read the > file again)? Are there any other reasons? Becau

Re: Why checksum? [was Re: Fuzzy Lookups]

2006-02-01 Thread Paul Rubin
Steven D'Aprano <[EMAIL PROTECTED]> writes: > Sure. But if you are just comparing two files, is there any reason to > bother with a checksum? (MD5 or other.) No of course not, except in special situations, like some problem opening and reading both files simultaneously. E.g.: the files are on two

Re: Why checksum? [was Re: Fuzzy Lookups]

2006-02-01 Thread Paul Rubin
Tom Anderson <[EMAIL PROTECTED]> writes: > > The obvious way is make a list of hashes, and sort the list. > > Obvious, perhaps, prudent, no. To make the list of hashes, you have to > read all of every single file first, which could take a while. If your > files are reasonably random at the beginni

Re: Why checksum? [was Re: Fuzzy Lookups]

2006-02-01 Thread Steven D'Aprano
On Tue, 31 Jan 2006 13:38:50 -0800, Paul Rubin wrote: > Steven D'Aprano <[EMAIL PROTECTED]> writes: >> This isn't a criticism, it is a genuine question. Why do people compare >> local files with MD5 instead of doing a byte-to-byte compare? Is it purely >> a caching thing (once you have the checksu

Re: Why checksum? [was Re: Fuzzy Lookups]

2006-02-01 Thread Tom Anderson
On Tue, 31 Jan 2006, it was written: > Steven D'Aprano <[EMAIL PROTECTED]> writes: > >> This isn't a criticism, it is a genuine question. Why do people compare >> local files with MD5 instead of doing a byte-to-byte compare? I often wonder that! >> Is it purely a caching thing (once you have th

Re: Why checksum? [was Re: Fuzzy Lookups]

2006-01-31 Thread Paul Rubin
Steven D'Aprano <[EMAIL PROTECTED]> writes: > This isn't a criticism, it is a genuine question. Why do people compare > local files with MD5 instead of doing a byte-to-byte compare? Is it purely > a caching thing (once you have the checksum, you don't need to read the > file again)? Are there any o

Why checksum? [was Re: Fuzzy Lookups]

2006-01-31 Thread Steven D'Aprano
On Tue, 31 Jan 2006 10:51:44 -0500, Gregory Piñero wrote: > http://www.blendedtechnologies.com/removing-duplicate-mp3s-with-python-a-naive-yet-fuzzy-approach/60 > > If anyone would be kind enough to improve it I'd love to have these > features but I'm swamped this week! > > - MD5 checking for fi

Re: Fuzzy Lookups

2006-01-31 Thread Gregory Piñero
I wonder which algorithm determines the similarity between two strings better? On 1/31/06, Kent Johnson <[EMAIL PROTECTED]> wrote: > Gregory Piñero wrote: > > Ok, ok, I got it! The Pythonic way is to use an existing library ;-) > > > > import difflib > > CloseMatches=difflib.get_close_matches(AFi

Re: Fuzzy Lookups

2006-01-31 Thread Kent Johnson
Gregory Piñero wrote: > Ok, ok, I got it! The Pythonic way is to use an existing library ;-) > > import difflib > CloseMatches=difflib.get_close_matches(AFileName,AllFiles,20,.7) > > I wrote a script to delete duplicate mp3's by filename a few years > back with this. If anyone's interested in s

Re: Fuzzy Lookups

2006-01-31 Thread Gregory Piñero
> Thanks for that, I'll have a look. (So many packages, so little > time...) Yes, there's a standard library for everything it seems! Except for a MySQL api :-( > > I wrote a script to delete duplicate mp3's by filename a few years > > back with this. If anyone's interested in seeing it, I'll p

Re: Fuzzy Lookups

2006-01-30 Thread Gregory Piñero
Ok, ok, I got it! The Pythonic way is to use an existing library ;-) import difflib CloseMatches=difflib.get_close_matches(AFileName,AllFiles,20,.7) I wrote a script to delete duplicate mp3's by filename a few years back with this. If anyone's interested in seeing it, I'll post a blog entry on

Re: Fuzzy Lookups

2006-01-30 Thread ajones
BBands wrote: > I have some CDs and have been archiving them on a PC. I wrote a Python > script that spans the archive and returns a list of its contents: > [[genre, artist, album, song]...]. I wanted to add a search function to > locate all the versions of a particular song. This is harder than y

Re: Fuzzy Lookups

2006-01-30 Thread gene tani
BBands wrote: > Diez B. Roggisch wrote: > > I did a levenshtein-fuzzy-search myself, however I enhanced my version by > > normalizing the distance the following way: > > Thanks for the snippet. I agree that normalizing is important. A > distance of three is one thing when your strings are long, bu

Re: Fuzzy Lookups

2006-01-30 Thread BBands
Diez B. Roggisch wrote: > I did a levenshtein-fuzzy-search myself, however I enhanced my version by > normalizing the distance the following way: Thanks for the snippet. I agree that normalizing is important. A distance of three is one thing when your strings are long, but quite another when they

Re: Fuzzy Lookups

2006-01-30 Thread Fredrik Lundh
Diez B. Roggisch wrote: > The advantage becomes apparent when you try to e.g. compare > > "Angelina Jolie" > > with > > "AngelinaJolei" > > and > > "Bob" > > Both have a l-dist of 3 >>> distance("Angelina Jolie", "AngelinaJolei") 3 >>> distance("Angelina Jolie", "Bob") 13 what did I miss ?

Re: Fuzzy Lookups

2006-01-30 Thread Diez B. Roggisch
Fredrik Lundh wrote: > Diez B. Roggisch wrote: > >> The advantage becomes apparent when you try to e.g. compare >> >> "Angelina Jolie" >> >> with >> >> "AngelinaJolei" >> >> and >> >> "Bob" >> >> Both have a l-dist of 3 > distance("Angelina Jolie", "AngelinaJolei") > 3 distance("Angeli

Fuzzy Lookups

2006-01-30 Thread BBands
I have some CDs and have been archiving them on a PC. I wrote a Python script that spans the archive and returns a list of its contents: [[genre, artist, album, song]...]. I wanted to add a search function to locate all the versions of a particular song. This is harder than you might think. For exa

Re: Fuzzy Lookups

2006-01-30 Thread Diez B. Roggisch
> As mentioned above this works quite well and I am happy with it, but I > wonder if there is a more Pythonic way of doing this type of lookup? I did a levenshtein-fuzzy-search myself, however I enhanced my version by normalizing the distance the following way: def relative(a, b): """ Com