-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 22/09/13 19:41, Kristopher Roy wrote:
> I have a table of songs, several have similar titles I can't find where
> to get started. I tried this but its not right. Select SongTitle,
> COUNT(SongTitle) AS LIKE_COUNT FROM Songs

I did work with a database that came from a company that sold music.
Their source data came from the various record companies and was a
complete mess.  (Yes record companies would make mistakes even for their
own artists!)  For our purposes the data had to be denormalised,
deduplicated and many items merged where the differences weren't important.

Fixing up the data required probabilistic matching, and can't be done in
simple SQL queries.  For example spelling mistakes had to be accounted
for, truncations, case differences, punctuation differences, numeric
differences (eg "Song One" vs "Song 1", "Album 3" vs "Album III", "Vol 4"
vs "Volume 4."), mixes (eg "Song One" vs "Song One (Radio Edit)"),
compositions of multiple artists or contributing to another artists songs
so "artist" becomes murky, etc.

This could only be achieved by processing all the data in advance.
Essentially every artist had to be scored against every other to see if
they were the same (but not similar), same for every album of that artist
against their other albums, and finally of all the songs within each
album.  It required a lot of inspecting the matches, finding anomalies,
doing google searches to find canonical information, adding heuristics,
making sure that heuristics changes did not break existing good matches,
and endless repeats until things are good enough.

If you are trying to do a good job, then you will need to do something
like that.

If you are trying to fix up your own collection, then first go in and fix
all the meta data.  Musicbrainz is a good source for authoritative
information and there are plenty of apps out there to help you edit and
update tags.

If you want a quick fix, then add another column to your SQLite database
that contains the normalised song title.  You will need to iterate over
all your existing data to calculate a normalised title.  For example
convert to all upper case, remove all punctuation, convert multiple spaces
to single, remove "digits -" as you gave in your example, truncating to 30
characters, and whatever else is relevant for your data.  You can now do
matching against the normalised title column for each title.

Roger


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEARECAAYFAlI/3EAACgkQmOOfHg372QSuHgCgla77zTSx5knJL036AMpU0Unx
JnEAoJ9Cx/kocO3ue4xafKFkM7BVEviE
=RDic
-----END PGP SIGNATURE-----
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to