-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 22/09/13 19:41, Kristopher Roy wrote: > I have a table of songs, several have similar titles I can't find where > to get started. I tried this but its not right. Select SongTitle, > COUNT(SongTitle) AS LIKE_COUNT FROM Songs
I did work with a database that came from a company that sold music. Their source data came from the various record companies and was a complete mess. (Yes record companies would make mistakes even for their own artists!) For our purposes the data had to be denormalised, deduplicated and many items merged where the differences weren't important. Fixing up the data required probabilistic matching, and can't be done in simple SQL queries. For example spelling mistakes had to be accounted for, truncations, case differences, punctuation differences, numeric differences (eg "Song One" vs "Song 1", "Album 3" vs "Album III", "Vol 4" vs "Volume 4."), mixes (eg "Song One" vs "Song One (Radio Edit)"), compositions of multiple artists or contributing to another artists songs so "artist" becomes murky, etc. This could only be achieved by processing all the data in advance. Essentially every artist had to be scored against every other to see if they were the same (but not similar), same for every album of that artist against their other albums, and finally of all the songs within each album. It required a lot of inspecting the matches, finding anomalies, doing google searches to find canonical information, adding heuristics, making sure that heuristics changes did not break existing good matches, and endless repeats until things are good enough. If you are trying to do a good job, then you will need to do something like that. If you are trying to fix up your own collection, then first go in and fix all the meta data. Musicbrainz is a good source for authoritative information and there are plenty of apps out there to help you edit and update tags. If you want a quick fix, then add another column to your SQLite database that contains the normalised song title. You will need to iterate over all your existing data to calculate a normalised title. For example convert to all upper case, remove all punctuation, convert multiple spaces to single, remove "digits -" as you gave in your example, truncating to 30 characters, and whatever else is relevant for your data. You can now do matching against the normalised title column for each title. Roger -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iEYEARECAAYFAlI/3EAACgkQmOOfHg372QSuHgCgla77zTSx5knJL036AMpU0Unx JnEAoJ9Cx/kocO3ue4xafKFkM7BVEviE =RDic -----END PGP SIGNATURE----- _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users