Roger,
Thank you, I will start working on it per your suggestions, though I would
love to get mine cleaned up easily, I want to provide a solution that other
users can use :)


On Sun, Sep 22, 2013 at 11:14 PM, Roger Binns <rog...@rogerbinns.com> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 22/09/13 19:41, Kristopher Roy wrote:
> > I have a table of songs, several have similar titles I can't find where
> > to get started. I tried this but its not right. Select SongTitle,
> > COUNT(SongTitle) AS LIKE_COUNT FROM Songs
>
> I did work with a database that came from a company that sold music.
> Their source data came from the various record companies and was a
> complete mess.  (Yes record companies would make mistakes even for their
> own artists!)  For our purposes the data had to be denormalised,
> deduplicated and many items merged where the differences weren't important.
>
> Fixing up the data required probabilistic matching, and can't be done in
> simple SQL queries.  For example spelling mistakes had to be accounted
> for, truncations, case differences, punctuation differences, numeric
> differences (eg "Song One" vs "Song 1", "Album 3" vs "Album III", "Vol 4"
> vs "Volume 4."), mixes (eg "Song One" vs "Song One (Radio Edit)"),
> compositions of multiple artists or contributing to another artists songs
> so "artist" becomes murky, etc.
>
> This could only be achieved by processing all the data in advance.
> Essentially every artist had to be scored against every other to see if
> they were the same (but not similar), same for every album of that artist
> against their other albums, and finally of all the songs within each
> album.  It required a lot of inspecting the matches, finding anomalies,
> doing google searches to find canonical information, adding heuristics,
> making sure that heuristics changes did not break existing good matches,
> and endless repeats until things are good enough.
>
> If you are trying to do a good job, then you will need to do something
> like that.
>
> If you are trying to fix up your own collection, then first go in and fix
> all the meta data.  Musicbrainz is a good source for authoritative
> information and there are plenty of apps out there to help you edit and
> update tags.
>
> If you want a quick fix, then add another column to your SQLite database
> that contains the normalised song title.  You will need to iterate over
> all your existing data to calculate a normalised title.  For example
> convert to all upper case, remove all punctuation, convert multiple spaces
> to single, remove "digits -" as you gave in your example, truncating to 30
> characters, and whatever else is relevant for your data.  You can now do
> matching against the normalised title column for each title.
>
> Roger
>
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.12 (GNU/Linux)
>
> iEYEARECAAYFAlI/3EAACgkQmOOfHg372QSuHgCgla77zTSx5knJL036AMpU0Unx
> JnEAoJ9Cx/kocO3ue4xafKFkM7BVEviE
> =RDic
> -----END PGP SIGNATURE-----
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>



-- 
Thank You,
Kristopher C. Roy
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to