Phenomenally helpful. So in short, if i understand it all correctly, the IMDB has in their own database: the real IMDB id, whether something is adult, whether something is on Amazon/Blockbuster. However, none of this is shared by them, and the only way to get any of them is to use a script to do a title search and scrape the pages?
On Fri, Mar 19, 2010 at 6:16 AM, Davide Alberani <davide.alber...@gmail.com> wrote: > On Mar 19, Michael Liu <mikel...@gmail.com> wrote: > >> I used imdbpy2sql to populate a local database with the IMDB data, but >> have some questions about the data. > > Hi! > I take this as an opportunity to write some FAQs to put in the > documentation, since these questions came up a lot, lately. :-) > >> In the titles table, the column imdb_id is empty. Am I missing a file >> I needed to download to fill this in? How can I get imdb_ids? > > Q2: why the movieID (and other IDs) used in the 'sql' database are not > the same used on the IMDb.com site? > > A2: first, a bit of nomenclature: we'll call "movieID" (or things like > "personID", for instance of the Person class) a unique identifier used > by IMDbPY to manage a single movie (or other kinds of object). > We'll call "imdbID" a unique identifier used, for the same kind > of data, by the IMDb.com site (i.e.: the 7-digit number in tt0094226, > as seen in the URL for "The Untouchables"). > > Using IMDbPY to access the web ('http' and 'mobile' data access > systems), movieIDs and imdbIDs are the same thing - beware that > in this case a movieID is a string, with the leading zeroes. > > Unfortunately, populating a sql database with data from the plain > text data files, we don't have access to imdbIDs - since they are > not distributed at all - and so we have to made them by ourselves > (they are the 'id' column in tables like 'title' or 'name'). > This mean that these values are valid only for your current database: > if you update it with a newer set of plain text data files, these IDs > will surely change (and, by the way, they are integers). > It's also obvious, now, that you can't exchange IDs between the > 'http' (or 'mobile') data access system and 'sql', and in the same > way you can't use imdbIDs with your local database or vice-versa. > > > Q3: using a sql database, what's the imdb_id (or something like that) > column in tables like 'title', 'name' and so on? > > A3: it's internally used by IMDbPY to remember the imdbID (the one > used by the web site - accessing the database you'll use the numeric > value of the 'id' column, as movieID) of a movie, once it stumbled > upon. This way, if IMDbPY is asked again about the imdbID of > a movie (or person, or ...), it doesn't have to contact again to > the web site. Notice that you have to access the sql database using > a user with write permission, to update it. > > As a bonus, when possible, the values of these imdbIDs are saved > between updates of the sql database (using the imdbpy2sql.py script). > Beware that it's tricky and not always possible, but the script does > its best to succeed. > > Q4: but what if I really need the imdbIDs, to use my database? > > A4: no, you don't. Search for a title, get its information. Be happy! > > Q5: I have a great idea: write a script to fetch all the imdbID from the > web site! Can't you do it? > > A5: yeah, I can. But I won't. :-) > It would be somewhat easy to map every title on the web to its > imdbID, but there are still a lot of problems. > First of all, every user will end up doing it for its own copy > of the plain text data files (and this will make the imdbpy2sql.py > script painfully slow and prone to all sort of problems). > Moreover, the imdbIDs are unique and never reused, true, but movie > title _do_ change: to fix typos, override working titles, to cope > with a new movie with the same title release in the same year (not > to mention cancelled or postponed movies). > > Besides that, we'd have to do the same for persons, characters and > companies. Believe me: it doesn't make sense. > Work on your local database using your movieIDs (or even better: > don't mind about movieIDs and think in terms of searches and Movie > instances!) and retrieve the imdbID only in the rare circumstances > when you really need them (see the next FAQ). > Repeat with me: I DON'T NEED ALL THE imdbIDs. :-) > >> Without the imdb_id, is it possible for me to generate a link to a >> given movie on IMDB? > > Q6: using a sql database, how can I convert a movieID (whose value > is valid only locally) to an imdbID (the ID used by the imdb.com site)? > > A6: various functions can be used to convert a movieID (or personID or > other IDs) to the imdbID used by the seb site. > Example of code: > > from imdb import IMDb > ia = IMDb('sql', uri=URI_TO_YOUR_SQL_DATABASE) > movie = ia.search_movie('The Untouchables')[0] # a Movie instance. > print 'The movieID for The Untouchables:', movie.movieID > print 'The imdbID used by the site:', ia.get_imdbMovieID(movie.movieID) > print 'Same ID, smarter function:', ia.get_imdbID(movie) > > It goes without saying that get_imdbMovieID has some sibling > methods: get_imdbPersonID, get_imdbCompanyID and get_imdbCharacterID. > Also notice that the get_imdbID method is smater, and takes any kind > of instance (the other functions need a movieID, personID, ...) > > Another method that will try to retrieve the imdbID is get_imdbURL, > which works like get_imdbID but returns an URL. > > In case of problems, these methods will return None. > >> Also, the online IMDB is aware of which titles are adult movies, but I >> don't see any similar column in my local database. How can I determine >> whether a movie is adult or not? > > Read README.adult and see imdb/parser/sql/__init__.py: searching for > a title, it tries to guess if it's an adult title. > It can't be perfect and I don't assume any kind of responsibilities > on this matter. ;-) > >> Lastly, online IMDB seems to know which movies are and aren't >> available on Amazon and Blockbuster. Is that in the database >> somewhere? > > No. > Accessing the web ('http' and 'mobile'), there are parsers for > the 'amazon reviews' page, but these information are not published > in the plain text data files. > > > HTH, > -- > Davide Alberani <davide.alber...@gmail.com> [GPG KeyID: 0x465BFD47] > http://www.mimante.net/ > ------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev _______________________________________________ Imdbpy-help mailing list Imdbpy-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-help