Re: [Imdbpy-help] Better imdbID support for imdbpy2sql

Alexmipego Fri, 30 Jul 2010 08:04:24 -0700

Hi,

True, we could simply create a 'repository' with original title +
imdbid but there are 2 issues with that that I considered:
1. Storage wise the MD5 sum is always 32 chars and it's easy to parse
and specially store in memory for a lookup table
2. I've found encoding issues on the issues, (e.g. if I cut & paste a
time to a echo title | md5sum terminal I don't get the same md5) so
MD5 also helps validating this and making sure every 'repo' has the
same encoding in mind

If the title changes so does the MD5, but titles change very little,
e.g. >1.xM reccords are more than a year old so that means changes
would be very rare if any at all. I'm not "promising" a total title -
imdbid match but since imdb doesn't release the ids this would help
bootstrap the process for everyone. When a MD5 changes it's easy to
add it for a "Wanted" list and we could even make get_imdbID write the
imdbid to the logs or something like that (e.g. write it somewhere so
that a simple cat log | grep 'New MD5' | upload_somewhere would make
it very easy to contribute).

Finally, the major point with this mail wasn't to ask you to host the
list or change imdbpy too much. Once the MD5 column was present on the
distribution everyone would be compatible with my solution. You ask
about when and where this "matching" would be done, and the beauty of
this (for you) is that it can be done completely outside imdbpy2sql.
Once the MD5s are in the database it's easy to read and match them in
a third party library without any additional help from imdb2py.

Btw, I initially thought that I could simply assume the ID you assign
to each title is sequential, so I could simply assume the first line
in the CSV was ID 1 and the last would simply match. However, I found
out that the final sql table has more rows than the raw file and that
means something is either wrong or some extra processing is done.
While I could probably just try to understand the processes involved
and duplicate them on my package it would be less than ideal simply
because every time you changed imdbpy I would need to patch it and it
would be compatible with a single version only.

Regards,

On Jul 30, 1:30 pm, Davide Alberani <davide.alber...@gmail.com> wrote:
> On Wed, Jul 28, 2010 at 8:41 PM, Alexmipego <alexmip...@gmail.com> wrote:
>
> > For the project I've in mind I really need
> > to have as many imdbid values mapped as possible. During research, and
> > checking the raw files myself, I found that many people ask for it but
> > it's kinda impossible for imdbpy2sql to do better than it does at
> > guessing ids.
>
> More or less. :-)
> The basic problem is that the imdbIDs are not distributed in
> the plain text data files.
>
> > My solution is based on the fact that searching imdb for the raw names
> > (in the movies.list file) returns an exact match almost aways. That
> > means, overtime, some applications will end up getting the true id of
> > a movie but there is no way for imdbpy2sql/database to recover the
> > original raw title.
>
> I'm not sure to have understood your point.
> What's the advantage of the MD5 sum, over the normal title?
> I mean: if the title changes, also its MD5 will change and you will
> not be able to find the imdbID.
>
> > When changes in titles, new titles, etc... would occur it would simply
> > fail gracefully and over time those new hash-imdbid codes could be
> > made available.
>
> Well, it may works and it's easy to implement, but it means that you
> need a central repository for this hash table.
> Since I (as IMDbPY) don't want to provide it, most of the users
> will use none or create their own.
> By the way it's not clear to me  when you want to ask the hash table
> for an imdbID: when the imdbpy2sql.py script runs (but this will have
> a heavy impact on performances, I fear) or when a single item (movie,
> person, character or company) is requested.
>
> > Let me know what you think. The changes to support a MD5 column are
> > just 2-3 lines iirc and it shouldn't cause any problems to anyone, yet
> > it would allow for this type of feature to be implemented even if
> > outside the imdbpy code base
>
> Yup - I see your point on this, and I'll take it on consideration.
>
> --
> Davide Alberani <d...@mimante.net>  [PGP KeyID: 
> 0x465BFD47]http://www.mimante.net/
>
> ------------------------------------------------------------------------------
> The Palm PDK Hot Apps Program offers developers who use the
> Plug-In Development Kit to bring their C/C++ apps to Palm for a share
> of $1 Million in cash or HP Products. Visit us here for more 
> details:http://p.sf.net/sfu/dev2dev-palm
> _______________________________________________
> Imdbpy-help mailing list
> imdbpy-h...@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/imdbpy-help

------------------------------------------------------------------------------
The Palm PDK Hot Apps Program offers developers who use the
Plug-In Development Kit to bring their C/C++ apps to Palm for a share
of $1 Million in cash or HP Products. Visit us here for more details:
http://p.sf.net/sfu/dev2dev-palm
_______________________________________________
Imdbpy-help mailing list
Imdbpy-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-help

Re: [Imdbpy-help] Better imdbID support for imdbpy2sql

Reply via email to