Re: [Imdbpy-help] differences between http and sql results

Davide Alberani Thu, 08 Jan 2009 02:23:51 -0800

Hi all,
I copy this mail to imdbpy-devel, because there are some issues
that I'm not sure how to handle.  Feel free to express any thought.

On Jan 08, Mike Castle <dalg...@gmail.com> wrote:

> I built a local sqlite db with data from 2009-01-02, using my then
> installed 3.6 version on Debian/testing (currently using 3.8).

3.9 was released 2 days ago. :-)

> I've noticed that there are a few things that are different between
> the http and sql versions.  Some I think (ratings missing from http,
> year as string vs int) will be fixed when I update.

Yes, these are fixed, but some of the other issues have a
different explanation.

> Using http, the title is Jui kuen and akas look like
> u'Challenge::(India: English title)' When using sql, the title
> is Drunken Master and akas look like u'Challenge (1978)::(India:
> English title)'

Let's look at these problems separately.

[THE TITLE IN THE RESULTS LIST ISSUE]
You're right, the title for the Movie instances returned by the
search_movie method (and ONLY at _that_ time) are different
between 'http' and 'sql' ('local' works like 'sql', btw).

The facts: searching for "Drunken Master", the IMDb web server
returns a list of links with the _original title_, no matter what
you've searched for (possibly listing some "akas", but these can't
be taken in consideration or handled easily).  So you end up with
a Movie instance titled "Jui kuen", in the results list.

With 'sql', we have to scan both the list of original titles and
the list of akas and build the results list with the matching
titles.
I've chosen - for titles in the akas table - to keep _that_ title,
and not the original title it refers to.
So, this time, you have a "Drunken Master" entry in the results list.

But wait... once you call the 'update' method on a Movie instance,
the _original title_ is set, and the aka ends up in the akas list.
You can see this behavior in your code, too, if you print again
the title _after_ i2.update(m).

The same also applies to people names.
While I think the result lists of 'sql/local' is a little more
useful, it's true that this is an important difference with
the 'http' interface, and I'm taking into consideration a "fix"
for 4.0.

[THE "AKAS WITH YEARS" ISSUE]
For "Jui kuen (1978)", in the akas list you get
"Challenge::(India: English title)" from the web and
"Challenge (1978)::(India: English title)" from 'sql' (and 'local').

The facts: when you _provide_ akas data to IMDb their policy states
that you must specify the year for the AKA if and only if it was
released in that country with the given title in a year that it's not
the same as the production year.

Now... I'm pretty sure that previously they listed these years
along the akas on their web pages, even if it doesn't seem the
case anymore.
It goes without saying that the plain text data files still
lists the year for every aka for most (I'd say all) of the
entries in the aka-titles.list.gz file.

Honestly I don't consider it a bug of any sort; if it bothers
you, you can easily strip the year off (btw, it makes me wonder
if we can make the helpers.makeTextNotes function able to take
a callable or two, to easily manage a case like this one).

> Is this a bug due to me loading the data with 3.6 and if I reloaded
> with a current version, I'd get more consistent output?

As you can guess now, no.

> Is it a bug in sqlite, and if I used a different DB I'd see
> different results?

Not at all - it's a design flaw (as long as you consider it a flaw
and not a nice feature! ;-)

> It took me over 24 hours to load the data into sqlite on my machine,
> most of it indexing.

Unfortunately SQLite is really slow, for our needs. :-/
Consider a switch to another database server (MySQL is really fast,
used as the imdbpy2sql.py script does).

> In doing some research, I see that building indexes is known to
> be slow with sqlite, maybe a feature request could be to have an
> option to choose minimal or no indexes with sqlite.  For what _I_
> plan on doing, I won't need most of the indexes and could probably
> do without most of them.

You could comment the 'createIndexes' call in imdbpy2sql.py.
Before I can consider the introduction of an option to switch off
indexes, I'd prefer to have some performance tests: if the database
is not usable without the indexes (for a "normal" use), it doesn't
make much sense.

> Then again, I may just brush the dust off of my postgres server
> and actually set up something there.

The best solution for sure. :-)

Thank you very much for the bug report and the code to test it!

-- 
Davide Alberani <davide.alber...@gmail.com> [PGP KeyID: 0x465BFD47]
http://erlug.linux.it/~da/

------------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It is the best place to buy or sell services for
just about anything Open Source.
http://p.sf.net/sfu/Xq1LFB
_______________________________________________
Imdbpy-help mailing list
Imdbpy-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-help

Re: [Imdbpy-help] differences between http and sql results

Reply via email to