[Freevo-devel] Re: IMDB results (was: [PATCH] third set of FreeBSD patches, also for mmpython)

2003-08-27 Thread Aubin Paul
On Wed, Aug 27, 2003 at 11:34:49AM +0200, Dirk Meyer wrote:
 The big question now: how can we make it produce better results? What
 about:
 
 1. search like we search now, the list may be long
 2. if the number of return items is greater 10 remove all titles which
don't include at least one _word_. So results without 'fellowship'
or 'ext' (only containing 1) will be deleted.
 3. sort the results:
a) Most popular searches to the top
b) Inside two areas (popular and not so popular), search by number
   of matched words: each word in the title and not in the search
   string hitpoint--, each search word in the title hitpoint += 5.

If we refilter the results, it might make some sense; just compare the
returned values to the string using some sort of fuzzy match (i.e. 75%
of the characters in common  50% of the characters in common.

I haven't written a fuzzy match since 2nd year computer science, and
it was in Pascal, so it'll take a while to remember it :)


---
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
___
Freevo-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/freevo-devel


[Freevo-devel] Re: IMDB results

2003-08-27 Thread Lars Eggert
Dirk Meyer wrote:
We search for the label, split on the _. So we search for 'fellowship
ext 1' The 1 gives us many results we don't want. But for other cases,
we need the numbers (Babylon 5, or sequells). 

The big question now: how can we make it produce better results? What
about:
1. search like we search now, the list may be long
2. if the number of return items is greater 10 remove all titles which
   don't include at least one _word_. So results without 'fellowship'
   or 'ext' (only containing 1) will be deleted.
3. sort the results:
   a) Most popular searches to the top
   b) Inside two areas (popular and not so popular), search by number
  of matched words: each word in the title and not in the search
  string hitpoint--, each search word in the title hitpoint += 5.
What do you think?
I've made a small change locally that (1) throws out any non-word 
characters from the name (\W) and (2) throws out any single-character 
words from the name. This seems to produce much better matches.

In the example above, it would search for dvd fellowship ext and frind 
it, instead of searching for dvd [fellowship ext d 1].

Lars
--
Lars Eggert [EMAIL PROTECTED]   USC Information Sciences Institute


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [Freevo-devel] Re: IMDB results

2003-08-27 Thread Lars Eggert
Lars Eggert wrote:
I've made a small change locally that (1) throws out any non-word 
characters from the name (\W) and (2) throws out any single-character 
words from the name. This seems to produce much better matches.

In the example above, it would search for dvd fellowship ext and frind 
it, instead of searching for dvd [fellowship ext d 1].
I should add that the idea here is to feed imdb more significant words 
for searching, instead of interpreting the matches it returns. Their 
search algorithm doesn't seem to be too smart about weighing terms.

For the same reason, it may make sense to strip other common short words 
(in, the, for, not, a, an, of, etc.) fromt he search string.

Lars
--
Lars Eggert [EMAIL PROTECTED]   USC Information Sciences Institute


smime.p7s
Description: S/MIME Cryptographic Signature


[Freevo-devel] Re: IMDB results

2003-08-27 Thread Aubin Paul
I like your idea Lars, it's a good thing to reduce the amount of
network traffic we need, and thousands of matches are no more useful
than too few but involve less traffic so the overall benefit is
greater.

Send the patch and I'll take a look...

Thanks,

Aubin

On Wed, Aug 27, 2003 at 08:29:31AM -0700, Lars Eggert wrote:
 Dirk Meyer wrote:
 
 We search for the label, split on the _. So we search for 'fellowship
 ext 1' The 1 gives us many results we don't want. But for other cases,
 we need the numbers (Babylon 5, or sequells). 
 
 The big question now: how can we make it produce better results? What
 about:
 
 1. search like we search now, the list may be long
 2. if the number of return items is greater 10 remove all titles which
don't include at least one _word_. So results without 'fellowship'
or 'ext' (only containing 1) will be deleted.
 3. sort the results:
a) Most popular searches to the top
b) Inside two areas (popular and not so popular), search by number
   of matched words: each word in the title and not in the search
   string hitpoint--, each search word in the title hitpoint += 5.
 
 What do you think?
 
 I've made a small change locally that (1) throws out any non-word 
 characters from the name (\W) and (2) throws out any single-character 
 words from the name. This seems to produce much better matches.
 
 In the example above, it would search for dvd fellowship ext and frind 
 it, instead of searching for dvd [fellowship ext d 1].
 
 Lars
 -- 
 Lars Eggert [EMAIL PROTECTED]   USC Information Sciences Institute




---
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
___
Freevo-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/freevo-devel


[Freevo-devel] Re: IMDB results

2003-08-27 Thread Dirk Meyer
Lars Eggert wrote:
 Lars Eggert wrote:
 I've made a small change locally that (1) throws out any non-word
 characters from the name (\W) and (2) throws out any
 single-character words from the name. This seems to produce much
 better matches.
 In the example above, it would search for dvd fellowship ext and
 frind it, instead of searching for dvd [fellowship ext d 1].

 I should add that the idea here is to feed imdb more significant words
 for searching, instead of interpreting the matches it returns. Their
 search algorithm doesn't seem to be too smart about weighing terms.

 For the same reason, it may make sense to strip other common short
 words (in, the, for, not, a, an, of, etc.) fromt he search string.

You mean IMDB_REMOVE_FROM_SEARCHSTRING? Already there. There is also
IMDB_REMOVE_FROM_LABEL. It included season[0-9] and disc[0-9]. I added
d[0-9].

I also checked in a new fxdimdb.py. When building the search string,
remove all one letter words (but not number, we may need them). Than
search. If the results are too long, try to remove some based on the
words.

Example:

'fellowship ext d 1' will be searched as 'fellowship ext 1'. We get
too much results and all results without 'fellowship' or 'ext' will be
ignored. The end result is a list of 4 choices.


Dischi

-- 
Conversation, n.:
A vocal competition in which the one who is catching his breath
is called the listener.


---
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
___
Freevo-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/freevo-devel