[Freevo-devel] Re: IMDB results
Lars Eggert wrote: > Lars Eggert wrote: >> I've made a small change locally that (1) throws out any non-word >> characters from the name (\W) and (2) throws out any >> single-character words from the name. This seems to produce much >> better matches. >> In the example above, it would search for "dvd fellowship ext" and >> frind it, instead of searching for "dvd [fellowship ext d 1]". > > I should add that the idea here is to feed imdb more significant words > for searching, instead of interpreting the matches it returns. Their > search algorithm doesn't seem to be too smart about weighing terms. > > For the same reason, it may make sense to strip other common short > words (in, the, for, not, a, an, of, etc.) fromt he search string. You mean IMDB_REMOVE_FROM_SEARCHSTRING? Already there. There is also IMDB_REMOVE_FROM_LABEL. It included season[0-9] and disc[0-9]. I added d[0-9]. I also checked in a new fxdimdb.py. When building the search string, remove all one letter words (but not number, we may need them). Than search. If the results are too long, try to remove some based on the words. Example: 'fellowship ext d 1' will be searched as 'fellowship ext 1'. We get too much results and all results without 'fellowship' or 'ext' will be ignored. The end result is a list of 4 choices. Dischi -- Conversation, n.: A vocal competition in which the one who is catching his breath is called the listener. --- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf ___ Freevo-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/freevo-devel
[Freevo-devel] Re: IMDB results
I like your idea Lars, it's a good thing to reduce the amount of network traffic we need, and thousands of matches are no more useful than too few but involve less traffic so the overall benefit is greater. Send the patch and I'll take a look... Thanks, Aubin On Wed, Aug 27, 2003 at 08:29:31AM -0700, Lars Eggert wrote: > Dirk Meyer wrote: > > > >We search for the label, split on the _. So we search for 'fellowship > >ext 1' The 1 gives us many results we don't want. But for other cases, > >we need the numbers (Babylon 5, or sequells). > > > >The big question now: how can we make it produce better results? What > >about: > > > >1. search like we search now, the list may be long > >2. if the number of return items is greater 10 remove all titles which > > don't include at least one _word_. So results without 'fellowship' > > or 'ext' (only containing 1) will be deleted. > >3. sort the results: > > a) Most popular searches to the top > > b) Inside two areas (popular and not so popular), search by number > > of matched words: each word in the title and not in the search > > string hitpoint--, each search word in the title hitpoint += 5. > > > >What do you think? > > I've made a small change locally that (1) throws out any non-word > characters from the name (\W) and (2) throws out any single-character > words from the name. This seems to produce much better matches. > > In the example above, it would search for "dvd fellowship ext" and frind > it, instead of searching for "dvd [fellowship ext d 1]". > > Lars > -- > Lars Eggert <[EMAIL PROTECTED]> USC Information Sciences Institute --- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf ___ Freevo-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/freevo-devel
Re: [Freevo-devel] Re: IMDB results
Lars Eggert wrote: I've made a small change locally that (1) throws out any non-word characters from the name (\W) and (2) throws out any single-character words from the name. This seems to produce much better matches. In the example above, it would search for "dvd fellowship ext" and frind it, instead of searching for "dvd [fellowship ext d 1]". I should add that the idea here is to feed imdb more significant words for searching, instead of interpreting the matches it returns. Their search algorithm doesn't seem to be too smart about weighing terms. For the same reason, it may make sense to strip other common short words (in, the, for, not, a, an, of, etc.) fromt he search string. Lars -- Lars Eggert <[EMAIL PROTECTED]> USC Information Sciences Institute smime.p7s Description: S/MIME Cryptographic Signature
[Freevo-devel] Re: IMDB results
Dirk Meyer wrote: We search for the label, split on the _. So we search for 'fellowship ext 1' The 1 gives us many results we don't want. But for other cases, we need the numbers (Babylon 5, or sequells). The big question now: how can we make it produce better results? What about: 1. search like we search now, the list may be long 2. if the number of return items is greater 10 remove all titles which don't include at least one _word_. So results without 'fellowship' or 'ext' (only containing 1) will be deleted. 3. sort the results: a) Most popular searches to the top b) Inside two areas (popular and not so popular), search by number of matched words: each word in the title and not in the search string hitpoint--, each search word in the title hitpoint += 5. What do you think? I've made a small change locally that (1) throws out any non-word characters from the name (\W) and (2) throws out any single-character words from the name. This seems to produce much better matches. In the example above, it would search for "dvd fellowship ext" and frind it, instead of searching for "dvd [fellowship ext d 1]". Lars -- Lars Eggert <[EMAIL PROTECTED]> USC Information Sciences Institute smime.p7s Description: S/MIME Cryptographic Signature
[Freevo-devel] Re: IMDB results (was: [PATCH] third set of FreeBSD patches, also for mmpython)
On Wed, Aug 27, 2003 at 11:34:49AM +0200, Dirk Meyer wrote: > The big question now: how can we make it produce better results? What > about: > > 1. search like we search now, the list may be long > 2. if the number of return items is greater 10 remove all titles which >don't include at least one _word_. So results without 'fellowship' >or 'ext' (only containing 1) will be deleted. > 3. sort the results: >a) Most popular searches to the top >b) Inside two areas (popular and not so popular), search by number > of matched words: each word in the title and not in the search > string hitpoint--, each search word in the title hitpoint += 5. If we refilter the results, it might make some sense; just compare the returned values to the string using some sort of fuzzy match (i.e. 75% of the characters in common > 50% of the characters in common. I haven't written a fuzzy match since 2nd year computer science, and it was in Pascal, so it'll take a while to remember it :) --- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf ___ Freevo-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/freevo-devel