Re: A few questions

Ugo Riboni Mon, 27 Jul 2009 02:52:15 -0700

The above notwithstanding, I've spent a few days hacking thevideo_parser.py in-order to get my library recognised. It's anassortment of anime, TV and movies coming from various sources and**probably** typical of most users content. I've gone from the 1.05version having 10-20% of the content being recognised and added to themovie/tv show libraries (with 80% uncategorised), to @90% beingcategorised with most of the content that is being problematic being sofor understandable reasons. I'd like to submit a patch at some point (Ineed to tidy my code and I wouldn't mind knocking a few of the outlyingcases on the head first) - is it acceptable to submit patches to thislist rather than going the bzr route?


Hi Lee,

I'm answering only this part of your email as I see other people havealready addressed most of your points in other emails.

I briefly worked on the media scanner myself a few weeks ago, here atFluendo. We recognized that the media scanner could be improved andstarted to do some work on it. However, we quickly recognized that anychanges in the media scanner have a very high risk of introducingregressions, i.e. improving on the recognition of some files but makingit fail to recognize some fails that were recognized before.

Because of that, we decided to put on hold all work on the media scanneruntil we can write a comprehensive battery of unit tests to ensure thatwe can test the scanner for regressions.I think there's some time scheduled to create these unit tests somewherein the coming few months, but at the moment i'm not sure exactly when.It would actually help if you could send to me (even in private if youprefer it like that) a list of the file names that you tested the mediascanner on ? It will help us build the unit tests when we eventually getto doing that.

That said, we already have another community member (in CC) that didsome work on the media scanner as well [1][2].It would be interesting to try and merge the stuff that I did, the stuffthat he did and the stuff that you did, then see if any of you cancontribute the unit tests as well.After that we will be more than happy to get that stuff reviewed andcommitted.

Clearly it's not exactly a trivial effort we're talking about here, soobviously I understand if you want to just wait for us to do it.Still, you can find the code from the other contributor in the links atbottom, and my code is attached as well just in case you want to dosomething with it or just look at it for ideas.

Speaking of which, if you look at my code you will see it's basicallystand-alone. I ripped the scanner off of the moovida code so i couldtest it more easily (it's all very crude, of course, as it's just a workin progress that got shelved temporarily).

But that got me (and other people here) thinking a bit about actuallyhaving the media scanner as a totally separate python library, in a waythat it can be used not only by moovida, but by other projects as well(besides of the obvious advantages of being much more easily testable, etc).

Mind you, we didn't go much further on this other than brainstormingsome ideas (e.g. pluggable web-based helpers to search imdb or tmdb orsimilar sites, pluggable extra filters and recognizers, etc).So having some more discussion about this may be helpful here foreveryone interested in improving the media scanner, and maybe someonewill want to pick up the idea and run with it for a while.

To me a well designed and reusable media scanner library seems like aninteresting "summer" project. Hell, if we were participating in Google'sSummer of Code this year I would have put it out as a student projectfor sure. But maybe even without that there's someone interested inlooking at it (when they're not spending their time at the beach ;)).


Cheers,
--
Ugo

[1]https://www.moovida.com/quality/review/request/%[email protected]%3e


[2] https://code.launchpad.net/~mattbrown/elisa/bugfixes

import os, re
from os import path

class Test:
    # These match patterns similar to S05E06 and several variations of the
    # same theme (ex: S03-E01 or Season3, Episode1, or [s3]_[e1] etc.)
    # Defined separately to keep regular expressions more readable (i hope)
    _parens_se_pat = r"[\(\[]?\s*(?:s|se|season)\s*([0-9]+)\s*[\]\)]?"
    _parens_se_pat_noncap = r"[\(\[]?\s*(?:s|se|season)\s*(?:[0-9]+)\s*[\]\)]?"
    _parens_ep_pat = r"[\(\[]?\s*(?:e|ep|episode)\s*([0-9]+)\s*[\]\)]?"

    # These patterns cover the cases where all the episode information can be
    # found in the file name, without going to look into the path.

    # Note that patterns below don't cover the cases like
    # lost[s3][e2].avi where we don't have spaces between title and
    # the (possibly parenthesized) se/ep list.
    # It's a pretty dumb naming scheme but we may want to cover it anyway later.
    _tvshow_patterns =  [
                        # ex: lost.[s3]_[e5].hdtv-lol.avi
                        # ex: Chuck - Season 1, Episode 04 [WS DVDRip Xvid].avi
                        # ex: Day.Break.S01E03.HDTV.XviD-XOR.avi
                        r"^(.+)\s+" + _parens_se_pat + r"\s*" + _parens_ep_pat 
+ r"\s*(.*)$",

                        # ex: Dexter - 01x06 (HDTV-LOL) Return to Sender.avi
                        r"^(.+)\s*([0-9]+)x([0-9]+)\s*(.*)$",

                        # ex: lost.305.hdtv-lol.avi
                        # Note that this is assuming at least 3 digits, of which
                        # the last 2 are considered the episode number and the
                        # others the season number. However sometimes this is
                        # just a sequence number (e.g. animes)
                        # FIXME: one way to fix this is to examine the files in
                        # the same dir and see if they fit the sequence number
                        # pattern or the se/ep pattern. MythTV code has some
                        # implementation of this we can borrow or learn from.
                        r"^(.+)\s+([0-9]+)([0-9][0-9])\s+(.*)$",
                        ]

    _tvshow_regexes = [re.compile(pattern, re.I) \
                       for pattern in _tvshow_patterns]
    
    # These patterns cover the same cases as the _tvshow_patterns, however they
    # don't try to capture show and season name because this information will be
    # retrieved from the path. However they try to match the season name as part
    # of the pattern, if it is available, to decrease the likelihood of false
    # matches)
    _tvshow_path_patterns = [
                # ex: lost.[s3]_[e5].hdtv-lol.avi
                # ex: The IT Crowd - Season 1, Episode 04 [WS DVDRip Xvid].avi
                # ex: Day.Break.S01E03.HDTV.XviD-XOR.avi
                r".*(?:" + _parens_se_pat_noncap + r"[-_\s\.,]?)?" + 
_parens_ep_pat + r"(.*)$",
  
                # ex: Dexter - 01x06 (HDTV-LOL) Return to Sender.avi
                r".*[0-9]+x([0-9]+)(.*)$",
  
                # ex: lost.305.hdtv-lol.avi
                # please note that in this case the number almost surely is not
                # in "SEE" format but it's just a sequential episode number, so
                # we match it entirely.
                # To avoid false matches with basically anything with a number
                # in the title, we force the number to be at the start of the
                # filename.
                r"([0-9]+)\s+(.*)$",
           ]

    _base_path_pat = r"(.*)" + os.sep + r"(?:season|se|s)\s*([0-9]+)" + os.sep
    _tvshow_path_regexes = [re.compile(_base_path_pat + pattern, re.I) \
                            for pattern in _tvshow_path_patterns]

    _prematch_discard = re.compile(r"[-_\.,]")
    
    def __init__(self):
        self._postmatch_filters = [self.post_match_year]

    _ok = []
    _fail = []
    
    def print_result(self, m, f):
        if len(m.groups()) > 4:
            return "TOO_MANY_GROUPS|%s|%s" % (m.groups(), f)
        
        show_name, season_nb, episode_nb, remain = m.groups()
        se = int(season_nb, 10)
        ep = int(episode_nb, 10)
        return "%s|%s|%s|%s|%s" % (show_name, se, ep, remain, f)
    
    def descend(self, dir, output = True, level = 0):
        pad = ''
        for _ in xrange(0, level):
            pad += '    '
            
        for f in os.listdir(dir):
            p = os.path.join(dir, f)
            if os.path.isdir(p):
                if output:
                    print pad + "[%s]" % f
                self.descend(p, level + 1)
            else:
                m = self.process(p, f)
                if m is None:
                    self._fail.append(p)
                    if output:
                        print pad + "FAIL: %s" % f
                else:
                    self._ok.append((p, m))
                    if output:
                        print pad + self.print_result(m, f)
            
    def process(self, dir, fname):
        text = self.path_tail(dir, 1)
        text = self.clean_video_name_prematch(text)
        
        for rx in self._tvshow_regexes:
            matches = rx.search(text)
            if not matches is None:
                if self.apply_postmatch_filters(dir, matches):
                    return matches
        
        text = self.path_tail(dir, 3)
        text = self.clean_video_name_prematch(text)
        for rx in self._tvshow_path_regexes:
            matches = rx.search(text)
            if not matches is None:
                if self.apply_postmatch_filters(dir, matches):
                    return matches
            
        return None
    
    def path_tail(self, filepath, components):
        """
        Returns the rightmost N components of filepath (joined as a valid path)
        For example path_tail('/videos/tv/lost/season1/01-pilot.avi', 3) will
        return 'lost/season1/01-pilot.avi'
        """
        return os.sep.join(filepath.split(os.sep)[-components:])
        
    def clean_video_name_prematch(self, video_name):
        """
        Remove file extension then replace a set characters often used in
        downloaded filenames in place of spaces (such as . and _) with spaces.
        Also replace common separators (such as - and ,) with spaces.
        """
        extpos = video_name.rfind(os.extsep)
        if not extpos is None:
            video_name = video_name[:extpos]

        return self._prematch_discard.sub(" ", video_name)
        
    def apply_postmatch_filters(self, path, matches):
        """
        Apply all the the post-match filters and return True only if all filters
        return True (i.e. all filters approve the match)
        """
        for filt in self._postmatch_filters:
            if not filt(path, matches):
                return False
        return True
            
    def post_match_year(self, path, match):
        """
        This filter is to check the result of regexps that match serial numbers.
        If these regexen are applied to movie titles with year in them, they
        wrongly categorize the movie as a tv show. For example 1998 can be seen
        as season 19 episode 98.
        This filter don't allow anything larger than 1900 to be considered a TV
        show, as rarely TV shows have more than 18 seasons, while most digitally
        available movies are dated after the year 1900.
        """
        if match is None or len(match.groups()) != 4:
            print "!!! NO MATCH or WRONG MATCH !!!"
            return False
        
        _, season, episode, _ = match.groups()
        year = int(season + episode, 10)
                
        return year < 1900
    
t = Test()
t.descend('/home/uriboni/tmp/video/film', False)
print "----------------- SUCCESS: -----------------------"
for ok in t._ok:
    filepath, matches = ok
    print t.print_result(matches, filepath)
    
print "----------------- FAILURE: ------------------------"
for fail in t._fail:
    print fail
    
print "----------------- STATS: -------------------------"
print "OK:", len(t._ok)
print "FAIL:", len(t._fail)

Re: A few questions

Reply via email to