The above notwithstanding, I've spent a few days hacking the video_parser.py in-order to get my library recognised. It's an assortment of anime, TV and movies coming from various sources and **probably** typical of most users content. I've gone from the 1.05 version having 10-20% of the content being recognised and added to the movie/tv show libraries (with 80% uncategorised), to @90% being categorised with most of the content that is being problematic being so for understandable reasons. I'd like to submit a patch at some point (I need to tidy my code and I wouldn't mind knocking a few of the outlying cases on the head first) - is it acceptable to submit patches to this list rather than going the bzr route?

Hi Lee,
I'm answering only this part of your email as I see other people have already addressed most of your points in other emails.

I briefly worked on the media scanner myself a few weeks ago, here at Fluendo. We recognized that the media scanner could be improved and started to do some work on it. However, we quickly recognized that any changes in the media scanner have a very high risk of introducing regressions, i.e. improving on the recognition of some files but making it fail to recognize some fails that were recognized before.

Because of that, we decided to put on hold all work on the media scanner until we can write a comprehensive battery of unit tests to ensure that we can test the scanner for regressions. I think there's some time scheduled to create these unit tests somewhere in the coming few months, but at the moment i'm not sure exactly when. It would actually help if you could send to me (even in private if you prefer it like that) a list of the file names that you tested the media scanner on ? It will help us build the unit tests when we eventually get to doing that.

That said, we already have another community member (in CC) that did some work on the media scanner as well [1][2]. It would be interesting to try and merge the stuff that I did, the stuff that he did and the stuff that you did, then see if any of you can contribute the unit tests as well. After that we will be more than happy to get that stuff reviewed and committed.

Clearly it's not exactly a trivial effort we're talking about here, so obviously I understand if you want to just wait for us to do it. Still, you can find the code from the other contributor in the links at bottom, and my code is attached as well just in case you want to do something with it or just look at it for ideas.

Speaking of which, if you look at my code you will see it's basically stand-alone. I ripped the scanner off of the moovida code so i could test it more easily (it's all very crude, of course, as it's just a work in progress that got shelved temporarily).

But that got me (and other people here) thinking a bit about actually having the media scanner as a totally separate python library, in a way that it can be used not only by moovida, but by other projects as well (besides of the obvious advantages of being much more easily testable, etc).

Mind you, we didn't go much further on this other than brainstorming some ideas (e.g. pluggable web-based helpers to search imdb or tmdb or similar sites, pluggable extra filters and recognizers, etc). So having some more discussion about this may be helpful here for everyone interested in improving the media scanner, and maybe someone will want to pick up the idea and run with it for a while.

To me a well designed and reusable media scanner library seems like an interesting "summer" project. Hell, if we were participating in Google's Summer of Code this year I would have put it out as a student project for sure. But maybe even without that there's someone interested in looking at it (when they're not spending their time at the beach ;)).

Cheers,
--
Ugo

[1] https://www.moovida.com/quality/review/request/%[email protected]%3e

[2] https://code.launchpad.net/~mattbrown/elisa/bugfixes
import os, re
from os import path

class Test:
    # These match patterns similar to S05E06 and several variations of the
    # same theme (ex: S03-E01 or Season3, Episode1, or [s3]_[e1] etc.)
    # Defined separately to keep regular expressions more readable (i hope)
    _parens_se_pat = r"[\(\[]?\s*(?:s|se|season)\s*([0-9]+)\s*[\]\)]?"
    _parens_se_pat_noncap = r"[\(\[]?\s*(?:s|se|season)\s*(?:[0-9]+)\s*[\]\)]?"
    _parens_ep_pat = r"[\(\[]?\s*(?:e|ep|episode)\s*([0-9]+)\s*[\]\)]?"

    # These patterns cover the cases where all the episode information can be
    # found in the file name, without going to look into the path.

    # Note that patterns below don't cover the cases like
    # lost[s3][e2].avi where we don't have spaces between title and
    # the (possibly parenthesized) se/ep list.
    # It's a pretty dumb naming scheme but we may want to cover it anyway later.
    _tvshow_patterns =  [
                        # ex: lost.[s3]_[e5].hdtv-lol.avi
                        # ex: Chuck - Season 1, Episode 04 [WS DVDRip Xvid].avi
                        # ex: Day.Break.S01E03.HDTV.XviD-XOR.avi
                        r"^(.+)\s+" + _parens_se_pat + r"\s*" + _parens_ep_pat 
+ r"\s*(.*)$",

                        # ex: Dexter - 01x06 (HDTV-LOL) Return to Sender.avi
                        r"^(.+)\s*([0-9]+)x([0-9]+)\s*(.*)$",

                        # ex: lost.305.hdtv-lol.avi
                        # Note that this is assuming at least 3 digits, of which
                        # the last 2 are considered the episode number and the
                        # others the season number. However sometimes this is
                        # just a sequence number (e.g. animes)
                        # FIXME: one way to fix this is to examine the files in
                        # the same dir and see if they fit the sequence number
                        # pattern or the se/ep pattern. MythTV code has some
                        # implementation of this we can borrow or learn from.
                        r"^(.+)\s+([0-9]+)([0-9][0-9])\s+(.*)$",
                        ]

    _tvshow_regexes = [re.compile(pattern, re.I) \
                       for pattern in _tvshow_patterns]
    
    # These patterns cover the same cases as the _tvshow_patterns, however they
    # don't try to capture show and season name because this information will be
    # retrieved from the path. However they try to match the season name as part
    # of the pattern, if it is available, to decrease the likelihood of false
    # matches)
    _tvshow_path_patterns = [
                # ex: lost.[s3]_[e5].hdtv-lol.avi
                # ex: The IT Crowd - Season 1, Episode 04 [WS DVDRip Xvid].avi
                # ex: Day.Break.S01E03.HDTV.XviD-XOR.avi
                r".*(?:" + _parens_se_pat_noncap + r"[-_\s\.,]?)?" + 
_parens_ep_pat + r"(.*)$",
  
                # ex: Dexter - 01x06 (HDTV-LOL) Return to Sender.avi
                r".*[0-9]+x([0-9]+)(.*)$",
  
                # ex: lost.305.hdtv-lol.avi
                # please note that in this case the number almost surely is not
                # in "SEE" format but it's just a sequential episode number, so
                # we match it entirely.
                # To avoid false matches with basically anything with a number
                # in the title, we force the number to be at the start of the
                # filename.
                r"([0-9]+)\s+(.*)$",
           ]

    _base_path_pat = r"(.*)" + os.sep + r"(?:season|se|s)\s*([0-9]+)" + os.sep
    _tvshow_path_regexes = [re.compile(_base_path_pat + pattern, re.I) \
                            for pattern in _tvshow_path_patterns]

    _prematch_discard = re.compile(r"[-_\.,]")
    
    def __init__(self):
        self._postmatch_filters = [self.post_match_year]

    _ok = []
    _fail = []
    
    def print_result(self, m, f):
        if len(m.groups()) > 4:
            return "TOO_MANY_GROUPS|%s|%s" % (m.groups(), f)
        
        show_name, season_nb, episode_nb, remain = m.groups()
        se = int(season_nb, 10)
        ep = int(episode_nb, 10)
        return "%s|%s|%s|%s|%s" % (show_name, se, ep, remain, f)
    
    def descend(self, dir, output = True, level = 0):
        pad = ''
        for _ in xrange(0, level):
            pad += '    '
            
        for f in os.listdir(dir):
            p = os.path.join(dir, f)
            if os.path.isdir(p):
                if output:
                    print pad + "[%s]" % f
                self.descend(p, level + 1)
            else:
                m = self.process(p, f)
                if m is None:
                    self._fail.append(p)
                    if output:
                        print pad + "FAIL: %s" % f
                else:
                    self._ok.append((p, m))
                    if output:
                        print pad + self.print_result(m, f)
            
    def process(self, dir, fname):
        text = self.path_tail(dir, 1)
        text = self.clean_video_name_prematch(text)
        
        for rx in self._tvshow_regexes:
            matches = rx.search(text)
            if not matches is None:
                if self.apply_postmatch_filters(dir, matches):
                    return matches
        
        text = self.path_tail(dir, 3)
        text = self.clean_video_name_prematch(text)
        for rx in self._tvshow_path_regexes:
            matches = rx.search(text)
            if not matches is None:
                if self.apply_postmatch_filters(dir, matches):
                    return matches
            
        return None
    
    def path_tail(self, filepath, components):
        """
        Returns the rightmost N components of filepath (joined as a valid path)
        For example path_tail('/videos/tv/lost/season1/01-pilot.avi', 3) will
        return 'lost/season1/01-pilot.avi'
        """
        return os.sep.join(filepath.split(os.sep)[-components:])
        
    def clean_video_name_prematch(self, video_name):
        """
        Remove file extension then replace a set characters often used in
        downloaded filenames in place of spaces (such as . and _) with spaces.
        Also replace common separators (such as - and ,) with spaces.
        """
        extpos = video_name.rfind(os.extsep)
        if not extpos is None:
            video_name = video_name[:extpos]

        return self._prematch_discard.sub(" ", video_name)
        
    def apply_postmatch_filters(self, path, matches):
        """
        Apply all the the post-match filters and return True only if all filters
        return True (i.e. all filters approve the match)
        """
        for filt in self._postmatch_filters:
            if not filt(path, matches):
                return False
        return True
            
    def post_match_year(self, path, match):
        """
        This filter is to check the result of regexps that match serial numbers.
        If these regexen are applied to movie titles with year in them, they
        wrongly categorize the movie as a tv show. For example 1998 can be seen
        as season 19 episode 98.
        This filter don't allow anything larger than 1900 to be considered a TV
        show, as rarely TV shows have more than 18 seasons, while most digitally
        available movies are dated after the year 1900.
        """
        if match is None or len(match.groups()) != 4:
            print "!!! NO MATCH or WRONG MATCH !!!"
            return False
        
        _, season, episode, _ = match.groups()
        year = int(season + episode, 10)
                
        return year < 1900
    
t = Test()
t.descend('/home/uriboni/tmp/video/film', False)
print "----------------- SUCCESS: -----------------------"
for ok in t._ok:
    filepath, matches = ok
    print t.print_result(matches, filepath)
    
print "----------------- FAILURE: ------------------------"
for fail in t._fail:
    print fail
    
print "----------------- STATS: -------------------------"
print "OK:", len(t._ok)
print "FAIL:", len(t._fail)

    

Reply via email to