Re: docx files turned into ocx

Samuele Kaplun Tue, 15 Jun 2010 17:32:08 +0200

Dear Devin,

Il giorno mar, 15/06/2010 alle 16.56 +0200, Devin Bougie ha scritto:
> When we upload .docx files to our installation of v0.99.1, Invenio
>  appears to change the extension of the file from "docx" to "ocx".  Any
>  suggestions for changing this behavior would be greatly appreciated. 
>  Please let me know if there is any more information I can provide.


the heuristics on how extensions are recognized has changed quite a lot
since release 0.99.1 and I currently don't have at hand a machine to
perform a quick test.

However you might try to add the docx (and all the other new extensions
from Microsoft Office) to the config variable:

CFG_WEBSUBMIT_ADDITIONAL_KNOWN_FILE_EXTENSIONS.

I also see that this is triggering a bug in the 0.99.1 heuristic by the
fact that ocx (substring of docx) happens to be a valid extension.

Could you also test this patch (after taking a backup of bibdocfile.py
module) that contains a backported algorithm from latest GIT?

To apply just do:

$ cd /opt/cds-invenio/lib/python/invenio
$ patch -p4
< /tmp/0001-BibDocFile-backport-extension-guessing-algorithm.patch

Let me know if this fixes your problem (and if you don't see any
collateral issues).

Best regards,
        Samuele

>From 618f4727fe406ec186fc0e89a7fe2cbd8dabfcaa Mon Sep 17 00:00:00 2001
From: Samuele Kaplun <[email protected]>
Date: Tue, 15 Jun 2010 17:24:54 +0200
Subject: [PATCH] BibDocFile: backport extension guessing algorithm

* Fix extension guessing algorithm by backporting latest version from
  master. Previous algorithm was guessing "foo.docx" as having extension
  "ocx". This is fixed.
---
 modules/websubmit/lib/bibdocfile.py |   70 ++++++++++++++++++++++++----------
 1 files changed, 49 insertions(+), 21 deletions(-)

diff --git a/modules/websubmit/lib/bibdocfile.py b/modules/websubmit/lib/bibdocfile.py
index 0689771..a44749a 100644
--- a/modules/websubmit/lib/bibdocfile.py
+++ b/modules/websubmit/lib/bibdocfile.py
@@ -66,32 +66,60 @@ CFG_BIBDOCFILE_STRONG_FORMAT_NORMALIZATION = False
 
 KEEP_OLD_VALUE = 'KEEP-OLD-VALUE'
 
-_mimes = MimeTypes()
+_mimes = MimeTypes(strict=False)
 _mimes.suffix_map.update({'.tbz2' : '.tar.bz2'})
 _mimes.encodings_map.update({'.bz2' : 'bzip2'})
-_extensions = _mimes.encodings_map.keys() + \
-              _mimes.suffix_map.keys() + \
-              _mimes.types_map[1].keys() + \
-              CFG_WEBSUBMIT_ADDITIONAL_KNOWN_FILE_EXTENSIONS
-_extensions.sort()
-_extensions.reverse()
-_extensions = set([ext.lower() for ext in _extensions])
 
-class InvenioWebSubmitFileError(Exception):
-    pass
+def _generate_extensions():
+    """
+    Generate the regular expression to match all the known extensions.
+
+    @return: the regular expression.
+    @rtype: regular expression object
+    """
+    _tmp_extensions = _mimes.encodings_map.keys() + \
+                _mimes.suffix_map.keys() + \
+                _mimes.types_map[1].keys() + \
+                CFG_WEBSUBMIT_ADDITIONAL_KNOWN_FILE_EXTENSIONS
+    extensions = []
+    for ext in _tmp_extensions:
+        if ext.startswith('.'):
+            extensions.append(ext)
+        else:
+            extensions.append('.' + ext)
+    extensions.sort()
+    extensions.reverse()
+    extensions = set([ext.lower() for ext in extensions])
+    extensions = '\\' + '$|\\'.join(extensions) + '$'
+    extensions = extensions.replace('+', '\\+')
+    return re.compile(extensions, re.I)
+
+#: Regular expression to recognized extensions.
+_extensions = _generate_extensions()
 
 def file_strip_ext(afile):
-    """Strip in the best way the extension from a filename"""
-    lowfile = afile.lower()
-    ext = '.'
-    while ext:
-        ext = ''
-        for c_ext in _extensions:
-            if lowfile.endswith(c_ext):
-                lowfile = lowfile[0:-len(c_ext)]
-                ext = c_ext
-                break
-    return afile[:len(lowfile)]
+    """
+    Strip in the best way the extension from a filename.
+
+    >>> file_strip_ext("foo.tar.gz")
+    'foo'
+    >>> file_strip_ext("foo.buz.gz")
+    'foo.buz'
+    >>> file_strip_ext("foo.buz")
+    'foo'
+
+    @param afile: the path/name of a file.
+    @type afile: string
+    @return: the name/path without the extension (and version).
+    @rtype: string
+    """
+    nextfile = _extensions.sub('', afile)
+    if nextfile == afile:
+        nextfile = os.path.splitext(afile)[0]
+    while nextfile != afile:
+        afile = nextfile
+        nextfile = _extensions.sub('', afile)
+    return nextfile
 
 def normalize_format(format):
     """Normalize the format."""
-- 
1.7.0.4

Re: docx files turned into ocx

Reply via email to