#36777: Exception raised when accessing files with UTF-8 characters in filename 
on
debian/Apache
-----------------------+-----------------------------------------
     Reporter:  Caram  |                     Type:  Bug
       Status:  new    |                Component:  Uncategorized
      Version:  6.0    |                 Severity:  Normal
     Keywords:         |             Triage Stage:  Unreviewed
    Has patch:  0      |      Needs documentation:  0
  Needs tests:  0      |  Patch needs improvement:  0
Easy pickings:  0      |                    UI/UX:  0
-----------------------+-----------------------------------------
 = Unicode Filename Handling Issues in Django under Apache/WSGI

 == Environment
 * **Django Version**: 5.2/6.0
 * **Python Version**: 3.12
 * **Web Server**: Apache 2.4.65 with mod_wsgi 5.0.0
 * **OS**: Debian Linux
 * **Database**: MySQL with utf8mb3_general_ci collation

 == Problem Description

 Files with Unicode characters in their filenames (e.g., `Note
 d'information Gestion des récupérations.pdf`) fail under Apache/WSGI in
 two ways:

 1. **File size displays as "0 bytes"** when using `{{
 attachment.file.size|filesizeformat }}`
 2. **File downloads return HTTP 404 errors**

 Both issues work correctly under Django's `runserver` but fail in
 production under Apache/WSGI.

 == Root Cause Analysis

 === 1. ASCII Encoding Default
 Apache/WSGI defaults to ASCII encoding for standard streams, unlike
 `runserver` which uses UTF-8.

 === 2. FileField.size Property Failure
 The `FileField.size` property attempts to access file metadata using the
 default ASCII codec, which fails for non-ASCII characters in paths:
 UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in
 position 88: ordinal not in range(128)

 === 3. UTF-8 Mojibake
 File paths from the database (stored as UTF-8) get incorrectly interpreted
 as Latin-1 by
 Apache/WSGI. For example:
 * **Actual filename**: `récupérations.pdf`
 * **In database**: UTF-8 bytes `\xc3\xa9` (correct encoding of "é")
 * **Received by Django**: String `r\xc3\xa9cup\xc3\xa9rations` (UTF-8
 bytes misinterpreted as Latin-1 characters)

 === 4. Filesystem Operations
 `os.path.exists()`, `os.path.getsize()`, and `open()` fail when Python
 tries to encode strings using the default ASCII codec.

 == Workaround Overview

 The workaround requires three components:

 === 1. Custom `filesize` Template Filter
 Replace `{{ attachment.file.size|filesizeformat }}` with a custom filter
 that:
 * Fixes UTF-8 mojibake by re-encoding:
 `path.encode('latin-1').decode('utf-8')`
 * Uses explicit UTF-8 byte paths: `path.encode('utf-8')`
 * Performs filesystem operations with byte strings to bypass ASCII codec

 **Usage**:

 {{{
     {{ attachment.file.path|filesize|filesizeformat }}
 }}}

 === 2. Custom File Serving View

 Replace django.views.static.serve with a Unicode-aware version
 (serve_unicode) that:
 - Fixes UTF-8 mojibake in incoming URL paths
 - Converts paths to UTF-8 bytes before filesystem operations
 - Opens files using byte paths: open(fullpath_bytes, 'rb')
 - Maintains security checks for path traversal
 - Handles HTTP caching headers properly

 **URL Configuration**:

 {{{
     re_path(r'^%s(?P<path>.*)$' %
 re.escape(settings.MEDIA_URL.lstrip('/')),
             serve_unicode,
             {'document_root': settings.MEDIA_ROOT})
 }}}

 === 3. URL Encoding Filter (Optional)

 Add urlencode_path filter to properly encode URLs for href attributes:
 - Decodes existing encoding to avoid double-encoding
 - Re-encodes with proper UTF-8 percent-encoding
 - Handles special characters (apostrophes, spaces, accented characters)

 **Usage**:

 {{{
     <a href="{{ attachment.file.url|urlencode_path }}?filename={{
 attachment.friendly_name|urlencode }}">
 }}}

 == Key Techniques

 === 1. Mojibake Fix

 Convert UTF-8 bytes incorrectly decoded as Latin-1 back to proper UTF-8

 {{{
     path = path.encode('latin-1').decode('utf-8')`
 }}}

 === 2. Byte Paths for Filesystem Operations

 Always use byte strings for filesystem access

 {{{
     path_bytes = path.encode('utf-8')
     if os.path.exists(path_bytes):
         size = os.path.getsize(path_bytes)
         with open(path_bytes, 'rb') as f:
             # ...
 }}}

 === 3. Explicit UTF-8 Encoding

 Never rely on default encoding (os.fsencode() uses ASCII in Apache/WSGI).
 Always specify UTF-8 explicitly: path.encode('utf-8')

 == Testing Checklist

 Test with filenames containing:
 - Accented characters: café.pdf
 - Apostrophes: Note d'information.pdf
 - Multiple Unicode characters: récupérations.pdf
 - Spaces and apostrophes: Note d'information Gestion des récupérations.pdf
 - Non-Latin scripts: 文档.pdf
 - Mixed characters: rapport_année_2024.pdf

 == Related Issues

 This addresses the common Apache/WSGI Unicode problem where:
 - UnicodeEncodeError: 'ascii' codec can't encode character
 - File operations work in development (runserver) but fail in production
 (Apache/WSGI)
 - Database stores UTF-8 correctly but Apache/WSGI mangles the encoding
-- 
Ticket URL: <https://code.djangoproject.com/ticket/36777>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

-- 
You received this message because you are subscribed to the Google Groups 
"Django updates" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/django-updates/0107019aee336320-709e537a-ff19-43cf-acc6-d4e845205769-000000%40eu-central-1.amazonses.com.

Reply via email to