#9753: makemessages failed on long Chinese text
-------------------------------------------+--------------------------------
          Reporter:  Will                  |         Owner:  nobody             
         
            Status:  new                   |     Milestone:  post-1.0           
         
         Component:  Internationalization  |       Version:  1.0                
         
        Resolution:                        |      Keywords:  django-admin.py 
makemessages
             Stage:  Unreviewed            |     Has_patch:  0                  
         
        Needs_docs:  0                     |   Needs_tests:  0                  
         
Needs_better_patch:  0                     |  
-------------------------------------------+--------------------------------
Comment (by kmtracey):

 This may be related to #9212.  Using gettext utilities 0.15 (from cygwin)
 on Windows I have no problem with the specified Chinese string in a
 templates.  However, using gettext utilities 0.13 on Windows I get these
 errors:

 {{{
 Error: errors happened while running msguniq
 D:\u\kmt\software\web\xword\locale\en\LC_MESSAGES\django.pot:48:76:
 invalid multibyte sequence
 D:\u\kmt\software\web\xword\locale\en\LC_MESSAGES\django.pot:48:77:
 invalid multibyte sequence
 D:\u\kmt\software\web\xword\locale\en\LC_MESSAGES\django.pot:49:2: invalid
 multibyte sequence
 }}}

 Lines 45-50 from the .pot file are:

 {{{
 #: .\crossword\templates\500.html.py:7
 msgid ""
 "四千年前有一个姑娘叫姜嫄,她有一天觉得很空虚,就到郊外玩,看见一只巨人脚印"
 ",也许是外星人留下的,她想上去比一比,看看谁的脚丫子更大,就踩上去。踩上去�"
 "�发现肚子里乱动,跟怀了孕似的。回去以后,肚子里的小孩,又老不出来,过了十二个月才生下来。"
 msgstr ""
 }}}

 which (if it displays as it is in the composition window) shows an invalid
 utf-8 sequence at the end of the 2nd message line and the beginning of the
 3rd (the lines identified in the error messages).

 The problem identified in #9212 was that the older xgettext assumes
 iso8859-1 encoding for Python files, and takes that assumed iso8859-1
 input and encodes it to utf-8.  However Django requires the source to
 already be utf8-encoded, so xgettext outputs doubly utf-8 encoded data.
 We "fixed" that by un-doing the extra utf-8 encoding being done by
 xgettext.  However here we see apparently xgettext splits long messages,
 and when it does that using the doubly-encoded utf-8 data, it may split at
 a point that, when the 2nd utf-8 encoding is un-done, results in one of
 the original utf-8 encoded Unicode chars being split across a line
 boundary, so extra close quote and newline chars are stuffed in the middle
 of some original n-byte utf-8 sequence, resulting in the error.

 Now, this is a template file and not a Python file -- it isn't entirely
 clear to me why we say the language is Python for all of the "extra"
 extensions processed.  If that isn't really necessary then perhaps this
 could be worked around by not specifying language Python for extra
 extensions.  But Python code would still potentially suffer from this
 problem (I'm assuming we do need to specify language Python for actual
 Python source).

 Anyway a more straightforward way to get around this problem, for all
 files where we specify language Python to xgettext, is to specify --no-
 wrap to xgettext.  That way it doesn't split lines and we don't wind up
 with invalid utf-8 sequences at line boundaries even after un-doing the
 extra utf-8 encoding.  I thought it would be best to only specify --no-
 wrap when we are in this oddball case of needing to un-do xgettext's
 incorrect double encoding of the original utf-8 data, but it seems msguniq
 (which we run after xgettext) also splits lines, so even when we specify
 --no-wrap to xgettext, we get nicely wrapped lines in the ultimate output
 file.  So I think it would be OK to just add --no-wrap unconditionally to
 the args for xgettext.  Sound OK?

 All this is assuming the problem I am seeing is the same as the reported
 one -- as the original report lacks the actual error messages seen and any
 details on platform used or gettext utilities version in use, I can't be
 100% sure of that.

-- 
Ticket URL: <http://code.djangoproject.com/ticket/9753#comment:2>
Django <http://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django updates" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-updates?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to