Re: [PATCH] Don't strip Content-Type in .po files on Windows

Julian Foad Tue, 14 Aug 2012 16:06:48 -0700

Philip Martin wrote:
> Роман Донченко <d...@corrigendum.ru> writes:
>>  - On Windows, we only support linking with svn-win32-libintl, which
>>  is hacked to disable all encoding conversions.
> 
> I assume it provides bind_textdomain_codeset as a function that does
> nothing.
> 
>>  - Even if someone links with with his own version of libintl, it's
>>  a safe bet that it will be new enough to support
>>  bind_textdomain_codeset, so we  can just call that.
> 
> If somebody managed to use an old version I guess there would be some
> sort of linker failure.  Build time?  Run time?
> 
> If somebody managed to use their own version that did encoding
> conversions I guess that would also be OK since it would be a
> utf8-utf8 no-op.
> 
> The patch looks OK but I don't know enough about the Windows build to
> know whether the above is correct.


I agree, without fully understanding all the details.

I went digging in the archives to try to understand the issue.  Maybe 
this will help others to do so too.  The email threads around the time 
when this 'charset stripping' was originally added are "RFC: Solving 
gettext & UTF-8 brokenness" [1] and "Stripping 'charset=' from po files 
[the sequal]"[2].  Roman's previous submission of this patch resulted 
in a short discussion that petered out [3].

From what I can gather, the stripping of the 'charset=' was originally 
introduced as a hack to work around a mis-feature of the 'gettext' or 
'libintl'/'libiconv' subsystems, that would undesirably change the 
encoding away from UTF-8.  We have two other ways to stop that 
re-coding: we call bind_textdomain_codeset() if available; and/or we 
link to a 'libintl'/'libiconv' implementation that doesn't do 
re-coding.  On Windows we do the latter; we don't currently do the 
former, but Roman's patch does.

It appears that Roman finds that that those other methods are 
sufficient.  Removing this hack makes his life better (eliminates some 
warnings from some translation tools, and simplifies the build system) 
and still provides the desired UTF-8 result.

The worst that could happen after applying this patch is someone 
building Subversion on Windows, and linking an incompatible 
'libintl'/'libiconv', might find that Subversion error messages are 
improperly displayed or are replaced by an 'Invalid UTF-8 sequence' 
message.  That's no big deal -- it's a straightforward issue for the 
builder/maintainer to solve and is not subtle or data-corrupting.

The .po file format is supposed to have this 'charset' field present, 
so in my opinion we shouldn't strip it if we don't have to: I could 
imagine how stripping it could cause extra warnings in translation 
tools and/or extra work for translators operating those tools, even 
if we hadn't actually had this report and this patch.

So it seems to me that we should apply this patch.

Brane was involved in the previous discussions and he's away today 
so I'd like to wait to hear his input.  If he has no particular 
objection then I'll be happy to commit this even though I can't 
easily test it.

- Julian


[1] Email thread "RFC: Solving gettext & UTF-8 brokenness" on dev@ 
in May 2004: 
<http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=345172>
or <http://svn.haxx.se/dev/archive-2004-05/0047.shtml>.

[2] Email thread "Stripping 'charset=' from po files [the sequal]" on 
dev@ in May 2004; a relevant place to start reading is the message 
from Greg Hudson on 2004-05-13: 
<http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=351459>
or <http://svn.haxx.se/dev/archive-2004-05/0593.shtml>.

[3] Email thread "Re: [PATCH] Remove po file charset stripping on 
Windows" on dev@ in Aug/Sep 2009; a relevant place to start reading 
is Roman's message on 2009-09-13: 
<http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2394298>
or <http://svn.haxx.se/dev/archive-2009-09/0225.shtml>.


Roman Donchenko (Роман Донченко) wrote on 6 August 2012:
> I submitted this patch a couple of years ago, but it didn't get much 
> attention. I still think it's relevant, so I'll try again.
> 
> [[[
> On Windows, don't strip the Content-Type field from .po files during 
> their compilation.
> 
> GNU libintl, by default, converts the l10n strings into the locale
> encoding, while Subversion requires UTF-8. This conversion can be
> suppressed by calling bind_textdomain_codeset, but certain old
> versions of libintl don't have that, so the Unix build system checks
> for the existence of that function, and if it's not present, strips
> the Content-Type header from the .po files (which prevents encoding
> conversion, as well, but makes msgfmt complain).
> 
> When building on Windows, this stripping is done unconditionally,
> but is completely unnecessary:
> 
> - On Windows, we only support linking with svn-win32-libintl, which
> is hacked to disable all encoding conversions.
> - Even if someone links with with his own version of libintl, it's a
> safe bet that it will be new enough to support bind_textdomain_codeset,
> so we can just call that.
>
> This patch removes Content-Type stripping on Windows, which gets rid
> of msgfmt warnings, as well as simplifies the build system.
> 
> * build/generator/build_locale.ezt: Remove the strip-po-charset.py
>   invocation.
> 
> * build/generator/gen_win.py:
>   (POFile.__init__): don't store the .spo file name.
>
> * build/strip-po-charset.py: Delete.
> 
> * subversion/libsvn_subr/nls.c:
>   (svn_nls_init): Move the bind_textdomain_codeset call out of the
>     #ifdef WIN32 block, so it's executed on Windows, as well.
> 
> * subversion/svn_private_config.hw: Indicate that
>   bind_textdomain_codeset is available if NLS is enabled.
> ]]]

Re: [PATCH] Don't strip Content-Type in .po files on Windows

Reply via email to