[Freeipa-devel] not ascii, not utf-8, what's a parser supposed to do?

2010-01-26 Thread John Dennis
I've run into a small problem with xgettext. By default xgettext expects 
all strings in an input file to be encoded in ascii. It will also allow 
you to override that by specifying the strings in the input file are utf-8.


In ipappython/ipautil.py line 296 is the following string:

SAFE_STRING_PATTERN = '(^(\000|\n|\r| |:|)|[\000\n\r\200-\377]+|[ ]+$)'

In it's default ascii mode xgettext throws an error claiming the string 
is not ascii. In fact xgettext is correct, the string is not ascii. (You 
may be wondering why xgettext even cares since it's not marked as 
translatable, but xgettext fully parses the input before deciding what 
is marked as translatable, bottom line: all strings get parsed and decoded).


If I override the default ascii input by telling xgettext the input 
strings are encoded in utf-8 xgettext stops complaining, the string is 
properly skipped.


But ... the string isn't really utf-8 either and I'm not sure how 
comfortable I feel about telling xgettext every string in IPA is encoded 
in utf-8 (when it isn't) just to get around this failure, especially 
since the offending string isn't even utf-8. (However, maybe we should 
allow utf-8 as an input format since ascii is a subset of utf-8, we 
might want to use utf-8 in the future and we can just hold our noses 
with respect to the above regular expression).


Do we have a stake in the ground as to what our input strings are 
encoded in?


Can you think of another way to express the offending string such that 
it doesn't trigger the non-ascii error? The only thing I could think of 
and get to work was this:


SAFE_STRING_PATTERN='%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c' 
% \

(40,94,40,0,124,10,124,13,124,32,124,58,124,60,41,124,91,0,10,13,128,45,255,93,43,124,91,32,93,43,36,41)

Which is pretty unreadable, but with sufficient comments could be 
acceptable.



--
John Dennis jden...@redhat.com

Looking to carve out IT costs?
www.redhat.com/carveoutcosts/

___
Freeipa-devel mailing list
Freeipa-devel@redhat.com
https://www.redhat.com/mailman/listinfo/freeipa-devel


Re: [Freeipa-devel] not ascii, not utf-8, what's a parser supposed to do?

2010-01-26 Thread Jason Gerard DeRose
On Tue, 2010-01-26 at 17:28 -0500, John Dennis wrote:
 I've run into a small problem with xgettext. By default xgettext expects 
 all strings in an input file to be encoded in ascii. It will also allow 
 you to override that by specifying the strings in the input file are utf-8.
 
 In ipappython/ipautil.py line 296 is the following string:
 
 SAFE_STRING_PATTERN = '(^(\000|\n|\r| |:|)|[\000\n\r\200-\377]+|[ ]+$)'

ipapython still has a lot of legacy code, so first thing we should do is
check if we even use SAFE_STRING_PATTERN.  Rob, do you know off hand?

 In it's default ascii mode xgettext throws an error claiming the string 
 is not ascii. In fact xgettext is correct, the string is not ascii. (You 
 may be wondering why xgettext even cares since it's not marked as 
 translatable, but xgettext fully parses the input before deciding what 
 is marked as translatable, bottom line: all strings get parsed and decoded).
 
 If I override the default ascii input by telling xgettext the input 
 strings are encoded in utf-8 xgettext stops complaining, the string is 
 properly skipped.
 
 But ... the string isn't really utf-8 either and I'm not sure how 
 comfortable I feel about telling xgettext every string in IPA is encoded 
 in utf-8 (when it isn't) just to get around this failure, especially 
 since the offending string isn't even utf-8. (However, maybe we should 
 allow utf-8 as an input format since ascii is a subset of utf-8, we 
 might want to use utf-8 in the future and we can just hold our noses 
 with respect to the above regular expression).
 
 Do we have a stake in the ground as to what our input strings are 
 encoded in?
 
 Can you think of another way to express the offending string such that 
 it doesn't trigger the non-ascii error? The only thing I could think of 
 and get to work was this:
 
 SAFE_STRING_PATTERN='%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c'
  
 % \
 (40,94,40,0,124,10,124,13,124,32,124,58,124,60,41,124,91,0,10,13,128,45,255,93,43,124,91,32,93,43,36,41)
 
 Which is pretty unreadable, but with sufficient comments could be 
 acceptable.
 
 

___
Freeipa-devel mailing list
Freeipa-devel@redhat.com
https://www.redhat.com/mailman/listinfo/freeipa-devel


Re: [Freeipa-devel] not ascii, not utf-8, what's a parser supposed to do?

2010-01-26 Thread Howard Chu
John Dennis wrote:
 I've run into a small problem with xgettext. By default xgettext expects 
 all strings in an input file to be encoded in ascii. It will also allow 
 you to override that by specifying the strings in the input file are utf-8.

Do you ever expect to run this stuff on IBM mainframes (i.e., systems using
EBCDIC or some other non-ASCII-related character set) ?

 Can you think of another way to express the offending string such that 
 it doesn't trigger the non-ascii error? The only thing I could think of 
 and get to work was this:
 
 SAFE_STRING_PATTERN='%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c'
  
 % \
 (40,94,40,0,124,10,124,13,124,32,124,58,124,60,41,124,91,0,10,13,128,45,255,93,43,124,91,32,93,43,36,41)
 
 Which is pretty unreadable, but with sufficient comments could be 
 acceptable.

I had to use similar hacks when porting OpenSSL to z/OS. It kinda sucks, but
it has the virtue of being completely independent of the machine's language
settings. And frankly, it doesn't take too much explanation in the comments to
be understandable.

-- 
  -- Howard Chu
  CTO, Symas Corp.   http://www.symas.com
  Director, Highland Sun http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/

___
Freeipa-devel mailing list
Freeipa-devel@redhat.com
https://www.redhat.com/mailman/listinfo/freeipa-devel