[Freeipa-devel] not ascii, not utf-8, what's a parser supposed to do?
I've run into a small problem with xgettext. By default xgettext expects all strings in an input file to be encoded in ascii. It will also allow you to override that by specifying the strings in the input file are utf-8. In ipappython/ipautil.py line 296 is the following string: SAFE_STRING_PATTERN = '(^(\000|\n|\r| |:|)|[\000\n\r\200-\377]+|[ ]+$)' In it's default ascii mode xgettext throws an error claiming the string is not ascii. In fact xgettext is correct, the string is not ascii. (You may be wondering why xgettext even cares since it's not marked as translatable, but xgettext fully parses the input before deciding what is marked as translatable, bottom line: all strings get parsed and decoded). If I override the default ascii input by telling xgettext the input strings are encoded in utf-8 xgettext stops complaining, the string is properly skipped. But ... the string isn't really utf-8 either and I'm not sure how comfortable I feel about telling xgettext every string in IPA is encoded in utf-8 (when it isn't) just to get around this failure, especially since the offending string isn't even utf-8. (However, maybe we should allow utf-8 as an input format since ascii is a subset of utf-8, we might want to use utf-8 in the future and we can just hold our noses with respect to the above regular expression). Do we have a stake in the ground as to what our input strings are encoded in? Can you think of another way to express the offending string such that it doesn't trigger the non-ascii error? The only thing I could think of and get to work was this: SAFE_STRING_PATTERN='%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c' % \ (40,94,40,0,124,10,124,13,124,32,124,58,124,60,41,124,91,0,10,13,128,45,255,93,43,124,91,32,93,43,36,41) Which is pretty unreadable, but with sufficient comments could be acceptable. -- John Dennis jden...@redhat.com Looking to carve out IT costs? www.redhat.com/carveoutcosts/ ___ Freeipa-devel mailing list Freeipa-devel@redhat.com https://www.redhat.com/mailman/listinfo/freeipa-devel
Re: [Freeipa-devel] not ascii, not utf-8, what's a parser supposed to do?
On Tue, 2010-01-26 at 17:28 -0500, John Dennis wrote: I've run into a small problem with xgettext. By default xgettext expects all strings in an input file to be encoded in ascii. It will also allow you to override that by specifying the strings in the input file are utf-8. In ipappython/ipautil.py line 296 is the following string: SAFE_STRING_PATTERN = '(^(\000|\n|\r| |:|)|[\000\n\r\200-\377]+|[ ]+$)' ipapython still has a lot of legacy code, so first thing we should do is check if we even use SAFE_STRING_PATTERN. Rob, do you know off hand? In it's default ascii mode xgettext throws an error claiming the string is not ascii. In fact xgettext is correct, the string is not ascii. (You may be wondering why xgettext even cares since it's not marked as translatable, but xgettext fully parses the input before deciding what is marked as translatable, bottom line: all strings get parsed and decoded). If I override the default ascii input by telling xgettext the input strings are encoded in utf-8 xgettext stops complaining, the string is properly skipped. But ... the string isn't really utf-8 either and I'm not sure how comfortable I feel about telling xgettext every string in IPA is encoded in utf-8 (when it isn't) just to get around this failure, especially since the offending string isn't even utf-8. (However, maybe we should allow utf-8 as an input format since ascii is a subset of utf-8, we might want to use utf-8 in the future and we can just hold our noses with respect to the above regular expression). Do we have a stake in the ground as to what our input strings are encoded in? Can you think of another way to express the offending string such that it doesn't trigger the non-ascii error? The only thing I could think of and get to work was this: SAFE_STRING_PATTERN='%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c' % \ (40,94,40,0,124,10,124,13,124,32,124,58,124,60,41,124,91,0,10,13,128,45,255,93,43,124,91,32,93,43,36,41) Which is pretty unreadable, but with sufficient comments could be acceptable. ___ Freeipa-devel mailing list Freeipa-devel@redhat.com https://www.redhat.com/mailman/listinfo/freeipa-devel
Re: [Freeipa-devel] not ascii, not utf-8, what's a parser supposed to do?
John Dennis wrote: I've run into a small problem with xgettext. By default xgettext expects all strings in an input file to be encoded in ascii. It will also allow you to override that by specifying the strings in the input file are utf-8. Do you ever expect to run this stuff on IBM mainframes (i.e., systems using EBCDIC or some other non-ASCII-related character set) ? Can you think of another way to express the offending string such that it doesn't trigger the non-ascii error? The only thing I could think of and get to work was this: SAFE_STRING_PATTERN='%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c' % \ (40,94,40,0,124,10,124,13,124,32,124,58,124,60,41,124,91,0,10,13,128,45,255,93,43,124,91,32,93,43,36,41) Which is pretty unreadable, but with sufficient comments could be acceptable. I had to use similar hacks when porting OpenSSL to z/OS. It kinda sucks, but it has the virtue of being completely independent of the machine's language settings. And frankly, it doesn't take too much explanation in the comments to be understandable. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/ ___ Freeipa-devel mailing list Freeipa-devel@redhat.com https://www.redhat.com/mailman/listinfo/freeipa-devel