Re: Shall we change our file.encoding

Nathan Beyer Thu, 16 Jul 2009 18:36:27 -0700

On Thu, Jul 16, 2009 at 8:26 PM, Nathan Beyer<[email protected]> wrote:
> On Thu, Jul 16, 2009 at 8:18 PM, Charles Lee<[email protected]> wrote:
>> Hi Nathan,
>>
>> What I got is 936, the code page identifier. Is there a api for us to map
>> 936 to the gb2312?
>
> Oh, the 'identifier' bit was missing - yeah, we'll need to translate
> that into a name of some sort. I'll poke around a bit and see what I
> can find.


We'll probably just have to put in a mapping ourselves based on the
documentation. We'd call GetACP [1] and map that to a known alias in
java.nio.charset that matches the definitions[2] of the identifiers.

[1] http://msdn.microsoft.com/en-us/library/dd318070%28VS.85%29.aspx
[2] http://msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx

>
>> If we put 936 in the file.encoding, can we successfully get the encoder and
>> decoder by charset?
>>
>> On Fri, Jul 17, 2009 at 9:05 AM, Nathan Beyer <[email protected]> wrote:
>>
>>> On Thu, Jul 16, 2009 at 1:28 AM, Charles Lee<[email protected]> wrote:
>>> > Hi guys,
>>> >
>>> > I have add the locale function in the drlvm, the patch is attached.
>>> Please
>>> > try this new patch on the linux.
>>> >
>>> > The patch should work on the linux but fail on the windows. Because
>>> windows
>>> > returns code page not charset from the setlocale.
>>>
>>> Code page and character set are the same thing. We shouldn't need to
>>> convert it as the Charset APIs will have to support the values anyway.
>>>
>>> What's the value you're getting? If it's 'Cp1252', then we're good, as
>>> that's just an alias for 'Windows-1252' (or vice-versa).
>>>
>>> -Nathan
>>>
>>>
>>> > I hv tried long time to
>>> > get the charset name from the codepage, for example:
>>> > CPINFOEX cpInfoEx;
>>> > BOOL iReturn = GetCPInfoEx(CP_ACP,0, &cPInfoEx);
>>> > if (iReturn > 0) {
>>> >     printf("FULL NAME %s\n", cPinfoEx,CodePageName);
>>> > }
>>> > But I only get the full name without any format.
>>> >
>>> > There is code page identifiers map in the msdn, detail here. I may hard
>>> code
>>> > this map in the file. But the note on the msdn says:
>>> >      "ANSI code pages can be different on different computers, or can be
>>> > changed for a single computer, leading to data corruption. For the most
>>> > consistent results, applications should use Unicode, such as UTF-8 or
>>> > UTF-16, instead of a specific code page."
>>> > I am afraid hard-code will fail on some machines. (By the way, this seems
>>> > the UTF-8 is suggested to be the default again :-)
>>> >
>>> > There is also a class Encoding in the VC++, detail here. But we can not
>>> use
>>> > it here.
>>> >
>>> > So anyone knows some thing about locale on the windows?
>>> > Again, shall use UTF-8 as our default?
>>> >
>>> > On Wed, Jul 15, 2009 at 2:12 PM, Charles Lee <[email protected]>
>>> wrote:
>>> >>
>>> >> That seems we should add it in the drlvm.
>>> >>
>>> >> On Wed, Jul 15, 2009 at 1:58 PM, Regis <[email protected]> wrote:
>>> >>>
>>> >>> Nathan Beyer wrote:
>>> >>>>
>>> >>>> Is the IBM VME dealing with this correctly? Do we just need to fix
>>> >>>> DRLVM?
>>> >>>
>>> >>> Yes, I only tested on Linux, IBM VME set the property correctly.
>>> >>>
>>> >>>>
>>> >>>> On Wed, Jul 15, 2009 at 12:25 AM, Regis<[email protected]> wrote:
>>> >>>>>
>>> >>>>> Kevin Zhou wrote:
>>> >>>>>>
>>> >>>>>> Yea, from luniglob.c, CL attempts to read the "file.encoding"
>>> property
>>> >>>>>> adown
>>> >>>>>> VM but fails to get the correct encoding.
>>> >>>>>>
>>> >>>>>> Regis, do you know any other specific ways that CL can gain the
>>> right
>>> >>>>>> property?
>>> >>>>>
>>> >>>>> We can get from OS directly. Maybe just read env variables on Linux?
>>> >>>>>
>>> >>>>>> Wed, Jul 15, 2009 at 9:59 AM, Regis <[email protected]> wrote:
>>> >>>>>>
>>> >>>>>>> Charles Lee wrote:
>>> >>>>>>>
>>> >>>>>>>> Hi Nanthan,
>>> >>>>>>>>
>>> >>>>>>>> If the file encoding derive from the OS, it should be the some
>>> bugs
>>> >>>>>>>> in
>>> >>>>>>>> it
>>> >>>>>>>> because on my LINUX machine the locale is en_US.UTF-8. Our default
>>> >>>>>>>> codec
>>> >>>>>>>> is
>>> >>>>>>>> still ISO8859-1. Do you know where can we found such codes?
>>> >>>>>>>>
>>> >>>>>>> Classlib expected vm do this and set the property, but it didn't,
>>> so
>>> >>>>>>> we
>>> >>>>>>> have to do this by ourselves.
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>> On Tue, Jul 14, 2009 at 10:17 PM, Nathan Beyer <[email protected]>
>>> >>>>>>>> wrote:
>>> >>>>>>>>
>>> >>>>>>>>  Are we talking about windows or linux?the default file encoding
>>> >>>>>>>> should
>>> >>>>>>>>>
>>> >>>>>>>>> derive from the OS. I believe that's defined by the specs.
>>> >>>>>>>>>
>>> >>>>>>>>> Sent from my iPhone
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>> On Jul 14, 2009, at 5:51 AM, Charles Lee <[email protected]>
>>> >>>>>>>>> wrote:
>>> >>>>>>>>>
>>> >>>>>>>>>  On Tue, Jul 14, 2009 at 6:12 PM, Jimmy,Jing Lv
>>> >>>>>>>>> <[email protected]>
>>> >>>>>>>>>
>>> >>>>>>>>>> wrote:
>>> >>>>>>>>>>
>>> >>>>>>>>>>  Hi,
>>> >>>>>>>>>>
>>> >>>>>>>>>>>  Charles, I believe UTF-8 is the default encoding for RI, and
>>> it
>>> >>>>>>>>>>> sounds
>>> >>>>>>>>>>> reasonable.
>>> >>>>>>>>>>>  BTW, it may encounter some compatibility problem, maybe we
>>> need
>>> >>>>>>>>>>> to
>>> >>>>>>>>>>> run
>>> >>>>>>>>>>> more tests to verify?
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> 2009/7/14 Charles Lee <[email protected]>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>  Hi guys:
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>> I am doing some test cases on the ant junit test case and
>>> >>>>>>>>>>>> meeting
>>> >>>>>>>>>>>> some
>>> >>>>>>>>>>>> encoding problems. I find they are maybe caused by the
>>> different
>>> >>>>>>>>>>>> default
>>> >>>>>>>>>>>> encoding from RI and harmony. My local is en_US.UTF-8, RI
>>> >>>>>>>>>>>> default is
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>>  UTF-8
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>  but harmony is 8859-1. And then I have encountered
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> HARMONY-3736<
>>> https://issues.apache.org/jira/browse/HARMONY-3736>,
>>> >>>>>>>>>>>> and the two diffs attached on that issue. It seems we always
>>> get
>>> >>>>>>>>>>>> 8859-1.
>>> >>>>>>>>>>>> Because: (correct me if wrong :-)
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> 1. we remove the set code in the vm. we will always get null
>>> if
>>> >>>>>>>>>>>> we
>>> >>>>>>>>>>>> call
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>>  vm
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>  method
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> 2. we set the file.encode in the libglob.c, if we got null
>>> from
>>> >>>>>>>>>>>> vm,
>>> >>>>>>>>>>>> we
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>>  set
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>  Sorry, it should be luniglob.c
>>> >>>>>>>>>>>
>>> >>>>>>>>>>  8859-1.
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> 3. we can not set file.encode on the run time.
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> ant use UTF-8 to encode filename which contains the non-ascii
>>> >>>>>>>>>>>> character.
>>> >>>>>>>>>>>> So why we use iso8859-1 as our unchangeable default?
>>> >>>>>>>>>>>> From the wiki http://en.wikipedia.org/wiki/ISO8859-1, it says
>>> >>>>>>>>>>>> "In
>>> >>>>>>>>>>>> computing
>>> >>>>>>>>>>>> applications, encodings that provide full UCS support (such as
>>> >>>>>>>>>>>> UTF-8<http://en.wikipedia.org/wiki/UTF-8>and
>>> >>>>>>>>>>>> UTF-16 <http://en.wikipedia.org/wiki/UTF-16>) are finding
>>> >>>>>>>>>>>> increasing
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>>  favor
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>  over encodings based on ISO 8859-1." Should we simply change
>>> >>>>>>>>>>> iso8859-1
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> to
>>> >>>>>>>>>>>> utf-8?
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> --
>>> >>>>>>>>>>>> Yours sincerely,
>>> >>>>>>>>>>>> Charles Lee
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>> --
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Best Regards!
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Jimmy, Jing Lv
>>> >>>>>>>>>>> China Software Development Lab, IBM
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>> --
>>> >>>>>>>>>> Yours sincerely,
>>> >>>>>>>>>> Charles Lee
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>> --
>>> >>>>>>> Best Regards,
>>> >>>>>>> Regis.
>>> >>>>>>>
>>> >>>>>
>>> >>>>> --
>>> >>>>> Best Regards,
>>> >>>>> Regis.
>>> >>>>>
>>> >>>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> Best Regards,
>>> >>> Regis.
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Yours sincerely,
>>> >> Charles Lee
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Yours sincerely,
>>> > Charles Lee
>>> >
>>> >
>>>
>>
>>
>>
>> --
>> Yours sincerely,
>> Charles Lee
>>
>

Re: Shall we change our file.encoding

Reply via email to