Re: Shall we change our file.encoding

Nathan Beyer Thu, 16 Jul 2009 18:51:10 -0700

On Thu, Jul 16, 2009 at 8:35 PM, Nathan Beyer<[email protected]> wrote:
> On Thu, Jul 16, 2009 at 8:26 PM, Nathan Beyer<[email protected]> wrote:
>> On Thu, Jul 16, 2009 at 8:18 PM, Charles Lee<[email protected]> wrote:
>>> Hi Nathan,
>>>
>>> What I got is 936, the code page identifier. Is there a api for us to map
>>> 936 to the gb2312?
>>
>> Oh, the 'identifier' bit was missing - yeah, we'll need to translate
>> that into a name of some sort. I'll poke around a bit and see what I
>> can find.
>
> We'll probably just have to put in a mapping ourselves based on the
> documentation. We'd call GetACP [1] and map that to a known alias in
> java.nio.charset that matches the definitions[2] of the identifiers.
>
> [1] http://msdn.microsoft.com/en-us/library/dd318070%28VS.85%29.aspx
> [2] http://msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx


This may be better - APR has a function for getting the OS default
encoding. This would work across all platforms that APR supports and I
believe we already use APR.

http://apr.apache.org/docs/apr/1.3/group__apr__portabile.html#g6e21845a4a5f3b7dd107b2beea50c91e

-Nathan
>
>>
>>> If we put 936 in the file.encoding, can we successfully get the encoder and
>>> decoder by charset?
>>>
>>> On Fri, Jul 17, 2009 at 9:05 AM, Nathan Beyer <[email protected]> wrote:
>>>
>>>> On Thu, Jul 16, 2009 at 1:28 AM, Charles Lee<[email protected]> wrote:
>>>> > Hi guys,
>>>> >
>>>> > I have add the locale function in the drlvm, the patch is attached.
>>>> Please
>>>> > try this new patch on the linux.
>>>> >
>>>> > The patch should work on the linux but fail on the windows. Because
>>>> windows
>>>> > returns code page not charset from the setlocale.
>>>>
>>>> Code page and character set are the same thing. We shouldn't need to
>>>> convert it as the Charset APIs will have to support the values anyway.
>>>>
>>>> What's the value you're getting? If it's 'Cp1252', then we're good, as
>>>> that's just an alias for 'Windows-1252' (or vice-versa).
>>>>
>>>> -Nathan
>>>>
>>>>
>>>> > I hv tried long time to
>>>> > get the charset name from the codepage, for example:
>>>> > CPINFOEX cpInfoEx;
>>>> > BOOL iReturn = GetCPInfoEx(CP_ACP,0, &cPInfoEx);
>>>> > if (iReturn > 0) {
>>>> >     printf("FULL NAME %s\n", cPinfoEx,CodePageName);
>>>> > }
>>>> > But I only get the full name without any format.
>>>> >
>>>> > There is code page identifiers map in the msdn, detail here. I may hard
>>>> code
>>>> > this map in the file. But the note on the msdn says:
>>>> >      "ANSI code pages can be different on different computers, or can be
>>>> > changed for a single computer, leading to data corruption. For the most
>>>> > consistent results, applications should use Unicode, such as UTF-8 or
>>>> > UTF-16, instead of a specific code page."
>>>> > I am afraid hard-code will fail on some machines. (By the way, this seems
>>>> > the UTF-8 is suggested to be the default again :-)
>>>> >
>>>> > There is also a class Encoding in the VC++, detail here. But we can not
>>>> use
>>>> > it here.
>>>> >
>>>> > So anyone knows some thing about locale on the windows?
>>>> > Again, shall use UTF-8 as our default?
>>>> >
>>>> > On Wed, Jul 15, 2009 at 2:12 PM, Charles Lee <[email protected]>
>>>> wrote:
>>>> >>
>>>> >> That seems we should add it in the drlvm.
>>>> >>
>>>> >> On Wed, Jul 15, 2009 at 1:58 PM, Regis <[email protected]> wrote:
>>>> >>>
>>>> >>> Nathan Beyer wrote:
>>>> >>>>
>>>> >>>> Is the IBM VME dealing with this correctly? Do we just need to fix
>>>> >>>> DRLVM?
>>>> >>>
>>>> >>> Yes, I only tested on Linux, IBM VME set the property correctly.
>>>> >>>
>>>> >>>>
>>>> >>>> On Wed, Jul 15, 2009 at 12:25 AM, Regis<[email protected]> wrote:
>>>> >>>>>
>>>> >>>>> Kevin Zhou wrote:
>>>> >>>>>>
>>>> >>>>>> Yea, from luniglob.c, CL attempts to read the "file.encoding"
>>>> property
>>>> >>>>>> adown
>>>> >>>>>> VM but fails to get the correct encoding.
>>>> >>>>>>
>>>> >>>>>> Regis, do you know any other specific ways that CL can gain the
>>>> right
>>>> >>>>>> property?
>>>> >>>>>
>>>> >>>>> We can get from OS directly. Maybe just read env variables on Linux?
>>>> >>>>>
>>>> >>>>>> Wed, Jul 15, 2009 at 9:59 AM, Regis <[email protected]> wrote:
>>>> >>>>>>
>>>> >>>>>>> Charles Lee wrote:
>>>> >>>>>>>
>>>> >>>>>>>> Hi Nanthan,
>>>> >>>>>>>>
>>>> >>>>>>>> If the file encoding derive from the OS, it should be the some
>>>> bugs
>>>> >>>>>>>> in
>>>> >>>>>>>> it
>>>> >>>>>>>> because on my LINUX machine the locale is en_US.UTF-8. Our default
>>>> >>>>>>>> codec
>>>> >>>>>>>> is
>>>> >>>>>>>> still ISO8859-1. Do you know where can we found such codes?
>>>> >>>>>>>>
>>>> >>>>>>> Classlib expected vm do this and set the property, but it didn't,
>>>> so
>>>> >>>>>>> we
>>>> >>>>>>> have to do this by ourselves.
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>>> On Tue, Jul 14, 2009 at 10:17 PM, Nathan Beyer <[email protected]>
>>>> >>>>>>>> wrote:
>>>> >>>>>>>>
>>>> >>>>>>>>  Are we talking about windows or linux?the default file encoding
>>>> >>>>>>>> should
>>>> >>>>>>>>>
>>>> >>>>>>>>> derive from the OS. I believe that's defined by the specs.
>>>> >>>>>>>>>
>>>> >>>>>>>>> Sent from my iPhone
>>>> >>>>>>>>>
>>>> >>>>>>>>>
>>>> >>>>>>>>> On Jul 14, 2009, at 5:51 AM, Charles Lee <[email protected]>
>>>> >>>>>>>>> wrote:
>>>> >>>>>>>>>
>>>> >>>>>>>>>  On Tue, Jul 14, 2009 at 6:12 PM, Jimmy,Jing Lv
>>>> >>>>>>>>> <[email protected]>
>>>> >>>>>>>>>
>>>> >>>>>>>>>> wrote:
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>  Hi,
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>>  Charles, I believe UTF-8 is the default encoding for RI, and
>>>> it
>>>> >>>>>>>>>>> sounds
>>>> >>>>>>>>>>> reasonable.
>>>> >>>>>>>>>>>  BTW, it may encounter some compatibility problem, maybe we
>>>> need
>>>> >>>>>>>>>>> to
>>>> >>>>>>>>>>> run
>>>> >>>>>>>>>>> more tests to verify?
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> 2009/7/14 Charles Lee <[email protected]>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>  Hi guys:
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>> I am doing some test cases on the ant junit test case and
>>>> >>>>>>>>>>>> meeting
>>>> >>>>>>>>>>>> some
>>>> >>>>>>>>>>>> encoding problems. I find they are maybe caused by the
>>>> different
>>>> >>>>>>>>>>>> default
>>>> >>>>>>>>>>>> encoding from RI and harmony. My local is en_US.UTF-8, RI
>>>> >>>>>>>>>>>> default is
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>>  UTF-8
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>  but harmony is 8859-1. And then I have encountered
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> HARMONY-3736<
>>>> https://issues.apache.org/jira/browse/HARMONY-3736>,
>>>> >>>>>>>>>>>> and the two diffs attached on that issue. It seems we always
>>>> get
>>>> >>>>>>>>>>>> 8859-1.
>>>> >>>>>>>>>>>> Because: (correct me if wrong :-)
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> 1. we remove the set code in the vm. we will always get null
>>>> if
>>>> >>>>>>>>>>>> we
>>>> >>>>>>>>>>>> call
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>>  vm
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>  method
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> 2. we set the file.encode in the libglob.c, if we got null
>>>> from
>>>> >>>>>>>>>>>> vm,
>>>> >>>>>>>>>>>> we
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>>  set
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>  Sorry, it should be luniglob.c
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>  8859-1.
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> 3. we can not set file.encode on the run time.
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> ant use UTF-8 to encode filename which contains the non-ascii
>>>> >>>>>>>>>>>> character.
>>>> >>>>>>>>>>>> So why we use iso8859-1 as our unchangeable default?
>>>> >>>>>>>>>>>> From the wiki http://en.wikipedia.org/wiki/ISO8859-1, it says
>>>> >>>>>>>>>>>> "In
>>>> >>>>>>>>>>>> computing
>>>> >>>>>>>>>>>> applications, encodings that provide full UCS support (such as
>>>> >>>>>>>>>>>> UTF-8<http://en.wikipedia.org/wiki/UTF-8>and
>>>> >>>>>>>>>>>> UTF-16 <http://en.wikipedia.org/wiki/UTF-16>) are finding
>>>> >>>>>>>>>>>> increasing
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>>  favor
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>  over encodings based on ISO 8859-1." Should we simply change
>>>> >>>>>>>>>>> iso8859-1
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> to
>>>> >>>>>>>>>>>> utf-8?
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> --
>>>> >>>>>>>>>>>> Yours sincerely,
>>>> >>>>>>>>>>>> Charles Lee
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>> --
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> Best Regards!
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> Jimmy, Jing Lv
>>>> >>>>>>>>>>> China Software Development Lab, IBM
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>> --
>>>> >>>>>>>>>> Yours sincerely,
>>>> >>>>>>>>>> Charles Lee
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>> >>>>>>> --
>>>> >>>>>>> Best Regards,
>>>> >>>>>>> Regis.
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>>> --
>>>> >>>>> Best Regards,
>>>> >>>>> Regis.
>>>> >>>>>
>>>> >>>>
>>>> >>>
>>>> >>>
>>>> >>> --
>>>> >>> Best Regards,
>>>> >>> Regis.
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Yours sincerely,
>>>> >> Charles Lee
>>>> >>
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Yours sincerely,
>>>> > Charles Lee
>>>> >
>>>> >
>>>>
>>>
>>>
>>>
>>> --
>>> Yours sincerely,
>>> Charles Lee
>>>
>>
>

Re: Shall we change our file.encoding

Reply via email to