Re: Shall we change our file.encoding

Nathan Beyer Thu, 16 Jul 2009 19:05:55 -0700

On Thu, Jul 16, 2009 at 8:50 PM, Nathan Beyer<[email protected]> wrote:
> On Thu, Jul 16, 2009 at 8:35 PM, Nathan Beyer<[email protected]> wrote:
>> On Thu, Jul 16, 2009 at 8:26 PM, Nathan Beyer<[email protected]> wrote:
>>> On Thu, Jul 16, 2009 at 8:18 PM, Charles Lee<[email protected]> wrote:
>>>> Hi Nathan,
>>>>
>>>> What I got is 936, the code page identifier. Is there a api for us to map
>>>> 936 to the gb2312?
>>>
>>> Oh, the 'identifier' bit was missing - yeah, we'll need to translate
>>> that into a name of some sort. I'll poke around a bit and see what I
>>> can find.
>>
>> We'll probably just have to put in a mapping ourselves based on the
>> documentation. We'd call GetACP [1] and map that to a known alias in
>> java.nio.charset that matches the definitions[2] of the identifiers.
>>
>> [1] http://msdn.microsoft.com/en-us/library/dd318070%28VS.85%29.aspx
>> [2] http://msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx
>
> This may be better - APR has a function for getting the OS default
> encoding. This would work across all platforms that APR supports and I
> believe we already use APR.
>
> http://apr.apache.org/docs/apr/1.3/group__apr__portabile.html#g6e21845a4a5f3b7dd107b2beea50c91e


However, the Windows version of this is simply - return
apr_psprintf(pool, "CP%u", (unsigned) GetACP());. Which is essentially
"CP" + codePageId.

And the Unix version of this method doesn't look very good for our purposes.
>
> -Nathan
>>
>>>
>>>> If we put 936 in the file.encoding, can we successfully get the encoder and
>>>> decoder by charset?
>>>>
>>>> On Fri, Jul 17, 2009 at 9:05 AM, Nathan Beyer <[email protected]> wrote:
>>>>
>>>>> On Thu, Jul 16, 2009 at 1:28 AM, Charles Lee<[email protected]> wrote:
>>>>> > Hi guys,
>>>>> >
>>>>> > I have add the locale function in the drlvm, the patch is attached.
>>>>> Please
>>>>> > try this new patch on the linux.
>>>>> >
>>>>> > The patch should work on the linux but fail on the windows. Because
>>>>> windows
>>>>> > returns code page not charset from the setlocale.
>>>>>
>>>>> Code page and character set are the same thing. We shouldn't need to
>>>>> convert it as the Charset APIs will have to support the values anyway.
>>>>>
>>>>> What's the value you're getting? If it's 'Cp1252', then we're good, as
>>>>> that's just an alias for 'Windows-1252' (or vice-versa).
>>>>>
>>>>> -Nathan
>>>>>
>>>>>
>>>>> > I hv tried long time to
>>>>> > get the charset name from the codepage, for example:
>>>>> > CPINFOEX cpInfoEx;
>>>>> > BOOL iReturn = GetCPInfoEx(CP_ACP,0, &cPInfoEx);
>>>>> > if (iReturn > 0) {
>>>>> >     printf("FULL NAME %s\n", cPinfoEx,CodePageName);
>>>>> > }
>>>>> > But I only get the full name without any format.
>>>>> >
>>>>> > There is code page identifiers map in the msdn, detail here. I may hard
>>>>> code
>>>>> > this map in the file. But the note on the msdn says:
>>>>> >      "ANSI code pages can be different on different computers, or can be
>>>>> > changed for a single computer, leading to data corruption. For the most
>>>>> > consistent results, applications should use Unicode, such as UTF-8 or
>>>>> > UTF-16, instead of a specific code page."
>>>>> > I am afraid hard-code will fail on some machines. (By the way, this 
>>>>> > seems
>>>>> > the UTF-8 is suggested to be the default again :-)
>>>>> >
>>>>> > There is also a class Encoding in the VC++, detail here. But we can not
>>>>> use
>>>>> > it here.
>>>>> >
>>>>> > So anyone knows some thing about locale on the windows?
>>>>> > Again, shall use UTF-8 as our default?
>>>>> >
>>>>> > On Wed, Jul 15, 2009 at 2:12 PM, Charles Lee <[email protected]>
>>>>> wrote:
>>>>> >>
>>>>> >> That seems we should add it in the drlvm.
>>>>> >>
>>>>> >> On Wed, Jul 15, 2009 at 1:58 PM, Regis <[email protected]> wrote:
>>>>> >>>
>>>>> >>> Nathan Beyer wrote:
>>>>> >>>>
>>>>> >>>> Is the IBM VME dealing with this correctly? Do we just need to fix
>>>>> >>>> DRLVM?
>>>>> >>>
>>>>> >>> Yes, I only tested on Linux, IBM VME set the property correctly.
>>>>> >>>
>>>>> >>>>
>>>>> >>>> On Wed, Jul 15, 2009 at 12:25 AM, Regis<[email protected]> wrote:
>>>>> >>>>>
>>>>> >>>>> Kevin Zhou wrote:
>>>>> >>>>>>
>>>>> >>>>>> Yea, from luniglob.c, CL attempts to read the "file.encoding"
>>>>> property
>>>>> >>>>>> adown
>>>>> >>>>>> VM but fails to get the correct encoding.
>>>>> >>>>>>
>>>>> >>>>>> Regis, do you know any other specific ways that CL can gain the
>>>>> right
>>>>> >>>>>> property?
>>>>> >>>>>
>>>>> >>>>> We can get from OS directly. Maybe just read env variables on Linux?
>>>>> >>>>>
>>>>> >>>>>> Wed, Jul 15, 2009 at 9:59 AM, Regis <[email protected]> wrote:
>>>>> >>>>>>
>>>>> >>>>>>> Charles Lee wrote:
>>>>> >>>>>>>
>>>>> >>>>>>>> Hi Nanthan,
>>>>> >>>>>>>>
>>>>> >>>>>>>> If the file encoding derive from the OS, it should be the some
>>>>> bugs
>>>>> >>>>>>>> in
>>>>> >>>>>>>> it
>>>>> >>>>>>>> because on my LINUX machine the locale is en_US.UTF-8. Our 
>>>>> >>>>>>>> default
>>>>> >>>>>>>> codec
>>>>> >>>>>>>> is
>>>>> >>>>>>>> still ISO8859-1. Do you know where can we found such codes?
>>>>> >>>>>>>>
>>>>> >>>>>>> Classlib expected vm do this and set the property, but it didn't,
>>>>> so
>>>>> >>>>>>> we
>>>>> >>>>>>> have to do this by ourselves.
>>>>> >>>>>>>
>>>>> >>>>>>>
>>>>> >>>>>>>
>>>>> >>>>>>>> On Tue, Jul 14, 2009 at 10:17 PM, Nathan Beyer <[email protected]>
>>>>> >>>>>>>> wrote:
>>>>> >>>>>>>>
>>>>> >>>>>>>>  Are we talking about windows or linux?the default file encoding
>>>>> >>>>>>>> should
>>>>> >>>>>>>>>
>>>>> >>>>>>>>> derive from the OS. I believe that's defined by the specs.
>>>>> >>>>>>>>>
>>>>> >>>>>>>>> Sent from my iPhone
>>>>> >>>>>>>>>
>>>>> >>>>>>>>>
>>>>> >>>>>>>>> On Jul 14, 2009, at 5:51 AM, Charles Lee <[email protected]>
>>>>> >>>>>>>>> wrote:
>>>>> >>>>>>>>>
>>>>> >>>>>>>>>  On Tue, Jul 14, 2009 at 6:12 PM, Jimmy,Jing Lv
>>>>> >>>>>>>>> <[email protected]>
>>>>> >>>>>>>>>
>>>>> >>>>>>>>>> wrote:
>>>>> >>>>>>>>>>
>>>>> >>>>>>>>>>  Hi,
>>>>> >>>>>>>>>>
>>>>> >>>>>>>>>>>  Charles, I believe UTF-8 is the default encoding for RI, and
>>>>> it
>>>>> >>>>>>>>>>> sounds
>>>>> >>>>>>>>>>> reasonable.
>>>>> >>>>>>>>>>>  BTW, it may encounter some compatibility problem, maybe we
>>>>> need
>>>>> >>>>>>>>>>> to
>>>>> >>>>>>>>>>> run
>>>>> >>>>>>>>>>> more tests to verify?
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>> 2009/7/14 Charles Lee <[email protected]>
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>>  Hi guys:
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>>> I am doing some test cases on the ant junit test case and
>>>>> >>>>>>>>>>>> meeting
>>>>> >>>>>>>>>>>> some
>>>>> >>>>>>>>>>>> encoding problems. I find they are maybe caused by the
>>>>> different
>>>>> >>>>>>>>>>>> default
>>>>> >>>>>>>>>>>> encoding from RI and harmony. My local is en_US.UTF-8, RI
>>>>> >>>>>>>>>>>> default is
>>>>> >>>>>>>>>>>>
>>>>> >>>>>>>>>>>>  UTF-8
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>>  but harmony is 8859-1. And then I have encountered
>>>>> >>>>>>>>>>>>
>>>>> >>>>>>>>>>>>
>>>>> >>>>>>>>>>>> HARMONY-3736<
>>>>> https://issues.apache.org/jira/browse/HARMONY-3736>,
>>>>> >>>>>>>>>>>> and the two diffs attached on that issue. It seems we always
>>>>> get
>>>>> >>>>>>>>>>>> 8859-1.
>>>>> >>>>>>>>>>>> Because: (correct me if wrong :-)
>>>>> >>>>>>>>>>>>
>>>>> >>>>>>>>>>>> 1. we remove the set code in the vm. we will always get null
>>>>> if
>>>>> >>>>>>>>>>>> we
>>>>> >>>>>>>>>>>> call
>>>>> >>>>>>>>>>>>
>>>>> >>>>>>>>>>>>  vm
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>>  method
>>>>> >>>>>>>>>>>>
>>>>> >>>>>>>>>>>> 2. we set the file.encode in the libglob.c, if we got null
>>>>> from
>>>>> >>>>>>>>>>>> vm,
>>>>> >>>>>>>>>>>> we
>>>>> >>>>>>>>>>>>
>>>>> >>>>>>>>>>>>  set
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>>  Sorry, it should be luniglob.c
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>  8859-1.
>>>>> >>>>>>>>>>>>
>>>>> >>>>>>>>>>>> 3. we can not set file.encode on the run time.
>>>>> >>>>>>>>>>>>
>>>>> >>>>>>>>>>>> ant use UTF-8 to encode filename which contains the non-ascii
>>>>> >>>>>>>>>>>> character.
>>>>> >>>>>>>>>>>> So why we use iso8859-1 as our unchangeable default?
>>>>> >>>>>>>>>>>> From the wiki http://en.wikipedia.org/wiki/ISO8859-1, it says
>>>>> >>>>>>>>>>>> "In
>>>>> >>>>>>>>>>>> computing
>>>>> >>>>>>>>>>>> applications, encodings that provide full UCS support (such 
>>>>> >>>>>>>>>>>> as
>>>>> >>>>>>>>>>>> UTF-8<http://en.wikipedia.org/wiki/UTF-8>and
>>>>> >>>>>>>>>>>> UTF-16 <http://en.wikipedia.org/wiki/UTF-16>) are finding
>>>>> >>>>>>>>>>>> increasing
>>>>> >>>>>>>>>>>>
>>>>> >>>>>>>>>>>>  favor
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>>  over encodings based on ISO 8859-1." Should we simply change
>>>>> >>>>>>>>>>> iso8859-1
>>>>> >>>>>>>>>>>>
>>>>> >>>>>>>>>>>> to
>>>>> >>>>>>>>>>>> utf-8?
>>>>> >>>>>>>>>>>>
>>>>> >>>>>>>>>>>> --
>>>>> >>>>>>>>>>>> Yours sincerely,
>>>>> >>>>>>>>>>>> Charles Lee
>>>>> >>>>>>>>>>>>
>>>>> >>>>>>>>>>>>
>>>>> >>>>>>>>>>>>
>>>>> >>>>>>>>>>> --
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>> Best Regards!
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>> Jimmy, Jing Lv
>>>>> >>>>>>>>>>> China Software Development Lab, IBM
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>> --
>>>>> >>>>>>>>>> Yours sincerely,
>>>>> >>>>>>>>>> Charles Lee
>>>>> >>>>>>>>>>
>>>>> >>>>>>>>>>
>>>>> >>>>>>> --
>>>>> >>>>>>> Best Regards,
>>>>> >>>>>>> Regis.
>>>>> >>>>>>>
>>>>> >>>>>
>>>>> >>>>> --
>>>>> >>>>> Best Regards,
>>>>> >>>>> Regis.
>>>>> >>>>>
>>>>> >>>>
>>>>> >>>
>>>>> >>>
>>>>> >>> --
>>>>> >>> Best Regards,
>>>>> >>> Regis.
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> --
>>>>> >> Yours sincerely,
>>>>> >> Charles Lee
>>>>> >>
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Yours sincerely,
>>>>> > Charles Lee
>>>>> >
>>>>> >
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Yours sincerely,
>>>> Charles Lee
>>>>
>>>
>>
>

Re: Shall we change our file.encoding

Reply via email to