On Thu, Jul 16, 2009 at 8:50 PM, Nathan Beyer<ndbe...@apache.org> wrote: > On Thu, Jul 16, 2009 at 8:35 PM, Nathan Beyer<ndbe...@apache.org> wrote: >> On Thu, Jul 16, 2009 at 8:26 PM, Nathan Beyer<ndbe...@apache.org> wrote: >>> On Thu, Jul 16, 2009 at 8:18 PM, Charles Lee<littlee1...@gmail.com> wrote: >>>> Hi Nathan, >>>> >>>> What I got is 936, the code page identifier. Is there a api for us to map >>>> 936 to the gb2312? >>> >>> Oh, the 'identifier' bit was missing - yeah, we'll need to translate >>> that into a name of some sort. I'll poke around a bit and see what I >>> can find. >> >> We'll probably just have to put in a mapping ourselves based on the >> documentation. We'd call GetACP [1] and map that to a known alias in >> java.nio.charset that matches the definitions[2] of the identifiers. >> >> [1] http://msdn.microsoft.com/en-us/library/dd318070%28VS.85%29.aspx >> [2] http://msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx > > This may be better - APR has a function for getting the OS default > encoding. This would work across all platforms that APR supports and I > believe we already use APR. > > http://apr.apache.org/docs/apr/1.3/group__apr__portabile.html#g6e21845a4a5f3b7dd107b2beea50c91e
However, the Windows version of this is simply - return apr_psprintf(pool, "CP%u", (unsigned) GetACP());. Which is essentially "CP" + codePageId. And the Unix version of this method doesn't look very good for our purposes. > > -Nathan >> >>> >>>> If we put 936 in the file.encoding, can we successfully get the encoder and >>>> decoder by charset? >>>> >>>> On Fri, Jul 17, 2009 at 9:05 AM, Nathan Beyer <ndbe...@apache.org> wrote: >>>> >>>>> On Thu, Jul 16, 2009 at 1:28 AM, Charles Lee<littlee1...@gmail.com> wrote: >>>>> > Hi guys, >>>>> > >>>>> > I have add the locale function in the drlvm, the patch is attached. >>>>> Please >>>>> > try this new patch on the linux. >>>>> > >>>>> > The patch should work on the linux but fail on the windows. Because >>>>> windows >>>>> > returns code page not charset from the setlocale. >>>>> >>>>> Code page and character set are the same thing. We shouldn't need to >>>>> convert it as the Charset APIs will have to support the values anyway. >>>>> >>>>> What's the value you're getting? If it's 'Cp1252', then we're good, as >>>>> that's just an alias for 'Windows-1252' (or vice-versa). >>>>> >>>>> -Nathan >>>>> >>>>> >>>>> > I hv tried long time to >>>>> > get the charset name from the codepage, for example: >>>>> > CPINFOEX cpInfoEx; >>>>> > BOOL iReturn = GetCPInfoEx(CP_ACP,0, &cPInfoEx); >>>>> > if (iReturn > 0) { >>>>> > printf("FULL NAME %s\n", cPinfoEx,CodePageName); >>>>> > } >>>>> > But I only get the full name without any format. >>>>> > >>>>> > There is code page identifiers map in the msdn, detail here. I may hard >>>>> code >>>>> > this map in the file. But the note on the msdn says: >>>>> > "ANSI code pages can be different on different computers, or can be >>>>> > changed for a single computer, leading to data corruption. For the most >>>>> > consistent results, applications should use Unicode, such as UTF-8 or >>>>> > UTF-16, instead of a specific code page." >>>>> > I am afraid hard-code will fail on some machines. (By the way, this >>>>> > seems >>>>> > the UTF-8 is suggested to be the default again :-) >>>>> > >>>>> > There is also a class Encoding in the VC++, detail here. But we can not >>>>> use >>>>> > it here. >>>>> > >>>>> > So anyone knows some thing about locale on the windows? >>>>> > Again, shall use UTF-8 as our default? >>>>> > >>>>> > On Wed, Jul 15, 2009 at 2:12 PM, Charles Lee <littlee1...@gmail.com> >>>>> wrote: >>>>> >> >>>>> >> That seems we should add it in the drlvm. >>>>> >> >>>>> >> On Wed, Jul 15, 2009 at 1:58 PM, Regis <xu.re...@gmail.com> wrote: >>>>> >>> >>>>> >>> Nathan Beyer wrote: >>>>> >>>> >>>>> >>>> Is the IBM VME dealing with this correctly? Do we just need to fix >>>>> >>>> DRLVM? >>>>> >>> >>>>> >>> Yes, I only tested on Linux, IBM VME set the property correctly. >>>>> >>> >>>>> >>>> >>>>> >>>> On Wed, Jul 15, 2009 at 12:25 AM, Regis<xu.re...@gmail.com> wrote: >>>>> >>>>> >>>>> >>>>> Kevin Zhou wrote: >>>>> >>>>>> >>>>> >>>>>> Yea, from luniglob.c, CL attempts to read the "file.encoding" >>>>> property >>>>> >>>>>> adown >>>>> >>>>>> VM but fails to get the correct encoding. >>>>> >>>>>> >>>>> >>>>>> Regis, do you know any other specific ways that CL can gain the >>>>> right >>>>> >>>>>> property? >>>>> >>>>> >>>>> >>>>> We can get from OS directly. Maybe just read env variables on Linux? >>>>> >>>>> >>>>> >>>>>> Wed, Jul 15, 2009 at 9:59 AM, Regis <xu.re...@gmail.com> wrote: >>>>> >>>>>> >>>>> >>>>>>> Charles Lee wrote: >>>>> >>>>>>> >>>>> >>>>>>>> Hi Nanthan, >>>>> >>>>>>>> >>>>> >>>>>>>> If the file encoding derive from the OS, it should be the some >>>>> bugs >>>>> >>>>>>>> in >>>>> >>>>>>>> it >>>>> >>>>>>>> because on my LINUX machine the locale is en_US.UTF-8. Our >>>>> >>>>>>>> default >>>>> >>>>>>>> codec >>>>> >>>>>>>> is >>>>> >>>>>>>> still ISO8859-1. Do you know where can we found such codes? >>>>> >>>>>>>> >>>>> >>>>>>> Classlib expected vm do this and set the property, but it didn't, >>>>> so >>>>> >>>>>>> we >>>>> >>>>>>> have to do this by ourselves. >>>>> >>>>>>> >>>>> >>>>>>> >>>>> >>>>>>> >>>>> >>>>>>>> On Tue, Jul 14, 2009 at 10:17 PM, Nathan Beyer <nbe...@gmail.com> >>>>> >>>>>>>> wrote: >>>>> >>>>>>>> >>>>> >>>>>>>> Are we talking about windows or linux?the default file encoding >>>>> >>>>>>>> should >>>>> >>>>>>>>> >>>>> >>>>>>>>> derive from the OS. I believe that's defined by the specs. >>>>> >>>>>>>>> >>>>> >>>>>>>>> Sent from my iPhone >>>>> >>>>>>>>> >>>>> >>>>>>>>> >>>>> >>>>>>>>> On Jul 14, 2009, at 5:51 AM, Charles Lee <littlee1...@gmail.com> >>>>> >>>>>>>>> wrote: >>>>> >>>>>>>>> >>>>> >>>>>>>>> On Tue, Jul 14, 2009 at 6:12 PM, Jimmy,Jing Lv >>>>> >>>>>>>>> <firep...@gmail.com> >>>>> >>>>>>>>> >>>>> >>>>>>>>>> wrote: >>>>> >>>>>>>>>> >>>>> >>>>>>>>>> Hi, >>>>> >>>>>>>>>> >>>>> >>>>>>>>>>> Charles, I believe UTF-8 is the default encoding for RI, and >>>>> it >>>>> >>>>>>>>>>> sounds >>>>> >>>>>>>>>>> reasonable. >>>>> >>>>>>>>>>> BTW, it may encounter some compatibility problem, maybe we >>>>> need >>>>> >>>>>>>>>>> to >>>>> >>>>>>>>>>> run >>>>> >>>>>>>>>>> more tests to verify? >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> 2009/7/14 Charles Lee <littlee1...@gmail.com> >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> Hi guys: >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>>> I am doing some test cases on the ant junit test case and >>>>> >>>>>>>>>>>> meeting >>>>> >>>>>>>>>>>> some >>>>> >>>>>>>>>>>> encoding problems. I find they are maybe caused by the >>>>> different >>>>> >>>>>>>>>>>> default >>>>> >>>>>>>>>>>> encoding from RI and harmony. My local is en_US.UTF-8, RI >>>>> >>>>>>>>>>>> default is >>>>> >>>>>>>>>>>> >>>>> >>>>>>>>>>>> UTF-8 >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> but harmony is 8859-1. And then I have encountered >>>>> >>>>>>>>>>>> >>>>> >>>>>>>>>>>> >>>>> >>>>>>>>>>>> HARMONY-3736< >>>>> https://issues.apache.org/jira/browse/HARMONY-3736>, >>>>> >>>>>>>>>>>> and the two diffs attached on that issue. It seems we always >>>>> get >>>>> >>>>>>>>>>>> 8859-1. >>>>> >>>>>>>>>>>> Because: (correct me if wrong :-) >>>>> >>>>>>>>>>>> >>>>> >>>>>>>>>>>> 1. we remove the set code in the vm. we will always get null >>>>> if >>>>> >>>>>>>>>>>> we >>>>> >>>>>>>>>>>> call >>>>> >>>>>>>>>>>> >>>>> >>>>>>>>>>>> vm >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> method >>>>> >>>>>>>>>>>> >>>>> >>>>>>>>>>>> 2. we set the file.encode in the libglob.c, if we got null >>>>> from >>>>> >>>>>>>>>>>> vm, >>>>> >>>>>>>>>>>> we >>>>> >>>>>>>>>>>> >>>>> >>>>>>>>>>>> set >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> Sorry, it should be luniglob.c >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>> 8859-1. >>>>> >>>>>>>>>>>> >>>>> >>>>>>>>>>>> 3. we can not set file.encode on the run time. >>>>> >>>>>>>>>>>> >>>>> >>>>>>>>>>>> ant use UTF-8 to encode filename which contains the non-ascii >>>>> >>>>>>>>>>>> character. >>>>> >>>>>>>>>>>> So why we use iso8859-1 as our unchangeable default? >>>>> >>>>>>>>>>>> From the wiki http://en.wikipedia.org/wiki/ISO8859-1, it says >>>>> >>>>>>>>>>>> "In >>>>> >>>>>>>>>>>> computing >>>>> >>>>>>>>>>>> applications, encodings that provide full UCS support (such >>>>> >>>>>>>>>>>> as >>>>> >>>>>>>>>>>> UTF-8<http://en.wikipedia.org/wiki/UTF-8>and >>>>> >>>>>>>>>>>> UTF-16 <http://en.wikipedia.org/wiki/UTF-16>) are finding >>>>> >>>>>>>>>>>> increasing >>>>> >>>>>>>>>>>> >>>>> >>>>>>>>>>>> favor >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> over encodings based on ISO 8859-1." Should we simply change >>>>> >>>>>>>>>>> iso8859-1 >>>>> >>>>>>>>>>>> >>>>> >>>>>>>>>>>> to >>>>> >>>>>>>>>>>> utf-8? >>>>> >>>>>>>>>>>> >>>>> >>>>>>>>>>>> -- >>>>> >>>>>>>>>>>> Yours sincerely, >>>>> >>>>>>>>>>>> Charles Lee >>>>> >>>>>>>>>>>> >>>>> >>>>>>>>>>>> >>>>> >>>>>>>>>>>> >>>>> >>>>>>>>>>> -- >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> Best Regards! >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> Jimmy, Jing Lv >>>>> >>>>>>>>>>> China Software Development Lab, IBM >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>> -- >>>>> >>>>>>>>>> Yours sincerely, >>>>> >>>>>>>>>> Charles Lee >>>>> >>>>>>>>>> >>>>> >>>>>>>>>> >>>>> >>>>>>> -- >>>>> >>>>>>> Best Regards, >>>>> >>>>>>> Regis. >>>>> >>>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Best Regards, >>>>> >>>>> Regis. >>>>> >>>>> >>>>> >>>> >>>>> >>> >>>>> >>> >>>>> >>> -- >>>>> >>> Best Regards, >>>>> >>> Regis. >>>>> >> >>>>> >> >>>>> >> >>>>> >> -- >>>>> >> Yours sincerely, >>>>> >> Charles Lee >>>>> >> >>>>> > >>>>> > >>>>> > >>>>> > -- >>>>> > Yours sincerely, >>>>> > Charles Lee >>>>> > >>>>> > >>>>> >>>> >>>> >>>> >>>> -- >>>> Yours sincerely, >>>> Charles Lee >>>> >>> >> >