Re: [Haskell-cafe] Re: Unicode workaround for getDirectoryContents under Windows?
On Thu, 2009-06-18 at 04:47 +0300, Yitzchak Gale wrote: > I wrote: > >> OK, would you like me to reflect this discussion in tickets? > >> Let's see, so far we have #3300, I don't see anything else. > >> > >> Do you want two tickets, one each for WIndows/Unix? Or > >> four, separating the FilePath and getArgs issues? > > Simon Marlow wrote: > > One for each issue is usually better, so four. > > OK, they are: #3300, #3307, #3308, #3309. Could we please make clear in those tickets that they only affect Windows. I do hope we are only proposing that FilePath be interpreted as Unicode on Window and OSX. It would break things to decode to Unicode on Unix systems. On Unix filepaths really are strings of bytes, not an encoding of Unicode code points. It's true that this is not reflected accurately in the type FilePath = String. The FilePath should be an opaque type that allows decoding into a human readable Unicode String. I wonder how much code would actually break if FilePath became an opaque type, eg if we make it an instance of IsString. It only need change in System.IO and System.FilePath, not in the old H98 modules. Duncan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: Unicode workaround for getDirectoryContents under Windows?
I wrote: >> OK, would you like me to reflect this discussion in tickets? >> Let's see, so far we have #3300, I don't see anything else. >> >> Do you want two tickets, one each for WIndows/Unix? Or >> four, separating the FilePath and getArgs issues? Simon Marlow wrote: > One for each issue is usually better, so four. OK, they are: #3300, #3307, #3308, #3309. Regards, Yitz ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: Unicode workaround for getDirectoryContents under Windows?
On 17/06/2009 15:03, Yitzchak Gale wrote: Simon Marlow wrote: The following cases are currently broken... I propose to fix these (on Windows). It will mean that your second case above will be broken, until someone fixes getDirectoryContents... ...it's a lot easier on Windows... on Unix I don't have a clear idea of how to proceed... If someone else has a good understanding of what needs done, please wade in. I don't know how getArgs fits in here... I agree it's broken and needs to be fixed. OK, would you like me to reflect this discussion in tickets? Let's see, so far we have #3300, I don't see anything else. Do you want two tickets, one each for WIndows/Unix? Or four, separating the FilePath and getArgs issues? One for each issue is usually better, so four. Thanks! On Unix, all file APIs take [Word8]... So we should probably be converting from FilePath to [Word8] by encoding using the current locale... what about encoding errors, Where relevant, we should emulate what the common shells do. In general, I don't see why they should be different than any other file operation error. and what if encode.decode is not the identity due to normalisation Well, is it common for people using typical input methods and common shells to create file paths containing text that decodes to non-normalized Unicode? I'm guessing not. If that's the case, then we don't really have to worry about it. People who went out of their way to create a weird file name will have the same troubles they have always had with that in Unix. But perhaps a better solution would be to make the underlying type of FilePath platform-dependent - e.g., String on Windows and [Word8] on Unix - and let it support platform- independent methods such as to/from String, to/from Bytes, setEncoding (defaulting to the current locale). That way, pass-through file paths will always work flawlessly on any platform, and applications have complete flexibility to deal with any other scenario however they choose. It's a breaking change though. Yes, we coud do a lot better if FilePath was an abstract type, but sadly it is not, and we can't change that without breaking Haskell 98 compatibility, not to mention tons of existing code. Cheers, Simon ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: Unicode workaround for getDirectoryContents under Windows?
Ketil Malde wrote: > If we want to incorporate a translation layer, I think it's fair to > only support UTF-8 (ignoring locales), but provide a workaround for > invalid characters. I disagree. Shells and GUI dialogs use the current locale. I think most other modern programming languages do too, but correct me if I am wrong. Still, your ideas about dealing with decoding errors sound useful. Regards, Yitz ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: Unicode workaround for getDirectoryContents under Windows?
Simon Marlow wrote: >>> The following cases are currently broken... >>> I propose to fix these (on Windows). It will mean that your second case >>> above will be broken, until someone fixes getDirectoryContents... > ...it's a lot easier on Windows... > on Unix I don't have a clear idea of how to proceed... > If someone else has a good understanding of what > needs done, please wade in. >>> I don't know how getArgs fits in here... > I agree it's broken and needs to be fixed. OK, would you like me to reflect this discussion in tickets? Let's see, so far we have #3300, I don't see anything else. Do you want two tickets, one each for WIndows/Unix? Or four, separating the FilePath and getArgs issues? > On Unix, all file APIs take [Word8]... > So we should probably be converting from FilePath to > [Word8] by encoding using the current locale... > what about encoding errors, Where relevant, we should emulate what the common shells do. In general, I don't see why they should be different than any other file operation error. > and what if encode.decode is not the identity due to normalisation Well, is it common for people using typical input methods and common shells to create file paths containing text that decodes to non-normalized Unicode? I'm guessing not. If that's the case, then we don't really have to worry about it. People who went out of their way to create a weird file name will have the same troubles they have always had with that in Unix. But perhaps a better solution would be to make the underlying type of FilePath platform-dependent - e.g., String on Windows and [Word8] on Unix - and let it support platform- independent methods such as to/from String, to/from Bytes, setEncoding (defaulting to the current locale). That way, pass-through file paths will always work flawlessly on any platform, and applications have complete flexibility to deal with any other scenario however they choose. It's a breaking change though. Thanks, Yitz ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: Unicode workaround for getDirectoryContents under Windows?
Simon Marlow writes: >> Why only on Windows? > Just because it's a lot easier on Windows - all the OS APIs take > Unicode file paths, so it's obvious what to do. In contrast on Unix I > don't have a clear idea of how to proceed. > On Unix, all file APIs take [Word8] rather than [Char]. By > convention, the [Word8] is usually assumed to be a string in the > locale encoding, but that's only a user-space convention. If we want to incorporate a translation layer, I think it's fair to only support UTF-8 (ignoring locales), but provide a workaround for invalid characters. >From http://en.wikipedia.org/wiki/UTF-8: | Therefore many modern UTF-8 converters translate errors to | something "safe". Only one byte is changed into the error | replacement and parsing starts again at the next byte, otherwise | concatenating strings could change good characters into | errors. Popular replacements for each byte are: | |* nothing (the bytes vanish) |* '?' or '�' |* The replacement character (U+FFFD) |* The byte from ISO-8859-1 or CP1252 |* An invalid Unicode code point, usually U+DCxx where xx is the byte's value How about using the last one? This would allow 'readFile' to work on FilePaths provided by 'getDirectoryContents', while allowing for real Unicode string literals. -k -- If I haven't seen further, it is by standing in the footprints of giants ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: Unicode workaround for getDirectoryContents under Windows?
On 17/06/2009 13:21, Yitzchak Gale wrote: I wrote: I think the most important use cases that should not break are: o open/read/write a FilePath from getArgs o open/read/write a FilePath from getDirectoryContents Simon Marlow wrote: The following cases are currently broken: * Calling openFile on a literal Unicode FilePath (note, not ACP-encoded, just Unicode). * Reading a Unicode FilePath from a text file and then calling openFile on it I propose to fix these (on Windows). It will mean that your second case above will be broken, until someone fixes getDirectoryContents. Why only on Windows? Just because it's a lot easier on Windows - all the OS APIs take Unicode file paths, so it's obvious what to do. In contrast on Unix I don't have a clear idea of how to proceed. On Unix, all file APIs take [Word8] rather than [Char]. By convention, the [Word8] is usually assumed to be a string in the locale encoding, but that's only a user-space convention. So we should probably be converting from FilePath to [Word8] by encoding using the current locale. This raises various complications (what about encoding errors, and what if encode.decode is not the identity due to normalisation, etc.). But you don't have to wait for me to fix this stuff (I'm feeling a bit Unicoded-out right now :) If someone else has a good understanding of what needs done, please wade in. I don't know how getArgs fits in here - should we be decoding argv using the ACP? And why not also on Unix? On any platform, the expected behavior should be that you type a file path at the command line, read it using getArgs, and open the file using that. Right. On Unix it works at the moment because we neither decode argv nor encode FilePaths, so the bytes get passed through unchanged. Same with getDirectoryContents. But I agree it's broken and needs to be fixed. Cheers, Simon ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: Unicode workaround for getDirectoryContents under Windows?
I wrote: >> I think the most important use cases that should not break are: >> >> o open/read/write a FilePath from getArgs >> o open/read/write a FilePath from getDirectoryContents Simon Marlow wrote: > The following cases are currently broken: > > * Calling openFile on a literal Unicode FilePath (note, not > ACP-encoded, just Unicode). > > * Reading a Unicode FilePath from a text file and then calling > openFile on it > > I propose to fix these (on Windows). It will mean that your second case > above will be broken, until someone fixes getDirectoryContents. Why only on Windows? > I don't know how getArgs fits in here - should we be decoding argv using the > ACP? And why not also on Unix? On any platform, the expected behavior should be that you type a file path at the command line, read it using getArgs, and open the file using that. For comparison, Python works that way, even though the variable is called "argv" there. The current behavior on Unix of returning, say, UTF-8 encoding characters in a String as if they were individual Unicode characters, is queer. Given your fantastic work so far to rid System.IO of those kinds of oddities, perhaps now is the time to finish the job. If you think we really need to provide access to the raw argv bytes, we could add another platform-independent function that does that. Thanks, Yitz ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: Unicode workaround for getDirectoryContents under Windows?
On 17/06/2009 09:38, Bulat Ziganshin wrote: Hello Simon, Wednesday, June 17, 2009, 11:55:15 AM, you wrote: Right, so getArgs is already fine. it's what i've found in Jun15 sources: #ifdef __GLASGOW_HASKELL__ getArgs :: IO [String] getArgs = alloca $ \ p_argc -> alloca $ \ p_argv -> do getProgArgv p_argc p_argv p<- fromIntegral `liftM` peek p_argc argv<- peek p_argv peekArray (p - 1) (advancePtr argv 1)>>= mapM peekCString foreign import ccall unsafe "getProgArgv" getProgArgv :: Ptr CInt -> Ptr (Ptr CString) -> IO () it uses peekCString so by any means it cannot produce unicode chars I see, so you were previously quoting code from some other source. Where did the GetCommandLineW version come from? Do you know of any issues that would prevent us using it in GHC? Cheers, Simon ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: Unicode workaround for getDirectoryContents under Windows?
On 16/06/2009 17:06, Bulat Ziganshin wrote: Hello Simon, Tuesday, June 16, 2009, 7:54:02 PM, you wrote: In fact there's not a lot left to convert in System.Directory, as you'll see if you look at the code. Feel like helping? these functions used there are ACP-only: c_stat c_chmod System.Win32.getFullPathName c_SearchPath c_SHGetFolderPath Yes, except for getFullPathName: foreign import stdcall unsafe "GetFullPathNameW" c_GetFullPathName :: LPCTSTR -> DWORD -> LPTSTR -> Ptr LPTSTR -> IO DWORD plus may be some more functions from System.Win32 package - i don't looked into it System.Win32 is using the wide-char APIs exclusively (ok, I haven't checked, but I don't know of any System.Win32 functions still using narrow strings). So as you can see, there's not much left to do. I'll fix openFile. Cheers, Simon ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: Unicode workaround for getDirectoryContents under Windows?
On 16/06/2009 21:19, Bulat Ziganshin wrote: Hello Simon, Tuesday, June 16, 2009, 5:02:43 PM, you wrote: I don't know how getArgs fits in here - should we be decoding argv using the ACP? myGetArgs = do alloca $ \p_argc -> do p_argv_w<- commandLineToArgvW getCommandLineW p_argc argc<- peek p_argc argv_w<- peekArray (i argc) p_argv_w mapM peekTString argv_w>>= return.tail foreign import stdcall unsafe "windows.h GetCommandLineW" getCommandLineW :: LPTSTR foreign import stdcall unsafe "windows.h CommandLineToArgvW" commandLineToArgvW :: LPCWSTR -> Ptr CInt -> IO (Ptr LPWSTR) Right, so getArgs is already fine. Cheers, Simon ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: Unicode workaround for getDirectoryContents under Windows?
On 16/06/2009 16:44, Bulat Ziganshin wrote: Hello Simon, Tuesday, June 16, 2009, 7:30:55 PM, you wrote: Actually we use a mixture of CRT functions and native Windows API, gradually moving in the direction of the latter. so file-related APIs are already unpredictable, and will remain in this state for unknown amount of ghc versions Sometimes fixing everything at the same time is too hard :-) In fact there's not a lot left to convert in System.Directory, as you'll see if you look at the code. Feel like helping? Cheers, Simon ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: Unicode workaround for getDirectoryContents under Windows?
On 16/06/2009 14:56, Bulat Ziganshin wrote: Hello Simon, Tuesday, June 16, 2009, 5:02:43 PM, you wrote: Also currently broken: * calling removeFile on a FilePath you get from getDirectoryContents, amongst other System.Directory operations Fixing getDirectoryContents will fix these. no. removeFile like anything else also uses ACP-based api What code are you looking at? Here is System.Directory.removeFile: removeFile :: FilePath -> IO () removeFile path = #if mingw32_HOST_OS System.Win32.deleteFile path #else System.Posix.removeLink path #endif and System.Win32.deleteFile: deleteFile :: String -> IO () deleteFile name = withTString name $ \ c_name -> failIfFalse_ "DeleteFile" $ c_DeleteFile c_name foreign import stdcall unsafe "windows.h DeleteFileW" c_DeleteFile :: LPCTSTR -> IO Bool note it's calling DeleteFileW, and using wide-char strings. Windows libraries emulates POSIX API (open, opendir, stat and so on) by translating these (char-based) calls into A-family. GHC libs are written Unix way, so these are effectively bundled to A-family of Win API Actually we use a mixture of CRT functions and native Windows API, gradually moving in the direction of the latter. Cheers, Simon ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: Unicode workaround for getDirectoryContents under Windows?
On 16/06/2009 13:46, Yitzchak Gale wrote: Simon Marlow wrote: Care to submit a patch to put this in System.Directory, or better still put the relevant functionality in System.Win32 and use it in System.Directory? Bulat Ziganshin wrote: now getDirectoryContents return ACP (ansi code page) names so openFile works for files 1) and 2). With such change getDirectoryContents will return correct unicode names, so openFile will work only with names in first group. The right way is to fix ALL string-related calls in System.IO, System.Posix.Internals, System.Environment. You're right in that we really ought to fix everything. However, I'm happy to just fix some of these things, even if it introduces some inconsistencies in the meantime. We already have much of System.Directory working with Unicode FilePaths, so there are already inconsistencies here. +1 for integrating Unicode file paths. Thanks, Bulat! Excuse my ignorance, but... what Unicode file paths? I think the most important use cases that should not break are: o open/read/write a FilePath from getArgs o open/read/write a FilePath from getDirectoryContents There's not much we can do about non-Latin-1 ACP file paths hard coded in Strings. I hope there aren't too many of those in the wild. The following cases are currently broken: * Calling openFile on a literal Unicode FilePath (note, not ACP-encoded, just Unicode). * Reading a Unicode FilePath from a text file and then calling openFile on it I propose to fix these (on Windows). It will mean that your second case above will be broken, until someone fixes getDirectoryContents. Also currently broken: * calling removeFile on a FilePath you get from getDirectoryContents, amongst other System.Directory operations Fixing getDirectoryContents will fix these. I don't know how getArgs fits in here - should we be decoding argv using the ACP? Cheers, Simon ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: Unicode workaround for getDirectoryContents under Windows?
Simon Marlow wrote: >>> Care to submit a patch to put this in System.Directory, or better still >>> put the relevant functionality in System.Win32 and use it in >>> System.Directory? Bulat Ziganshin wrote: >> now getDirectoryContents return ACP (ansi code page) names so openFile >> works for files 1) and 2). >> With such change getDirectoryContents will return correct unicode >> names, so openFile will work only with names in first group. >> The right way is to fix ALL string-related calls in System.IO, >> System.Posix.Internals, System.Environment. > You're right in that we really ought to fix everything. However, I'm happy > to just fix some of these things, even if it introduces some inconsistencies > in the meantime. We already have much of System.Directory working with > Unicode FilePaths, so there are already inconsistencies here. +1 for integrating Unicode file paths. Thanks, Bulat! I think the most important use cases that should not break are: o open/read/write a FilePath from getArgs o open/read/write a FilePath from getDirectoryContents There's not much we can do about non-Latin-1 ACP file paths hard coded in Strings. I hope there aren't too many of those in the wild. Regards, Yitz ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: Unicode workaround for getDirectoryContents under Windows?
On 16/06/2009 12:42, Bulat Ziganshin wrote: Hello Simon, Tuesday, June 16, 2009, 3:30:31 PM, you wrote: Care to submit a patch to put this in System.Directory, or better still put the relevant functionality in System.Win32 and use it in System.Directory? Simon, it will somewhat broke openFile. let's see. there are 3 types of filenames - 1) english (latin-1) only 2) including local (ansi code page) chars 3) including any other unicode chars now getDirectoryContents return ACP (ansi code page) names so openFile works for files 1) and 2) with such change getDirectoryContents will return correct unicode names, so openFile will work only with names in first group the right way is to fix ALL string-related calls in System.IO, System.Posix.Internals, System.Environment You're right in that we really ought to fix everything. However, I'm happy to just fix some of these things, even if it introduces some inconsistencies in the meantime. We already have much of System.Directory working with Unicode FilePaths, so there are already inconsistencies here. Thanks for reminding me that openFile is also broken. It's easily fixed, so I'll look into that. Cheers, Simon ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: Unicode workaround for getDirectoryContents under Windows?
Hello Simon, Tuesday, June 16, 2009, 3:30:31 PM, you wrote: > Care to submit a patch to put this in System.Directory, or better still > put the relevant functionality in System.Win32 and use it in > System.Directory? Simon, it will somewhat broke openFile. let's see. there are 3 types of filenames - 1) english (latin-1) only 2) including local (ansi code page) chars 3) including any other unicode chars now getDirectoryContents return ACP (ansi code page) names so openFile works for files 1) and 2) with such change getDirectoryContents will return correct unicode names, so openFile will work only with names in first group the right way is to fix ALL string-related calls in System.IO, System.Posix.Internals, System.Environment -- Best regards, Bulatmailto:bulat.zigans...@gmail.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe