Re: [Haskell-cafe] Re: Unicode workaround for getDirectoryContents under Windows?

2009-06-18 Thread Duncan Coutts
On Thu, 2009-06-18 at 04:47 +0300, Yitzchak Gale wrote:
> I wrote:
> >> OK, would you like me to reflect this discussion in tickets?
> >> Let's see, so far we have #3300, I don't see anything else.
> >>
> >> Do you want two tickets, one each for WIndows/Unix? Or
> >> four, separating the FilePath and getArgs issues?
> 
> Simon Marlow wrote:
> > One for each issue is usually better, so four.
> 
> OK, they are: #3300, #3307, #3308, #3309.

Could we please make clear in those tickets that they only affect
Windows. I do hope we are only proposing that FilePath be interpreted as
Unicode on Window and OSX. It would break things to decode to Unicode on
Unix systems. On Unix filepaths really are strings of bytes, not an
encoding of Unicode code points. It's true that this is not reflected
accurately in the type FilePath = String.

The FilePath should be an opaque type that allows decoding into a human
readable Unicode String.

I wonder how much code would actually break if FilePath became an opaque
type, eg if we make it an instance of IsString. It only need change in
System.IO and System.FilePath, not in the old H98 modules.

Duncan

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Unicode workaround for getDirectoryContents under Windows?

2009-06-17 Thread Yitzchak Gale
I wrote:
>> OK, would you like me to reflect this discussion in tickets?
>> Let's see, so far we have #3300, I don't see anything else.
>>
>> Do you want two tickets, one each for WIndows/Unix? Or
>> four, separating the FilePath and getArgs issues?

Simon Marlow wrote:
> One for each issue is usually better, so four.

OK, they are: #3300, #3307, #3308, #3309.

Regards,
Yitz
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Unicode workaround for getDirectoryContents under Windows?

2009-06-17 Thread Simon Marlow

On 17/06/2009 15:03, Yitzchak Gale wrote:

Simon Marlow wrote:

The following cases are currently broken...
I propose to fix these (on Windows).  It will mean that your second case
above will be broken, until someone fixes getDirectoryContents...

...it's a lot easier on Windows...
on Unix I don't have a clear idea of how to proceed...
If someone else has a good understanding of what
needs done, please wade in.

I don't know how getArgs fits in here...

I agree it's broken and needs to be fixed.


OK, would you like me to reflect this discussion in tickets?
Let's see, so far we have #3300, I don't see anything else.

Do you want two tickets, one each for WIndows/Unix? Or
four, separating the FilePath and getArgs issues?


One for each issue is usually better, so four.  Thanks!


On Unix, all file APIs take [Word8]...
So we should probably be converting from FilePath to
[Word8] by encoding using the current locale...
what about encoding errors,


Where relevant, we should emulate what the common
shells do. In general, I don't see why they should be different
than any other file operation error.


and what if encode.decode is not the identity due to normalisation


Well, is it common for people using typical input methods
and common shells to create file paths containing
text that decodes to non-normalized Unicode?

I'm guessing not. If that's the case, then we don't really have
to worry about it. People who went out of their way to create
a weird file name will have the same troubles they have
always had with that in Unix.

But perhaps a better solution would be to make the underlying
type of FilePath platform-dependent - e.g., String on Windows
and [Word8] on Unix - and let it support platform-
independent methods such as to/from String, to/from Bytes,
setEncoding (defaulting to the current locale). That way,
pass-through file paths will always work flawlessly on any
platform, and applications have complete flexibility
to deal with any other scenario however they choose. It's a
breaking change though.


Yes, we coud do a lot better if FilePath was an abstract type, but sadly 
it is not, and we can't change that without breaking Haskell 98 
compatibility, not to mention tons of existing code.


Cheers,
Simon
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Unicode workaround for getDirectoryContents under Windows?

2009-06-17 Thread Yitzchak Gale
Ketil Malde wrote:
> If we want to incorporate a translation layer, I think it's fair to
> only support UTF-8 (ignoring locales), but provide a workaround for
> invalid characters.

I disagree. Shells and GUI dialogs use the current locale.
I think most other modern programming languages do too, but
correct me if I am wrong.

Still, your ideas about dealing with decoding errors sound useful.

Regards,
Yitz
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Unicode workaround for getDirectoryContents under Windows?

2009-06-17 Thread Yitzchak Gale
Simon Marlow wrote:
>>> The following cases are currently broken...
>>> I propose to fix these (on Windows).  It will mean that your second case
>>> above will be broken, until someone fixes getDirectoryContents...
> ...it's a lot easier on Windows...
> on Unix I don't have a clear idea of how to proceed...
> If someone else has a good understanding of what
> needs done, please wade in.
>>> I don't know how getArgs fits in here...
> I agree it's broken and needs to be fixed.

OK, would you like me to reflect this discussion in tickets?
Let's see, so far we have #3300, I don't see anything else.

Do you want two tickets, one each for WIndows/Unix? Or
four, separating the FilePath and getArgs issues?

> On Unix, all file APIs take [Word8]...
> So we should probably be converting from FilePath to
> [Word8] by encoding using the current locale...
> what about encoding errors,

Where relevant, we should emulate what the common
shells do. In general, I don't see why they should be different
than any other file operation error.

> and what if encode.decode is not the identity due to normalisation

Well, is it common for people using typical input methods
and common shells to create file paths containing
text that decodes to non-normalized Unicode?

I'm guessing not. If that's the case, then we don't really have
to worry about it. People who went out of their way to create
a weird file name will have the same troubles they have
always had with that in Unix.

But perhaps a better solution would be to make the underlying
type of FilePath platform-dependent - e.g., String on Windows
and [Word8] on Unix - and let it support platform-
independent methods such as to/from String, to/from Bytes,
setEncoding (defaulting to the current locale). That way,
pass-through file paths will always work flawlessly on any
platform, and applications have complete flexibility
to deal with any other scenario however they choose. It's a
breaking change though.

Thanks,
Yitz
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Unicode workaround for getDirectoryContents under Windows?

2009-06-17 Thread Ketil Malde
Simon Marlow  writes:

>> Why only on Windows?

> Just because it's a lot easier on Windows - all the OS APIs take
> Unicode file paths, so it's obvious what to do.  In contrast on Unix I
> don't have a clear idea of how to proceed.

> On Unix, all file APIs take [Word8] rather than [Char].  By
> convention, the [Word8] is usually assumed to be a string in the
> locale encoding, but that's only a user-space convention.

If we want to incorporate a translation layer, I think it's fair to
only support UTF-8 (ignoring locales), but provide a workaround for
invalid characters. 

>From http://en.wikipedia.org/wiki/UTF-8:

|  Therefore many modern UTF-8 converters translate errors to
|  something "safe". Only one byte is changed into the error
|  replacement and parsing starts again at the next byte, otherwise
|  concatenating strings could change good characters into
|  errors. Popular replacements for each byte are: 
|
|* nothing (the bytes vanish)
|* '?' or '�'
|* The replacement character (U+FFFD)
|* The byte from ISO-8859-1 or CP1252
|* An invalid Unicode code point, usually U+DCxx where xx is the byte's 
value

How about using the last one? This would allow 'readFile' to work on
FilePaths provided by 'getDirectoryContents', while allowing for
real Unicode string literals.

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Unicode workaround for getDirectoryContents under Windows?

2009-06-17 Thread Simon Marlow

On 17/06/2009 13:21, Yitzchak Gale wrote:

I wrote:

I think the most important use cases that should not break are:

o open/read/write a FilePath from getArgs
o open/read/write a FilePath from getDirectoryContents


Simon Marlow wrote:

The following cases are currently broken:

  * Calling openFile on a literal Unicode FilePath (note, not
   ACP-encoded, just Unicode).

  * Reading a Unicode FilePath from a text file and then calling
   openFile on it

I propose to fix these (on Windows).  It will mean that your second case
above will be broken, until someone fixes getDirectoryContents.


Why only on Windows?


Just because it's a lot easier on Windows - all the OS APIs take Unicode 
file paths, so it's obvious what to do.  In contrast on Unix I don't 
have a clear idea of how to proceed.


On Unix, all file APIs take [Word8] rather than [Char].  By convention, 
the [Word8] is usually assumed to be a string in the locale encoding, 
but that's only a user-space convention.


So we should probably be converting from FilePath to [Word8] by encoding 
using the current locale.  This raises various complications (what about 
encoding errors, and what if encode.decode is not the identity due to 
normalisation, etc.).


But you don't have to wait for me to fix this stuff (I'm feeling a bit 
Unicoded-out right now :)  If someone else has a good understanding of 
what needs done, please wade in.



I don't know how getArgs fits in here - should we be decoding argv using the
ACP?


And why not also on Unix? On any platform, the expected behavior should
be that you type a file path at the command line, read it using getArgs,
and open the file using that.


Right.  On Unix it works at the moment because we neither decode argv 
nor encode FilePaths, so the bytes get passed through unchanged.  Same 
with getDirectoryContents.


But I agree it's broken and needs to be fixed.

Cheers,
Simon
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Unicode workaround for getDirectoryContents under Windows?

2009-06-17 Thread Yitzchak Gale
I wrote:
>> I think the most important use cases that should not break are:
>>
>> o open/read/write a FilePath from getArgs
>> o open/read/write a FilePath from getDirectoryContents

Simon Marlow wrote:
> The following cases are currently broken:
>
>  * Calling openFile on a literal Unicode FilePath (note, not
>   ACP-encoded, just Unicode).
>
>  * Reading a Unicode FilePath from a text file and then calling
>   openFile on it
>
> I propose to fix these (on Windows).  It will mean that your second case
> above will be broken, until someone fixes getDirectoryContents.

Why only on Windows?

> I don't know how getArgs fits in here - should we be decoding argv using the
> ACP?

And why not also on Unix? On any platform, the expected behavior should
be that you type a file path at the command line, read it using getArgs,
and open the file using that.

For comparison, Python works that way, even though the variable
is called "argv" there.

The current behavior on Unix of returning, say, UTF-8 encoding characters
in a String as if they were individual Unicode characters, is queer.
Given your fantastic work so far to rid System.IO of those kinds of oddities,
perhaps now is the time to finish the job.

If you think we really need to provide access to the raw argv bytes,
we could add another platform-independent function that does that.

Thanks,
Yitz
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Unicode workaround for getDirectoryContents under Windows?

2009-06-17 Thread Simon Marlow

On 17/06/2009 09:38, Bulat Ziganshin wrote:

Hello Simon,

Wednesday, June 17, 2009, 11:55:15 AM, you wrote:


Right, so getArgs is already fine.


it's what i've found in Jun15 sources:

#ifdef __GLASGOW_HASKELL__
getArgs :: IO [String]
getArgs =
   alloca $ \ p_argc ->
   alloca $ \ p_argv ->  do
getProgArgv p_argc p_argv
p<- fromIntegral `liftM` peek p_argc
argv<- peek p_argv
peekArray (p - 1) (advancePtr argv 1)>>= mapM peekCString


foreign import ccall unsafe "getProgArgv"
   getProgArgv :: Ptr CInt ->  Ptr (Ptr CString) ->  IO ()


it uses peekCString so by any means it cannot produce unicode chars


I see, so you were previously quoting code from some other source. 
Where did the GetCommandLineW version come from?  Do you know of any 
issues that would prevent us using it in GHC?


Cheers,
Simon

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Unicode workaround for getDirectoryContents under Windows?

2009-06-17 Thread Simon Marlow

On 16/06/2009 17:06, Bulat Ziganshin wrote:

Hello Simon,

Tuesday, June 16, 2009, 7:54:02 PM, you wrote:


In fact there's not a lot left to convert in System.Directory, as you'll
see if you look at the code.  Feel like helping?


these functions used there are ACP-only:

c_stat c_chmod System.Win32.getFullPathName c_SearchPath c_SHGetFolderPath


Yes, except for getFullPathName:

foreign import stdcall unsafe "GetFullPathNameW"
  c_GetFullPathName :: LPCTSTR -> DWORD -> LPTSTR -> Ptr LPTSTR -> IO DWORD


plus may be some more functions from System.Win32 package - i don't
looked into it


System.Win32 is using the wide-char APIs exclusively (ok, I haven't 
checked, but I don't know of any System.Win32 functions still using 
narrow strings).


So as you can see, there's not much left to do.  I'll fix openFile.

Cheers,
Simon
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Unicode workaround for getDirectoryContents under Windows?

2009-06-17 Thread Simon Marlow

On 16/06/2009 21:19, Bulat Ziganshin wrote:

Hello Simon,

Tuesday, June 16, 2009, 5:02:43 PM, you wrote:


I don't know how getArgs fits in here - should we be decoding argv using
the ACP?


myGetArgs = do
alloca $ \p_argc ->  do
p_argv_w<- commandLineToArgvW getCommandLineW p_argc
argc<- peek p_argc
argv_w<- peekArray (i argc) p_argv_w
mapM peekTString argv_w>>= return.tail

foreign import stdcall unsafe "windows.h GetCommandLineW"
   getCommandLineW :: LPTSTR

foreign import stdcall unsafe "windows.h CommandLineToArgvW"
   commandLineToArgvW :: LPCWSTR ->  Ptr CInt ->  IO (Ptr LPWSTR)


Right, so getArgs is already fine.

Cheers,
Simon

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Unicode workaround for getDirectoryContents under Windows?

2009-06-16 Thread Simon Marlow

On 16/06/2009 16:44, Bulat Ziganshin wrote:

Hello Simon,

Tuesday, June 16, 2009, 7:30:55 PM, you wrote:


Actually we use a mixture of CRT functions and native Windows API,
gradually moving in the direction of the latter.


so file-related APIs are already unpredictable, and will remain in
this state for unknown amount of ghc versions


Sometimes fixing everything at the same time is too hard :-)

In fact there's not a lot left to convert in System.Directory, as you'll 
see if you look at the code.  Feel like helping?


Cheers,
Simon
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Unicode workaround for getDirectoryContents under Windows?

2009-06-16 Thread Simon Marlow

On 16/06/2009 14:56, Bulat Ziganshin wrote:

Hello Simon,

Tuesday, June 16, 2009, 5:02:43 PM, you wrote:


Also currently broken:



   * calling removeFile on a FilePath you get from getDirectoryContents,
 amongst other System.Directory operations



Fixing getDirectoryContents will fix these.


no. removeFile like anything else also uses ACP-based api


What code are you looking at?

Here is System.Directory.removeFile:

removeFile :: FilePath -> IO ()
removeFile path =
#if mingw32_HOST_OS
  System.Win32.deleteFile path
#else
  System.Posix.removeLink path
#endif

and System.Win32.deleteFile:

deleteFile :: String -> IO ()
deleteFile name =
  withTString name $ \ c_name ->
  failIfFalse_ "DeleteFile" $ c_DeleteFile c_name
foreign import stdcall unsafe "windows.h DeleteFileW"
  c_DeleteFile :: LPCTSTR -> IO Bool

note it's calling DeleteFileW, and using wide-char strings.


Windows libraries emulates POSIX API (open, opendir, stat and so on)
by translating these (char-based) calls into A-family. GHC libs are
written Unix way, so these are effectively bundled to A-family of Win
API


Actually we use a mixture of CRT functions and native Windows API, 
gradually moving in the direction of the latter.


Cheers,
Simon
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Unicode workaround for getDirectoryContents under Windows?

2009-06-16 Thread Simon Marlow

On 16/06/2009 13:46, Yitzchak Gale wrote:

Simon Marlow wrote:

Care to submit a patch to put this in System.Directory, or better still
put the relevant functionality in System.Win32 and use it in
System.Directory?


Bulat Ziganshin wrote:

now getDirectoryContents return ACP (ansi code page) names so openFile
works for files 1) and 2).
With such change getDirectoryContents will return correct unicode
names, so openFile will work only with names in first group.
The right way is to fix ALL string-related calls in System.IO,
System.Posix.Internals, System.Environment.



You're right in that we really ought to fix everything.  However, I'm happy
to just fix some of these things, even if it introduces some inconsistencies
in the meantime.  We already have much of System.Directory working with
Unicode FilePaths, so there are already inconsistencies here.


+1 for integrating Unicode file paths. Thanks, Bulat!


Excuse my ignorance, but... what Unicode file paths?


I think the most important use cases that should not break are:

o open/read/write a FilePath from getArgs
o open/read/write a FilePath from getDirectoryContents

There's not much we can do about non-Latin-1 ACP file paths
hard coded in Strings. I hope there aren't too many
of those in the wild.


The following cases are currently broken:

 * Calling openFile on a literal Unicode FilePath (note, not
   ACP-encoded, just Unicode).

 * Reading a Unicode FilePath from a text file and then calling
   openFile on it

I propose to fix these (on Windows).  It will mean that your second case 
above will be broken, until someone fixes getDirectoryContents.


Also currently broken:

 * calling removeFile on a FilePath you get from getDirectoryContents,
   amongst other System.Directory operations

Fixing getDirectoryContents will fix these.

I don't know how getArgs fits in here - should we be decoding argv using 
the ACP?


Cheers,
Simon
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Unicode workaround for getDirectoryContents under Windows?

2009-06-16 Thread Yitzchak Gale
Simon Marlow wrote:
>>> Care to submit a patch to put this in System.Directory, or better still
>>> put the relevant functionality in System.Win32 and use it in
>>> System.Directory?

Bulat Ziganshin wrote:
>> now getDirectoryContents return ACP (ansi code page) names so openFile
>> works for files 1) and 2).
>> With such change getDirectoryContents will return correct unicode
>> names, so openFile will work only with names in first group.
>> The right way is to fix ALL string-related calls in System.IO,
>> System.Posix.Internals, System.Environment.

> You're right in that we really ought to fix everything.  However, I'm happy
> to just fix some of these things, even if it introduces some inconsistencies
> in the meantime.  We already have much of System.Directory working with
> Unicode FilePaths, so there are already inconsistencies here.

+1 for integrating Unicode file paths. Thanks, Bulat!

I think the most important use cases that should not break are:

o open/read/write a FilePath from getArgs
o open/read/write a FilePath from getDirectoryContents

There's not much we can do about non-Latin-1 ACP file paths
hard coded in Strings. I hope there aren't too many
of those in the wild.

Regards,
Yitz
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Unicode workaround for getDirectoryContents under Windows?

2009-06-16 Thread Simon Marlow

On 16/06/2009 12:42, Bulat Ziganshin wrote:

Hello Simon,

Tuesday, June 16, 2009, 3:30:31 PM, you wrote:


Care to submit a patch to put this in System.Directory, or better still
put the relevant functionality in System.Win32 and use it in
System.Directory?


Simon, it will somewhat broke openFile. let's see. there are 3 types
of filenames -

1) english (latin-1) only
2) including local (ansi code page) chars
3) including any other unicode chars

now getDirectoryContents return ACP (ansi code page) names so openFile
works for files 1) and 2)

with such change getDirectoryContents will return correct unicode
names, so openFile will work only with names in first group

the right way is to fix ALL string-related calls in System.IO,
System.Posix.Internals, System.Environment


You're right in that we really ought to fix everything.  However, I'm 
happy to just fix some of these things, even if it introduces some 
inconsistencies in the meantime.  We already have much of 
System.Directory working with Unicode FilePaths, so there are already 
inconsistencies here.


Thanks for reminding me that openFile is also broken.  It's easily 
fixed, so I'll look into that.


Cheers,
Simon
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Unicode workaround for getDirectoryContents under Windows?

2009-06-16 Thread Bulat Ziganshin
Hello Simon,

Tuesday, June 16, 2009, 3:30:31 PM, you wrote:

> Care to submit a patch to put this in System.Directory, or better still
> put the relevant functionality in System.Win32 and use it in 
> System.Directory?

Simon, it will somewhat broke openFile. let's see. there are 3 types
of filenames -

1) english (latin-1) only
2) including local (ansi code page) chars
3) including any other unicode chars

now getDirectoryContents return ACP (ansi code page) names so openFile
works for files 1) and 2)

with such change getDirectoryContents will return correct unicode
names, so openFile will work only with names in first group

the right way is to fix ALL string-related calls in System.IO,
System.Posix.Internals, System.Environment



-- 
Best regards,
 Bulatmailto:bulat.zigans...@gmail.com

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe