Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-18 Thread Max Bolingbroke
On 10 November 2011 14:35, Simon Marlow  wrote:
> Agreed.

Committed.

>> I'm wondering if we should also have hSetLocaleEncoding,
>> hSetFileSystemEncoding :: TextEncoding ->  IO () and change
>> localeEncoding, fileSystemEncoding :: IO TextEncoding.
>> hSetFileSystemEncoding in particular would let people opt-out of
>> escapes entirely as long as they issued it right at the start of their
>> program before the fileSystemEncoding had been used.
>
> Ok by me.

I've done this as well. One wart is that System.IO.localeEncoding ::
TextEncoding and I dn't want to break that API. So the
System.IO.localeEncoding is always the *initial* locale encoding and
does not reflect later changes via setLocaleEncoding.

Max

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-10 Thread John Millikin
On Thu, Nov 10, 2011 at 03:28, Simon Marlow  wrote:
> I've done a search/replace and called it RawFilePath.  Ok?

Fantastic, thank you very much.

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-10 Thread Simon Marlow

On 10/11/2011 09:28, Max Bolingbroke wrote:


Is there any consensus about what to do here? My take is that we
should move back to lone surrogates. This:
   1. Recovers the roundtrip property, which we appear to believe is essential
   2. Removes all the weird problems I outlined earlier that can occur
if your byte strings happen to contain some bytes that decode to
U+EFxx
   3. DOES break software that expects Strings not to contain surrogate
codepoints, but (I agree with you) this is arguably a feature

This is also exactly what Python does so it has the advantage of being
battle tested.

Agreed?


Agreed.


We can additionally:
  * Provide your layer in the "unix" package where FilePath =
ByteString, for people who for some reason care about performance of
their FilePath encoding/decoding, OR who don't want to rely on the
roundtripping property being implemented correctly


I think I'll do this anyway.


  * Perhaps provide a layer in the "win32" package where FilePath =
ByteString but where that ByteString is guaranteed to be UTF-16
encoded (I'm less sure about this, because we can always unambiguously
decode this without doing any escaping. It's still useful if you care
about performance.)

I'm wondering if we should also have hSetLocaleEncoding,
hSetFileSystemEncoding :: TextEncoding ->  IO () and change
localeEncoding, fileSystemEncoding :: IO TextEncoding.
hSetFileSystemEncoding in particular would let people opt-out of
escapes entirely as long as they issued it right at the start of their
program before the fileSystemEncoding had been used.


Ok by me.

Cheers,
Simon

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-10 Thread Simon Marlow

On 09/11/2011 16:42, John Millikin wrote:

On Wed, Nov 9, 2011 at 08:04, Simon Marlow  wrote:

Ok, I spent most of today adding ByteString alternatives for all of the
functions in System.Posix that use FilePath or environment strings.  The
Haddocks for my augmented unix package are here:

http://community.haskell.org/~simonmar/unix-with-bytestring-extras/index.html

In particular, the module System.Posix.ByteString is the whole System.Posix
API but with ByteString FilePaths and environment strings:

http://community.haskell.org/~simonmar/unix-with-bytestring-extras/System-Posix-ByteString.html


This looks lovely -- thank you.

Once it's released, I'll port all my libraries over to using it.


It has one addition relative to System.Posix:

  getArgs :: IO [ByteString]


Thank you very much! Several tools I use daily accept binary data as
command-line options, and this will make it much easier to port them
to Haskell in the future.


Let me know what you think.  I suspect the main controversial aspect is that
I included

  type FilePath = ByteString

which is a bit cute but might be confusing.


Indeed, I was very confused when I saw that in the docs. If it's not
too much trouble, could those functions accept/return ByteString
directly?


I've done a search/replace and called it RawFilePath.  Ok?

Cheers,
Simon


___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-10 Thread Max Bolingbroke
On 9 November 2011 16:29, Simon Marlow  wrote:
> Ok, so since we need something like
>
>  makePrintable :: FilePath -> String
>
> arguably we might as well make that do the locale decoding.  That's
> certainly a good point...

You could, but getArgs :: IO [String], not :: IO [FilePath]. And
locale-decoding command-line arguments is the Right Thing To Do. So
this doesn't really avoid the need to roundtrip, does it?

Is there any consensus about what to do here? My take is that we
should move back to lone surrogates. This:
  1. Recovers the roundtrip property, which we appear to believe is essential
  2. Removes all the weird problems I outlined earlier that can occur
if your byte strings happen to contain some bytes that decode to
U+EFxx
  3. DOES break software that expects Strings not to contain surrogate
codepoints, but (I agree with you) this is arguably a feature

This is also exactly what Python does so it has the advantage of being
battle tested.

Agreed?

We can additionally:
 * Provide your layer in the "unix" package where FilePath =
ByteString, for people who for some reason care about performance of
their FilePath encoding/decoding, OR who don't want to rely on the
roundtripping property being implemented correctly
 * Perhaps provide a layer in the "win32" package where FilePath =
ByteString but where that ByteString is guaranteed to be UTF-16
encoded (I'm less sure about this, because we can always unambiguously
decode this without doing any escaping. It's still useful if you care
about performance.)

I'm wondering if we should also have hSetLocaleEncoding,
hSetFileSystemEncoding :: TextEncoding -> IO () and change
localeEncoding, fileSystemEncoding :: IO TextEncoding.
hSetFileSystemEncoding in particular would let people opt-out of
escapes entirely as long as they issued it right at the start of their
program before the fileSystemEncoding had been used.

What do you think?

Max

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-10 Thread Max Bolingbroke
On 10 November 2011 00:17, Ian Lynagh  wrote:
> On Wed, Nov 09, 2011 at 03:58:47PM +, Max Bolingbroke wrote:
>>
>> (Note that the above outlined problems are problems in the current
>> implementation too
>
> Then the proposal seems to me to be strictly better than the current
> system. Under both systems the wrong thing happen when U+EFxx is entered
> as unicode text, but the proposed system works for all filenames read
> from the filesystem.

Your proposal is not *strictly* better than what is implemented in at
least the following ways:
  1. With your proposal, if you read a filename containing U+EF80 into
the variable "fp" and then expect the character U+EF80 to be in fp you
will be surprised to only find its escaped form. In the current
implementation you will in fact find U+EF80.
  2. The performance of iconv-based decoders will suffer because we
will need to do a post-pass in the TextEncoding to do this extra
escaping for U+EFxx characters

I'm really not keen about implementing a fix that addresses such a
limited subset of the problems, anyway.

> In the longer term, I think we need to fix the underlying problem that
> (for example) both getLine and getArgs produce a String from bytes, but
> do so in different ways. At some point we should change the type of
> getArgs and friends.

I'm not sure about this. hGetLine produces a String from bytes in a
different way depending on the encoding set on the Handle, but we
don't try to differentiate in the type system between Strings decoded
using different TextEncodings. Why should getLine and getArgs be
different?

If you are really unhappy about getLine and getArgs having different
behaviour in this sense, one option would be to change the default
stdout/stdin TextEncoding to use the fileSystemEncoding that knows
about escapes. (Note that this would mean that your Haskell program
wouldn't immediately die if you were using the UTF8 locale and then
tried to read some non-UTF8 input from stdin, which might or might not
be a good thing, depending on the application.)

Max

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-09 Thread John Lask

My primary concerns are (in order of priority - and I only speak for myself)

(a) consistency across platforms
(b) minimize (unrequired) performance overhead

I would prefer an api which is consistent for both win32, posix or other 
os which only did as much as what the user (us) wanted

for example ...

module System.Directory.ByteString ...

FilePath = ByteString

getDirectoryContents :: FilePath -> IO [FilePath]

which is the same for both win32 and posix and represents raw 
uninterpreted bytestrings in whatever encoding/(non-encoding) the os 
providesimplicitly it is for the user to know and understand what 
their getting (utf-16 in the case of windows, bytes in case of posix 
platforms)



then this api can be re-exported with the decoding/encoding by
System.Directory/System.IO

which would export FilePath=String

ie a two level api...



___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-09 Thread Ian Lynagh
On Wed, Nov 09, 2011 at 03:58:47PM +, Max Bolingbroke wrote:
> 
> (Note that the above outlined problems are problems in the current
> implementation too

Then the proposal seems to me to be strictly better than the current
system. Under both systems the wrong thing happen when U+EFxx is entered
as unicode text, but the proposed system works for all filenames read
from the filesystem.


In the longer term, I think we need to fix the underlying problem that
(for example) both getLine and getArgs produce a String from bytes, but
do so in different ways. At some point we should change the type of
getArgs and friends.


Thanks
Ian


___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-09 Thread Simon Marlow

On 09/11/2011 15:58, Max Bolingbroke wrote:


(Note that the above outlined problems are problems in the current
implementation too -- but the current implementation doesn't even
pretend to support U+EFxx characters. Its correctness is entirely
dependent on them never showing up, which is why we chose a part of
the private codepoint region that is reserved specifically for the
purpose of encoding hacks).


But we can't make that assumption, because the user might have 
accidentally set the locale wrong and then all kinds of garbage will 
show up in decoded file paths.  I think it's important that programs 
that just traverse the file system keep working under those conditions, 
rather than randomly failing due to (encode . decode) being almost but 
not quite the identity.


Cheers,
Simon



___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-09 Thread John Millikin
On Wed, Nov 9, 2011 at 08:04, Simon Marlow  wrote:
> Ok, I spent most of today adding ByteString alternatives for all of the
> functions in System.Posix that use FilePath or environment strings.  The
> Haddocks for my augmented unix package are here:
>
> http://community.haskell.org/~simonmar/unix-with-bytestring-extras/index.html
>
> In particular, the module System.Posix.ByteString is the whole System.Posix
> API but with ByteString FilePaths and environment strings:
>
> http://community.haskell.org/~simonmar/unix-with-bytestring-extras/System-Posix-ByteString.html

This looks lovely -- thank you.

Once it's released, I'll port all my libraries over to using it.

> It has one addition relative to System.Posix:
>
>  getArgs :: IO [ByteString]

Thank you very much! Several tools I use daily accept binary data as
command-line options, and this will make it much easier to port them
to Haskell in the future.

> Let me know what you think.  I suspect the main controversial aspect is that
> I included
>
>  type FilePath = ByteString
>
> which is a bit cute but might be confusing.

Indeed, I was very confused when I saw that in the docs. If it's not
too much trouble, could those functions accept/return ByteString
directly?

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-09 Thread Simon Marlow

On 09/11/2011 13:11, Ian Lynagh wrote:
> On Wed, Nov 09, 2011 at 11:02:54AM +, Simon Marlow wrote:
>>
>> I would be happy with the surrogate approach I think.  Arguable if
>> you try to treat a string with lone surrogates as Unicode and it
>> fails, then that is a feature: the original string wasn't Unicode.
>> All you can do with an invalid Unicode string is use it as a
>> FilePath again, and the right thing will happen.
>
> If we aren't going to guarantee that the encoded string is unicode, then
> is there any benefit to encoding it in the first place?

With a decoded FilePath you can:

  - use it as a FilePath argument to some other function

  - map all the illegal characters to '?' and then treat it as
Unicode, e.g. for printing it out (but then you lost the ability to
roundtrip, which is why we can't do this automatically).

Ok, so since we need something like

  makePrintable :: FilePath -> String

arguably we might as well make that do the locale decoding.  That's 
certainly a good point...


Cheers,
Simon

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-09 Thread Simon Marlow

On 08/11/2011 15:42, John Millikin wrote:

On Tue, Nov 8, 2011 at 03:04, Simon Marlow  wrote:

I really think we should provide the native APIs.  The problem is that the
System.Posix.Directory API is all in terms of FilePath (=String), and if we
gave that a different meaning from the System.Directory FilePaths then
confusion would ensue.  So perhaps we need to add another API to
System.Posix with filesystem operations in terms of ByteString, and
similarly for Win32.


+1

I think most users would be OK with having System.Posix treat FilePath
differently, as long as this is clearly documented, but if you feel a
separate API is better then I have no objection. As long as there's
some way to say "I know what I'm doing, here's the bytes" to the
library.

The Win32 package uses wide-character functions, so I'm not sure
whether bytes would be appropriate there. My instinct says to stick
with chars, via withCWString or equivalent. The package maintainer
will have a better idea of what fits with the OS's idioms.


Ok, I spent most of today adding ByteString alternatives for all of the 
functions in System.Posix that use FilePath or environment strings.  The 
Haddocks for my augmented unix package are here:


http://community.haskell.org/~simonmar/unix-with-bytestring-extras/index.html

In particular, the module System.Posix.ByteString is the whole 
System.Posix API but with ByteString FilePaths and environment strings:


http://community.haskell.org/~simonmar/unix-with-bytestring-extras/System-Posix-ByteString.html

It has one addition relative to System.Posix:

  getArgs :: IO [ByteString]

Let me know what you think.  I suspect the main controversial aspect is 
that I included


  type FilePath = ByteString

which is a bit cute but might be confusing.

Cheers,
Simon

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-09 Thread Max Bolingbroke
On 9 November 2011 11:02, Simon Marlow  wrote:
> The performance overhead of all this worries me.  withCString has taken a
> huge performance hit, and I think there are people who wnat to know that
> there aren't several complex encoding/decoding passes between their Haskell
> code and the POSIX API.  We ought to be able to program to POSIX directly,
> and the same goes for Win32.

We are only really talking about environment variables, filenames and
command line arguments here. I'm sure there are performance
implications to all this decoding/encoding, but these bits of text are
almost always very short and are unlikely to be causing bottlenecks.
Adding a whole new API *just* to eliminate a hypothetical performance
problem seems like overkill.

OTOH, I'm happy to add it if we stick with using private chars for the
escapes, because then using it or not using it is a *correctness*
issue (albeit in rare cases).

Max

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-09 Thread Max Bolingbroke
On 9 November 2011 13:11, Ian Lynagh  wrote:
> If we aren't going to guarantee that the encoded string is unicode, then
> is there any benefit to encoding it in the first place?

(I think you mean decoded here - my understanding is that decode ::
ByteString -> String, encode :: String -> ByteString)

> Why not encode into private chars, i.e. encode U+EF00 (which in UTF8 is
> 0xEE 0xBC 0x80) as U+EFEE U+EFBC U+EF80, etc?
>
> (Max gave some reasons earlier in this thread, but I'd need examples of
> what goes wrong to understand them).

We can do this but it doesn't solve all problems. Here are two such problems:

PROBLEM 1 (bleeding from non-escaping to escaping TextEncodings)
===

So let's say we are reading a filename from stdin. Currently stdin
uses the utf8 TextEncoding -- this TextEncoding knows nothing about
private-char roundtripping, and will throw an exception when decoding
bad bytes or encoding our private chars.

Now the user types a UTF-8 U+EF80 character - i.e. we get the bytes
0xEE 0xBC 0x80 on stdin.

The utf8 TextEncoding naively decodes this byte sequence to the
character sequence U+EF80.

We have lost at this point: if the user supplies the resulting String
to a function that encodes the String with the fileSystemEncoding, the
String will be encoded into the byte sequence 0x80. This is probably
not what we want to happen! It means that a program like this:

"""
main = do
  fp <- getLine
  readFile fp >>= putStrLn
"""

Will fail ("file not found: \x80") when given the name of an
(existant) file 0xEE 0xBC 0x80.

PROBLEM 2 (bleeding between two different escaping TextEncodings)
===

So let's say the user supplies the UTF-8 encoded U+EF00 (byte sequence
0xEE 0xBC 0x80) as a command line argument, so it goes through the
fileSystemEncoding. In your scheme the resulting Char sequence is
U+EFEE U+EFBC U+EF80.

What happens when we that *encode* that Char sequence using a UTF-16
TextEncoding (that knows about the 0xEFxx escape mechanism)? The
resulting byte sequence is 0xEE 0xBC 0x80, NOT the UTF-16 encoded
version of U+EF00! This is certainly contrary to what the user would
expect.

PROBLEM 3 (bleeding from escaping to non-escaping TextEncodings)
===

Just as above, let's say the user supplies the UTF-8 encoded U+EF00
(byte sequence 0xEE 0xBC 0x80) as a command line argument, so it goes
through the fileSystemEncoding. In your scheme the resulting Char
sequence is U+EFEE U+EFBC U+EF80.

If you try to write this String to stdout (which uses the UTF-8
encoding that knows nothing about 0xEFxx escapes) you just get an
exception, NOT the UTF-8 encoded version of U+EF00. Game over man,
game over!

CONCLUSION
===

As far as I can see, the proposed escaping scheme recovers the
roundtrip property but fails to regain a lot of other
reasonable-looking behaviours.

(Note that the above outlined problems are problems in the current
implementation too -- but the current implementation doesn't even
pretend to support U+EFxx characters. Its correctness is entirely
dependent on them never showing up, which is why we chose a part of
the private codepoint region that is reserved specifically for the
purpose of encoding hacks).

Max

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-09 Thread Ian Lynagh
On Wed, Nov 09, 2011 at 11:02:54AM +, Simon Marlow wrote:
> 
> I would be happy with the surrogate approach I think.  Arguable if
> you try to treat a string with lone surrogates as Unicode and it
> fails, then that is a feature: the original string wasn't Unicode.
> All you can do with an invalid Unicode string is use it as a
> FilePath again, and the right thing will happen.

If we aren't going to guarantee that the encoded string is unicode, then
is there any benefit to encoding it in the first place?

> Alternatively if we stick with the private char approach, it should
> be possible to have an escaping scheme for 0xEFxx characters in the
> input that would enable us to roundtrip correctly.  That is, escape
> 0xEFxx into a sequence 0xYYEF 0xYYxx for some suitable YY.

Why not encode into private chars, i.e. encode U+EF00 (which in UTF8 is
0xEE 0xBC 0x80) as U+EFEE U+EFBC U+EF80, etc?

(Max gave some reasons earlier in this thread, but I'd need examples of
what goes wrong to understand them).


Thanks
Ian


___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-09 Thread Simon Marlow

On 09/11/2011 10:39, Max Bolingbroke wrote:

On 8 November 2011 11:43, Simon Marlow  wrote:

Don't you mean 1 is what we have?


Yes, sorry!


Failing to roundtrip in some cases, and doing so silently, seems highly
suboptimal to me.  I'm sorry I didn't pick up on this at the time (Unicode
is a swamp :).


I *can* change the implementation back to using lone surrogates. This
gives us guaranteed roundtripping but it means that the user might see
lone-surrogate Char values in Strings from the filesystem/command
line. IIRC this does break some software -- e.g. Brian's "text"
library explicitly checks for such characters and fails if it detects
them.

So whatever happens we are going to end up making some group of users unhappy!
   * No PEP383: Haskellers using non-ASCII get upset when their command
line argument [String]s aren't in fact sequences of characters, but
sequences of bytes in some arbitrary encoding
   * PEP383(surrogates): Unicoders get upset by lone surrogates (which
can actually occur at the moment, independent of PEP383 -- e.g. as
character literals or from FFI)
   * PEP383(private chars): Unixers get upset that we can't roundtrip
byte sequences that look like the codepoint 0xEFXX encoded in the
current locale. In practice, 0xEFXX is only decodable from a UTF
encoding, so we fail to roundtrip byte sequences like the one Ian
posted.

I'm happy to implement any behaviour, I would just like to know that
whatever it is is accepted as the correct tradeoff :-)


I would be happy with the surrogate approach I think.  Arguable if you 
try to treat a string with lone surrogates as Unicode and it fails, then 
that is a feature: the original string wasn't Unicode.  All you can do 
with an invalid Unicode string is use it as a FilePath again, and the 
right thing will happen.


Alternatively if we stick with the private char approach, it should be 
possible to have an escaping scheme for 0xEFxx characters in the input 
that would enable us to roundtrip correctly.  That is, escape 0xEFxx 
into a sequence 0xYYEF 0xYYxx for some suitable YY.  But perhaps that 
would be too expensive - an extra translation pass over the buffer after 
iconv (well, we do this for newline translation, so maybe it's not too bad).



RE exposing a ByteString based interface to the IO library from
base/unix/whatever: AFAIK Python doesn't do this, and just tells
people to use the (x.encode(sys.getfilesystemencoding(),
"surrogateescape")) escape hatch, which is what I've been
recommending. I think this would be more satisfying to John if it were
actually guaranteed to work on arbitrary byte sequences, not just
*highly likely* to work :-)


The performance overhead of all this worries me.  withCString has taken 
a huge performance hit, and I think there are people who wnat to know 
that there aren't several complex encoding/decoding passes between their 
Haskell code and the POSIX API.  We ought to be able to program to POSIX 
directly, and the same goes for Win32.


Cheers,
Simon



___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-09 Thread Max Bolingbroke
On 7 November 2011 17:32, John Millikin  wrote:
> I am also not convinced that it is possible to correctly implement
> either of these functions if their behavior is dependent on the user's
> locale.

FWIW it's only dependent on the users locale because whether glibc
iconv detects errors in the *from* sequence depends on what the *to*
locale is. Clearly an invalid *from* sequence should be reported as
invalid regardless of *to*. I know this isn't much comfort to you,
though, since you do have to worry about broken behaviour in 7.2, and
possible future breakage with changes in iconv.

I understand your point that it would be better from a complexity
point of view to just roundtrip the bytes as *bytes* without relying
on all this escaping/unescaping code.

> Please understand, I am not arguing against the existence of this
> encoding layer in general. It's a fine idea for a simplistic
> high-level filesystem interaction library. But it should be
> *optional*, not part of the compiler or "base.

The problem is that I *really really want* getArgs to decode the
command line arguments. That's almost the whole point of this change,
and it is what most users seem to expect. Given this constraint, the
code has to be part of "base", and if getArgs has this behaviour then
any file system function we ship that takes a FilePath (i.e. all the
functions in base, directory, win32 and unix) must be prepared to
handle these escape characters for consistency.

I *would* be happy to expose an alternative file system API from the
posix package that operates with ByteString paths. This package could
provide a function :: FilePath -> ByteString that encodes the string
with the fileSystemEncoding (removing escapes in the process) for
interoperability with file names arriving via getArgs, and at that
point the decision about whether to use the escaping/unescaping code
would be (mostly) in the hands of the user. We could even have posix
expose APIs to get command line arguments/environment variables as
ByteStrings, and then you could avoid escape/unescape entirely.

Which of these solutions (if any) would satisfy you?
 1. The current situation, plus an alternative API exposed from
"posix" along the lines described above
 2. The current situation but with the escape/unescape modified so it
allows true roundtripping (at the cost of weird "surrogate" Char
values popping up now and again). If you have this you can reliably
implement the alternative API on top of the String based one, assuming
we got our escape/unescape code right

I hope we can work together to find a solution here.

Cheers,
Max

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-09 Thread Max Bolingbroke
On 8 November 2011 11:43, Simon Marlow  wrote:
> Don't you mean 1 is what we have?

Yes, sorry!

> Failing to roundtrip in some cases, and doing so silently, seems highly
> suboptimal to me.  I'm sorry I didn't pick up on this at the time (Unicode
> is a swamp :).

I *can* change the implementation back to using lone surrogates. This
gives us guaranteed roundtripping but it means that the user might see
lone-surrogate Char values in Strings from the filesystem/command
line. IIRC this does break some software -- e.g. Brian's "text"
library explicitly checks for such characters and fails if it detects
them.

So whatever happens we are going to end up making some group of users unhappy!
  * No PEP383: Haskellers using non-ASCII get upset when their command
line argument [String]s aren't in fact sequences of characters, but
sequences of bytes in some arbitrary encoding
  * PEP383(surrogates): Unicoders get upset by lone surrogates (which
can actually occur at the moment, independent of PEP383 -- e.g. as
character literals or from FFI)
  * PEP383(private chars): Unixers get upset that we can't roundtrip
byte sequences that look like the codepoint 0xEFXX encoded in the
current locale. In practice, 0xEFXX is only decodable from a UTF
encoding, so we fail to roundtrip byte sequences like the one Ian
posted.

I'm happy to implement any behaviour, I would just like to know that
whatever it is is accepted as the correct tradeoff :-)

RE exposing a ByteString based interface to the IO library from
base/unix/whatever: AFAIK Python doesn't do this, and just tells
people to use the (x.encode(sys.getfilesystemencoding(),
"surrogateescape")) escape hatch, which is what I've been
recommending. I think this would be more satisfying to John if it were
actually guaranteed to work on arbitrary byte sequences, not just
*highly likely* to work :-)

Max

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-08 Thread wren ng thornton

On 11/8/11 6:04 AM, Simon Marlow wrote:

I really think we should provide the native APIs. The problem is that
the System.Posix.Directory API is all in terms of FilePath (=String),
and if we gave that a different meaning from the System.Directory
FilePaths then confusion would ensue. So perhaps we need to add another
API to System.Posix with filesystem operations in terms of ByteString,
and similarly for Win32.


+1.

It'd be nice to have an abstract FilePath. But until that happens, it's 
important to distinguish the automagic type from the raw type. H98's 
FilePath=String vs ByteString seems a good way to do that.


--
Live well,
~wren

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-08 Thread John Millikin
On Tue, Nov 8, 2011 at 03:04, Simon Marlow  wrote:
>> As mentioned earlier in the thread, this behavior is breaking things.
>> Due to an implementation error, programs compiled with GHC 7.2 on
>> POSIX systems cannot open files unless their paths also happen to be
>> valid text according to their locale. It is very difficult to work
>> around this error, because the paths-are-text logic was placed at a
>> very low level in the library stack.
>
> So your objection is that there is a bug?  What if we fixed the bug?

My objection is that the current implementation provides no way to
work around potential bugs.

GHC is software. Like all software, it contains errors, and new
features are likely to contain more errors. When adding behavior like
automatic path encoding, there should always be a way to avoid or work
around it, in case a severe bug is discovered.

>>> It would probably be better to have an abstract FilePath type and to keep
>>> the original bytes, decoding on demand.  But that is a big change to the
>>> API
>>> and would break much more code.  One day we'll do this properly; for now
>>> we
>>> have this, which I think is a pretty reasonble compromise.
>>
>> Please understand, I am not arguing against the existence of this
>> encoding layer in general. It's a fine idea for a simplistic
>> high-level filesystem interaction library. But it should be
>> *optional*, not part of the compiler or "base.
>
> Ok, so I was about to reply and say that the low-level API is available via
> the unix and Win32 packages, and then I thought I should check first, and I
> discovered that even using System.Posix you get the magic encoding
> behaviour.
>
> I really think we should provide the native APIs.  The problem is that the
> System.Posix.Directory API is all in terms of FilePath (=String), and if we
> gave that a different meaning from the System.Directory FilePaths then
> confusion would ensue.  So perhaps we need to add another API to
> System.Posix with filesystem operations in terms of ByteString, and
> similarly for Win32.

+1

I think most users would be OK with having System.Posix treat FilePath
differently, as long as this is clearly documented, but if you feel a
separate API is better then I have no objection. As long as there's
some way to say "I know what I'm doing, here's the bytes" to the
library.

The Win32 package uses wide-character functions, so I'm not sure
whether bytes would be appropriate there. My instinct says to stick
with chars, via withCWString or equivalent. The package maintainer
will have a better idea of what fits with the OS's idioms.

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-08 Thread Simon Marlow

On 02/11/2011 21:40, Max Bolingbroke wrote:

On 2 November 2011 20:16, Ian Lynagh  wrote:

Are you saying there's a bug that should be fixed?


You can choose between two options:

  1. Failing to roundtrip some strings (in our case, those containing
the 0xEFNN byte sequences)
  2. Having GHC's decoding functions return strings including
codepoints that should not be allowed (i.e. lone surrogates)

At the time I implemented this there was significant support for 2, so
that is what we have.


Don't you mean 1 is what we have?

> At the time I was convinced that 2 was the right

thing to do, but now I'm more agnostic. But anyway the current
behaviour is not really a bug -- it is by design :-)


Failing to roundtrip in some cases, and doing so silently, seems highly 
suboptimal to me.  I'm sorry I didn't pick up on this at the time 
(Unicode is a swamp :).



Cheers,
Simon

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-08 Thread Simon Marlow

On 07/11/2011 17:32, John Millikin wrote:

On Mon, Nov 7, 2011 at 09:02, Simon Marlow  wrote:

I think you might be misunderstanding how the new API works.  Basically,
imagine a reversible transformation:

  encode :: String ->  [Word8]
  decode :: [Word8] ->  String

this transformation is applied in the appropriate direction by the IO
library to translate filesystem paths into FilePath and vice versa.  No
information is lost; furthermore you can apply the transformation yourself
in order to recover the original [Word8] from a String, or to inject your
own [Word8] file path.

Ok?


I understand how the API is intended / designed to work; however, the
implementation does not actually do this. My argument is that this
transformation should be in a high-level library like "directory", and
the low-level libraries like "base" or "unix" ought to provide
functions which do not transform their inputs. That way, when an error
is found in the encoding logic, it can be fixed by just pushing a new
version of the affected library to Hackage, instead of requiring a new
version of the compiler.

I am also not convinced that it is possible to correctly implement
either of these functions if their behavior is dependent on the user's
locale.


All this does is mean that the common case where you want to interpret file
system paths as text works with no fuss, without breaking anything in the
case when the file system paths are not actually text.


As mentioned earlier in the thread, this behavior is breaking things.
Due to an implementation error, programs compiled with GHC 7.2 on
POSIX systems cannot open files unless their paths also happen to be
valid text according to their locale. It is very difficult to work
around this error, because the paths-are-text logic was placed at a
very low level in the library stack.


So your objection is that there is a bug?  What if we fixed the bug?


It would probably be better to have an abstract FilePath type and to keep
the original bytes, decoding on demand.  But that is a big change to the API
and would break much more code.  One day we'll do this properly; for now we
have this, which I think is a pretty reasonble compromise.


Please understand, I am not arguing against the existence of this
encoding layer in general. It's a fine idea for a simplistic
high-level filesystem interaction library. But it should be
*optional*, not part of the compiler or "base.


Ok, so I was about to reply and say that the low-level API is available 
via the unix and Win32 packages, and then I thought I should check 
first, and I discovered that even using System.Posix you get the magic 
encoding behaviour.


I really think we should provide the native APIs.  The problem is that 
the System.Posix.Directory API is all in terms of FilePath (=String), 
and if we gave that a different meaning from the System.Directory 
FilePaths then confusion would ensue.  So perhaps we need to add another 
API to System.Posix with filesystem operations in terms of ByteString, 
and similarly for Win32.


Cheers,
Simon

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-08 Thread Simon Marlow

On 07/11/2011 17:57, Ian Lynagh wrote:

On Mon, Nov 07, 2011 at 05:02:32PM +, Simon Marlow wrote:


Basically, imagine a reversible transformation:

   encode :: String ->  [Word8]
   decode :: [Word8] ->  String

this transformation is applied in the appropriate direction by the
IO library to translate filesystem paths into FilePath and vice
versa.  No information is lost


I think that would be great if it were true, but it isn't:

$ touch `printf '\x80'`
$ touch `printf '\xEE\xBE\x80'`
$ ghc -e 'System.Directory.getDirectoryContents ".">>= print'
["\61312",".","\61312",".."]

Both of those filenames get encoded as \61312 (U+EF80).


Ouch, I missed that.  I was under the impression that we guaranteed 
roundtripping, but it seems not.


Max - we need to fix this.

Cheers,
Simon

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-07 Thread John Millikin
On Mon, Nov 7, 2011 at 15:39, Yitzchak Gale  wrote:
> The problem is that Haskell 98 specifies type FilePath = String.
> In retrospect, we now know that this is too simplistic.
> But that's what we have right now.

This is *a* problem, but not a particularly major one; the definition
of paths in GHC 7.0 (text on some systems, bytes on others) is
inelegant but workable.

The main problem, IMO, is that the semantics of openFile et al changed
in a way that is impossible to check for statically, and there was no
mention of this in the documentation. It's one thing to make a change
which will cause new compilation failures. It's quite another to
introduce an undocumented change in important semantics.

>> As implemented in GHC 7.2, this encoding is a complex and untested
>> behavior with no escape hatch.
>
> Isn't System.Posix.IO the escape hatch?
>
> Even though FilePath is still used there instead of
> ByteString as it should be, this is the
> low-level POSIX-specific library. So the old hack of
> interpreting the lowest 8 bits as bytes makes
> a lot more sense there.

System.Posix.IO, and the "unix" package in general, also perform the
new path encoding/decoding.

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-07 Thread Yitzchak Gale
Simon Marlow wrote:
>> It would probably be better to have an abstract FilePath type and to keep
>> the original bytes, decoding on demand.  But that is a big change to the API
>> and would break much more code.  One day we'll do this properly; for now we
>> have this, which I think is a pretty reasonble compromise.

John Millikin wrote:
> Please understand, I am not arguing against the existence of this
> encoding layer in general. It's a fine idea for a simplistic
> high-level filesystem interaction library. But it should be
> *optional*, not part of the compiler or "base.

The problem is that Haskell 98 specifies type FilePath = String.
In retrospect, we now know that this is too simplistic.
But that's what we have right now.

> As implemented in GHC 7.2, this encoding is a complex and untested
> behavior with no escape hatch.

Isn't System.Posix.IO the escape hatch?

Even though FilePath is still used there instead of
ByteString as it should be, this is the
low-level POSIX-specific library. So the old hack of
interpreting the lowest 8 bits as bytes makes
a lot more sense there.

Thanks,
Yitz

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-07 Thread Ian Lynagh
On Mon, Nov 07, 2011 at 05:02:32PM +, Simon Marlow wrote:
> 
> Basically, imagine a reversible transformation:
> 
>   encode :: String -> [Word8]
>   decode :: [Word8] -> String
> 
> this transformation is applied in the appropriate direction by the
> IO library to translate filesystem paths into FilePath and vice
> versa.  No information is lost

I think that would be great if it were true, but it isn't:

$ touch `printf '\x80'`
$ touch `printf '\xEE\xBE\x80'`
$ ghc -e 'System.Directory.getDirectoryContents "." >>= print'
["\61312",".","\61312",".."]

Both of those filenames get encoded as \61312 (U+EF80).


Thanks
Ian


___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-07 Thread John Millikin
On Mon, Nov 7, 2011 at 09:02, Simon Marlow  wrote:
> I think you might be misunderstanding how the new API works.  Basically,
> imagine a reversible transformation:
>
>  encode :: String -> [Word8]
>  decode :: [Word8] -> String
>
> this transformation is applied in the appropriate direction by the IO
> library to translate filesystem paths into FilePath and vice versa.  No
> information is lost; furthermore you can apply the transformation yourself
> in order to recover the original [Word8] from a String, or to inject your
> own [Word8] file path.
>
> Ok?

I understand how the API is intended / designed to work; however, the
implementation does not actually do this. My argument is that this
transformation should be in a high-level library like "directory", and
the low-level libraries like "base" or "unix" ought to provide
functions which do not transform their inputs. That way, when an error
is found in the encoding logic, it can be fixed by just pushing a new
version of the affected library to Hackage, instead of requiring a new
version of the compiler.

I am also not convinced that it is possible to correctly implement
either of these functions if their behavior is dependent on the user's
locale.

> All this does is mean that the common case where you want to interpret file
> system paths as text works with no fuss, without breaking anything in the
> case when the file system paths are not actually text.

As mentioned earlier in the thread, this behavior is breaking things.
Due to an implementation error, programs compiled with GHC 7.2 on
POSIX systems cannot open files unless their paths also happen to be
valid text according to their locale. It is very difficult to work
around this error, because the paths-are-text logic was placed at a
very low level in the library stack.

> It would probably be better to have an abstract FilePath type and to keep
> the original bytes, decoding on demand.  But that is a big change to the API
> and would break much more code.  One day we'll do this properly; for now we
> have this, which I think is a pretty reasonble compromise.

Please understand, I am not arguing against the existence of this
encoding layer in general. It's a fine idea for a simplistic
high-level filesystem interaction library. But it should be
*optional*, not part of the compiler or "base.

As implemented in GHC 7.2, this encoding is a complex and untested
behavior with no escape hatch.

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-07 Thread Simon Marlow

On 06/11/2011 16:56, John Millikin wrote:

2011/11/6 Max Bolingbroke:

On 6 November 2011 04:14, John Millikin  wrote:

For what it's worth, on my Ubuntu system, Nautilus ignores the locale
and just treats all paths as either UTF8 or invalid.
To me, this seems like the most reasonable option; the concept of
"locale encoding" is entirely vestigal, and should only be used in
certain specialized cases.


Unfortunately non-UTF8 locale encodings are seen in practice quite
often. I'm not sure about Linux, but certainly lots of Windows systems
are configured with a locale encoding like GBK or Big5.


This doesn't really matter for file paths, though. The Win32 file API
uses wide-character functions, which ought to work with Unicode text
regardless of what the user set their locale to.


Paths as text is what *Windows* programmers expect. Paths as bytes is
what's expected by programmers on non-Windows OSes, including Linux
and OS X.


IIRC paths on OS X are guaranteed to be valid UTF-8. The only platform
that uses bytes for paths (that we care about) is Linux.


UTF-8 is bytes. It can be treated as text in some cases, but it's
better to think about it as bytes.


I'm not saying one is inherently better than the other, but
considering that various UNIX  and UNIX-like operating systems have
been using byte-based paths for near on forty years now, trying to
abolish them by redefining the type is not a useful action.


We have to:
  1. Provide an API that makes sense on all our supported OSes
  2. Have getArgs :: IO [String]
  3. Have it such that if you go to your console and write
(./MyHaskellProgram 你好) then getArgs tells you ["你好"]

Given these constraints I don't see any alternative to PEP-383 behaviour.


Requirement #1 directly contradicts #2 and #3.


If you're going to make all the System.IO stuff use text, at least
give us an escape hatch. The "unix" package is ideally suited, as it's
already inherently OS-specific. Something like this would be perfect:


You can already do this with the implemented design. We have:

openFile :: FilePath ->  IO Handle

The FilePath will be encoded in the fileSystemEncoding. On Unix this
will have PEP383 roundtripping behaviour. So if you want openFile' ::
[Byte] ->  IO Handle you can write something like this:

escape = map (\b ->  if b<  128 then chr b else chr (0xEF00 + b))
openFile = openFile' . escape

The bytes that reach the API call will be exactly the ones you supply.
(You can also implement "escape" by just encoding the [Byte] with the
fileSystemEncoding).

Likewise, if you have a String and want to get the [Byte] we decoded
it from, you just need to encode the String again with the
fileSystemEncoding.

If this is not enough for you please let me know, but it seems to me
that it covers all your use cases, without any need to reimplement the
FFI bindings.


This is not enough, since these strings are still being passed through
the potentially (and in 7.2.1, actually) broken path encoder.


I think you might be misunderstanding how the new API works.  Basically, 
imagine a reversible transformation:


  encode :: String -> [Word8]
  decode :: [Word8] -> String

this transformation is applied in the appropriate direction by the IO 
library to translate filesystem paths into FilePath and vice versa.  No 
information is lost; furthermore you can apply the transformation 
yourself in order to recover the original [Word8] from a String, or to 
inject your own [Word8] file path.


Ok?

All this does is mean that the common case where you want to interpret 
file system paths as text works with no fuss, without breaking anything 
in the case when the file system paths are not actually text.


It would probably be better to have an abstract FilePath type and to 
keep the original bytes, decoding on demand.  But that is a big change 
to the API and would break much more code.  One day we'll do this 
properly; for now we have this, which I think is a pretty reasonble 
compromise.


Cheers,
Simon


___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-06 Thread Daniel Peebles
Can't we just have the usual .Internal module convention, where people who
want internals can get at them if they need to, and most people get a
simpler interface? It's amazingly frustrating when you have a library that
does 99% of what you need it to do, except for one tiny internal detail
that the author didn't foresee anyone needing, so didn't export.

2011/11/6 John Lask 

> for what it is worth, I would like to see both System.IO and Directory
> export "internal functions" where the filepath is a Raw Byte
> representation.
>
> I have utilities that regularly scan 100,000 of files and hash the path
> the details of which are irrelevant to this discussion, the point being
> that the locale encoding/decoding is not relevant in this situation and
> adds unnecessary overhead that would affect the speed of the file-system
> scans.
>
> A  denotation of a filepath as an uninterpreted sequence of bytes is the
> lowest common denominator for all systems that I know of and would be
> worthwhile to export from the system libraries upon which other
> abstractions can be built.
>
> I agree that for the general user the current behavior is sufficient,
> however exporting the raw interface would be beneficial for some users,
> for instance those that have responded to this thread.
>
>
> On 7/11/2011 2:42 AM, Max Bolingbroke wrote:
> > On 6 November 2011 04:14, John Millikin  wrote:
> >> For what it's worth, on my Ubuntu system, Nautilus ignores the locale
> >> and just treats all paths as either UTF8 or invalid.
> >> To me, this seems like the most reasonable option; the concept of
> >> "locale encoding" is entirely vestigal, and should only be used in
> >> certain specialized cases.
> >
> > Unfortunately non-UTF8 locale encodings are seen in practice quite
> > often. I'm not sure about Linux, but certainly lots of Windows systems
> > are configured with a locale encoding like GBK or Big5.
> >
> >> Paths as text is what *Windows* programmers expect. Paths as bytes is
> >> what's expected by programmers on non-Windows OSes, including Linux
> >> and OS X.
> >
> > IIRC paths on OS X are guaranteed to be valid UTF-8. The only platform
> > that uses bytes for paths (that we care about) is Linux.
> >
> >> I'm not saying one is inherently better than the other, but
> >> considering that various UNIX  and UNIX-like operating systems have
> >> been using byte-based paths for near on forty years now, trying to
> >> abolish them by redefining the type is not a useful action.
> >
> > We have to:
> >   1. Provide an API that makes sense on all our supported OSes
> >   2. Have getArgs :: IO [String]
> >   3. Have it such that if you go to your console and write
> > (./MyHaskellProgram 你好) then getArgs tells you ["你好"]
> >
> > Given these constraints I don't see any alternative to PEP-383 behaviour.
> >
> >> If you're going to make all the System.IO stuff use text, at least
> >> give us an escape hatch. The "unix" package is ideally suited, as it's
> >> already inherently OS-specific. Something like this would be perfect:
> >
> > You can already do this with the implemented design. We have:
> >
> > openFile :: FilePath ->  IO Handle
> >
> > The FilePath will be encoded in the fileSystemEncoding. On Unix this
> > will have PEP383 roundtripping behaviour. So if you want openFile' ::
> > [Byte] ->  IO Handle you can write something like this:
> >
> > escape = map (\b ->  if b<  128 then chr b else chr (0xEF00 + b))
> > openFile = openFile' . escape
> >
> > The bytes that reach the API call will be exactly the ones you supply.
> > (You can also implement "escape" by just encoding the [Byte] with the
> > fileSystemEncoding).
> >
> > Likewise, if you have a String and want to get the [Byte] we decoded
> > it from, you just need to encode the String again with the
> > fileSystemEncoding.
> >
> > If this is not enough for you please let me know, but it seems to me
> > that it covers all your use cases, without any need to reimplement the
> > FFI bindings.
> >
> > Max
> >
> > ___
> > Glasgow-haskell-users mailing list
> > Glasgow-haskell-users@haskell.org
> > http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
>
>
> ___
> Glasgow-haskell-users mailing list
> Glasgow-haskell-users@haskell.org
> http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
>
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-06 Thread John Lask
for what it is worth, I would like to see both System.IO and Directory
export "internal functions" where the filepath is a Raw Byte representation.

I have utilities that regularly scan 100,000 of files and hash the path
the details of which are irrelevant to this discussion, the point being
that the locale encoding/decoding is not relevant in this situation and
adds unnecessary overhead that would affect the speed of the file-system
scans.

A  denotation of a filepath as an uninterpreted sequence of bytes is the
lowest common denominator for all systems that I know of and would be
worthwhile to export from the system libraries upon which other
abstractions can be built.

I agree that for the general user the current behavior is sufficient,
however exporting the raw interface would be beneficial for some users,
for instance those that have responded to this thread.


On 7/11/2011 2:42 AM, Max Bolingbroke wrote:
> On 6 November 2011 04:14, John Millikin  wrote:
>> For what it's worth, on my Ubuntu system, Nautilus ignores the locale
>> and just treats all paths as either UTF8 or invalid.
>> To me, this seems like the most reasonable option; the concept of
>> "locale encoding" is entirely vestigal, and should only be used in
>> certain specialized cases.
> 
> Unfortunately non-UTF8 locale encodings are seen in practice quite
> often. I'm not sure about Linux, but certainly lots of Windows systems
> are configured with a locale encoding like GBK or Big5.
> 
>> Paths as text is what *Windows* programmers expect. Paths as bytes is
>> what's expected by programmers on non-Windows OSes, including Linux
>> and OS X.
> 
> IIRC paths on OS X are guaranteed to be valid UTF-8. The only platform
> that uses bytes for paths (that we care about) is Linux.
> 
>> I'm not saying one is inherently better than the other, but
>> considering that various UNIX  and UNIX-like operating systems have
>> been using byte-based paths for near on forty years now, trying to
>> abolish them by redefining the type is not a useful action.
> 
> We have to:
>   1. Provide an API that makes sense on all our supported OSes
>   2. Have getArgs :: IO [String]
>   3. Have it such that if you go to your console and write
> (./MyHaskellProgram 你好) then getArgs tells you ["你好"]
> 
> Given these constraints I don't see any alternative to PEP-383 behaviour.
> 
>> If you're going to make all the System.IO stuff use text, at least
>> give us an escape hatch. The "unix" package is ideally suited, as it's
>> already inherently OS-specific. Something like this would be perfect:
> 
> You can already do this with the implemented design. We have:
> 
> openFile :: FilePath ->  IO Handle
> 
> The FilePath will be encoded in the fileSystemEncoding. On Unix this
> will have PEP383 roundtripping behaviour. So if you want openFile' ::
> [Byte] ->  IO Handle you can write something like this:
> 
> escape = map (\b ->  if b<  128 then chr b else chr (0xEF00 + b))
> openFile = openFile' . escape
> 
> The bytes that reach the API call will be exactly the ones you supply.
> (You can also implement "escape" by just encoding the [Byte] with the
> fileSystemEncoding).
> 
> Likewise, if you have a String and want to get the [Byte] we decoded
> it from, you just need to encode the String again with the
> fileSystemEncoding.
> 
> If this is not enough for you please let me know, but it seems to me
> that it covers all your use cases, without any need to reimplement the
> FFI bindings.
> 
> Max
> 
> ___
> Glasgow-haskell-users mailing list
> Glasgow-haskell-users@haskell.org
> http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-06 Thread Donn Cave
Quoth John Millikin ,
...
> One is to give low-level access, using abstractions as close to the
> real API as possible. In this model, "unix" would provide functions
> like [[ rename :: ByteString -> ByteString -> IO () ]], and I would
> know that it's not going to do anything weird to the parameters.

I like that a lot.  In the "PEP" I see the phrase "in the same way that
the C interfaces can ignore the encoding" - and the above low level
access seems to belong to that same non-problematic category.

Donn

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-06 Thread John Millikin
2011/11/6 Max Bolingbroke :
> On 6 November 2011 04:14, John Millikin  wrote:
>> For what it's worth, on my Ubuntu system, Nautilus ignores the locale
>> and just treats all paths as either UTF8 or invalid.
>> To me, this seems like the most reasonable option; the concept of
>> "locale encoding" is entirely vestigal, and should only be used in
>> certain specialized cases.
>
> Unfortunately non-UTF8 locale encodings are seen in practice quite
> often. I'm not sure about Linux, but certainly lots of Windows systems
> are configured with a locale encoding like GBK or Big5.

This doesn't really matter for file paths, though. The Win32 file API
uses wide-character functions, which ought to work with Unicode text
regardless of what the user set their locale to.

>> Paths as text is what *Windows* programmers expect. Paths as bytes is
>> what's expected by programmers on non-Windows OSes, including Linux
>> and OS X.
>
> IIRC paths on OS X are guaranteed to be valid UTF-8. The only platform
> that uses bytes for paths (that we care about) is Linux.

UTF-8 is bytes. It can be treated as text in some cases, but it's
better to think about it as bytes.

>> I'm not saying one is inherently better than the other, but
>> considering that various UNIX  and UNIX-like operating systems have
>> been using byte-based paths for near on forty years now, trying to
>> abolish them by redefining the type is not a useful action.
>
> We have to:
>  1. Provide an API that makes sense on all our supported OSes
>  2. Have getArgs :: IO [String]
>  3. Have it such that if you go to your console and write
> (./MyHaskellProgram 你好) then getArgs tells you ["你好"]
>
> Given these constraints I don't see any alternative to PEP-383 behaviour.

Requirement #1 directly contradicts #2 and #3.

>> If you're going to make all the System.IO stuff use text, at least
>> give us an escape hatch. The "unix" package is ideally suited, as it's
>> already inherently OS-specific. Something like this would be perfect:
>
> You can already do this with the implemented design. We have:
>
> openFile :: FilePath -> IO Handle
>
> The FilePath will be encoded in the fileSystemEncoding. On Unix this
> will have PEP383 roundtripping behaviour. So if you want openFile' ::
> [Byte] -> IO Handle you can write something like this:
>
> escape = map (\b -> if b < 128 then chr b else chr (0xEF00 + b))
> openFile = openFile' . escape
>
> The bytes that reach the API call will be exactly the ones you supply.
> (You can also implement "escape" by just encoding the [Byte] with the
> fileSystemEncoding).
>
> Likewise, if you have a String and want to get the [Byte] we decoded
> it from, you just need to encode the String again with the
> fileSystemEncoding.
>
> If this is not enough for you please let me know, but it seems to me
> that it covers all your use cases, without any need to reimplement the
> FFI bindings.

This is not enough, since these strings are still being passed through
the potentially (and in 7.2.1, actually) broken path encoder.

If the "unix" package had defined functions which operate on the
correct type (CString / ByteString), then it would not be necessary to
patch "base". I could just call the POSIX functions from system-fileio
and be done with it.

And this solution still assumes that there is such a thing as a
filesystem encoding in POSIX. There isn't. A file path is an arbitrary
sequence of bytes, with no significance except what the application
user interface decides.

It seems to me that there's two ways to provide bindings to operating
system functionality.

One is to give low-level access, using abstractions as close to the
real API as possible. In this model, "unix" would provide functions
like [[ rename :: ByteString -> ByteString -> IO () ]], and I would
know that it's not going to do anything weird to the parameters.

Another is to pretend that operating systems are all the same, and can
have their APIs abstracted away to some hypothetical virtual system.
This model just makes it more difficult for programmers to access the
OS, as they have to learn both the standard API, *and* whatever weird
thing has been layered on top of it.

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-06 Thread Max Bolingbroke
On 6 November 2011 04:14, John Millikin  wrote:
> For what it's worth, on my Ubuntu system, Nautilus ignores the locale
> and just treats all paths as either UTF8 or invalid.
> To me, this seems like the most reasonable option; the concept of
> "locale encoding" is entirely vestigal, and should only be used in
> certain specialized cases.

Unfortunately non-UTF8 locale encodings are seen in practice quite
often. I'm not sure about Linux, but certainly lots of Windows systems
are configured with a locale encoding like GBK or Big5.

> Paths as text is what *Windows* programmers expect. Paths as bytes is
> what's expected by programmers on non-Windows OSes, including Linux
> and OS X.

IIRC paths on OS X are guaranteed to be valid UTF-8. The only platform
that uses bytes for paths (that we care about) is Linux.

> I'm not saying one is inherently better than the other, but
> considering that various UNIX  and UNIX-like operating systems have
> been using byte-based paths for near on forty years now, trying to
> abolish them by redefining the type is not a useful action.

We have to:
 1. Provide an API that makes sense on all our supported OSes
 2. Have getArgs :: IO [String]
 3. Have it such that if you go to your console and write
(./MyHaskellProgram 你好) then getArgs tells you ["你好"]

Given these constraints I don't see any alternative to PEP-383 behaviour.

> If you're going to make all the System.IO stuff use text, at least
> give us an escape hatch. The "unix" package is ideally suited, as it's
> already inherently OS-specific. Something like this would be perfect:

You can already do this with the implemented design. We have:

openFile :: FilePath -> IO Handle

The FilePath will be encoded in the fileSystemEncoding. On Unix this
will have PEP383 roundtripping behaviour. So if you want openFile' ::
[Byte] -> IO Handle you can write something like this:

escape = map (\b -> if b < 128 then chr b else chr (0xEF00 + b))
openFile = openFile' . escape

The bytes that reach the API call will be exactly the ones you supply.
(You can also implement "escape" by just encoding the [Byte] with the
fileSystemEncoding).

Likewise, if you have a String and want to get the [Byte] we decoded
it from, you just need to encode the String again with the
fileSystemEncoding.

If this is not enough for you please let me know, but it seems to me
that it covers all your use cases, without any need to reimplement the
FFI bindings.

Max

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-05 Thread John Millikin
FYI: I just released new versions of system-filepath and
system-fileio, which attempt to work around the changes in GHC 7.2.

On Wed, Nov 2, 2011 at 11:55, Max Bolingbroke
 wrote:
>> Maybe I'm misunderstanding, but it sounds like you're still trying to
>> treat posix file paths as text. There should not be any iconv or
>> locales or anything involved in looking up a posix file path.
>
> The thing is that on every non-Unix OS paths *can* be interpreted as
> text, and people expect them to be. In fact, even on Unix most
> programs/frameworks interpret them as text - e.g. IIRC QT's QString
> class is used for filenames in that framework, and if you view
> filenames in an end-user app like Nautilus it obviously decodes them
> in the current locale for presentation.

There is a difference between how paths are rendered to users, and how
they are handled by applications.

Applications *must* use whatever the operating system says a path is.
If a path is bytes, they must use bytes. If a path is text, they must
use text.

How they present paths to the user is a matter of user interface design.

For what it's worth, on my Ubuntu system, Nautilus ignores the locale
and just treats all paths as either UTF8 or invalid.
To me, this seems like the most reasonable option; the concept of
"locale encoding" is entirely vestigal, and should only be used in
certain specialized cases.

> Paths as text is just what people expect, and is grandfathered into
> the Haskell libraries itself as "type FilePath = String". PEP-383
> behaviour is (I think) a good way to satisfy this expectation while
> still not sacrificing the ability to deal with files that have names
> encoded in some way other than the locale encoding.

Paths as text is what *Windows* programmers expect. Paths as bytes is
what's expected by programmers on non-Windows OSes, including Linux
and OS X.

I'm not saying one is inherently better than the other, but
considering that various UNIX  and UNIX-like operating systems have
been using byte-based paths for near on forty years now, trying to
abolish them by redefining the type is not a useful action.

> (Perhaps if Haskell had an abstract FilePath data type rather than
> FilePath = String we could do something different.

This is the general purpose of my system-filepath package, which
provides a set of generic modifications, applicable to paths from
various OS families.

> But it's not clear
> if we could, without also having ugliness like getArgs :: IO [Byte])

We *ought* to have getArgs :: IO [ByteString], at least on POSIX systems.

It's totally OK if high-level packages like "directory" want to hide
details behind some nice abstractions. But the low-level libraries,
like "base" and "unix" and "Win32", must must must provide direct
low-level access to the operating system's APIs.

The only other option is to re-implement half of the standard library
using FFI bindings, which is ugly (for file/directory manipulation) or
nearly impossible (for opening handles).

If you're going to make all the System.IO stuff use text, at least
give us an escape hatch. The "unix" package is ideally suited, as it's
already inherently OS-specific. Something like this would be perfect:

--
System.Posix.File.openHandle :: CString -> IOMode -> IO Handle

System.Posix.File.rename :: CString -> CString -> IO ()
--

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-04 Thread David Brown

On Thu, Nov 03, 2011 at 09:41:32AM +, Max Bolingbroke wrote:

On 2 November 2011 21:46, Ganesh Sittampalam  wrote:

The workaround you propose seems a little complex and it might be a bit
problematic that 100% roundtripping can't be guaranteed even once your
fix is applied.


I can understand this perspective, although the roundtripping as
implemented will only fail in certain very obscure cases.


Depending on the software one is writing, any failure, no matter how
obscure, would not be acceptable.  Think of a file browser, or backup
software.  So, yes, a non-destructive way of reading directories is
important.

At least one Linux distribution (I think Gentoo) actually has invalid
pathnames in the filesystem in order to make sure that software that
is part of the system will be able to handle them.

For 'harchive' (which I am still gradually working on), I had to write
my own version of readDirStream out of Posix that returns both the
path and the inode number (FileID).  Most filesystems on Linux are
vastly faster if you 'stat' the entires of a directory in inode order
rather than the order they were returned by readdir.  In this sense,
I'm not all that concerned if the regular getDirectoryContents isn't
round trippable.

David

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-03 Thread Max Bolingbroke
On 2 November 2011 21:46, Ganesh Sittampalam  wrote:
> The workaround you propose seems a little complex and it might be a bit
> problematic that 100% roundtripping can't be guaranteed even once your
> fix is applied.

I can understand this perspective, although the roundtripping as
implemented will only fail in certain very obscure cases.

> Do you think it would be reasonable/feasible for darcs
> to have its own version of getDirectoryContents that doesn't try to do
> any translation in the first place? It might make sense to make a
> separate package that others could use to.

Yes, absolutely! I think a very valuable contribution would be a
package providing filesystem functions (with an abstract FilePath
type) that is portable across Windows, OS X and *nix-like OSes. This
would be a useful package for anyone who wants to avoid the
performance (and very rare correctness) problems associated with
roundtripping.

> BTW I was trying to find the patch where this changed but couldn't - was
> it a consequence of
> https://github.com/ghc/packages-base/commit/509f28cc93b980d30aca37008cbe66c677a0d6f6
> ?

That is the main patch. I had to patch the libraries as well to make
use of the changed encodings. See (for example)
https://github.com/ghc/packages-unix/commit/bb8a27d14a63fcd126a924d32c69b7694ea709d9

Max

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-02 Thread Ganesh Sittampalam
Hi Max,

On 01/11/2011 10:23, Max Bolingbroke wrote:

> This is my implementation of Python's PEP 383 [1] for Haskell.
> 
> IMHO this behaviour is much closer to what users expect.For example,
> getDirectoryContents "." >>= print shows Unicode filenames properly.
> As a result of this change we were able to close quite a few
> outstanding GHC bugs.

Many thanks for your reply and all the subsequent followups and bugfixing.

The workaround you propose seems a little complex and it might be a bit
problematic that 100% roundtripping can't be guaranteed even once your
fix is applied. Do you think it would be reasonable/feasible for darcs
to have its own version of getDirectoryContents that doesn't try to do
any translation in the first place? It might make sense to make a
separate package that others could use to.

BTW I was trying to find the patch where this changed but couldn't - was
it a consequence of
https://github.com/ghc/packages-base/commit/509f28cc93b980d30aca37008cbe66c677a0d6f6
?

Cheers,

Ganesh

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-02 Thread Max Bolingbroke
On 2 November 2011 20:16, Ian Lynagh  wrote:
> Are you saying there's a bug that should be fixed?

You can choose between two options:

 1. Failing to roundtrip some strings (in our case, those containing
the 0xEFNN byte sequences)
 2. Having GHC's decoding functions return strings including
codepoints that should not be allowed (i.e. lone surrogates)

At the time I implemented this there was significant support for 2, so
that is what we have. At the time I was convinced that 2 was the right
thing to do, but now I'm more agnostic. But anyway the current
behaviour is not really a bug -- it is by design :-)

Max

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-02 Thread Ian Lynagh
On Wed, Nov 02, 2011 at 07:59:21PM +, Max Bolingbroke wrote:
> On 2 November 2011 19:13, Ian Lynagh  wrote:
> 
> > They are allowed to occur in Linux/ext2 filenames, anyway, and I think
> > we ought to be able to handle them correctly if they do.
> 
> In Python, if a filename is decoded using UTF8 and the "surrogate
> escape" error handler, occurrences of lone surrogates are a decoding
> error because they are not allowed to occur in UTF-8 text. As a result
> the lone surrogate is put into the string escaped so it can be
> roundtripped back to a lone surrogate on output. So Python works OK.
> 
> In GHC >= 7.2, if a filename is decoded using UTF8 and the "Roundtrip"
> error handler, occurrences of 0xEFNN are not a decoding error because
> they are perfectly fine Unicode codepoints. As a result they get put
> into the string unescaped, and so when we try to roundtrip the string
> we get the byte 0xNN in the output rather than the UTF-8 encoding of
> 0xEFNN. So GHC does not work OK in this situation :-(

Are you saying there's a bug that should be fixed?


Thanks
Ian


___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-02 Thread Max Bolingbroke
On 2 November 2011 19:13, Ian Lynagh  wrote:
> [snip some stuff I didn't understand. I think I made the mistake of
> entering a Unicode discussion]

Sorry, perhaps that was too opaque! The problem is that if we commit
to support occurrences of the private-use codepoint 0xEF80 then what
happens if we:

1. Decode the UTF-32le data [0x80, 0xEF, 0x00, 0x00] to a string "\xEF80"
2. Pass the string "\xEF80" to a function that encodes it using an
encoding which knows about the escaping mechanism.
3. Consequently encode "\xEF80" as [0x80]

This seems a bit sad.

> They are allowed to occur in Linux/ext2 filenames, anyway, and I think
> we ought to be able to handle them correctly if they do.

In Python, if a filename is decoded using UTF8 and the "surrogate
escape" error handler, occurrences of lone surrogates are a decoding
error because they are not allowed to occur in UTF-8 text. As a result
the lone surrogate is put into the string escaped so it can be
roundtripped back to a lone surrogate on output. So Python works OK.

In GHC >= 7.2, if a filename is decoded using UTF8 and the "Roundtrip"
error handler, occurrences of 0xEFNN are not a decoding error because
they are perfectly fine Unicode codepoints. As a result they get put
into the string unescaped, and so when we try to roundtrip the string
we get the byte 0xNN in the output rather than the UTF-8 encoding of
0xEFNN. So GHC does not work OK in this situation :-(

(The problem I outlined at the start of this email doesn't arise with
the lone surrogate mechanism because surrogates aren't allowed in
UTF-32 text either. So step 1 in the process would have failed with a
decoding error.)

Hope that helps,
Max

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-02 Thread Ian Lynagh
On Wed, Nov 02, 2011 at 07:02:09PM +, Max Bolingbroke wrote:

[snip some stuff I didn't understand. I think I made the mistake of
entering a Unicode discussion]

> This is why the unmodified PEP383 approach is kind of nice - it uses
> lone surrogate (rather than private use) codepoints to do the
> escaping, and these codepoints are simply not allowed to occur in
> valid UTF-encoded text.

If they do not occur, then why does it matter whether or not occurrences
would get escaped?

They are allowed to occur in Linux/ext2 filenames, anyway, and I think
we ought to be able to handle them correctly if they do.


Thanks
Ian


___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-02 Thread Max Bolingbroke
On 2 November 2011 16:29, Ian Lynagh  wrote:
> If I understand correctly, you use U+EF00-U+EFFF to encode the
> characters 0-255 when they are not a valid part of the UTF8 stream.

Yes.

> So why not encode U+EF00 (which in UTF8 is 0xEE 0xBC 0x80) as
> U+EFEE U+EFBC U+EF80, and so on? Doesn't it then become completely
> reversible?

This was also suggested by Mark Lentczner at the time I wrote the
patch, but I raised a few objections (reproduced below):

"""
This would require us to:
 1. Unconditionally decode these bytes sequences using the escape
mechanism, even if using a non-roundtripping encoding. This is because
the chars that result might be fed back into a roundtripping encoding,
where they would otherwise get confused with escapes representing some
other bytes.
 2. Unconditonally decode these particular characters from escapes,
even if using a non-roundtripping decoding -- necessary because of 1.

Which are both a little annoying. Perhaps more seriously, it would
play badly with e.g. reading in UTF-8 and writing out UTF-16, because
your UTF-16 would have bits of UTF-8 representing these private-use
chars embedded within it..
"""

So although this is approach is somewhat attractive, I'm not sure the
benefits of complete roundtripping outweigh the costs.

This is why the unmodified PEP383 approach is kind of nice - it uses
lone surrogate (rather than private use) codepoints to do the
escaping, and these codepoints are simply not allowed to occur in
valid UTF-encoded text.

Max

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-02 Thread Max Bolingbroke
On 2 November 2011 17:15, John Millikin  wrote:
> What package does this patch -- "unix", "directory", something else?

The "base" package. The problem lay in the implementation of
GHC.IO.Encoding.fileSystemEncoding on non-Windows OSes.

> Maybe I'm misunderstanding, but it sounds like you're still trying to
> treat posix file paths as text. There should not be any iconv or
> locales or anything involved in looking up a posix file path.

The thing is that on every non-Unix OS paths *can* be interpreted as
text, and people expect them to be. In fact, even on Unix most
programs/frameworks interpret them as text - e.g. IIRC QT's QString
class is used for filenames in that framework, and if you view
filenames in an end-user app like Nautilus it obviously decodes them
in the current locale for presentation.

Paths as text is just what people expect, and is grandfathered into
the Haskell libraries itself as "type FilePath = String". PEP-383
behaviour is (I think) a good way to satisfy this expectation while
still not sacrificing the ability to deal with files that have names
encoded in some way other than the locale encoding.

(Perhaps if Haskell had an abstract FilePath data type rather than
FilePath = String we could do something different. But it's not clear
if we could, without also having ugliness like getArgs :: IO [Byte])

Cheers,
Max

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-02 Thread John Millikin
On Wed, Nov 2, 2011 at 06:53, Max Bolingbroke
 wrote:
> I've got a patch that will work around the issue in most situations by
> avoiding the iconv code path. With the patch everything will work OK
> as long as the system locale is one that we have a native-Haskell
> decoder for (i.e. basically UTF-8). So you will still be able to get
> the broken behaviour if the above 3 conditions are met AND your system
> locale is not UTF-8.

What package does this patch -- "unix", "directory", something else?

> I think the only way to fix this last case in general is to fix iconv
> itself, so I'm going to see if I can get a patch upstream. Fixing it
> for people with UTF-8 locales should be enough for 99% of users,
> though.

Maybe I'm misunderstanding, but it sounds like you're still trying to
treat posix file paths as text. There should not be any iconv or
locales or anything involved in looking up a posix file path.

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-02 Thread Ian Lynagh
On Wed, Nov 02, 2011 at 01:29:16PM +, Max Bolingbroke wrote:
> On 2 November 2011 10:03, Jean-Marie Gaillourdet  wrote:
> > As far as I know, not all encodings are reversable. I.e. there are byte 
> > sequences which are invalid utf-8. Therefore, decoding and re-encoding 
> > might not return the exact same byte sequence.
> 
> The PEP 383 mechanism explicitly recognises this fact and defines a
> reversible way of decoding bytes into strings. The new behaviour is
> guaranteed to be reversible except for certain private use codepoints
> (0xEF00 to 0xEFFF inclusive) which:
>  1. We do not expect to see in practice
>  2. Are unofficially standardised for use with this sort of "encoding hack"

I don't understand this.

If I understand correctly, you use U+EF00-U+EFFF to encode the
characters 0-255 when they are not a valid part of the UTF8 stream.

So why not encode U+EF00 (which in UTF8 is 0xEE 0xBC 0x80) as
U+EFEE U+EFBC U+EF80, and so on? Doesn't it then become completely
reversible?


Thanks
Ian


___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-02 Thread Max Bolingbroke
On 2 November 2011 13:53, Max Bolingbroke  wrote:
> I think the only way to fix this last case in general is to fix iconv
> itself, so I'm going to see if I can get a patch upstream. Fixing it
> for people with UTF-8 locales should be enough for 99% of users,
> though.

One last update on this: I've found the cause of the problem in the
GNU iconv source code and submitted a bug report.

I've also found out that with my patch the problem should be fixed in
almost every cases (not just 99%!) because GNU iconv will correctly
reject surrogates in the UTF32le<->locale encoding conversion process
with EILSEQ for every non-UTF8 locale encoding that I looked at - even
UTF16 and UTF32!

So in conclusion I think this issue is totally fixed. Please let me
know if you encounter any other problems.

Max

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-02 Thread Max Bolingbroke
On 2 November 2011 09:37, Max Bolingbroke  wrote:
> On 1 November 2011 20:13, John Millikin  wrote:
>> $ ghci-7.2.1
>> GHC> import System.Directory
>> GHC> getDirectoryContents "path-test"
>> ["\161\165","\61345\61349","..","."]
>> GHC> readFile "path-test/\161\165"
>> "world\n"
>> GHC> readFile "path-test/\61345\61349"
>> *** Exception: path-test/: openFile: does not exist (No such file or
>> directory)
>
> Thanks for the example! I can reproduce this on Linux (haven't tried
> OS X or Windows) and AFAICT this behaviour is just a straight-up bug
> and is *not* intended behaviour. I'm not sure why the tests aren't
> catching it.

I've tracked it down and this bug arises in the following situation:
 1. You are not running on Windows
 2. You are attempting to encode a string containing the private-use
escape codepoints
 3. You are using an iconv (such as the one in GNU libc) that, in
contravention of the Unicode standard, does not signal EILSEQ if
surrogate codepoints are encountered in a non-UTF16 input

I've got a patch that will work around the issue in most situations by
avoiding the iconv code path. With the patch everything will work OK
as long as the system locale is one that we have a native-Haskell
decoder for (i.e. basically UTF-8). So you will still be able to get
the broken behaviour if the above 3 conditions are met AND your system
locale is not UTF-8.

I think the only way to fix this last case in general is to fix iconv
itself, so I'm going to see if I can get a patch upstream. Fixing it
for people with UTF-8 locales should be enough for 99% of users,
though.

Max

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-02 Thread Max Bolingbroke
On 2 November 2011 10:03, Jean-Marie Gaillourdet  wrote:
> As far as I know, not all encodings are reversable. I.e. there are byte 
> sequences which are invalid utf-8. Therefore, decoding and re-encoding might 
> not return the exact same byte sequence.

The PEP 383 mechanism explicitly recognises this fact and defines a
reversible way of decoding bytes into strings. The new behaviour is
guaranteed to be reversible except for certain private use codepoints
(0xEF00 to 0xEFFF inclusive) which:
 1. We do not expect to see in practice
 2. Are unofficially standardised for use with this sort of "encoding hack"

Max

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-02 Thread Jean-Marie Gaillourdet
Hi,

On 01.11.2011, at 19:43, Max Bolingbroke wrote:

> As I pointed out earlier in the thread you can recover the old
> behaviour if you really want it by manually reencoding the strings, so
> I would dispute the claim that it is "impossible to fix within the
> given API".

As far as I know, not all encodings are reversable. I.e. there are byte 
sequences which are invalid utf-8. Therefore, decoding and re-encoding might 
not return the exact same byte sequence.

Cheers,
  Jean
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-02 Thread Max Bolingbroke
On 1 November 2011 20:13, John Millikin  wrote:
> $ ghci-7.2.1
> GHC> import System.Directory
> GHC> getDirectoryContents "path-test"
> ["\161\165","\61345\61349","..","."]
> GHC> readFile "path-test/\161\165"
> "world\n"
> GHC> readFile "path-test/\61345\61349"
> *** Exception: path-test/: openFile: does not exist (No such file or
> directory)

Thanks for the example! I can reproduce this on Linux (haven't tried
OS X or Windows) and AFAICT this behaviour is just a straight-up bug
and is *not* intended behaviour. I'm not sure why the tests aren't
catching it.

I'm looking into it now.

Max

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-01 Thread John Millikin
On Tue, Nov 1, 2011 at 11:43, Max Bolingbroke
 wrote:
> Hi John,
>
> On 1 November 2011 17:14, John Millikin  wrote:
>> GHC 7.2 assumes Linux/BSD paths are text, which 1) silently breaks all
>> existing code and 2) makes it impossible to fix within the given API.
>
> Please can you give an example of code that is broken with the new
> behaviour? The PEP 383 mechanism will unavoidably break *some* code
> but I don't expect there to be much of it. One thing that most likely
> *will* be broken is code that attempts to reinterpret a String as a
> "byte string" - i.e. assuming that it was decoded using latin1, but I
> expect that such code can just be deleted when you upgrade to 7.2.

Examples of broken code are Darcs, my system-fileio, and likely
anything else which needs to open Unicode-named files in both 7.0 and
7.2.

As a quick example, consider the case of files with encodings
different from the user's locale. This is *very* common, especially
when interoperating with foreign Windows systems.

$ ghci-7.0.4
GHC> import System.Directory
GHC> createDirectory "path-test"
GHC> writeFile "path-test/\xA1\xA5" "hello\n"
GHC> writeFile "path-test/\xC2\xA1\xC2\xA5" "world\n"
GHC> ^D

$ ghci-7.2.1
GHC> import System.Directory
GHC> getDirectoryContents "path-test"
["\161\165","\61345\61349","..","."]
GHC> readFile "path-test/\161\165"
"world\n"
GHC> readFile "path-test/\61345\61349"
*** Exception: path-test/: openFile: does not exist (No such file or
directory)

> As I pointed out earlier in the thread you can recover the old
> behaviour if you really want it by manually reencoding the strings, so
> I would dispute the claim that it is "impossible to fix within the
> given API".

Please describe how I can, in GHC 7.2, read the contents of the file
"path-test/\xA1\xA5" without changing my locale.

As far as I can tell, there is no way to do this using the standard
libraries. I would have to fall back to the "unix" package, or even
FFI imports, to open that file.

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-01 Thread Max Bolingbroke
Hi John,

On 1 November 2011 17:14, John Millikin  wrote:
> GHC 7.2 assumes Linux/BSD paths are text, which 1) silently breaks all
> existing code and 2) makes it impossible to fix within the given API.

Please can you give an example of code that is broken with the new
behaviour? The PEP 383 mechanism will unavoidably break *some* code
but I don't expect there to be much of it. One thing that most likely
*will* be broken is code that attempts to reinterpret a String as a
"byte string" - i.e. assuming that it was decoded using latin1, but I
expect that such code can just be deleted when you upgrade to 7.2.

As I pointed out earlier in the thread you can recover the old
behaviour if you really want it by manually reencoding the strings, so
I would dispute the claim that it is "impossible to fix within the
given API".

Cheers,
Max

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-01 Thread John Millikin
You're right -- many parts of system-fileio (the parts based on
"directory") are broken due to this. I'll need to update it to call
the posix/win32 functions directly.

IMO, the GHC behavior in <=7.0 is ugly, but the behavior in 7.2 is
fundamentally wrong.

Different OSes have different definitions of a "file path". A Windows
path is a sequence of Unicode characters. A Linux/BSD path is a
sequence of bytes. I'm not certain what OSX does, but I believe it
uses bytes also.

In GHC <= 7.0, the String type was used for both sorts of paths, with
interpretation of the contents being OS-dependent. This sort of works,
because it's possible to represent both byte- and text-based paths in
String.

GHC 7.2 assumes Linux/BSD paths are text, which 1) silently breaks all
existing code and 2) makes it impossible to fix within the given API.

On Tue, Nov 1, 2011 at 08:48, Felipe Almeida Lessa
 wrote:
> On Tue, Nov 1, 2011 at 5:16 AM, Ganesh Sittampalam  wrote:
>> I'm just investigating what we can do about a problem with darcs'
>> handling of non-ASCII filenames on GHC 7.2.
>>
>> The issue is apparently that as of GHC 7.2, getDirectoryContents now
>> tries to decode filenames in the current locale, rather than converting
>> a stream of bytes into characters: http://bugs.darcs.net/issue2095
>>
>> I found an old thread on the subject:
>> http://www.haskell.org/pipermail/haskell-cafe/2009-June/062795.html and
>> some GHC tickets (e.g. http://hackage.haskell.org/trac/ghc/ticket/3300)
>>
>> Can anyone point me at the rationale and details of the change and/or
>> suggest workarounds?
>
> You could try using system-fileio [1], but by reading its source code
> I guess that it may have the same bug (since it tries to decode what
> the directory package gives).  I'm CCing John Millikin, its
> maintainer.
>
> Cheers,
>
> [1] 
> http://hackage.haskell.org/packages/archive/system-fileio/0.3.2.1/doc/html/Filesystem.html#v:listDirectory
>
> --
> Felipe.
>

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-01 Thread Felipe Almeida Lessa
On Tue, Nov 1, 2011 at 5:16 AM, Ganesh Sittampalam  wrote:
> I'm just investigating what we can do about a problem with darcs'
> handling of non-ASCII filenames on GHC 7.2.
>
> The issue is apparently that as of GHC 7.2, getDirectoryContents now
> tries to decode filenames in the current locale, rather than converting
> a stream of bytes into characters: http://bugs.darcs.net/issue2095
>
> I found an old thread on the subject:
> http://www.haskell.org/pipermail/haskell-cafe/2009-June/062795.html and
> some GHC tickets (e.g. http://hackage.haskell.org/trac/ghc/ticket/3300)
>
> Can anyone point me at the rationale and details of the change and/or
> suggest workarounds?

You could try using system-fileio [1], but by reading its source code
I guess that it may have the same bug (since it tries to decode what
the directory package gives).  I'm CCing John Millikin, its
maintainer.

Cheers,

[1] 
http://hackage.haskell.org/packages/archive/system-fileio/0.3.2.1/doc/html/Filesystem.html#v:listDirectory

-- 
Felipe.

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-01 Thread Max Bolingbroke
Hi Ganesh,

On 1 November 2011 07:16, Ganesh Sittampalam  wrote:
> Can anyone point me at the rationale and details of the change and/or
> suggest workarounds?

This is my implementation of Python's PEP 383 [1] for Haskell.

IMHO this behaviour is much closer to what users expect.For example,
getDirectoryContents "." >>= print shows Unicode filenames properly.
As a result of this change we were able to close quite a few
outstanding GHC bugs.

PEP-383 behaviour always does the right thing on setups with a
consistent text encoding for filenames, command line arguments and the
like (Windows, or *nix where the system locale is e.g. UTF-8 and all
filenames are encoded in that locale). However, there are legitimate
use cases where the program has more information about how something
is encoded than just the system locale, and in those cases you should
*encode* the String from getDirectoryContents using
GHC.IO.Encoding.fileSystemEncoding and then *decode* it with your
preferred TextEncoding. In your case I think you want
GHC.IO.Encoding.latin1.

You can use a helper function like this to make this easier:

reencode :: TextEncoding -> TextEncoding -> String -> String
reencode from_enc to_enc from = unsafeLocalState $
GHC.Foreign.withCStringLen from_enc (GHC.Foreign.peekCStringLen
to_enc)

Hope that helps,

Max


[1] http://www.python.org/dev/peps/pep-0383/

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


behaviour change in getDirectoryContents in GHC 7.2?

2011-11-01 Thread Ganesh Sittampalam
Hi,

I'm just investigating what we can do about a problem with darcs'
handling of non-ASCII filenames on GHC 7.2.

The issue is apparently that as of GHC 7.2, getDirectoryContents now
tries to decode filenames in the current locale, rather than converting
a stream of bytes into characters: http://bugs.darcs.net/issue2095

I found an old thread on the subject:
http://www.haskell.org/pipermail/haskell-cafe/2009-June/062795.html and
some GHC tickets (e.g. http://hackage.haskell.org/trac/ghc/ticket/3300)

Can anyone point me at the rationale and details of the change and/or
suggest workarounds?

Cheers,

Ganesh





___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users