Re: Should GHC default to -O1 ?
On Tue, Nov 8, 2011 at 11:28 PM, wagne...@seas.upenn.edu wrote: I don't agree that GHC's user interface should be optimized for newcomers to Haskell. GHC is an industrial-strength compiler with some very advanced features; the majority of its target audience is professional programmers. Let its interface reflect that fact. As Simon explained, GHC's current defaults are a very nice point in the programming space for people who are actively building and changing their programs. It's easy to build arguments for either side, but my experience as a professional developer is that new devs don't know what arguments they need for reasonable performance, often knowing even what various optimization flags do, but experienced developers do know the difference between -O0 and -O1, and frequently need -debug (not a default option) more than -O0. Seasoned GHC users can find that -O0 gives miserably slow compile times, and fall back to GHCi for edit/rebuild cycles... which still aren't terribly fast if you're using GHC's advanced features. I have a couple small modules that take 10 minutes each to compile on a current Core i7 at -O0, and -O2 really doesn't take much longer. GHCi is very slightly faster but I'll still head directly downstairs for a coffee as soon as either of these bad boys need rebuilding... and still make it back upstairs before they're done. And so I'd prefer the default to be -O1 or even -O2 and have people who really need it use -O0. GHC shouldn't be painful on purpose, industrial strength or not. -n ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: behaviour change in getDirectoryContents in GHC 7.2?
On 8 November 2011 11:43, Simon Marlow marlo...@gmail.com wrote: Don't you mean 1 is what we have? Yes, sorry! Failing to roundtrip in some cases, and doing so silently, seems highly suboptimal to me. I'm sorry I didn't pick up on this at the time (Unicode is a swamp :). I *can* change the implementation back to using lone surrogates. This gives us guaranteed roundtripping but it means that the user might see lone-surrogate Char values in Strings from the filesystem/command line. IIRC this does break some software -- e.g. Brian's text library explicitly checks for such characters and fails if it detects them. So whatever happens we are going to end up making some group of users unhappy! * No PEP383: Haskellers using non-ASCII get upset when their command line argument [String]s aren't in fact sequences of characters, but sequences of bytes in some arbitrary encoding * PEP383(surrogates): Unicoders get upset by lone surrogates (which can actually occur at the moment, independent of PEP383 -- e.g. as character literals or from FFI) * PEP383(private chars): Unixers get upset that we can't roundtrip byte sequences that look like the codepoint 0xEFXX encoded in the current locale. In practice, 0xEFXX is only decodable from a UTF encoding, so we fail to roundtrip byte sequences like the one Ian posted. I'm happy to implement any behaviour, I would just like to know that whatever it is is accepted as the correct tradeoff :-) RE exposing a ByteString based interface to the IO library from base/unix/whatever: AFAIK Python doesn't do this, and just tells people to use the (x.encode(sys.getfilesystemencoding(), surrogateescape)) escape hatch, which is what I've been recommending. I think this would be more satisfying to John if it were actually guaranteed to work on arbitrary byte sequences, not just *highly likely* to work :-) Max ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: behaviour change in getDirectoryContents in GHC 7.2?
On 7 November 2011 17:32, John Millikin jmilli...@gmail.com wrote: I am also not convinced that it is possible to correctly implement either of these functions if their behavior is dependent on the user's locale. FWIW it's only dependent on the users locale because whether glibc iconv detects errors in the *from* sequence depends on what the *to* locale is. Clearly an invalid *from* sequence should be reported as invalid regardless of *to*. I know this isn't much comfort to you, though, since you do have to worry about broken behaviour in 7.2, and possible future breakage with changes in iconv. I understand your point that it would be better from a complexity point of view to just roundtrip the bytes as *bytes* without relying on all this escaping/unescaping code. Please understand, I am not arguing against the existence of this encoding layer in general. It's a fine idea for a simplistic high-level filesystem interaction library. But it should be *optional*, not part of the compiler or base. The problem is that I *really really want* getArgs to decode the command line arguments. That's almost the whole point of this change, and it is what most users seem to expect. Given this constraint, the code has to be part of base, and if getArgs has this behaviour then any file system function we ship that takes a FilePath (i.e. all the functions in base, directory, win32 and unix) must be prepared to handle these escape characters for consistency. I *would* be happy to expose an alternative file system API from the posix package that operates with ByteString paths. This package could provide a function :: FilePath - ByteString that encodes the string with the fileSystemEncoding (removing escapes in the process) for interoperability with file names arriving via getArgs, and at that point the decision about whether to use the escaping/unescaping code would be (mostly) in the hands of the user. We could even have posix expose APIs to get command line arguments/environment variables as ByteStrings, and then you could avoid escape/unescape entirely. Which of these solutions (if any) would satisfy you? 1. The current situation, plus an alternative API exposed from posix along the lines described above 2. The current situation but with the escape/unescape modified so it allows true roundtripping (at the cost of weird surrogate Char values popping up now and again). If you have this you can reliably implement the alternative API on top of the String based one, assuming we got our escape/unescape code right I hope we can work together to find a solution here. Cheers, Max ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: behaviour change in getDirectoryContents in GHC 7.2?
On Wed, Nov 09, 2011 at 11:02:54AM +, Simon Marlow wrote: I would be happy with the surrogate approach I think. Arguable if you try to treat a string with lone surrogates as Unicode and it fails, then that is a feature: the original string wasn't Unicode. All you can do with an invalid Unicode string is use it as a FilePath again, and the right thing will happen. If we aren't going to guarantee that the encoded string is unicode, then is there any benefit to encoding it in the first place? Alternatively if we stick with the private char approach, it should be possible to have an escaping scheme for 0xEFxx characters in the input that would enable us to roundtrip correctly. That is, escape 0xEFxx into a sequence 0xYYEF 0xYYxx for some suitable YY. Why not encode into private chars, i.e. encode U+EF00 (which in UTF8 is 0xEE 0xBC 0x80) as U+EFEE U+EFBC U+EF80, etc? (Max gave some reasons earlier in this thread, but I'd need examples of what goes wrong to understand them). Thanks Ian ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: Should GHC default to -O1 ?
How much does using ghc without cabal imply a newer programmer? I don't use cabal when trying out small bits of code (maybe I should be using ghci), but am otherwise always using cabal. On Wed, Nov 9, 2011 at 3:18 AM, Duncan Coutts duncan.cou...@googlemail.comwrote: On 9 November 2011 00:17, Felipe Almeida Lessa felipe.le...@gmail.com wrote: On Tue, Nov 8, 2011 at 3:01 PM, Daniel Fischer daniel.is.fisc...@googlemail.com wrote: On Tuesday 08 November 2011, 17:16:27, Simon Marlow wrote: most people know about 1, but I think 2 is probably less well-known. When in the edit-compile-debug cycle it really helps to have -O off, because your compiles will be so much quicker due to both factors 1 2. Of course. So defaulting to -O1 would mean one has to specify -O0 in the .cabal or Makefile resp. on the command line during development, which certainly is an inconvenience. AFAIK, Cabal already uses -O1 by default. Indeed, and cabal check / hackage upload complain if you put -O{n} in your .cabal file. The recommended method during development is to use: $ cabal configure -O0 Duncan ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: Should GHC default to -O1 ?
On 9 November 2011 13:53, Greg Weber g...@gregweber.info wrote: How much does using ghc without cabal imply a newer programmer? I don't use cabal when trying out small bits of code (maybe I should be using ghci), but am otherwise always using cabal. The main reason cabal has always defaulted to -O is because historically it's been assumed that the user is installing something rather than just hacking on their own code. If we can distinguish cleanly in the user interface between the installing and hacking use cases then we could default to -O0 for the hacking case. Duncan On Wed, Nov 9, 2011 at 3:18 AM, Duncan Coutts duncan.cou...@googlemail.com wrote: On 9 November 2011 00:17, Felipe Almeida Lessa felipe.le...@gmail.com wrote: On Tue, Nov 8, 2011 at 3:01 PM, Daniel Fischer daniel.is.fisc...@googlemail.com wrote: On Tuesday 08 November 2011, 17:16:27, Simon Marlow wrote: most people know about 1, but I think 2 is probably less well-known. When in the edit-compile-debug cycle it really helps to have -O off, because your compiles will be so much quicker due to both factors 1 2. Of course. So defaulting to -O1 would mean one has to specify -O0 in the .cabal or Makefile resp. on the command line during development, which certainly is an inconvenience. AFAIK, Cabal already uses -O1 by default. Indeed, and cabal check / hackage upload complain if you put -O{n} in your .cabal file. The recommended method during development is to use: $ cabal configure -O0 Duncan ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: behaviour change in getDirectoryContents in GHC 7.2?
On 9 November 2011 13:11, Ian Lynagh ig...@earth.li wrote: If we aren't going to guarantee that the encoded string is unicode, then is there any benefit to encoding it in the first place? (I think you mean decoded here - my understanding is that decode :: ByteString - String, encode :: String - ByteString) Why not encode into private chars, i.e. encode U+EF00 (which in UTF8 is 0xEE 0xBC 0x80) as U+EFEE U+EFBC U+EF80, etc? (Max gave some reasons earlier in this thread, but I'd need examples of what goes wrong to understand them). We can do this but it doesn't solve all problems. Here are two such problems: PROBLEM 1 (bleeding from non-escaping to escaping TextEncodings) === So let's say we are reading a filename from stdin. Currently stdin uses the utf8 TextEncoding -- this TextEncoding knows nothing about private-char roundtripping, and will throw an exception when decoding bad bytes or encoding our private chars. Now the user types a UTF-8 U+EF80 character - i.e. we get the bytes 0xEE 0xBC 0x80 on stdin. The utf8 TextEncoding naively decodes this byte sequence to the character sequence U+EF80. We have lost at this point: if the user supplies the resulting String to a function that encodes the String with the fileSystemEncoding, the String will be encoded into the byte sequence 0x80. This is probably not what we want to happen! It means that a program like this: main = do fp - getLine readFile fp = putStrLn Will fail (file not found: \x80) when given the name of an (existant) file 0xEE 0xBC 0x80. PROBLEM 2 (bleeding between two different escaping TextEncodings) === So let's say the user supplies the UTF-8 encoded U+EF00 (byte sequence 0xEE 0xBC 0x80) as a command line argument, so it goes through the fileSystemEncoding. In your scheme the resulting Char sequence is U+EFEE U+EFBC U+EF80. What happens when we that *encode* that Char sequence using a UTF-16 TextEncoding (that knows about the 0xEFxx escape mechanism)? The resulting byte sequence is 0xEE 0xBC 0x80, NOT the UTF-16 encoded version of U+EF00! This is certainly contrary to what the user would expect. PROBLEM 3 (bleeding from escaping to non-escaping TextEncodings) === Just as above, let's say the user supplies the UTF-8 encoded U+EF00 (byte sequence 0xEE 0xBC 0x80) as a command line argument, so it goes through the fileSystemEncoding. In your scheme the resulting Char sequence is U+EFEE U+EFBC U+EF80. If you try to write this String to stdout (which uses the UTF-8 encoding that knows nothing about 0xEFxx escapes) you just get an exception, NOT the UTF-8 encoded version of U+EF00. Game over man, game over! CONCLUSION === As far as I can see, the proposed escaping scheme recovers the roundtrip property but fails to regain a lot of other reasonable-looking behaviours. (Note that the above outlined problems are problems in the current implementation too -- but the current implementation doesn't even pretend to support U+EFxx characters. Its correctness is entirely dependent on them never showing up, which is why we chose a part of the private codepoint region that is reserved specifically for the purpose of encoding hacks). Max ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: behaviour change in getDirectoryContents in GHC 7.2?
On 9 November 2011 11:02, Simon Marlow marlo...@gmail.com wrote: The performance overhead of all this worries me. withCString has taken a huge performance hit, and I think there are people who wnat to know that there aren't several complex encoding/decoding passes between their Haskell code and the POSIX API. We ought to be able to program to POSIX directly, and the same goes for Win32. We are only really talking about environment variables, filenames and command line arguments here. I'm sure there are performance implications to all this decoding/encoding, but these bits of text are almost always very short and are unlikely to be causing bottlenecks. Adding a whole new API *just* to eliminate a hypothetical performance problem seems like overkill. OTOH, I'm happy to add it if we stick with using private chars for the escapes, because then using it or not using it is a *correctness* issue (albeit in rare cases). Max ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: behaviour change in getDirectoryContents in GHC 7.2?
On 08/11/2011 15:42, John Millikin wrote: On Tue, Nov 8, 2011 at 03:04, Simon Marlowmarlo...@gmail.com wrote: I really think we should provide the native APIs. The problem is that the System.Posix.Directory API is all in terms of FilePath (=String), and if we gave that a different meaning from the System.Directory FilePaths then confusion would ensue. So perhaps we need to add another API to System.Posix with filesystem operations in terms of ByteString, and similarly for Win32. +1 I think most users would be OK with having System.Posix treat FilePath differently, as long as this is clearly documented, but if you feel a separate API is better then I have no objection. As long as there's some way to say I know what I'm doing, here's the bytes to the library. The Win32 package uses wide-character functions, so I'm not sure whether bytes would be appropriate there. My instinct says to stick with chars, via withCWString or equivalent. The package maintainer will have a better idea of what fits with the OS's idioms. Ok, I spent most of today adding ByteString alternatives for all of the functions in System.Posix that use FilePath or environment strings. The Haddocks for my augmented unix package are here: http://community.haskell.org/~simonmar/unix-with-bytestring-extras/index.html In particular, the module System.Posix.ByteString is the whole System.Posix API but with ByteString FilePaths and environment strings: http://community.haskell.org/~simonmar/unix-with-bytestring-extras/System-Posix-ByteString.html It has one addition relative to System.Posix: getArgs :: IO [ByteString] Let me know what you think. I suspect the main controversial aspect is that I included type FilePath = ByteString which is a bit cute but might be confusing. Cheers, Simon ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: behaviour change in getDirectoryContents in GHC 7.2?
On 09/11/2011 13:11, Ian Lynagh wrote: On Wed, Nov 09, 2011 at 11:02:54AM +, Simon Marlow wrote: I would be happy with the surrogate approach I think. Arguable if you try to treat a string with lone surrogates as Unicode and it fails, then that is a feature: the original string wasn't Unicode. All you can do with an invalid Unicode string is use it as a FilePath again, and the right thing will happen. If we aren't going to guarantee that the encoded string is unicode, then is there any benefit to encoding it in the first place? With a decoded FilePath you can: - use it as a FilePath argument to some other function - map all the illegal characters to '?' and then treat it as Unicode, e.g. for printing it out (but then you lost the ability to roundtrip, which is why we can't do this automatically). Ok, so since we need something like makePrintable :: FilePath - String arguably we might as well make that do the locale decoding. That's certainly a good point... Cheers, Simon ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: behaviour change in getDirectoryContents in GHC 7.2?
On Wed, Nov 9, 2011 at 08:04, Simon Marlow marlo...@gmail.com wrote: Ok, I spent most of today adding ByteString alternatives for all of the functions in System.Posix that use FilePath or environment strings. The Haddocks for my augmented unix package are here: http://community.haskell.org/~simonmar/unix-with-bytestring-extras/index.html In particular, the module System.Posix.ByteString is the whole System.Posix API but with ByteString FilePaths and environment strings: http://community.haskell.org/~simonmar/unix-with-bytestring-extras/System-Posix-ByteString.html This looks lovely -- thank you. Once it's released, I'll port all my libraries over to using it. It has one addition relative to System.Posix: getArgs :: IO [ByteString] Thank you very much! Several tools I use daily accept binary data as command-line options, and this will make it much easier to port them to Haskell in the future. Let me know what you think. I suspect the main controversial aspect is that I included type FilePath = ByteString which is a bit cute but might be confusing. Indeed, I was very confused when I saw that in the docs. If it's not too much trouble, could those functions accept/return ByteString directly? ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: behaviour change in getDirectoryContents in GHC 7.2?
On 09/11/2011 15:58, Max Bolingbroke wrote: (Note that the above outlined problems are problems in the current implementation too -- but the current implementation doesn't even pretend to support U+EFxx characters. Its correctness is entirely dependent on them never showing up, which is why we chose a part of the private codepoint region that is reserved specifically for the purpose of encoding hacks). But we can't make that assumption, because the user might have accidentally set the locale wrong and then all kinds of garbage will show up in decoded file paths. I think it's important that programs that just traverse the file system keep working under those conditions, rather than randomly failing due to (encode . decode) being almost but not quite the identity. Cheers, Simon ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: behaviour change in getDirectoryContents in GHC 7.2?
On Wed, Nov 09, 2011 at 03:58:47PM +, Max Bolingbroke wrote: (Note that the above outlined problems are problems in the current implementation too Then the proposal seems to me to be strictly better than the current system. Under both systems the wrong thing happen when U+EFxx is entered as unicode text, but the proposed system works for all filenames read from the filesystem. In the longer term, I think we need to fix the underlying problem that (for example) both getLine and getArgs produce a String from bytes, but do so in different ways. At some point we should change the type of getArgs and friends. Thanks Ian ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: behaviour change in getDirectoryContents in GHC 7.2?
My primary concerns are (in order of priority - and I only speak for myself) (a) consistency across platforms (b) minimize (unrequired) performance overhead I would prefer an api which is consistent for both win32, posix or other os which only did as much as what the user (us) wanted for example ... module System.Directory.ByteString ... FilePath = ByteString getDirectoryContents :: FilePath - IO [FilePath] which is the same for both win32 and posix and represents raw uninterpreted bytestrings in whatever encoding/(non-encoding) the os providesimplicitly it is for the user to know and understand what their getting (utf-16 in the case of windows, bytes in case of posix platforms) then this api can be re-exported with the decoding/encoding by System.Directory/System.IO which would export FilePath=String ie a two level api... ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users