RE: [Haskell-cafe] Re: Writing binary files?

2004-09-16 Thread Glynn Clements

MR K P SCHUPKE wrote:

> >In the general case, it needs to be a bit more complex than that,
> 
> Thats why the functions handled lists not individual characters,
> I was assuming that each [Word8] -> [Char] represented a valid
> and complete encoding block... IE at the start of each call it
> assumes no escapes. All this means is than when reading in chunks
> you paste those chunks together before conversion, and you can
> only break outside of escapes. This in my opinion is better
> behaviour anyway... I don't want some hidden escape state mangling
> output, just because some earler code generated invalid output.

Right. Certainly, a stateless interface will handle converting
"complete" strings (pathnames, arguments, etc).

But, ultimately we will have need of a more general interface. E.g. in
the chunked HTTP example which Oleg gave, you would probably want
separate decoders for the headers and body, switching between them as
you read the stream. You wouldn't want to have to accumulate the
entire body as a single byte string just so that you could decode it
in one go, and you can't just "push" a decoder onto the stream.

-- 
Glynn Clements <[EMAIL PROTECTED]>
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Writing binary files?

2004-09-16 Thread Glynn Clements

Gabriel Ebner wrote:

> > For case testing, locale-dependent sorting and the like, you need to
> > convert to characters. [Although possibly only temporarily; you can
> > sort a list of byte strings based upon their corresponding character
> > strings using sortBy. This means that a decoding failure only means
> > that the ordering will be wrong. This is essentially what happens with
> > "ls" if you have filenames which aren't valid in the current locale.]
> 
> sortBy could only cope with single-byte encodings.  Multi-byte
> encodings would need something else.

I think that you may have misunderstood my point. I was referring to
something like this:

type ByteString = [Word8]

decode :: ByteString -> String
decode = ...

comparator :: ByteString -> ByteString
comparator s1 s2 = compare (decode s1) (decode s2)

sortByteStrings :: [ByteString] -> [ByteString]
sortByteStrings ss = sortBy comparator ss

The byte strings which are returned from sortByteStrings are the
original byte strings, but the ordering will be determined by the
encoding. This produces the same results as decode->sort->encode (in
the cases where the latter actually works), but is more robust.

> > It's broken. Being able to represent filenames as byte strings is
> > fundamental. Being able to convert them to or from character strings
> > is useful but not essential. The only reason why the existing API
> > doesn't cause serious problems is because the translation is currently
> > hardwired to an encoding which can't fail.
> 
> Handling binary filenames is hardly fundamental.  It isn't even very
> portable, see the posts about filename handling under modern Windows.
> It might be an important feature, but there are other programs out
> there (mostly GUIs) that expect filenames to be encoded according to
> the locale settings too.

It's fundamental if you want your programs to be robust. For most
programs, there is no legitimate reason to refuse to read a file
because of its name.

A GUI program (or for that matter, a terminal) might legitimately fail
to *display* a filename correctly if it can't decode it (it has to
index into the font). But that isn't a reason to reject it altogether.

E.g. if I create a file whose name contains control characters, most
GUI programs display it incorrectly in the file selection dialog, but
they still manage to open it.

-- 
Glynn Clements <[EMAIL PROTECTED]>
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Interoperability with other languages and haskell in industry

2004-09-16 Thread Andy Moran
On Thursday 16 September 2004 02:46 pm, Vincenzo aka Nick Name wrote:

> Again, I will try to take benefit of the thread on the "senior" list to
> ask a question to everybody who uses haskell in industry (so you people
> at Galois Connection can't avoid to answer, I know you are there :D ):
> are your solutions entierely written in haskell, or are there parts
> written in other languages? If so, how do you let all parts
> interoperate? Do you use some form of RPC, or CORBA, do you just use a
> database to store common data, do you use custom protocols (e.g.
> command line arguments) or what? Do you have experience with wrong ways
> to achieve this goal?

Almost all of our projects are written entirely in Haskell.  The reminder 
use the excellent GHC FFI mechanisms to call out to C, and in one case, Tcl 
(via TclHaskell).  We don't develop multi-process applications, distributed 
or otherwise, so we have no need for RPC or more complex middleware(*), or 
other more baroque solutions.

> I ask this because "it might be" that in the next years our sleeping
> company will produce some software, and I can easily convince other
> people to use "new" languages, if I can ensure them that in case it
> proves difficult for any reason, we can finish with a certain module
> and implement the rest of the system using "more conventional"
> technologies.

Here's what you say: "Let's prototype it in Haskell, and then replace any 
performance bottlenecks with C."  For some application areas (like tools), 
you won't have to bother with the last step.  If you do need to rewrite 
some component in C, the FFI is there to help.

I'd like to say that this approach has worked for us time and time again, 
but, to date, we've never had to rewrite a slow component in C :-)  For us, 
C interoperability has always been a case of linking to third party 
software, or for writing test harnesses to test generated C.

Things are different if your application is multi-process and/or 
distributed, and you're not going to be using an established protocol (like 
HTTP, for instance).  In that case, you might want to look at HDirect 
(giving access to CORBA, COM, DCOM), if you need to talk to CORBA/COM/DCOM 
objects.  There are many simple solutions to RPC available too, if that's 
all you need.

Cheers,

Andy

(*) We have prototyped our own secure middleware, which had an elegant, 
powerful, class-based RPC for Haskell processes, but we haven't really used 
that in other projects.

-- 
Andy Moran Ph. (503) 626 6616, x113
Galois Connections Inc. Fax. (503) 350 0833
12725 SW Millikan Way, Suite #290 http://www.galois.com
Beaverton, OR 97005[EMAIL PROTECTED]

___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Interoperability with other languages and haskell in industry

2004-09-16 Thread Vincenzo aka Nick Name
Again, I will try to take benefit of the thread on the "senior" list to 
ask a question to everybody who uses haskell in industry (so you people 
at Galois Connection can't avoid to answer, I know you are there :D ): 
are your solutions entierely written in haskell, or are there parts 
written in other languages? If so, how do you let all parts 
interoperate? Do you use some form of RPC, or CORBA, do you just use a 
database to store common data, do you use custom protocols (e.g. 
command line arguments) or what? Do you have experience with wrong ways 
to achieve this goal?

I ask this because "it might be" that in the next years our sleeping 
company will produce some software, and I can easily convince other 
people to use "new" languages, if I can ensure them that in case it 
proves difficult for any reason, we can finish with a certain module 
and implement the rest of the system using "more conventional" 
technologies.

Thanks

Vincenzo
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


RE: [Haskell-cafe] Re: Writing binary files?

2004-09-16 Thread Glynn Clements

Simon Marlow wrote:

> >>> Which is why I'm suggesting changing Char to be a byte, so that we
> >>> can have the basic, robust API now and wait for the more advanced
> >>> API, rather than having to wait for a usable API while people sort
> >>> out all of the issues.
> >> 
> >> An easier way is just to declare that the existing API assumes a
> >> Latin-1 encoding consistently.  Later we might add a way to let the
> >> application pick another encoding, or request that the I/O library
> >> uses the locale encoding.
> > 
> > But how do you do that without breaking stuff? If the application
> > changes the encoding to UTF-8 (either explicitly, or by using the
> > locale's encoding when it happens to be UTF-8), then code such as:
> > 
> > [filename] <- getArgs
> > openFile filename ReadMode
> > 
> > will fail if filename isn't a valid UTF-8 sequence. Similarly for the
> > other cases where the OS accepts/returns byte strings but the Haskell
> > interface uses String.
> 
> And that's the correct behaviour, isn't it?

No. The correct behaviour is to keep such data as byte strings. 
Otherwise it's going to be hard to write robust programs if the
hard-wired ISO-8859-1 encoding is ever changed.

In the current implementation, getArgs gets a list of bytes from
argv[], which it converts to a String. The String is passed to
openFile, which converts it back to a list of bytes which are then
passed to open().

Thus the list of bytes is effectively fed through (encode . decode). 
For ISO-8859-*, this is the identity function. For UTF-8, it's a
subfunction of the identity function, i.e. it either returns its input
or it fails. I don't see what is to be gained by having it fail. It
would be preferable to just pass the byte string directly from argv[]
to open().

> > I'm less concerned about the handling of streams, as you can
> > reasonably add a way to change the encoding before any data has been
> > read or written. I'm more concerned about FilePaths, argv, the
> > environment etc.
> 
> Yes, these are interesting issues.  Filenames are stored as character
> strings on some OSs (eg. Windows) and byte strings on others.  So the
> Haskell portable API should probably use String, and do decoding based
> on the locale (if the programmer asks for it).
> 
> Argv and the environment - I don't know.  Windows CreateProcess() allows
> these to be UTF-16 strings, but I don't know what encoding/decoding
> happens between CreateProcess() and what the target process sees in its
> argv[] (can't be bothered to dig through MSDN right now).  I suspect
> these should be Strings in Haskell too, with appropriate
> decoding/encoding happening under the hood.

I suspect that Windows will convert them according to the active
codepage, so that OpenFileA(argv[i], ...) works. 

-- 
Glynn Clements <[EMAIL PROTECTED]>
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] ghc-6-2-1_1.msi MD5 sum

2004-09-16 Thread Sergey Zaharchenko
Could anyone give me the MD5 sum of

http://www.haskell.org/ghc/dist/6.2.1/ghc-6-2-1_1.msi

I've been writing in Haskell for FreeBSD, and now I wanted to compile
one under windows. The .msi file seems to be broken, and I just wanted
to check...

-- 
DoubleF
"The more data I punch in this card, the lighter it becomes, and the
lower the mailing cost."
-- Stan Kelly-Bootle, "The Devil's DP Dictionary"


pgpOVWaX9rZJz.pgp
Description: PGP signature
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Lifting makes lazy

2004-09-16 Thread Iavor S. Diatchki
hello,
the types IO(IO ()) and IO() are not the same.
think of a value of type "IO a" as a _program_ that when executed
will return a result of type "a" (and while executing may print some 
stuff to the screen).

now consider these types:
putStrLn :: String -> IO ()
this is a _pure function_ which when given a string will return a 
_program_ that later when
executed will print the string.  note however that this function does 
not itself print,
it simply creates a program that can print.   this is the main idea that 
allows haskell
to be pure and still manage to have printing.

liftM :: (a -> b) -> IO a -> IO b
this again is a pure function, that when given a function and a program, 
will make
another program that behaves as follows (when executed):
  * it first executes the argument program to get an "a"
  * then it applies the argument function, to turn the "a" into a "b", 
and this is the final result.

at this point you might wonder: well it is all very good that one can 
build these programs, but how do you execute them?
a slightly simplified answer is as follows:  the haskell run time system 
starts by evaluating
the _expression_ "main :: IO ()", i.e. it will first build an IO 
program, and then (behind the scenece) it will execute it.
the reason this is slightly simlified is that in fact the building of 
the program and its execution kind of
progress in parallel (for more details see Simon PJ's "tackling the 
awkward squad" paper). 

now back to your question, you were wondering about:
liftM putStrLn (readFile "SomeFile")  :: IO (IO ())
this is a peculiar beast --- a program, that when executed will return 
as a result another program
(a kind of meta-program)
when executed this program will behave as follows:
  * it will execute the (readFile "SomeFile") program, and as a result 
it will get the contents of the file
  * then to get the result it will apply putStrLn to this, which will 
produce _another program_
  (rememebr putStrLn does not print, as haskell is a pure language)
at this point this new program will be returned as the result.  notice 
that the only IO that happened
was the readiong of the file.

to sequence IO programs you should use on of the following: (>>=), 
(=<<), or simply the "do"-notation.
so the above program that will first read the file, and then print its 
contents may be written as follows:

do string <- readFile "SomeFile"
putStrLn string
or alternatively:
putStrLn =<< readFile "SomeFile"
sorry for the long post, and i hope it was useful.
-iavor















[EMAIL PROTECTED] wrote:
L.S.,
In my enthusiasm to reduce imperative style coding to a minimum, I 
changed  a program to something too lazy to do anything. The following 
is an  extremely simplified version of the program:

import Monad

displayFile1 :: IO (IO ())displayFile1 =  liftM putStr contents --  
Displays nothing
where
  contents :: IO [Char]
  contents = readFile "DisplayFile.lhs"

This should display the contents of a file, but nothing appears. The  
following function should be exactly the same, but this one does 
display  the file:

displayFile2 :: IO ()
displayFile2 =  do
contents <- readFile "DisplayFile.lhs"
putStr contents

My conclusion is, that "putStr" is evaluated strictly, while "liftM  
putStr" is not.

I have the following questions:
 - Why is this difference?
 - Is there some method to predict whether my program is sufficiently  
strict to really do what it is supposed to do?
 - Did someone design a method to develop programs not too strict and 
not  too lazy?
 - The manual "A gentle introduction to Haskell" states in section 
6.3:  "adding strictness flags may lead to hard to find infinite loops 
or have  other unexpected consequences"; I would like to know when 
these problems  arise; are these cases described somewhere?

___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Lifting makes lazy

2004-09-16 Thread Mike Gunter

This should *not* display anything.  Look at the types!  displayFile1
returns a value that is the "displaying action".  It does not "run"
it.

Replace "liftM" with "(=<<)" or use "join displayFile1" if you want to
run the displaying action.

mike

>> displayFile1 :: IO (IO ())
>> displayFile1 =  liftM putStr contents -- Displays nothing
>> where
>>   contents :: IO [Char]
>>   contents = readFile "DisplayFile.lhs"
>
> This should display the contents of a file, but nothing appears.
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Lifting makes lazy

2004-09-16 Thread Jeremy Shaw
At Thu, 16 Sep 2004 18:26:35 +0200,
[EMAIL PROTECTED] wrote:
> 
> 
> L.S.,
> 
> In my enthusiasm to reduce imperative style coding to a minimum, I changed  
> a program to something too lazy to do anything. The following is an  
> extremely simplified version of the program:
> 
> > import Monad
> 
> > displayFile1 :: IO (IO ())displayFile1 =  liftM putStr contents --  
> > Displays nothing
> > where
> >   contents :: IO [Char]
> >   contents = readFile "DisplayFile.lhs"
> 
> This should display the contents of a file, but nothing appears. The  
> following function should be exactly the same, but this one does display  
> the file:
> 
> > displayFile2 :: IO ()
> > displayFile2 =  do
> > contents <- readFile "DisplayFile.lhs"
> > putStr contents
> 
> My conclusion is, that "putStr" is evaluated strictly, while "liftM  
> putStr" is not.

Has nothing to do with strict/lazy.

> I have the following questions:
>   - Why is this difference?


Notice the type of displayFile1 vs displayFile2

IO(IO()) vs IO ()

While displayFile2 actually prints something, displayFile1 mearly
returns an 'action', which will print something if you 'evaluate' it.

Here is an example of how to use displayFile1 the way you wrote it:

import Monad

displayFile1 :: IO (IO ())
displayFile1 =  liftM putStr contents
where
  contents :: IO [Char]
  contents = readFile "DisplayFile.lhs"


doDisplayFile :: IO ()
doDisplayFile = displayFile1 >>= \action -> action

or alternately, you could write doDisplayFile like this (same thing,
just a different syntax):

doDisplayFile :: IO ()
doDisplayFile = 
 do action <- displayFile1
action


if you wanted displayFile1 to behave more like displayFile2 you could
write it like this:

displayFile3 :: IO ()
displayFile3 = contents >>= putStrLn
where
  contents :: IO [Char]
  contents = readFile "DisplayFile.lhs"

or, like this (same thing, different syntax)

displayFile3 :: IO ()
displayFile3 = 
do c <- contents
   putStrLn c
where
  contents :: IO [Char]
  contents = readFile "DisplayFile.lhs"

Jeremy Shaw
--

This message contains information which may be confidential and privileged. Unless you 
are the 
addressee (or authorized to receive for the addressee), you may not use, copy or 
disclose to anyone 
the message or any information contained in the message. If you have received the 
message in error, 
please advise the sender and delete the message.  Thank you.
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Lifting makes lazy

2004-09-16 Thread hjgtuyl
L.S.,
In my enthusiasm to reduce imperative style coding to a minimum, I changed  
a program to something too lazy to do anything. The following is an  
extremely simplified version of the program:

import Monad

displayFile1 :: IO (IO ())displayFile1 =  liftM putStr contents --  
Displays nothing
where
  contents :: IO [Char]
  contents = readFile "DisplayFile.lhs"
This should display the contents of a file, but nothing appears. The  
following function should be exactly the same, but this one does display  
the file:

displayFile2 :: IO ()
displayFile2 =  do
contents <- readFile "DisplayFile.lhs"
putStr contents
My conclusion is, that "putStr" is evaluated strictly, while "liftM  
putStr" is not.

I have the following questions:
 - Why is this difference?
 - Is there some method to predict whether my program is sufficiently  
strict to really do what it is supposed to do?
 - Did someone design a method to develop programs not too strict and not  
too lazy?
 - The manual "A gentle introduction to Haskell" states in section 6.3:  
"adding strictness flags may lead to hard to find infinite loops or have  
other unexpected consequences"; I would like to know when these problems  
arise; are these cases described somewhere?

--
Met vriendelijke groet,
Herzliche GrÃÃe,
Best regards,
Henk-Jan van Tuyl

  Festina Lente
  Hasten Slowly
  Haast U langzaam
  Eile langsam
  Skynd dig langsomt
  Affrettati lentamente
  SpÄchej pomalu
  Skynda lÃngsamt
 Desiderius Erasmus

___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Maybe bytes *are* text (was Re: Writing binary files?)

2004-09-16 Thread Ben Rudiak-Gould
On Thu, 16 Sep 2004, Udo Stenzel wrote:

> Having a seperate byte based api is far better.  If you don't know the
> encoding, all you have is bytes, no text.

Okay, after reading large chunks of this discussion, I'm going to rock the
boat a bit by suggesting that bytes *are* text, and *do* belong in the
Char type, and hence that the current binary file API is actually correct,
after a fashion. In fact, I think that we can resolve many of the problems
of this thread by abandoning the conceptual distinction between characters
and bytes.

Suppose I invoke

gcc -o XXX YYY.c

where XXX and YYY are strings of Japanese characters. It has been pointed
out that if GCC treats its filename arguments as opaque byte strings to be
passed back to the appropriate file opening functions, then it will work
even if the current locale isn't Japanese. But that's only true on Posix-
like systems. On NT, filenames are made of Unicode code points, and argv
is encoded according to the current locale. If GCC uses argv, it will fail
on the example above. I've run into this problem many times on my desktop
XP box, which uses a US-English locale but contains some filenames with
Japanese characters in them.

But in any case GCC's arguments aren't really opaque: it needs to check
each argument to see if it's an option, and it needs to look at the
extensions of files like YYY.c to figure out which subprogram to invoke.
Nevertheless, the opaque-filename approach does work on Posix, because --
this is the important bit -- the characters GCC cares about (like '-',
'o', '.', and 'c') have the same representation in every encoding. In
other words, the character encoding is neither transparent nor opaque to
GCC, but sort of "band-limited": it can understand the values from 0 to
127, but the higher values are mysterious to it. They could be Latin-1
code points; they could be EUC half-characters; they could be Unicode code
points. It doesn't know, and it doesn't *need* to know. It will fail if
given an encoding which doesn't follow this rule (e.g. EBCDIC).

We can make GCC (were it implemented in Haskell) work with all filenames
on both major platforms without platform-specific code by representing
command-line arguments and pathnames as Strings = [Char]s, where Char is
defined as the byte values 0-255 on Posix, but the UTF-16 values on Win32.

Clearly this is very fragile, but the type system provides a solution:

newtype {- TransASCIIEncoding a => -} Char a = Chr Word32

type String a = [Char a]

class TransASCIIEncoding a where
  maxValueUsedByEncoding :: Word32

instance TransASCIIEncoding Unicode where ...
instance TransASCIIEncoding UTF16 where ...
instance TransASCIIEncoding UTF8 where ...
instance TransASCIIEncoding GenericByte where ...

'x' :: Char a
'\u1234' :: Char Unicode
'\q789' :: Char WeirdCompilerSupportedEncoding

instance (TransASCIIEncoding a) => Bounded (Char a) where
  minBound = Chr 0
  maxBound = Chr maxValueUsedByEncoding

class CharTranscoding a b where
  transcode :: CharacterString a

ord :: Character a -> Maybe Int  -- Nothing if arg isn't ASCII
ordUnicode :: Character Unicode -> Int

Obvious problems: backward compatibility and codings like ISO 2022 and
Shift-JIS which break the fundamental assumption. I don't think either
problem is fatal. A more flexible subtyping mechanism would be nice, so
that (e.g.) byte-writing functions could take any Char type with a
sufficiently small maxValue.

-- Ben

___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Writing binary files?

2004-09-16 Thread MR K P SCHUPKE
>CommandLineToArgvW provides a way to obtain a Unicode

Don't forget there are multiple encodings for unicode:

UTF-8, UTF-16, UTF-32...

Keean.
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Writing binary files?

2004-09-16 Thread Scott Turner
On 2004 September 16 Thursday 06:19, Simon Marlow wrote:
> Argv and the environment - I don't know.  Windows CreateProcess() allows
> these to be UTF-16 strings, but I don't know what encoding/decoding
> happens between CreateProcess() and what the target process sees in its
> argv[] (can't be bothered to dig through MSDN right now). 

In Windows, CommandLineToArgvW provides a way to obtain a Unicode set of argv 
and argc values from a Unicode command-line string. Visual C++ supports 
defining a wmain function which is like main except it receives a Unicode 
argv. I looked for details of how the args are converted for an ordinary C 
'main' function, but didn't turn up much else while digging through MSDN. 
Windows distinguishes between the system code page and the C runtime locale 
(which is initially ASCII).

So Windows would work best if getArgs returns a String, while on Unix it would 
avoid encoding problems if it returns [Byte].
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Writing binary files?

2004-09-16 Thread Gabriel Ebner
Glynn Clements <[EMAIL PROTECTED]> writes:

>> > If you want text, well, tough; what comes out most system calls and
>> > core library functions (not just read()) are bytes.
>> 
>> Which need to be interpreted by the program depending on where these
>> bytes come from.
>
> They don't necessarily need to be interpreted.

I was thinking of data read from an fd.

> A lot of data simply gets "routed" from one place to another. E.g. a
> program reads a filename from argv[i] and passes it to open(). It
> doesn't matter if the filename is in Klingon.

Right.

> If you *need* an encoding, and don't have any better information, then
> the locale provides a last resort. Decoding bytes according to the
> locale for the sake of it just adds an unnecessary failure mode.

Right.

> For case testing, locale-dependent sorting and the like, you need to
> convert to characters. [Although possibly only temporarily; you can
> sort a list of byte strings based upon their corresponding character
> strings using sortBy. This means that a decoding failure only means
> that the ordering will be wrong. This is essentially what happens with
> "ls" if you have filenames which aren't valid in the current locale.]

sortBy could only cope with single-byte encodings.  Multi-byte
encodings would need something else.

> It's broken. Being able to represent filenames as byte strings is
> fundamental. Being able to convert them to or from character strings
> is useful but not essential. The only reason why the existing API
> doesn't cause serious problems is because the translation is currently
> hardwired to an encoding which can't fail.

Handling binary filenames is hardly fundamental.  It isn't even very
portable, see the posts about filename handling under modern Windows.
It might be an important feature, but there are other programs out
there (mostly GUIs) that expect filenames to be encoded according to
the locale settings too.

> By "core library functions", I was referring primarily to libc, not
> the Haskell library functions which were built upon them. The Haskell
> developers can change Haskell, they can't change libc.

And they don't need to change libc.  Libc just passes bytes through.

Gabriel.


pgp38CaFAHMIx.pgp
Description: PGP signature
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


RE: [Haskell-cafe] Re: Writing binary files?

2004-09-16 Thread MR K P SCHUPKE
>In the general case, it needs to be a bit more complex than that,

Thats why the functions handled lists not individual characters,
I was assuming that each [Word8] -> [Char] represented a valid
and complete encoding block... IE at the start of each call it
assumes no escapes. All this means is than when reading in chunks
you paste those chunks together before conversion, and you can
only break outside of escapes. This in my opinion is better
behaviour anyway... I don't want some hidden escape state mangling
output, just because some earler code generated invalid output.

Keean.
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


RE: [Haskell-cafe] Re: Writing binary files?

2004-09-16 Thread Glynn Clements

MR K P SCHUPKE wrote:

> >E.g. what happens if you call getDirectoryContents for a directory
> >which contains filenames which aren't valid in the current encoding?
> 
> Surely this shows the problem with the idea of a 'current encoding'

Yes.

In case I haven't already made this clear, my argument is essentially
that it's the API which is broken, rather than the implementations.

> ... You could be reading files from two remote servers each using
> different encodings...
> 
> So you could have read and write raw [Word8] and read and write char,
> somehting like:
> 
> readWithEncoder :: ([Word8] -> [Char]) -> IO [Char]
> writeWithEncoder :: ([Char] -> [Word8]) -> [Char] -> IO ()

In the general case, it needs to be a bit more complex than that, in
order to handle stateful encodings (e.g. ISO-2022), or to handle
decoding multi-byte encodings (e.g. UTF-8) in chunks. Unfortunately,
the iconv interface doesn't allow the encoder state to be extracted,
so a generic iconv-based converter would have to be in the IO monad.

-- 
Glynn Clements <[EMAIL PROTECTED]>
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


RE: [Haskell-cafe] Re: Writing binary files?

2004-09-16 Thread Simon Marlow
On 16 September 2004 10:35, Glynn Clements wrote:

> Simon Marlow wrote:
> 
>>> Which is why I'm suggesting changing Char to be a byte, so that we
>>> can have the basic, robust API now and wait for the more advanced
>>> API, rather than having to wait for a usable API while people sort
>>> out all of the issues.
>> 
>> An easier way is just to declare that the existing API assumes a
>> Latin-1 encoding consistently.  Later we might add a way to let the
>> application pick another encoding, or request that the I/O library
>> uses the locale encoding.
> 
> But how do you do that without breaking stuff? If the application
> changes the encoding to UTF-8 (either explicitly, or by using the
> locale's encoding when it happens to be UTF-8), then code such as:
> 
>   [filename] <- getArgs
>   openFile filename ReadMode
> 
> will fail if filename isn't a valid UTF-8 sequence. Similarly for the
> other cases where the OS accepts/returns byte strings but the Haskell
> interface uses String.

And that's the correct behaviour, isn't it?

Actually I hadn't really considered filenames, I was just talking about
data read & written via the IO library.

> I'm less concerned about the handling of streams, as you can
> reasonably add a way to change the encoding before any data has been
> read or written. I'm more concerned about FilePaths, argv, the
> environment etc.

Yes, these are interesting issues.  Filenames are stored as character
strings on some OSs (eg. Windows) and byte strings on others.  So the
Haskell portable API should probably use String, and do decoding based
on the locale (if the programmer asks for it).

Argv and the environment - I don't know.  Windows CreateProcess() allows
these to be UTF-16 strings, but I don't know what encoding/decoding
happens between CreateProcess() and what the target process sees in its
argv[] (can't be bothered to dig through MSDN right now).  I suspect
these should be Strings in Haskell too, with appropriate
decoding/encoding happening under the hood.

Cheers,
Simon
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


RE: [Haskell-cafe] Re: Writing binary files?

2004-09-16 Thread MR K P SCHUPKE
>E.g. what happens if you call getDirectoryContents for a directory
>which contains filenames which aren't valid in the current encoding?

Surely this shows the problem with the idea of a 'current encoding'
... You could be reading files from two remote servers each using
different encodings...

So you could have read and write raw [Word8] and read and write char,
somehting like:

readWithEncoder :: ([Word8] -> [Char]) -> IO [Char]
writeWithEncoder :: ([Char] -> [Word8]) -> [Char] -> IO ()


After all, how can you 'normalise' a filename to a standard encoding
if you don't have a function to do so. Infact if you encounter
a server with an encoding you have no converter for, you have no
choice but to treat it as raw bytes?

Keean.
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Writing binary files?

2004-09-16 Thread Glynn Clements

Udo Stenzel wrote:

> > > One more reason to fix the I/O functions to handle encodings and have
> > > a seperate/underlying binary I/O API.
> > 
> > The problem is that we also need to fix them to handle *no encoding*.
> 
> What are you proposing here?  Making the breakage even worse by specifying
> a text based api that uses "no encoding"?  

No. I'm suggesting that many of the I/O functions shouldn't be
treating their arguments or return values as text.

> Having a seperate byte based api is far better. If you don't know
> the encoding, all you have is bytes, no text.

My point is that many of the existing functions should be changed to
use bytes instead of text (not separate byte/char versions). E.g.:

type FilePath = [Byte]

If you have a reason to treat a FilePath as text, then you convert it.
E.g.

names <- getDirectoryContents dir
let namesT = map (toString localeEncoding) names

We don't need a separate getDirectoryContentsAsText, and we certainly
don't want that to be the default.

For stream I/O, then having both text and binary read/write functions
makes sense.

> > >  String's are a list of unicode characters, [Word8] is a
> > > list of bytes.
> > 
> > And what comes out of (and goes into) most core library functions is
> > the latter.
> 
> So System.Directory needs to be specified in terms of bytes, too.  Looks like
> a clean solution to me.

Sure. But I'm looking for a solution which doesn't involve re-writing
everything, and which won't result in lots of programs suddenly
becoming unreliable if the hardwired default ISO-8859-1 conversion is
changed.

-- 
Glynn Clements <[EMAIL PROTECTED]>
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


RE: [Haskell-cafe] Re: Writing binary files?

2004-09-16 Thread Glynn Clements

Simon Marlow wrote:

> > Which is why I'm suggesting changing Char to be a byte, so that we can
> > have the basic, robust API now and wait for the more advanced API,
> > rather than having to wait for a usable API while people sort out all
> > of the issues.
> 
> An easier way is just to declare that the existing API assumes a Latin-1
> encoding consistently.  Later we might add a way to let the application
> pick another encoding, or request that the I/O library uses the locale
> encoding.  

But how do you do that without breaking stuff? If the application
changes the encoding to UTF-8 (either explicitly, or by using the
locale's encoding when it happens to be UTF-8), then code such as:

[filename] <- getArgs
openFile filename ReadMode

will fail if filename isn't a valid UTF-8 sequence. Similarly for the
other cases where the OS accepts/returns byte strings but the Haskell
interface uses String.

Currently, the use of String for byte strings doesn't cause problems
because decoding using ISO-8859-1 can't fail. Allowing the use of a
fallible decoder introduces a new set of issues.

E.g. what happens if you call getDirectoryContents for a directory
which contains filenames which aren't valid in the current encoding? 
Does the call fail outright, or are invalid entries silently omitted?

I'm less concerned about the handling of streams, as you can
reasonably add a way to change the encoding before any data has been
read or written. I'm more concerned about FilePaths, argv, the
environment etc.

-- 
Glynn Clements <[EMAIL PROTECTED]>
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Writing binary files?

2004-09-16 Thread Udo Stenzel
Glynn Clements <[EMAIL PROTECTED]> schrieb am 16.09.04 10:46:58:
> Gabriel Ebner wrote:
> > One more reason to fix the I/O functions to handle encodings and have
> > a seperate/underlying binary I/O API.
> 
> The problem is that we also need to fix them to handle *no encoding*.

What are you proposing here?  Making the breakage even worse by specifying
a text based api that uses "no encoding"?  

Having a seperate byte based api is far better.  If you don't know the encoding, 
all you have is bytes, no text.

> Also, binary data and text aren't disjoint. Everything is binary; some
> of it is *also* text.

No, it isn't.  Everything is binary (read: we need a byte based io library), after
decoding and only after decoding it becomes text (read: we need explicit support
for decoding and probably a convenience layer that looks like the old io library).

> >  String's are a list of unicode characters, [Word8] is a
> > list of bytes.
> 
> And what comes out of (and goes into) most core library functions is
> the latter.

So System.Directory needs to be specified in terms of bytes, too.  Looks like
a clean solution to me.


Regards,

Udo.


Verschicken Sie romantische, coole und witzige Bilder per SMS!
Jetzt neu bei WEB.DE FreeMail: http://freemail.web.de/?mc=021193

___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


RE: [Haskell-cafe] Re: Writing binary files?

2004-09-16 Thread Simon Marlow
On 16 September 2004 00:02, Glynn Clements wrote:

> Which is why I'm suggesting changing Char to be a byte, so that we can
> have the basic, robust API now and wait for the more advanced API,
> rather than having to wait for a usable API while people sort out all
> of the issues.

An easier way is just to declare that the existing API assumes a Latin-1
encoding consistently.  Later we might add a way to let the application
pick another encoding, or request that the I/O library uses the locale
encoding.  

Existing code continues to work, and there are no conceptual problems (a
Char is still Unicode).

You have to decide what happens when the programmer tries to output a
Char that is out of range for Latin-1.  The current behaviour is simply
to take the code point mod 0x100, but we could also decide to raise an
exception in this case.

Cheers,
Simon
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Writing binary files?

2004-09-16 Thread Glynn Clements

Gabriel Ebner wrote:

> > The RTS doesn't know the encoding. Assuming that the data will use the
> > locale's encoding will be wrong too often.
> 
> If the program wants to get bytes, it should get bytes explicitly, not
> some sort of pseudo-Unicode String.

Er, that's what I've been saying. And most programs should be getting
filenames as bytes.

> > Like so many other people, you're making an argument based upon
> > fiction (specifically, that you have a closed world where everything
> > always uses the same encoding) then deeming anyone who is unable to
> > maintain the fiction to be "wrong".
> 
> Everything's fine here with LANG=de_AT.utf8.  And I can't recall
> having any problems with it.  But well, YMMV.

So either you never encounter Latin1 files or the programs aren't
trying to decode them.

Bear in mind that the standard libraries don't automatically decode
everything according to the locale's encoding. A lot of programs
completely ignore the locale, and many of those which use it for
something don't decode strings into wide strings.

> > No. If a program just passes bytes around, everything will work so
> > long as the inputs use the encoding which the outputs are assumed to
> > use. And if the inputs aren't in the "correct" encoding, then you have
> > to deal with encodings manually regardless of the default behaviour.
> 
> The only programs that just pass bytes around that come to mind are
> basic Unix utilities.  Basically everything else will somehow process
> the data.

I would suggest that most programs which deal with filenames merely
pass them around. I.e. read bytes from argv or the environment or
files, and pass the bytes to open() etc. And when programs do process
filenames, the processing is usually trivial, and not influenced by
the encoding, e.g. appending or removing directories or extensions.

> > Tough. You already have it, and will do for the foreseeable future. 
> > Many existing APIs (including the core Unix API), protocols and file
> > formats are defined in terms of byte strings with no encoding
> > specified or implied.
> 
> Guess why I like Haskell (the language; the implementations are not up
> to that ideal yet).

You're missing the point. Haskell is implemented upon those existing
APIs, and Haskell programs need to understand those protocols and file
formats. Nothing that Haskell (or an implementation thereof) does can
make the issues go away.

> > I'd like to. But many of the functions which provide or accept binary
> > data (e.g. FilePath) insist on represent it using Strings.
> 
> Good point.  Adding functions that accept bytes instead of strings
> would be a major undertaking.

Which is why I'm suggesting changing Char to be a byte, so that we can
have the basic, robust API now and wait for the more advanced API,
rather than having to wait for a usable API while people sort out all
of the issues.

> > I18N is inherently difficult. Lots of textual data exists in lots of
> > different encodings, and the encoding is frequently unspecified.
> 
> That's the problem with the current API.  You can neither easily
> read/write bytes nor strings in a specified encoding.

No, that's the problem with reality. It has nothing to do with Haskell
beyond the issue of whether Haskell is based upon reality or fiction.

> > The problem is that we also need to fix them to handle *no encoding*.
> 
> That's binary data. (assuming you didn't want to say 'unknown')

Yes. Filenames are binary data; environment strings are binary data;
argv[] is binary data. They may *also* be text, but if they are, the
encoding is, in general, unknown.

> > Also, binary data and text aren't disjoint. Everything is binary; some
> > of it is *also* text.
> 
> Simon's new-io proposal does this very nicely.  Stdin is by default a
> binary stream and you can obtain a TextInputStream for it using either
> the locale's encoding or a specified encoding.  That's the way I'd
> like it to be.

Yes. From what I've seen of it, it's basically the right thing, so far
as it goes, which unfortunately isn't that far. The issues go far
beyond reading and writing streams.

The problem is that this *isn't* the Haskell98 API; it isn't even
included in any existing implentation.

> > Or out of getDirectoryContents, getArgs, getEnv etc. Or to pass a list
> > of Word8s to a handle, or to openFile, getEnv etc.
> 
> That's a real issue.  Adding new functions with a bin- is the only
> solution that comes to my mind.

Well, the other obvious solution is changing the existing functions to
use Word8s (obviously, we want a better name, e.g. Byte or Char, or
even just CChar) and make the new functions use wide characters.

It isn't as if the existing functions actually deal with anything
other than bytes. You never get anything other than Latin1 from a
system function which returns Char (or IO Char), and passing anything
which isn't Latin1 to such a function results in it being silently
cast to a byte.

> > Because we don't have an "oracl

Re: [Haskell-cafe] Writing binary files?

2004-09-16 Thread Glynn Clements

Marcin 'Qrczak' Kowalczyk wrote:

> >> When I switch my environment to UTF-8, which may happen in a few
> >> years, I will convert filenames to UTF-8 and set up mount options to
> >> translate vfat filenames to/from UTF-8 instead of to ISO-8859-2.
> >
> > But what about files which were been created by other people, who
> > don't use UTF-8?
> 
> All people sharing a filesystem should use the same encoding.

Again, this is just "hand waving" the issues away.

> BTW, when ftping files between Windows and Unix, a good ftp client
> should convert filenames to keep the same characters rather than
> bytes, so CP-1250 encoded names don't come as garbage in the encoding
> used on Unix which is definitely different (ISO-8859-2 or UTF-8) or
> vice versa.

Which is fine if the FTP client can figure out which encoding is used
on the remote end. In practice, you have to tell it, i.e. have a list
of which servers (or even which directories on which servers) use
which encoding.

> >> I expect good programs to understand that and display them
> >> correctly no matter what technique they are using for the display.
> >
> > When it comes to display, you have to have to deal with encoding
> > issues one way or another. But not all programs deal with display.
> 
> So you advocate using multiple encodings internally. This is in
> general more complicated than what I advocate: using only Unicode
> internally, limiting other encodings to I/O boundary.

How do you draw that conclusion from what I wrote here?

There are cases where it's advantages to use multiple encodings, but I
wasn't suggesting that in the above. What I'm suggesting in the above
is to sidestep the encoding issue by keeping filenames as byte strings
wherever possible.

> > The core OS and network server applications essentially remain
> > encoding-agnostic.
> 
> Which is a problem when they generate an email, e.g. to send a
> non-empty output of a cron job, or report unauthorized use of sudo.
> If the data involved is not pure ASCII, I will often be mangled.

It only gets mangled if you feed it to a program which is making
assumptions about the encoding. Non-MIME messages neither specify nor
imply an encoding. MIME messages can use either
"text/plain; charset=x-unknown" or application/octet-stream if they
don't undertand the encoding.

And program-generated email notifications frequently include text with
no known encoding (i.e. binary data). Or are you going to demand that
anyone who tries to hack into your system only sends it UTF-8 data so
that the alert messages are displayed correctly in your mail program?

> It's rarely a problem in practice because filenames, command
> arguments, error messages, user full names etc. are usually pure
> ASCII. But this is slowly changing.

To the extent that non-ASCII filenames are used, I've encountered far
more filenames in both Latin1 and ISO-2022 than in UTF-8. Japanese FTP
sites typically use ISO-2022 for everything; even ASCII names may have
"\e(B" prepended to them.

> > But, as I keep pointing out, filenames are byte strings, not
> > character strings. You shouldn't be converting them to character
> > strings unless you have to.
> 
> Processing data in their original byte encodings makes supporting
> multiple languages harder. Filenames which are inexpressible as
> character strings get in the way of clean APIs. When considering only
> filenames, using bytes would be sufficient, but in overall it's more
> convenient to Unicodize them like other strings.

It also harms reliability. Depending upon the encoding, two distinct
byte strings may have the same Unicode representation.

E.g. if you are interfacing to a server which uses ISO-2022 for
filenames, you have to get the escapes correct even when they are
no-ops in terms of the string representation. If you obtain a
directory listing, receive the filename "\e(Bfoo.txt", and convert it
to Unicode, you get "foo.txt". If you then convert it back without the
leading escape, the server is going to say "file not found".

> > The term "mismatch" implies that there have to be at least two things.
> > If they don't match, which one is at fault? If I make a tar file
> > available for you to download, and it contains non-UTF-8 filenames, is
> > that my fault or yours?
> 
> Such tarballs are not portable across systems using different encodings.

Well, programs which treat filenames as byte strings to be read from
argv[] and passed directly to open() won't have any problems with
this. It's only a problem if you make it a problem.

> If I tar a subdirectory stored on ext2 partition, and you untar it on
> a vfat partition, whose fault it is that files which differ only in
> case are conflated?

Arguably, it's Microsoft's fault for not considering the problems
caused by multiple encodings when they decided that filenames were
going to be case-folded.

> > In any case, if a program refuses to deal with a file because it is
> > cannot convert the filename to characters, even when it 

Re: [Haskell-cafe] Writing binary files?

2004-09-16 Thread Marcin 'Qrczak' Kowalczyk
Glynn Clements <[EMAIL PROTECTED]> writes:

> But this seems to be assuming a closed world. I.e. the only files
> which the program will ever see are those which were created by you,
> or by others who are compatible with your conventions.

Yes, unless you set the default encoding to Latin1.

>> Some programs use UTF-8 in filenames no matter what the locale is. For
>> example the Evolution mail program which stores mail folders as files
>> under names the user entered in a GUI.
>
> This is entirely reasonable for a file which a program creates. If a
> filename is just a string of bytes, a program can use whatever
> encoding it wants.

But then they display wrong in any other program.

> If it had just treated them as bytes, rather than trying to interpret
> them as characters, there wouldn't have been any problems.

I suspect it treats some characters in these synthesized newsgroup
names, like dots, specially, so it won't work unless it was designed
differently.

>> When I switch my environment to UTF-8, which may happen in a few
>> years, I will convert filenames to UTF-8 and set up mount options to
>> translate vfat filenames to/from UTF-8 instead of to ISO-8859-2.
>
> But what about files which were been created by other people, who
> don't use UTF-8?

All people sharing a filesystem should use the same encoding.

BTW, when ftping files between Windows and Unix, a good ftp client
should convert filenames to keep the same characters rather than
bytes, so CP-1250 encoded names don't come as garbage in the encoding
used on Unix which is definitely different (ISO-8859-2 or UTF-8) or
vice versa.

>> I expect good programs to understand that and display them
>> correctly no matter what technique they are using for the display.
>
> When it comes to display, you have to have to deal with encoding
> issues one way or another. But not all programs deal with display.

So you advocate using multiple encodings internally. This is in
general more complicated than what I advocate: using only Unicode
internally, limiting other encodings to I/O boundary.

> Assuming that everything is UTF-8 allows a lot of potential problems
> to be ignored.

I don't assume UTF-8 when locale doesn't say this.

> The core OS and network server applications essentially remain
> encoding-agnostic.

Which is a problem when they generate an email, e.g. to send a
non-empty output of a cron job, or report unauthorized use of sudo.
If the data involved is not pure ASCII, I will often be mangled.

It's rarely a problem in practice because filenames, command
arguments, error messages, user full names etc. are usually pure
ASCII. But this is slowly changing.

> But, as I keep pointing out, filenames are byte strings, not
> character strings. You shouldn't be converting them to character
> strings unless you have to.

Processing data in their original byte encodings makes supporting
multiple languages harder. Filenames which are inexpressible as
character strings get in the way of clean APIs. When considering only
filenames, using bytes would be sufficient, but in overall it's more
convenient to Unicodize them like other strings.

> 1. Actually, each user decides which locale they wish to use. Nothing
> forces two users of a system to use the same locale.

Locales may be different, but they should use the same encoding when
they share files. This applies to file contents too - various formats
don't have a fixed encoding and don't specify the encoding explicitly,
so these files are assumed to be in the locale encoding.

> 2. Even if the locale was constant for all users on a system, there's
> still the (not exactly minor) issue of networking.

Depends on the networking protocols. They might insist that filenames
are represented in UTF-8 for example.

>> > Or that every program should pass everything through iconv()
>> > (and handle the failures)?
>> 
>> If it uses Unicode as internal string representation, yes (because the
>> OS API on Unix generally uses byte encodings rather than Unicode).
>
> The problem with that is that you need to *know* the source and
> destination encodings. The program gets to choose one of them, but it
> may not even know the other one.

If it can't know the encoding, it should process the data as a
sequence of bytes, and can output it only to another channel which
accepts raw bytes.

But usually it's either known or can be assumed to be the locale
encoding.

> The term "mismatch" implies that there have to be at least two things.
> If they don't match, which one is at fault? If I make a tar file
> available for you to download, and it contains non-UTF-8 filenames, is
> that my fault or yours?

Such tarballs are not portable across systems using different encodings.

If I tar a subdirectory stored on ext2 partition, and you untar it on
a vfat partition, whose fault it is that files which differ only in
case are conflated?

> In any case, if a program refuses to deal with a file because it is
> cannot convert the filename t

Re: [Haskell-cafe] Re: Writing binary files?

2004-09-16 Thread Glynn Clements

Gabriel Ebner wrote:

> >> 3. The default encoding is settable from Haskell, defaults to
> >>ISO-8859-1.
> >
> > Agreed.
> 
> So every haskell program that did more than just passing raw bytes
> From stdin to stdout should decode the appropriate environment
> variables, and set the encoding by itself?

This statement is too restrictive. Passing bytes isn't limited to
stdin->stdout, and there's no reason why setting the encoding needs to
be any more involved than e.g. "setLocaleEncoding". If you change it
to:

> So every haskell program that did more than just passing raw bytes
> ... should ... set the encoding by itself?

then the answer is yes.

>  IMO that's too much of
> redundancy, the RTS should actually do that.

The RTS doesn't know the encoding. Assuming that the data will use the
locale's encoding will be wrong too often.

> > There are limits to the extent to which this can be achieved. E.g.
> > what happens if you set the encoding to UTF-8, then call
> > getDirectoryContents for a directory which contains filenames which
> > aren't valid UTF-8 strings?
> 
> Then you _seriously_ messed up.  Your terminal would produce garbage,
> Nautilus would break, ...

Like so many other people, you're making an argument based upon
fiction (specifically, that you have a closed world where everything
always uses the same encoding) then deeming anyone who is unable to
maintain the fiction to be "wrong".

> >> 5. The default encoding is settable from Haskell, defaults to the
> >>locale encoding.
> >
> > I feel that the default encoding should be one whose decoder cannot
> > fail, e.g. ISO-8859-1. You should have to explicitly request the use
> > of the locale's encoding (analogous to calling setlocale(LC_CTYPE, "")
> > at the start of a C program; there's a good reason why C doesn't do
> > this without being explicitly told to).
> 
> So that any haskell program that doesn't call setlocale and outputs
> anything else than US-ASCII will produce garbage on an UTF-8 system?

No. If a program just passes bytes around, everything will work so
long as the inputs use the encoding which the outputs are assumed to
use. And if the inputs aren't in the "correct" encoding, then you have
to deal with encodings manually regardless of the default behaviour.

> > Actually, the more I think about it, the more I think that "simple,
> > stupid programs" probably shouldn't be using Unicode at all.
> 
> Care to give any examples?  Everything that has been mentioned until
> now would break with an UTF-8 locale:
> - ls (sorting would break),
> - env (sorting too)

Sorting according to codepoints inevitably involves decoding. However,
getting the order wrong is usually considered less problematic than
failing outright.

> > I.e. Char, String, string literals, and the I/O functions in
> > Prelude, IO etc should all be using bytes,
> 
> I don't want the same mess as in C, where strings and raw data are the
> very same.

Tough. You already have it, and will do for the foreseeable future. 
Many existing APIs (including the core Unix API), protocols and file
formats are defined in terms of byte strings with no encoding
specified or implied.

> Haskell has a nice type system and nicely defined types
> for binary data ([Word8]) and for Strings (String), why don't use it?

I'd like to. But many of the functions which provide or accept binary
data (e.g. FilePath) insist on represent it using Strings.

> > with a distinct wide-character API available for people who want to
> > make the (substantial) effort involved in writing (genuinely)
> > internationalised programs.
> 
> If you introduce an entirely new "i18n-only" API, then it'll surely
> become difficult. :-)

I18N is inherently difficult. Lots of textual data exists in lots of
different encodings, and the encoding is frequently unspecified.

It would be easier if we had a closed world where only one encoding
was ever used. But we don't, and pretending that we do doesn't make it
so.

> > Anything that isn't ISO-8859-1 just doesn't work for the most part,
> > and anyone who wants to provide real I18N first has to work around
> > the pseudo-I18N that's already there (e.g. convert Chars back into
> > Word8s so that they can decode them into real Chars).
> 
> One more reason to fix the I/O functions to handle encodings and have
> a seperate/underlying binary I/O API.

The problem is that we also need to fix them to handle *no encoding*.

Also, binary data and text aren't disjoint. Everything is binary; some
of it is *also* text.

> > Oh, and because bytes are being stored in Chars, the type system won't
> > help if you neglect to decode a string, or if you decode it twice.
> 
> Yes, that's the problem with the current approach, i.e. that there's
> no easy way get a list of Word8's out of a handle.

Or out of getDirectoryContents, getArgs, getEnv etc. Or to pass a list
of Word8s to a handle, or to openFile, getEnv etc.

> >> The current ISO-8859-1 assumption is also wrong. A progr

Re: [Haskell-cafe] Layered I/O

2004-09-16 Thread Marcin 'Qrczak' Kowalczyk
[EMAIL PROTECTED] writes:

> The discussion of i18n i/o highlighted the need for general overlay
> streams. We should be able to place a processing layer onto a handle
> -- and to peel it off and place another one. The layers can do
> character encoding, subranging (limiting the stream to the specified
> number of basic units), base64 and other decoding, signature
> collecting and verification, etc.

My language Kogut  uses the following
types:

BYTE_INPUT - abstract supertype of a stream from which bytes can be read
CHAR_INPUT, BYTE_OUTPUT, CHAR_OUTPUT - analogously

The above types support i/o in blocks only (an array of bytes / chars
at a time). In particular resizable byte arrays and character arrays
are input and output streams.

BYTE_INPUT_BUFFER - transforms a BYTE_INPUT to another BYTE_INPUT,
   providing buffering, unlimited lookahead and unlimited "unreading"
   (putback)
CHAR_INPUT_BUFFER - analogously; in addition provides function which
   read a line at a time
BYTE_OUTPUT_BUFFER - transforms a BYTE_OUTPUT to another BYTE_OUTPUT,
   providing buffering and explicit flushing
CHAR_OUTPUT_BUFFER - analogously; in addition provides optional
   automatic flushing after outputting full lines

The above types provide i/o in blocks and in individual characters,
and in lines for character buffers. They should be used as the last
component of a stack.

BYTE_FILTER - defines how a sequence of bytes is transformed to
   another sequence of bytes, by providing a function which transforms
   a block at a time; it consumes some part of input, produces some
   part of output, and tells whether it stopped because it wants more
   input or because it wants more room in output; throws exception
   on errors
CHAR_FILTER - analogously, but for characters
ENCODER - analogously, but transforms characters into bytes
DECODER - analogously, but transforms bytes into characters

The above are only auxiliary types which just do the conversion on a
block, not streams.

BYTE_INPUT_FILTER - a byte input which uses another byte input and
   applies a byte filter to each block read
CHAR_INPUT_FILTER - a char input which uses another char input and
   applies a char filter to each block read
INPUT_DECODER - a char input which uses a byte input and applies
   a decoder to each block read

The above types support i/o in blocks only.

BYTE_OUTPUT_FILTER, CHAR_OUTPUT_FILTER, OUTPUT_ENCODER -
   analogously, but for output

ENCODING - a supertype whic denotes an encoding in an abstract way.
   STRING is one of its subtypes (would be "instance" in Haskell)
   which currently means iconv-implemented encoding. There are also
   singleton types for important encodings implemented directly.
   There is a function which yields a new (stateful) encoder from an
   encoding, and another which yields a decoder, but encoding is what
   is used as an optional argument to the function which opens a file
   or converts between a standalone string and byte array.

REPLACE_CODING_ERRORS - transforms an encoding to a related encoding
   which substitutes U+FFFD on decoding, and '?' on encoding, instead
   of throwing an exception on error.

A similar transformer which e.g. produces 〹 for unencodable
characters could be written too (not implemented yet).

COPYING_FILTER - filter which dumps data passed through it to another
   output stream
APPEND_INPUT - concatenates several input streams into one
NULL_OUTPUT - /dev/null

The above types come in BYTE and CHAR flavors.

FLUSHING_OTHER - a byte input which reads data from another byte
   input, but flushes some specified output stream before each input
   operation; it's used on the *bottom* of stdin stack and flushes the
   *top* of stdout stack, so alternating input and output on
   stdin/stdout comes in the right order even if partial lines are
   output and without explicit flushing

RAW_FILE - a byte input and output at the same time, a direct
   interface to the OS

Some functions and other values:

TextReader - transforms a byte input to a character input by stacking
   a decoder (for the specified or default encoding), a filter for
   newlines (not implemented yet), and char input buffer (with the
   specified or default buffer size)
TextWriter - analogously, for output

OpenRawFile, CreateRawFile - opens a raw file handle, has various
   options (read, write, create, truncate, exclusive, append, mode).

OpenTextFile - a composition of OpenRawFile and TextReader which
   splits optional arguments to both, depending on where they apply
CreateTextFile - a composition of CreateRawFile and TextWriter

BinaryReader, BinaryWriter - only does buffering, has a slightly
   different interface than ByteInputBuffer and ByteOutputBuffer
OpenBinaryFile, CreateBinaryFile - analogously

RawStdIn, RawStdOut, RawStdErr - raw files
StdOut - RawStdOut, transformed by TextWriter with automatic flushing
   after lines turned on (it's normally off by default)
StdErr - similar
StdIn

Re: [Haskell-cafe] FilePath handling [Was: Writing binary files?]

2004-09-16 Thread Henning Thielemann

On Wed, 15 Sep 2004, Glynn Clements wrote:

> Henning Thielemann wrote:
> 
> > I even plead for an abstract data type FilePath which supports operations
> > like 'enter a directory', 'go one level higher' and so on.
> 
> Are you referring to "pure" operations on the FilePath, e.g. appending
> and removing entries?

Yep, thus the signature should be like

parentDirectory :: FileName -> FilePath -> FilePath

> That's reasonable enough. But it needs to be
> borne in mind that there's a difference between:
> 
>   setCurrentDirectory ".."
> and:
>   dir <- getCurrentDirectory
>   setCurrentDirectory $ parentDirectory dir
> 
> [where parentDirectory is a pure FilePath -> FilePath function.]
> 
> if the last component in the path is a symlink.
> 
> If you want to make FilePath an instance of Eq, the situation gets
> much more complicated.

I agree, though these are not problems introduced by the abstraction of
FilePath but they already exist for FilePath = String.


___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe