[Haskell-cafe] Re: File path programme

2005-02-02 Thread Peter Simons
Glynn Clements writes:

 >> Well, there is a sort-of canonic version for every path;
 >> on most Unix systems the function realpath(3) will find
 >> it. My interpretation is that two paths are equivalent
 >> iff they point to the same target.

 > I think that any definition which includes an "iff" is
 > likely to be overly optimistic.

I see your point. I guess it comes down to how much effort
is put into implementing a realpath() derivate in Haskell.


 > Even so, you will need to make certain assumptions. E.g.
 > older Unices would allow root to replace the "." and ".."
 > entries; you probably want to assume that can't happen.

My take on things is that it is hopeless to even try and
cover all this weird behavior. I'd like to treat paths as
something abstract. What I'm aiming for is that my library
can be used to manipulate file paths as well as URLs,
namespaces, and whatnot else; so I'll necessarily lose some
functionality that an implementation specifically designed
for file paths could provide. If you want to be portable,
you cannot use any esoteric functionality anyway.


 > There are also issues of definition, e.g. is "/dev/tty"
 > considered "equivalent" to the specific "/dev/ttyXX"
 > device for the current process?

No, because the paths differ. ;-)

Peter

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: File path programme

2005-02-02 Thread Glynn Clements

Peter Simons wrote:

>  > Hmmm, I'm not really sure what "equivalence" for file
>  > paths should mean in the presence of hard/symbolic links,
>  > (NFS-)mounted file systems, etc.
> 
> Well, there is a sort-of canonic version for every path; on
> most Unix systems the function realpath(3) will find it.
> My interpretation is that two paths are equivalent iff they
> point to the same target.

I think that any definition which includes an "iff" is likely to be
overly optimistic.

More likely, you will have to settle for a definition such that, if
two paths are considered equal, they refer to the same "file", but
without the converse (i.e. even if they aren't equal, they might still
refer to the same file).

Even so, you will need to make certain assumptions. E.g. older Unices
would allow root to replace the "." and ".." entries; you probably
want to assume that can't happen.

> You (and the others who pointed it out) are correct, though,
> that the current 'canon' function doesn't accomplish that. I
> guess, I'll have to move it into the IO monad to get it
> right. And I should probably rename it, too. ;-)

A version in the IO monad would allow for a "tighter" definition (i.e. 
more likely to correctly identify that two different path values
actually refer to the same file).

[Certainly, you have to use the IO monad if you want to allow for case
sensitivity, as that depends upon which filesystems are mounted
where.]

Within the IO monad, the obvious approach is to stat() both pathnames
and check whether their targets have the same device/inode pairs. 
That's reasonably simple, and probably about as good as you can get.

That still won't handle the case where you mount a single remote
filesystem via both NFS and SMB though. I doubt that anything can
achieve that.

There are also issues of definition, e.g. is "/dev/tty" considered
"equivalent" to the specific "/dev/ttyXX" device for the current
process?

-- 
Glynn Clements <[EMAIL PROTECTED]>
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: File path programme

2005-02-01 Thread John Meacham
I have not been following this thread too closely, but I have looked at
the various proposed implementations floating around and have a few
comments.

I noticed some have #ifdefs for platforms, which seems not very useful.
I write not just cross-platform, but cross platform at the same time
programs. i.e. a linux program which might be controlling slaves running
on windows or linux, meaning it will have to deal with paths from either
operating system in a common way. We should make sure all functionality
is available on all systems as much as is feasable. Another common use
is a program running under cygwin, where you generally want to let the
user work with unix or windows style paths. 

We need to support paths as black boxes, what is returned from directory
routines should be able to be passed back to system routines without
modification or canonicalization. This will allow people to write
file-chooser type apps, so even if the user visible display name of a
file has been changed by charset conversion/whatnot, we can still pass
the black box gotten out of the directory listing back to an open call
and be assured of getting the right file. 

I don't think this is a good use of typeclasses mainly because I don't
think we should use different types for different platforms. I would
like a single abstract Path type which can represent paths from any
platform or an encapsulated black box from the system.

I liked the root-relative formulaton someone mentioned.


data FilePath = Path Root [Relative]

data Root = WindowsDrive Char 
| UnixRoot 
| CurrentDir 
| RootBB BlackBox

data Relative = Dot | DotDot | Sub String | SubBB BlackBox

type BlackBox = UArray Int Word8

note that black boxes can be used as path components as well as the
entire path. Examples of where these would be used would be getting the
current directory would return an opaque black box representing the
whole path, while a directory listing would return relative black boxen.
Or imagine implementing cp -r, you'd want to create the same file names in a
different directory.

just some comments... sorry for the hit-n-run.

John

-- 
John Meacham - ârepetae.netâjohnâ 
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Re: File path programme

2005-01-31 Thread Aaron Denney
On 2005-01-31, Marcin 'Qrczak' Kowalczyk <[EMAIL PROTECTED]> wrote:
> Peter Simons <[EMAIL PROTECTED]> writes:
>
>>   http://cryp.to/pathspec/PathSpec.hs
>
>> There also is a function which changes a path specification
>> into its canonic form, meaning that all redundant segments
>> are stripped.
>
> It's incorrect: canon (read "x/y/.." :: RelPath Posix) gives "x",
> yet on Unix they aren't equivalent when y is a non-local symlink
> or doesn't exist.

True, but most people want x when they construct x/y/.., in makefiles,
install scripts, etc.  It's not "OS thinks is the same", and shouldn't
be marketed as such, but it is useful as "what people generally want to
refer to".

-- 
Aaron Denney
-><-

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Re: File path programme

2005-01-31 Thread Peter Simons
Sven Panne writes:

 > OK, but even paths which realpath normalizes to different
 > things might be the same (hard links!).

Sure, but paths it normalizes to the same thing almost
certainly _are_ the same. ;-) That's all I am looking for.
In general, I think path normalization is a nice-to-have
feature, not a must-have.


 > IMHO we can provide something like realpath in the IO
 > monad, but shouldn't define any equality via it.

You are right; Eq shouldn't be defined on top of that. And
couldn't even, if normalization needs the IO monad anyway.

Peter

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: File path programme

2005-01-31 Thread robert dockins

Well, there is a sort-of canonic version for every path; on
most Unix systems the function realpath(3) will find it.
Here is the BUGS listing from 'man realpath' on my system:
Never use this function. It is broken by design since it is impossible 
to determine a suitable size for the output  buffer. According  to 
POSIX  a  buffer  of size PATH_MAX suffices, but PATH_MAX need not be a 
defined constant, and may have to be obtained using pathconf().  And 
asking pathconf() does not really help, since on the one hand POSIX 
warns that  the  result of  pathconf()  may  be huge and unsuitable for 
mallocing memory. And on the other hand pathconf() may return -1 to 
signify that PATH_MAX is not bounded.

> My interpretation is that two paths are equivalent iff they
> point to the same target.
You might do better (on *nix) to check if two paths terminate in the 
same filesystem and then see if the inode numbers match (with some stat 
variant).  Even that may break down for networked filesystems or FAT 
wrappers or other things that may lie about the inode number.

You could also unravel the path manually, but that seems error-prone and 
unportable.

This strikes me as yet another case of a simple-seeming operation that 
simply cannot be implemented correctly on file names.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: File path programme

2005-01-31 Thread Sven Panne
Peter Simons wrote:
Sven Panne writes:
 > Hmmm, I'm not really sure what "equivalence" for file
 > paths should mean in the presence of hard/symbolic links,
 > (NFS-)mounted file systems, etc.
Well, there is a sort-of canonic version for every path; on
most Unix systems the function realpath(3) will find it.
OK, but even paths which realpath normalizes to different things might
be the same (hard links!). This might be OK for some uses, but not for
all.
My interpretation is that two paths are equivalent iff they
point to the same target. [...]
This would mean that they are equal iff stat(2) returns the same 
device/inode
pair for them. But this leaves other questions open:
 * Do we have something stat-like on every platform?
 * What does this mean for network file systems, e.g. in the presence of
   the same files/directories exported under different NFS mounts? I don't
   have enough books/manual pages at hand to answer this currently...
 * What does this mean if the file path doesn't refer to an existing
   file/directory?
IMHO we can provide something like realpath in the IO monad, but shouldn't
define any equality via it.
Cheers,
   S.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Re: File path programme

2005-01-31 Thread Peter Simons
Sven Panne writes:

 > Hmmm, I'm not really sure what "equivalence" for file
 > paths should mean in the presence of hard/symbolic links,
 > (NFS-)mounted file systems, etc.

Well, there is a sort-of canonic version for every path; on
most Unix systems the function realpath(3) will find it.
My interpretation is that two paths are equivalent iff they
point to the same target.

You (and the others who pointed it out) are correct, though,
that the current 'canon' function doesn't accomplish that. I
guess, I'll have to move it into the IO monad to get it
right. And I should probably rename it, too. ;-)


Ben Rudiak-Gould writes:

 > The Read and Show instances aren't inverses of each
 > other. I don't think we should be using Read for path
 > parsing, for this reason.

That's fine with me; I can change that.


 > I don't understand why the path ADT is parameterized by
 > segment representation, but then the Posix and Windows
 > parameter types are both wrappers for String.

No particular reason. I just wanted to make the library work
with a simple internal representation before doing the more
advanced stuff. It is experimental code.


 > It seems artificial to distinguish read :: String ->
 > RelPath Windows from read :: String -> RelPath Posix in
 > this way.

I think it's pretty neat, actually. You have a way to
specify what kind of path you have -- and the type system
distinguishes it, not a run-time error.

Peter

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: File path programme

2005-01-31 Thread Ben Rudiak-Gould
This is a very good summary, and I'm interested to see what you come up 
with.

robert dockins wrote:
1) File names are abstract entities.  There are a number of ways one 
might concretely represent a filename. Among these ways are:

  a) A contiguous sequence of octets in memory
   (C style string on most modern hardware)
  b) A sequence of unicode codepoints
   (Haskell style string)
b') A sequence of octets
  (Haskell style string, in real life)
4) In practice, the vast majority of file paths are portable between 
the various forms; the forms are "nearly" isomorphic, with corner 
cases being fairly rare.
I don't think they're so rare. I have files on my XP laptop which can't 
be represented in the system code page. It's easy for me to tell which 
programs are Unicode-aware and which aren't.

-- Ben
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: File path programme

2005-01-31 Thread Marcin 'Qrczak' Kowalczyk
Peter Simons <[EMAIL PROTECTED]> writes:

>   http://cryp.to/pathspec/PathSpec.hs

> There also is a function which changes a path specification
> into its canonic form, meaning that all redundant segments
> are stripped.

It's incorrect: canon (read "x/y/.." :: RelPath Posix) gives "x",
yet on Unix they aren't equivalent when y is a non-local symlink
or doesn't exist.

Also, "x/." is not equivalent to "x": rmdir can be used with "x"
but not with "x/.".

-- 
   __("< Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: File path programme

2005-01-31 Thread Ben Rudiak-Gould
Peter Simons wrote:
>The module currently knows only _relative_ paths. I am still
>experimenting with absolute paths because I have recently
>learned that on Windows something like "C:foo.txt" is
>actually relative -- not absolute. Very weird.
"\foo.txt" is also relative on Win32. And "con.txt" is absolute.
>There also is a function which changes a path specification
>into its canonic form, meaning that all redundant segments
>are stripped. So although two paths which designate the same
>target may not be equal, they can be tested for equivalence.
Again, while this transformation may be useful in some cases, it is not 
a canonicalization operation. "foo/../bar" and "bar" do not in general 
refer to the same file, and "foo" and "foo/." are not in general 
equivalent. We shouldn't encourage these misconceptions in the library, 
even if we do provide a path-collapsing transformation along these lines.

Other comments:
The Read and Show instances aren't inverses of each other. I don't think 
we should be using Read for path parsing, for this reason.

I don't understand why the path ADT is parameterized by segment 
representation, but then the Posix and Windows parameter types are both 
wrappers for String. It seems artificial to distinguish read :: String 
-> RelPath Windows from read :: String -> RelPath Posix in this way.

In general, this library doesn't seem to deal with any of the hard 
cases. The devil's in the details.

-- Ben
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: File path programme

2005-01-31 Thread Sven Panne
Peter Simons wrote:
[...]
There also is a function which changes a path specification
into its canonic form, meaning that all redundant segments
are stripped. So although two paths which designate the same
target may not be equal, they can be tested for equivalence.
Hmmm, I'm not really sure what "equivalence" for file paths should
mean in the presence of hard/symbolic links, (NFS-)mounted file
systems, etc.  Haskell's stateless (==) function doesn't really
make sense IMHO, but perhaps I've missed something in this epic
discussion... :-]
Cheers,
   S.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Re: File path programme

2005-01-31 Thread Peter Simons
Robert Dockins writes:

 > 1) File names are abstract entities.  There are a number of
 > ways one might concretely represent a filename. Among these
 > ways are:
 >
 >a) A contiguous sequence of octets in memory
 > (C style string on most modern hardware)
 >b) A sequence of unicode codepoints
 > (Haskell style string)
 >c) Algebraic datatypes supporting path manipulations
 > (yet to be developed)

The solution I have in mind uses algebraic data types which
are parameterized over the actual representation. Thus, you
can use them to represent any type of path (in any kind of
representation). In the spirit of release early, release
often:

  http://cryp.to/pathspec/PathSpec.hs
  darcs get http://cryp.to/pathspec

The module currently knows only _relative_ paths. I am still
experimenting with absolute paths because I have recently
learned that on Windows something like "C:foo.txt" is
actually relative -- not absolute. Very weird.

There also is a function which changes a path specification
into its canonic form, meaning that all redundant segments
are stripped. So although two paths which designate the same
target may not be equal, they can be tested for equivalence.

Suggestions for enhancement are welcome, of course.

Peter

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: UTF-8 BOM, really!? (was: [Haskell-cafe] Re: File path programme)

2005-01-31 Thread Scott Turner
On 2005 January 31 Monday 04:56, Graham Klyne wrote:
> How can it make sense to have a BOM in UTF-8?  UTF-8 is a sequence of
> octets (bytes);  what ordering is there here that can sensibly be varied?

Correct. There is no order to be varied.

A BOM came to be permitted because it uses the identical code as NBSP 
(non-breaking space). Earlier versions of Unicode permit NBSP just about 
anywhere in the character sequence.  Unicode 4 deprecates this use of NBSP.

If I read it correctly, Unicode 4 says that a BOM at the beginning of a UTF-8 
encoded stream is not to be taken as part of the text. The BOM has no effect. 
The rationale for this is that some applications put out a BOM at the 
beginning of the output regardless of the encoding.  Other occurrences of 
NBSP in a UTF-8 encoded stream are significant.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: File path programme

2005-01-31 Thread robert dockins
I have been ruminating on the various responses my attempted file path 
implementation has generated.  I have a design beginning to form in the 
back of my head which attempts to address the file path problem as I lay 
out below. Before I develop it any further, are there any important 
considerations I am missing?

Here is my conception of the file name problem:
1) File names are abstract entities.  There are a number of ways one 
might concretely represent a filename. Among these ways are:

  a) A contiguous sequence of octets in memory
   (C style string on most modern hardware)
  b) A sequence of unicode codepoints
   (Haskell style string)
  c) Algebraic datatypes supporting path manipulations
   (yet to be developed)
2) We would like these three representations to be isomorphic. 
Unfortunately, this cannot be.  In particular, there are major issues 
with the translations between the (a) and (b) forms given above.  One 
could imagine that translations issues involving the (c) form are also 
possible.

3) Translations between (a) and (b) must be parameterized by a character 
encoding.  Translations to and from (c) will require some manner of 
description of the path syntax, which differs by OS.

4) In practice, the vast majority of file paths are portable between the 
various forms; the forms are "nearly" isomorphic, with corner cases 
being fairly rare.

5) Translations between the various forms cost compute cycles and 
memory, and are not necessarily bijective.  Therefore, translations 
should occur _only_ if absolutely necessary.  In particular, if a file 
name passes through a program as a black box (it is not examined or 
manipulated) it should undergo no transformation.

6) Different OSes handle file names differently.  These differences 
should be accounted for, transparently where possible.  These 
differences, however, should be exposed to developers for whom the 
difference matter.

7) Using simple file names should be easy.  We don't want developers to 
have to worry too much about character encodings, path separators, and 
generally bizarre path syntax just to open files.  The complexities of 
correct file name handling should be hidden from the casual programmer. 
However, developers interested in serious 
portability/internationalization should be able to get down into the 
muck if they need to.


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


UTF-8 BOM, really!? (was: [Haskell-cafe] Re: File path programme)

2005-01-31 Thread Graham Klyne
At 23:39 30/01/05 +0100, Marcin 'Qrczak' Kowalczyk wrote:
Aaron Denney <[EMAIL PROTECTED]> writes:
>> It provides variants of UTF-16/32 with and without a BOM, but
>> UTF-8 only has the variant with a BOM. This makes UTF-8 a stateful
>> encoding.
>
> I think you mean "UTF-8 only has the variant without a BOM".
No, unfortunately. Unicode standard section 3.10 defines encoding
schemes:
- UTF-8(witha BOM)
- UTF-16BE (without a BOM)
- UTF-16LE (without a BOM)
- UTF-16   (witha BOM)
- UTF-32BE (without a BOM)
- UTF-32LE (without a BOM)
- UTF-32   (witha BOM)
It says about UTF-8 BOM: "Its usage at the beginning of a UTF-8 data
stream is neither required nor recommended by the Unicode Standard,
but its presence does not affect conformance to the UTF-8 encoding
scheme."
IMHO it would be fair if it had two variants of UTF-8 encoding scheme,
just like it has three variants of UTF-16/32, so it would be unambiguous
whether "UTF-8" in a particular context allows BOM or not.
I haven't been following this thread in detail, so I may be missing 
something, but...

How can it make sense to have a BOM in UTF-8?  UTF-8 is a sequence of 
octets (bytes);  what ordering is there here that can sensibly be varied?

#g

Graham Klyne
For email:
http://www.ninebynine.org/#Contact
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: File path programme

2005-01-30 Thread David Menendez
Marcin 'Qrczak' Kowalczyk writes:

> AFAIK MacOS normalizes filenames, but using a slightly different
> algorithm than Unicode (perhaps just an older version).

According to , Mac OS
X uses different forms depending on the file system.

| For example, HFS Plus uses a variant of Normal Form D in which 
| U+2000 through U+2FFF, U+F900 through U+FAFF, and U+2F800 through
| U+2FAFF are not decomposed (this avoids problems with round trip
| conversions from old Mac text encodings).

The big catch to watch out for is that Mac OS X supports UFS, which is
case sensitive, and HFS+, which is not. I've had at least one Haskell
program that didn't work properly because it tried to create two files
named "tags" and "TAGS" in the same directory.
-- 
David Menendez <[EMAIL PROTECTED]> 
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: File path programme

2005-01-30 Thread Marcin 'Qrczak' Kowalczyk
Aaron Denney <[EMAIL PROTECTED]> writes:

> Better yet would be to have the standard never allow the BOM.

If I could decide, I would ban the BOM in UTF-8 altogetger, but I'm
afraid the Unicode Consortium doesn't want to do this.

Miscosoft Notepad puts a BOM in UTF-8 encoded files.

-- 
   __("< Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Re: File path programme

2005-01-30 Thread Aaron Denney
On 2005-01-30, Marcin 'Qrczak' Kowalczyk <[EMAIL PROTECTED]> wrote:
> Aaron Denney <[EMAIL PROTECTED]> writes:
>
>>> It provides variants of UTF-16/32 with and without a BOM, but
>>> UTF-8 only has the variant with a BOM. This makes UTF-8 a stateful
>>> encoding.
>>
>> I think you mean "UTF-8 only has the variant without a BOM".

...

> IMHO it would be fair if it had two variants of UTF-8 encoding scheme,
> just like it has three variants of UTF-16/32, so it would be unambiguous
> whether "UTF-8" in a particular context allows BOM or not.

Ah.  Okay.  It's not that the BOM is always to be there, but that
it's always ambiguous, which was not clear from your initial
description.

Better yet would be to have the standard never allow the BOM.

Since some things can't handle it, on output we should never emit it,
but still must handle it on input.  Bah.

-- 
Aaron Denney
-><-

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: File path programme

2005-01-30 Thread Marcin 'Qrczak' Kowalczyk
Aaron Denney <[EMAIL PROTECTED]> writes:

>> It provides variants of UTF-16/32 with and without a BOM, but
>> UTF-8 only has the variant with a BOM. This makes UTF-8 a stateful
>> encoding.
>
> I think you mean "UTF-8 only has the variant without a BOM".

No, unfortunately. Unicode standard section 3.10 defines encoding
schemes:

- UTF-8(witha BOM)
- UTF-16BE (without a BOM)
- UTF-16LE (without a BOM)
- UTF-16   (witha BOM)
- UTF-32BE (without a BOM)
- UTF-32LE (without a BOM)
- UTF-32   (witha BOM)

It says about UTF-8 BOM: "Its usage at the beginning of a UTF-8 data
stream is neither required nor recommended by the Unicode Standard,
but its presence does not affect conformance to the UTF-8 encoding
scheme."

IMHO it would be fair if it had two variants of UTF-8 encoding scheme,
just like it has three variants of UTF-16/32, so it would be unambiguous
whether "UTF-8" in a particular context allows BOM or not.

-- 
   __("< Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Re: File path programme

2005-01-30 Thread Aaron Denney
On 2005-01-30, Marcin 'Qrczak' Kowalczyk <[EMAIL PROTECTED]> wrote:
> Glynn Clements <[EMAIL PROTECTED]> writes:
>
>> And it isn't a theoretical issue. E.g. in an environment where EUC-JP
>> is used, filenames may begin with $)B (designate JISX0208 to G1),
>> or they may not (because G1 is assumed to contain JISX0208 initally).
>
> I think such encodings are never used as default encodings of a Unix
> locale.
>
>>> The various UTF encodings do not have this particular problem; if a UTF 
>>> string is valid, then it is a unique representation of a unicode string.
>
> BOM is a problem. Unfortunately Unicode mandates that FEFF at the
> start of a UTF-8 text stream is a mark which doesn't belong to the
> text.

Right

> It provides variants of UTF-16/32 with and without a BOM, but
> UTF-8 only has the variant with a BOM. This makes UTF-8 a stateful
> encoding.

I think you mean "UTF-8 only has the variant without a BOM".  Otherwise
I'd like to see a citation in the standard for this.  Because that's
not the reading I get from .
Instead, it seems that whether the BOM is included or not is a function
of the protocol, and that the UTF-8 streams themselves do not include
the BOM.

-- 
Aaron Denney
-><-

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: File path programme

2005-01-30 Thread Glynn Clements

(BMarcin 'Qrczak' Kowalczyk wrote:
(B
(B> >> The various UTF encodings do not have this particular problem; if a UTF
(B> >> string is valid, then it is a unique representation of a unicode string.
(B> >> However, decoding is still a partial function and can fail.
(B> >
(B> > And while it is partly true, it is qualified by the problems relative to
(B> > canonicalization (an "-Bé" in Unicode can both be represented as "é" or as 
(B> > two-A
(B> > chars (an e and an accent) and they should (ideally) compare equal).
(B> 
(B> In what sense "equal"? They are supposed to be equivalent as far
(B> as the semantics of the text is concerned, but representations are
(B> clearly different and most programs distinguish them. In particular
(B> they are different filenames on both Unix and Windows. AFAIK MacOS
(B> normalizes filenames, but using a slightly different algorithm than
(B> Unicode (perhaps just an older version).
(B> 
(B> IMHO it makes no sense to pretend that they are exactly the same when
(B> strings consist of code points or lower level units (and I don't
(B> believe another choice for the default string type would be practical).
(B
(BWell, at least you and I agree on that.
(B
(BOnce you start down the "semantic equivalence" route, you will quickly
(Brun into issues like "ß" == "ss", and it only gets worse from there
(Bon.
(B
(B-- 
(BGlynn Clements <[EMAIL PROTECTED]>
(B___
(BHaskell-Cafe mailing list
(BHaskell-Cafe@haskell.org
(Bhttp://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: File path programme

2005-01-30 Thread Marcin 'Qrczak' Kowalczyk
Stefan Monnier <[EMAIL PROTECTED]> writes:

>> The various UTF encodings do not have this particular problem; if a UTF
>> string is valid, then it is a unique representation of a unicode string.
>> However, decoding is still a partial function and can fail.
>
> And while it is partly true, it is qualified by the problems relative to
> canonicalization (an "é" in Unicode can both be represented as "é" or as two
> chars (an e and an accent) and they should (ideally) compare equal).

In what sense "equal"? They are supposed to be equivalent as far
as the semantics of the text is concerned, but representations are
clearly different and most programs distinguish them. In particular
they are different filenames on both Unix and Windows. AFAIK MacOS
normalizes filenames, but using a slightly different algorithm than
Unicode (perhaps just an older version).

IMHO it makes no sense to pretend that they are exactly the same when
strings consist of code points or lower level units (and I don't
believe another choice for the default string type would be practical).

-- 
   __("< Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Re: File path programme

2005-01-29 Thread Stefan Monnier
> The various UTF encodings do not have this particular problem; if a UTF
> string is valid, then it is a unique representation of a unicode string.
> However, decoding is still a partial function and can fail.

And while it is partly true, it is qualified by the problems relative to
canonicalization (an "é" in Unicode can both be represented as "é" or as two
chars (an e and an accent) and they should (ideally) compare equal).


Stefan

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Re: File path programme

2005-01-25 Thread Peter Simons
Ben Rudiak-Gould writes:

 > 1. Programs using the library will have predictable
 > (exploitable) bugs in pathname handling.

 > 2. It will never be possible to change the current weird
 > behavior, because it might break legacy code.

I completely agree.

Handling file path specifications as Strings is not a good
idea. A path should be an abstract data type -- I think we
all agree on that --; and if it were, it could be made an
instance of Show, and all would be well.

One argument against changing this has always been that it
would break legacy code (as if it were a huge problem to add
a 'show' call here and there). If GHC starts distributing
even _more_ (broken) path manipulation functions which use
String rather than something sensible, I fear that the
window of opportunity for ever getting it right is shut.

My suggestion would be to postpone distributing these
modules until it is clear that they do work, or until
someone has written something that does. The sources are
readily available in CVS anyway, if you need them.

Path handling is extremely tricky, especially if you want to
be portable. IMHO, the subject deserves more attention
before any of the existing solution should be labeled
"reliable" -- and that is what shipping the code with GHC
would implicitly do.

Just my 0.02 Euros.

Peter

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Re: File path programme

2005-01-25 Thread Krasimir Angelov
On Tue, 25 Jan 2005 13:32:29 +0200, Krasimir Angelov
<[EMAIL PROTECTED]> wrote:
> >> What about splitFileExt "foo.bar."? ("foo", "bar.") or ("foo.bar.", "")?
> >
> > The latter makes more sense to me, as an extension of the first case
> > you give and splitting "foo.tar.gz" to ("foo.tar", "gz").
> 
> I will take a look at this. I also don't know which case is more natural.

("foo.bar.", "") is more natural for me because it eleminates the
special case for "." and "..". The original definition of splitFileExt
is:

splitFileExt :: FilePath -> (String, String)
splitFileExt p =
  case pre of
[]  -> (p, [])
(_:pre) -> (reverse (pre++path), reverse suf)
  where
(fname,path) = break isPathSeparator (reverse p)
(suf,pre) | fname == "." || fname == ".." = (fname,"")
  | otherwise = break (== '.') fname

The definition can be changed to:

splitFileExt :: FilePath -> (String, String)
splitFileExt p =
  case break (== '.') fname of
(suf@(_:_),_:pre) -> (reverse (pre++path), reverse suf)
_ -> (p, [])
  where
(fname,path) = break isPathSeparator (reverse p)

The letter is simplier, it doesn't treat "." and ".." as special cases
and for it
splitFileExt "foo.bar." == ("foo.bar.", "")

Cheers,
  Krasimir
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: File path programme

2005-01-24 Thread Keean Schupke
Marcin 'Qrczak' Kowalczyk wrote:
These rules agree on "foo", "foo." and "foo.tar.gz", yet disagree on
"foo.bar."; I don't know which is more natural.
 

Filename extensions come from DOS 8.3 format. In these kind of
names only one '.' is allowed. Unix does not have filename extensions,
as '.' is just a normal filename character (with the exception of
'.', '..', and filenames starting with a '.' which are hidden files).
As far as I know unix utilities like gzip look for specific extensions 
like '.gz',
so it would make more sense on a unix platform to just look for a filename
ending '.gz'... this applies recursively so:

fred.tar.gz
Is a tarred gzip file, so first ending is '.gz' the next is '.tar'...
So as far as unix is concerned:
"foo.bar." is just as it is... as would any other combination unless the 
extension
matches that specifically used by your application...

So the most sensible approach would be to have a list of known 
extensions which can be
recursively applied to the filenames, and leave any other filenames alone.

[".gz",".tar",".zip"] ...
In other words just splitting on a '.' seems the wrong operation. 
(Imagine gziping a file
called "a..." you get "agz", in other words simply an appended ".gz")

   Keean
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: File path programme

2005-01-23 Thread Marcin 'Qrczak' Kowalczyk
Aaron Denney <[EMAIL PROTECTED]> writes:

>> What about splitFileExt "foo.bar."? ("foo", "bar.") or ("foo.bar.", "")?
>
> The latter makes more sense to me, as an extension of the first case
> you give and splitting "foo.tar.gz" to ("foo.tar", "gz").

It's not that obvious: both choices are compatible with these.

The former is produced by rules:
- split the filename before the last dot *which is not the last
  character of the filename*, or at the end if there is no such dot
- remove the first character of the extension if it's non-empty
  (the character must have been a dot)

The latter is produced by rules:
- split the filename before the last dot, or at the end if there is
  no dot at all
- *if the extension is a sole dot, append a dot to the basename*
- remove the first character of the extension if it's non-empty
  (the character must have been a dot)

Special filenames of "." and ".." are treated separately, before these
rules apply.

Both choices are inverted by the same joinFileExt, which inserts a dot
between the name and extension unless the extension is empty.

These rules agree on "foo", "foo." and "foo.tar.gz", yet disagree on
"foo.bar."; I don't know which is more natural.

The difference influences the behavior of changeFileExt. These cases
are the same with both choices:
   changeFileExt "foo.bar" "" = "foo"
   changeFileExt "foo.tar.gz" "" = "foo.tar"
   changeFileExt "foo." "" = "foo."
   changeFileExt "foo." "baz" = "foo..baz"
but these differ - first choice:
   changeFileExt "foo.bar." "" = "foo"
   changeFileExt "foo.bar." "baz" = "foo.baz"
or the second:
   changeFileExt "foo.bar." "" = "foo.bar."
   changeFileExt "foo.bar." "baz" = "foo.bar..baz"
?

-- 
   __("< Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Re: File path programme

2005-01-23 Thread Aaron Denney
On 2005-01-23, Marcin 'Qrczak' Kowalczyk <[EMAIL PROTECTED]> wrote:
> What about splitFileExt "foo.bar."? ("foo", "bar.") or ("foo.bar.", "")?

The latter makes more sense to me, as an extension of the first case you
give and splitting "foo.tar.gz" to ("foo.tar", "gz").

-- 
Aaron Denney
-><-

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe