Re: [PATCH 2/5] [mingw]: Have compiled-file-name produce valid names.

2011-04-29 Thread Andy Wingo
Hi Jan,

On Tue 15 Feb 2011 16:35, Jan Nieuwenhuizen  writes:

> From: Jan Nieuwenhuizen 
>
> 2011-02-04  Jan Nieuwenhuizen  
>
> * module/system/base/compile.scm (compiled-file-name): Add
> directory separator and remove colon for Mingw.  Fixes
> compilation on Windows.
> ---
>  module/system/base/compile.scm |9 +++--
>  1 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/module/system/base/compile.scm b/module/system/base/compile.scm
> index 7d46713..8c72e54 100644
> --- a/module/system/base/compile.scm
> +++ b/module/system/base/compile.scm
> @@ -100,11 +100,16 @@
> ".go")
>(else (car %load-compiled-extensions
>(and %compile-fallback-path
> -   (let ((f (string-append
> +   (let* ((c (canonicalize-path file))
> +   (f (string-append
>   %compile-fallback-path
>   ;; no need for '/' separator here, canonicalize-path
>   ;; will give us an absolute path
> - (canonicalize-path file)
> +  (if (eq? (string-ref c 1) #\:)
> +  ;; on Mingw remove drive-letter separator `:' to
> +  ;; obtain valid file name
> +  (substring c 2)
> +  c)
>   (compiled-extension
>   (and (false-if-exception (ensure-writable-dir (dirname f)))
>f

I don't much like this approach.  Besides mixing in a heuristic on all
machines that is win32-specific, it makes c:/foo.scm collide with
d:/foo.scm in the cache, and fails to also modify load.c which also does
autocompilation in other contexts.

I think we need a proper path library, and unfortunately I think it
needs to be implemented at least partly in C, due to circularity issues.
See http://docs.racket-lang.org/reference/pathutils.html for an example
of what I'm talking about.

Is anyone interested in implementing a path library?

Andy
-- 
http://wingolog.org/



Re: [PATCH 2/5] [mingw]: Have compiled-file-name produce valid names.

2011-04-29 Thread Noah Lavine
> Is anyone interested in implementing a path library?
>
> Andy

I might be able to work on it. I haven't done much for Guile lately,
but I expect to have a lot more free time once my semester ends on May
7th.

However, I don't know much about how Windows paths work. Are there any
special considerations beyond the directory separator?

Also, are there any characters that are valid in filenames on some
systems but invalid on other systems?

Noah



Re: [PATCH 2/5] [mingw]: Have compiled-file-name produce valid names.

2011-05-01 Thread Andy Wingo
On Fri 29 Apr 2011 19:30, Noah Lavine  writes:

>> Is anyone interested in implementing a path library?
>
> I might be able to work on it.

Super!

> However, I don't know much about how Windows paths work. Are there any
> special considerations beyond the directory separator?

Yep!  Check that racket web page I linked to.  You don't have to
implement all of it, but it should be possible to implement, given the
path abstraction.

> Also, are there any characters that are valid in filenames on some
> systems but invalid on other systems?

Ah, I see you are under the delusion that paths are composed of
characters :)  This is not the case.  To the OS, paths are
NUL-terminated byte arrays, with some constraints about their
composition, but which are not necessarily representable as strings.  It
is nice to offer the ability to convert to and from strings, when that
is possible, but we must not assume that it is indeed possible.

Basically I think the plan should be to add scm_from_locale_path,
scm_from_raw_path, etc to filesys.[ch], and change any
pathname-accepting procedure in Guile to accept path objects, producing
them from strings when given strings, and pass the bytevector
representation to the raw o/s procedures like `open' et al.

Then for a lot of the utilities, we can add (ice-9 paths) or something,
and implement most of the utility functions in Scheme.

Andy
-- 
http://wingolog.org/



Re: [PATCH 2/5] [mingw]: Have compiled-file-name produce valid names.

2011-05-01 Thread Noah Lavine
> Yep!  Check that racket web page I linked to.  You don't have to
> implement all of it, but it should be possible to implement, given the
> path abstraction.

Okay, I've read it. It doesn't seem very complicated. Should we strive
for API compatibility? I don't see any programs needing it right now,
but maybe there would be in the future if we made them compatible.

> Ah, I see you are under the delusion that paths are composed of
> characters :)  This is not the case.  To the OS, paths are
> NUL-terminated byte arrays, with some constraints about their
> composition, but which are not necessarily representable as strings.  It
> is nice to offer the ability to convert to and from strings, when that
> is possible, but we must not assume that it is indeed possible.

Thanks! However, I'm also under a somewhat different delusion, which
the Racket docs disagree with. I think of a path as a vector of "path
elements", each of which represents a directory except that the last
one might represent a file. I notice the Racket path library makes
their path object distinct from this - you can build a path from a
list of path elements with build-path, and turn a path into a list of
path elements with explode-path, but you can't take an actual path
object and manipulate its components (unless I've missed something).
Do you think this is the right way to think of it?

I'd say that my way of thinking makes more sense if you think that a
filesystem is really just a directed acyclic graph (well, usually
acyclic), and a path is a list of graph nodes. I can't quite see what
the alternative model is, but I have a feeling there's another way of
thinking where Racket's path library makes more sense.

> Basically I think the plan should be to add scm_from_locale_path,
> scm_from_raw_path, etc to filesys.[ch], and change any
> pathname-accepting procedure in Guile to accept path objects, producing
> them from strings when given strings, and pass the bytevector
> representation to the raw o/s procedures like `open' et al.
>
> Then for a lot of the utilities, we can add (ice-9 paths) or something,
> and implement most of the utility functions in Scheme.

Sounds like a plan.

Noah



Re: [PATCH 2/5] [mingw]: Have compiled-file-name produce valid names.

2011-05-01 Thread Andy Wingo
Hi,

On Sun 01 May 2011 21:23, Noah Lavine  writes:

>> Yep!  Check that racket web page I linked to.  You don't have to
>> implement all of it, but it should be possible to implement, given the
>> path abstraction.
>
> Okay, I've read it. It doesn't seem very complicated. Should we strive
> for API compatibility? I don't see any programs needing it right now,
> but maybe there would be in the future if we made them compatible.

I don't think we need to be compatible, no.  That said it does look
pretty good.

> I think of a path as a vector of "path elements", each of which
> represents a directory except that the last one might represent a
> file. I notice the Racket path library makes their path object
> distinct from this - you can build a path from a list of path elements
> with build-path, and turn a path into a list of path elements with
> explode-path, but you can't take an actual path object and manipulate
> its components (unless I've missed something).  Do you think this is
> the right way to think of it?

I think that might be what you want sometimes, but it doesn't correspond
to the underlying OS path concept.  You could build such a thing on top
of the byte arrays, but I don't think a vector (or list, ...) is always
going to be what you want.  I don't know.  I would say to stick with
byte arrays and strings on the lowest level.

Cheers,

Andy
-- 
http://wingolog.org/



Re: [PATCH 2/5] [mingw]: Have compiled-file-name produce valid names.

2011-05-01 Thread Mark H Weaver
Andy Wingo  writes:
> On Fri 29 Apr 2011 19:30, Noah Lavine  writes:
>> Also, are there any characters that are valid in filenames on some
>> systems but invalid on other systems?
>
> Ah, I see you are under the delusion that paths are composed of
> characters :)  This is not the case.  To the OS, paths are
> NUL-terminated byte arrays, with some constraints about their
> composition, but which are not necessarily representable as strings.

This is the case on POSIX, but we should keep in mind that on some
systems (e.g. Windows NT) filenames are considered character data,
or at least so says PEP 383 

IMHO, it would be best to avoid embedding into Guile the assumption that
filenames, environment variables, command-line arguments, etc, are
really bytevectors.

> Basically I think the plan should be to add scm_from_locale_path,
> scm_from_raw_path, etc to filesys.[ch], and change any
> pathname-accepting procedure in Guile to accept path objects, producing
> them from strings when given strings, and pass the bytevector
> representation to the raw o/s procedures like `open' et al.

I like this idea, but we should keep in mind that we face the same
problem with things like environment variables, command-line arguments,
etc.  Ideally, we should try to come up with a coherent story and set of
APIs for dealing with all of these data that are string-like, but
actually bytevectors on some systems.

 Mark



Re: [PATCH 2/5] [mingw]: Have compiled-file-name produce valid names.

2011-05-02 Thread Andy Wingo
On Sun 01 May 2011 23:48, Mark H Weaver  writes:

> on some systems (e.g. Windows NT) filenames are considered character
> data, or at least so says PEP 383
> 

Ah, interesting, I was blissfully ignorant; not the desired state when
one is hacking file-name encoding :)

Still, though, I think the basic point stands: copy what Racket does,
because they actually do run well on windows and are happy with their
abstraction.  It's the sincerest form of flattery :)

> Ideally, we should try to come up with a coherent story and set of
> APIs for dealing with all of these data that are string-like, but
> actually bytevectors on some systems.

Environment variables and command-line arguments being the other ones
that you mentioned; and yes, some common conventions here would be good.
I still think, though, that path objects need their own data type.

Peace,

Andy
-- 
http://wingolog.org/



Re: [PATCH 2/5] [mingw]: Have compiled-file-name produce valid names.

2011-05-02 Thread Ludovic Courtès
Hi,

Andy Wingo  writes:

> Basically I think the plan should be to add scm_from_locale_path,
> scm_from_raw_path, etc to filesys.[ch], and change any
> pathname-accepting procedure in Guile to accept path objects, producing
> them from strings when given strings, and pass the bytevector
> representation to the raw o/s procedures like `open' et al.

Seems to like a disjoint type “just for Windows” would be overkill, no?

MIT/GNU Scheme has something this overkill [0].

Bigloo has just one variable, ‘file-separator’, which is either #\/ or
#\\ [1].  Vicinities in SLIB/SCM are similar, with ‘vicinity:suffix?’
abstracting over slash vs. backslash [2].  I’m not sure how they handle
MS-DOS volume names.

Thanks,
Ludo’.

[0] 
http://www.gnu.org/software/mit-scheme/documentation/mit-scheme-ref/Pathnames.html
[1] 
http://www-sop.inria.fr/mimosa/fp/Bigloo/doc/bigloo-7.html#System-Programming
[2] http://people.csail.mit.edu/jaffer/slib_2.html




Re: [PATCH 2/5] [mingw]: Have compiled-file-name produce valid names.

2011-05-02 Thread Andy Wingo
On Mon 02 May 2011 22:58, l...@gnu.org (Ludovic Courtès) writes:

> Andy Wingo  writes:
>
>> Basically I think the plan should be to add scm_from_locale_path,
>> scm_from_raw_path, etc to filesys.[ch], and change any
>> pathname-accepting procedure in Guile to accept path objects, producing
>> them from strings when given strings, and pass the bytevector
>> representation to the raw o/s procedures like `open' et al.
>
> Seems to like a disjoint type “just for Windows” would be overkill, no?

Maybe you're right; hummm!  I have added a kind racketeer on Cc; perhaps
if he has time, he might have some thoughts in this regard.  :-)

> Bigloo has just one variable, ‘file-separator’, which is either #\/ or
> #\\ [1].

The funny thing is that this doesn't matter at all.  Well, I mean that
it's valid to construct pathnames with / as the separator on Windows, as
/ and \ are equivalent there.

I still think that we need at least the ability to pass a bytevector as
a path name, on GNU systems; and that if we can do so, then any routine
that needs to deal with a path name would then need to deal in byte
vectors in addition to strings, and at that point perhaps it is indeed
useful to have a path library.

> Vicinities in SLIB/SCM are similar, with ‘vicinity:suffix?’
> abstracting over slash vs. backslash [2].  I’m not sure how they handle
> MS-DOS volume names.

I don't think that they do handle volume names; at least, from what I
could see in the API description there.

Good questions!

A
-- 
http://wingolog.org/



Re: [PATCH 2/5] [mingw]: Have compiled-file-name produce valid names.

2011-05-02 Thread Ludovic Courtès
Hello!

Andy Wingo  writes:

> On Mon 02 May 2011 22:58, l...@gnu.org (Ludovic Courtès) writes:

[...]

> The funny thing is that this doesn't matter at all.  Well, I mean that
> it's valid to construct pathnames with / as the separator on Windows, as
> / and \ are equivalent there.

Oh, good.

> I still think that we need at least the ability to pass a bytevector as
> a path name, on GNU systems; and that if we can do so, then any routine
> that needs to deal with a path name would then need to deal in byte
> vectors in addition to strings, and at that point perhaps it is indeed
> useful to have a path library.

To accommodate various file name encodings, right?  Then yes.

I think GLib and the like expect UTF-8 as the file name encoding and
complain otherwise, so UTF-8 might be a better default than locale
encoding (and it’s certainly wiser to be locale-independent.)

>> Vicinities in SLIB/SCM are similar, with ‘vicinity:suffix?’
>> abstracting over slash vs. backslash [2].  I’m not sure how they handle
>> MS-DOS volume names.
>
> I don't think that they do handle volume names; at least, from what I
> could see in the API description there.

So volumes matter in the file name canonicalization of the .go cache
right?

Couldn’t we mimic /cygdrive/c, etc.?

Thanks,
Ludo’.



Re: [PATCH 2/5] [mingw]: Have compiled-file-name produce valid names.

2011-05-02 Thread Eli Barzilay
[Second attempt, my Emacs has unfortunate issues with Ludovic's
name...]


An hour ago, Andy Wingo wrote:
> On Mon 02 May 2011 22:58, l...@gnu.org (Ludovic Courtès) writes:
> 
> > Andy Wingo  writes:
> >
> >> Basically I think the plan should be to add scm_from_locale_path,
> >> scm_from_raw_path, etc to filesys.[ch], and change any
> >> pathname-accepting procedure in Guile to accept path objects,
> >> producing them from strings when given strings, and pass the
> >> bytevector representation to the raw o/s procedures like `open'
> >> et al.
> >
> > Seems to like a disjoint type “just for Windows” would be
> > overkill, no?
> 
> Maybe you're right; hummm!  I have added a kind racketeer on Cc; perhaps
> if he has time, he might have some thoughts in this regard.  :-)

I don't think that I can contribute much -- I'm mostly looking at
these things from a user's point of view...  Roughly speaking (mostly
because I don't know what the issues that you're up against), our path
values have "just paths" for whatever the OS wants -- so on Windows
they might have either backslashes or slashes (since Racket accepts
both).

To write portable code we don't have a `file-separator' thing,
instead, we have `build-path' that combines two paths with the right
separator.  Similarly, we have `split-path' to split up a path to the
directory part and the last part.  I think that it's generally better
this way, since it represents the higher level operation rather than
fiddling with the semantics of where and how to put separators
directly (but this is not some religious issue, just seems to me like
it would be more convenient).

Also, we have cases where we want something that looks like a portable
path (for example, naming relative file names in `require') -- for
those we use /-separated strings that are limited to "safe"
characters.  And related, in cases where we want to encode path in
code (for example, some macro that wants to generate a path), we'll
use strings or byte strings, with the latter more common for lower
level things.

(But I'm just rambling now, I haven't slept in N days -- so feel free
to ignore me...)

-- 
  ((lambda (x) (x x)) (lambda (x) (x x)))  Eli Barzilay:
http://barzilay.org/   Maze is Life!



Re: [PATCH 2/5] [mingw]: Have compiled-file-name produce valid names.

2011-05-03 Thread Andy Wingo
On Tue 03 May 2011 00:18, l...@gnu.org (Ludovic Courtès) writes:

>> I still think that we need at least the ability to pass a bytevector as
>> a path name, on GNU systems; and that if we can do so, then any routine
>> that needs to deal with a path name would then need to deal in byte
>> vectors in addition to strings, and at that point perhaps it is indeed
>> useful to have a path library.
>
> To accommodate various file name encodings, right?  Then yes.

That's the crazy thing: file names on GNU aren't in any encoding!  They
are byte strings that may or may not decode to a string, given some
encoding.  Granted, they're mostly UTF-8 these days, but users have the
darndest files...

> I think GLib and the like expect UTF-8 as the file name encoding and
> complain otherwise, so UTF-8 might be a better default than locale
> encoding (and it’s certainly wiser to be locale-independent.)

It's more complicated than that.  Here's the old interface that they
used, which attempted to treat paths as utf-8:

  http://developer.gnome.org/glib/unstable/glib-Character-Set-Conversion.html
  (search for "file name encoding")

The new API is abstract, so it allows operations like "get-display-name"
and "get-bytes":

  http://developer.gnome.org/gio/2.28/GFile.html  (search for "encoding"
  in that page)

  "All GFiles have a basename (get with g_file_get_basename()). These
  names are byte strings that are used to identify the file on the
  filesystem (relative to its parent directory) and there is no
  guarantees that they have any particular charset encoding or even make
  any sense at all. If you want to use filenames in a user interface you
  should use the display name that you can get by requesting the
  G_FILE_ATTRIBUTE_STANDARD_DISPLAY_NAME attribute with
  g_file_query_info(). This is guaranteed to be in utf8 and can be used
  in a user interface. But always store the real basename or the GFile
  to use to actually access the file, because there is no way to go from
  a display name to the actual name."

> So volumes matter in the file name canonicalization of the .go cache
> right?
>
> Couldn’t we mimic /cygdrive/c, etc.?

Is that what cygwin does?  We certainly could, yes; though for the
purposes of joining the cache dir to an absolute filename, I guess we
could simply change c:/foo to /c/foo...  Hum!

Andy
-- 
http://wingolog.org/



Re: [PATCH 2/5] [mingw]: Have compiled-file-name produce valid names.

2011-05-03 Thread Ludovic Courtès
Hi,

Andy Wingo  writes:

> On Tue 03 May 2011 00:18, l...@gnu.org (Ludovic Courtès) writes:
>
>>> I still think that we need at least the ability to pass a bytevector as
>>> a path name, on GNU systems; and that if we can do so, then any routine
>>> that needs to deal with a path name would then need to deal in byte
>>> vectors in addition to strings, and at that point perhaps it is indeed
>>> useful to have a path library.
>>
>> To accommodate various file name encodings, right?  Then yes.
>
> That's the crazy thing: file names on GNU aren't in any encoding!

Yes, that’s POSIX.

>> I think GLib and the like expect UTF-8 as the file name encoding and
>> complain otherwise, so UTF-8 might be a better default than locale
>> encoding (and it’s certainly wiser to be locale-independent.)
>
> It's more complicated than that.  Here's the old interface that they
> used, which attempted to treat paths as utf-8:
>
>   http://developer.gnome.org/glib/unstable/glib-Character-Set-Conversion.html
>   (search for "file name encoding")
>
> The new API is abstract, so it allows operations like "get-display-name"
> and "get-bytes":
>
>   http://developer.gnome.org/gio/2.28/GFile.html  (search for "encoding"
>   in that page)

Interesting.

But when I launch Geeqie there’s a GLib warning when it encounters a
non-UTF-8-encoded name, which basically makes me feel guilty for not
using UTF-8.

>> So volumes matter in the file name canonicalization of the .go cache
>> right?
>>
>> Couldn’t we mimic /cygdrive/c, etc.?
>
> Is that what cygwin does?  We certainly could, yes; though for the
> purposes of joining the cache dir to an absolute filename, I guess we
> could simply change c:/foo to /c/foo...  Hum!

Yes, that should be good enough (but that’s really just for Guile on
MinGW since Guile on Cygwin cannot have this problem, AIUI.)

Thanks,
Ludo’.



Re: [PATCH 2/5] [mingw]: Have compiled-file-name produce valid names.

2011-05-03 Thread Mark H Weaver
Andy Wingo  writes:
> That's the crazy thing: file names on GNU aren't in any encoding!  They
> are byte strings that may or may not decode to a string, given some
> encoding.  Granted, they're mostly UTF-8 these days, but users have the
> darndest files...
[...]
> On Tue 03 May 2011 00:18, l...@gnu.org (Ludovic Courtès) writes:
>> I think GLib and the like expect UTF-8 as the file name encoding and
>> complain otherwise, so UTF-8 might be a better default than locale
>> encoding (and it’s certainly wiser to be locale-independent.)
>
> It's more complicated than that.  Here's the old interface that they
> used, which attempted to treat paths as utf-8:
>
>   http://developer.gnome.org/glib/unstable/glib-Character-Set-Conversion.html
>   (search for "file name encoding")
>
> The new API is abstract, so it allows operations like "get-display-name"
> and "get-bytes":
>
>   http://developer.gnome.org/gio/2.28/GFile.html  (search for "encoding"
>   in that page)
>
>   "All GFiles have a basename (get with g_file_get_basename()). These
>   names are byte strings that are used to identify the file on the
>   filesystem (relative to its parent directory) and there is no
>   guarantees that they have any particular charset encoding or even make
>   any sense at all. If you want to use filenames in a user interface you
>   should use the display name that you can get by requesting the
>   G_FILE_ATTRIBUTE_STANDARD_DISPLAY_NAME attribute with
>   g_file_query_info(). This is guaranteed to be in utf8 and can be used
>   in a user interface. But always store the real basename or the GFile
>   to use to actually access the file, because there is no way to go from
>   a display name to the actual name."

In my opinion, this is a bad approach to take in Guile.  When developers
are careful to robustly handle filenames with invalid encoding, it will
lead to overly complex code.  More often, when developers write more
straightforward code, it will lead to code that works most of the time
but fails badly when confronted with weird filenames.  This is the same
type of problem that plagues Bourne shell scripts.  Let's please not go
down that road.

There is a better way.  We can do a variant of what Python 3 does,
described in PEP 383 .

Basically, the idea is to provide alternative versions of
scm_{to,from}_stringn that allow arbitrary bytevectors to be turned into
strings and back again without any lossage.  These alternative versions
would be used for operations involving filenames et al, and should
probably also be made available to users.

Basically the idea is that "invalid bytes" are mapped to code points
that will never appear in any valid encoding.  PEP 383 maps such bytes
to a range of surrogate code points that are reserved for use in UTF-16
surrogate pairs, and are otherwise considered invalid by Unicode.  There
are other possible mapping schemes as well.  See section 3.7 of Unicode
Technical Report #36  for more
discussion on this.

I can understand why some say that filenames in GNU are not really
strings but rather bytevectors.  I respectfully disagree.  Filenames,
environment variables, command-line arguments, etc, are _conceptually_
strings.  Let's not muddle that concept just because the transition to
Unicode has not yet been completed in the GNU world.

Hopefully in the future, these old-style POSIX byte strings will once
again become true strings in concept.  All that's required for this to
happen is for popular software to agree to standardize on the use of
UTF-8 for all of these things.  This is reasonably likely to happen at
some point.

In practice, I see no advantage to calling them bytevectors other than
to allow lossless storage of oddball filenames.  It's not as if any sane
user interface is going to display them in hex.  Think about it.  What
are you really going to do with the bytevector version, other than to
store it in case you want to convert it back into a filename,
environment variable, or command-line argument?  Think about the mess
that this will make to otherwise simple code.  Also think about the
obscure bugs that will arise from programmers who balk at this and
simply pass around the strings instead.

Let's keep things simple.  Let's use plain strings for everything that
is _conceptually_ a string.  Let's instead deal with the occasional
ill-encoded-filename by allowing strings to represent these oddballs.

 Best,
  Mark



Re: [PATCH 2/5] [mingw]: Have compiled-file-name produce valid names.

2011-05-03 Thread Noah Lavine
Hello all,

I have another issue to raise. I think this is actually parallel to
some of the stuff in the (web) module, as you will see.

I've always thought it was ridiculous and hackish that I had to escape
spaces in path strings. For instance, I have a folder called "Getting
a Job" on my desktop, whose path is ~/Desktop/Getting\ a\ Job.

The reason this strangeness enters is that path strings are actually
lists (or vectors) encoded as strings. Conceptually, the path
~/Desktop/Getting\ a\ Job is the list ("~" "Desktop" "Getting a Job").
In this representation, there are no escapes and no separators. It
always seemed cleaner to me to think about it that way. I think there
should be some mechanism by which Guile users never have to think
about escaping spaces (and any other characters they want in their
paths). We don't have to represent them with lists or vectors, but
there should be some mechanism for avoiding this.

I said this is similar to the (web) module because of all of the
discussion there of how HTTP encodes data types in text, and how it's
better to think of a URI as URI type rather than a special string,
etc. I think the same issue applies here - you've got list (or a list
of lists, if you have a whole command-line with arguments) encoded as
a string using ' ' and '/' as separators, and then you have to escape
those characters when you want to use them in a different way, and the
whole thing gets unnecessarily complicated because the right way to
think about this is as lists of strings.

Noah

On Tue, May 3, 2011 at 11:59 PM, Mark H Weaver  wrote:
> Andy Wingo  writes:
>> That's the crazy thing: file names on GNU aren't in any encoding!  They
>> are byte strings that may or may not decode to a string, given some
>> encoding.  Granted, they're mostly UTF-8 these days, but users have the
>> darndest files...
> [...]
>> On Tue 03 May 2011 00:18, l...@gnu.org (Ludovic Courtès) writes:
>>> I think GLib and the like expect UTF-8 as the file name encoding and
>>> complain otherwise, so UTF-8 might be a better default than locale
>>> encoding (and it’s certainly wiser to be locale-independent.)
>>
>> It's more complicated than that.  Here's the old interface that they
>> used, which attempted to treat paths as utf-8:
>>
>>   http://developer.gnome.org/glib/unstable/glib-Character-Set-Conversion.html
>>   (search for "file name encoding")
>>
>> The new API is abstract, so it allows operations like "get-display-name"
>> and "get-bytes":
>>
>>   http://developer.gnome.org/gio/2.28/GFile.html  (search for "encoding"
>>   in that page)
>>
>>   "All GFiles have a basename (get with g_file_get_basename()). These
>>   names are byte strings that are used to identify the file on the
>>   filesystem (relative to its parent directory) and there is no
>>   guarantees that they have any particular charset encoding or even make
>>   any sense at all. If you want to use filenames in a user interface you
>>   should use the display name that you can get by requesting the
>>   G_FILE_ATTRIBUTE_STANDARD_DISPLAY_NAME attribute with
>>   g_file_query_info(). This is guaranteed to be in utf8 and can be used
>>   in a user interface. But always store the real basename or the GFile
>>   to use to actually access the file, because there is no way to go from
>>   a display name to the actual name."
>
> In my opinion, this is a bad approach to take in Guile.  When developers
> are careful to robustly handle filenames with invalid encoding, it will
> lead to overly complex code.  More often, when developers write more
> straightforward code, it will lead to code that works most of the time
> but fails badly when confronted with weird filenames.  This is the same
> type of problem that plagues Bourne shell scripts.  Let's please not go
> down that road.
>
> There is a better way.  We can do a variant of what Python 3 does,
> described in PEP 383 .
>
> Basically, the idea is to provide alternative versions of
> scm_{to,from}_stringn that allow arbitrary bytevectors to be turned into
> strings and back again without any lossage.  These alternative versions
> would be used for operations involving filenames et al, and should
> probably also be made available to users.
>
> Basically the idea is that "invalid bytes" are mapped to code points
> that will never appear in any valid encoding.  PEP 383 maps such bytes
> to a range of surrogate code points that are reserved for use in UTF-16
> surrogate pairs, and are otherwise considered invalid by Unicode.  There
> are other possible mapping schemes as well.  See section 3.7 of Unicode
> Technical Report #36  for more
> discussion on this.
>
> I can understand why some say that filenames in GNU are not really
> strings but rather bytevectors.  I respectfully disagree.  Filenames,
> environment variables, command-line arguments, etc, are _conceptually_
> strings.  Let's not muddle that concept just because the 

Re: [PATCH 2/5] [mingw]: Have compiled-file-name produce valid names.

2011-05-04 Thread Ludovic Courtès
Hi Noah,

Noah Lavine  writes:

> The reason this strangeness enters is that path strings are actually
> lists (or vectors) encoded as strings. Conceptually, the path
> ~/Desktop/Getting\ a\ Job is the list ("~" "Desktop" "Getting a Job").
> In this representation, there are no escapes and no separators. It
> always seemed cleaner to me to think about it that way.

Agreed.

However, POSIX procedures deal with strings, so you still need to
convert to a string at some point.  So I think there are few places
where you could really use anything other than strings to represent file
names—unless all of libguile is changed to deal with that, which seems
unreasonable to me.

MIT Scheme’s API goes this route, but that’s heavyweight and can hardly
be retrofitted in a file-name-as-strings implementation, I think:
.

> I said this is similar to the (web) module because of all of the
> discussion there of how HTTP encodes data types in text, and how it's
> better to think of a URI as URI type rather than a special string,
> etc.

Yes.

Thanks,
Ludo’.



Re: [PATCH 2/5] [mingw]: Have compiled-file-name produce valid names.

2011-05-17 Thread Noah Lavine
Hello all,

I've been scanning some file api documentation and wondering what we
could do that would translate across platforms reliably. I've been
thinking of sort of concentric circles of operations, where the inner
circles can easily be supported in a cross-platform way, and the outer
ones require more and more hackery. What do you think of the
following?

Group 1: Treat pathnames as opaque objects that come from outside APIs
and can only be used by passing them to APIs. We can support these in
a way that will be compatible everywhere.
Operations: open file, close file, stat file.
In order to be useful, we might also provide a
"command-line-argument->file" operation, but probably no reverse
operation.

Group 2: treat pathnames as vectors of opaque path components
Operations: list items in a directory

Group 3: now we need to care about encoding
Operations: string->path, path->string.
This will be much harder than groups 1 and 2.

I think group 1 by itself would allow for most command-line programs
that people want to write. If you add group 2, you could write find,
ls, cat, and probably others. You need group 3 to write grep and a web
server.

My thought right now is that group 3 is going to have a complex API if
we really want to get encodings right. Our goal should be that this
complexity doesn't affect group 1 and group 2, which really should
have very simple APIs.

Now, some thoughts on group 3:
Mark is right that paths are basically just strings, even though
occasionally they're not. I sort of like the idea of the PEP-383
encoding (making paths strings that can potentially contain unused
codepoints, which represent non-character bytes), but would that make
path strings break under some Guile string operations?

Also, when we convert strings to paths, we need to know what encoding
the local filesystem uses. That will usually be UTF-8, but potentially
might not be, correct? If we can auto-discover the correct encoding,
we might be able to keep all of that in the background and just
pretend that we can convert Guile strings to file system paths in a
clean way.

Noah

On Wed, May 4, 2011 at 5:24 AM, Ludovic Courtès  wrote:
> Hi Noah,
>
> Noah Lavine  writes:
>
>> The reason this strangeness enters is that path strings are actually
>> lists (or vectors) encoded as strings. Conceptually, the path
>> ~/Desktop/Getting\ a\ Job is the list ("~" "Desktop" "Getting a Job").
>> In this representation, there are no escapes and no separators. It
>> always seemed cleaner to me to think about it that way.
>
> Agreed.
>
> However, POSIX procedures deal with strings, so you still need to
> convert to a string at some point.  So I think there are few places
> where you could really use anything other than strings to represent file
> names—unless all of libguile is changed to deal with that, which seems
> unreasonable to me.
>
> MIT Scheme’s API goes this route, but that’s heavyweight and can hardly
> be retrofitted in a file-name-as-strings implementation, I think:
> .
>
>> I said this is similar to the (web) module because of all of the
>> discussion there of how HTTP encodes data types in text, and how it's
>> better to think of a URI as URI type rather than a special string,
>> etc.
>
> Yes.
>
> Thanks,
> Ludo’.
>



Re: [PATCH 2/5] [mingw]: Have compiled-file-name produce valid names.

2011-05-17 Thread Mark H Weaver
Hi Noah,

Thanks for thinking about this thorny issue.

Noah Lavine  writes:
> Group 1: Treat pathnames as opaque objects that come from outside APIs
> and can only be used by passing them to APIs. We can support these in
> a way that will be compatible everywhere.
> Operations: open file, close file, stat file.
> In order to be useful, we might also provide a
> "command-line-argument->file" operation, but probably no reverse
> operation.

Unfortunately, we'd need more than just that one operation.  What if you
need to run an external command on a filename received from readdir?
For this you need `file->command-line-argument'.  What if you need to
put that filename into an environment variable?  Then you need
`file->environment-variable-value'.

What if you want to use an environment variable's value (which contains
a filename) to either open the file directly or call an external command
on it?  For this you need `environment-value->file' or
`environment-variable-value->command-line-argument'.

What if you want to put a command-line argument into an environment
variable?
For this you need `command-line-argument->environment-variable-value'.

What if you want to split the PATH environment variable (or another one
like it) up into components, and then use those components to either
read those component directories from scheme, or run external commands
on those components, or put the components into environment variables?

Also, what are we to do about backward-compatibility for all of the
existing POSIX interfaces in Guile which have always returned strings?
What are we to do with procedures like `program-arguments',
`command-line', `environ', `getenv', `readdir' and `passwd:dir'?

What can we pass to `main' that would both incorporate this new distinct
command-line-argument type and maintain backward compatibility with
scripts that expect strings?

Best,
 Mark



Re: [PATCH 2/5] [mingw]: Have compiled-file-name produce valid names.

2011-05-17 Thread Mark H Weaver
Noah Lavine  writes:
> Mark is right that paths are basically just strings, even though
> occasionally they're not. I sort of like the idea of the PEP-383
> encoding (making paths strings that can potentially contain unused
> codepoints, which represent non-character bytes), but would that make
> path strings break under some Guile string operations?

Yes, this is indeed a problem.  Instead of using isolated surrogate code
points as recommended by PEP-383, I think we should instead use one of
the alternative mappings proposed in section 3.7.4 of Unicode Technical
Report #36 :

1. Use 256 private-use code points, somewhere in the ranges F..D
   or 10..10FFFD. This would probably cause the fewest security and
   interoperability problems. There is, however, some possibility of
   collision with other uses of private-use characters.

2. Use pairs of noncharacter code points in the range FDD0..FDEF. These
   are "super" private-use characters, and are discouraged for general
   interchange. The transformation would take each nibble of a byte Y,
   and add to FDD0 and FDE0, respectively. However, noncharacter code
   points may be replaced by U+FFFD ( � ) REPLACEMENT CHARACTER by some
   implementations, especially when they use them internally. (Again,
   incoming characters must never be deleted, because that can cause
   security problems.)

> Also, when we convert strings to paths, we need to know what encoding
> the local filesystem uses. That will usually be UTF-8, but potentially
> might not be, correct?

Yes, that is correct.  I haven't looked deeply into this, but clearly a
lot of software uses the current locale encoding to interpret these
POSIX byte strings, and I suspect at least some software uses UTF-8 to
interpret filenames.  Fortunately, most popular modern distributions of
GNU are now using UTF-8 locales by default, which basically makes the
problem disappear.

Regardless, this method of mapping ill-formed byte sequences to
private-use code points can used with _any_ encoding, not just UTF-8.

Best,
 Mark



Re: [PATCH 2/5] [mingw]: Have compiled-file-name produce valid names.

2011-05-20 Thread Jan Nieuwenhuizen
Andy Wingo writes:

> I don't much like this approach.  Besides mixing in a heuristic on all
> machines that is win32-specific, it makes c:/foo.scm collide with
> d:/foo.scm in the cache, and fails to also modify load.c which also does
> autocompilation in other contexts.

Yes, a newer version of this patch also includes load.c and boot-9.scm.
Of course the drive letter should be kept.

> Is anyone interested in implementing a path library?

What's the status/estimate on this -- of course I agree this would be
nicer, otoh, a patch to these three files is available that makes guile
run on mingw right now.

Greetings, Jan

-- 
Jan Nieuwenhuizen  | GNU LilyPond http://lilypond.org
Freelance IT http://JoyofSource.com | Avatar®  http://AvatarAcademy.nl



Re: [PATCH 2/5] [mingw]: Have compiled-file-name produce valid names.

2011-05-20 Thread Andy Wingo
Hi Jan,

On Fri 20 May 2011 15:47, Jan Nieuwenhuizen  writes:

>> Is anyone interested in implementing a path library?
>
> What's the status/estimate on this -- of course I agree this would be
> nicer, otoh, a patch to these three files is available that makes guile
> run on mingw right now.

Unclear :)  I think the thing you did for your autobuilder was the right
strategy for you and for lilypond, but that we can do even better in
Guile if we give ourselves a bit of time.

Andy
-- 
http://wingolog.org/



Re: [PATCH 2/5] [mingw]: Have compiled-file-name produce valid names.

2011-06-16 Thread Andy Wingo
Hi,

This discussion strayed a bit far from the initial need to concatenate
"/foo/bar" with "c:/baz/qux".

On Tue 03 May 2011 09:44, Andy Wingo  writes:

> On Tue 03 May 2011 00:18, l...@gnu.org (Ludovic Courtès) writes:
>
>> So volumes matter in the file name canonicalization of the .go cache
>> right?
>>
>> Couldn’t we mimic /cygdrive/c, etc.?
>
> Is that what cygwin does?  We certainly could, yes; though for the
> purposes of joining the cache dir to an absolute filename, I guess we
> could simply change c:/foo to /c/foo...  Hum!

MSYS apparently does this as well.  Probably it's what we should do in
the case of caches.  But this sort of thing is very nasty without a path
library :-/

Andy
-- 
http://wingolog.org/



Re: [PATCH 2/5] [mingw]: Have compiled-file-name produce valid names.

2011-06-30 Thread Andy Wingo
On Fri 20 May 2011 15:47, Jan Nieuwenhuizen  writes:

> Andy Wingo writes:
>
>> I don't much like this approach.  Besides mixing in a heuristic on all
>> machines that is win32-specific, it makes c:/foo.scm collide with
>> d:/foo.scm in the cache, and fails to also modify load.c which also does
>> autocompilation in other contexts.
>
> Yes, a newer version of this patch also includes load.c and boot-9.scm.
> Of course the drive letter should be kept.

I have applied a similar patch to Guile, before realizing you had a
newer version.  Sorry for the huge delay here.

[I said:]
>> Is anyone interested in implementing a path library?

I don't think this would have helped very much in this case, given that
taking an absolute path and turning it into a path suffix is not
something that a path library is really good for.  In reality all we
need is a key that corresponds in a 1-to-1 relationship with a source
file -- a SHA1 hash would have done as well.  But oh well, c:/foo ->
/c/foo it is!

Andy
-- 
http://wingolog.org/