> -----Original Message----- > From: Fleshgrinder [mailto:p...@fleshgrinder.com] > Sent: Saturday, April 1, 2017 2:43 PM > To: Anatol Belski <weltl...@outlook.de>; Rasmus Schultz > <ras...@mindplay.dk> > Cc: PHP internals <internals@lists.php.net> > Subject: Re: [PHP-DEV] Directory separators on Windows > > On 4/1/2017 2:01 PM, Anatol Belski wrote: > > 1. optionally - yes, otherwise it should do platform default 2. no, > > this kind of operation is a pure parsing, no I/O related checks needed > > 3. irrelevant, but can be defined > > > > Other points yet I'd care about > > - result should be correct for target platform disregarding actual > > platform, fe > target Linux path Windows, or Windows path on Mac, etc. > > - validation, particularly for reserved words and chars, also other > > platform aspects > > - encodings have to be respected, or UTF-8 only, to define > > - probably should be compatible with PHP stream wrapper namespaces > > > > > > Thanks > > > > Anatol > > > > 1. How do you envision that? If the path is `/a/b/../c` where only `/a` > exists right > now? It's unresolvable, assuming that `../` points to `/a` is wrong if `b/` > is a > symbolic link that points to `/x/y`. > > 2. Here I agree, casing cannot be decided without hitting the filesystem. Some > are case-sensitive, some insensitive, and others configurable. > Basically, it is the same as your points 8., 9. and 10. - it deals with the given path itself, so no symlinks, etc. In the snippet /a/b/../c it's parsed like follows
- parse up to /a/b/../ - scroll back to /a - append the remain so it becomes /a/c Similar process is with /a/./b would become /a/b and others. It is string traversing only. What is done with dirname() uses this approach. In general one can say - normalization is a path simplification, no drive access like realpath() does. For example, it lets to know the path itself would be correct before it comes to actual file operation, and not bother with I/O otherwise. > 3. Does not matter for Windows itself, it is case-insensitive. > > (I continue the numbering for the points you raised.) > > 4. How would we go about normalizing a Windows path to POSIX? `C:\a` is not > necessarily the same as `/a`, or should it produce `C:/a`? > As mentioned in an earlier post, in might make sense to have flags to control the behavior. Maybe a signature like string canonicalize_path(string $path, int $flags = 0); The function OFC knows the current platform. Flags like PATH_TARGET_WINDOWS | PATH_UNIXIFY would control the path separator behaviors. Generally, regarding path without drive letter - on Windows I'd strongely advise to not to use it in configs, etc. because of multiple root issues mentioned already. But in principle, say one has same FS structure on different platforms and just wants to mirror it, that would be ok with flags like PATH_TARGET_LINUX | PATH_STRIP_DRIVE as Linux implies forward slashes. Or otherwise, fe the reverse case - generating a path on Linux that is to be used on Windows, flags might contain only PATH_TARGET_WINDOWS which would produce backslashes as system default. Maybe that's too much or unrelated, and only platform targets should be provided, dunno, just a mind game for now. > 5. ๐ > > 6. I vote for UTF-8 only. We already have locale dependent filesystem > functions, > which also makes them kind of weird to use, especially in libraries. Another > very > important aspect to take care of this point is normalization forms. > Filesystems > generally store stuff as is, that means that we can create to files with the > same > name, at least by the looks of it, which are actually different ones. Think > of `รค` > which can also be `aฬ`. It is generally most advisable to stick to NFC, > because that > is also how users usually produce those chars. > Yeah, probably UTF-8 were the simplest for the cross platform implementation. Regarding the encoding variant - that's where more care would be needed. Fe see https://github.com/aws/aws-cli/issues/1639 , that's where we would care about PATH_TARGET_MAC specific things. Comparable, fe the situation, where you want to escapeshell* something, but it'll be invalid on another platform or possibly with another shell, how it currently works. > 7. ๐ just forward I'd say. > > 8. Collapse multiple separators (e.g. `a//b` ~> `a/b`). > > 9. Resolve self-references, unless they are leading (e.g. `a/./b` ~> `a/b` but > `./a/b` stays `./a/b`). > > 10. Trim separators from the end (e.g. `a/` ~> `a`). > These last 3 points, as well as above one, are canonicalization. Of course, in the imaginary function, it could be decoupled like PATH_NO_CANONIC if it's not wanted, or PATH_CANONICALIZE_ONLY to omit other conversions. It's only about to have the behaviors sensible. Fe possible other flags could be PATH_STRIP_TRAILING_SLASH, PATH_ALLOW_RELATIVE and other fine things. But by default, the function should do the default thing for the target platform, based on the current platform. Thus, producing NFD for Mac and NFC otherwise, backslash for Windows and forward slash otherwise, other thing that will for sure popup. As mentioned earlier, still this requires some re-implementations of the platform APIs, even we'd talk about slashes only - for ASCII paths I'm not sure we even can differentiate the UTF-8 encoding forms without involving yet another library, so this might be tricky. Simply exposing the part of realpath() processing might solve several things for one given platform, that's for sure. The initial case Rasmus reported was about crossplatform handling, but the topic is indeed slightly bigger than just path separators, so IMO the convenient way were to care about a crossplatform approach. I've no info, how badly such crossplatform path issues are indeed relevant, so it might be another story to investigate before one starts any implementation. At least, grouping some cases and thought, maybe as an RFC, could be good to track the topic. Thanks Anatol