New submission from Barney Gale <barney.g...@gmail.com>:
Capturing a write-up by eryksun on GitHub into a new bug. Link: https://github.com/python/cpython/pull/25264#pullrequestreview-631787754 > `nt._getfinalpathname()` opens a handle to a file/directory with > `CreateFileW()` and calls `GetFinalPathNameByHandleW()`. The latter makes a > few system calls to get the final opened path in the filesystem (e.g. > "\Windows\explorer.exe") and the canonical DOS name of the volume device on > which the filesystem is mounted (e.g. "\Device\HarddiskVolume2" -> "\\?\C:") > in order to return a canonical DOS path (e.g. "\\?\C:\Windows\explorer.exe"). > > Opening a handle with `CreateFileW()` entails first getting a fully-qualified > and normalized NT path, which, among other things, entails resolving ".." > components naively in the path string. This does not take reparse points such > as symlinks and mountpoints into account. The only time Windows parses ".." > components in an opened path the way POSIX does is in the kernel when they're > in the target path of a relative symlink. > > `nt.readlink()` opens a handle to the file with the flag > `FILE_FLAG_OPEN_REPARSE_POINT`. If the final path component is a reparse > point, it opens it instead of traversing it. Then the reparse point is read > with the filesystem control request, `FSCTL_GET_REPARSE_POINT`. System > symlinks and mountpoints (`IO_REPARSE_TAG_SYMLINK` and > `IO_REPARSE_TAG_MOUNT_POINT`) are the only supported name-surrogate > reparse-point types, though `os.stat()` and `os.lstat()` handle all > name-surrogate types as 'links'. Moreover, only symlinks get the `S_IFLNK` > mode flag in a stat result, because they're the only ones we can create with > `os.symlink()` to satisfy the usage `if os.path.islink(src): > os.symlink(os.readlink(src), dst)`. > > > What would it take to do a POSIX-style "normalize as we resolve", > > and would we want to? I guess we'd need to call nt._getfinalpathname() > > on each path component in turn (C:, C:\Users, C:\Users\Barney etc), > > which from my pretty basic Windows knowledge might be rather slow if > > that involves file handles. > > You asked, so I decided to write up an outline of what implementing a > POSIX-style `realpath()` might look like in Windows. At its core, it's > similar to POSIX: lstat(), and, for a symlink, readlink() and recur. The > equivalent calls in Windows are the following: > > * `CreateFileW()` (open a handle) > > * `GetFileInformationByHandleEx()`: `FileAttributeTagInfo` > > * `DeviceIoControl()`: `FSCTL_GET_REPARSE_POINT` > > > A symlink has the reparse tag `IO_REPARSE_TAG_SYMLINK`. > > Filesystem mountpoints (aka junctions, which are like Unix bind mountpoints) > must be retained in the resolved path in order to correctly resolve relative > symlinks such as "\spam" (relative to the resolved device) and "..\..\spam". > Anyway, this is consistent with the UNC case, since mountpoints on a remote > server can never be resolved (i.e. a final UNC path never resolves > mountpoints). > > Here are some of the notable differences compared to POSIX: > > * If the source path is not a "\\?\" verbatim path, `GetFullPathNameW()` > must be called initially. However, ".." components in the target path of a > relative symlink must be resolved the POSIX way, else symlinks in the target > path may be removed incorrectly before their target is resolved (e.g. > "foo\symlink\..\bar" incorrectly resolved as "foo\bar"). The opened path is > initially normalized as follows: > > * replace forward slashes with backslashes > * collapse repeated backslashes (except the UNC root must have exactly > two backslashes) > * resolve a relative path (e.g. "spam"), drive-relative path (e.g. > "Z:spam"), or rooted path (e.g. "\spam") as a fully-qualified path (e.g. > "Z:\eggs\spam") > * resolve "." and ".." components in the opened path (naive to symlinks) > * strip trailing spaces and dots from the final component (e.g. > "C:\spam. . ." -> "C:\spam") > * resolve reserved device names in the final component of a non-UNC > path (e.g. "C:\nul" -> "\\.\nul") > > * Substitute drives (e.g. created by "subst.exe", or `DefineDosDeviceW`) > and mapped drives (e.g. created by "net.exe", or `WNetAddConnection2W`) must > be resolved, respectively via `QueryDosDeviceW()` and > `WNetGetUniversalNameW()`. Like all DOS 'devices', these drives are > implemented as object symlinks (i.e. symlinks in the object namespace, not to > be confused with filesystem symlinks). The target path of these drives, > however, is not a Device object, but rather a filesystem path on a device > that can include any number of path components, some of which may be > filesystem symlinks that need to be resolved. Normally when a path is opened, > the system object manager reparses all DOS 'devices' to the path of an actual > Device object, or a path on a Device object, before the I/O manager's parse > routine ever sees the path. Such drives need to be resolved whenever parsing > starts or restarts at a drive, but the result can be cached in case multiple > filesystem symlinks target the same drive . > > * Substitute drives can target paths on other substitute drives, so > `QueryDosDeviceW()` has to be called in a loop that accumulates the tail path > components until it reaches a real device (i.e. a target path that doesn't > begin with "\??\"). > * `WNetGetUniversalNameW()` has to be called after resolving substitute > drives. It resolves the underlying UNC path of a mapped drive. The target > path of the object symlink that implements a mapped drive is of the form > "\Device\<redirector device > name>\;<something>\server\share\some\filesystem\path". The "redirector device > name" component is usually (post Windows Vista) an object symlink to a path > on the system's Multiple UNC Provider (MUP) device, "\Device\Mup". The > mapped-drive target path ultimately resolves to a redirected filesystem > that's mounted in the MUP device namespace at the "share" name. This is an > implementation detail of the filesystem redirector and MUP device, which the > Multiple Provider Router (MPR) WNet API encapsulates. For example, for the > mapped drive path "Z:\spam\eggs", it returns a UNC path of the form > "\\server\share\some\filesystem\path\spam\eggs". > > * A join that tries to resolve ".." against the drive or share root path > must fail, whereas this is ignored for the root path in POSIX. For example, > `symlink_join("C:\\", "..\\spam")` must fail, since the system would fail an > open that tried to reparse that symlink target. > > * At the end, the resolved path should be tested to try to remove "\\?\" > if the source path didn't have this prefix. Call `GetFullPathNameW()` to > check for a reserved name in the final component and > `PathCchCanonicalizeEx()` to check for long-path support. (The latter calls > the system runtime library function `RtlAreLongPathsEnabled`, but that's an > undocumented implementation detail.) > > > `GetFinalPathNameByHandleW()` is not required. Optionally, it can be called > for the last valid component if the caller wants a final path with all > mountpoints resolved, i.e. add a `final_path=False` option. Of course, a > final UNC path must retain mountpoints, so there's nothing we can do in that > case. It's fine that this `realpath()` implementation would return a path > that contains mountpoints in Windows (as the current implementation also does > for UNC paths). They are not symlinks, and this matches the behavior of POSIX. > > I'd include a warning in the documentation that getting a final path via > `GetFinalPathNameByHandleW()` in the non-strict case may be dysfunctional. > The unresolved tail end of the path may become valid again if a server or > device comes back online. If the unresolved part contains symlinks with > relative targets such as "\spam" and "..\..\spam", and the `realpath()` call > resolved away mountpoints, the reminaing path may not resolve correctly > against the final path, as compared to how it would resolve against the > original path. It definitely will not resolve the same for a rooted target > path such as "\spam" if the last resolved reparse point in the original path > was a mountpoint, since it will reparse to the root path of the mountpoint > device instead of the original opened device, or instead of the last resolved > device of a symlink in the path. ---------- components: Library (Lib) messages: 391804 nosy: barneygale priority: normal severity: normal status: open title: os.path.realpath() normalizes paths before resolving links on Windows versions: Python 3.10, Python 3.11, Python 3.6, Python 3.7, Python 3.8, Python 3.9 _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue43936> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com