[chromium-dev] Re: Changes to FilePath?
On Wed, May 13, 2009 at 7:24 PM, Brett Wilson wrote: > You can't actually canonicalize a filename on Windows, so I think it's > dangerous to write a component that claims to do it. You can do it under controlled conditions, and especially if the file exists on the disk already and is accessible. For instance, if you don't try to handle (non-deterministic) 8.3 names of files that don't exist yet/anymore and NTFS mount points, I think you can fairly safely apply the "regular" rules to canonicalize paths (and even if you applied the rules to those, most of the time they would still work). I would make sure that the class only claims to canonicalize paths that it really knows it can do, of course. Look, I know there are tough problems here, but why not TRY to solve them as well as possible. FilePath is fine for simple manipulations, and is a good, lightweight container if you're not planning on doing anything complex with the file names. If you actually need to do more interesting things with them, like display the names, convert to relative paths, compare them for equality or pass them off to a third party in a particular encoding, it's not sufficient. I could write a half-assed implementation that kinda works if you don't throw anything wonky at it. I've got that now. I want something more bulletproof. It can't be perfect because file paths are non-deterministic on all three systems in not so obvious ways, but why should everyone who needs more than FilePath have to climb that learning curve? And we can only give out information that is as good as we get from the OS -- if the OS isn't able to present a filesystem that makes sense, we can only provide the best gibberish we can get our hands on. -Greg. --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Changes to FilePath?
I mean.. there's a registry setting or something that can be set to disable it.-darin On Wed, May 13, 2009 at 8:40 PM, Darin Fisher wrote: > FYI: Don't use GetShortPathName. It isn't supported on some Windows > systems. We had a significant number of users that could not use Firefox > until we stopped using it. > -Darin > > > On Wed, May 13, 2009 at 7:29 PM, Brett Wilson wrote: > >> >> On Wed, May 13, 2009 at 7:24 PM, Brett Wilson >> wrote: >> > On Wed, May 13, 2009 at 6:12 PM, Greg Spencer >> wrote: >> >> On Wed, May 13, 2009 at 4:07 PM, Brett Wilson >> wrote: >> >>> >> >>> On Wed, May 13, 2009 at 3:51 PM, Amanda Walker >> >>> wrote: >> >>> > >> >>> > Perhaps what we need is a companion to FilePath. For example: >> >>> > >> >>> > FilePath: much as it is now, lightweight, "alternative to string >> >>> > manipulation". >> >>> > FileReference: heavierweight, can talk to the file system and have >> >>> > carnal knowledge of platform specifics for things like resolving / >> >>> > canonicalizing pathnames, determining whether or not they refer to >> the >> >>> > same files, generating C strings that can be passed to 3rd party >> >>> > libraries, etc. >> >>> >> >>> I think this is very dangerous. >> >>> >> >>> I think Greg should not be talking to the filesystem when inserting >> >>> filenames into a set. We don't allow filesystem access from the UI >> >>> thread of Chrome, and I think other parts of our system should also >> >>> not do filesystem access on their critical threads, especially if they >> >>> want to be more part of Chrome in the future. >> >> >> >> Well, so the use I have for this in O3D at the moment is in our >> importer, >> >> which currently is a separate command-line tool that reads Collada >> files and >> >> writes out our wire format for geometry. So it isn't meant to be >> occuring >> >> in a UI thread, but I could see times when it might be useful to know >> for >> >> sure if two files reference the same file in the UI thread (dragging >> and >> >> dropping a file onto a drop zone, for instance). >> >> I do need to know if I have the same file more than once in a set >> because >> >> the COLLADA file might reference the same texture multiple times, or >> (more >> >> dangerous) it might reference a file that is one file on Windows, >> >> but (incorrectly) maps to two different files in the (Unix-path-format) >> .tgz >> >> files. To detect that, I need canonicalization. >> > >> > You can't actually canonicalize a filename on Windows, so I think it's >> > dangerous to write a component that claims to do it. >> >> I guess you could call GetShortPathName every time you see a name. But >> I think that's a crazy solution. I still think you should do my >> suggestion below. >> >> >> > I think you just need to come up with some simple rules that makes it >> > work most of the time. Personally I would do ASCII lowercasing and >> > stop worrying about it. If you use ICU to lower-case "correctly," >> > Windows won't necessarily agree and you won't be able to use that >> > file. >> >> >> >> > --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Changes to FilePath?
FYI: Don't use GetShortPathName. It isn't supported on some Windows systems. We had a significant number of users that could not use Firefox until we stopped using it. -Darin On Wed, May 13, 2009 at 7:29 PM, Brett Wilson wrote: > > On Wed, May 13, 2009 at 7:24 PM, Brett Wilson wrote: > > On Wed, May 13, 2009 at 6:12 PM, Greg Spencer > wrote: > >> On Wed, May 13, 2009 at 4:07 PM, Brett Wilson > wrote: > >>> > >>> On Wed, May 13, 2009 at 3:51 PM, Amanda Walker > >>> wrote: > >>> > > >>> > Perhaps what we need is a companion to FilePath. For example: > >>> > > >>> > FilePath: much as it is now, lightweight, "alternative to string > >>> > manipulation". > >>> > FileReference: heavierweight, can talk to the file system and have > >>> > carnal knowledge of platform specifics for things like resolving / > >>> > canonicalizing pathnames, determining whether or not they refer to > the > >>> > same files, generating C strings that can be passed to 3rd party > >>> > libraries, etc. > >>> > >>> I think this is very dangerous. > >>> > >>> I think Greg should not be talking to the filesystem when inserting > >>> filenames into a set. We don't allow filesystem access from the UI > >>> thread of Chrome, and I think other parts of our system should also > >>> not do filesystem access on their critical threads, especially if they > >>> want to be more part of Chrome in the future. > >> > >> Well, so the use I have for this in O3D at the moment is in our > importer, > >> which currently is a separate command-line tool that reads Collada files > and > >> writes out our wire format for geometry. So it isn't meant to be > occuring > >> in a UI thread, but I could see times when it might be useful to know > for > >> sure if two files reference the same file in the UI thread (dragging and > >> dropping a file onto a drop zone, for instance). > >> I do need to know if I have the same file more than once in a set > because > >> the COLLADA file might reference the same texture multiple times, or > (more > >> dangerous) it might reference a file that is one file on Windows, > >> but (incorrectly) maps to two different files in the (Unix-path-format) > .tgz > >> files. To detect that, I need canonicalization. > > > > You can't actually canonicalize a filename on Windows, so I think it's > > dangerous to write a component that claims to do it. > > I guess you could call GetShortPathName every time you see a name. But > I think that's a crazy solution. I still think you should do my > suggestion below. > > > > I think you just need to come up with some simple rules that makes it > > work most of the time. Personally I would do ASCII lowercasing and > > stop worrying about it. If you use ICU to lower-case "correctly," > > Windows won't necessarily agree and you won't be able to use that > > file. > > > > --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Changes to FilePath?
On Wed, May 13, 2009 at 7:24 PM, Brett Wilson wrote: > On Wed, May 13, 2009 at 6:12 PM, Greg Spencer wrote: >> On Wed, May 13, 2009 at 4:07 PM, Brett Wilson wrote: >>> >>> On Wed, May 13, 2009 at 3:51 PM, Amanda Walker >>> wrote: >>> > >>> > Perhaps what we need is a companion to FilePath. For example: >>> > >>> > FilePath: much as it is now, lightweight, "alternative to string >>> > manipulation". >>> > FileReference: heavierweight, can talk to the file system and have >>> > carnal knowledge of platform specifics for things like resolving / >>> > canonicalizing pathnames, determining whether or not they refer to the >>> > same files, generating C strings that can be passed to 3rd party >>> > libraries, etc. >>> >>> I think this is very dangerous. >>> >>> I think Greg should not be talking to the filesystem when inserting >>> filenames into a set. We don't allow filesystem access from the UI >>> thread of Chrome, and I think other parts of our system should also >>> not do filesystem access on their critical threads, especially if they >>> want to be more part of Chrome in the future. >> >> Well, so the use I have for this in O3D at the moment is in our importer, >> which currently is a separate command-line tool that reads Collada files and >> writes out our wire format for geometry. So it isn't meant to be occuring >> in a UI thread, but I could see times when it might be useful to know for >> sure if two files reference the same file in the UI thread (dragging and >> dropping a file onto a drop zone, for instance). >> I do need to know if I have the same file more than once in a set because >> the COLLADA file might reference the same texture multiple times, or (more >> dangerous) it might reference a file that is one file on Windows, >> but (incorrectly) maps to two different files in the (Unix-path-format) .tgz >> files. To detect that, I need canonicalization. > > You can't actually canonicalize a filename on Windows, so I think it's > dangerous to write a component that claims to do it. I guess you could call GetShortPathName every time you see a name. But I think that's a crazy solution. I still think you should do my suggestion below. > I think you just need to come up with some simple rules that makes it > work most of the time. Personally I would do ASCII lowercasing and > stop worrying about it. If you use ICU to lower-case "correctly," > Windows won't necessarily agree and you won't be able to use that > file. --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Changes to FilePath?
On Wed, May 13, 2009 at 6:12 PM, Greg Spencer wrote: > On Wed, May 13, 2009 at 4:07 PM, Brett Wilson wrote: >> >> On Wed, May 13, 2009 at 3:51 PM, Amanda Walker >> wrote: >> > >> > Perhaps what we need is a companion to FilePath. For example: >> > >> > FilePath: much as it is now, lightweight, "alternative to string >> > manipulation". >> > FileReference: heavierweight, can talk to the file system and have >> > carnal knowledge of platform specifics for things like resolving / >> > canonicalizing pathnames, determining whether or not they refer to the >> > same files, generating C strings that can be passed to 3rd party >> > libraries, etc. >> >> I think this is very dangerous. >> >> I think Greg should not be talking to the filesystem when inserting >> filenames into a set. We don't allow filesystem access from the UI >> thread of Chrome, and I think other parts of our system should also >> not do filesystem access on their critical threads, especially if they >> want to be more part of Chrome in the future. > > Well, so the use I have for this in O3D at the moment is in our importer, > which currently is a separate command-line tool that reads Collada files and > writes out our wire format for geometry. So it isn't meant to be occuring > in a UI thread, but I could see times when it might be useful to know for > sure if two files reference the same file in the UI thread (dragging and > dropping a file onto a drop zone, for instance). > I do need to know if I have the same file more than once in a set because > the COLLADA file might reference the same texture multiple times, or (more > dangerous) it might reference a file that is one file on Windows, > but (incorrectly) maps to two different files in the (Unix-path-format) .tgz > files. To detect that, I need canonicalization. You can't actually canonicalize a filename on Windows, so I think it's dangerous to write a component that claims to do it. I think you just need to come up with some simple rules that makes it work most of the time. Personally I would do ASCII lowercasing and stop worrying about it. If you use ICU to lower-case "correctly," Windows won't necessarily agree and you won't be able to use that file. Brett --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Changes to FilePath?
On Wed, May 13, 2009 at 4:07 PM, Brett Wilson wrote: > On Wed, May 13, 2009 at 3:51 PM, Amanda Walker > wrote: > > > > Perhaps what we need is a companion to FilePath. For example: > > > > FilePath: much as it is now, lightweight, "alternative to string > manipulation". > > FileReference: heavierweight, can talk to the file system and have > > carnal knowledge of platform specifics for things like resolving / > > canonicalizing pathnames, determining whether or not they refer to the > > same files, generating C strings that can be passed to 3rd party > > libraries, etc. > > I think this is very dangerous. > > I think Greg should not be talking to the filesystem when inserting > filenames into a set. We don't allow filesystem access from the UI > thread of Chrome, and I think other parts of our system should also > not do filesystem access on their critical threads, especially if they > want to be more part of Chrome in the future. Well, so the use I have for this in O3D at the moment is in our importer, which currently is a separate command-line tool that reads Collada files and writes out our wire format for geometry. So it isn't meant to be occuring in a UI thread, but I could see times when it might be useful to know for sure if two files reference the same file in the UI thread (dragging and dropping a file onto a drop zone, for instance). I do need to know if I have the same file more than once in a set because the COLLADA file might reference the same texture multiple times, or (more dangerous) it might reference a file that is one file on Windows, but (incorrectly) maps to two different files in the (Unix-path-format) .tgz files. To detect that, I need canonicalization. I also need to convert paths in the Collada file to relative paths in our tgz files. In order to do that, I need to be able to normalize the path to the Collada file so I can normalize the paths to the referenced texture files and strip off common base directories. I'd really like to avoid the filesystem access too -- it's a real pain in the ass to do, which is why it hasn't been done yet. Currently, the user has to tell me the string to strip off of the pathnames to make them relative, and if files collide or split, then the output is just 2x bigger, or just doesn't work. I'd like to fix those things, but to do it right, I need a better set of tools, and it seemed to me that if I was needing these tools, then someone else could use them too. -Greg. --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Changes to FilePath?
On Wed, May 13, 2009 at 4:35 PM, Darin Fisher wrote: > The "solution" is to not convert to UTF-16 unless you are trying to > generate a string to display to the user. Then you should use the LANG > information to determine how best to render the text for display to the > user. > Yeah, that would be nice, and I agree, but the reason I need it is that some third party APIs (probably wrongly) take UTF16 to represent an input file in their API. So in order for the third party API to load the file properly, I need a UTF16 version of the file path. Also, in all of the O3D code, we assume that strings are encoded in UTF8 (which is fine and correct for any string except for filenames on Linux), so any string that might come from the user would come in as UTF8, and I'd have to translate it into a FilePath (somehow). > I know this doesn't really help. I think it is reasonable to have a > utility somewhere to perform a conversion to UTF-16 (or UTF-8), but it > should come with a stern warning, and I kind of prefer it not being a method > on FilePath since I would prefer people not be tempted to overuse it. > Yeah, I think we've beat that to death: it won't be in FilePath. -Greg. --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Changes to FilePath?
On Wed, May 13, 2009 at 4:34 PM, Amanda Walker wrote: > On Wed, May 13, 2009 at 7:07 PM, Brett Wilson wrote: >> On Wed, May 13, 2009 at 3:51 PM, Amanda Walker wrote: >>> >>> Perhaps what we need is a companion to FilePath. For example: >>> >>> FilePath: much as it is now, lightweight, "alternative to string >>> manipulation". >>> FileReference: heavierweight, can talk to the file system and have >>> carnal knowledge of platform specifics for things like resolving / >>> canonicalizing pathnames, determining whether or not they refer to the >>> same files, generating C strings that can be passed to 3rd party >>> libraries, etc. >> >> I think this is very dangerous. >> >> I think Greg should not be talking to the filesystem when inserting >> filenames into a set. We don't allow filesystem access from the UI >> thread of Chrome, and I think other parts of our system should also >> not do filesystem access on their critical threads, especially if they >> want to be more part of Chrome in the future. > > But in context, he's passing these things to 3rd party libraries that > will be doing plenty of file system access (importing and exporting > data, for example). That's why I was suggesting something separate > from FilePath for such use. Then he doesn't need canonicalization at all. He needs to know how the third party library is going to use the string for filesystem access and then do the corresponding transformations. That does not involve filesystem access. Brett --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Changes to FilePath?
On Wed, May 13, 2009 at 2:20 PM, Greg Spencer wrote: > On Wed, May 13, 2009 at 2:05 PM, Darin Fisher wrote: >> >> That conversion is not defined. If you are on Linux, the contents of the >> file path is just an array of bytes. It might be UTF-8, in which case you >> can convert to UTF-16. However, it may also be some crazy encoding or it >> may not match any encoding. This OS does not require it to match an >> encoding. >> >> When we need to convert a FilePath to Unicode, we use the >> SysWideToNativeMB and SysNativeMBToWide functions from base. This works by >> inspecting what the system thinks the current multi-byte encoding is. On >> Mac that is UTF-8. On Linux, it depends on the value of $LANG. Each time >> we do such a conversion, we are introducing a potential bug in the product >> (on Linux at least), so we try hard to avoid them. >> > > Yes, I know that this is how it works (see earlier messages in this > thread), but can you tell me if there are any Linux apps that manage to do > this correctly (e.g. without having this bug), and how they do it? > > I can't see how any Linux app can do any better than looking at LANG and > LC_CHAR and hoping that they're set correctly. Certainly there's no way to > decode a pathname that includes multiple encodings, and I have no idea what > happens with NFS mounts between machines with different settings. > > I'm just saying why not just do as well as can be done by the best app out > there, and punt after that? > > -Greg. > Sorry to repeat information. This is a long thread! The "solution" is to not convert to UTF-16 unless you are trying to generate a string to display to the user. Then you should use the LANG information to determine how best to render the text for display to the user. The program should try its best to preserve the file path in the original form and not try to convert to UTF-16 and back again since that conversion may be lossy. I know this doesn't really help. I think it is reasonable to have a utility somewhere to perform a conversion to UTF-16 (or UTF-8), but it should come with a stern warning, and I kind of prefer it not being a method on FilePath since I would prefer people not be tempted to overuse it. -Darin --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Changes to FilePath?
On Wed, May 13, 2009 at 7:07 PM, Brett Wilson wrote: > On Wed, May 13, 2009 at 3:51 PM, Amanda Walker wrote: >> >> Perhaps what we need is a companion to FilePath. For example: >> >> FilePath: much as it is now, lightweight, "alternative to string >> manipulation". >> FileReference: heavierweight, can talk to the file system and have >> carnal knowledge of platform specifics for things like resolving / >> canonicalizing pathnames, determining whether or not they refer to the >> same files, generating C strings that can be passed to 3rd party >> libraries, etc. > > I think this is very dangerous. > > I think Greg should not be talking to the filesystem when inserting > filenames into a set. We don't allow filesystem access from the UI > thread of Chrome, and I think other parts of our system should also > not do filesystem access on their critical threads, especially if they > want to be more part of Chrome in the future. But in context, he's passing these things to 3rd party libraries that will be doing plenty of file system access (importing and exporting data, for example). That's why I was suggesting something separate from FilePath for such use. --Amanda --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Changes to FilePath?
On Wed, May 13, 2009 at 3:51 PM, Amanda Walker wrote: > > Perhaps what we need is a companion to FilePath. For example: > > FilePath: much as it is now, lightweight, "alternative to string > manipulation". > FileReference: heavierweight, can talk to the file system and have > carnal knowledge of platform specifics for things like resolving / > canonicalizing pathnames, determining whether or not they refer to the > same files, generating C strings that can be passed to 3rd party > libraries, etc. I think this is very dangerous. I think Greg should not be talking to the filesystem when inserting filenames into a set. We don't allow filesystem access from the UI thread of Chrome, and I think other parts of our system should also not do filesystem access on their critical threads, especially if they want to be more part of Chrome in the future. Brett --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Changes to FilePath?
Perhaps what we need is a companion to FilePath. For example: FilePath: much as it is now, lightweight, "alternative to string manipulation". FileReference: heavierweight, can talk to the file system and have carnal knowledge of platform specifics for things like resolving / canonicalizing pathnames, determining whether or not they refer to the same files, generating C strings that can be passed to 3rd party libraries, etc. --Amanda On Wed, May 13, 2009 at 5:22 PM, Greg Spencer wrote: > On Wed, May 13, 2009 at 1:03 PM, Mark Mentovai wrote: >> >> If you've got to take an arbitrary FilePath and convert it for display >> to the user, or take an arbitrary string in a known encoding and >> re-encode it for the filesystem, then we don't have anything in >> FilePath for this. I believe that if we do add something, it should >> strictly operate only on single pathname components at a time, and not >> entire pathnames. We could add it to FilePath or we could add it >> somewhere else, because it is sort of distinct from what FilePath is >> really supposed to be, which is just a container for ferrying around >> native paths. > > > OK, I can see the allure of dealing in terms of lists of encoded strings so > that you > can encode them separately. For my purposes, I need to get a string > encoded as > UTF16 (on Windows) or UTF8 (on other platforms) that represents a filename > so that > I can pass it to third party APIs, so it has to include the path separators. > But that > can be done as a "join" operation when I get the string out. >> >> >> It's also a specification and implementation nightmare. Everyone has >> >> a different idea of what "normalization" means. What's your idea? >> > >> > Yes, I know it's a nightmare all around, but I think it would be useful >> > to >> > have something that addresses this. My idea would be the same as >> > Python's >> > os.path.normpath, mainly because it's a well-tested, seasoned example >> > with >> > test cases. Windows also has a routine for this (PathCanonicalize) that >> > could be used (but I know it doesn't work for UNC paths). >> >> Why would it be useful? Do you want to compare paths for equality? > > Yes, for instance to be able to place them into a map or set and be sure I > only have one > entry for a particular file. And I want to be able to do absolute to > relative path conversions > (as far as possible, anyhow). And yes, I know that those are *really hard* > to do properly, > which argues even more for implementing one in a common library so that > individual > developers don't roll their own all the time, thinking that it is easy (and > consequently > producing buggy implementations). > >> >> Then we should have an API that compares paths for equality. It would >> have to hit the disk to do so. You might need general-purpose >> canonization to implement that on some systems. Great, you need to >> hit the disk to do that too. It's fine if you want these things, but >> we can't put them into FilePath. It's important that FilePath remain >> lightweight and not make any system calls, because system calls can >> block and FilePath is just a data carrier. > > Which is why I proposed in my last message not putting them into FilePath, > since I can see > that it is not your intention that it support anything that hits the > filesystem (and I can see why > you would want that). >> >> os.path.normpath is known to be buggy. It might be well-tested and >> seasoned, but only within the confines of its known limitations. >> Watch this. [...] > > Yes, I'm aware that you can create situations (especially with symbolic > links) where > the same path conversions will succeed or fail depending on the filesystem > contents. This is why > the class would have to have access to the filesystem. > >> >> Again, it sounds like what you really want is a pathname comparator >> that hits the disk. You really can't do this stuff correctly on most >> systems without talking to the filesystem. You can't even do >> general-purpose canonization without talking to the filesystem. > > Yep. Totally agreed. (and normcase is probably not the behavior I'm looking > for, you're right). > >> >> Let me make clear: I'm not trying to shoot down the idea of needing to >> be able to compare paths or even necessarily canonize them. I'm >> arguing primarily against doing it in FilePath, but I'm also also >> trying to illustrate that doing proper comparisons and canonization is >> harder than it seems, that even "seasoned and well-tested" APIs are >> limited in ways that developers don't necessarily expect, and that the >> semantics and expectations need to be well-defined. > > Very well illustrated, and I assure you that I'm well aware that it's a > bitch to do right. > -Greg. --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group
[chromium-dev] Re: Changes to FilePath?
This post made me think that we should have infrastructure so that certain unit tests can opt to run in a restricted environment to enforce that someone doesn't come along and add filesystem-access code or other known-bad synchronous APIs. I realize that that is probably hard, and that patches would be welcome. Just throwing it out there in hopes that someone says "Hey, I know how to do that" and someone else says "Hey, do that". -scott [It could also be a rathole that only seems like a good idea until you actually try it, like getting const-ness propagation thoroughly correct.] On Wed, May 13, 2009 at 1:03 PM, Mark Mentovai wrote: > > If you've got a file that begins its life as something on-disk, and > you just need to carry the path to it around, then that's fine, it > should live its life as a FilePath. > > If you've got to create a file using some name where the name is some > constant in code, use FilePath with ASCII constants. AppendASCII > exists to stick new ASCII components onto existing FilePaths. This is > fine and is considered safe because ASCII is a subset of any rational > filesystem encoding. > > If you've got to take an arbitrary FilePath and convert it for display > to the user, or take an arbitrary string in a known encoding and > re-encode it for the filesystem, then we don't have anything in > FilePath for this. I believe that if we do add something, it should > strictly operate only on single pathname components at a time, and not > entire pathnames. We could add it to FilePath or we could add it > somewhere else, because it is sort of distinct from what FilePath is > really supposed to be, which is just a container for ferrying around > native paths. > >>> It's also a specification and implementation nightmare. Everyone has >>> a different idea of what "normalization" means. What's your idea? >> >> Yes, I know it's a nightmare all around, but I think it would be useful to >> have something that addresses this. My idea would be the same as Python's >> os.path.normpath, mainly because it's a well-tested, seasoned example with >> test cases. Windows also has a routine for this (PathCanonicalize) that >> could be used (but I know it doesn't work for UNC paths). > > Why would it be useful? Do you want to compare paths for equality? > Then we should have an API that compares paths for equality. It would > have to hit the disk to do so. You might need general-purpose > canonization to implement that on some systems. Great, you need to > hit the disk to do that too. It's fine if you want these things, but > we can't put them into FilePath. It's important that FilePath remain > lightweight and not make any system calls, because system calls can > block and FilePath is just a data carrier. > > os.path.normpath is known to be buggy. It might be well-tested and > seasoned, but only within the confines of its known limitations. > Watch this. > > m...@anodizer bash$ ls -l a/b/../c > -rw-r--r-- 1 mark staff 0 May 13 15:47 a/b/../c > m...@anodizer bash$ python > Python 2.5.1 (r251:54863, Feb 6 2009, 19:02:12) > [GCC 4.0.1 (Apple Inc. build 5465)] on darwin > Type "help", "copyright", "credits" or "license" for more information. import os.path os.path.normpath('a/b/../c') > 'a/c' ^D > m...@anodizer bash$ ls -l a/c > ls: a/c: No such file or directory > >> Probably the same as os.path.normcase in Python. I want this stuff so that >> I can make sure that I can at least semi-reliably compare/manipulate >> FilePaths to do things like absolute->relative path conversion, or store >> FilePaths in a set or map and be sure I don't have multiple entries pointing >> to the same file. Without these kinds of operations, doing these things is >> pretty much impossible. > > I don't think os.path.normcase does what you're asking for either. > > m...@anodizer bash$ ls -lid /System/Library > 81 drwxr-xr-x 64 root wheel 2176 May 12 18:37 /System/Library > m...@anodizer bash$ ls -lid /system/LIBRARY > 81 drwxr-xr-x 64 root wheel 2176 May 12 18:37 /system/LIBRARY > m...@anodizer bash$ python > Python 2.5.1 (r251:54863, Feb 6 2009, 19:02:12) > [GCC 4.0.1 (Apple Inc. build 5465)] on darwin > Type "help", "copyright", "credits" or "license" for more information. import sys sys.platform > 'darwin' import os.path os.path.normcase('/System/Library') > '/System/Library' os.path.normcase('/system/LIBRARY') > '/system/LIBRARY' ^D > > Even os.path.realpath returns the same results. > > Again, it sounds like what you really want is a pathname comparator > that hits the disk. You really can't do this stuff correctly on most > systems without talking to the filesystem. You can't even do > general-purpose canonization without talking to the filesystem. > > Let me make clear: I'm not trying to shoot down the idea of needing to > be able to compare paths or even necessarily canonize them. I'm > arguing primarily against doing it in FilePath, but I'm als
[chromium-dev] Re: Changes to FilePath?
On Wed, May 13, 2009 at 2:05 PM, Darin Fisher wrote: > > That conversion is not defined. If you are on Linux, the contents of the > file path is just an array of bytes. It might be UTF-8, in which case you > can convert to UTF-16. However, it may also be some crazy encoding or it > may not match any encoding. This OS does not require it to match an > encoding. > > When we need to convert a FilePath to Unicode, we use the SysWideToNativeMB > and SysNativeMBToWide functions from base. This works by inspecting what > the system thinks the current multi-byte encoding is. On Mac that is UTF-8. > On Linux, it depends on the value of $LANG. Each time we do such a > conversion, we are introducing a potential bug in the product (on Linux at > least), so we try hard to avoid them. > Yes, I know that this is how it works (see earlier messages in this thread), but can you tell me if there are any Linux apps that manage to do this correctly (e.g. without having this bug), and how they do it? I can't see how any Linux app can do any better than looking at LANG and LC_CHAR and hoping that they're set correctly. Certainly there's no way to decode a pathname that includes multiple encodings, and I have no idea what happens with NFS mounts between machines with different settings. I'm just saying why not just do as well as can be done by the best app out there, and punt after that? -Greg. --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Changes to FilePath?
On Wed, May 13, 2009 at 1:03 PM, Mark Mentovai wrote: > If you've got to take an arbitrary FilePath and convert it for display > to the user, or take an arbitrary string in a known encoding and > re-encode it for the filesystem, then we don't have anything in > FilePath for this. I believe that if we do add something, it should > strictly operate only on single pathname components at a time, and not > entire pathnames. We could add it to FilePath or we could add it > somewhere else, because it is sort of distinct from what FilePath is > really supposed to be, which is just a container for ferrying around > native paths. OK, I can see the allure of dealing in terms of lists of encoded strings so that you can encode them separately. For my purposes, I need to get a string encoded as UTF16 (on Windows) or UTF8 (on other platforms) that represents a filename so that I can pass it to third party APIs, so it has to include the path separators. But that can be done as a "join" operation when I get the string out. >> It's also a specification and implementation nightmare. Everyone has > >> a different idea of what "normalization" means. What's your idea? > > > > Yes, I know it's a nightmare all around, but I think it would be useful > to > > have something that addresses this. My idea would be the same as > Python's > > os.path.normpath, mainly because it's a well-tested, seasoned example > with > > test cases. Windows also has a routine for this (PathCanonicalize) that > > could be used (but I know it doesn't work for UNC paths). > > Why would it be useful? Do you want to compare paths for equality? Yes, for instance to be able to place them into a map or set and be sure I only have one entry for a particular file. And I want to be able to do absolute to relative path conversions (as far as possible, anyhow). And yes, I know that those are *really hard* to do properly, which argues even more for implementing one in a common library so that individual developers don't roll their own all the time, thinking that it is easy (and consequently producing buggy implementations). > Then we should have an API that compares paths for equality. It would > have to hit the disk to do so. You might need general-purpose > canonization to implement that on some systems. Great, you need to > hit the disk to do that too. It's fine if you want these things, but > we can't put them into FilePath. It's important that FilePath remain > lightweight and not make any system calls, because system calls can > block and FilePath is just a data carrier. Which is why I proposed in my last message not putting them into FilePath, since I can see that it is not your intention that it support anything that hits the filesystem (and I can see why you would want that). os.path.normpath is known to be buggy. It might be well-tested and > seasoned, but only within the confines of its known limitations. > Watch this. [...] Yes, I'm aware that you can create situations (especially with symbolic links) where the same path conversions will succeed or fail depending on the filesystem contents. This is why the class would have to have access to the filesystem. > Again, it sounds like what you really want is a pathname comparator > that hits the disk. You really can't do this stuff correctly on most > systems without talking to the filesystem. You can't even do > general-purpose canonization without talking to the filesystem. Yep. Totally agreed. (and normcase is probably not the behavior I'm looking for, you're right). > Let me make clear: I'm not trying to shoot down the idea of needing to > be able to compare paths or even necessarily canonize them. I'm > arguing primarily against doing it in FilePath, but I'm also also > trying to illustrate that doing proper comparisons and canonization is > harder than it seems, that even "seasoned and well-tested" APIs are > limited in ways that developers don't necessarily expect, and that the > semantics and expectations need to be well-defined. Very well illustrated, and I assure you that I'm well aware that it's a bitch to do right. -Greg. --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Changes to FilePath?
On Tue, Apr 28, 2009 at 2:47 PM, Greg Spencer wrote: > On Tue, Apr 28, 2009 at 2:41 PM, Amanda Walker wrote: > >> >> On Tue, Apr 28, 2009 at 4:39 PM, Greg Spencer >> wrote: >> > 1) I'd like to add some explicit routines for converting to/from UTF8 >> and >> > UTF16. While it's nice (and important) that FilePath uses the >> platform's >> > native string, we've found that many third party libraries have made >> other >> > assumptions, where they always expect UTF8 (char) or UTF16 (wchar_t) >> paths >> > regardless of platform, and converting a FilePath to and from those >> forms is >> > a platform-dependent exercise which should be centralized into the class >> > (i.e. adding "ToUTF8" and "ToWide" functions to the class, and explicit >> > constructors that take each type). >> >> One thing many of us have found, across multiple projects, is that >> wchar_t is fraught with complication as soon as more than one platform >> is involved. "wchar_t == UTF16" is a Windowsism (gcc defaults to 4 >> bytes, for example, and L"mumble" gets stored in UCS-4, not UTF-16). >> Chrome started with more or less what you are suggesting, and we moved >> off of it after much pain. > > > I understand those issues quite well (but I probably should call the > conversion method ToUTF16, now that you mention it). And char* isn't > necessarily UTF8 on all platforms either. > > OK, so what's the currently recommended path for converting to UTF16 or > UTF8 from a FilePath? > That conversion is not defined. If you are on Linux, the contents of the file path is just an array of bytes. It might be UTF-8, in which case you can convert to UTF-16. However, it may also be some crazy encoding or it may not match any encoding. This OS does not require it to match an encoding. When we need to convert a FilePath to Unicode, we use the SysWideToNativeMB and SysNativeMBToWide functions from base. This works by inspecting what the system thinks the current multi-byte encoding is. On Mac that is UTF-8. On Linux, it depends on the value of $LANG. Each time we do such a conversion, we are introducing a potential bug in the product (on Linux at least), so we try hard to avoid them. -Darin --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Changes to FilePath?
If you've got a file that begins its life as something on-disk, and you just need to carry the path to it around, then that's fine, it should live its life as a FilePath. If you've got to create a file using some name where the name is some constant in code, use FilePath with ASCII constants. AppendASCII exists to stick new ASCII components onto existing FilePaths. This is fine and is considered safe because ASCII is a subset of any rational filesystem encoding. If you've got to take an arbitrary FilePath and convert it for display to the user, or take an arbitrary string in a known encoding and re-encode it for the filesystem, then we don't have anything in FilePath for this. I believe that if we do add something, it should strictly operate only on single pathname components at a time, and not entire pathnames. We could add it to FilePath or we could add it somewhere else, because it is sort of distinct from what FilePath is really supposed to be, which is just a container for ferrying around native paths. >> It's also a specification and implementation nightmare. Everyone has >> a different idea of what "normalization" means. What's your idea? > > Yes, I know it's a nightmare all around, but I think it would be useful to > have something that addresses this. My idea would be the same as Python's > os.path.normpath, mainly because it's a well-tested, seasoned example with > test cases. Windows also has a routine for this (PathCanonicalize) that > could be used (but I know it doesn't work for UNC paths). Why would it be useful? Do you want to compare paths for equality? Then we should have an API that compares paths for equality. It would have to hit the disk to do so. You might need general-purpose canonization to implement that on some systems. Great, you need to hit the disk to do that too. It's fine if you want these things, but we can't put them into FilePath. It's important that FilePath remain lightweight and not make any system calls, because system calls can block and FilePath is just a data carrier. os.path.normpath is known to be buggy. It might be well-tested and seasoned, but only within the confines of its known limitations. Watch this. m...@anodizer bash$ ls -l a/b/../c -rw-r--r-- 1 mark staff 0 May 13 15:47 a/b/../c m...@anodizer bash$ python Python 2.5.1 (r251:54863, Feb 6 2009, 19:02:12) [GCC 4.0.1 (Apple Inc. build 5465)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import os.path >>> os.path.normpath('a/b/../c') 'a/c' >>> ^D m...@anodizer bash$ ls -l a/c ls: a/c: No such file or directory > Probably the same as os.path.normcase in Python. I want this stuff so that > I can make sure that I can at least semi-reliably compare/manipulate > FilePaths to do things like absolute->relative path conversion, or store > FilePaths in a set or map and be sure I don't have multiple entries pointing > to the same file. Without these kinds of operations, doing these things is > pretty much impossible. I don't think os.path.normcase does what you're asking for either. m...@anodizer bash$ ls -lid /System/Library 81 drwxr-xr-x 64 root wheel 2176 May 12 18:37 /System/Library m...@anodizer bash$ ls -lid /system/LIBRARY 81 drwxr-xr-x 64 root wheel 2176 May 12 18:37 /system/LIBRARY m...@anodizer bash$ python Python 2.5.1 (r251:54863, Feb 6 2009, 19:02:12) [GCC 4.0.1 (Apple Inc. build 5465)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> sys.platform 'darwin' >>> import os.path >>> os.path.normcase('/System/Library') '/System/Library' >>> os.path.normcase('/system/LIBRARY') '/system/LIBRARY' >>> ^D Even os.path.realpath returns the same results. Again, it sounds like what you really want is a pathname comparator that hits the disk. You really can't do this stuff correctly on most systems without talking to the filesystem. You can't even do general-purpose canonization without talking to the filesystem. Let me make clear: I'm not trying to shoot down the idea of needing to be able to compare paths or even necessarily canonize them. I'm arguing primarily against doing it in FilePath, but I'm also also trying to illustrate that doing proper comparisons and canonization is harder than it seems, that even "seasoned and well-tested" APIs are limited in ways that developers don't necessarily expect, and that the semantics and expectations need to be well-defined. Mark --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Changes to FilePath?
(ping) So, I had another idea. How about a separate file path manipulation class that has a well defined character encoding, so that we can do filename manipulations like with FilePath (and a few more). It could convert from a FilePath if given an encoding, and convert back to a FilePath with the platform's default encoding (using LC_*/LANG on Linux, falling back to ASCII), or a given encoding. It could touch the filesystem so that it could know what ecoding methods and manipulations were valid for the platform/drive combination. Since it seems like this is not really something that Chromium needs or wants right now (and it doesn't belong in base anyhow because of needing to touch the filesystem), I think I'll work on this for O3D, and later you can see if you want to use it for Chromium. -Greg. On Wed, Apr 29, 2009 at 3:58 PM, Greg Spencer wrote: > On Wed, Apr 29, 2009 at 12:22 PM, Mark Mentovai wrote: > >> I understand your problem. You're saying "I have user-supplied data >> that I want to build a filename from," and "I have this pathname that >> I want to display back to the user." I agree that it would be good to > > have a way to handle these cases in base. I don't know if FilePath >> proper is the right place to do it. If we do it in FilePath, it still >> won't really be right. > > > OK, so it sounds like you're telling me not to use FilePath to represent > file paths from a disk for my purposes because they can't ever be converted > reliably to a particular encoding on Linux (which is a requirement for me, > because of the third party libraries that require a particular encoding). > > That's fine, but what do I do instead? Roll my own FilePath clone that has > some encoding assumptions? I can do that, but it has the same issues as the > ones you're worried about with FilePath, so it seems better to solve the > issue in one place rather than have two versions that are both insufficient. > Man, it would be better if FilePath could reliably know its encoding! (I > realize that Linux makes this impossible, it just seems like it would be > better that way. :-) > > Since Linux is the only platform where the encoding is unclear, what if we > did the best we could on Linux: > > When constructing a FilePath from a char* string on Linux: > - Test the input string for values > 127 to determine if it's really just > ASCII (and if so, we're out of the woods). > - Then check LANG, LC_CTYPE, LC_ALL (through appropriate Linux APIs) for an > encoding that we can support, and note the encoding for later if we are > requested to do a conversion. > - If we run into an invalid sequence during a conversion, or an encoding we > can't convert from, then use a CHECK to crash. > > This should work on most filenames, in almost all situations -- I'll bet > most filenames are ASCII, even on foreign systems, and the ones that aren't > ASCII have set LANG to something in /etc/profile, so all filenames created > by any app running on that machine should match that encoding. > > Where they don't do that correctly, they're already getting garbage (and > should expect garbage) from any application they use, not just Chrome, since > there is no way *any *app can decode a path with multiple encodings in it, > or where the encoding is different than LANG (or LC_*) says it is. > > Chrome already crashes like this when it encounters situations where it's > just impossible to know what's right, so it's consistent with Chrome's > behavior in other areas. > > >> it should be the caller's responsibility to only deal with user-created >> names with >> this interface. > > > What do you mean here? Isn't that the case now with FilePath? (It's the > file_util routines that actually read the filesystem and make FilePaths out > of them, afterall). As for your suggestion to only deal with path > components, how would you propose to parse user-supplied paths into one of > these? > > >> > 2) I'd like to make it possible to instantiate a POSIX FilePath object >> on >> > Windows and a Windows FilePath on POSIX platforms. This is because some >> > libraries (e.g. the zip library, or tar files), use POSIX semantics for >> > their paths even on Windows (I haven't seen a use case for Windows paths >> on >> > POSIX yet, actually). This would make it possible to use the nice API >> that >> > FilePath has to manipulate paths appropriately for these other >> libraries. >> > This could be easily accomplished by having POSIX and Windows versions >> of >> > FilePath, and then typedef'ing FilePath differently on different >> platforms >> > to one of these versions. >> >> Sounds pretty Pythonic. >> >> FilePath already sort of has some support for this - it does a bunch >> of things based on feature macros, mostly so that as I was writing it, >> I could test the Windows semantics without having to (shudder) resort >> to running on Windows. These could probably be adapted to do what >> you're asking. > > > Cool. > > >> > 3) It would be helpful
[chromium-dev] Re: Changes to FilePath?
On Wed, Apr 29, 2009 at 12:22 PM, Mark Mentovai wrote: > I understand your problem. You're saying "I have user-supplied data > that I want to build a filename from," and "I have this pathname that > I want to display back to the user." I agree that it would be good to have a way to handle these cases in base. I don't know if FilePath > proper is the right place to do it. If we do it in FilePath, it still > won't really be right. OK, so it sounds like you're telling me not to use FilePath to represent file paths from a disk for my purposes because they can't ever be converted reliably to a particular encoding on Linux (which is a requirement for me, because of the third party libraries that require a particular encoding). That's fine, but what do I do instead? Roll my own FilePath clone that has some encoding assumptions? I can do that, but it has the same issues as the ones you're worried about with FilePath, so it seems better to solve the issue in one place rather than have two versions that are both insufficient. Man, it would be better if FilePath could reliably know its encoding! (I realize that Linux makes this impossible, it just seems like it would be better that way. :-) Since Linux is the only platform where the encoding is unclear, what if we did the best we could on Linux: When constructing a FilePath from a char* string on Linux: - Test the input string for values > 127 to determine if it's really just ASCII (and if so, we're out of the woods). - Then check LANG, LC_CTYPE, LC_ALL (through appropriate Linux APIs) for an encoding that we can support, and note the encoding for later if we are requested to do a conversion. - If we run into an invalid sequence during a conversion, or an encoding we can't convert from, then use a CHECK to crash. This should work on most filenames, in almost all situations -- I'll bet most filenames are ASCII, even on foreign systems, and the ones that aren't ASCII have set LANG to something in /etc/profile, so all filenames created by any app running on that machine should match that encoding. Where they don't do that correctly, they're already getting garbage (and should expect garbage) from any application they use, not just Chrome, since there is no way *any *app can decode a path with multiple encodings in it, or where the encoding is different than LANG (or LC_*) says it is. Chrome already crashes like this when it encounters situations where it's just impossible to know what's right, so it's consistent with Chrome's behavior in other areas. > it should be the caller's responsibility to only deal with user-created > names with > this interface. What do you mean here? Isn't that the case now with FilePath? (It's the file_util routines that actually read the filesystem and make FilePaths out of them, afterall). As for your suggestion to only deal with path components, how would you propose to parse user-supplied paths into one of these? > > 2) I'd like to make it possible to instantiate a POSIX FilePath object on > > Windows and a Windows FilePath on POSIX platforms. This is because some > > libraries (e.g. the zip library, or tar files), use POSIX semantics for > > their paths even on Windows (I haven't seen a use case for Windows paths > on > > POSIX yet, actually). This would make it possible to use the nice API > that > > FilePath has to manipulate paths appropriately for these other libraries. > > This could be easily accomplished by having POSIX and Windows versions of > > FilePath, and then typedef'ing FilePath differently on different > platforms > > to one of these versions. > > Sounds pretty Pythonic. > > FilePath already sort of has some support for this - it does a bunch > of things based on feature macros, mostly so that as I was writing it, > I could test the Windows semantics without having to (shudder) resort > to running on Windows. These could probably be adapted to do what > you're asking. Cool. > > 3) It would be helpful to have real path normalization for each of the > > platforms (although I know what a testing nightmare that can be). I > might > > try and tackle this if people think it would be beneficial. > > It's also a specification and implementation nightmare. Everyone has > a different idea of what "normalization" means. What's your idea? Yes, I know it's a nightmare all around, but I think it would be useful to have something that addresses this. My idea would be the same as Python's os.path.normpath, mainly because it's a well-tested, seasoned example with test cases. Windows also has a routine for this (PathCanonicalize) that could be used (but I know it doesn't work for UNC paths). > 4) Make sure we handle case sensitivity vs case preservation correctly. > > It's unclear to me that FilePath does this correctly on the Mac -- Mac > file > > names are case preserving, but case insensitive, Unix filenames are both > > (and windows filenames are neither :-). > > Again with the normalization.
[chromium-dev] Re: Changes to FilePath?
Greg Spencer wrote: > So there's currently no right way to do the conversion, but I still think > that the FilePath constructor is probably in the best position to inspect > LC_ALL, etc. and do as close to the right thing as possible. I doubt most > Linux developers even think about this, and so the chances that they will > implement anything other than assuming that it's ASCII are slim -- this > would allow us to at least implement a baseline for them. Not doing the conversion is kinda the point. Well, it's exactly the point. (Hi, I'm the author of FilePath.) If you've got an arbitrary path, it might be encoded in some scheme, and it might not, and it might contain a mix of encodings. The point of FilePath is "we know it's a path and we don't necessarily know anything else." Chromium didn't used to have FilePath. Everything was a wstring which implied UTF-16/32, and the conversions implied UTF-8 because we couldn't do anything smarter, and there was all sorts of potential for messing things up. Not a pretty story. When FilePath was born, the *Hack methods showed up to give us a way to transition the old-style wstring APIs to new-style FilePath APIs at reasonable cut points, instead of having to do everything all at once. I understand your problem. You're saying "I have user-supplied data that I want to build a filename from," and "I have this pathname that I want to display back to the user." I agree that it would be good to have a way to handle these cases in base. I don't know if FilePath proper is the right place to do it. If we do it in FilePath, it still won't really be right. If we had something, it should probably be made to operate only on single pathname components, and it should be the caller's responsibility to only deal with user-created names with this interface. > 2) I'd like to make it possible to instantiate a POSIX FilePath object on > Windows and a Windows FilePath on POSIX platforms. This is because some > libraries (e.g. the zip library, or tar files), use POSIX semantics for > their paths even on Windows (I haven't seen a use case for Windows paths on > POSIX yet, actually). This would make it possible to use the nice API that > FilePath has to manipulate paths appropriately for these other libraries. > This could be easily accomplished by having POSIX and Windows versions of > FilePath, and then typedef'ing FilePath differently on different platforms > to one of these versions. Sounds pretty Pythonic. FilePath already sort of has some support for this - it does a bunch of things based on feature macros, mostly so that as I was writing it, I could test the Windows semantics without having to (shudder) resort to running on Windows. These could probably be adapted to do what you're asking. > 3) It would be helpful to have real path normalization for each of the > platforms (although I know what a testing nightmare that can be). I might > try and tackle this if people think it would be beneficial. It's also a specification and implementation nightmare. Everyone has a different idea of what "normalization" means. What's your idea? > 4) Make sure we handle case sensitivity vs case preservation correctly. > It's unclear to me that FilePath does this correctly on the Mac -- Mac file > names are case preserving, but case insensitive, Unix filenames are both > (and windows filenames are neither :-). Again with the normalization. What do you want this stuff for? What's your idea of how this should work? Remember: FilePath is specified to be light and to never touch the disk. If you've got a disk-touching operation, it probably doesn't belong in FilePath proper. Mark --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Changes to FilePath?
On Tue, Apr 28, 2009 at 3:26 PM, Erik Kay wrote: > On Tue, Apr 28, 2009 at 3:19 PM, Greg Spencer wrote: > >> But that's exactly the point. FilePath is the class that created the path >> to begin with. So it can know what the LC_*/LANG variables were was when it >> was created, and do the right conversion when you ask the FilePath to >> convert to UTF16. Also, if the developer calls something called >> FilePath::CreateFromUTF8, then it can know it was supposed to be UTF8 and >> remember that. >> > > If you created it yourself, that's fine. FilePaths aren't always created > manually by users. They often are populated from system APIs where you > can't know. See file_util* for some examples. So the problem is that if > you add this API, people will mistakenly use the conversion functions when > they can't be safe. I agree it sucks. I just don't know of a reasonable > solution. > So there's currently no right way to do the conversion, but I still think that the FilePath constructor is probably in the best position to inspect LC_ALL, etc. and do as close to the right thing as possible. I doubt most Linux developers even think about this, and so the chances that they will implement anything other than assuming that it's ASCII are slim -- this would allow us to at least implement a baseline for them. Or would that just screw things up worse? Doesn't this mean that it's possible that the path manipulation routines fail for sufficiently odd encodings? (jis or something where an encoded char might include a "/"?) -Greg. --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Changes to FilePath?
On Tue, Apr 28, 2009 at 3:26 PM, Erik Kay wrote: >> But that's exactly the point. FilePath is the class that created the path >> to begin with. So it can know what the LC_*/LANG variables were was when it >> was created, and do the right conversion when you ask the FilePath to >> convert to UTF16. Also, if the developer calls something called >> FilePath::CreateFromUTF8, then it can know it was supposed to be UTF8 and >> remember that. > > > If you created it yourself, that's fine. FilePaths aren't always created > manually by users. They often are populated from system APIs where you > can't know. See file_util* for some examples. So the problem is that if > you add this API, people will mistakenly use the conversion functions when > they can't be safe. I agree it sucks. I just don't know of a reasonable > solution. We have this problem already, when FilePaths need to work with wstring-based APIs like the win32 one. What we've done so far is use a function with an awkward name (ToWStringHack, FromWStringHack) to try to create bias against them. On the other hand, the codebase now has 309 lines containing "WStringHack" so I don't know it's been too successful. It might be worth figuring out a name that does what Greg needs that is similarly awkward but doesn't involve "Hack" for circumstances where you really just need to do the conversion. --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Changes to FilePath?
On Tue, Apr 28, 2009 at 3:19 PM, Greg Spencer wrote: > On Tue, Apr 28, 2009 at 3:11 PM, Erik Kay wrote: > >> The biggest problem with this change is that it's not possible to do this >> conversion on Linux in a safe way. In Linux, there is no charset defined by >> the filesystem. Each filename is just a blob of bytes. Apps are supposed >> to respect an environment variable, but since this environment variable >> could change over time and be different from user to user, there's no >> reliable way to know what the charset is, so you can't convert from a >> FilePath on Linux to UTF8 or UTF16 unless you were the one who created the >> path to begin with. >> > > But that's exactly the point. FilePath is the class that created the path > to begin with. So it can know what the LC_*/LANG variables were was when it > was created, and do the right conversion when you ask the FilePath to > convert to UTF16. Also, if the developer calls something called > FilePath::CreateFromUTF8, then it can know it was supposed to be UTF8 and > remember that. > If you created it yourself, that's fine. FilePaths aren't always created manually by users. They often are populated from system APIs where you can't know. See file_util* for some examples. So the problem is that if you add this API, people will mistakenly use the conversion functions when they can't be safe. I agree it sucks. I just don't know of a reasonable solution. Erik --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Changes to FilePath?
On Tue, Apr 28, 2009 at 3:19 PM, Greg Spencer wrote: > On Tue, Apr 28, 2009 at 3:11 PM, Erik Kay wrote: > >> The biggest problem with this change is that it's not possible to do this >> conversion on Linux in a safe way. >> > And besides -- this problem isn't introduced by this change: it exists already because currently there's no safe way to convert, regardless of the API (since a consumer of a FilePath doesn't know what encoding it contains). -Greg. --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Changes to FilePath?
On Tue, Apr 28, 2009 at 3:11 PM, Erik Kay wrote: > The biggest problem with this change is that it's not possible to do this > conversion on Linux in a safe way. In Linux, there is no charset defined by > the filesystem. Each filename is just a blob of bytes. Apps are supposed > to respect an environment variable, but since this environment variable > could change over time and be different from user to user, there's no > reliable way to know what the charset is, so you can't convert from a > FilePath on Linux to UTF8 or UTF16 unless you were the one who created the > path to begin with. > But that's exactly the point. FilePath is the class that created the path to begin with. So it can know what the LC_*/LANG variables were was when it was created, and do the right conversion when you ask the FilePath to convert to UTF16. Also, if the developer calls something called FilePath::CreateFromUTF8, then it can know it was supposed to be UTF8 and remember that. -Greg. --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Changes to FilePath?
(resend - arg) On Tue, Apr 28, 2009 at 2:47 PM, Greg Spencer wrote: > On Tue, Apr 28, 2009 at 2:41 PM, Amanda Walker wrote: > >> >> On Tue, Apr 28, 2009 at 4:39 PM, Greg Spencer >> wrote: >> > 1) I'd like to add some explicit routines for converting to/from UTF8 >> and >> > UTF16. While it's nice (and important) that FilePath uses the >> platform's >> > native string, we've found that many third party libraries have made >> other >> > assumptions, where they always expect UTF8 (char) or UTF16 (wchar_t) >> paths >> > regardless of platform, and converting a FilePath to and from those >> forms is >> > a platform-dependent exercise which should be centralized into the class >> > (i.e. adding "ToUTF8" and "ToWide" functions to the class, and explicit >> > constructors that take each type). >> >> One thing many of us have found, across multiple projects, is that >> wchar_t is fraught with complication as soon as more than one platform >> is involved. "wchar_t == UTF16" is a Windowsism (gcc defaults to 4 >> bytes, for example, and L"mumble" gets stored in UCS-4, not UTF-16). >> Chrome started with more or less what you are suggesting, and we moved >> off of it after much pain. > > > I understand those issues quite well (but I probably should call the > conversion method ToUTF16, now that you mention it). And char* isn't > necessarily UTF8 on all platforms either. > > OK, so what's the currently recommended path for converting to UTF16 or > UTF8 from a FilePath? > The biggest problem with this change is that it's not possible to do this conversion on Linux in a safe way. In Linux, there is no charset defined by the filesystem. Each filename is just a blob of bytes. Apps are supposed to respect an environment variable, but since this environment variable could change over time and be different from user to user, there's no reliable way to know what the charset is, so you can't convert from a FilePath on Linux to UTF8 or UTF16 unless you were the one who created the path to begin with. Erik --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Changes to FilePath?
On Tue, Apr 28, 2009 at 2:51 PM, Peter Kasting wrote: > On Tue, Apr 28, 2009 at 2:48 PM, Greg Spencer wrote: > >> So, I was unable to find the conversion utilities in base that do the >> conversion to/from UTF8. What are they called? If I missed them (and I >> looked for a while before I gave up), then maybe they need to be more >> prominent? >> > > See base/string_util.h, UTF8ToUTF16() etc. > Yes, but those are generic string conversions, and so to convert a FilePath to UTF16 on all platforms, my code has to look something like: -- FilePath path(FILE_PATH_LITERAL("Foo.bar")); collada::fstring collada_path; // a UTF16 path. #if defined(OS_WIN) collada_path = path.value(); #elif defined(OS_MACOSX) collada_path = UTF8ToUTF16(path.value()); #elif defined(OS_LINUX) // (or whatever this linux flavor uses for a filename encoding.) collada_path = Latin1ToUTF16(path.value()); #endif -- This seems like code that belongs in FilePath because it knows exactly what the filename encoding would be on each platform. Yes, partly because including dedicated helpers like this makes it sound as > if the class is somehow special-cased or fastpathed to deal better with > these than a generic converter would be. > But it can. For instance, on the Mac, we know that filenames are UTF8 encoded. We have not such guarantee on Linux, even though they both use a char* format in FilePath. If FilePath were doing the conversion, then it could be very picky about doing the conversion properly on each platform, because converting a Latin-1 string to a wide char using a UTF8 codec may end up with some strange results. The other argument is simply that converting utf8 to utf16 is a generic sort > of functionality that belongs in base/ or another similar general-purpose > location, rather than specifically in FilePath. > And the implementation in FilePath would be using those generic functions, but it would be using them (or not) as applied to the specific platform it is compiled on, whereas the conversion routines don't know anything about FilePath's platform specific semantics. -Greg. --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Changes to FilePath?
On Tue, Apr 28, 2009 at 2:48 PM, Greg Spencer wrote: > So, I was unable to find the conversion utilities in base that do the > conversion to/from UTF8. What are they called? If I missed them (and I > looked for a while before I gave up), then maybe they need to be more > prominent? > See base/string_util.h, UTF8ToUTF16() etc. What is the danger here of being lazy? Is it that developers will > unwittingly do expensive conversions? > Yes, partly because including dedicated helpers like this makes it sound as if the class is somehow special-cased or fastpathed to deal better with these than a generic converter would be. The other argument is simply that converting utf8 to utf16 is a generic sort of functionality that belongs in base/ or another similar general-purpose location, rather than specifically in FilePath. PK --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Changes to FilePath?
On Tue, Apr 28, 2009 at 2:31 PM, Peter Kasting wrote: > On Tue, Apr 28, 2009 at 1:39 PM, Greg Spencer wrote: > >> 1) I'd like to add some explicit routines for converting to/from UTF8 and >> UTF16. While it's nice (and important) that FilePath uses the platform's >> native string, we've found that many third party libraries have made other >> assumptions, where they always expect UTF8 (char) or UTF16 (wchar_t) paths >> regardless of platform, and converting a FilePath to and from those forms is >> a platform-dependent exercise which should be centralized into the class >> (i.e. adding "ToUTF8" and "ToWide" functions to the class, and explicit >> constructors that take each type). > > > I'm pretty strongly against this for the same reasons as Evan. I think > consumers who need to convert should be doing the conversion using their own > routines (e.g. Chrome uses ones in our base/ module). > So, I was unable to find the conversion utilities in base that do the conversion to/from UTF8. What are they called? If I missed them (and I looked for a while before I gave up), then maybe they need to be more prominent? What is the danger here of being lazy? Is it that developers will unwittingly do expensive conversions? If so, I would expect that a member function called "ToUTF8" would be just as much of a performance warning as a helper function called "FilePathToUTF8", but be a heck of a lot more convenient (since it would not require the developer to create a local variable for use as a return value from the helper, and can be used as an argument to another library's functions). I can see the argument for not having a casting constructor that isn't from the platform native form, but in that case, a factory method called "CreateFromUTF8" should be a sufficient warning to the developer that it might be expensive. -Greg. --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Changes to FilePath?
On Tue, Apr 28, 2009 at 2:41 PM, Amanda Walker wrote: > > On Tue, Apr 28, 2009 at 4:39 PM, Greg Spencer wrote: > > 1) I'd like to add some explicit routines for converting to/from UTF8 and > > UTF16. While it's nice (and important) that FilePath uses the platform's > > native string, we've found that many third party libraries have made > other > > assumptions, where they always expect UTF8 (char) or UTF16 (wchar_t) > paths > > regardless of platform, and converting a FilePath to and from those forms > is > > a platform-dependent exercise which should be centralized into the class > > (i.e. adding "ToUTF8" and "ToWide" functions to the class, and explicit > > constructors that take each type). > > One thing many of us have found, across multiple projects, is that > wchar_t is fraught with complication as soon as more than one platform > is involved. "wchar_t == UTF16" is a Windowsism (gcc defaults to 4 > bytes, for example, and L"mumble" gets stored in UCS-4, not UTF-16). > Chrome started with more or less what you are suggesting, and we moved > off of it after much pain. I understand those issues quite well (but I probably should call the conversion method ToUTF16, now that you mention it). And char* isn't necessarily UTF8 on all platforms either. OK, so what's the currently recommended path for converting to UTF16 or UTF8 from a FilePath? -Greg. --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Changes to FilePath?
On Tue, Apr 28, 2009 at 4:39 PM, Greg Spencer wrote: > 1) I'd like to add some explicit routines for converting to/from UTF8 and > UTF16. While it's nice (and important) that FilePath uses the platform's > native string, we've found that many third party libraries have made other > assumptions, where they always expect UTF8 (char) or UTF16 (wchar_t) paths > regardless of platform, and converting a FilePath to and from those forms is > a platform-dependent exercise which should be centralized into the class > (i.e. adding "ToUTF8" and "ToWide" functions to the class, and explicit > constructors that take each type). One thing many of us have found, across multiple projects, is that wchar_t is fraught with complication as soon as more than one platform is involved. "wchar_t == UTF16" is a Windowsism (gcc defaults to 4 bytes, for example, and L"mumble" gets stored in UCS-4, not UTF-16). Chrome started with more or less what you are suggesting, and we moved off of it after much pain. --Amanda --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Changes to FilePath?
On Tue, Apr 28, 2009 at 1:39 PM, Greg Spencer wrote: > 1) I'd like to add some explicit routines for converting to/from UTF8 and > UTF16. While it's nice (and important) that FilePath uses the platform's > native string, we've found that many third party libraries have made other > assumptions, where they always expect UTF8 (char) or UTF16 (wchar_t) paths > regardless of platform, and converting a FilePath to and from those forms is > a platform-dependent exercise which should be centralized into the class > (i.e. adding "ToUTF8" and "ToWide" functions to the class, and explicit > constructors that take each type). I'm pretty strongly against this for the same reasons as Evan. I think consumers who need to convert should be doing the conversion using their own routines (e.g. Chrome uses ones in our base/ module). PK --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Changes to FilePath?
On Tue, Apr 28, 2009 at 1:57 PM, Thomas Van Lenten wrote: > On Tue, Apr 28, 2009 at 4:39 PM, Greg Spencer wrote: > >> 4) Make sure we handle case sensitivity vs case preservation correctly. >> It's unclear to me that FilePath does this correctly on the Mac -- Mac file >> names are case preserving, but case insensitive, Unix filenames are both >> (and windows filenames are neither :-). > > > FYI - it's a drive format time option on the Mac, so they can be case > preserving and case sensitive. > Thanks for pointing that out. In fact, NTFS is actually case sensitive, where FAT32 is not (see http://support.microsoft.com/kb/100625). So we have issues there as well. The real issue would be dealing with relative paths that don't exist yet -- there would be no way to inspect the file location to find out what mode it was in. I think I would just punt and go with the widely-used defaults (the ones I mentioned above), since most apps seem to assume those limitations. An alternative would be to have an API to specify the desired mode, and default to the common case on each platform. -Greg. --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Changes to FilePath?
On Tue, Apr 28, 2009 at 1:39 PM, Greg Spencer wrote: > 1) I'd like to add some explicit routines for converting to/from UTF8 and > UTF16. While it's nice (and important) that FilePath uses the platform's > native string, we've found that many third party libraries have made other > assumptions, where they always expect UTF8 (char) or UTF16 (wchar_t) paths > regardless of platform, and converting a FilePath to and from those forms is > a platform-dependent exercise which should be centralized into the class > (i.e. adding "ToUTF8" and "ToWide" functions to the class, and explicit > constructors that take each type). Can you give some examples of where this is needed? We've historically fought against this pretty hard, and as soon as accessors are available users will get lazy about it. --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Changes to FilePath?
On Tue, Apr 28, 2009 at 4:39 PM, Greg Spencer wrote: > Hi Chromium Developers, > > I'm working on Google's O3D (http://code.google.com/p/o3d), and we > (naturally) share some of Chrome's base classes for our code, including the > very useful class FilePath. > > However, in using FilePath in the last few months, I've seen that it needs > some refinement. I'd like to augment the FilePath class with some things > that would make it more generally useful -- it's very nicely set up, but > it's missing a few things that make it harder to work with than it needs to > be: > > 1) I'd like to add some explicit routines for converting to/from UTF8 and > UTF16. While it's nice (and important) that FilePath uses the platform's > native string, we've found that many third party libraries have made other > assumptions, where they always expect UTF8 (char) or UTF16 (wchar_t) paths > regardless of platform, and converting a FilePath to and from those forms is > a platform-dependent exercise which should be centralized into the class > (i.e. adding "ToUTF8" and "ToWide" functions to the class, and explicit > constructors that take each type). > > 2) I'd like to make it possible to instantiate a POSIX FilePath object on > Windows and a Windows FilePath on POSIX platforms. This is because some > libraries (e.g. the zip library, or tar files), use POSIX semantics for > their paths even on Windows (I haven't seen a use case for Windows paths on > POSIX yet, actually). This would make it possible to use the nice API that > FilePath has to manipulate paths appropriately for these other libraries. > This could be easily accomplished by having POSIX and Windows versions of > FilePath, and then typedef'ing FilePath differently on different platforms > to one of these versions. > > 3) It would be helpful to have real path normalization for each of the > platforms (although I know what a testing nightmare that can be). I might > try and tackle this if people think it would be beneficial. > > 4) Make sure we handle case sensitivity vs case preservation correctly. > It's unclear to me that FilePath does this correctly on the Mac -- Mac file > names are case preserving, but case insensitive, Unix filenames are both > (and windows filenames are neither :-). FYI - it's a drive format time option on the Mac, so they can be case preserving and case sensitive. TVL > > > So, is there any resistance to any of the above? Do you have other > suggestions that I might take into account? Am I violating any design > assumptions of FilePath? For #2, is speed/size enough of a concern to avoid > a virtual base class (I wouldn't think so, but you never know..)? > > -Greg. > > > > --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---