Re: [vfs] parsing uri
As you might have seen I implemented the plugin-resolve-stuff. Now we could extend VFS by simply drop a jar into the classpath and if we find a /META-INF/vfs-plugins.xml it would be added. That way we could keep the VFS core slim and could provide extension jars to allow whatever we can think of. I think this is a good compromise. Is this the point of view that I am trying to compromise with? We should add everything to vfs that seems at least remotely useful or if not useful then at least somewhat cool. And if at somepoint something we added is no longer neither useful nor cool we still keep it around to keep vfs backward compatible. Did I get this right? My point of view is: We should clearly and explicitly define the scope of vfs to be an excellent api to filesystems in general in heterogenous and distributed environment. We should write elegant, logicallly correct and well documented piece of software to do that. And make it extremely robust. So the compromise is this (please confirm): We make all providers to be pluggable so that there is the vfs-core with maybe one provider for logical testing of the core. And a bunch of provider plugins nicely packaged so that you can just grab the once you need and ignore the rest. And the core will not get any extra quirks because it would be nice when doing something with hibernate through vfs. So, yes, I think this could be a good compromise between the conservatists (me) and the liberal (them). Note that politically I am liberal but logically I am conservatist. But I already talked about, think of accessing your mailfolder through an imap provider and your mailcontent through an mime provider. e.g. mime:imap://[EMAIL PROTECTED]/INBOX/mail9012718!/part1.txt Sooner or later, this might happen ... and why not - its cool, isnt it? Our ideas of coolness slightly differ. My idea of coolness would be that the imap protocol would be better defined and more to the point (pop3 was much better in this). I remember the times when I was planning on accessing lotus notes through it's imap interface that supposedly could give you a hierarchical representation of notes databases. Here's a new cool provider idea for vfs. Lotus Notes provider that uses the imap service of notes. This way you could nicely present notes documents and forms as files and folders. And write a few books about the possible semantics. Plus since notes is commercial one could actually make a few dollars out of it. So you get the point. Not all that glitters is gold. But sometimes all we really want and need is just the glitter :) - rami - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [vfs] parsing uri
Hello! So the compromise is this (please confirm): We make all providers to be pluggable so that there is the vfs-core with maybe one provider for logical testing of the core. And a bunch of provider plugins nicely packaged so that you can just grab the once you need and ignore the rest. Yes! Though I am not sure what the core should be. Maybe we would define the current state as core? VFS in its current packaging is widely used and accepted. Every new filesystem will be extra packaged and only if the community might find it useful a voting could start to decide if we put it into the core. And the core will not get any extra quirks because it would be nice when doing something with hibernate through vfs. YES! And I hope my progress so far made clear that I am definitely would NOT put some quirks into VFS. In fact I would say I am VERY conservative in the stuff I do. All I have done so far was to stabilize and complete VFS: *) cache *) RandomAccessContent *) compressed files *) filename parsing *) plugin *) now and then (more then then now ;-) ) a little bit documentation After every change I let VFS settle down a sufficient time period to gave everyone a chance to test and report errors. So you get the point. Not all that glitters is gold. But sometimes all we really want and need is just the glitter :) And sometimes we just find some spare minutes and would like to experiment a little bit - even if the result is glitter, maybe for the time being it become gold. --- Mario - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [vfs] parsing uri
And sometimes we just find some spare minutes and would like to experiment a little bit I have absolutely nothing against experimenting and coding all kinds of weird and useful things for your own purposes. I was just talking about what should be included into vfs. even if the result is glitter, maybe for the time being it become gold. The difference between gold and glitter is that the good feeling that gold gives lasts long. With glitter it only lasts a moment. YES! And I hope my progress so far made clear that I am definitely would NOT put some quirks into VFS. In fact I would say I am VERY conservative in the stuff I do. 12 points ... Mario. - rami - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: [vfs] parsing uri
By the way could you add the following to VFS public static void close() { BTW you know its sideeffects? You no longer can use VFS in threaded environments as this close closes ALL filesystems. (I have some gc problems and this alleviates it a bit) Lets try to figure it out. I use vfs in an webapp and never had a problem with memory consumption. Maybe I can add a method to dump the cache content, that way we might find whats the problem - if its VFS fault. Well I don´t see the gc problem as vfs problem as much as jvm gc problem. My app is a plugin to an editor that I want to reload by removing all references to plugin classloader. And then load a new version with a new classloader. The problem is that no matter how much I get rid of references (and see with a memory debugger that there are no references) the classloader and the classes it has loaded will not go away. I have tested that the unloading of classloaders does work in simple cases but not when a lot has happened in the classloader. Now when the classloader does not get unloaded neither will any static fields. VFS has some static fields for perfectly good reasons (caching). So all I need is the ability to tell VFS that I am done with you try to free the memory you have reserved (in static fields) because your classloader does not want to unload you. Actually I should send this request to SUN and say that there should be a close method in classloader but I don´t think they would listen. Note. I also use the deprecated stop method in thread class because in some cases it is the only way and is very good. So if it could help you could also make the close method deprecated and say in docs that one should be VERY careful when using it. - rami - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [vfs] parsing uri
Your asumption about the used servers is correct. Now why uml or vmware: It is a pain to setup all this stuff and keep it in sync with any junit changes. With uml or vmware I can provide a image one simply can drop into its box and startup the tests. So no security problem, just to simplify the installation. Sounds good. Vmware is the best IMHO around. I have used it only their cracked open source so I don't know about their goodwill to open source dudes :) Just as a sidenote: I think it is not the responsibility of VFS to ensure running with different server implementations. The used libraries should handle this. Though, we should do what we can to support them finding problems with exotic platforms. Good point. agree. I am not at home now, I will send one later. Take your time. I don't pay anything for this :) Tempfs uses the DefaultFileReplicator to handle its content. So where are the files stored? Do they get deleted when vfs closes. Or when jvm closes? what if jvm crashes? - url provider bothers me because it kind of duplicates vfs. And it DUPLICATES the effort of vfs (http, ftp, jar ...) Now you get emotional ;-) Its better to integrate than to rule out. We also provide a method to wrap VFS into a URLConnection. I was not emotional. I was rational. Now that I have been sipping some italian red wine I am ready to get emotional. What do you mean by integration? Integrate into what? The point is that it does not offer any capabilities that are not already provided by vfs. So i does not give any further integrative possibilities. What it does give is undocumented features that duplicate documented features. And it does not work (probably) with all implementations of Java API. And the whole project of accessing any urls with some api (like the URLConnection API) is doomed to fail because url is such a broad concept and there will be cases of url that fit VERY badly to the API. I mean you can point to anything with URL (that is where the universal comes from). And you can not have a meaningful api to ANYTHING. URIs and URLs are about universal naming in the world of computers (and internet specifically). Api's tend to go beyond naming. Further this let's embrace everything attitude will take vfs into the world of yet another universal whatever. And the evolution is like this. A lot of good things and features are provided that are trendy at the moment. When the system becomes too messy to understans it is forgotten. Virtual filesystem can mean anything because of the magic word virtual. But I wish this would be just a filesystem that can integrate different kinds of filesystems on the network. That already is a tall order. And also note that filesystem model is very simple hierarchical model. So we should not see it as the ultimate way to model and interact with data. I think I am still being rational but in a good emotianal way :) - rami - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [vfs] parsing uri
Rami Ojares wrote: So where are the files stored? The FileReplicator creates a vfs_cache directory and takes care of its content. Do they get deleted when vfs closes. Yes, but only if you tell VFS to close e.g. ((DefaultFileSystemManager) VFS.getManager()).close(); Oh, boy - how do you to this? How do you find all those stinky nooks? ;-) Or when jvm closes? Nope! Before you ask - No, I dont want to use deleteFileOnExit function - we already discussed it. But what I can do is to implement a shutdown-hook and try to cleanly shutdown VFS then. what if jvm crashes? Bad luck! ;-) Now that I have been sipping some italian red wine I am ready to get emotional. I go green with envy! What do you mean by integration? Integrate into what? The point is that it does not offer any capabilities that are not already provided by vfs. So i does not give any further integrative possibilities. Yes, yes and yes, you might be right with all you say, but ... now we have it. And I think the main intention was to allow ftp and http (read-onl) access on systems where no commons-net or httpclient available, though do not know how well it works. You and I do not use it, and do not like it ... good, our point of view. Again - now we have it and there is no need to remove it - we wont be bothered. Further this let's embrace everything attitude will take vfs into the world of yet another universal whatever. As you might have seen I implemented the plugin-resolve-stuff. Now we could extend VFS by simply drop a jar into the classpath and if we find a /META-INF/vfs-plugins.xml it would be added. That way we could keep the VFS core slim and could provide extension jars to allow whatever we can think of. I think this is a good compromise. And also note that filesystem model is very simple hierarchical model. So we should not see it as the ultimate way to model and interact with data. No one do, do we? But I already talked about, think of accessing your mailfolder through an imap provider and your mailcontent through an mime provider. e.g. mime:imap://[EMAIL PROTECTED]/INBOX/mail9012718!/part1.txt Sooner or later, this might happen ... and why not - its cool, isnt it? Now that we have the plugin stuff we could do it without bloating the core. --- Mario - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [vfs] parsing uri
Hi Rami! There are new vfs nightlies [1] available with my reworked filename parsing. All tests passed, but I would really appreciate if you could take some time to make some tests with it. Thanks! Mario [1] http://cvs.apache.org/builds/jakarta-commons/nightly/commons-vfs/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [vfs] parsing uri
I glanced the tests you have in place for uris and naming and they seem extensive. I am not going to test this extensively. But the case that brought this issue up was like the following I called FileObject resolveFile(File baseFile, String name) on FileSystemManager and the baseFile has path something like /foo/%bar and the name was just some ordinary fname So just make sure you have this in your junit tests. Anyway quickly trying the snapshot in my program brings no errors. By the way could you add the following to VFS public static void close() { try { // Closes FileSystemManager instance final Method closeMethod = instance.getClass().getMethod( close, null ); closeMethod.invoke(instance, null); } catch (Exception e) { e.printStackTrace(); // Ignore; don't close } instance = null; } DefaultFileSystemManager already has the close method. And this would be equal to the init method. (I have some gc problems and this alleviates it a bit) About the testing environment. What exactly is needed to run the tests? A quick quess would be - ftp server - sftp (ssh) server - samba server - tomcat for http webdav Why is uml or vmware needed? The way I would test is to just have those servers running on my machine. Of course if something crashes vmware can bring security but if none of the services run as root the setting should be secure enough. I have never run the tests but I could try doing it. It would be easy for me to set up those services on my gentoo. Of course if you are planning on testing stuff on many platforms and different server implementations then vmware would be needed but isn't that an overkill? I mean how many different ftp servers are there? And where do you draw the line. I am sure that you are not going to test all ftp servers running on OS/400 using EBCDIC encoding :?) If you could give a quick tutorial (that could be added then to docs) about how to run tests I could give it a shot. And I have even an XP on separate machine for smb testing. And maybe there could be some kind of a profile where you tell what services you have on for testing and where they can be found? And now for some random thoughts: It seems to my that currently the providers could be categorized into 4 categories: - local filesystem - network protocol based providers - ftp, sftp, smb, webdav, http - layered filesystems - tar, jar, bzip2, compressed, gzip, zip - filesystems based on concepts from java environment - temp, url, res I really have no deep understanding about this but please enlighten me where I am wrong. - temp seems to have a special place because almost nothing is implemented under temp package. So the implementation must be somewhere higher. I assume that the implementation uses java's temporary file concept from java.io.File API ??? - resources have a special place for a java program and earn their place because of that. - url provider bothers me because it kind of duplicates vfs. Basically it says that you can access any url but the reality is that you can access only urls for which there exists a provider inside sun's jdk. The set of these providers is not part of the api and thus undocumented and subject to change any day. And do we find all those providers in other jdk's. And it DUPLICATES the effort of vfs (http, ftp, jar ...) And then one question about layered filesystems. Can you layer them as much as you like. smb - zip - jar etc. Time to go to sleep. - rami - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [vfs] parsing uri
Rami Ojares wrote: Anyway quickly trying the snapshot in my program brings no errors. Well, that great! By the way could you add the following to VFS public static void close() { BTW you know its sideeffects? You no longer can use VFS in threaded environments as this close closes ALL filesystems. (I have some gc problems and this alleviates it a bit) Lets try to figure it out. I use vfs in an webapp and never had a problem with memory consumption. Maybe I can add a method to dump the cache content, that way we might find whats the problem - if its VFS fault. Why is uml or vmware needed? Your asumption about the used servers is correct. Now why uml or vmware: It is a pain to setup all this stuff and keep it in sync with any junit changes. With uml or vmware I can provide a image one simply can drop into its box and startup the tests. So no security problem, just to simplify the installation. Of course if you are planning on testing stuff on many platforms and different server implementations Just as a sidenote: I think it is not the responsibility of VFS to ensure running with different server implementations. The used libraries should handle this. Though, we should do what we can to support them finding problems with exotic platforms. If you could give a quick tutorial (that could be added then to docs) about how to run tests I could give it a shot. And I have even an XP on separate machine for smb testing. I am not at home now, I will send one later. It seems to my that currently the providers could be categorized into 4 categories: correct. - temp seems to have a special place because almost nothing is implemented under temp package. So the implementation must be somewhere higher. Tempfs uses the DefaultFileReplicator to handle its content. I assume that the implementation uses java's temporary file concept from java.io.File API ??? Not exactly as the filename handling is somewhat different. - resources have a special place for a java program and earn their place because of that. Yes. - url provider bothers me because it kind of duplicates vfs. And it DUPLICATES the effort of vfs (http, ftp, jar ...) Now you get emotional ;-) Its better to integrate than to rule out. We also provide a method to wrap VFS into a URLConnection. And then one question about layered filesystems. Can you layer them as much as you like. smb - zip - jar etc. Yes - should work. It is done by expanding every archive into the temporary store. --- Mario - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [vfs] parsing uri
My daily status report ;-) Rami Ojares wrote: That being said I think that the absolute minimum is only % But what is the practical minimum is left in the air. This is what I have done now. I changed the code to pass the given uri down to the filesystem and only honor the % sign as special character. From the point of view of VFS only the ! needs special care as it is used by the layered filesystem (tar, zip, jar, ...) Now if a uri contains a %nn sequence it is meant to cancel any special meaning of its corresponding character. ie if one encodes the . as %2e it is no longer meant as current directory. Sure - none of the currently available filesystems can create a file named . - but what if we once have a filesystem backed by a database ... Lets come back to the !. Again a new incompatibility to the previous naming scheme. In the past it was needet to encode the character to access a nested archive ie: tar:tar:file:/home/tar1.tar%21/tar2.tar!/entry.txt Now it is (it think) more naturally: tar:tar:file:/home/tar1.tar!/tar2.tar!/entry.txt And the ! is useable in filenames as %21. I think it was really worth the work. Now that it is possible to safely pass uris we could have a look how we should encode. I will try to figure out how local-file, ftp, http, webdav, smb, sft will handle filenames with special characters. --- Mario - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [vfs] parsing uri
Hello! Now that it is possible to safely pass uris we could have a look how we should encode. I will try to figure out how local-file, ftp, http, webdav, smb, sft will handle filenames with special characters. During my tests I found some sideeffects which needs some thoughts: 1) The cache The cache uses the filename as key - now if I try to resolve a file named webdav:/anydir/test%0d.txt the webdav will return a file named webdav:/anydir/test\r.txt (\r = the unencode %0d) As you might see, both filenames are different and thus it will create two different entries in the cache (which is not acceptable). If i ask wedav to return the escaped form of the name it will return webdav:/anydir/test%0D.txt (notice the uppercase D) - again a different name. However, what if one is funny and tries to resolve webdav:/anydir/%74est%0d.txt In this case the filename from the fileprovider is different - regardless if I get the normal or escaped form. So my conclusion is to always use a decoded form of the filename for the cache key - knowing that in the very very rare cases where the decoding is not symmetric I might have a problem with the cache. 2) German Umlauts ... and any other non ascii character. I cant use the encoded form of the filename from the filesystem provider as I have to know the encoding then (ISO, UTF-8). Currently the filesystem libraries are responsible for the correct decoding - and I dont want to enter a charset war - again, its best to use the decoded filename. Result: VFS should not introduce its own encoding, only the % (and ! for the layered filesystem) needs some addressing and to allow the case where one needs to pass down a special url to the filesystem. Comments? --- Mario - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [vfs] parsing uri
URI spec dudes talk about canonical form of the URI. This is left for the schema to define. Now if vfs is in control of the uri's that come in and go out then it would be possible to canonicalize the URI when it enters the core areas of vfs that is not provider (schema) specific. Cache I believe in that core area. Let's say someone points to a file with URI webdav:/anydir/%74est%0d.txt This is canonicalized into webdav:/anydir/test%0D.txt So when someone points next time to uri webdav:/anydir/tes%74%0D.txt then he will get the cached file. Note: Canonicalization could be provider specific so that different schemas could escape different set of characters. What do you think? Hello! Now that it is possible to safely pass uris we could have a look how we should encode. I will try to figure out how local-file, ftp, http, webdav, smb, sft will handle filenames with special characters. During my tests I found some sideeffects which needs some thoughts: 1) The cache The cache uses the filename as key - now if I try to resolve a file named webdav:/anydir/test%0d.txt the webdav will return a file named webdav:/anydir/test\r.txt (\r = the unencode %0d) As you might see, both filenames are different and thus it will create two different entries in the cache (which is not acceptable). If i ask wedav to return the escaped form of the name it will return webdav:/anydir/test%0D.txt (notice the uppercase D) - again a different name. However, what if one is funny and tries to resolve webdav:/anydir/%74est%0d.txt In this case the filename from the fileprovider is different - regardless if I get the normal or escaped form. So my conclusion is to always use a decoded form of the filename for the cache key - knowing that in the very very rare cases where the decoding is not symmetric I might have a problem with the cache. 2) German Umlauts ... and any other non ascii character. I cant use the encoded form of the filename from the filesystem provider as I have to know the encoding then (ISO, UTF-8). Currently the filesystem libraries are responsible for the correct decoding - and I dont want to enter a charset war - again, its best to use the decoded filename. Result: VFS should not introduce its own encoding, only the % (and ! for the layered filesystem) needs some addressing and to allow the case where one needs to pass down a special url to the filesystem. Comments? --- Mario - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [vfs] parsing uri
Rami Ojares wrote: webdav:/anydir/%74est%0d.txt This is canonicalized into webdav:/anydir/test%0D.txt So when someone points next time to uri webdav:/anydir/tes%74%0D.txt then he will get the cached file. I still do not know how to canonicalize a file with umlauts. e.g. VFS (running on a ISO-8859-1 filesystem) access a webdav filesysten running on an UTF-8 system now the VFS url might be something like webdav:/anydir/R%E4tsel.txt which means webdav:/anydir/Rätsel.txt in ISO-8859-1, but notice, I cant canonicalize it as I do not know the encoding of the url. And the encoded form reported by the webdav might be the UTF-8 encoded form: webdav:/anydir/R%C3%A4tsel.txt Notice: If you use the plain form webdav:/anydir/Rätsel.txt it is always possible to resolve the file, regardless of the destination charset. The library is able to convert as it knows the source charset and can send it to the server accordingly. Currently I cant see any other option than NOT to use encoded URIs at all. Now it is possible to encode using the %, but it should not be the preferred way. And the canonicalized form is the decoded form - except for % (and sometimes !) --- Mario smime.p7s Description: S/MIME Cryptographic Signature
Re: [vfs] parsing uri
Again quoting the RFC: For original character sequences that contain non-ASCII characters, however, the situation is more difficult. Internet protocols that transmit octet sequences intended to represent character sequences are expected to provide some way of identifying the charset used, if there might be more than one [RFC2277]. However, there is currently no provision within the generic URI syntax to accomplish this identification. An individual URI scheme may require a single charset, define a default charset, or provide a way to indicate the charset used. It is expected that a systematic treatment of character encoding within URI will be developed as a future modification of this specification. I quess http schema sticks to US-ASCII for now. But maybe with escapes you could access on some web servers pages like http://aku.suomi.fi/k%E4%E4k.html = http://aku.suomi.fi/kääk.html To be honest I don't know. Also I don't know if the systemic treatment has already happened or when it will happen. So it is up to us to decide how we deal with charsets. Since vfs is written in java it would make sense to first turn the character sequence of to 16 bit unicode (UTF-16?) And then encode every character above US-ASCII (7 bit) or ISO-LATIN-1 (8 bit). But this would not make the visual representation of URI very nice. According to URI spec one should be able to read URI on the radio :-) If you are in japan every character would be encoded and very difficult to read for the announcer. But if you don't encode then that URI would look to westerner a sequence of those boxes that represent character for which there is no font. Let's get practical. Someone wrote the following uri in ant build file (and some ant task uses vfs). webdav:/höh/kääk.ini Ant when reading the string knows that it is encoded in iso-latin-1 But the string in jvm is in unicode. Ant gives this string (uri) to vfs that encodes all character above us-ascii. so it is now webdav:/h%F6h/k%E4%E4k.ini Now webdav provider makes http request let's say to tomcat. Question arises: Can tomcat handle (or the webdav protocol spec) unicode characters in resource names? I don't know. But maybe webdav provider implementor knows. So if webdav names only handle us-ascii then the provider can right away say when it is asked to canonicalize the uri that this is not a proper webdav uri. Or maybe this is not specified. And some webdav servers could handle the uri and some could not. Maybe webdav provider then could ask the server what it supports. But maybe there is no one standardized single way to ask this. At this point a sane person starts to give up and thinks:Whatever! Just pass the string and let the user handle errors. But let's say that webdav can handle iso-latin-1 and the request is sent to server. The server's filesystem is encoded in some other coding (EBCDIC?) that maps ö and ä to a different number. So in order to do the mapping the webdav server would need to know what character encoding vfs uses (UTF-16) in order to do this. But since this is not specified (at least in the rfc I am quoting) then it would probably unescape using it's own encoding and request a wrong resource from it's filesystem. This state of affairs makes me wonder do the standard makers really want to make standards or do they just pretend. The answer is of course that industry wants to make standards to a point. Because confusion and protectionism makes IT business thrive. That being said I think one pragmatic approach could be to treat uri characters to be in from unicode character set. When transported they would be in US-ASCII where everything above us-ascii is escaped. So to answer your question ü = %FB But all this is just assuming and making things up. I quess the decision is in your hands since you write the code. - rami - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [vfs] parsing uri
I wrote the previous email too quickly so there are many errors in details. So please read it without too much attention to the details. I quess those uris with non us characters get always sent in some encoding. It would work nicely if it could be us-ascii But the interpretation problem is just lifted one level up with encoding since we don't know how to negotiate the encoding. So I quess minimal encoding policy is better because it works as well and is of course much less hassle. - rami - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: [vfs] parsing uri
Can someone please help me get off this list? I have tried to unsubscribe unsuccessfully. thanks -Original Message- From: Mario Ivankovits [mailto:[EMAIL PROTECTED] Sent: Thursday, March 10, 2005 2:16 PM To: Jakarta Commons Developers List Subject: Re: [vfs] parsing uri Hello! Now that it is possible to safely pass uris we could have a look how we should encode. I will try to figure out how local-file, ftp, http, webdav, smb, sft will handle filenames with special characters. During my tests I found some sideeffects which needs some thoughts: 1) The cache The cache uses the filename as key - now if I try to resolve a file named webdav:/anydir/test%0d.txt the webdav will return a file named webdav:/anydir/test\r.txt (\r = the unencode %0d) As you might see, both filenames are different and thus it will create two different entries in the cache (which is not acceptable). If i ask wedav to return the escaped form of the name it will return webdav:/anydir/test%0D.txt (notice the uppercase D) - again a different name. However, what if one is funny and tries to resolve webdav:/anydir/%74est%0d.txt In this case the filename from the fileprovider is different - regardless if I get the normal or escaped form. So my conclusion is to always use a decoded form of the filename for the cache key - knowing that in the very very rare cases where the decoding is not symmetric I might have a problem with the cache. 2) German Umlauts ... and any other non ascii character. I cant use the encoded form of the filename from the filesystem provider as I have to know the encoding then (ISO, UTF-8). Currently the filesystem libraries are responsible for the correct decoding - and I dont want to enter a charset war - again, its best to use the decoded filename. Result: VFS should not introduce its own encoding, only the % (and ! for the layered filesystem) needs some addressing and to allow the case where one needs to pass down a special url to the filesystem. Comments? --- Mario - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [vfs] parsing uri
The File-URI codec on unix encodes \foo\bar -- %5Cfoo%5Cbar This is to be interpreted as file or dir named \foo\bar If you send this uri to jvm on windows you get new File(new URI(uriStr)) which is interpreted as file or dir bar under dir foo which is under root. So it seems that %5C is not interpreted as having special meaning but on windows it is. The other alternative on windows would be to throw exception because a file with the given path can't be created. So it is thought that it makes life easier if the %5C is interpreted as path separator on windows. The same question applies to . (dot = current dir) double dot ( = parent dir) and any other characters that we might want to assign some special meaning to ( eg. ~ tilde) When do we interpret a special charater to have it's special meaning and how do we escape away that special meaning? Well the answer is so simple and according to what you think is right. %xx notation ESCAPES the character and NEGATES the possible special meaning it might have. So therefore I think it would be more correct if %5Cfoo%5Cbar on windows would throw an exception. And your intuition is correct. But note: If I have a path ../xtc then the corresponding uri should be ../xtc. Because in this case we want the dots to have their special meaning. But what if % character would have a special meaning (let's imagine it points to the parent of the parent if one exists or else to root) Then path %/xtc should be uri %/xtc BUT this is not possible because % has a special meaning in URI as escape character. All the other excluded characters MUST be encoded because of URI spec. The reasons being eg. that uri could be printed on paper and new line characters would be hard to read if they were not escaped. So let's recap the excluded character list ctrl-chars | space | | | # | % | None of these have any special meaning in any filesystems Thus we are saved. Rest of the encodings are because of the schema specific rules and serve the purpose of escaping the schema specific meaning of the character. Therefore the uri corresponding the path @foo/%bar/+xtc should be @foo/%25bar/+xtc Do these thoughts clarify ? :-) - rami Hello! Sounds like a long night today :-) Hard work - it might take some time until I can commit the new naming stuff. The whole procedure of parsing a uri needs to be refactored, currently I fight agains the Layered stuff e.g. tar:tar:file:/dir/first.tar!/second.tar!/entry And I already implemented some incompatibilites between the old and the new VFS naming: Current: file = getManager().resolveFile(%2e); resolves to the current Directory New: resolves to a file or directory NAMED . Current: file = getManager().resolveFile(dir%2fchild); resolves to a file child in directory dir New: resolves to a file or directory named dir/child Current: file = getManager().resolveFile(dir%5cchild); resolves to a file child in directory dir New: resolves to a file or directory named dir\child I leave it up to the filesystem if such a file or directory could be created. The above examples are those from the unit-test, so the old behaviour was wanted. But I think the new one is the right one. I think it is very unlikely that those constructs can be found in the wild life, but if one used VFS that way it IS broken. Any comments? --- Mario - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [vfs] parsing uri
Mario Ivankovits wrote: Current: file = getManager().resolveFile(%2e); resolves to the current Directory New: resolves to a file or directory NAMED . I don't think there is a filesystem where this is possible. I'd need to read the relevant W3C specs to be sure. resolves to a file or directory named dir/child resolves to a file or directory named dir\child I leave it up to the filesystem if such a file or directory could be created. Now these are more interesting. What a load of corner cases to test for! Cheers, --binkley - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [vfs] parsing uri
I'm unsure that the URI specs intend to distinguish a string from it's encoded form for the purposes of naming. I believe they are to be interpreted equivalently, and that the encoding exists only to permit uncorrupted transmission of forbidden characters. You have found something interesting to encoded URIs if a difference exists, but yours is a lot of work and I'd double-check the assumption before proceeding further. Laziness is one of the three virtues. :-) Cheers, --binkley - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [vfs] parsing uri
[EMAIL PROTECTED] wrote: I'm unsure that the URI specs intend to distinguish a string from it's encoded form for the purposes of naming. I believe they are to be interpreted equivalently, and that the encoding exists only to permit uncorrupted transmission of forbidden characters. Quote from RFC 2396 Uniform Resource Identifiers (URI): Generic Syntax 2.4.2. When to Escape and Unescape A URI is always in an escaped form, since escaping or unescaping a completed URI might change its semantics. /_*Normally, the only time escape encodings can safely be made is when the URI is being created from its component parts; each component may have its own set of characters that are reserved, so only the mechanism responsible for generating or interpreting that component can determine whether or not escaping a character will change its semantics. Likewise, a URI must be separated into its components before the escaped characters within those components can be safely decoded.*_/ In some cases, data that could be represented by an unreserved character may appear escaped; for example, some of the unreserved mark characters are automatically escaped by some systems. If the given URI scheme defines a canonicalization algorithm, then unreserved characters may be unescaped according to that algorithm. For example, %7e is sometimes used instead of ~ in an http URL path, but the two are equivalent for an http URL. Because the percent % character always has the reserved purpose of being the escape indicator, it must be escaped as %25 in order to be used as data within a URI. Implementers should be careful not to escape or unescape the same string more than once, since unescaping an already unescaped string might lead to misinterpreting a percent data character as another escaped character, or vice versa in the case of escaping an already escaped string. Important passage: /each component may have its own set of characters that are reserved, so only the mechanism responsible for generating or interpreting that component can determine whether or not escaping a character will change its semantics At this point the RFC indirectly says that only % MUST be always encoded. But later it excludes other characters from ever existing in URI for reasons of readability when uri is eg. printed. Think if you see somewhere URI: foo Is this URI foo or foo ? The same applies to URI: foo That being said I think that the absolute minimum is only % But what is the practical minimum is left in the air. - rami / You have found something interesting to encoded URIs if a difference exists, but yours is a lot of work and I'd double-check the assumption before proceeding further. Laziness is one of the three virtues. :-) Cheers, --binkley - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [vfs] parsing uri
Hello! Anyway there is nothing to it so mario can probably make the fix right away. But the list of special characters needs still to be addressed. I think at least {'#', ' '} I tried to find a way without decode/encode the url again. This turns out to work - could you please check it out. btw. you catched a vespiary - usign the '%' as valid filename character turns out to be a problem through all archive like filesystem providers (tar, zip, ..). Also the FileObject.getName().getURI() didnt correctly encode the path i.e. one cant use its result to resolve a file again. I have to investigate this in more detail. If I could I would assign you 12 points (the maximium) for catching this problem ;-) --- Mario - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [vfs] parsing uri
btw. you catched a vespiary - usign the '%' as valid filename character turns out to be a problem through all archive like filesystem providers (tar, zip, ..). Also the FileObject.getName().getURI() didnt correctly encode the path i.e. one cant use its result to resolve a file again. I have to investigate this in more detail. Already one year ago when I was fiddling with classloaders I found out how URL encoding (eg. in URLClassLoader) is completely flawed in Java. There are bug reports about this in java.sun.com but the official answer to this seems to be that the way URL encoding is done now is too central to be changed since big software has been written that assume that URLs are encoded wrongly. Therefore encode the file path to URL in vfs. It's not hard and it is the only way. This theme brings up an interesting topic about the set of characters that are allowed to appear in file name. As we know the set of prohibited characters on different operating systems is - well different. Since vfs is cross-platform file-system it should define it's own set of prohibited characters. Maybe union of prohibited characters on win/unix/mac. But that is impossible since it will find files on unix that do have characters that are prohibited - say on windows. Maybe FileSystemProvider when instantiated has to be able to tell which characters are allowed. Of course vfs can be completely neutral about the issue and let the os / network protocol tell that something is wrong when illegal filename was used. Nevertheless it would be excellent to document these kinds of issues as part of the vfs project. Then it would be easier also to say for sure which characters need to be encoded for URL. Also I think decodeURI and encodeURI should be symmetrical. Maybe we don't need to know anything about filenames. We only need to know about URI. What is the set of characters that need to be encoded in URI. Well let's see RFC 2396 /reserved = ; | / | ? | : | @ | | = | + | $ | ,/ These are reserved characters because they have a special meaning in URI They work as delimiters between different components. and the schema finally decides if they are delimiters or not (I think) They should be escaped but note: /2.4.2. When to Escape and Unescape A URI is always in an escaped form, since escaping or unescaping a completed URI might change its semantics. Normally, the only time escape encodings can safely be made is when the URI is being created from its component parts; each component may have its own set of characters that are reserved, so only the mechanism responsible for generating or interpreting that component can determine whether or not escaping a character will change its semantics. Likewise, a URI must be separated into its components before the escaped characters within those components can be safely decoded./ So when I have a path like /foo/%bar I should encode % but not / Looking at the reserved character set in case of file: schema I think none of them should be escaped. /2.4.3. Excluded US-ASCII Characters / /control = US-ASCII coded characters 00-1F and 7F hexadecimal/ /space = US-ASCII coded character 20 hexadecimal delims = | | # | % | The angle-bracket and and double-quote () characters are excluded because they are often used as the delimiters around URI in text documents and protocol fields. The character # is excluded because it is used to delimit a URI from a fragment identifier in URI references (Section 4). The percent character % is excluded because it is used for the encoding of escaped characters./ I think these should always be encoded in URI There exists also unwise characters /Other characters are excluded because gateways and other transport agents are known to sometimes modify such characters, or they are used as delimiters. unwise = { | } | | | \ | ^ | [ | ] | ` / But I don't think these should be encoded. So all in all for file URI schema I think the characters to encode are: *control = US-ASCII coded characters 00-1F and 7F hexadecimal* *space = US-ASCII coded character 20 hexadecimal* *delims = | | # | % | * On my Linux I can create directory /#%/ I just need write mkdir \\#%\ Also it has happened to me that a program has created a file name that contains newlines and some other non-printable characters. Copying this folder to some other os would result (probably) in exception. // If I could I would assign you 12 points (the maximium) for catching this problem ;-) Why can't you ?-) - rami
Re: [vfs] parsing uri
Hello Rami! Thanks for collection all this informations, this is very usefull and I will try my best to implement it in VFS. Therefore encode the file path to URL in vfs. It's not hard and it is the only way. Currently VFS tries to decode as soon as possible - and yes, I think thats wrong. I think (no decision has made now) I will change this to decode only if its needet e.g. if the real physical access will be made - and then only if the underlaying library requires it e.g. ftp or http might work with the encoded uris even better. Sounds like a long night today :-) Why can't you ?-) Rami 12 points. --- Mario - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [vfs] parsing uri
Rami 12 points. I'm honored. - Rami - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [vfs] parsing uri
Rami Ojares wrote: file.toURI().toString() is not the way to go. The reason is simple. It does not work. What does it does not work mean? That is, what is an example failure case? Cheers, --binkley - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [vfs] parsing uri
What does it does not work mean? That is, what is an example failure case? Good question. Because it does work :) All I can say to my defense is that my library management is a mess! Therefore I decided to make the simplest possible class for testing how file.toURI().toString() It encodes all excluded characters (space, %, #, ...) From reserved character it encodes (on my linux) only ? (question mark) Then from unwise characters ({}|\\^[]`) it encodes all. But maybe it is not necessary to know how it encodes because the inverse operation can be done too. new File(new URI( (new File($%[EMAIL PROTECTED]|\\^[]`$)).toURI().toString() )).getPath() Returns $%[EMAIL PROTECTED]|\\^[]`$ Which is correct. Once again all this confusion was produced because I have my library management in state of flux and I have had bad experiences with this issue in the past. Also I remembered the bug about this encoding issue but this really seems to work. My java -version returns 1.4.2_06-b03 This might not work on 1.3 but I am not sure. Like I said before, the URI encoding is schema specific, so it should be done separately for different providers. And it seems that for local files URI and File classes could work as the codec. Thanks binkley! - rami - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [vfs] parsing uri
Slight correction new File(new URI( (new File($%[EMAIL PROTECTED]|\\^[]`$)).toURI().toString() )).getPath() Returns $%[EMAIL PROTECTED]|\\^[]`$ Return value is $%[EMAIL PROTECTED]|\^[]`$ (Only one backslash) - rami - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [vfs] parsing uri
Rami Ojares wrote: new File(new URI( (new File($%[EMAIL PROTECTED]|\\^[]`$)).toURI().toString() )).getPath() Returns $%[EMAIL PROTECTED]|\\^[]`$ Which is correct. Yikes! I want to hire you to do all my software testing. That is diabolical. Cheers, --binkley - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [vfs] parsing uri
Hello! Sounds like a long night today :-) Hard work - it might take some time until I can commit the new naming stuff. The whole procedure of parsing a uri needs to be refactored, currently I fight agains the Layered stuff e.g. tar:tar:file:/dir/first.tar!/second.tar!/entry And I already implemented some incompatibilites between the old and the new VFS naming: Current: file = getManager().resolveFile(%2e); resolves to the current Directory New: resolves to a file or directory NAMED . Current: file = getManager().resolveFile(dir%2fchild); resolves to a file child in directory dir New: resolves to a file or directory named dir/child Current: file = getManager().resolveFile(dir%5cchild); resolves to a file child in directory dir New: resolves to a file or directory named dir\child I leave it up to the filesystem if such a file or directory could be created. The above examples are those from the unit-test, so the old behaviour was wanted. But I think the new one is the right one. I think it is very unlikely that those constructs can be found in the wild life, but if one used VFS that way it IS broken. Any comments? --- Mario - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[vfs] parsing uri
In DefaultLocalFileProvider is a method findLocalFile. It's idea is to convert File object into FileObject object. public FileObject findLocalFile(final File file) throws FileSystemException { // TODO - tidy this up, should build file object straight from the file return findFile(null, file: + file.getAbsolutePath(), null); } It calls findFile that is in AbstractOriginatingFileProvider The signature of the method is findFile(final FileObject baseFile, final String uri, final FileSystemOptions fileSystemOptions) throws FileSystemException Notice the name of the second argument: 'uri' Here's the problem: Let's say I have file whose absolute path is /foo/%bar It's uri should be file:/foo/%25bar but now it just is file:/foo/%bar which is not a correct uri leading to an exception later when the system tries to decode the uri and complains that Invalid URI escape sequence %ba So the method should be public FileObject findLocalFile(final File file) throws FileSystemException { // TODO - tidy this up, should build file object straight from the file return findFile(null, file: + ENCODE_URI_SOMEHOW(file.getAbsolutePath()), null); } the same remark applies to public FileObject findLocalFile(final String name) throws FileSystemException { // TODO - tidy this up, no need to turn the name into an absolute URI, // and then straight back again return findFile(null, file: + name, null); } - Rami Ojares Ps. I hope this remark is valid since I haven't updated the sources for a long time. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [vfs] parsing uri
Rami Ojares wrote: public FileObject findLocalFile(final File file) throws FileSystemException { // TODO - tidy this up, should build file object straight from the file return findFile(null, file: + ENCODE_URI_SOMEHOW(file.getAbsolutePath()), null); } I would do even less work than that (being as lazy as I am): public FileObject findLocalFile(final File file) throws FileSystemException { return findFile(null, file.toURI().toString(), null); } java.io.File is handy that way. :-) Cheers, --binkley - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [vfs] parsing uri
Here is my proposal using idea from binkley /** * Finds a local file, from its local name. */ public FileObject findLocalFile(final String name) throws FileSystemException { // TODO - tidy this up, no need to turn the name into an absolute URI, // and then straight back again return findFile(null, (new File(name)).toURI().toString(), null); } /** * Finds a local file. */ public FileObject findLocalFile(final File file) throws FileSystemException { // TODO - tidy this up, should build file object straight from the file return findFile(null, file.getAbsoluteFile().toURI().toString(), null); } I tried it and it worked. - rami - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [vfs] parsing uri
file.toURI().toString() is not the way to go. The reason is simple. It does not work. I don't know why. So I think we should use ParseUtil.encode(..) which does work and decide which characters to include as special ones. I did this and it works (last time I said this I was wrong because a jar did not get updated ..) But now I'm home so I will submit it tomorrow. Anyway there is nothing to it so mario can probably make the fix right away. But the list of special characters needs still to be addressed. I think at least {'#', ' '} - rami - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]