Re: Various musings about the request URL / URI / whatever
Hmmm, go away for two days and a mail storm erupts. :-( I may never be able to catch up and digest this mail thread, but I'll try and add a few comments of my own. On 01/12/2005, at 8:41 AM, Nicolas Lehuen wrote: c) We don't have a req.base_uri (to follow Jim's naming suggestion) or req.script_name that would be equivalent to req.subprocess_env.get(SCRIPT_NAME), but we have a req.path_info... Why is this missing ? Note that SCRIPT_NAME as obtainable from Apache appears to be broken. See: http://issues.apache.org/jira/browse/MODPYTHON-68 Also note that with respect to req.uri and req.path_info, be aware that req.path_info is normalised and req.uri is not. Ie., the latter may have duplicated instances of '/' in it whereas that cannot occur in req.path_info. This makes it error prone to take the len() of req.path_info to work out how much to drop off req.uri to obtain the base uri or script name, you need to normalise req.uri first, but then the result will be missing the duplicates instances of '/' unless you use some more elaborate algorithm to work it out. Graham
Re: Various musings about the request URL / URI / whatever
Gregory (Grisha) Trubetskoy writes: I think a properly designed site should insist on its host name, i.e. I see you think I'm gobbledygook.bleh, but I'm going to redirect you to http://www.modpython.org/ because that is my true name. This is very common with sites that respond to both www.site.com and site.com, but insist on only one of those by redirecting. As I said in my previous email to the list, I *think* if you use virtual hosts and your real sites are NOT the first real host, then you are forcing clients to speak HTTP/1.1, thus forcing the Host header to be sent. If you then put in your first, default virtualhost: RedirectPermanent / http://realserver/ then you protect yourself from gobbledygook.bleh because that will be sent to the default virtualhost which will redirect. Right? If so, a bit convoluted and not accessible to the novice. Perhaps a Tips Tricks chapter to the manual? Daniel Popowich --- http://home.comcast.net/~d.popowich/mpservlets/
Re: Various musings about the request URL / URI / whatever
Jim Gallacher wrote: Gregory (Grisha) Trubetskoy wrote: I don't know what the specific issue is with parsed_uri, if this is a mod_python bug it should just be fixed BUT if this is an issue with httpd, I don't think we should cover the problem up by having mod_python fix it. Since we are part of the HTTP Server project, we should just fix it in httpd. Either way, it should be fixed. I think maybe it's not really broken. In case anyone is not familiar with the issue, a request for http://example.com/tests/mptest?view=form currently gives a tuple that looks something like this: That's not true. That's what you might see in your client browser, but (usually) it only asks for /tests/mptest?view=form, regardless of the name it used to find the server. It may use the Host: header to negotiate the right virtual host, but the Host: header is not part of the string that parsed_uri is actually parsing. (None, None, None, None, None, None, '/tests/mptest', 'view=form', None) which is not what we expect. This is what the mod_python docs have to say: parsed_uri Tuple. The URI broken down into pieces. (scheme, hostinfo, user, password, hostname, port, path, query, fragment). The apache module defines a set of URI_* constants that should be used to access elements of this tuple. Example: fname = req.parsed_uri[apache.URI_PATH] (Read-Only) This is all correct. I think the problem is that developers are hoping to use parsed_uri in a use case for which it is inappropriate. Those values are populated *if present* in the supplied request URI, but the only *minimal* requirement would be a / for the path. If you want to know what resource the client *really* requested (and inquiring minds do), you *must not* attempt to rewrite or repopulate this.
Re: Various musings about the request URL / URI / whatever
Nicolas Lehuen wrote: Hi, Is it me or is it quite tiresome to get the URL that called us, or get the complete URL that would call another function ? When performing an external redirect (using mod_python.util.redirect for example), we MUST (as per RFC) provide a full URL, not a relative one. Instead of util.redirect(req,'/foo/bar.py'), we should write util.redirect(req,'https://whatever:8443/foo/bar.py'). The problem is, writing this is always tiresome, as it means building a string like this : def current_url(req): req.add_common_vars() current_url = [] # protocol if req.subprocess_env.get('HTTPS') == 'on': current_url.append('https') default_port = 443 else: current_url.append('http') default_port = 80 current_url.append('://') # host current_url.append(req.hostname) # port port = req.connection.local_addr[1] if port != default_port: current_url.append(':') current_url.append(str(port)) # URI current_url.append(req.uri) return ''.join(current_url) So I have two questions : First question, is there a simpler way to do this ? Ironically, when using mod_rewrite, you get an environment variable named SCRIPT_URI which is precisely what I need (SCRIPT_URL, also added by mod_rewrite, is equivalent to req.uri... Don't ask we why). But relying on it isn't safe since mod_rewrite isn't always used. I guess you could just assemble the parts from the req.parsed_uri tuple, except that apache doesn't actually fill in parsed_uri. :( Second question, if there isn't any simpler way to do this, should we add it to mod_python ? Either as a function like above in mod_python.util, or as a member of the request object (named something like url to match the other member named uri, but that's just teasing). I'm not against it, but for my purposes I think it would be more useful for parsed_uri to just work properly. And third question (in pure Spanish inquisition style) : why is req.parsed_uri returning me a tuple full of Nones except for the uri and path_info part ? It comes from apache that way. I sure don't know why though. Maybe we're missing some magic apache call that would fill it in? Ah, fourth question : why are we (mod_python, mod_rewrite and the CGI environment variables) using the terms URI and URL to distinguish between a full, absolute resource path and a path relative to the server, whereas the definition of URLs and URIs is very vague and nothing close to this (http://www.w3.org/TR/uri-clarification/#contemporary) ? Shouldn't we save our souls and a lot of saliva by choosing better names ? Strangely I was reading the cited page just last week, for perhaps the 100th time. I keep hoping I'll find enlightment but alas no. The danger of choosing new names (ie absolute_thingy or relative_thingy) is that we also add another layer of confusion. I'm not saying new names are a bad idea, just that we need to be very careful. OK, OK, fifth question : we made req.filename and other members writable. But when those attributes are changed, as Graham noted a while ago, the other dependent ones aren't, leading to inconsitencies (for example, if you change req.filename, req.canonical_filename isn't changed). Should we try to solve this and provide clear definition of the various parts of a request for mod_python 3.3 ? That would make sense. I'm wondering how often people make use of req.canonical_filename (CFN*)? Also, just how would the CFN be adjusted if the user code changes req.filename, since the user is free to put any string in there they want? Maybe CFN just gets changed to the same string. Hopefully Graham will shed some light on this, since it was his use case. Regards, Jim * Because I can't type canonical_filename the same way twice. Stupid fingers.
Re: Various musings about the request URL / URI / whatever
On Tue, 29 Nov 2005, Nicolas Lehuen wrote: def current_url(req): [snip] # host current_url.append(req.hostname) [snip] This part isn't going to work reliably if you are not using virtual hosts and just bind to an IP number. Deciphering the URL is an impossible task - I used to have similar code in my apllications, but lately I realized that it does not work reliably and it is much simpler to just treat it as a configuration item... First question, is there a simpler way to do this ? Ironically, when using mod_rewrite, you get an environment variable named SCRIPT_URI which is precisely what I need (SCRIPT_URL, also added by mod_rewrite, is equivalent to req.uri... Don't ask we why). But relying on it isn't safe since mod_rewrite isn't always used. well - here's how it does it. /* * create the SCRIPT_URI variable for the env */ /* add the canonical URI of this URL */ thisserver = ap_get_server_name(r); port = ap_get_server_port(r); if (ap_is_default_port(port, r)) { thisport = ; } else { apr_snprintf(buf, sizeof(buf), :%u, port); thisport = buf; } thisurl = apr_table_get(r-subprocess_env, ENVVAR_SCRIPT_URL); /* set the variable */ var = apr_pstrcat(r-pool, ap_http_method(r), ://, thisserver, thisport, thisurl, NULL); apr_table_setn(r-subprocess_env, ENVVAR_SCRIPT_URI, var); /* if filename was not initially set, * we start with the requested URI */ if (r-filename == NULL) { r-filename = apr_pstrdup(r-pool, r-uri); rewritelog(r, 2, init rewrite engine with requested uri %s, r-filename); } Second question, if there isn't any simpler way to do this, should we add it to mod_python ? Either as a function like above in mod_python.util, or as a member of the request object (named something like url to match the other member named uri, but that's just teasing). I don't know... Since the result is going to be half-baked... I think a more interesting and mod_python-ish thing to do would be to expose all the API's used in the above code (e.g. ap_get_server_name, ap_is_default_port, ap_http_method) FIRST, then think about this. And third question (in pure Spanish inquisition style) : why is req.parsed_uri returning me a tuple full of Nones except for the uri and path_info part ? This is an httpd question most likely... Ah, fourth question : why are we (mod_python, mod_rewrite and the CGI environment variables) using the terms URI and URL to distinguish between a full, absolute resource path and a path relative to the server, whereas the definition of URLs and URIs is very vague and nothing close to this (http://www.w3.org/TR/uri-clarification/#contemporary) ? Shouldn't we save our souls and a lot of saliva by choosing better names ? No, we (mod_python) should just use the exact same name that httpd uses. If we come up better names, then it's just going to make it even more confusing. OK, OK, fifth question : we made req.filename and other members writable. But when those attributes are changed, as Graham noted a while ago, the other dependent ones aren't, leading to inconsitencies (for example, if you change req.filename, req.canonical_filename isn't changed). Should we try to solve this The solutions is to make req.canonical_filename writable too and document that if you change req.filename, you may consider changing canonical_filename as well and what will happen if you do not. and provide clear definition of the various parts of a request for mod_python 3.3 ? Yes, that'd be good :) Grisha
Re: Various musings about the request URL / URI / whatever
Daniel J. Popowich wrote: Jim Gallacher writes: Nicolas Lehuen wrote: Second question, if there isn't any simpler way to do this, should we add it to mod_python ? Either as a function like above in mod_python.util, or as a member of the request object (named something like url to match the other member named uri, but that's just teasing). I'm not against it, but for my purposes I think it would be more useful for parsed_uri to just work properly. Here, here!! I've wanted parsed_uri to work as expected for quite some time...I'm actually in a position where I could devote some time to tracking this down. If apache doesn't provide it, I think mod_python should at least fill it in, right? +1 Can someone knudge me in the right direction to start? Haven't poked around apache source and/or developer docs in years. All I can say is grep is your friend. :) I've found http://docx.webperf.org to be useful. Unfortunately you can only drill down into the header files, not c files (unless I'm missing something). I might even be tempted to generate my own local copy of the apache docs using doxygen so that the c-files get included. I've been playing with doxygen + mod_python and it's pretty cool. Searching docx for parse_uri turns up ap_parse_uri. http://docx.webperf.org/group__APACHE__CORE__PROTO.html#ga44 Grab the src and put grep to work. I'll dig in and help any way I can. Jim
Re: Various musings about the request URL / URI / whatever
Gregory (Grisha) Trubetskoy wrote: On Tue, 29 Nov 2005, Jim Gallacher wrote: Daniel J. Popowich wrote: Here, here!! I've wanted parsed_uri to work as expected for quite some time...I'm actually in a position where I could devote some time to tracking this down. If apache doesn't provide it, I think mod_python should at least fill it in, right? +1 I don't know what the specific issue is with parsed_uri, if this is a mod_python bug it should just be fixed BUT if this is an issue with httpd, I don't think we should cover the problem up by having mod_python fix it. Since we are part of the HTTP Server project, we should just fix it in httpd. Either way, it should be fixed. In case anyone is not familiar with the issue, a request for http://example.com/tests/mptest?view=form currently gives a tuple that looks something like this: (None, None, None, None, None, None, '/tests/mptest', 'view=form', None) which is not what we expect. This is what the mod_python docs have to say: parsed_uri Tuple. The URI broken down into pieces. (scheme, hostinfo, user, password, hostname, port, path, query, fragment). The apache module defines a set of URI_* constants that should be used to access elements of this tuple. Example: fname = req.parsed_uri[apache.URI_PATH] (Read-Only) Jim
Re: Various musings about the request URL / URI / whatever
Gregory (Grisha) Trubetskoy writes: On Tue, 29 Nov 2005, Nicolas Lehuen wrote: If I understand you correctly, req.hostname is not reliable in case where virtual hosting is not used. What about server.server_hostname, which seems to be used by the code from mod_rewrite you posted below ? Can it be used reliably ? I don't think so. if I do this: telnet some.host.com 80 GET /index.html How would apache know what the hostname is? By the Host header. I've been looking into this issue tonight and think I have the answers (but it's really late, so I'll save the gory details for tomorrow). In brief: typically, req.hostname is set from the Host header and, in fact, when I telnet to apache and issue a GET by hand, if I don't send the Host header, apache barfs with a 400, Bad Request, response. (apache 2.0.54, debian testing) As for the larger issue at hand: the reason req.parsed_uri is not filled in is because browsers don't send the info in the GET, e.g., browsers send this: GET /index.py?a=bc=d HTTP/1.1 not GET http://user:[EMAIL PROTECTED]:80/index.py?a=bc=d#here HTTP/1.1 if they did, parsed_uri would be filled in. req.unparsed_uri is whatever the word after GET in the http protocol exchange; req.parsed_uri is the parsing of that word. Given the full URI spec: SCHEME://[USER[:[EMAIL PROTECTED]:PORT]/PATH?QUERY#FRAGMENT you can see where eight of the nine elements of the parsed_uri tuple come from; the ninth, hostinfo, is the combination of [USER[:[EMAIL PROTECTED]:PORT] (everything between // and /). Unfortunately, browsers only send: /PATH?QUERY and that's why we only ever see it in unparsed_uri and parsed_uri. Again, lots more to share...in the morrow... Daniel Popowich --- http://home.comcast.net/~d.popowich/mpservlets/
Re: Various musings about the request URL / URI / whatever
Daniel J. Popowich wrote: By the Host header. I've been looking into this issue tonight and think I have the answers (but it's really late, so I'll save the gory details for tomorrow). In brief: typically, req.hostname is set from the Host header and, in fact, when I telnet to apache and issue a GET by hand, if I don't send the Host header, apache barfs with a 400, Bad Request, response. (apache 2.0.54, debian testing) It will only do that if you claim to be a HTTP/1.1 client. If you send GET / HTTP/1.0 it will not complain about the host header. Sending: GET / HTTP/1.1 will get you a 400 response, because you MUST supply it (says RFC 2068, and whatever superseded that one). There is more you must do to be able to call yourself HTTP/1.1 by the way, such as keep-alive connections and chunked encoding. -- Mike Looijmans Philips Natlab / Topic Automation