Hi Larry, The same file does work directly from WebHDFS (see below). Looking more closely at the logs I sent previously, it looks like Knox (or something in the chain I'm unaware of) is decoding the %20 encoded spaces, then reencoding them as + encoded, i.e.
17/05/24 15:51:05 ||88ce58ea-d7c5-46cd-a87a-c2f96b38130e|audit|WEBHDFS||||access|uri|/gateway/<cluster>/webhdfs/v1/docs/filename with spaces.pdf?op=OPEN|unavailable|Request method: GET .. 17/05/24 15:51:05 ||88ce58ea-d7c5-46cd-a87a-c2f96b38130e|audit|WEBHDFS|<username>|||dispatch|uri|http://<namenode>.<cluster>:50070/webhdfs/v1/docs/filename+with+spaces.pdf?op=OPEN&doAs=<username>|success|Response status: 404 With thanks, Alex Direct WebHDFS request (hostnames redacted) # curl -si -u: "http://<namenode>:50070/webhdfs/v1/docs/filename%20with%20spaces.pdf?op=OPEN" --negotiate -L | head -n40 HTTP/1.1 401 Authentication required Cache-Control: must-revalidate,no-cache,no-store Date: Wed, 24 May 2017 19:01:41 GMT Pragma: no-cache Date: Wed, 24 May 2017 19:01:41 GMT Pragma: no-cache X-FRAME-OPTIONS: SAMEORIGIN WWW-Authenticate: Negotiate Set-Cookie: hadoop.auth=; Path=/; HttpOnly Content-Type: text/html; charset=iso-8859-1 Content-Length: 1533 Server: Jetty(6.1.26.hwx) HTTP/1.1 307 TEMPORARY_REDIRECT Cache-Control: no-cache Expires: Wed, 24 May 2017 19:01:42 GMT Date: Wed, 24 May 2017 19:01:42 GMT Pragma: no-cache Expires: Wed, 24 May 2017 19:01:42 GMT Date: Wed, 24 May 2017 19:01:42 GMT Pragma: no-cache X-FRAME-OPTIONS: SAMEORIGIN WWW-Authenticate: Negotiate YGkGCSqGSIb3EgECAgIAb1owWKADAgEFoQMCAQ+iTDBKoAMCARKiQwRBQM/auuLcl2xey6wMp6EjCPJFSqK3snscxMzW7RvfgxOo7182GzD5N9jf+OWGr+tjpvlRX0c/7iTBfYKSetf4ekU= Set-Cookie: hadoop.auth="u=admin&p=admin@CYSAFA&t=kerberos&e=1495688502002&s=b7p35TgaxItAUTkKJuSXuynoq9E="; Path=/; HttpOnly Content-Type: application/octet-stream Location: http://<datanode3>:1022/webhdfs/v1/docs/filename%20with%20spaces.pdf?op=OPEN&delegation=HgAFYWRtaW4FYWRtaW4AigFcO9YJ8ooBXF_ijfJFAxSBYFUnsXY3up11ZNIi4hIi__5RvRJXRUJIREZTIGRlbGVnYXRpb24PMTcyLjE4LjAuOTo4MDIw&namenoderpcaddress=<namenode>:8020&offset=0 Content-Length: 0 Server: Jetty(6.1.26.hwx) HTTP/1.1 200 OK Access-Control-Allow-Methods: GET Access-Control-Allow-Origin: * Content-Type: application/octet-stream Connection: close Content-Length: 13365618 %����1.6 <</Filter/FlateDecode/First 157/Length 5350/N 16/Type/ObjStm>>stream ... ________________________________ From: larry mccay [[email protected]] Sent: 24 May 2017 18:05 To: [email protected] Subject: Re: Encoding/escaping whitespace in WebHDFS requests Hi Alex - I notice from the audit log that the 404 is actually coming from WebHDFS not from Knox. Can you confirm that direct access to WebHDFS without going through Knox works with the same URL? thanks, --larry On Wed, May 24, 2017 at 12:32 PM, Willmer, Alex (UK Defence) <[email protected]<mailto:[email protected]>> wrote: How should I encode spaces characters in the URL when I make a request to WebHDFS through Knox? Or should be enabling/configuring something in Knox to handle them? I'm making the following (redacted values in <>) request to WebHDFS, through Knox curl "https://<hostname>:18443/gateway/<cluster>/webhdfs/v1/docs/filename%20with%20spaces.pdf?op=OPEN" \ -<username>:<password> -k -s However Knox is returning HTTP 404 with the following body (whitespace/formatting added by me) {"exception":"FileNotFoundException", "javaClassName":"java.io<http://java.io>.FileNotFoundException", "message":"File /docs/filename+with+spaces.pdf not found."}} I've tried encoding the spaces as + (same result), and not encoding them (HTTP 400 Unknown Version). If I request a file for which the path does not contain spaces then it works. Any ideas? With thanks, Alex PS In anticipation of queries: I'm using Knox 0.11.0 with OpenJDK 1.8.0_131 on CentOS 7, with an HDP 2.6 (Hadoop 2.7.x) cluster. Kerberos is enabled in the cluster. The (redacted) response headers for the %20 encoded request < HTTP/1.1 404 Not Found < Date: Wed, 24 May 2017 15:34:26 GMT < Set-Cookie: JSESSIONID=15acwo8gt9qr8gdbvk48y9yjh;Path=/gateway/<cluster>;Secure;HttpOnly < Expires: Thu, 01 Jan 1970 00:00:00 GMT < Set-Cookie: rememberMe=deleteMe; Path=/gateway/cysafa; Max-Age=0; Expires=Tue, 23-May-2017 15:34:26 GMT < Cache-Control: no-cache < Expires: Wed, 24 May 2017 15:34:26 GMT < Date: Wed, 24 May 2017 15:34:26 GMT < Pragma: no-cache < Expires: Wed, 24 May 2017 15:34:26 GMT < Date: Wed, 24 May 2017 15:34:26 GMT < Pragma: no-cache < X-FRAME-OPTIONS: SAMEORIGIN < Content-Type: application/json; charset=UTF-8 < Server: Jetty(6.1.26.hwx) < Content-Length: 252 The (redacted) Knox logs for the %20 encoded request ==> /var/log/hadoop/knox/gateway-audit.log <== 17/05/24 15:51:05 ||88ce58ea-d7c5-46cd-a87a-c2f96b38130e|audit|WEBHDFS||||access|uri|/gateway/<cluster>/webhdfs/v1/docs/filename with spaces.pdf?op=OPEN|unavailable|Request method: GET 17/05/24 15:51:05 ||88ce58ea-d7c5-46cd-a87a-c2f96b38130e|audit|WEBHDFS|<username>|||authentication|uri|/gateway/<cluster>/webhdfs/v1/docs/filename with spaces.pdf?op=OPEN|success| 17/05/24 15:51:05 ||88ce58ea-d7c5-46cd-a87a-c2f96b38130e|audit|WEBHDFS|<username>|||authentication|uri|/gateway/<cluster>/webhdfs/v1/docs/filename with spaces.pdf?op=OPEN|success|Groups: [] 17/05/24 15:51:05 ||88ce58ea-d7c5-46cd-a87a-c2f96b38130e|audit|WEBHDFS|<username>|||authorization|uri|/gateway/<cluster>/webhdfs/v1/docs/filename with spaces.pdf?op=OPEN|success| 17/05/24 15:51:05 ||88ce58ea-d7c5-46cd-a87a-c2f96b38130e|audit|WEBHDFS|<username>|||dispatch|uri|http://<namenode>.<cluster>:50070/webhdfs/v1/docs/filename+with+spaces.pdf?op=OPEN&doAs=<username>|unavailable|Request method: GET 17/05/24 15:51:05 ||88ce58ea-d7c5-46cd-a87a-c2f96b38130e|audit|WEBHDFS|<username>|||dispatch|uri|http://<namenode>.<cluster>:50070/webhdfs/v1/docs/filename+with+spaces.pdf?op=OPEN&doAs=<username>|success|Response status: 404 17/05/24 15:51:05 ||88ce58ea-d7c5-46cd-a87a-c2f96b38130e|audit|WEBHDFS|<username>|||access|uri|/gateway/<cluster>/webhdfs/v1/docs/filename with spaces.pdf?op=OPEN|success|Response status: 404 ==> /var/log/hadoop/knox/gateway.log <== 2017-05-24 15:51:05,254 INFO hadoop.gateway (KnoxLdapRealm.java:getUserDn(691)) - Computed userDn: uid=<username>,cn=users,cn=accounts,dc=<cluster> using dnTemplate for principal: <username> 2017-05-24 15:51:05,259 INFO hadoop.gateway (AclsAuthorizationFilter.java:doFilter(85)) - Access Granted: true The (redacted) topology <topology> <gateway> <provider> <role>authentication</role> <name>ShiroProvider</name> <enabled>true</enabled> <param> <name>sessionTimeout</name> <value>30</value> </param> <param> <name>main.ldapRealm</name> <value>org.apache.hadoop.gateway.shirorealm.KnoxLdapRealm</value> </param> <param> <name>main.ldapContextFactory</name> <value>org.apache.hadoop.gateway.shirorealm.KnoxLdapContextFactory</value> </param> <param> <name>main.ldapRealm.contextFactory</name> <value>$ldapContextFactory</value> </param> <param> <name>main.ldapRealm.userDnTemplate</name> <value>uid={0},cn=users,cn=accounts,dc=<cluster></value> </param> <param> <name>main.ldapRealm.contextFactory.url</name> <value>ldap://<freeipa_node>:389</value> </param> <param> <name>main.ldapRealm.contextFactory.authenticationMechanism</name> <value>simple</value> </param> <param> <name>urls./**</name> <value>authcBasic</value> </param> </provider> <provider> <role>authorization</role> <name>AclsAuthz</name> <enabled>true</enabled> <param> <name>knox.acl</name> <value>admin;*;*</value> </param> </provider> <provider> <role>identity-assertion</role> <name>Default</name> <enabled>true</enabled> </provider> <provider> <role>hostmap</role> <name>static</name> <enabled>false</enabled> <param><name>localhost</name><value>sandbox,sandbox.hortonworks.com<http://sandbox.hortonworks.com></value></param> </provider> </gateway> <service> <role>WEBHDFS</role> <url>http://<namenode>:50070/webhdfs</url> </service> <service> <role>SOLRAPI</role> <url>http://<solrnode>:6083/solr</url> </service> </topology>
