I understand reasoning behind the design decision which requires making files available to a certain user. However there are clients in certain industries who are OK with an acceptable performance hit (might caused by loading/caching metadata per user) as long as they can have user specific permissions at all storage levels (HDFS, Accumulo and ....).
IMO, Impala should make this possible as a configuration option. On Thu, Jan 3, 2019 at 10:22 AM Bharath Vissapragada <[email protected]> wrote: > Agree with Tim's points. My opinion is also the same, given the current > Catalog architecture. > > On Thu, Jan 3, 2019 at 10:17 AM Tim Armstrong <[email protected]> > wrote: > >> Right, we could use requesting_user for logging, statistics, etc, but it >> would be problematic to impersonate that user when loading metadata. >> >> It's of course possible that I'm missing something here. >> >> On Thu, Jan 3, 2019 at 10:05 AM mhd wrk <[email protected]> wrote: >> >>> Thanks for the link. So the final answer is that even if the libhdfs >>> bug gets fixed there won't be any changes to Impala to expose >>> requesting_user in Catalog Service, right? >>> >>> On Thu, Jan 3, 2019 at 9:46 AM Tim Armstrong <[email protected]> >>> wrote: >>> >>>> > catalog server ignores file system authorization model >>>> The catalog daemon does this by design - the idea is that the catalog >>>> server can load and cache metadata on behalf of multiple users. It requires >>>> that the catalogd user (usually "impala") has permissions to read >>>> filesystem metadata. >>>> >>>> The "user account requirements" section in our docs explains this: >>>> https://impala.apache.org/docs/build/html/topics/impala_prereqs.html#prereqs >>>> and >>>> https://impala.apache.org/docs/build/html/topics/impala_security_files.html >>>> >>>> On Wed, Jan 2, 2019 at 5:52 PM mhd wrk <[email protected]> wrote: >>>> >>>>> it's more about enforcing Hadoop file system authorisation. All we >>>>> have done is implementing a custom Hadoop File System ( >>>>> org.apache.hadoop.fs.FileSystem) and now trying to use Impala to >>>>> query files hosted on that file system and it fails because catalog server >>>>> ignores file system authorization model. The same file system works nicely >>>>> with HDFS commands (e.g. hdfs dfs -ls ...) as well as HiveServer. >>>>> >>>>> Our clients expect us to enforce authorization at all levels (HDFS, >>>>> Accumulo, Hive, Impala and ....) >>>>> >>>>> On Wed, Jan 2, 2019 at 4:56 PM Tim Armstrong <[email protected]> >>>>> wrote: >>>>> >>>>>> Stepping back for a second, doesn't what you're trying to do assume >>>>>> that each user will load metadata for each table separately? The whole >>>>>> point of the catalog server is that we load the metadata once and then >>>>>> share it between queries and users. >>>>>> >>>>>> I don't think we want to have the catalog server load different >>>>>> versions of a table depending on which user initially loaded the table? >>>>>> That would cause all sorts of issues. >>>>>> >>>>>> On Wed, Jan 2, 2019 at 12:36 PM mhd wrk <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> I see. I was wondering how it works inside hive server. Basically >>>>>>> this is a HDFS C API issue. Thanks for the elaborate explanation. >>>>>>> >>>>>>> On Wed, Jan 2, 2019 at 12:27 PM Shant Hovsepian < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Problem is mostly with libhdfs as documented here HADOOP-12953 >>>>>>>> >>>>>>>> On a kerberized setup the service principal gets picked up. There >>>>>>>> are work arounds in the Java HDFS API but the c based one in libhdfs >>>>>>>> has >>>>>>>> this issue. Of course caching HDFS will b trickier in impala as well >>>>>>>> but >>>>>>>> first his one API in libhdfs needs to be enhanced. >>>>>>>> >>>>>>>> Also in general having database authorization at the file level may >>>>>>>> not be a good idea or clean design and using sentry and extending it's >>>>>>>> authorization mecuanisms would be cleaner. >>>>>>>> >>>>>>>> -Shant >>>>>>>> >>>>>>>> On Wed, Jan 2, 2019, 12:21 PM mhd wrk <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Thanks for further info. Not sure if our Product Management is OK, >>>>>>>>> at this point, with us patching Impala server to get our solution >>>>>>>>> working. >>>>>>>>> Our product is supposed to work with already installed servers. >>>>>>>>> >>>>>>>>> Any plans to address the gap (making requesting_user visible >>>>>>>>> inside catalog server) in future release? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Jan 2, 2019 at 11:50 AM Bharath Vissapragada < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> I was poking around in the code and it looks like we have most of >>>>>>>>>> the code in place >>>>>>>>>> <https://github.com/apache/impala/blob/27577dd652554dda5a03016e2d1e3ab66fe6b1f5/common/thrift/CatalogService.thrift#L47> >>>>>>>>>> >>>>>>>>>> // Common header included in all CatalogService requests. >>>>>>>>>> // TODO: The CatalogServiceVersion/protocol version should be >>>>>>>>>> part of the header. >>>>>>>>>> // This would require changes in BDR and break their >>>>>>>>>> compatibility story. We should >>>>>>>>>> // coordinate a joint change somewhere down the line. >>>>>>>>>> struct TCatalogServiceRequestHeader { >>>>>>>>>> // The effective user who submitted this request. >>>>>>>>>> 1: optional string requesting_user >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> That header is included in all the RPCs. However, that is an >>>>>>>>>> optional field and may not be in a few places (since we don't >>>>>>>>>> actually rely >>>>>>>>>> on that currently). So you could start with making it a "required" >>>>>>>>>> field >>>>>>>>>> and see what all breaks. HTH. >>>>>>>>>> >>>>>>>>>> On Wed, Jan 2, 2019 at 11:35 AM Bharath Vissapragada < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> I think we expose it via UDF effective_user() (effective user >>>>>>>>>>> could be different from the connected if delegation/doas is >>>>>>>>>>> enabled). You >>>>>>>>>>> can run a query like "select effective_user()" in a session. >>>>>>>>>>> >>>>>>>>>>> You can also look it up in the /sessions page on the coordinator >>>>>>>>>>> web UI (<coordinator>:25000/sessions?json) and you can get a json >>>>>>>>>>> formatted >>>>>>>>>>> string containing the connected and delegate user for each session. >>>>>>>>>>> >>>>>>>>>>> If you want it on the Catalog side, you probably have to plumb >>>>>>>>>>> it through the RPC calls (change the thrift spec and pass it along >>>>>>>>>>> from the >>>>>>>>>>> coordinator session handling code to the Catalog RPC code). >>>>>>>>>>> >>>>>>>>>>> On Wed, Jan 2, 2019 at 11:19 AM mhd wrk <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Is there any Impala/Sentry specific API we can use inside our >>>>>>>>>>>> code to figure out who current user is? >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Jan 2, 2019 at 11:12 AM Bharath Vissapragada < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Yes. I think Jeszy is right. Per my understanding too, we >>>>>>>>>>>>> don't impersonate the client user on the Catalog server. Instead, >>>>>>>>>>>>> we >>>>>>>>>>>>> enforce the authorization via Sentry during query planning. >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Jan 2, 2019 at 7:06 AM mhd wrk <[email protected]> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> IMPALA-2177 sounds like the correct issue. >>>>>>>>>>>>>> Here are log messages from authentication.cc for impalad and >>>>>>>>>>>>>> catalogd respectively: >>>>>>>>>>>>>> >>>>>>>>>>>>>> I0102 14:15:06.722666 28195 authentication.cc:478] >>>>>>>>>>>>>>> Successfully authenticated client user *"[email protected] >>>>>>>>>>>>>>> <[email protected]>"* >>>>>>>>>>>>>>> I0102 03:40:07.972348 27948 authentication.cc:445] >>>>>>>>>>>>>>> Successfully authenticated principal >>>>>>>>>>>>>>> *"impala/[email protected] >>>>>>>>>>>>>>> <[email protected]>"* on an internal connection >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> As you can see from the messages above, impalad is able to >>>>>>>>>>>>>> identify the currently connected user correctly. However >>>>>>>>>>>>>> catalogd always >>>>>>>>>>>>>> authenticates as impala which causes the problem. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Jan 2, 2019 at 4:19 AM Jeszy <[email protected]> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hey, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> IIUC your question correctly, this is a limitation. >>>>>>>>>>>>>>> IMPALA-2177 looks >>>>>>>>>>>>>>> to be the appropriate jira. >>>>>>>>>>>>>>> Most users use Impala together with Sentry, where the >>>>>>>>>>>>>>> recommended >>>>>>>>>>>>>>> approach is to disable impersonation (even in services that >>>>>>>>>>>>>>> allow it, >>>>>>>>>>>>>>> like Hive). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> HTH >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, 2 Jan 2019 at 05:55, Bharath Vissapragada < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > Hi, >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > Can you add the stack trace here if possible? It is not >>>>>>>>>>>>>>> super clear where exactly the problem is. >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > Thanks, >>>>>>>>>>>>>>> > Bharath >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > On Tue, Jan 1, 2019 at 6:34 PM mhd wrk < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> we have our own implementation of Hadoop FileSystem which >>>>>>>>>>>>>>> relies on current user in a kerberosied environment to locate >>>>>>>>>>>>>>> user specific >>>>>>>>>>>>>>> files in HDFS. This custom file system works fine inside hive >>>>>>>>>>>>>>> to create >>>>>>>>>>>>>>> external tables and query them. However trying to access the >>>>>>>>>>>>>>> same tables >>>>>>>>>>>>>>> via Impala (jdbc driver) fails. Watching the log messages seems >>>>>>>>>>>>>>> that when >>>>>>>>>>>>>>> impalad sends requests to catalogd to get meta data of a given >>>>>>>>>>>>>>> table the >>>>>>>>>>>>>>> current user returned by UserGroupInformation is the service >>>>>>>>>>>>>>> account >>>>>>>>>>>>>>> running the server (impala/[email protected]) instead of >>>>>>>>>>>>>>> the currently connected user. >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> Is this a known issue or limitation of Impala? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>
