I see. I was wondering how it works inside hive server. Basically this is a HDFS C API issue. Thanks for the elaborate explanation.
On Wed, Jan 2, 2019 at 12:27 PM Shant Hovsepian <[email protected]> wrote: > Problem is mostly with libhdfs as documented here HADOOP-12953 > > On a kerberized setup the service principal gets picked up. There are work > arounds in the Java HDFS API but the c based one in libhdfs has this issue. > Of course caching HDFS will b trickier in impala as well but first his one > API in libhdfs needs to be enhanced. > > Also in general having database authorization at the file level may not be > a good idea or clean design and using sentry and extending it's > authorization mecuanisms would be cleaner. > > -Shant > > On Wed, Jan 2, 2019, 12:21 PM mhd wrk <[email protected]> wrote: > >> Thanks for further info. Not sure if our Product Management is OK, at >> this point, with us patching Impala server to get our solution working. Our >> product is supposed to work with already installed servers. >> >> Any plans to address the gap (making requesting_user visible inside >> catalog server) in future release? >> >> >> >> On Wed, Jan 2, 2019 at 11:50 AM Bharath Vissapragada < >> [email protected]> wrote: >> >>> I was poking around in the code and it looks like we have most of the code >>> in place >>> <https://github.com/apache/impala/blob/27577dd652554dda5a03016e2d1e3ab66fe6b1f5/common/thrift/CatalogService.thrift#L47> >>> >>> // Common header included in all CatalogService requests. >>> // TODO: The CatalogServiceVersion/protocol version should be part of >>> the header. >>> // This would require changes in BDR and break their compatibility >>> story. We should >>> // coordinate a joint change somewhere down the line. >>> struct TCatalogServiceRequestHeader { >>> // The effective user who submitted this request. >>> 1: optional string requesting_user >>> } >>> >>> That header is included in all the RPCs. However, that is an optional >>> field and may not be in a few places (since we don't actually rely on that >>> currently). So you could start with making it a "required" field and see >>> what all breaks. HTH. >>> >>> On Wed, Jan 2, 2019 at 11:35 AM Bharath Vissapragada < >>> [email protected]> wrote: >>> >>>> I think we expose it via UDF effective_user() (effective user could be >>>> different from the connected if delegation/doas is enabled). You can run a >>>> query like "select effective_user()" in a session. >>>> >>>> You can also look it up in the /sessions page on the coordinator web UI >>>> (<coordinator>:25000/sessions?json) and you can get a json formatted string >>>> containing the connected and delegate user for each session. >>>> >>>> If you want it on the Catalog side, you probably have to plumb it >>>> through the RPC calls (change the thrift spec and pass it along from the >>>> coordinator session handling code to the Catalog RPC code). >>>> >>>> On Wed, Jan 2, 2019 at 11:19 AM mhd wrk <[email protected]> wrote: >>>> >>>>> Is there any Impala/Sentry specific API we can use inside our code to >>>>> figure out who current user is? >>>>> >>>>> On Wed, Jan 2, 2019 at 11:12 AM Bharath Vissapragada < >>>>> [email protected]> wrote: >>>>> >>>>>> Yes. I think Jeszy is right. Per my understanding too, we don't >>>>>> impersonate the client user on the Catalog server. Instead, we enforce >>>>>> the >>>>>> authorization via Sentry during query planning. >>>>>> >>>>>> On Wed, Jan 2, 2019 at 7:06 AM mhd wrk <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> IMPALA-2177 sounds like the correct issue. >>>>>>> Here are log messages from authentication.cc for impalad and >>>>>>> catalogd respectively: >>>>>>> >>>>>>> I0102 14:15:06.722666 28195 authentication.cc:478] Successfully >>>>>>>> authenticated client user *"[email protected] <[email protected]>"* >>>>>>>> I0102 03:40:07.972348 27948 authentication.cc:445] Successfully >>>>>>>> authenticated principal *"impala/[email protected] >>>>>>>> <[email protected]>"* on an internal connection >>>>>>> >>>>>>> >>>>>>> As you can see from the messages above, impalad is able to identify >>>>>>> the currently connected user correctly. However catalogd always >>>>>>> authenticates as impala which causes the problem. >>>>>>> >>>>>>> >>>>>>> On Wed, Jan 2, 2019 at 4:19 AM Jeszy <[email protected]> wrote: >>>>>>> >>>>>>>> Hey, >>>>>>>> >>>>>>>> IIUC your question correctly, this is a limitation. IMPALA-2177 >>>>>>>> looks >>>>>>>> to be the appropriate jira. >>>>>>>> Most users use Impala together with Sentry, where the recommended >>>>>>>> approach is to disable impersonation (even in services that allow >>>>>>>> it, >>>>>>>> like Hive). >>>>>>>> >>>>>>>> HTH >>>>>>>> >>>>>>>> On Wed, 2 Jan 2019 at 05:55, Bharath Vissapragada < >>>>>>>> [email protected]> wrote: >>>>>>>> > >>>>>>>> > Hi, >>>>>>>> > >>>>>>>> > Can you add the stack trace here if possible? It is not super >>>>>>>> clear where exactly the problem is. >>>>>>>> > >>>>>>>> > Thanks, >>>>>>>> > Bharath >>>>>>>> > >>>>>>>> > On Tue, Jan 1, 2019 at 6:34 PM mhd wrk <[email protected]> >>>>>>>> wrote: >>>>>>>> >> >>>>>>>> >> we have our own implementation of Hadoop FileSystem which relies >>>>>>>> on current user in a kerberosied environment to locate user specific >>>>>>>> files >>>>>>>> in HDFS. This custom file system works fine inside hive to create >>>>>>>> external >>>>>>>> tables and query them. However trying to access the same tables via >>>>>>>> Impala >>>>>>>> (jdbc driver) fails. Watching the log messages seems that when impalad >>>>>>>> sends requests to catalogd to get meta data of a given table the >>>>>>>> current >>>>>>>> user returned by UserGroupInformation is the service account running >>>>>>>> the >>>>>>>> server (impala/[email protected]) instead of the currently >>>>>>>> connected user. >>>>>>>> >> >>>>>>>> >> Is this a known issue or limitation of Impala? >>>>>>>> >>>>>>>
