I see. I was wondering how it works inside hive server. Basically this is a
HDFS C API issue. Thanks for the elaborate explanation.

On Wed, Jan 2, 2019 at 12:27 PM Shant Hovsepian <[email protected]>
wrote:

> Problem is mostly with libhdfs as documented here HADOOP-12953
>
> On a kerberized setup the service principal gets picked up. There are work
> arounds in the Java HDFS API but the c based one in libhdfs has this issue.
> Of course caching HDFS will b trickier in impala as well but first his one
> API in libhdfs needs to be enhanced.
>
> Also in general having database authorization at the file level may not be
> a good idea or clean design and using sentry and extending it's
> authorization mecuanisms would be cleaner.
>
> -Shant
>
> On Wed, Jan 2, 2019, 12:21 PM mhd wrk <[email protected]> wrote:
>
>> Thanks for further info. Not sure if our Product Management is OK, at
>> this point, with us patching Impala server to get our solution working. Our
>> product is supposed to work with already installed servers.
>>
>> Any plans to address the gap (making requesting_user visible inside
>> catalog server) in future release?
>>
>>
>>
>> On Wed, Jan 2, 2019 at 11:50 AM Bharath Vissapragada <
>> [email protected]> wrote:
>>
>>> I was poking around in the code and it looks like we have most of the code
>>> in place
>>> <https://github.com/apache/impala/blob/27577dd652554dda5a03016e2d1e3ab66fe6b1f5/common/thrift/CatalogService.thrift#L47>
>>>
>>> // Common header included in all CatalogService requests.
>>> // TODO: The CatalogServiceVersion/protocol version should be part of
>>> the header.
>>> // This would require changes in BDR and break their compatibility
>>> story. We should
>>> // coordinate a joint change somewhere down the line.
>>> struct TCatalogServiceRequestHeader {
>>> // The effective user who submitted this request.
>>> 1: optional string requesting_user
>>> }
>>>
>>> That header is included in all the RPCs. However, that is an optional
>>> field and may not be in a few places (since we don't actually rely on that
>>> currently). So you could start with making it a "required" field and see
>>> what all breaks. HTH.
>>>
>>> On Wed, Jan 2, 2019 at 11:35 AM Bharath Vissapragada <
>>> [email protected]> wrote:
>>>
>>>> I think we expose it via UDF effective_user() (effective user could be
>>>> different from the connected if delegation/doas is enabled). You can run a
>>>> query like "select effective_user()" in a session.
>>>>
>>>> You can also look it up in the /sessions page on the coordinator web UI
>>>> (<coordinator>:25000/sessions?json) and you can get a json formatted string
>>>> containing the connected and delegate user for each session.
>>>>
>>>> If you want it on the Catalog side, you probably have to plumb it
>>>> through the RPC calls (change the thrift spec and pass it along from the
>>>> coordinator session handling code to the Catalog RPC code).
>>>>
>>>> On Wed, Jan 2, 2019 at 11:19 AM mhd wrk <[email protected]> wrote:
>>>>
>>>>> Is there any Impala/Sentry specific API we can use inside our code to
>>>>> figure out who current user is?
>>>>>
>>>>> On Wed, Jan 2, 2019 at 11:12 AM Bharath Vissapragada <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Yes. I think Jeszy is right. Per my understanding too, we don't
>>>>>> impersonate the client user on the Catalog server. Instead, we enforce 
>>>>>> the
>>>>>> authorization via Sentry during query planning.
>>>>>>
>>>>>> On Wed, Jan 2, 2019 at 7:06 AM mhd wrk <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> IMPALA-2177 sounds like the correct issue.
>>>>>>> Here are log messages from authentication.cc for impalad and
>>>>>>> catalogd respectively:
>>>>>>>
>>>>>>> I0102 14:15:06.722666 28195 authentication.cc:478] Successfully
>>>>>>>> authenticated client user *"[email protected] <[email protected]>"*
>>>>>>>> I0102 03:40:07.972348 27948 authentication.cc:445] Successfully
>>>>>>>> authenticated principal *"impala/[email protected]
>>>>>>>> <[email protected]>"* on an internal connection
>>>>>>>
>>>>>>>
>>>>>>> As you can see from the messages above, impalad is able to identify
>>>>>>> the currently connected user correctly. However catalogd always
>>>>>>> authenticates as impala which causes the problem.
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jan 2, 2019 at 4:19 AM Jeszy <[email protected]> wrote:
>>>>>>>
>>>>>>>> Hey,
>>>>>>>>
>>>>>>>> IIUC your question correctly, this is a limitation. IMPALA-2177
>>>>>>>> looks
>>>>>>>> to be the appropriate jira.
>>>>>>>> Most users use Impala together with Sentry, where the recommended
>>>>>>>> approach is to disable impersonation (even in services that allow
>>>>>>>> it,
>>>>>>>> like Hive).
>>>>>>>>
>>>>>>>> HTH
>>>>>>>>
>>>>>>>> On Wed, 2 Jan 2019 at 05:55, Bharath Vissapragada <
>>>>>>>> [email protected]> wrote:
>>>>>>>> >
>>>>>>>> > Hi,
>>>>>>>> >
>>>>>>>> > Can you add the stack trace here if possible? It is not super
>>>>>>>> clear where exactly the problem is.
>>>>>>>> >
>>>>>>>> > Thanks,
>>>>>>>> > Bharath
>>>>>>>> >
>>>>>>>> > On Tue, Jan 1, 2019 at 6:34 PM mhd wrk <[email protected]>
>>>>>>>> wrote:
>>>>>>>> >>
>>>>>>>> >> we have our own implementation of Hadoop FileSystem which relies
>>>>>>>> on current user in a kerberosied environment to locate user specific 
>>>>>>>> files
>>>>>>>> in HDFS.  This custom file system works fine inside hive to create 
>>>>>>>> external
>>>>>>>> tables and query them. However trying to access the same tables via 
>>>>>>>> Impala
>>>>>>>> (jdbc driver) fails. Watching the log messages seems that when impalad
>>>>>>>> sends requests to catalogd to get meta data of a given table the 
>>>>>>>> current
>>>>>>>> user returned by  UserGroupInformation is the service account running 
>>>>>>>> the
>>>>>>>> server (impala/[email protected]) instead of the currently
>>>>>>>> connected user.
>>>>>>>> >>
>>>>>>>> >> Is this a known issue or limitation of Impala?
>>>>>>>>
>>>>>>>

Reply via email to