>  catalog server ignores file system authorization model
The catalog daemon does this by design - the idea is that the catalog
server can load and cache metadata on behalf of multiple users. It requires
that the catalogd user (usually "impala") has permissions to read
filesystem metadata.

The "user account requirements" section in our docs explains this:
https://impala.apache.org/docs/build/html/topics/impala_prereqs.html#prereqs
and
https://impala.apache.org/docs/build/html/topics/impala_security_files.html

On Wed, Jan 2, 2019 at 5:52 PM mhd wrk <[email protected]> wrote:

> it's more about enforcing Hadoop file system authorisation. All we have
> done is implementing a custom Hadoop File System (org.apache.hadoop.fs.
> FileSystem) and now trying to use Impala to query files hosted on that
> file system and it fails because catalog server ignores file system
> authorization model. The same file system works nicely with HDFS commands
> (e.g. hdfs dfs -ls ...) as well as HiveServer.
>
> Our clients expect us to enforce authorization at all levels (HDFS,
> Accumulo, Hive, Impala and ....)
>
> On Wed, Jan 2, 2019 at 4:56 PM Tim Armstrong <[email protected]>
> wrote:
>
>> Stepping back for a second, doesn't what you're trying to do assume that
>> each user will load metadata for each table separately? The whole point of
>> the catalog server is that we load the metadata once and then share it
>> between queries and users.
>>
>> I don't think we want to have the catalog server load different versions
>> of a table depending on which user initially loaded the table? That would
>> cause all sorts of issues.
>>
>> On Wed, Jan 2, 2019 at 12:36 PM mhd wrk <[email protected]> wrote:
>>
>>> I see. I was wondering how it works inside hive server. Basically this
>>> is a HDFS C API issue. Thanks for the elaborate explanation.
>>>
>>> On Wed, Jan 2, 2019 at 12:27 PM Shant Hovsepian <[email protected]>
>>> wrote:
>>>
>>>> Problem is mostly with libhdfs as documented here HADOOP-12953
>>>>
>>>> On a kerberized setup the service principal gets picked up. There are
>>>> work arounds in the Java HDFS API but the c based one in libhdfs has this
>>>> issue. Of course caching HDFS will b trickier in impala as well but first
>>>> his one API in libhdfs needs to be enhanced.
>>>>
>>>> Also in general having database authorization at the file level may not
>>>> be a good idea or clean design and using sentry and extending it's
>>>> authorization mecuanisms would be cleaner.
>>>>
>>>> -Shant
>>>>
>>>> On Wed, Jan 2, 2019, 12:21 PM mhd wrk <[email protected]> wrote:
>>>>
>>>>> Thanks for further info. Not sure if our Product Management is OK, at
>>>>> this point, with us patching Impala server to get our solution working. 
>>>>> Our
>>>>> product is supposed to work with already installed servers.
>>>>>
>>>>> Any plans to address the gap (making requesting_user visible inside
>>>>> catalog server) in future release?
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jan 2, 2019 at 11:50 AM Bharath Vissapragada <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> I was poking around in the code and it looks like we have most of the 
>>>>>> code
>>>>>> in place
>>>>>> <https://github.com/apache/impala/blob/27577dd652554dda5a03016e2d1e3ab66fe6b1f5/common/thrift/CatalogService.thrift#L47>
>>>>>>
>>>>>> // Common header included in all CatalogService requests.
>>>>>> // TODO: The CatalogServiceVersion/protocol version should be part of
>>>>>> the header.
>>>>>> // This would require changes in BDR and break their compatibility
>>>>>> story. We should
>>>>>> // coordinate a joint change somewhere down the line.
>>>>>> struct TCatalogServiceRequestHeader {
>>>>>> // The effective user who submitted this request.
>>>>>> 1: optional string requesting_user
>>>>>> }
>>>>>>
>>>>>> That header is included in all the RPCs. However, that is an optional
>>>>>> field and may not be in a few places (since we don't actually rely on 
>>>>>> that
>>>>>> currently). So you could start with making it a "required" field and see
>>>>>> what all breaks. HTH.
>>>>>>
>>>>>> On Wed, Jan 2, 2019 at 11:35 AM Bharath Vissapragada <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> I think we expose it via UDF effective_user() (effective user could
>>>>>>> be different from the connected if delegation/doas is enabled). You can 
>>>>>>> run
>>>>>>> a query like "select effective_user()" in a session.
>>>>>>>
>>>>>>> You can also look it up in the /sessions page on the coordinator web
>>>>>>> UI (<coordinator>:25000/sessions?json) and you can get a json formatted
>>>>>>> string containing the connected and delegate user for each session.
>>>>>>>
>>>>>>> If you want it on the Catalog side, you probably have to plumb it
>>>>>>> through the RPC calls (change the thrift spec and pass it along from the
>>>>>>> coordinator session handling code to the Catalog RPC code).
>>>>>>>
>>>>>>> On Wed, Jan 2, 2019 at 11:19 AM mhd wrk <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Is there any Impala/Sentry specific API we can use inside our code
>>>>>>>> to figure out who current user is?
>>>>>>>>
>>>>>>>> On Wed, Jan 2, 2019 at 11:12 AM Bharath Vissapragada <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Yes. I think Jeszy is right. Per my understanding too, we don't
>>>>>>>>> impersonate the client user on the Catalog server. Instead, we 
>>>>>>>>> enforce the
>>>>>>>>> authorization via Sentry during query planning.
>>>>>>>>>
>>>>>>>>> On Wed, Jan 2, 2019 at 7:06 AM mhd wrk <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> IMPALA-2177 sounds like the correct issue.
>>>>>>>>>> Here are log messages from authentication.cc for impalad and
>>>>>>>>>> catalogd respectively:
>>>>>>>>>>
>>>>>>>>>> I0102 14:15:06.722666 28195 authentication.cc:478] Successfully
>>>>>>>>>>> authenticated client user *"[email protected]
>>>>>>>>>>> <[email protected]>"*
>>>>>>>>>>> I0102 03:40:07.972348 27948 authentication.cc:445] Successfully
>>>>>>>>>>> authenticated principal *"impala/[email protected]
>>>>>>>>>>> <[email protected]>"* on an internal connection
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> As you can see from the messages above, impalad is able to
>>>>>>>>>> identify the currently connected user correctly. However catalogd 
>>>>>>>>>> always
>>>>>>>>>> authenticates as impala which causes the problem.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Jan 2, 2019 at 4:19 AM Jeszy <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey,
>>>>>>>>>>>
>>>>>>>>>>> IIUC your question correctly, this is a limitation. IMPALA-2177
>>>>>>>>>>> looks
>>>>>>>>>>> to be the appropriate jira.
>>>>>>>>>>> Most users use Impala together with Sentry, where the recommended
>>>>>>>>>>> approach is to disable impersonation (even in services that
>>>>>>>>>>> allow it,
>>>>>>>>>>> like Hive).
>>>>>>>>>>>
>>>>>>>>>>> HTH
>>>>>>>>>>>
>>>>>>>>>>> On Wed, 2 Jan 2019 at 05:55, Bharath Vissapragada <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>> >
>>>>>>>>>>> > Hi,
>>>>>>>>>>> >
>>>>>>>>>>> > Can you add the stack trace here if possible? It is not super
>>>>>>>>>>> clear where exactly the problem is.
>>>>>>>>>>> >
>>>>>>>>>>> > Thanks,
>>>>>>>>>>> > Bharath
>>>>>>>>>>> >
>>>>>>>>>>> > On Tue, Jan 1, 2019 at 6:34 PM mhd wrk <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>> >>
>>>>>>>>>>> >> we have our own implementation of Hadoop FileSystem which
>>>>>>>>>>> relies on current user in a kerberosied environment to locate user 
>>>>>>>>>>> specific
>>>>>>>>>>> files in HDFS.  This custom file system works fine inside hive to 
>>>>>>>>>>> create
>>>>>>>>>>> external tables and query them. However trying to access the same 
>>>>>>>>>>> tables
>>>>>>>>>>> via Impala (jdbc driver) fails. Watching the log messages seems 
>>>>>>>>>>> that when
>>>>>>>>>>> impalad sends requests to catalogd to get meta data of a given 
>>>>>>>>>>> table the
>>>>>>>>>>> current user returned by  UserGroupInformation is the service 
>>>>>>>>>>> account
>>>>>>>>>>> running the server (impala/[email protected]) instead of the
>>>>>>>>>>> currently connected user.
>>>>>>>>>>> >>
>>>>>>>>>>> >> Is this a known issue or limitation of Impala?
>>>>>>>>>>>
>>>>>>>>>>

Reply via email to