I understand reasoning behind the design decision which requires making
files available to a certain user. However there are clients in certain
industries who are OK with an acceptable performance hit (might caused by
loading/caching metadata per user) as long as they can have user specific
permissions at all storage levels (HDFS, Accumulo and ....).

IMO, Impala should make this possible as a configuration option.



On Thu, Jan 3, 2019 at 10:22 AM Bharath Vissapragada <[email protected]>
wrote:

> Agree with Tim's points. My opinion is also the same, given the current
> Catalog architecture.
>
> On Thu, Jan 3, 2019 at 10:17 AM Tim Armstrong <[email protected]>
> wrote:
>
>> Right, we could use requesting_user for logging, statistics, etc, but it
>> would be problematic to impersonate that user when loading metadata.
>>
>> It's of course possible that I'm missing something here.
>>
>> On Thu, Jan 3, 2019 at 10:05 AM mhd wrk <[email protected]> wrote:
>>
>>> Thanks for the link. So the final answer is that even if the libhdfs
>>> bug  gets fixed there won't be any changes to Impala to expose
>>> requesting_user in Catalog Service, right?
>>>
>>> On Thu, Jan 3, 2019 at 9:46 AM Tim Armstrong <[email protected]>
>>> wrote:
>>>
>>>> >  catalog server ignores file system authorization model
>>>> The catalog daemon does this by design - the idea is that the catalog
>>>> server can load and cache metadata on behalf of multiple users. It requires
>>>> that the catalogd user (usually "impala") has permissions to read
>>>> filesystem metadata.
>>>>
>>>> The "user account requirements" section in our docs explains this:
>>>> https://impala.apache.org/docs/build/html/topics/impala_prereqs.html#prereqs
>>>> and
>>>> https://impala.apache.org/docs/build/html/topics/impala_security_files.html
>>>>
>>>> On Wed, Jan 2, 2019 at 5:52 PM mhd wrk <[email protected]> wrote:
>>>>
>>>>> it's more about enforcing Hadoop file system authorisation. All we
>>>>> have done is implementing a custom Hadoop File System (
>>>>> org.apache.hadoop.fs.FileSystem) and now trying to use Impala to
>>>>> query files hosted on that file system and it fails because catalog server
>>>>> ignores file system authorization model. The same file system works nicely
>>>>> with HDFS commands (e.g. hdfs dfs -ls ...) as well as HiveServer.
>>>>>
>>>>> Our clients expect us to enforce authorization at all levels (HDFS,
>>>>> Accumulo, Hive, Impala and ....)
>>>>>
>>>>> On Wed, Jan 2, 2019 at 4:56 PM Tim Armstrong <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Stepping back for a second, doesn't what you're trying to do assume
>>>>>> that each user will load metadata for each table separately? The whole
>>>>>> point of the catalog server is that we load the metadata once and then
>>>>>> share it between queries and users.
>>>>>>
>>>>>> I don't think we want to have the catalog server load different
>>>>>> versions of a table depending on which user initially loaded the table?
>>>>>> That would cause all sorts of issues.
>>>>>>
>>>>>> On Wed, Jan 2, 2019 at 12:36 PM mhd wrk <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> I see. I was wondering how it works inside hive server. Basically
>>>>>>> this is a HDFS C API issue. Thanks for the elaborate explanation.
>>>>>>>
>>>>>>> On Wed, Jan 2, 2019 at 12:27 PM Shant Hovsepian <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Problem is mostly with libhdfs as documented here HADOOP-12953
>>>>>>>>
>>>>>>>> On a kerberized setup the service principal gets picked up. There
>>>>>>>> are work arounds in the Java HDFS API but the c based one in libhdfs 
>>>>>>>> has
>>>>>>>> this issue. Of course caching HDFS will b trickier in impala as well 
>>>>>>>> but
>>>>>>>> first his one API in libhdfs needs to be enhanced.
>>>>>>>>
>>>>>>>> Also in general having database authorization at the file level may
>>>>>>>> not be a good idea or clean design and using sentry and extending it's
>>>>>>>> authorization mecuanisms would be cleaner.
>>>>>>>>
>>>>>>>> -Shant
>>>>>>>>
>>>>>>>> On Wed, Jan 2, 2019, 12:21 PM mhd wrk <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks for further info. Not sure if our Product Management is OK,
>>>>>>>>> at this point, with us patching Impala server to get our solution 
>>>>>>>>> working.
>>>>>>>>> Our product is supposed to work with already installed servers.
>>>>>>>>>
>>>>>>>>> Any plans to address the gap (making requesting_user visible
>>>>>>>>> inside catalog server) in future release?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jan 2, 2019 at 11:50 AM Bharath Vissapragada <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> I was poking around in the code and it looks like we have most of
>>>>>>>>>> the code in place
>>>>>>>>>> <https://github.com/apache/impala/blob/27577dd652554dda5a03016e2d1e3ab66fe6b1f5/common/thrift/CatalogService.thrift#L47>
>>>>>>>>>>
>>>>>>>>>> // Common header included in all CatalogService requests.
>>>>>>>>>> // TODO: The CatalogServiceVersion/protocol version should be
>>>>>>>>>> part of the header.
>>>>>>>>>> // This would require changes in BDR and break their
>>>>>>>>>> compatibility story. We should
>>>>>>>>>> // coordinate a joint change somewhere down the line.
>>>>>>>>>> struct TCatalogServiceRequestHeader {
>>>>>>>>>> // The effective user who submitted this request.
>>>>>>>>>> 1: optional string requesting_user
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> That header is included in all the RPCs. However, that is an
>>>>>>>>>> optional field and may not be in a few places (since we don't 
>>>>>>>>>> actually rely
>>>>>>>>>> on that currently). So you could start with making it a "required" 
>>>>>>>>>> field
>>>>>>>>>> and see what all breaks. HTH.
>>>>>>>>>>
>>>>>>>>>> On Wed, Jan 2, 2019 at 11:35 AM Bharath Vissapragada <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> I think we expose it via UDF effective_user() (effective user
>>>>>>>>>>> could be different from the connected if delegation/doas is 
>>>>>>>>>>> enabled). You
>>>>>>>>>>> can run a query like "select effective_user()" in a session.
>>>>>>>>>>>
>>>>>>>>>>> You can also look it up in the /sessions page on the coordinator
>>>>>>>>>>> web UI (<coordinator>:25000/sessions?json) and you can get a json 
>>>>>>>>>>> formatted
>>>>>>>>>>> string containing the connected and delegate user for each session.
>>>>>>>>>>>
>>>>>>>>>>> If you want it on the Catalog side, you probably have to plumb
>>>>>>>>>>> it through the RPC calls (change the thrift spec and pass it along 
>>>>>>>>>>> from the
>>>>>>>>>>> coordinator session handling code to the Catalog RPC code).
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Jan 2, 2019 at 11:19 AM mhd wrk <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Is there any Impala/Sentry specific API we can use inside our
>>>>>>>>>>>> code to figure out who current user is?
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Jan 2, 2019 at 11:12 AM Bharath Vissapragada <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Yes. I think Jeszy is right. Per my understanding too, we
>>>>>>>>>>>>> don't impersonate the client user on the Catalog server. Instead, 
>>>>>>>>>>>>> we
>>>>>>>>>>>>> enforce the authorization via Sentry during query planning.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Jan 2, 2019 at 7:06 AM mhd wrk <[email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> IMPALA-2177 sounds like the correct issue.
>>>>>>>>>>>>>> Here are log messages from authentication.cc for impalad and
>>>>>>>>>>>>>> catalogd respectively:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I0102 14:15:06.722666 28195 authentication.cc:478]
>>>>>>>>>>>>>>> Successfully authenticated client user *"[email protected]
>>>>>>>>>>>>>>> <[email protected]>"*
>>>>>>>>>>>>>>> I0102 03:40:07.972348 27948 authentication.cc:445]
>>>>>>>>>>>>>>> Successfully authenticated principal 
>>>>>>>>>>>>>>> *"impala/[email protected]
>>>>>>>>>>>>>>> <[email protected]>"* on an internal connection
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As you can see from the messages above, impalad is able to
>>>>>>>>>>>>>> identify the currently connected user correctly. However 
>>>>>>>>>>>>>> catalogd always
>>>>>>>>>>>>>> authenticates as impala which causes the problem.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Jan 2, 2019 at 4:19 AM Jeszy <[email protected]>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hey,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> IIUC your question correctly, this is a limitation.
>>>>>>>>>>>>>>> IMPALA-2177 looks
>>>>>>>>>>>>>>> to be the appropriate jira.
>>>>>>>>>>>>>>> Most users use Impala together with Sentry, where the
>>>>>>>>>>>>>>> recommended
>>>>>>>>>>>>>>> approach is to disable impersonation (even in services that
>>>>>>>>>>>>>>> allow it,
>>>>>>>>>>>>>>> like Hive).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> HTH
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, 2 Jan 2019 at 05:55, Bharath Vissapragada <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > Hi,
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > Can you add the stack trace here if possible? It is not
>>>>>>>>>>>>>>> super clear where exactly the problem is.
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > Thanks,
>>>>>>>>>>>>>>> > Bharath
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > On Tue, Jan 1, 2019 at 6:34 PM mhd wrk <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>> >> we have our own implementation of Hadoop FileSystem which
>>>>>>>>>>>>>>> relies on current user in a kerberosied environment to locate 
>>>>>>>>>>>>>>> user specific
>>>>>>>>>>>>>>> files in HDFS.  This custom file system works fine inside hive 
>>>>>>>>>>>>>>> to create
>>>>>>>>>>>>>>> external tables and query them. However trying to access the 
>>>>>>>>>>>>>>> same tables
>>>>>>>>>>>>>>> via Impala (jdbc driver) fails. Watching the log messages seems 
>>>>>>>>>>>>>>> that when
>>>>>>>>>>>>>>> impalad sends requests to catalogd to get meta data of a given 
>>>>>>>>>>>>>>> table the
>>>>>>>>>>>>>>> current user returned by  UserGroupInformation is the service 
>>>>>>>>>>>>>>> account
>>>>>>>>>>>>>>> running the server (impala/[email protected]) instead of
>>>>>>>>>>>>>>> the currently connected user.
>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>> >> Is this a known issue or limitation of Impala?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>

Reply via email to