gardenia opened a new issue, #2032:
URL: https://github.com/apache/iceberg-python/issues/2032
### Apache Iceberg version
None
### Please describe the bug 🐞
Hi,
I'm using the following code to connect to a kerberized hive metastore:
```
from pyiceberg.catalog import load_catalog
# Set up the Iceberg catalog
catalog = load_catalog("hive", **{
"type": "hive",
"uri": "thrift://cluster1-hive-server:9083",
"hive.kerberos-authentication": "true"
})
print("Initial Namespaces:", catalog.list_namespaces())
```
Before running this I did a kinit:
kinit -kt /var/keytabs/hive.keytab
hiveuser/[email protected]
When I run the script I get the following error:
```
Traceback (most recent call last):
File
"/home/sandbox-user/connect-to-hive-metastore-and-list-namespaces.py", line 20,
in <module>
print("Initial Namespaces:", catalog.list_namespaces())
^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/home/sandbox-user/venv/lib/python3.12/site-packages/pyiceberg/catalog/hive.py",
line 707, in list_namespaces
with self._client as open_client:
File
"/home/sandbox-user/venv/lib/python3.12/site-packages/pyiceberg/catalog/hive.py",
line 172, in __enter__
self._transport.open()
File
"/home/sandbox-user/venv/lib/python3.12/site-packages/thrift/transport/TTransport.py",
line 381, in open
self.send_sasl_msg(self.OK, self.sasl.process())
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/puresasl/client.py", line 16, in
wrapped
return f(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/puresasl/client.py", line 148, in
process
return self._chosen_mech.process(challenge)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/puresasl/mechanisms.py", line 495, in
process
kerberos.authGSSClientStep(self.context, '')
kerberos.GSSError: (('Unspecified GSS failure. Minor code may provide more
information', 851968), ('Server hive/[email protected] not
found in Kerberos database', -1765328377))
```
NOTE: I can connect just find with java iceberg in the same situation.
I then ran the script with KRB5_TRACE=/dev/stdout and captured the following
additional output:
```
[345] 1747909692.415510: ccselect module realm chose cache
FILE:/tmp/krb5cc_1001 with client principal
hiveuser/[email protected] for server principal
hive/[email protected]
[345] 1747909692.415511: Getting credentials
hiveuser/[email protected] ->
hive/[email protected] using ccache FILE:/tmp/krb5cc_1001
[345] 1747909692.415512: Retrieving
hiveuser/[email protected] ->
krb5_ccache_conf_data/start_realm@X-CACHECONF: from FILE:/tmp/krb5cc_1001 with
result: -1765328243/Matching credential not found (filename: /tmp/krb5cc_1001)
[345] 1747909692.415513: Retrieving
hiveuser/[email protected] ->
hive/[email protected] from FILE:/tmp/krb5cc_1001 with result:
-1765328243/Matching credential not found (filename: /tmp/krb5cc_1001)
[345] 1747909692.415514: Retrieving
hiveuser/[email protected] -> krbtgt/[email protected]
from FILE:/tmp/krb5cc_1001 with result: 0/Success
[345] 1747909692.415515: Starting with TGT for client realm:
hiveuser/[email protected] -> krbtgt/[email protected]
[345] 1747909692.415516: Requesting tickets for
hive/[email protected], referrals on
[345] 1747909692.415517: Generated subkey for TGS request: aes256-cts/6798
[345] 1747909692.415518: etypes requested in TGS request: aes256-cts
[345] 1747909692.415520: Encoding request body and padata into FAST request
[345] 1747909692.415521: Sending request (1080 bytes) to CLUSTER1.COM
[345] 1747909692.415522: Resolving hostname cluster1-kerberos-server
[345] 1747909692.415523: Sending initial UDP request to dgram 192.168.0.5:88
[345] 1747909692.415524: Received answer (468 bytes) from dgram
192.168.0.5:88
[345] 1747909692.415525: Response was not from primary KDC
[345] 1747909692.415526: Decoding FAST response
[345] 1747909692.415527: TGS request result: -1765328377/Server
hive/[email protected] not found in Kerberos database
[345] 1747909692.415528: Requesting tickets for
hive/[email protected], referrals off
[345] 1747909692.415529: Generated subkey for TGS request: aes256-cts/5F8A
[345] 1747909692.415530: etypes requested in TGS request: aes256-cts
[345] 1747909692.415532: Encoding request body and padata into FAST request
[345] 1747909692.415533: Sending request (1080 bytes) to CLUSTER1.COM
[345] 1747909692.415534: Resolving hostname cluster1-kerberos-server
[345] 1747909692.415535: Sending initial UDP request to dgram 192.168.0.5:88
[345] 1747909692.415536: Received answer (468 bytes) from dgram
192.168.0.5:88
[345] 1747909692.415537: Response was not from primary KDC
[345] 1747909692.415538: Decoding FAST response
[345] 1747909692.415539: TGS request result: -1765328377/Server
hive/[email protected] not found in Kerberos database
```
To me this line stands out:
```
[345] 1747909692.415511: Getting credentials
hiveuser/[email protected] ->
hive/[email protected] using ccache FILE:/tmp/krb5cc_1001
```
It was not clear to me why there was a remapping of "hiveuser" prefix in the
principal to "hive" and I wasn't sure where that remapping was coming from. At
first I thought it might be something in my krb5.conf (or perhaps something
that should be there but isn't). But that fact that this works fine with java
iceberg makes me question that.
In an effort to try to explain the above I was looking in the pyiceberg code
and found this line in pyiceberg/catalog/hive.py
```
return TTransport.TSaslClientTransport(socket,
host=url_parts.hostname, service="hive")
```
When I speculatively changed that service="hive" part to service="hiveuser"
in that code and re-ran the script it then worked as expected:
```
[350] 1747910147.748592: ccselect module realm chose cache
FILE:/tmp/krb5cc_1001 with client principal
hiveuser/[email protected] for server principal
hiveuser/[email protected]
[350] 1747910147.748593: Getting credentials
hiveuser/[email protected] ->
hiveuser/[email protected] using ccache FILE:/tmp/krb5cc_1001
[350] 1747910147.748594: Retrieving
hiveuser/[email protected] ->
krb5_ccache_conf_data/start_realm@X-CACHECONF: from FILE:/tmp/krb5cc_1001 with
result: -1765328243/Matching credential not found (filename: /tmp/krb5cc_1001)
[350] 1747910147.748595: Retrieving
hiveuser/[email protected] ->
hiveuser/[email protected] from FILE:/tmp/krb5cc_1001 with
result: 0/Success
[350] 1747910147.748596: Creating authenticator for
hiveuser/[email protected] ->
hiveuser/[email protected], seqnum 821973613, subkey
aes256-cts/C292, session key aes256-cts/8A76
[350] 1747910147.748598: Read AP-REP, time 1747910147.748597, subkey (null),
seqnum 214032946
Initial Namespaces: [('default',)]
```
Obviously this band-aid is very specific to my situation but the fact that
it worked makes me wonder if that hard-coded "hive" service name needs to be a
parameter or auto-sensed or otherwise potentially not hard-coded.
My questions are:
* is there something I'm missing here in my usage of pyiceberg which I can
use to avoid this problem without having to make this band-aid.
* if the answer to the above is no then is there some enhancement required
here pyiceberg/catalog/hive.py to make this "hive" hard-coded service name
string be configurable.
### Willingness to contribute
- [x] I can contribute a fix for this bug independently
- [x] I would be willing to contribute a fix for this bug with guidance from
the Iceberg community
- [ ] I cannot contribute a fix for this bug at this time
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]