Thanks Tom & John! modifying spark-env.sh did the trick - my last line in the file is now:
export SPARK_DIST_CLASSPATH=$(paste -sd: "$SELF/classpath.txt"):`hbase classpath`:/etc/hbase/conf:/etc/hbase/conf/hbase-site.xml Now o.a.s.d.y.Client logs “Added HBase security token to credentials” and the .count() on my HBase RDD works fine. From: Ellis, Tom (Financial Markets IT) [mailto:tom.el...@lloydsbanking.com] Sent: 19 May 2016 09:51 To: 'John Trengrove'; Meyerhoefer, Philipp (TR Technology & Ops) Cc: user Subject: RE: HBase / Spark Kerberos problem Yeah we ran into this issue. Key part is to have the hbase jars and hbase-site.xml config on the classpath of the spark submitter. We did it slightly differently from Y Bodnar, where we set the required jars and config on the env var SPARK_DIST_CLASSPATH in our spark env file (rather than SPARK_CLASSPATH which is deprecated). With this and –principal/--keytab, if you turn DEBUG logging for org.apache.spark.deploy.yarn you should see “Added HBase security token to credentials.” Otherwise you should at least hopefully see the error where it fails to add the HBase tokens. Check out the source of Client [1] and YarnSparkHadoopUtil [2] – you’ll see how obtainTokenForHBase is being done. It’s a bit confusing as to why it says you haven’t kinited even when you do loginUserFromKeytab – I haven’t quite worked through the reason for that yet. Cheers, Tom Ellis telli...@gmail.com [1] https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala [2] https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala From: John Trengrove [mailto:john.trengr...@servian.com.au] Sent: 19 May 2016 08:09 To: philipp.meyerhoe...@thomsonreuters.com Cc: user Subject: Re: HBase / Spark Kerberos problem -- This email has reached the Bank via an external source -- Have you had a look at this issue? https://issues.apache.org/jira/browse/SPARK-12279 There is a comment by Y Bodnar on how they successfully got Kerberos and HBase working. 2016-05-18 18:13 GMT+10:00 <philipp.meyerhoe...@thomsonreuters.com>: Hi all, I have been puzzling over a Kerberos problem for a while now and wondered if anyone can help. For spark-submit, I specify --keytab x --principal y, which creates my SparkContext fine. Connections to Zookeeper Quorum to find the HBase master work well too. But when it comes to a .count() action on the RDD, I am always presented with the stack trace at the end of this mail. We are using CDH5.5.2 (spark 1.5.0), and com.cloudera.spark.hbase.HBaseContext is a wrapper around TableInputFormat/hadoopRDD (see https://github.com/cloudera-labs/SparkOnHBase), as you can see in the stack trace. Am I doing something obvious wrong here? A similar flow, inside test code, works well, only going via spark-submit exposes this issue. Code snippet (I have tried using the commented-out lines in various combinations, without success): val conf = new SparkConf(). set("spark.shuffle.consolidateFiles", "true"). set("spark.kryo.registrationRequired", "false"). set("spark.serializer", "org.apache.spark.serializer.KryoSerializer"). set("spark.kryoserializer.buffer", "30m") val sc = new SparkContext(conf) val cfg = sc.hadoopConfiguration // cfg.addResource(new org.apache.hadoop.fs.Path("/etc/hbase/conf/hbase-site.xml")) // UserGroupInformation.getCurrentUser.setAuthenticationMethod(UserGroupInformation.AuthenticationMethod.KERBEROS) // cfg.set("hbase.security.authentication", "kerberos") val hc = new HBaseContext(sc, cfg) val scan = new Scan scan.setTimeRange(startMillis, endMillis) val matchesInRange = hc.hbaseRDD(MY_TABLE, scan, resultToMatch) val cnt = matchesInRange.count() log.info(s"matches in range $cnt") Stack trace / log: 16/05/17 17:04:47 INFO SparkContext: Starting job: count at Analysis.scala:93 16/05/17 17:04:47 INFO DAGScheduler: Got job 0 (count at Analysis.scala:93) with 1 output partitions 16/05/17 17:04:47 INFO DAGScheduler: Final stage: ResultStage 0(count at Analysis.scala:93) 16/05/17 17:04:47 INFO DAGScheduler: Parents of final stage: List() 16/05/17 17:04:47 INFO DAGScheduler: Missing parents: List() 16/05/17 17:04:47 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at HBaseContext.scala:580), which has no missing parents 16/05/17 17:04:47 INFO MemoryStore: ensureFreeSpace(3248) called with curMem=428022, maxMem=244187136 16/05/17 17:04:47 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 3.2 KB, free 232.5 MB) 16/05/17 17:04:47 INFO MemoryStore: ensureFreeSpace(2022) called with curMem=431270, maxMem=244187136 16/05/17 17:04:47 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 2022.0 B, free 232.5 MB) 16/05/17 17:04:47 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 10.6.164.40:33563 (size: 2022.0 B, free: 232.8 MB) 16/05/17 17:04:47 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:861 16/05/17 17:04:47 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at HBaseContext.scala:580) 16/05/17 17:04:47 INFO YarnScheduler: Adding task set 0.0 with 1 tasks 16/05/17 17:04:47 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, hpg-dev-vm, partition 0,PROCESS_LOCAL, 2208 bytes) 16/05/17 17:04:47 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on hpg-dev-vm:52698 (size: 2022.0 B, free: 388.4 MB) 16/05/17 17:04:48 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on hpg-dev-vm:52698 (size: 26.0 KB, free: 388.4 MB) 16/05/17 17:04:57 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, hpg-dev-vm): org.apache.hadoop.hbase.client.RetriesExhaustedException: Can't get the location at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:308) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:155) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:63) at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200) at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:314) at org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:289) at org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:161) at org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:156) at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:888) at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.restart(TableRecordReaderImpl.java:90) at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.initialize(TableRecordReaderImpl.java:167) at org.apache.hadoop.hbase.mapreduce.TableRecordReader.initialize(TableRecordReader.java:138) at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase$1.initialize(TableInputFormatBase.java:200) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:153) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:124) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Could not set up IO Streams to hpg-dev-vm /127.0.0.1:60020 at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:773) at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.writeRequest(RpcClientImpl.java:890) at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.tracedWriteRequest(RpcClientImpl.java:859) at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1193) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:300) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.get(ClientProtos.java:32627) at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRowOrBefore(ProtobufUtil.java:1583) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1293) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1125) at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:299) ... 26 more Caused by: java.lang.RuntimeException: SASL authentication failed. The most likely cause is missing or invalid credentials. Consider 'kinit'. at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$1.run(RpcClientImpl.java:673) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.handleSaslConnectionFailure(RpcClientImpl.java:631) at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:739) ... 36 more Caused by: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) at org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:179) at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupSaslConnection(RpcClientImpl.java:605) at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.access$600(RpcClientImpl.java:154) at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:731) at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:728) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:728) ... 36 more Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt) at sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147) at sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:122) at sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187) at sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:224) at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212) at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179) at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:192) ... 45 more -- Philipp Meyerhoefer Thomson Reuters philipp.meyerhoe...@tr.com ________________________________ This e-mail is for the sole use of the intended recipient and contains information that may be privileged and/or confidential. If you are not an intended recipient, please notify the sender by return e-mail and delete this e-mail and any attachments. Certain required legal entity disclosures can be accessed on our website.<http://site.thomsonreuters.com/site/disclosures/> Lloyds Banking Group plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC95000. Telephone: 0131 225 4555. Lloyds Bank plc. Registered Office: 25 Gresham Street, London EC2V 7HN. Registered in England and Wales no. 2065. Telephone 0207626 1500. Bank of Scotland plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC327000. Telephone: 03457 801 801. Cheltenham & Gloucester plc. Registered Office: Barnett Way, Gloucester GL4 3RL. Registered in England and Wales 2299428. Telephone: 0345 603 1637 Lloyds Bank plc, Bank of Scotland plc are authorised by the Prudential Regulation Authority and regulated by the Financial Conduct Authority and Prudential Regulation Authority. Cheltenham & Gloucester plc is authorised and regulated by the Financial Conduct Authority. Halifax is a division of Bank of Scotland plc. Cheltenham & Gloucester Savings is a division of Lloyds Bank plc. HBOS plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC218813. This e-mail (including any attachments) is private and confidential and may contain privileged material. If you have received this e-mail in error, please notify the sender and delete it (including any attachments) immediately. You must not copy, distribute, disclose or use any of the information in it or any attachments. Telephone calls may be monitored or recorded.