A bug in spark or hadoop RPC with kerberos authentication?

2017-08-22 Thread Sun, Keith
Hello ,

I met this very weird issue, while easy to reproduce, and stuck me for more 
than 1 day .I suspect this may be an issue/bug related to the class loader.
Can you help confirm the root cause ?

I want to specify a customized Hadoop configuration set instead of those on the 
class path(we have a few hadoop clusters and all have Kerberos security and I 
want to support different configuration).
Code/error like below.


The work around I found is to place a core-site.xml on the class path with 
below 2 properties will work.
By checking  the rpc code under org.apache.hadoop.ipc.RPC, I suspect the RPC 
code may not see the UGI class in the same classloader.
So UGI is initialized with default value on the classpth which is simple 
authentication.

core-site.xml with the security setup on the classpath:


hadoop.security.authentication
 kerberos


hadoop.security.authorization
true




error--
2673 [main] DEBUG 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil  - 
DataTransferProtocol using SaslPropertiesResolver, configured QOP 
dfs.data.transfer.protection = privacy, configured class 
dfs.data.transfer.saslproperties.resolver.class = class 
org.apache.hadoop.security.WhitelistBasedResolver
2696 [main] DEBUG org.apache.hadoop.service.AbstractService  - Service: 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl entered state INITED
2744 [main] DEBUG org.apache.hadoop.security.UserGroupInformation  - 
PrivilegedAction as:x@xxxCOM (auth:KERBEROS) 
from:org.apache.hadoop.yarn.client.RMProxy.getProxy(RMProxy.java:136) //
2746 [main] DEBUG org.apache.hadoop.yarn.ipc.YarnRPC  - Creating YarnRPC for 
org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC
2746 [main] DEBUG org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC  - Creating a 
HadoopYarnProtoRpc proxy for protocol interface 
org.apache.hadoop.yarn.api.ApplicationClientProtocol
2801 [main] DEBUG org.apache.hadoop.ipc.Client  - getting client out of cache: 
org.apache.hadoop.ipc.Client@748fe51d
2981 [main] DEBUG org.apache.hadoop.service.AbstractService  - Service 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl is started
3004 [main] DEBUG org.apache.hadoop.ipc.Client  - The ping interval is 6 ms.
3005 [main] DEBUG org.apache.hadoop.ipc.Client  - Connecting to 
yarn-rm-1/x:8032
3019 [IPC Client (2012095985) connection to yarn-rm-1/x:8032 from 
x@xx] DEBUG org.apache.hadoop.ipc.Client  - IPC Client (2012095985) 
connection to yarn-rm-1/x:8032 from x@xx: starting, having 
connections 1
3020 [IPC Parameter Sending Thread #0] DEBUG org.apache.hadoop.ipc.Client  - 
IPC Client (2012095985) connection to yarn-rm-1/x:8032 from x@xx 
sending #0
3025 [IPC Client (2012095985) connection to yarn-rm-1/x:8032 from 
x@xx] DEBUG org.apache.hadoop.ipc.Client  - IPC Client (2012095985) 
connection to yarn-rm-1/x:8032 from x@xx got value #-1
3026 [IPC Client (2012095985) connection to yarn-rm-1/x:8032 from 
x@xx] DEBUG org.apache.hadoop.ipc.Client  - closing ipc connection to 
yarn-rm-1/x:8032: SIMPLE authentication is not enabled.  Available:[TOKEN, 
KERBEROS]
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException):
 SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]
at 
org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1131)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:979)
---code---

  Configuration hc = new  Configuration(false);

  hc.addResource("myconf /yarn-site.xml");
  hc.addResource("myconf/core-site.xml");
  hc.addResource("myconf/hdfs-site.xml");
  hc.addResource("myconf/hive-site.xml");

  SparkConf sc = new SparkConf(true);
  // add config in spark conf as no xml in the classpath except those 
“default.xml” from Hadoop jars.
  hc.forEach(entry-> {
if(entry.getKey().startsWith("hive")) {
sc.set(entry.getKey(), entry.getValue());
}else {
sc.set("spark.hadoop."+entry.getKey(), entry.getValue());
}
 });

   UserGroupInformation.setConfiguration(hc);
   UserGroupInformation.loginUserFromKeytab(Principal, Keytab);

  System.out.println("spark-conf##");
  System.out.println(sc.toDebugString());


  SparkSession sparkSessesion= SparkSession
.builder()
.master("yarn-client") //"yarn-client", "local"
.config(sc)
.appName(SparkEAZDebug.class.getName())
.enableHiveSupport()
.getOrCreate();

Thanks very much.
Keith



RE: A bug in spark or hadoop RPC with kerberos authentication?

2017-08-23 Thread Sun, Keith
Finally find the root cause and raise a bug issue in 
https://issues.apache.org/jira/browse/SPARK-21819



Thanks very much.
Keith

From: Sun, Keith
Sent: 2017年8月22日 8:48
To: user@spark.apache.org
Subject: A bug in spark or hadoop RPC with kerberos authentication?

Hello ,

I met this very weird issue, while easy to reproduce, and stuck me for more 
than 1 day .I suspect this may be an issue/bug related to the class loader.
Can you help confirm the root cause ?

I want to specify a customized Hadoop configuration set instead of those on the 
class path(we have a few hadoop clusters and all have Kerberos security and I 
want to support different configuration).
Code/error like below.


The work around I found is to place a core-site.xml on the class path with 
below 2 properties will work.
By checking  the rpc code under org.apache.hadoop.ipc.RPC, I suspect the RPC 
code may not see the UGI class in the same classloader.
So UGI is initialized with default value on the classpth which is simple 
authentication.

core-site.xml with the security setup on the classpath:


hadoop.security.authentication
 kerberos


hadoop.security.authorization
true




error--
2673 [main] DEBUG 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil  - 
DataTransferProtocol using SaslPropertiesResolver, configured QOP 
dfs.data.transfer.protection = privacy, configured class 
dfs.data.transfer.saslproperties.resolver.class = class 
org.apache.hadoop.security.WhitelistBasedResolver
2696 [main] DEBUG org.apache.hadoop.service.AbstractService  - Service: 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl entered state INITED
2744 [main] DEBUG org.apache.hadoop.security.UserGroupInformation  - 
PrivilegedAction as:x@xxxCOM (auth:KERBEROS) 
from:org.apache.hadoop.yarn.client.RMProxy.getProxy(RMProxy.java:136) //
2746 [main] DEBUG org.apache.hadoop.yarn.ipc.YarnRPC  - Creating YarnRPC for 
org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC
2746 [main] DEBUG org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC  - Creating a 
HadoopYarnProtoRpc proxy for protocol interface 
org.apache.hadoop.yarn.api.ApplicationClientProtocol
2801 [main] DEBUG org.apache.hadoop.ipc.Client  - getting client out of cache: 
org.apache.hadoop.ipc.Client@748fe51d<mailto:org.apache.hadoop.ipc.Client@748fe51d>
2981 [main] DEBUG org.apache.hadoop.service.AbstractService  - Service 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl is started
3004 [main] DEBUG org.apache.hadoop.ipc.Client  - The ping interval is 6 ms.
3005 [main] DEBUG org.apache.hadoop.ipc.Client  - Connecting to 
yarn-rm-1/x:8032
3019 [IPC Client (2012095985) connection to yarn-rm-1/x:8032 from 
x@xx] DEBUG org.apache.hadoop.ipc.Client  - IPC Client (2012095985) 
connection to yarn-rm-1/x:8032 from x@xx: starting, having 
connections 1
3020 [IPC Parameter Sending Thread #0] DEBUG org.apache.hadoop.ipc.Client  - 
IPC Client (2012095985) connection to yarn-rm-1/x:8032 from x@xx 
sending #0
3025 [IPC Client (2012095985) connection to yarn-rm-1/x:8032 from 
x@xx] DEBUG org.apache.hadoop.ipc.Client  - IPC Client (2012095985) 
connection to yarn-rm-1/x:8032 from x@xx got value #-1
3026 [IPC Client (2012095985) connection to yarn-rm-1/x:8032 from 
x@xx] DEBUG org.apache.hadoop.ipc.Client  - closing ipc connection to 
yarn-rm-1/x:8032: SIMPLE authentication is not enabled.  Available:[TOKEN, 
KERBEROS]
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException):
 SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]
at 
org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1131)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:979)
---code---

  Configuration hc = new  Configuration(false);

  hc.addResource("myconf /yarn-site.xml");
  hc.addResource("myconf/core-site.xml");
  hc.addResource("myconf/hdfs-site.xml");
  hc.addResource("myconf/hive-site.xml");

  SparkConf sc = new SparkConf(true);
  // add config in spark conf as no xml in the classpath except those 
“default.xml” from Hadoop jars.
  hc.forEach(entry-> {
if(entry.getKey().startsWith("hive")) {
sc.set(entry.getKey(), entry.getValue());
}else {
sc.set("spark.hadoop."+entry.getKey(), entry.getValue());
}
 });

   UserGroupInformation.setConfiguration(hc);
   UserGroupInformation.loginUserFromKeytab(Principal, Keytab);

  System.out.println("spark-conf##");
  System.out.println(sc.toDebugString());


  SparkSession sparkSessesion= SparkSession
.builder()
.master("yarn-client") //"

RE: A bug in spark or hadoop RPC with kerberos authentication?

2017-08-23 Thread Sun, Keith
Thanks for the reply, I filled an issue in JIRA 
https://issues.apache.org/jira/browse/SPARK-21819

I submitted the job from Java API, not by the spark-submit command line as we 
want to make spark processing as a service .

Configuration hc = new  Configuration(false);
String yarnxml=String.format("%s/%s", 
ConfigLocation,"yarn-site.xml");
String corexml=String.format("%s/%s", 
ConfigLocation,"core-site.xml");
String hdfsxml=String.format("%s/%s", 
ConfigLocation,"hdfs-site.xml");
String hivexml=String.format("%s/%s", 
ConfigLocation,"hive-site.xml");

hc.addResource(yarnxml);
hc.addResource(corexml);
hc.addResource(hdfsxml);
hc.addResource(hivexml);

//manually set all the Hadoop config in sparkconf
SparkConf sc = new SparkConf(true);
hc.forEach(entry-> {
 if(entry.getKey().startsWith("hive")) {
   sc.set(entry.getKey(), 
entry.getValue());
 }else {
   
sc.set("spark.hadoop."+entry.getKey(), entry.getValue());
 }
   });

  UserGroupInformation.setConfiguration(hc);
  UserGroupInformation.loginUserFromKeytab(Principal, Keytab);

SparkSession sparkSessesion= SparkSession
 .builder()
 .master("yarn-client") 
//"yarn-client", "local"
 .config(sc)
 .appName(SparkEAZDebug.class.getName())
 .enableHiveSupport()
         .getOrCreate();


Thanks very much.
Keith

From: 周康 [mailto:zhoukang199...@gmail.com]
Sent: 2017年8月22日 20:22
To: Sun, Keith 
Cc: user@spark.apache.org
Subject: Re: A bug in spark or hadoop RPC with kerberos authentication?

you can checkout Hadoop**credential class in  spark yarn。During spark submit,it 
will use config on the classpath.
I wonder how do you reference your own config?


How to find the temporary views' DDL

2017-10-01 Thread Sun, Keith
Hello,

Is there a way to find the DDL of the “temporary” view created in current 
session with spark sql:

For example :
create or replace temporary view
tmp_v as
select
c1 from table table_x;

“Show create table “ does not work for this case as it is not a table .
“Describe” could  show the columns while not the ddl.


Thanks very much.
Keith

From: Anastasios Zouzias [mailto:zouz...@gmail.com]
Sent: Sunday, October 1, 2017 3:05 PM
To: Kanagha Kumar 
Cc: user @spark 
Subject: Re: Error - Spark reading from HDFS via dataframes - Java

Hi,

Set the inferschema option to true in spark-csv. you may also want to set the 
mode option. See readme below

https://github.com/databricks/spark-csv/blob/master/README.md

Best,
Anastasios

Am 01.10.2017 07:58 schrieb "Kanagha Kumar" 
mailto:kpra...@salesforce.com>>:
Hi,

I'm trying to read data from HDFS in spark as dataframes. Printing the schema, 
I see all columns are being read as strings. I'm converting it to RDDs and 
creating another dataframe by passing in the correct schema ( how the rows 
should be interpreted finally).

I'm getting the following error:

Caused by: java.lang.RuntimeException: java.lang.String is not a valid external 
type for schema of bigint


Spark read API:

Dataset hdfs_dataset = new SQLContext(spark).read().option("header", 
"false").csv("hdfs:/inputpath/*");

Dataset ds = new 
SQLContext(spark).createDataFrame(hdfs_dataset.toJavaRDD(), conversionSchema);
This is the schema to be converted to:
StructType(StructField(COL1,StringType,true),
StructField(COL2,StringType,true),
StructField(COL3,LongType,true),
StructField(COL4,StringType,true),
StructField(COL5,StringType,true),
StructField(COL6,LongType,true))

This is the original schema obtained once read API was invoked
StructType(StructField(_c1,StringType,true),
StructField(_c2,StringType,true),
StructField(_c3,StringType,true),
StructField(_c4,StringType,true),
StructField(_c5,StringType,true),
StructField(_c6,StringType,true))

My interpretation is even when a JavaRDD is cast to dataframe by passing in the 
new schema, values are not getting type casted.
This is occurring because the above read API reads data as string types from 
HDFS.

How can I  convert an RDD to dataframe by passing in the correct schema once it 
is read?
How can the values by type cast correctly during this RDD to dataframe 
conversion?

Or how can I read data from HDFS with an input schema in java?
Any suggestions are helpful. Thanks!




RE: how to use cluster sparkSession like localSession

2018-11-04 Thread Sun, Keith
Hello,

I think you can try with  below , the reason is only yarn-cllient mode is 
supported for your scenario.

master("yarn-client")



Thanks very much.
Keith
From: 张万新 
Sent: Thursday, November 1, 2018 11:36 PM
To: 崔苗(数据与人工智能产品开发部) <0049003...@znv.com>
Cc: user 
Subject: Re: how to use cluster sparkSession like localSession

I think you should investigate apache zeppelin and livy
崔苗(数据与人工智能产品开发部) <0049003...@znv.com>于2018年11月2日 
周五11:01写道:

Hi,
we want to execute spark code with out submit application.jar,like this code:

public static void main(String args[]) throws Exception{
SparkSession spark = SparkSession
.builder()
.master("local[*]")
.appName("spark test")
.getOrCreate();

Dataset testData = 
spark.read().csv(".\\src\\main\\java\\Resources\\no_schema_iris.scv");
testData.printSchema();
testData.show();
}

the above code can work well with idea , do not need to generate jar file and 
submit , but if we replace master("local[*]") with master("yarn") , it can't 
work , so is there a way to use cluster sparkSession like local sparkSession ?  
we need to dynamically execute spark code in web server according to the 
different request ,  such as filter request will call dataset.filter() , so 
there is no application.jar to submit .

[https://mail-online.nosdn.127.net/qiyelogo/defaultAvatar.png]

0049003208

0049003...@znv.com

签名由 
网易邮箱大师
 定制
- To 
unsubscribe e-mail: 
user-unsubscr...@spark.apache.org