[jira] [Updated] (SPARK-29018) Build spark thrift server on it's own code based on protocol v11

angerszhu (Jira) Wed, 18 Dec 2019 06:12:22 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-29018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


angerszhu updated SPARK-29018:
------------------------------
    Description: 
SPIP:Build Spark thrift server based on thrift protocol v11
h2. Background

    With the development of Spark and Hive，in current sql/hive-thriftserver 
module, we need to do a lot of work to solve code conflicts for different 
built-in hive versions. It's an annoying and unending work in current ways. And 
these issues have limited our ability and convenience to develop new features 
for Spark’s thrift server. 

    We suppose to implement a new thrift server and JDBC driver based on Hive’s 
latest v11 TCLService.thrift thrift protocol. Finally, the new thrift server 
have below feature:
 # Build new module spark-service as spark’s thrift server 
 # Don't need as much reflection and inherited code as `hive-thriftser` modules
 # Support all functions current `sql/hive-thriftserver` support
 # Use all code maintained by spark itself, won’t depend on Hive
 # Support origin functions use spark’s own way, won't limited by Hive's code
 # Support running without hive metastore or with hive metastore
 # Support user impersonation by Multi-tenant splited hive authentication and 
DFS authentication
 # Support session hook for with spark’s own code
 # Add a new jdbc driver spark-jdbc, with spark’s own connection url  
“jdbc:spark:<host>:<port>/<db>”
 # Support both hive-jdbc and spark-jdbc client, then we can support most 
clients and BI platform

 
h2. How to start?

     We can start this new thrift server by shell 
*sbin/start-spark-thriftserver.sh* and stop it by 
*sbin/stop-spark-thriftserver.sh*. Don’t need HiveConf ’s configurations to 
determine the characteristics of the current spark thrift server service, we  
have implemented all need configuration by spark itself in 
`org.apache.spark.sql.service.internal.ServiceConf`, hive-site.xml only used to 
connect to hive metastore. We can write all we needed conf in 
*conf/spark-default.conf* or in startup command *--conf*
h2. How to connect through jdbc?

   Now we support both hive-jdbc and spark-jdbc, user can choose which one he 
likes

 
h3. spark-jdbc

 
 # use `SparkDriver` as jdbc driver class
 # Connection url 
`jdbc:spark://<host1>:<port1>,<host2>:<port2>/dbName;sess_var_list?conf_list#var_list`
 most samse as hive but with spark’s special url prefix `jdbc:spark`
 # For proxy, use SparkDriver should set proxy conf 
`spark.sql.thriftserver.proxy.user=username` 




h3. hive-jdbc

 
 # use `HiveDriver` as jdbc driver class
 # connection str 
jdbc:hive2://<host1>:<port1>,<host2>:<port2>/dbName;sess_var_list?conf_list#var_list
  as origin 
 # For proxy, use HiveDriver should set proxy conf 
hive.server2.proxy.user=username, current server support both config

 
h2. How is it done today, and what are the limits of current practice?
h3. Current practice

We have completed two modules `spark-service` & `spark-jdbc` now, it can run 
well  and we have add origin UT to it these two module and it can pass the UT, 
for impersonation, we have write the code and test it in our kerberized 
environment, it can work well and wait for review. Now we will raise pr to 
apace/spark master branch step by step.
h3. Here are some known changes:

 
 # Don’t use any hive code in `spark-service` `spark-jdbc` module
 # In current service, default rcfile suffix  `.hiverc` was replaced by 
`.sparkrc`
 # When use SparkDriver as jdbc driver class, url should use 
jdbc:spark://<host1>:<port1>,<host2>:<port2>/dbName;sess_var_list?conf_list#var_list
 # When use SparkDriver as jdbc driver class, proxy conf should be 
`spark.sql.thriftserver.proxy.user=proxy_user_name`
 # Support `hiveconf` `hivevar` session conf through hive-jdbc connection
 # 




h2. What are the risks?

    Totally new module, won’t change other module’s code except for supporting 
impersonation. Except impersonation, we have added a lot of UT changed (fit 
grammar without hive) from origin UT, and all pass it. For impersonation I have 
test it in our kerberized environment but still need detail review since change 
a lot.

 
h2. How long will it take?

       We have done all these works in our own repo, now we plan merge our code 
into the master step by step.
 # Phase1 pr about build new module *spark-service* on folder *sql/service*
 # Phase2 pr thrift protocol and generated thrift protocol java code
 # Phase3 pr with all *spark-service* module code and description about design, 
also Unnit Test
 # Phase4 pr about build new module *spark-jdbc* on folder *sql/jdbc*
 # Phase5 pr with all *spark-jdbc* module code and Unit Tests
 # Phase6 pr about support thriftserver Impersonation
 # Phase7 pr about build spark's own beeline client *spark-beeline*
 # Phase8 pr about spark's own CLI client code to support *Spark SQL CLI* 
module named *spark-cli*

 
h3. Appendix A. Proposed API Changes. Optional section defining APIs changes, 
if any. Backward and forward compatibility must be taken into account.

Compared to current `sql/hive-thriftserver`,  corresponding API changes as 
below:

 
 # Add a new class org.apache.spark.sql.service.internal.ServiceConnf, contains 
all needed configuration for spark thrift server
 # ServiceSessionXxx as origin HiveSessionXxx
 # In ServiceSessionImpl, remove  code spark won’t use
 # In ServiceSessionImpl set session conf directly to sqlConf  like 
[https://github.com/apache/spark/blob/18431c7baaba72539603814ef1757650000943d5/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLSessionManager.scala#L67-L69]
 # Remove SparkSQLSessionManager, add logic into SessionMananger
 # Implement all OperationMananegr logic into SparkSQLOperationMananger and 
rename it to OperationManager
 # Add SQLContext to ServiceSessionImpl  as it’s variable, don’t pass it by 
SparkSQLOperationManager, just get it by parentSession.getSqlContext() session 
conf was set to this sqlContext.sqlConf
 # Remove HiveServer2 since we don’t need the logic in it
 # Remove logic about hive impersonation since it won’t be useful in spark 
thrift server and remove parameter delegationTokenStr in 
ServiceSessionImplWithUGI 
[https://github.com/apache/spark/blob/18431c7baaba72539603814ef1757650000943d5/sql/hive-thriftserver/v2.3/src/main/java/org/apache/hive/service/cli/thrift/ThriftCLIService.java#L352-L353]
   we will use new way for spark’s impersonation.
 # Remove ThriftserverShimUtils, since we don’t need this
 # Remove SparkSQLCLIService just use CLIService 
 # Remove ReflectionUtils and ReflactCompositeService since we don’t need 
interition and reflection

  was:
With the development of Spark and Hive，in current sql/hive-thriftserver module, 
we need to do a lot of work to solve code conflicts for different built-in hive 
versions. It's an annoying and unending work in current ways. And these issues 
have limited our ability and convenience to develop new features for Spark’s 
thrift server. 

    We suppose to implement a new thrift server and JDBC driver based on Hive’s 
latest v11 TCLService.thrift thrift protocol. Finally, the new thrift server 
have below feature:
 # Build new module spark-service as spark’s thrift server 
 # Don't need as much reflection and inherited code as `hive-thriftser` modules
 # Support all functions current `sql/hive-thriftserver` support
 # Use all code maintained by spark itself, won’t depend on Hive
 # Support origin functions use spark’s own way, won't limited by Hive's code
 # Support running without hive metastore or with hive metastore
 # Support user impersonation by Multi-tenant splited hive authentication and 
DFS authentication
 # Support session hook for with spark’s own code
 # Add a new jdbc driver spark-jdbc, with spark’s own connection url  
“jdbc:spark:<host>:<port>/<db>”
 # Support both hive-jdbc and spark-jdbc client, then we can support most 
clients and BI platform


> Build spark thrift server on it's own code based on protocol v11
> ----------------------------------------------------------------
>
>                 Key: SPARK-29018
>                 URL: https://issues.apache.org/jira/browse/SPARK-29018
>             Project: Spark
>          Issue Type: Umbrella
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: angerszhu
>            Priority: Major
>
> SPIP:Build Spark thrift server based on thrift protocol v11
> h2. Background
>     With the development of Spark and Hive，in current sql/hive-thriftserver 
> module, we need to do a lot of work to solve code conflicts for different 
> built-in hive versions. It's an annoying and unending work in current ways. 
> And these issues have limited our ability and convenience to develop new 
> features for Spark’s thrift server. 
>     We suppose to implement a new thrift server and JDBC driver based on 
> Hive’s latest v11 TCLService.thrift thrift protocol. Finally, the new thrift 
> server have below feature:
>  # Build new module spark-service as spark’s thrift server 
>  # Don't need as much reflection and inherited code as `hive-thriftser` 
> modules
>  # Support all functions current `sql/hive-thriftserver` support
>  # Use all code maintained by spark itself, won’t depend on Hive
>  # Support origin functions use spark’s own way, won't limited by Hive's code
>  # Support running without hive metastore or with hive metastore
>  # Support user impersonation by Multi-tenant splited hive authentication and 
> DFS authentication
>  # Support session hook for with spark’s own code
>  # Add a new jdbc driver spark-jdbc, with spark’s own connection url  
> “jdbc:spark:<host>:<port>/<db>”
>  # Support both hive-jdbc and spark-jdbc client, then we can support most 
> clients and BI platform
>  
> h2. How to start?
>      We can start this new thrift server by shell 
> *sbin/start-spark-thriftserver.sh* and stop it by 
> *sbin/stop-spark-thriftserver.sh*. Don’t need HiveConf ’s configurations to 
> determine the characteristics of the current spark thrift server service, we  
> have implemented all need configuration by spark itself in 
> `org.apache.spark.sql.service.internal.ServiceConf`, hive-site.xml only used 
> to connect to hive metastore. We can write all we needed conf in 
> *conf/spark-default.conf* or in startup command *--conf*
> h2. How to connect through jdbc?
>    Now we support both hive-jdbc and spark-jdbc, user can choose which one he 
> likes
>  
> h3. spark-jdbc
>  
>  # use `SparkDriver` as jdbc driver class
>  # Connection url 
> `jdbc:spark://<host1>:<port1>,<host2>:<port2>/dbName;sess_var_list?conf_list#var_list`
>  most samse as hive but with spark’s special url prefix `jdbc:spark`
>  # For proxy, use SparkDriver should set proxy conf 
> `spark.sql.thriftserver.proxy.user=username` 
> h3. hive-jdbc
>  
>  # use `HiveDriver` as jdbc driver class
>  # connection str 
> jdbc:hive2://<host1>:<port1>,<host2>:<port2>/dbName;sess_var_list?conf_list#var_list
>   as origin 
>  # For proxy, use HiveDriver should set proxy conf 
> hive.server2.proxy.user=username, current server support both config
>  
> h2. How is it done today, and what are the limits of current practice?
> h3. Current practice
> We have completed two modules `spark-service` & `spark-jdbc` now, it can run 
> well  and we have add origin UT to it these two module and it can pass the 
> UT, for impersonation, we have write the code and test it in our kerberized 
> environment, it can work well and wait for review. Now we will raise pr to 
> apace/spark master branch step by step.
> h3. Here are some known changes:
>  
>  # Don’t use any hive code in `spark-service` `spark-jdbc` module
>  # In current service, default rcfile suffix  `.hiverc` was replaced by 
> `.sparkrc`
>  # When use SparkDriver as jdbc driver class, url should use 
> jdbc:spark://<host1>:<port1>,<host2>:<port2>/dbName;sess_var_list?conf_list#var_list
>  # When use SparkDriver as jdbc driver class, proxy conf should be 
> `spark.sql.thriftserver.proxy.user=proxy_user_name`
>  # Support `hiveconf` `hivevar` session conf through hive-jdbc connection
>  # 
> h2. What are the risks?
>     Totally new module, won’t change other module’s code except for 
> supporting impersonation. Except impersonation, we have added a lot of UT 
> changed (fit grammar without hive) from origin UT, and all pass it. For 
> impersonation I have test it in our kerberized environment but still need 
> detail review since change a lot.
>  
> h2. How long will it take?
>        We have done all these works in our own repo, now we plan merge our 
> code into the master step by step.
>  # Phase1 pr about build new module *spark-service* on folder *sql/service*
>  # Phase2 pr thrift protocol and generated thrift protocol java code
>  # Phase3 pr with all *spark-service* module code and description about 
> design, also Unnit Test
>  # Phase4 pr about build new module *spark-jdbc* on folder *sql/jdbc*
>  # Phase5 pr with all *spark-jdbc* module code and Unit Tests
>  # Phase6 pr about support thriftserver Impersonation
>  # Phase7 pr about build spark's own beeline client *spark-beeline*
>  # Phase8 pr about spark's own CLI client code to support *Spark SQL CLI* 
> module named *spark-cli*
>  
> h3. Appendix A. Proposed API Changes. Optional section defining APIs changes, 
> if any. Backward and forward compatibility must be taken into account.
> Compared to current `sql/hive-thriftserver`,  corresponding API changes as 
> below:
>  
>  # Add a new class org.apache.spark.sql.service.internal.ServiceConnf, 
> contains all needed configuration for spark thrift server
>  # ServiceSessionXxx as origin HiveSessionXxx
>  # In ServiceSessionImpl, remove  code spark won’t use
>  # In ServiceSessionImpl set session conf directly to sqlConf  like 
> [https://github.com/apache/spark/blob/18431c7baaba72539603814ef1757650000943d5/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLSessionManager.scala#L67-L69]
>  # Remove SparkSQLSessionManager, add logic into SessionMananger
>  # Implement all OperationMananegr logic into SparkSQLOperationMananger and 
> rename it to OperationManager
>  # Add SQLContext to ServiceSessionImpl  as it’s variable, don’t pass it by 
> SparkSQLOperationManager, just get it by parentSession.getSqlContext() 
> session conf was set to this sqlContext.sqlConf
>  # Remove HiveServer2 since we don’t need the logic in it
>  # Remove logic about hive impersonation since it won’t be useful in spark 
> thrift server and remove parameter delegationTokenStr in 
> ServiceSessionImplWithUGI 
> [https://github.com/apache/spark/blob/18431c7baaba72539603814ef1757650000943d5/sql/hive-thriftserver/v2.3/src/main/java/org/apache/hive/service/cli/thrift/ThriftCLIService.java#L352-L353]
>    we will use new way for spark’s impersonation.
>  # Remove ThriftserverShimUtils, since we don’t need this
>  # Remove SparkSQLCLIService just use CLIService 
>  # Remove ReflectionUtils and ReflactCompositeService since we don’t need 
> interition and reflection



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29018) Build spark thrift server on it's own code based on protocol v11

Reply via email to