[ https://issues.apache.org/jira/browse/SPARK-29018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
angerszhu resolved SPARK-29018. ------------------------------- Resolution: Won't Fix > Build spark thrift server on it's own code based on protocol v11 > ---------------------------------------------------------------- > > Key: SPARK-29018 > URL: https://issues.apache.org/jira/browse/SPARK-29018 > Project: Spark > Issue Type: Umbrella > Components: SQL > Affects Versions: 3.0.0 > Reporter: angerszhu > Priority: Major > > h2. Background > With the development of Spark and Hive,in current sql/hive-thriftserver > module, we need to do a lot of work to solve code conflicts for different > built-in hive versions. It's an annoying and unending work in current ways. > And these issues have limited our ability and convenience to develop new > features for Spark’s thrift server. > We suppose to implement a new thrift server and JDBC driver based on > Hive’s latest v11 TCLService.thrift thrift protocol. Finally, the new thrift > server have below feature: > # Build new module spark-service as spark’s thrift server > # Don't need as much reflection and inherited code as `hive-thriftser` > modules > # Support all functions current `sql/hive-thriftserver` support > # Use all code maintained by spark itself, won’t depend on Hive > # Support origin functions use spark’s own way, won't limited by Hive's code > # Support running without hive metastore or with hive metastore > # Support user impersonation by Multi-tenant splited hive authentication and > DFS authentication > # Support session hook for with spark’s own code > # Add a new jdbc driver spark-jdbc, with spark’s own connection url > “jdbc:spark:<host>:<port>/<db>” > # Support both hive-jdbc and spark-jdbc client, then we can support most > clients and BI platform > h2. How to start? > We can start this new thrift server by shell > *sbin/start-spark-thriftserver.sh* and stop it by > *sbin/stop-spark-thriftserver.sh*. Don’t need HiveConf ’s configurations to > determine the characteristics of the current spark thrift server service, we > have implemented all need configuration by spark itself in > `org.apache.spark.sql.service.internal.ServiceConf`, hive-site.xml only used > to connect to hive metastore. We can write all we needed conf in > *conf/spark-default.conf* or in startup command *--conf* > h2. How to connect through jdbc? > Now we support both hive-jdbc and spark-jdbc, user can choose which one he > likes > h3. spark-jdbc > # use `SparkDriver` as jdbc driver class > # Connection url > `jdbc:spark://<host1>:<port1>,<host2>:<port2>/dbName;sess_var_list?conf_list#var_list` > most samse as hive but with spark’s special url prefix `jdbc:spark` > # For proxy, use SparkDriver should set proxy conf > `spark.sql.thriftserver.proxy.user=username` > h3. hive-jdbc > # use `HiveDriver` as jdbc driver class > # connection str > jdbc:hive2://<host1>:<port1>,<host2>:<port2>/dbName;sess_var_list?conf_list#var_list > as origin > # For proxy, use HiveDriver should set proxy conf > hive.server2.proxy.user=username, current server support both config > h2. How is it done today, and what are the limits of current practice? > h3. Current practice > We have completed two modules `spark-service` & `spark-jdbc` now, it can run > well and we have add origin UT to it these two module and it can pass the > UT, for impersonation, we have write the code and test it in our kerberized > environment, it can work well and wait for review. Now we will raise pr to > apace/spark master branch step by step. > h3. Here are some known changes: > # Don’t use any hive code in `spark-service` `spark-jdbc` module > # In current service, default rcfile suffix `.hiverc` was replaced by > `.sparkrc` > # When use SparkDriver as jdbc driver class, url should use > jdbc:spark://<host1>:<port1>,<host2>:<port2>/dbName;sess_var_list?conf_list#var_list > # When use SparkDriver as jdbc driver class, proxy conf should be > `spark.sql.thriftserver.proxy.user=proxy_user_name` > # Support `hiveconf` `hivevar` session conf through hive-jdbc connection > h2. What are the risks? > Totally new module, won’t change other module’s code except for > supporting impersonation. Except impersonation, we have added a lot of UT > changed (fit grammar without hive) from origin UT, and all pass it. For > impersonation I have test it in our kerberized environment but still need > detail review since change a lot. > h2. How long will it take? > We have done all these works in our own repo, now we plan merge our > code into the master step by step. > # Phase1 pr about build new module *spark-service* on folder *sql/service* > # Phase2 pr thrift protocol and generated thrift protocol java code > # Phase3 pr with all *spark-service* module code and description about > design, also Unnit Test > # Phase4 pr about build new module *spark-jdbc* on folder *sql/jdbc* > # Phase5 pr with all *spark-jdbc* module code and Unit Tests > # Phase6 pr about support thriftserver Impersonation > # Phase7 pr about build spark's own beeline client *spark-beeline* > # Phase8 pr about spark's own CLI client code to support *Spark SQL CLI* > module named *spark-cli* > h3. Appendix A. Proposed API Changes. Optional section defining APIs changes, > if any. Backward and forward compatibility must be taken into account. > Compared to current `sql/hive-thriftserver`, corresponding API changes as > below: > > # Add a new class org.apache.spark.sql.service.internal.ServiceConnf, > contains all needed configuration for spark thrift server > # ServiceSessionXxx as origin HiveSessionXxx > # In ServiceSessionImpl, remove code spark won’t use > # In ServiceSessionImpl set session conf directly to sqlConf like > [https://github.com/apache/spark/blob/18431c7baaba72539603814ef1757650000943d5/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLSessionManager.scala#L67-L69] > # Remove SparkSQLSessionManager, add logic into SessionMananger > # Implement all OperationMananegr logic into SparkSQLOperationMananger and > rename it to OperationManager > # Add SQLContext to ServiceSessionImpl as it’s variable, don’t pass it by > SparkSQLOperationManager, just get it by parentSession.getSqlContext() > session conf was set to this sqlContext.sqlConf > # Remove HiveServer2 since we don’t need the logic in it > # Remove logic about hive impersonation since it won’t be useful in spark > thrift server and remove parameter delegationTokenStr in > ServiceSessionImplWithUGI > [https://github.com/apache/spark/blob/18431c7baaba72539603814ef1757650000943d5/sql/hive-thriftserver/v2.3/src/main/java/org/apache/hive/service/cli/thrift/ThriftCLIService.java#L352-L353] > we will use new way for spark’s impersonation. > # Remove ThriftserverShimUtils, since we don’t need this > # Remove SparkSQLCLIService just use CLIService > # Remove ReflectionUtils and ReflactCompositeService since we don’t need > interition and reflection -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org