[ https://issues.apache.org/jira/browse/SPARK-33164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216174#comment-17216174 ]
Arnaud Nauwynck commented on SPARK-33164: ----------------------------------------- Notice that there is also the feature "REPLACE" that might be implemented as in BigQuery {noformat} select * (REPLACE expr as name) {noformat} see : https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#select_replace Finally, another syntaxic sugar might be {noformat} select * (RENAME oldname as newname) {noformat} ... which may be (apart from column order) equivalent to {noformat} select * (EXCEPT oldname) FROM (select *, oldname as new name) .. {noformat} > SPIP: add SQL support to "SELECT * (EXCEPT someColumn) FROM .." equivalent to > DataSet.dropColumn(someColumn) > ------------------------------------------------------------------------------------------------------------ > > Key: SPARK-33164 > URL: https://issues.apache.org/jira/browse/SPARK-33164 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1 > Reporter: Arnaud Nauwynck > Priority: Minor > Original Estimate: 120h > Remaining Estimate: 120h > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > I would like to have the extended SQL syntax "SELECT * EXCEPT someColumn FROM > .." > to be able to select all columns except some in a SELECT clause. > It would be similar to SQL syntax from some databases, like Google BigQuery > or PostgresQL. > https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax > Google question "select * EXCEPT one column", and you will see many > developpers have the same problems. > example posts: > https://blog.jooq.org/2018/05/14/selecting-all-columns-except-one-in-postgresql/ > https://www.thetopsites.net/article/53001825.shtml > There are several typicall examples where is is very helpfull : > use-case1: > you add "count ( * ) countCol" column, and then filter on it using for > example "having countCol = 1" > ... and then you want to select all columns EXCEPT this dummy column which > always is "1" > {noformat} > select * (EXCEPT countCol) > from ( > select count(*) countCol, * > from MyTable > where ... > group by ... having countCol = 1 > ) > {noformat} > > use-case 2: > same with analytical function "partition over(...) rankCol ... where > rankCol=1" > For example to get the latest row before a given time, in a time series > table. > This is "Time-Travel" queries addressed by framework like "DeltaLake" > {noformat} > CREATE table t_updates (update_time timestamp, id string, col1 type1, col2 > type2, ... col42) > pastTime=.. > SELECT * (except rankCol) > FROM ( > SELECT *, > RANK() OVER (PARTITION BY id ORDER BY update_time) rankCol > FROM t_updates > where update_time < pastTime > ) WHERE rankCol = 1 > > {noformat} > > use-case 3: > copy some data from table "t" to corresponding table "t_snapshot", and back > to "t" > {noformat} > CREATE TABLE t (col1 type1, col2 type2, col3 type3, ... col42 type42) ... > > /* create corresponding table: (snap_id string, col1 type1, col2 type2, > col3 type3, ... col42 type42) */ > CREATE TABLE t_snapshot > AS SELECT '' as snap_id, * FROM t WHERE 1=2 > /* insert data from t to some snapshot */ > INSERT INTO t_snapshot > SELECT 'snap1' as snap_id, * from t > > /* select some data from snapshot table (without snap_id column) .. */ > SELECT * (EXCEPT snap_id) FROM t_snapshot where snap_id='snap1' > > {noformat} > > > *Q2.* What problem is this proposal NOT designed to solve? > It is only a SQL syntaxic sugar. > It does not change SQL execution plan or anything complex. > *Q3.* How is it done today, and what are the limits of current practice? > > Today, you can either use the DataSet API, with .dropColumn(someColumn) > or you need to HARD-CODE manually all columns in your SQL. Therefore your > code is NOT generic (or you are using a SQL meta-code generator?) > *Q4.* What is new in your approach and why do you think it will be successful? > It is NOT new... it is already a proven solution from DataSet.dropColumn(), > Postgresql, BigQuery > > *Q5.* Who cares? If you are successful, what difference will it make? > It simplifies life of developpers, dba, data analysts, end users. > It simplify development of SQL code, in a more generic way for many tasks. > *Q6.* What are the risks? > There is VERY limited risk on spark SQL, because it already exists in DataSet > API. > It is an extension of SQL syntax, so the risk is annoying some IDE SQL > editors for a new SQL syntax. > *Q7.* How long will it take? > No idea. I guess someone experienced in the Spark SQL internals might do it > relatively "quickly". > It is a kind of syntaxic sugar to add in antlr grammar rule, then transform > in DataSet api > *Q8.* What are the mid-term and final “exams” to check for success? > The 3 standard use-cases given in question Q1. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org