beliefer opened a new pull request #27429: [SPARK-28330][SQL] Support ANSI SQL: 
result offset clause in query expression
URL: https://github.com/apache/spark/pull/27429
 
 
   ### What changes were proposed in this pull request?
   This is a ANSI SQL and feature id is `F861`
   ```
   <query expression> ::=
   [ <with clause> ] <query expression body>
   [ <order by clause> ] [ <result offset clause> ] [ <fetch first clause> ]
   
   <result offset clause> ::=
   OFFSET <offset row count> { ROW | ROWS }
   ```
   For example:
   ```
   SELECT customer_name, customer_gender FROM customer_dimension 
      WHERE occupation='Dancer' AND customer_city = 'San Francisco' ORDER BY 
customer_name;
       customer_name     | customer_gender
   ----------------------+-----------------
    Amy X. Lang          | Female
    Anna H. Li           | Female
    Brian O. Weaver      | Male
    Craig O. Pavlov      | Male
    Doug Z. Goldberg     | Male
    Harold S. Jones      | Male
    Jack E. Perkins      | Male
    Joseph W. Overstreet | Male
    Kevin . Campbell     | Male
    Raja Y. Wilson       | Male
    Samantha O. Brown    | Female
    Steve H. Gauthier    | Male
    William . Nielson    | Male
    William Z. Roy       | Male
   (14 rows)
   
   SELECT customer_name, customer_gender FROM customer_dimension 
      WHERE occupation='Dancer' AND customer_city = 'San Francisco' ORDER BY 
customer_name OFFSET 8;
      customer_name   | customer_gender
   -------------------+-----------------
    Kevin . Campbell  | Male
    Raja Y. Wilson    | Male
    Samantha O. Brown | Female
    Steve H. Gauthier | Male
    William . Nielson | Male
    William Z. Roy    | Male
   (6 rows)
   ```
   There are some mainstream database support the syntax.
   **PostgreSQL:**
   https://www.postgresql.org/docs/11/queries-limit.html
   
   **Vertica:**
   
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Statements/SELECT/OFFSETClause.htm?zoom_highlight=offset
   
   **MySQL:**
   https://dev.mysql.com/doc/refman/5.6/en/select.html
   
   The description for design:
   **1**. Consider `OFFSET` as the special case of `LIMIT`. For example:
   `SELECT * FROM a limit 10;` similar to `SELECT * FROM a limit 10 offset 0;`
   `SELECT * FROM a offset 10;` similar to `SELECT * FROM a limit -1 offset 10;`
   **2**. Because the current implement of `LIMIT` has good performance. For 
example:
   `SELECT * FROM a limit 10;` parsed to the logic plan as below:
   ```
   GlobalLimit (limit = 10)
       |--LocalLimit (limit = 10)
   ```
   and then the physical plan as below:
   ```
   GlobalLimitExec (limit = 10) // Take the first 10 rows globally
       |--LocalLimitExec (limit = 10) // Take the first 10 rows locally
   ```
   This operator reduce massive shuffle and has good performance.
   Sometimes, the logic plan transformed to the physical plan as:
   ```
   CollectLimitExec (limit = 10) // Take the first 10 rows globally
   ```
   If the SQL contains order by, such as `SELECT * FROM a order by c limit 10;`.
   This SQL will be transformed to the physical plan as below:
   ```
   TakeOrderedAndProjectExec (limit = 10) // Take the first 10 rows after sort 
globally
   ```
   
   Based on this situation, this PR produces the following operations. For 
example:
   `SELECT * FROM a limit 10 offset 10;` parsed to the logic plan as below:
   ```
   GlobalLimit (limit = 10, offset = 10)
       |--LocalLimit (limit = 10, offset = 10)
   ```
   and then the physical plan as below:
   ```
   GlobalLimitExec (limit = 10, offset = 10) // Skip the first 10 rows and take 
the next 10 rows globally
       |--LocalLimitExec (limit = 10, offset = 10) // Take the first 20(limit + 
offset) rows locally
   ```
   Sometimes, the logic plan transformed to the physical plan as:
   ```
   CollectLimitExec (limit = 10, offset = 10) // Skip the first 10 rows and 
take the next 10 rows globally
   ```
   If the SQL contains order by, such as `SELECT * FROM a order by c limit 10 
offset 10;`.
   This SQL will be transformed to the physical plan as below:
   ```
   TakeOrderedAndProjectExec (limit = 10, offset 10) // Skip the first 10 rows 
and take the next 10 rows after sort globally
   ```
   **3**.In addition to the above, there is a special case that is only offset 
but no limit. For example:
   `SELECT * FROM a offset 10;` parsed to the logic plan as below:
   ```
   GlobalLimit (limit = -1, offset = 10) // -1 means invalid
       |--LocalLimit (limit = -1, offset = 10) // -1 means invalid
   ```
   If offset is very large, will generate a lot of overhead. So I add a 
configuration item `spark.sql.forceUsingOffsetWithoutLimit` to force running 
query when user knows the offset is small enough. The default value of 
`spark.sql.forceUsingOffsetWithoutLimit` is false.
   
   Note: The origin PR to support this feature is 
https://github.com/apache/spark/pull/25416.
   Because the origin PR too old, there exists massive conflict which is hard 
to resolve. So I open this new PR to support this feature.
   
   ### Why are the changes needed?
   new feature
   
   ### Does this PR introduce any user-facing change?
   'No'
   
   
   ### How was this patch tested?
   Exists and new UT

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to