This is an automated email from the ASF dual-hosted git repository. dzamo pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/drill-site.git
commit e3968908c335705a8ea876d3876223fdcd3a34ce Author: James Turton <[email protected]> AuthorDate: Fri Mar 3 13:15:54 2023 +0200 Document new statistical functions. --- .../sql-reference/sql-commands/084-limit-clause.md | 9 ++ .../sql-commands/087-set-operators.md | 23 +++-- .../050-aggregate-and-aggregate-statistical.md | 111 +++++++++++++++++---- 3 files changed, 115 insertions(+), 28 deletions(-) diff --git a/_docs/en/sql-reference/sql-commands/084-limit-clause.md b/_docs/en/sql-reference/sql-commands/084-limit-clause.md index 160128628..e2e5be611 100644 --- a/_docs/en/sql-reference/sql-commands/084-limit-clause.md +++ b/_docs/en/sql-reference/sql-commands/084-limit-clause.md @@ -11,6 +11,7 @@ The LIMIT clause limits the result set to the specified number of rows. You can You can also configure an automatic limit on the number of rows returned from a result set. See [Setting an Auto Limit on the Number of Rows Returned for Result Sets]({{site.baseurl}}/docs/planning-and-execution-options/#setting-an-auto-limit-on-the-number-of-rows-returned-for-result-sets). ## Syntax + The LIMIT clause supports the following syntaxes: ``` LIMIT { count | ALL } @@ -19,6 +20,7 @@ FETCH NEXT count { ROW | ROWS } ONLY ``` ## Parameters + *count* Specifies the maximum number of rows to return. If the count expression evaluates to NULL, Drill treats it as LIMIT ALL. @@ -42,6 +44,7 @@ LIMIT 0 quickly returns an empty set. Use LIMIT 0 to test the validity of the SQ Drill optimizes LIMIT 0 queries through the following options, which are enabled by default: **planner.enable\_limit0_optimization** + Enables the query planner to determine data types returned by a query during the planning phase (before scanning data). Since Drill is a schema-free engine, the query planner does not know the column types of the result set before reading records. However, the query planner can infer the column types of the result set during query validation if you provide enough type information to Drill about the input column types. If the planner can infer all column types, a short execution path is p [...] To provide column type information, you can: @@ -52,12 +55,14 @@ To provide column type information, you can: - Issue queries on views with casts on table columns. **planner.enable\_limit0\_on_scan** + Supported in Drill 1.14 and later. Enables Drill to determine data types as Drill scans data. This optimization is used when the query planner cannot infer types of columns during validation (prior to scanning). Drill exits and terminates the query immediately after resolving the types, and the query execution is not parallelized. When this optimization is applied, the query plan contains a LIMIT (0) above every SCAN, with an optional PROJECT in between. ### LIMIT 0 Limitations The following sections list the types, operators, and functions that LIMIT 0 optimizations do not support. **Unsupported Types** + LIMIT 0 optimizations do not apply to the following types: REAL, BINARY, VARBINARY, NULL, ANY, SYMBOL, MULTISET, ARRAY, MAP, DISTINCT, STRUCTURED, ROW, OTHER, CURSOR, COLUMN_LIST @@ -65,6 +70,7 @@ REAL, BINARY, VARBINARY, NULL, ANY, SYMBOL, MULTISET, ARRAY, MAP, DISTINCT, STRU **NOTE:** The query planner in Drill has other types that are not optimized because they are undefined in Drill’s type system. **Unsupported Hive UDF Types** + Hive has a list of functions with return type NVARCHAR, which the query planner in Drill does not support; therefore, type inference does not work for the following functions: @@ -76,6 +82,7 @@ Hive has a list of functions with return type NVARCHAR, which the query planner - XPATH_STRING **Unsupported Operators and Functions** + LIMIT 0 optimizations do not work for queries with the UNION [ALL] set operator or the following complex functions: - KVGEN or MAPPIFY @@ -87,12 +94,14 @@ LIMIT 0 optimizations do not work for queries with the UNION [ALL] set operator - AVG (window function) **Recursive file listing in Drill 1.21+** + Since version 1.21, Drill will exit early from recursive file listing during the planning of a query against filesystem storage if has detected a LIMIT 0 in the outermost SELECT. This optimization is aimed at queries of the following form. ``` SELECT * FROM dfs.`huge_directory` LIMIT 0; ``` ## Examples + The following example query includes the ORDER BY and LIMIT clauses and returns the top 20 sales totals by month and state: 0: jdbc:drill:> SELECT `month`, state, SUM(order_total) diff --git a/_docs/en/sql-reference/sql-commands/087-set-operators.md b/_docs/en/sql-reference/sql-commands/087-set-operators.md index 3dd251dca..8d2f30e83 100644 --- a/_docs/en/sql-reference/sql-commands/087-set-operators.md +++ b/_docs/en/sql-reference/sql-commands/087-set-operators.md @@ -6,23 +6,23 @@ parent: "SQL Commands" The UNION, INTERSECT and EXCEPT set operators combine the result sets of two separate query expressions. The result set of each query must have the same number of columns and compatible data types. -# UNION +## UNION The UNION operator computes the set union of its two arguments, automatically removing duplicate records from the result set. UNION ALL returns all duplicate records. -## Syntax +### Syntax The UNION operator supports the following syntax: query { UNION [ ALL ] } another_query -## Parameters +### Parameters *query*, *another_query* Any SELECT query that Drill supports. See [SELECT]({{site.baseurl}}/docs/select/). -## Examples +### Examples The following example uses the UNION ALL set operator to combine click activity data before and after a marketing campaign. The data in the example exists in the `dfs.clicks workspace`. 0: jdbc:drill:> SELECT t.trans_id transaction, t.user_info.cust_id customer @@ -47,12 +47,13 @@ If a query on either side of the UNION operator queries an empty directory, as s Drill treats the empty directory as a schemaless table and returns results as if the UNION operator is not included in the query. -# INTERSECT and EXCEPT +## INTERSECT and EXCEPT **Introduced in release: 1.21** + The INTERSECT and EXCEPT operators respectively compute the set intersection and the set difference of their two arguments. As with the UNION operator, duplicate records are automatically removed from the result set. -## Syntax +### Syntax These set operators support the following syntax: @@ -62,14 +63,14 @@ query another_query ``` -## Parameters +### Parameters *query*, *another_query* Any SELECT query that Drill supports. See [SELECT]({{site.baseurl}}/docs/select/). -## Examples +### Examples -### Compute { 1, 2, 4 } ∩ { 4, 5, 6 }. +#### Compute { 1, 2, 4 } ∩ { 4, 5, 6 }. ``` apache drill> WITH X AS (SELECT employee_id AS col FROM cp.`employee.json` LIMIT 3 OFFSET 0), @@ -83,7 +84,7 @@ col 4 1 row selected (0.449 seconds) ``` -### Compute { 1, 2, 4 } ∖ { 4, 5, 6 } +#### Compute { 1, 2, 4 } ∖ { 4, 5, 6 } ``` apache drill> WITH @@ -100,7 +101,7 @@ col 2 2 rows selected (0.41 seconds) ``` -# Set operator usage notes +## Set operator usage notes * The two SELECT query expressions that represent the direct operands of the set operator must produce the same number of columns. Corresponding columns must contain compatible data types. See [Supported Data Types]({{site.baseurl}}/docs/supported-data-types/). * Multiple set operators in the same SELECT statement are evaluated left to right, unless otherwise indicated by parentheses. * You can only use * on either side of a set operator when the data source has a defined schema, such as data in Hive or views. diff --git a/_docs/en/sql-reference/sql-functions/050-aggregate-and-aggregate-statistical.md b/_docs/en/sql-reference/sql-functions/050-aggregate-and-aggregate-statistical.md index 49bcde22b..cb58d1ecd 100644 --- a/_docs/en/sql-reference/sql-functions/050-aggregate-and-aggregate-statistical.md +++ b/_docs/en/sql-reference/sql-functions/050-aggregate-and-aggregate-statistical.md @@ -10,21 +10,24 @@ The following table lists the aggregate functions that you can use in Drill queries. -| **Function** | **Argument Type** | **Return Type** | -| ------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------ | -| ANY_VALUE(expression) | BIT, INT, BIGINT, FLOAT4, FLOAT8, DATE, TIMESTAMP, TIME, VARCHAR, VARBINARY, LIST, MAP, INTERVAL, INTERVALDAY, INTERVALYEAR, VARDECIMAL | Same as argument type | -| AVG(expression) | SMALLINT, INTEGER, BIGINT, FLOAT, DOUBLE, DECIMAL, INTERVAL | DECIMAL for DECIMAL argument, DOUBLE for all other arguments | -| BOOL_AND(expression), BOOL_OR(expression) | BIT | BIT | -| BIT_AND(expression), BIT_OR(expression), BIT_XOR(expression) | INT, BIGINT | Same as argument type | -| CORR(x,y) | Numeric | Double | -| COVAR_POP(x,y) | Numeric | Double -| COVAR_SAMP(x,y) | Numeric | Double -| COUNT(\*) | - | BIGINT | -| COUNT(\[DISTINCT\] expression) | any | BIGINT | -| MAX(expression), MIN(expression) | BINARY, DECIMAL, VARCHAR, DATE, TIME, or TIMESTAMP | Same as argument type | -| STDDEV, STDDEV_POP, STDDEV_SAMP | Numeric | Double -| SUM(expression) | SMALLINT, INTEGER, BIGINT, FLOAT, DOUBLE, DECIMAL, INTERVAL | DECIMAL for DECIMAL argument, BIGINT for any integer-type argument (including BIGINT), DOUBLE for floating-point arguments | -| VARIANCE, VAR_POP, VAR_SAMP | Numeric | Numeric +| **Function** | **Argument Type** | **Return Type** | +| ------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------- | +| ANY_VALUE(expression) | BIT, INT, BIGINT, FLOAT4, FLOAT8, DATE, TIMESTAMP, TIME, VARCHAR, VARBINARY, LIST, MAP, INTERVAL, INTERVALDAY, INTERVALYEAR, VARDECIMAL | Same as argument type | +| AVG(expression) | SMALLINT, INTEGER, BIGINT, FLOAT, DOUBLE, DECIMAL, INTERVAL | DECIMAL for DECIMAL argument, DOUBLE for all other arguments | +| BOOL_AND(expression), BOOL_OR(expression) | BIT | BIT | +| BIT_AND(expression), BIT_OR(expression), BIT_XOR(expression) | INT, BIGINT | Same as argument type | +| CORR(x,y) | Numeric | Double | +| COVAR_POP(x,y) | Numeric | Double | +| COVAR_SAMP(x,y) | Numeric | Double | +| COUNT(\*) | - | BIGINT | +| COUNT(\[DISTINCT\] expression) | any | BIGINT | +| KENDALL_CORRELATION | Numeric | Double | +| MAX(expression), MIN(expression) | BINARY, DECIMAL, VARCHAR, DATE, TIME, or TIMESTAMP | Same as argument type | +| REGR_INTERCEPT | Numeric | Double | +| REGR_SLOPE | Numeric | Double | +| STDDEV, STDDEV_POP, STDDEV_SAMP | Numeric | Double | +| SUM(expression) | SMALLINT, INTEGER, BIGINT, FLOAT, DOUBLE, DECIMAL, INTERVAL | DECIMAL for DECIMAL argument, BIGINT for any integer-type argument (including BIGINT), DOUBLE for floating-point arguments | +| VARIANCE, VAR_POP, VAR_SAMP | Numeric | Numeric | - Drill 1.14 and later supports the ANY_VALUE function. - Starting in Drill 1.14, the DECIMAL data type is enabled by default. @@ -40,8 +43,9 @@ queries. ## ANY_VALUE -Supported in Drill 1.14 and later. Returns one of the values of value across all -input values. This function is NOT specified in the SQL standard. +**Introdcued in release: 1.14** + +Returns one of the values of value across all input values. This function is NOT specified in the SQL standard. ### ANY_VALUE Syntax @@ -99,6 +103,7 @@ SELECT ANY_VALUE(employee_id) as anyemp FROM cp.`employee.json` GROUP BY salary ``` ## Filtered Aggregates + **Introduced in release: 1.21** Starting in Drill 1.21 it is possible to follow an aggregate function invocation with a boolean expression that will filter the values procesed by the aggregate using the following syntax. @@ -363,6 +368,30 @@ SELECT COUNT(*) FROM cp.`employee.json`; 1 row selected (0.174 seconds) ``` +## KENDALL_CORRELATION + +Returns the Kendall correlation coefficient. + +### KENDALL_CORRELATION Syntax + +``` +KENDALL_CORRELATION( expression1, expression2 ) +``` + +### KENDALL_CORRELATION Examples + +```sql +with seq as ( + select row_number() over (order by 1) x from cp.`employee.json` limit 10 +) +select kendall_correlation(x+random(), x+random()+5) from seq; +``` +``` +EXPR$0 0.2 + +1 row selected (0.213 seconds) +``` + ## MIN and MAX These functions return the smallest and largest values of the selected @@ -521,6 +550,54 @@ computation of table statistics accordingly. {% include endnote.html %} - exec.statistics.ndv_extrapolation_bf_elements - exec.statistics.ndv_extrapolation_bf_fpprobability +## REGR_INTERCEPT + +Returns the intercept of the least squares linear regression fit. + +### REGR_INTERCEPT Syntax + +``` +REGR_INTERCEPT( expression1, expression2 ) +``` + +### REGR_INTERCEPT Examples + +```sql +with seq as ( + select row_number() over (order by 1) x from cp.`employee.json` limit 10 +) +select regr_intercept(x+random(), x+random()+5) from seq; +``` +``` +EXPR$0 4.9715020855109255 + +1 row selected (0.221 seconds) +``` + +## REGR_SLOPE + +Returns the slope of the least squares linear regression fit. + +### REGR_INTERCEPT Syntax + +``` +REGR_INTERCEPT( expression1, expression2 ) +``` + +### REGR_INTERCEPT Examples + +```sql +with seq as ( + select row_number() over (order by 1) x from cp.`employee.json` limit 10 +) +select regr_slope(x+random(), x+random()+5) from seq; +``` +``` +EXPR$0 0.9719696129009783 + +1 row selected (0.283 seconds) +``` + ## STDDEV Returns the sample standard deviation.
