This is an automated email from the ASF dual-hosted git repository.
dzamo pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/drill-site.git
The following commit(s) were added to refs/heads/master by this push:
new 114dacb7e Re-add new blog post for Drill 1.21.
114dacb7e is described below
commit 114dacb7e9eff30014380b121dffc44f59e75e1b
Author: James Turton <[email protected]>
AuthorDate: Fri Mar 3 13:54:33 2023 +0200
Re-add new blog post for Drill 1.21.
---
.../sql-reference/sql-functions/015-statistical.md | 9 ++-
.../050-aggregate-and-aggregate-statistical.md | 55 +++++++++-------
blog/_posts/en/2023-03-02-drill-1.21.0-released.md | 73 ++++++++++++++++++++++
3 files changed, 109 insertions(+), 28 deletions(-)
diff --git a/_docs/en/sql-reference/sql-functions/015-statistical.md
b/_docs/en/sql-reference/sql-functions/015-statistical.md
index 9e9497f76..1cadec553 100644
--- a/_docs/en/sql-reference/sql-functions/015-statistical.md
+++ b/_docs/en/sql-reference/sql-functions/015-statistical.md
@@ -3,18 +3,17 @@ title: "Statistical"
slug: "Statistical"
parent: "SQL Functions"
---
-
Drill supports the scalar statistical functions shown in the following table.
## Table of statistical functions
-| Function | Return Type | Description
|
-| ------------ | ----------- |
------------------------------------------------------------------------------------------------------------------------------
|
-| width_bucket | INT | Returns the 1-based bucket index of the value
after dividing the interval between min and max into the given number of
buckets |
+| Function | Return Type | Description
|
+| -------------------------------------- | ----------- |
------------------------------------------------------------------------------------------------------------------------------
|
+| width_bucket(value, min, max, buckets) | INT | Returns the 1-based
bucket index of the value after dividing the interval between min and max into
the given number of buckets |
## WIDTH_BUCKET
-Returns the 1-based bucket index of _value_ after dividing the interval
between _min_ and_ max_ into the given number of buckets. A _value_ that falls
outside the given range are given an index of 0 (_value_ < _min_) or _buckets_
+ 1 (_value_ > _max_).
+Returns the 1-based bucket index of _value_ after dividing the interval
between _min_ and _max_ into the given number of buckets. A _value_ that falls
outside the given range are given an index of 0 (_value_ \< _min_) or _buckets_
+ 1 (_value_ > _max_).
### WIDTH_BUCKET Syntax
diff --git
a/_docs/en/sql-reference/sql-functions/050-aggregate-and-aggregate-statistical.md
b/_docs/en/sql-reference/sql-functions/050-aggregate-and-aggregate-statistical.md
index cb58d1ecd..7f6704d46 100644
---
a/_docs/en/sql-reference/sql-functions/050-aggregate-and-aggregate-statistical.md
+++
b/_docs/en/sql-reference/sql-functions/050-aggregate-and-aggregate-statistical.md
@@ -9,25 +9,24 @@ parent: "SQL Functions"
The following table lists the aggregate functions that you can use in Drill
queries.
-
-| **Function** | **Argument
Type**
| **Return Type**
|
-| ------------------------------------------------------------ |
---------------------------------------------------------------------------------------------------------------------------------------
|
----------------------------------------------------------------------------------------------------------------------------------
|
-| ANY_VALUE(expression) | BIT, INT,
BIGINT, FLOAT4, FLOAT8, DATE, TIMESTAMP, TIME, VARCHAR, VARBINARY, LIST, MAP,
INTERVAL, INTERVALDAY, INTERVALYEAR, VARDECIMAL | Same as argument type
|
-| AVG(expression) | SMALLINT,
INTEGER, BIGINT, FLOAT, DOUBLE, DECIMAL, INTERVAL
| DECIMAL for DECIMAL argument,
DOUBLE for all other arguments
|
-| BOOL_AND(expression), BOOL_OR(expression) | BIT
| BIT
|
-| BIT_AND(expression), BIT_OR(expression), BIT_XOR(expression) | INT, BIGINT
| Same as argument type
|
-| CORR(x,y) | Numeric
| Double
|
-| COVAR_POP(x,y) | Numeric
| Double
|
-| COVAR_SAMP(x,y) | Numeric
| Double
|
-| COUNT(\*) | -
| BIGINT
|
-| COUNT(\[DISTINCT\] expression) | any
| BIGINT
|
-| KENDALL_CORRELATION | Numeric
| Double
|
-| MAX(expression), MIN(expression) | BINARY,
DECIMAL, VARCHAR, DATE, TIME, or TIMESTAMP
| Same as argument type
|
-| REGR_INTERCEPT | Numeric
| Double
|
-| REGR_SLOPE | Numeric
| Double
|
-| STDDEV, STDDEV_POP, STDDEV_SAMP | Numeric
| Double
|
-| SUM(expression) | SMALLINT,
INTEGER, BIGINT, FLOAT, DOUBLE, DECIMAL, INTERVAL
| DECIMAL for DECIMAL argument,
BIGINT for any integer-type argument (including BIGINT), DOUBLE for
floating-point arguments |
-| VARIANCE, VAR_POP, VAR_SAMP | Numeric
| Numeric
|
+| **Function** | **Argument Type**
| **Return Type**
|
+| --------------------------------- |
---------------------------------------------------------------------------------------------------------------------------------------
|
----------------------------------------------------------------------------------------------------------------------------------
|
+| ANY_VALUE(x) | BIT, INT, BIGINT, FLOAT4, FLOAT8, DATE,
TIMESTAMP, TIME, VARCHAR, VARBINARY, LIST, MAP, INTERVAL, INTERVALDAY,
INTERVALYEAR, VARDECIMAL | Same as argument type
|
+| AVG(x) | SMALLINT, INTEGER, BIGINT, FLOAT,
DOUBLE, DECIMAL, INTERVAL
| DECIMAL for DECIMAL argument, DOUBLE for all other
arguments |
+| BOOL_AND(x), BOOL_OR(x) | BIT
| BIT
|
+| BIT_AND(x), BIT_OR(x), BIT_XOR(x) | INT, BIGINT
| Same as argument type
|
+| CORR(x,y) | Numeric
| DOUBLE
|
+| COVAR_POP(x,y) | Numeric
| DOUBLE
|
+| COVAR_SAMP(x,y) | Numeric
| DOUBLE
|
+| COUNT(\*) | -
| BIGINT
|
+| COUNT(\[DISTINCT\] expression) | any
| BIGINT
|
+| KENDALL_CORRELATION(x, y) | Numeric
| DOUBLE
|
+| MAX(expression), MIN(expression) | BINARY, DECIMAL, VARCHAR, DATE, TIME, or
TIMESTAMP
| Same as argument type
|
+| REGR_INTERCEPT(x, y) | Numeric
| DOUBLE
|
+| REGR_SLOPE(x, y) | Numeric
| DOUBLE
|
+| STDDEV, STDDEV_POP, STDDEV_SAMP | Numeric
| DOUBLE
|
+| SUM(expression) | SMALLINT, INTEGER, BIGINT, FLOAT,
DOUBLE, DECIMAL, INTERVAL
| DECIMAL for DECIMAL argument, BIGINT for any
integer-type argument (including BIGINT), DOUBLE for floating-point arguments
|
+| VARIANCE, VAR_POP, VAR_SAMP | Numeric
| Numeric
|
- Drill 1.14 and later supports the ANY_VALUE function.
- Starting in Drill 1.14, the DECIMAL data type is enabled by default.
@@ -107,12 +106,14 @@ SELECT ANY_VALUE(employee_id) as anyemp FROM
cp.`employee.json` GROUP BY salary
**Introduced in release: 1.21**
Starting in Drill 1.21 it is possible to follow an aggregate function
invocation with a boolean expression that will filter the values procesed by
the aggregate using the following syntax.
+
```
agg_func( column ) FILTER(WHERE boolean_expression)
```
For example
-``` sql
+
+```sql
SELECT
count(n_name) FILTER(WHERE n_regionkey = 1) AS nations_count_in_1_region,
count(n_name) FILTER(WHERE n_regionkey = 2) AS nations_count_in_2_region,
@@ -121,7 +122,9 @@ SELECT
count(n_name) FILTER(WHERE n_regionkey = 0) AS nations_count_in_0_region
FROM cp.`tpch/nation.parquet`
```
+
will return
+
```
+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+
| nations_count_in_1_region | nations_count_in_2_region |
nations_count_in_3_region | nations_count_in_4_region |
nations_count_in_0_region |
@@ -130,7 +133,7 @@ will return
+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+
```
-**N.B.** Some versions of Drill prior to 1.21 do not fail if FILTER
expressions are included with aggregate function calls, but silently do no
filtering yielding incorrect results. Filtered aggregates are only supported
from version 1.21 onward.
+{% include startnote.html %}Some versions of Drill prior to 1.21 do not fail
if FILTER expressions are included with aggregate function calls, but silently
do no filtering yielding incorrect results. Filtered aggregates are only
supported from version 1.21 onward.{% include endnote.html %}
## AVG
@@ -277,7 +280,8 @@ SELECT BIT_XOR(position_id) FROM cp.`employee.json`;
```
## CORR
-Returns the [Pearson Correlation
Coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) for
a given x, y.
+
+Returns the [Pearson Correlation
Coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) for
a given x, y.
```
SELECT CORR (department_id, salary) as correlation FROM cp.`employee.json`;
@@ -289,6 +293,7 @@ SELECT CORR (department_id, salary) as correlation FROM
cp.`employee.json`;
```
## COVAR_POP
+
Returns the population covariance for a data set.
```
@@ -301,6 +306,7 @@ SELECT covar_pop (department_id, salary) AS covariance FROM
cp.`employee.json`;
```
## COVAR_SAMP
+
Returns the sample covariance for a data set.
```
@@ -330,7 +336,7 @@ SELECT COUNT(*) FROM . . .
- ALL expression\
Returns the number of values of the specified expression.
- \* (asterisk)\
-- Returns the number of records in the table.
+ Returns the number of records in the table.
### COUNT Examples
@@ -386,6 +392,7 @@ with seq as (
)
select kendall_correlation(x+random(), x+random()+5) from seq;
```
+
```
EXPR$0 0.2
@@ -568,6 +575,7 @@ with seq as (
)
select regr_intercept(x+random(), x+random()+5) from seq;
```
+
```
EXPR$0 4.9715020855109255
@@ -592,6 +600,7 @@ with seq as (
)
select regr_slope(x+random(), x+random()+5) from seq;
```
+
```
EXPR$0 0.9719696129009783
diff --git a/blog/_posts/en/2023-03-02-drill-1.21.0-released.md
b/blog/_posts/en/2023-03-02-drill-1.21.0-released.md
new file mode 100644
index 000000000..f610b3342
--- /dev/null
+++ b/blog/_posts/en/2023-03-02-drill-1.21.0-released.md
@@ -0,0 +1,73 @@
+---
+layout: post
+title: "Announcing Drill 1.21!"
+code: drill-1.21-announcement
+excerpt: "Announcing Drill 1.21: New Connectors, Functions and Much Better
Stability."
+
+authors: ["cgivre"]
+---
+
+
+# Announcing Drill 1.21: New Connectors, Functions and Much Better Stability
+The Apache Drill PMC is pleased to announce a milestone release of Apache
Drill. Since the last release of Drill the team has been hard at work quashing
bugs and making overall functionality improvements. The TL;DR includes the
following:
+
+* New connectors including Apache Iceberg, Delta Lake, Microsoft Access,
GoogleSheets, and Box
+* Efficient cross-cloud query capability
+* Greatly improved access controls to include user translation support for all
storage plugins
+* Greatly improved query planning and implicit casting.
+* New BI-focused SQL operators including `PIVOT`, `UNPIVOT`, `EXCEPT` and
`INTERSECT`
+* New functions for computing regression lines and trends.
+* New and updated date manipulation functions.
+
+Overall, Drill 1.21 is much more capable and stable than previous versions.
+
+## Calcite, We’re Back!
+Drill relies on another open source project, Apache Calcite for its query
planning. The query planning process is a huge part of the overall
functionality of Drill. Unfortunately, about three years ago, there were some
issues in Calcite which forced Drill to fork it and rely on that fork. As a
result, Drill was essentially stuck with a three year old query planner, but
more importantly, bugs that were fixed in Calcite, as well as new capabilities
were not finding their way into Drill.
+
+That is no longer the case. Drill 1.21 is now running on the latest stable
version of Calcite, version 1.33. As a result, we’ve been able to close
countless JIRA tickets of various queries failing and other random bugs that
were the result of query planning bugs.
+
+What this means for you as a user is that you’ll see much fewer queries
failing and better overall performance in terms of speed and stability. You’ll
see better optimizations being pushed down to JDBC data sources as well as
support for BigQuery, Athena and other JDBC data sources. We hope to keep Drill
away from Calcite forks so I hope that we will work with the Calcite community
to keep our tools in sync.
+
+## Improved Implicit Casting Rules Reduce Schema Change Failures
+From this author’s perspective, one of the biggest improvements in Drill is
one of the least noticeable and that is the result of improved implicit
casting. One of Drill’s unique features is its ability to infer the structure,
or schema of your data. However, this can be problematic when the schema
changes. When I used to teach Drill, I used to have spend a considerable amount
of time teaching students how to cast data from one data type to another to
ensure that the queries would succeed.
+
+When using latest version of Drill, you’ll find that queries will work without
the need for much if any casting. In short, they’ll do what you expect them to
do. It’s really a high on magic functionality.
+
+## Integrations with the Modern and Not-so-Modern Data Stack
+The new version of Drill features several new connectors and readers that will
enable users to connect to the “modern data stack”, specifically support for
Apache Iceberg and Delta Lake.
+
+### Breaking the Iceberg
+Iceberg is a high-performance format for huge analytic tables. Iceberg brings
the reliability and simplicity of SQL tables to big data, while making it
possible for engines like Drill to safely work with the same tables, at the
same time. In addition to being able to query data directly from Iceberg
tables, Drill also allows users to query the Iceberg table metadata as well as
snapshots. [Complete documentation is available
here](https://drill.apache.org/docs/iceberg-format-plugin/).
+
+### Querying Delta Lake
+Lest we offend someone, we’re not going to get into the debate between Iceberg
and Delta lake (after all, let’s not argue about who killed whom), but Delta
Lake, if you aren’t familiar with it, is another modern table format which
allows ACID transactions, versioning etc. In version 1.21, Drill adds support
for Delta Lake tables, so users can query Delta Lake tables as well as
associated metadata. You can also query specific versions of files in delta
lake. [Complete documentation is av [...]
+
+### Accessing Access
+A surprising number of people use Microsoft Access as a database for their
business data. With version 1.21, Apache Drill can now natively query Microsoft
Access database files using Drill. This can be a major benefit for those
looking to migrate data from Access into more modern formats such as parquet or
even other relational databases. Drill will support Access files from version
1997 and up.
+
+### Oh Sheets!
+In addition to all of the above, Drill can now query data directly from
GoogleSheets. In addition to being able to query this data source, Drill can
read, write, delete and append to GoogleSheets. Google doesn’t make it easy, so
if this is a feature you are interested in, you’ll definitely want to [read the
documentation
here](https://drill.apache.org/docs/google-sheets-storage-plugin/).
+
+### Remote Data
+As you can see, Drill has significantly expanded the number of data sources
and types that it can query. A part of this work has also been to improve the
implementation behind filesystems. As a result, Drill can now query data stored
on Dropbox, and Box. We added support for filesystems which use OAuth 2.0 for
authorization so this means that more extended file systems are likely coming
your way for the next release.
+
+## Greatly Improved Access Controls
+Managing access controls and credentials on a federated query engine is a
complicated task. Drill has supported a concept called user impersonation which
basically means that Drill can execute queries using the credentials of the
logged in user. This concept works well for querying file systems such as
Hadoop, and other data sources that have the same concepts, however it does not
work at all with data sources that have different concepts of users, or in the
case of OAuth enabled plugins [...]
+
+To answer this challenge, Drill 1.21 introduces the concept of user
translation. The idea of user translation is that, when enabled, every user
will have their own unique credentials for specific data sources. Thus, when
that user queries a specific data source, that user’s credentials are used to
execute the query. This is configurable on an individual data source basis.
Ultimately, what this means is that you no longer have to create service
accounts to access data via Drill.
+
+## Drilling Across the Clouds
+While we’re on the subject of clouds, as you may be aware, Drill can query
data stored in cloud-based file systems such as S3, Azure, GCP etc. One of the
challenges however, is that if you have data stored in multiple clouds, it can
become very inefficient to query this data, especially from the perspective of
network IO. As of Drill 1.21, Drill adds a storage plugin which we are calling
Drill on Drill.
+
+Let’s say that you had a Drill cluster in S3, but you had data in both S3 and
Azure. With the new Drill on Drill capability, you could install an additional
Drill cluster in Azure, then query both from either Drill cluster. The
advantage is that the queries would be pushed down to the Drill cluster where
the data resides. So if you query Azure from S3, you aren’t sending tons of
data back and forth.
+
+## Drill Now Supports More BI Operators
+While Drill held more or less to the SQL standard, it was missing some BI
operators that had become commonplace among SQL platforms. Drill 1.21
introduces the `PIVOT`, and `UNPIVOT` operators which covert rows to columns or
vice versa, much in the same way a pivot table works in Excel. Additionally, we
added set operators `INTERSECT` and `EXCEPT` which have become part of the SQL
standard.
+
+## New Statistical Functions
+Drill 1.21 adds new SQL functions for statistical summaries including
`kendall_correlation` for calculating correlation coefficients, `width_bucket`
which is a SQL function for computing histograms and distributions, and two
other functions for computing regression lines.
+
+Lastly, we’ve also added additional date/time manipulation functions which
will make working with dates significantly easier.
+
+## What’s Next?
+The big question is where do we go from here? We’ve already started working on
adding support for additional BI operators such as `CUBE`, `GROUPING SETS` and
`ROLLUP`, as well as `REGEXP_EXTRACT`. Since the new version of Calcite has
support for numerous optimizations around materialized views this is also
something which is being discussed. If you like what you are seeing, please
download Drill and try it out. Feedback is always welcome on the Drill slack
channel or on our mailing lists [...]