ccaominh commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r408977929
##
File path: docs/querying/datasource.md
##
@@ -22,43 +22,317 @@ title: "Datasources"
~ under the License.
-->
+Datasources in Apache Druid are things that you can query. The most common
kind of datasource is a table datasource,
+and in many contexts the word "datasource" implicitly refers to table
datasources. This is especially true
+[during data ingestion](../ingestion/index.html), where ingestion is always
creating or writing into a table
+datasource. But at query time, there are many other types of datasources
available.
-A data source is the Apache Druid equivalent of a database table. However, a
query can also masquerade as a data source, providing subquery-like
functionality. Query data sources are currently supported only by
[GroupBy](../querying/groupbyquery.md) queries.
+In the [Druid SQL](sql.html) language, datasources are provided in the [`FROM`
clause](sql.html#from).
-### Table datasource
-The table data source is the most common type. It's represented by a string,
or by the full structure:
+The word "datasource" is generally spelled `dataSource` (with a capital S)
when it appears in API requests and
+responses.
+## Datasource type
+
+### `table`
+
+
+
+```sql
+SELECT column1, column2 FROM "druid"."dataSourceName"
+```
+
+```json
+{
+ "queryType": "scan",
+ "dataSource": "dataSourceName",
+ "columns": ["column1", "column2"],
+ "intervals": ["/3000"]
+}
+```
+
+
+The table datasource is the most common type. This is the kind of datasource
you get when you perform
+[data ingestion](../ingestion/index.html). They are split up into segments,
distributed around the cluster,
+and queried in parallel.
+
+In [Druid SQL](sql.html#from), table datasources reside in the the `druid`
schema. This is the default schema, so table
+datasources can be referenced as either `druid.dataSourceName` or simply
`dataSourceName`.
+
+In native queries, table datasources can be referenced using their names as
strings (as in the example above), or by
+using JSON objects of the form:
+
+```json
+"dataSource": {
+ "type": "table",
+ "name": "dataSourceName"
+}
+```
+
+To see a list of all table datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'druid'`.
+
+### `lookup`
+
+
+
+```sql
+SELECT k, v FROM lookup.countries
+```
+
+```json
+{
+ "queryType": "scan",
+ "dataSource": {
+"type": "lookup",
+"lookup": "countries"
+ },
+ "columns": ["k", "v"],
+ "intervals": ["/3000"]
+}
+```
+
+
+Lookup datasources correspond to Druid's key-value [lookup](lookups.html)
objects. In [Druid SQL](sql.html#from),
+they reside in the the `lookup` schema. They are preloaded in memory on all
servers, so they can be accessed rapidly.
+They can be joined onto regular tables using the [join operator](#join).
+
+Lookup datasources are key-value oriented and always have exactly two columns:
`k` (the key) and `v` (the value), and
+both are always strings.
+
+To see a list of all lookup datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'lookup'`.
+
+> Performance tip: Lookups can be joined with a base table either using an
explicit [join](#join), or by using the
+> SQL [`LOOKUP` function](sql.html#string-functions).
+> However, the join operator must evaluate the condition on each row, whereas
the
+> `LOOKUP` function can defer evaluation until after an aggregation phase.
This means that the `LOOKUP` function is
+> usually faster than joining to a lookup datasource.
+
+### `query`
+
+
+
+```sql
+-- Uses a subquery to count hits per page, then takes the average.
+SELECT
+ AVG(cnt) AS average_hits_per_page
+FROM
+ (SELECT page, COUNT(*) AS hits FROM site_traffic GROUP BY page)
+```
+
+```json
+{
+ "queryType": "timeseries",
+ "dataSource": {
+"type": "query",
+"query": {
+ "queryType": "groupBy",
+ "dataSource": "site_traffic",
+ "intervals": ["/3000"],
+ "granularity": "all",
+ "dimensions": ["page"],
+ "aggregations": [
+{ "type": "count", "name": "hits" }
+ ]
+}
+ },
+ "intervals": ["/3000"],
+ "granularity": "all",
+ "aggregations": [
+{ "type": "longSum", "name": "hits", "fieldName": "hits" },
+{ "type": "count", "name": "pages" }
+ ],
+ "postAggregations": [
+{ "type": "expression", "name": "average_hits_per_page", "expression":
"hits / pages" }
+ ]
+}
+```
+
+
+Query datasources allow you to issue subqueries. In native queries, they can
appear anywhere that accepts a
+`dataSource`. In SQL, they can appear in the following places, always
surrounded by parentheses:
+
+- The FROM clause: `FROM ()`.
+- As inputs to a JOIN: ` t1 INNER JOIN
t2 ON t1. = t2.`.
+- In the WHERE clause: `WHERE { IN | NOT IN } ()`. These
are translated to joins by