vtlim commented on code in PR #16762: URL: https://github.com/apache/druid/pull/16762#discussion_r1693360525
########## docs/tutorials/tutorial-rollup.md: ########## @@ -49,150 +52,98 @@ For this tutorial, we'll use a small sample of network flow event data, represen {"timestamp":"2018-01-02T21:35:45Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":12,"bytes":2818} ``` -A file containing this sample input data is located at `quickstart/tutorial/rollup-data.json`. - -We'll ingest this data using the following ingestion task spec, located at `quickstart/tutorial/rollup-index.json`. - -```json -{ - "type" : "index_parallel", - "spec" : { - "dataSchema" : { - "dataSource" : "rollup-tutorial", - "dimensionsSpec" : { - "dimensions" : [ - "srcIP", - "dstIP" - ] - }, - "timestampSpec": { - "column": "timestamp", - "format": "iso" - }, - "metricsSpec" : [ - { "type" : "count", "name" : "count" }, - { "type" : "longSum", "name" : "packets", "fieldName" : "packets" }, - { "type" : "longSum", "name" : "bytes", "fieldName" : "bytes" } - ], - "granularitySpec" : { - "type" : "uniform", - "segmentGranularity" : "week", - "queryGranularity" : "minute", - "intervals" : ["2018-01-01/2018-01-03"], - "rollup" : true - } - }, - "ioConfig" : { - "type" : "index_parallel", - "inputSource" : { - "type" : "local", - "baseDir" : "quickstart/tutorial", - "filter" : "rollup-data.json" - }, - "inputFormat" : { - "type" : "json" - }, - "appendToExisting" : false - }, - "tuningConfig" : { - "type" : "index_parallel", - "partitionsSpec": { - "type": "dynamic" - }, - "maxRowsInMemory" : 25000 - } - } -} +Load the sample dataset using the [`INSERT INTO`](../multi-stage-query/reference.md/#insert) statement and the [`EXTERN`](../multi-stage-query/reference.md/#extern-function) function to ingest the data inline. In the [Druid web console](../operations/web-console.md), go to the **Query** view and run the following query: + +```sql +INSERT INTO "rollup_tutorial" +WITH "inline_data" AS ( + SELECT * + FROM TABLE(EXTERN('{ + "type":"inline", + "data":"{\"timestamp\":\"2018-01-01T01:01:35Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":20,\"bytes\":9024}\n{\"timestamp\":\"2018-01-01T01:02:14Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":38,\"bytes\":6289}\n{\"timestamp\":\"2018-01-01T01:01:59Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":11,\"bytes\":5780}\n{\"timestamp\":\"2018-01-01T01:01:51Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":255,\"bytes\":21133}\n{\"timestamp\":\"2018-01-01T01:02:29Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":377,\"bytes\":359971}\n{\"timestamp\":\"2018-01-01T01:03:29Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":49,\"bytes\":10204}\n{\"timestamp\":\"2018-01-02T21:33:14Z\",\"srcIP\":\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":38,\"bytes\":6289}\n{\"timestamp\":\"2018-01-02T21:33:45Z\",\"srcIP\":\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":123,\"bytes\":93999}\n{\"timestamp\":\"2018-01-02T21:35:45Z\",\"srcIP\" :\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":12,\"bytes\":2818}"}', + '{"type":"json"}')) + EXTEND ("timestamp" VARCHAR, "srcIP" VARCHAR, "dstIP" VARCHAR, "packets" BIGINT, "bytes" BIGINT) +) +SELECT + FLOOR(TIME_PARSE("timestamp") TO MINUTE) AS __time, + "srcIP", + "dstIP", + SUM("bytes") AS "bytes", + SUM("packets") AS "packets", + COUNT(*) AS "count" +FROM "inline_data" +GROUP BY 1, 2, 3 +PARTITIONED BY DAY ``` -Rollup has been enabled by setting `"rollup" : true` in the `granularitySpec`. - -Note that we have `srcIP` and `dstIP` defined as dimensions, a longSum metric is defined for the `packets` and `bytes` columns, and the `queryGranularity` has been defined as `minute`. +In the query, you group by dimensions, `timestamp`, `srcIP`, and `dstIP`. Note that the query uses the `FLOOR` function to bucket rows based on MINUTE granularity. +For the metrics, you apply aggregations to sum the `bytes` and `packets` columns and add a column that counts the number of rows that get rolled-up. Review Comment: Pulling this description from the [rollup](https://druid.apache.org/docs/latest/ingestion/rollup) doc to make it more instructive for users following the tutorial. This is more of a general preview, then the "view in action" section explains it in more detail for the user. ```suggestion With rollup, Druid combines rows with identical timestamp and dimension values after the timestamp truncation. Druid computes and stores the metric values using the specified aggregation function over each set of rolled-up rows. ``` ########## docs/tutorials/tutorial-rollup.md: ########## @@ -49,150 +52,98 @@ For this tutorial, we'll use a small sample of network flow event data, represen {"timestamp":"2018-01-02T21:35:45Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":12,"bytes":2818} ``` -A file containing this sample input data is located at `quickstart/tutorial/rollup-data.json`. - -We'll ingest this data using the following ingestion task spec, located at `quickstart/tutorial/rollup-index.json`. - -```json -{ - "type" : "index_parallel", - "spec" : { - "dataSchema" : { - "dataSource" : "rollup-tutorial", - "dimensionsSpec" : { - "dimensions" : [ - "srcIP", - "dstIP" - ] - }, - "timestampSpec": { - "column": "timestamp", - "format": "iso" - }, - "metricsSpec" : [ - { "type" : "count", "name" : "count" }, - { "type" : "longSum", "name" : "packets", "fieldName" : "packets" }, - { "type" : "longSum", "name" : "bytes", "fieldName" : "bytes" } - ], - "granularitySpec" : { - "type" : "uniform", - "segmentGranularity" : "week", - "queryGranularity" : "minute", - "intervals" : ["2018-01-01/2018-01-03"], - "rollup" : true - } - }, - "ioConfig" : { - "type" : "index_parallel", - "inputSource" : { - "type" : "local", - "baseDir" : "quickstart/tutorial", - "filter" : "rollup-data.json" - }, - "inputFormat" : { - "type" : "json" - }, - "appendToExisting" : false - }, - "tuningConfig" : { - "type" : "index_parallel", - "partitionsSpec": { - "type": "dynamic" - }, - "maxRowsInMemory" : 25000 - } - } -} +Load the sample dataset using the [`INSERT INTO`](../multi-stage-query/reference.md/#insert) statement and the [`EXTERN`](../multi-stage-query/reference.md/#extern-function) function to ingest the data inline. In the [Druid web console](../operations/web-console.md), go to the **Query** view and run the following query: + +```sql +INSERT INTO "rollup_tutorial" +WITH "inline_data" AS ( + SELECT * + FROM TABLE(EXTERN('{ + "type":"inline", + "data":"{\"timestamp\":\"2018-01-01T01:01:35Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":20,\"bytes\":9024}\n{\"timestamp\":\"2018-01-01T01:02:14Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":38,\"bytes\":6289}\n{\"timestamp\":\"2018-01-01T01:01:59Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":11,\"bytes\":5780}\n{\"timestamp\":\"2018-01-01T01:01:51Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":255,\"bytes\":21133}\n{\"timestamp\":\"2018-01-01T01:02:29Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":377,\"bytes\":359971}\n{\"timestamp\":\"2018-01-01T01:03:29Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":49,\"bytes\":10204}\n{\"timestamp\":\"2018-01-02T21:33:14Z\",\"srcIP\":\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":38,\"bytes\":6289}\n{\"timestamp\":\"2018-01-02T21:33:45Z\",\"srcIP\":\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":123,\"bytes\":93999}\n{\"timestamp\":\"2018-01-02T21:35:45Z\",\"srcIP\" :\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":12,\"bytes\":2818}"}', + '{"type":"json"}')) + EXTEND ("timestamp" VARCHAR, "srcIP" VARCHAR, "dstIP" VARCHAR, "packets" BIGINT, "bytes" BIGINT) +) +SELECT + FLOOR(TIME_PARSE("timestamp") TO MINUTE) AS __time, + "srcIP", + "dstIP", + SUM("bytes") AS "bytes", + SUM("packets") AS "packets", + COUNT(*) AS "count" +FROM "inline_data" +GROUP BY 1, 2, 3 +PARTITIONED BY DAY ``` -Rollup has been enabled by setting `"rollup" : true` in the `granularitySpec`. - -Note that we have `srcIP` and `dstIP` defined as dimensions, a longSum metric is defined for the `packets` and `bytes` columns, and the `queryGranularity` has been defined as `minute`. +In the query, you group by dimensions, `timestamp`, `srcIP`, and `dstIP`. Note that the query uses the `FLOOR` function to bucket rows based on MINUTE granularity. +For the metrics, you apply aggregations to sum the `bytes` and `packets` columns and add a column that counts the number of rows that get rolled-up. -We will see how these definitions are used after we load this data. +After the ingestion completes, you can query the data. -## Load the example data +## Query the example data -From the apache-druid-{{DRUIDVERSION}} package root, run the following command: +In the web console, open a new tab in the **Query** view. Run the following query to view the ingested data: -```bash -bin/post-index-task --file quickstart/tutorial/rollup-index.json --url http://localhost:8081 +```sql +SELECT * FROM "rollup_tutorial" ``` -After the script completes, we will query the data. +Returns the following: -## Query the example data +| `__time` | `srcIP` | `dstIP` | `bytes` | `count` | `packets` | +| -- | -- | -- | -- | -- | -- | +| `2018-01-01T01:01:00.000Z` | `1.1.1.1` | `2.2.2.2` | `35,937` | `3` | `286` | +| `2018-01-01T01:02:00.000Z` | `1.1.1.1` | `2.2.2.2` | `366,260` | `2` | `415` | +| `2018-01-01T01:03:00.000Z` | `1.1.1.1` | `2.2.2.2` | `10,204` | `1` | `49` | +| `2018-01-02T21:33:00.000Z` | `7.7.7.7` | `8.8.8.8` | `100,288` | `2` | `161` | +| `2018-01-02T21:35:00.000Z` | `7.7.7.7` | `8.8.8.8` | `2,818` | `1` | `12` | -Let's run `bin/dsql` and issue a `select * from "rollup-tutorial";` query to see what data was ingested. - -```bash -$ bin/dsql -Welcome to dsql, the command-line client for Druid SQL. -Type "\h" for help. -dsql> select * from "rollup-tutorial"; -┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐ -│ __time │ bytes │ count │ dstIP │ packets │ srcIP │ -├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤ -│ 2018-01-01T01:01:00.000Z │ 35937 │ 3 │ 2.2.2.2 │ 286 │ 1.1.1.1 │ -│ 2018-01-01T01:02:00.000Z │ 366260 │ 2 │ 2.2.2.2 │ 415 │ 1.1.1.1 │ -│ 2018-01-01T01:03:00.000Z │ 10204 │ 1 │ 2.2.2.2 │ 49 │ 1.1.1.1 │ -│ 2018-01-02T21:33:00.000Z │ 100288 │ 2 │ 8.8.8.8 │ 161 │ 7.7.7.7 │ -│ 2018-01-02T21:35:00.000Z │ 2818 │ 1 │ 8.8.8.8 │ 12 │ 7.7.7.7 │ -└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘ -Retrieved 5 rows in 1.18s. - -dsql> -``` +Notice there are only six rows as opposed to the nine rows in the example data. The next section covers how ingestion with rollup accomplishes this. Review Comment: You're not really going into "how rollup works" -- maybe this phrasing would be more accurate ```suggestion Notice there are only six rows as opposed to the nine rows in the example data. In the next section, you explore the components of the rolled-up rows. ``` ########## docs/tutorials/tutorial-rollup.md: ########## @@ -49,150 +52,98 @@ For this tutorial, we'll use a small sample of network flow event data, represen {"timestamp":"2018-01-02T21:35:45Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":12,"bytes":2818} ``` -A file containing this sample input data is located at `quickstart/tutorial/rollup-data.json`. - -We'll ingest this data using the following ingestion task spec, located at `quickstart/tutorial/rollup-index.json`. - -```json -{ - "type" : "index_parallel", - "spec" : { - "dataSchema" : { - "dataSource" : "rollup-tutorial", - "dimensionsSpec" : { - "dimensions" : [ - "srcIP", - "dstIP" - ] - }, - "timestampSpec": { - "column": "timestamp", - "format": "iso" - }, - "metricsSpec" : [ - { "type" : "count", "name" : "count" }, - { "type" : "longSum", "name" : "packets", "fieldName" : "packets" }, - { "type" : "longSum", "name" : "bytes", "fieldName" : "bytes" } - ], - "granularitySpec" : { - "type" : "uniform", - "segmentGranularity" : "week", - "queryGranularity" : "minute", - "intervals" : ["2018-01-01/2018-01-03"], - "rollup" : true - } - }, - "ioConfig" : { - "type" : "index_parallel", - "inputSource" : { - "type" : "local", - "baseDir" : "quickstart/tutorial", - "filter" : "rollup-data.json" - }, - "inputFormat" : { - "type" : "json" - }, - "appendToExisting" : false - }, - "tuningConfig" : { - "type" : "index_parallel", - "partitionsSpec": { - "type": "dynamic" - }, - "maxRowsInMemory" : 25000 - } - } -} +Load the sample dataset using the [`INSERT INTO`](../multi-stage-query/reference.md/#insert) statement and the [`EXTERN`](../multi-stage-query/reference.md/#extern-function) function to ingest the data inline. In the [Druid web console](../operations/web-console.md), go to the **Query** view and run the following query: + +```sql +INSERT INTO "rollup_tutorial" +WITH "inline_data" AS ( + SELECT * + FROM TABLE(EXTERN('{ + "type":"inline", + "data":"{\"timestamp\":\"2018-01-01T01:01:35Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":20,\"bytes\":9024}\n{\"timestamp\":\"2018-01-01T01:02:14Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":38,\"bytes\":6289}\n{\"timestamp\":\"2018-01-01T01:01:59Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":11,\"bytes\":5780}\n{\"timestamp\":\"2018-01-01T01:01:51Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":255,\"bytes\":21133}\n{\"timestamp\":\"2018-01-01T01:02:29Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":377,\"bytes\":359971}\n{\"timestamp\":\"2018-01-01T01:03:29Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":49,\"bytes\":10204}\n{\"timestamp\":\"2018-01-02T21:33:14Z\",\"srcIP\":\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":38,\"bytes\":6289}\n{\"timestamp\":\"2018-01-02T21:33:45Z\",\"srcIP\":\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":123,\"bytes\":93999}\n{\"timestamp\":\"2018-01-02T21:35:45Z\",\"srcIP\" :\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":12,\"bytes\":2818}"}', + '{"type":"json"}')) + EXTEND ("timestamp" VARCHAR, "srcIP" VARCHAR, "dstIP" VARCHAR, "packets" BIGINT, "bytes" BIGINT) +) +SELECT + FLOOR(TIME_PARSE("timestamp") TO MINUTE) AS __time, + "srcIP", + "dstIP", + SUM("bytes") AS "bytes", + SUM("packets") AS "packets", + COUNT(*) AS "count" +FROM "inline_data" +GROUP BY 1, 2, 3 +PARTITIONED BY DAY ``` -Rollup has been enabled by setting `"rollup" : true` in the `granularitySpec`. - -Note that we have `srcIP` and `dstIP` defined as dimensions, a longSum metric is defined for the `packets` and `bytes` columns, and the `queryGranularity` has been defined as `minute`. +In the query, you group by dimensions, `timestamp`, `srcIP`, and `dstIP`. Note that the query uses the `FLOOR` function to bucket rows based on MINUTE granularity. +For the metrics, you apply aggregations to sum the `bytes` and `packets` columns and add a column that counts the number of rows that get rolled-up. -We will see how these definitions are used after we load this data. +After the ingestion completes, you can query the data. -## Load the example data +## Query the example data -From the apache-druid-{{DRUIDVERSION}} package root, run the following command: +In the web console, open a new tab in the **Query** view. Run the following query to view the ingested data: -```bash -bin/post-index-task --file quickstart/tutorial/rollup-index.json --url http://localhost:8081 +```sql +SELECT * FROM "rollup_tutorial" ``` -After the script completes, we will query the data. +Returns the following: -## Query the example data +| `__time` | `srcIP` | `dstIP` | `bytes` | `count` | `packets` | +| -- | -- | -- | -- | -- | -- | +| `2018-01-01T01:01:00.000Z` | `1.1.1.1` | `2.2.2.2` | `35,937` | `3` | `286` | +| `2018-01-01T01:02:00.000Z` | `1.1.1.1` | `2.2.2.2` | `366,260` | `2` | `415` | +| `2018-01-01T01:03:00.000Z` | `1.1.1.1` | `2.2.2.2` | `10,204` | `1` | `49` | +| `2018-01-02T21:33:00.000Z` | `7.7.7.7` | `8.8.8.8` | `100,288` | `2` | `161` | +| `2018-01-02T21:35:00.000Z` | `7.7.7.7` | `8.8.8.8` | `2,818` | `1` | `12` | -Let's run `bin/dsql` and issue a `select * from "rollup-tutorial";` query to see what data was ingested. - -```bash -$ bin/dsql -Welcome to dsql, the command-line client for Druid SQL. -Type "\h" for help. -dsql> select * from "rollup-tutorial"; -┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐ -│ __time │ bytes │ count │ dstIP │ packets │ srcIP │ -├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤ -│ 2018-01-01T01:01:00.000Z │ 35937 │ 3 │ 2.2.2.2 │ 286 │ 1.1.1.1 │ -│ 2018-01-01T01:02:00.000Z │ 366260 │ 2 │ 2.2.2.2 │ 415 │ 1.1.1.1 │ -│ 2018-01-01T01:03:00.000Z │ 10204 │ 1 │ 2.2.2.2 │ 49 │ 1.1.1.1 │ -│ 2018-01-02T21:33:00.000Z │ 100288 │ 2 │ 8.8.8.8 │ 161 │ 7.7.7.7 │ -│ 2018-01-02T21:35:00.000Z │ 2818 │ 1 │ 8.8.8.8 │ 12 │ 7.7.7.7 │ -└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘ -Retrieved 5 rows in 1.18s. - -dsql> -``` +Notice there are only six rows as opposed to the nine rows in the example data. The next section covers how ingestion with rollup accomplishes this. Review Comment: "six rows" --> isn't this five? ########## docs/tutorials/tutorial-rollup.md: ########## @@ -49,150 +52,98 @@ For this tutorial, we'll use a small sample of network flow event data, represen {"timestamp":"2018-01-02T21:35:45Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":12,"bytes":2818} ``` -A file containing this sample input data is located at `quickstart/tutorial/rollup-data.json`. - -We'll ingest this data using the following ingestion task spec, located at `quickstart/tutorial/rollup-index.json`. - -```json -{ - "type" : "index_parallel", - "spec" : { - "dataSchema" : { - "dataSource" : "rollup-tutorial", - "dimensionsSpec" : { - "dimensions" : [ - "srcIP", - "dstIP" - ] - }, - "timestampSpec": { - "column": "timestamp", - "format": "iso" - }, - "metricsSpec" : [ - { "type" : "count", "name" : "count" }, - { "type" : "longSum", "name" : "packets", "fieldName" : "packets" }, - { "type" : "longSum", "name" : "bytes", "fieldName" : "bytes" } - ], - "granularitySpec" : { - "type" : "uniform", - "segmentGranularity" : "week", - "queryGranularity" : "minute", - "intervals" : ["2018-01-01/2018-01-03"], - "rollup" : true - } - }, - "ioConfig" : { - "type" : "index_parallel", - "inputSource" : { - "type" : "local", - "baseDir" : "quickstart/tutorial", - "filter" : "rollup-data.json" - }, - "inputFormat" : { - "type" : "json" - }, - "appendToExisting" : false - }, - "tuningConfig" : { - "type" : "index_parallel", - "partitionsSpec": { - "type": "dynamic" - }, - "maxRowsInMemory" : 25000 - } - } -} +Load the sample dataset using the [`INSERT INTO`](../multi-stage-query/reference.md/#insert) statement and the [`EXTERN`](../multi-stage-query/reference.md/#extern-function) function to ingest the data inline. In the [Druid web console](../operations/web-console.md), go to the **Query** view and run the following query: + +```sql +INSERT INTO "rollup_tutorial" +WITH "inline_data" AS ( + SELECT * + FROM TABLE(EXTERN('{ + "type":"inline", + "data":"{\"timestamp\":\"2018-01-01T01:01:35Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":20,\"bytes\":9024}\n{\"timestamp\":\"2018-01-01T01:02:14Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":38,\"bytes\":6289}\n{\"timestamp\":\"2018-01-01T01:01:59Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":11,\"bytes\":5780}\n{\"timestamp\":\"2018-01-01T01:01:51Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":255,\"bytes\":21133}\n{\"timestamp\":\"2018-01-01T01:02:29Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":377,\"bytes\":359971}\n{\"timestamp\":\"2018-01-01T01:03:29Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":49,\"bytes\":10204}\n{\"timestamp\":\"2018-01-02T21:33:14Z\",\"srcIP\":\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":38,\"bytes\":6289}\n{\"timestamp\":\"2018-01-02T21:33:45Z\",\"srcIP\":\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":123,\"bytes\":93999}\n{\"timestamp\":\"2018-01-02T21:35:45Z\",\"srcIP\" :\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":12,\"bytes\":2818}"}', + '{"type":"json"}')) + EXTEND ("timestamp" VARCHAR, "srcIP" VARCHAR, "dstIP" VARCHAR, "packets" BIGINT, "bytes" BIGINT) +) +SELECT + FLOOR(TIME_PARSE("timestamp") TO MINUTE) AS __time, + "srcIP", + "dstIP", + SUM("bytes") AS "bytes", + SUM("packets") AS "packets", + COUNT(*) AS "count" +FROM "inline_data" +GROUP BY 1, 2, 3 +PARTITIONED BY DAY ``` -Rollup has been enabled by setting `"rollup" : true` in the `granularitySpec`. - -Note that we have `srcIP` and `dstIP` defined as dimensions, a longSum metric is defined for the `packets` and `bytes` columns, and the `queryGranularity` has been defined as `minute`. +In the query, you group by dimensions, `timestamp`, `srcIP`, and `dstIP`. Note that the query uses the `FLOOR` function to bucket rows based on MINUTE granularity. +For the metrics, you apply aggregations to sum the `bytes` and `packets` columns and add a column that counts the number of rows that get rolled-up. -We will see how these definitions are used after we load this data. +After the ingestion completes, you can query the data. -## Load the example data +## Query the example data -From the apache-druid-{{DRUIDVERSION}} package root, run the following command: +In the web console, open a new tab in the **Query** view. Run the following query to view the ingested data: -```bash -bin/post-index-task --file quickstart/tutorial/rollup-index.json --url http://localhost:8081 +```sql +SELECT * FROM "rollup_tutorial" ``` -After the script completes, we will query the data. +Returns the following: -## Query the example data +| `__time` | `srcIP` | `dstIP` | `bytes` | `count` | `packets` | +| -- | -- | -- | -- | -- | -- | +| `2018-01-01T01:01:00.000Z` | `1.1.1.1` | `2.2.2.2` | `35,937` | `3` | `286` | +| `2018-01-01T01:02:00.000Z` | `1.1.1.1` | `2.2.2.2` | `366,260` | `2` | `415` | +| `2018-01-01T01:03:00.000Z` | `1.1.1.1` | `2.2.2.2` | `10,204` | `1` | `49` | +| `2018-01-02T21:33:00.000Z` | `7.7.7.7` | `8.8.8.8` | `100,288` | `2` | `161` | +| `2018-01-02T21:35:00.000Z` | `7.7.7.7` | `8.8.8.8` | `2,818` | `1` | `12` | -Let's run `bin/dsql` and issue a `select * from "rollup-tutorial";` query to see what data was ingested. - -```bash -$ bin/dsql -Welcome to dsql, the command-line client for Druid SQL. -Type "\h" for help. -dsql> select * from "rollup-tutorial"; -┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐ -│ __time │ bytes │ count │ dstIP │ packets │ srcIP │ -├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤ -│ 2018-01-01T01:01:00.000Z │ 35937 │ 3 │ 2.2.2.2 │ 286 │ 1.1.1.1 │ -│ 2018-01-01T01:02:00.000Z │ 366260 │ 2 │ 2.2.2.2 │ 415 │ 1.1.1.1 │ -│ 2018-01-01T01:03:00.000Z │ 10204 │ 1 │ 2.2.2.2 │ 49 │ 1.1.1.1 │ -│ 2018-01-02T21:33:00.000Z │ 100288 │ 2 │ 8.8.8.8 │ 161 │ 7.7.7.7 │ -│ 2018-01-02T21:35:00.000Z │ 2818 │ 1 │ 8.8.8.8 │ 12 │ 7.7.7.7 │ -└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘ -Retrieved 5 rows in 1.18s. - -dsql> -``` +Notice there are only six rows as opposed to the nine rows in the example data. The next section covers how ingestion with rollup accomplishes this. + +## View rollup in action -Let's look at the three events in the original input data that occurred during `2018-01-01T01:01`: +Consider the three events in the original input data that occur over the course of minute `2018-01-01T01:01`: ```json {"timestamp":"2018-01-01T01:01:35Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":20,"bytes":9024} {"timestamp":"2018-01-01T01:01:51Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":255,"bytes":21133} {"timestamp":"2018-01-01T01:01:59Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":11,"bytes":5780} ``` -These three rows have been "rolled up" into the following row: +Druid combines the three rows into one during rollup: -```bash -┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐ -│ __time │ bytes │ count │ dstIP │ packets │ srcIP │ -├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤ -│ 2018-01-01T01:01:00.000Z │ 35937 │ 3 │ 2.2.2.2 │ 286 │ 1.1.1.1 │ -└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘ -``` +| `__time` | `srcIP` | `dstIP` | `bytes` | `count` | `packets` | +| -- | -- | -- | -- | -- | -- | +| `2018-01-01T01:01:00.000Z` | `1.1.1.1` | `2.2.2.2` | `35,937` | `3` | `286` | -The input rows have been grouped by the timestamp and dimension columns `{timestamp, srcIP, dstIP}` with sum aggregations on the metric columns `packets` and `bytes`. +Before the grouping occurs, the `FLOOR(TIME_PARSE("timestamp") TO MINUTE)` expression buckets (floors) the timestamp column of the original input by minute. -Before the grouping occurs, the timestamps of the original input data are bucketed/floored by minute, due to the `"queryGranularity":"minute"` setting in the ingestion spec. +The input rows are grouped because they have the same values for their dimension columns `{timestamp, srcIP, dstIP}`. The metric columns calculate the sum aggregation of the grouped rows for `packets` and `bytes`. The `count` metric shows how many rows from the original input data contributed to the final rolled-up row. -Likewise, these two events that occurred during `2018-01-01T01:02` have been rolled up: +Now, consider the two events in the original input data that occur over the course of minute `2018-01-01T01:02`: ```json {"timestamp":"2018-01-01T01:02:14Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":38,"bytes":6289} {"timestamp":"2018-01-01T01:02:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":377,"bytes":359971} ``` -```bash -┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐ -│ __time │ bytes │ count │ dstIP │ packets │ srcIP │ -├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤ -│ 2018-01-01T01:02:00.000Z │ 366260 │ 2 │ 2.2.2.2 │ 415 │ 1.1.1.1 │ -└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘ -``` +The rows are grouped into the following during rollup: -For the last event recording traffic between 1.1.1.1 and 2.2.2.2, no rollup took place, because this was the only event that occurred during `2018-01-01T01:03`: +| `__time` | `srcIP` | `dstIP` | `bytes` | `count` | `packets` | +| -- | -- | -- | -- | -- | -- | +| `2018-01-01T01:02:00.000Z` | `1.1.1.1` | `2.2.2.2` | `366,260` | `2` | `415` | + +In the original input data, only one event occurs over the course of minute `2018-01-01T01:03`: ```json {"timestamp":"2018-01-01T01:03:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":49,"bytes":10204} ``` -```bash -┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐ -│ __time │ bytes │ count │ dstIP │ packets │ srcIP │ -├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤ -│ 2018-01-01T01:03:00.000Z │ 10204 │ 1 │ 2.2.2.2 │ 49 │ 1.1.1.1 │ -└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘ -``` +Therefore, no rollup takes place: + +| `__time` | `srcIP` | `dstIP` | `bytes` | `count` | `packets` | +| -- | -- | -- | -- | -- | -- | +| `2018-01-01T01:03:00.000Z` | `1.1.1.1` | `2.2.2.2` | `10,204` | `1` | `49` | + -Note that the `count` metric shows how many rows in the original input data contributed to the final "rolled up" row. Review Comment: At the end, can you add a "Learn more" section? See the append data tutorial for a model. You can link to the rollup section of the SQL-based ingestion docs, as well as the more general rollup documentation. Add anything else you think may be helpful for the reader. ########## docs/tutorials/tutorial-rollup.md: ########## @@ -49,150 +52,98 @@ For this tutorial, we'll use a small sample of network flow event data, represen {"timestamp":"2018-01-02T21:35:45Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":12,"bytes":2818} ``` -A file containing this sample input data is located at `quickstart/tutorial/rollup-data.json`. - -We'll ingest this data using the following ingestion task spec, located at `quickstart/tutorial/rollup-index.json`. - -```json -{ - "type" : "index_parallel", - "spec" : { - "dataSchema" : { - "dataSource" : "rollup-tutorial", - "dimensionsSpec" : { - "dimensions" : [ - "srcIP", - "dstIP" - ] - }, - "timestampSpec": { - "column": "timestamp", - "format": "iso" - }, - "metricsSpec" : [ - { "type" : "count", "name" : "count" }, - { "type" : "longSum", "name" : "packets", "fieldName" : "packets" }, - { "type" : "longSum", "name" : "bytes", "fieldName" : "bytes" } - ], - "granularitySpec" : { - "type" : "uniform", - "segmentGranularity" : "week", - "queryGranularity" : "minute", - "intervals" : ["2018-01-01/2018-01-03"], - "rollup" : true - } - }, - "ioConfig" : { - "type" : "index_parallel", - "inputSource" : { - "type" : "local", - "baseDir" : "quickstart/tutorial", - "filter" : "rollup-data.json" - }, - "inputFormat" : { - "type" : "json" - }, - "appendToExisting" : false - }, - "tuningConfig" : { - "type" : "index_parallel", - "partitionsSpec": { - "type": "dynamic" - }, - "maxRowsInMemory" : 25000 - } - } -} +Load the sample dataset using the [`INSERT INTO`](../multi-stage-query/reference.md/#insert) statement and the [`EXTERN`](../multi-stage-query/reference.md/#extern-function) function to ingest the data inline. In the [Druid web console](../operations/web-console.md), go to the **Query** view and run the following query: + +```sql +INSERT INTO "rollup_tutorial" +WITH "inline_data" AS ( + SELECT * + FROM TABLE(EXTERN('{ + "type":"inline", + "data":"{\"timestamp\":\"2018-01-01T01:01:35Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":20,\"bytes\":9024}\n{\"timestamp\":\"2018-01-01T01:02:14Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":38,\"bytes\":6289}\n{\"timestamp\":\"2018-01-01T01:01:59Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":11,\"bytes\":5780}\n{\"timestamp\":\"2018-01-01T01:01:51Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":255,\"bytes\":21133}\n{\"timestamp\":\"2018-01-01T01:02:29Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":377,\"bytes\":359971}\n{\"timestamp\":\"2018-01-01T01:03:29Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":49,\"bytes\":10204}\n{\"timestamp\":\"2018-01-02T21:33:14Z\",\"srcIP\":\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":38,\"bytes\":6289}\n{\"timestamp\":\"2018-01-02T21:33:45Z\",\"srcIP\":\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":123,\"bytes\":93999}\n{\"timestamp\":\"2018-01-02T21:35:45Z\",\"srcIP\" :\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":12,\"bytes\":2818}"}', + '{"type":"json"}')) + EXTEND ("timestamp" VARCHAR, "srcIP" VARCHAR, "dstIP" VARCHAR, "packets" BIGINT, "bytes" BIGINT) +) +SELECT + FLOOR(TIME_PARSE("timestamp") TO MINUTE) AS __time, + "srcIP", + "dstIP", + SUM("bytes") AS "bytes", + SUM("packets") AS "packets", + COUNT(*) AS "count" +FROM "inline_data" +GROUP BY 1, 2, 3 +PARTITIONED BY DAY ``` -Rollup has been enabled by setting `"rollup" : true` in the `granularitySpec`. - -Note that we have `srcIP` and `dstIP` defined as dimensions, a longSum metric is defined for the `packets` and `bytes` columns, and the `queryGranularity` has been defined as `minute`. +In the query, you group by dimensions, `timestamp`, `srcIP`, and `dstIP`. Note that the query uses the `FLOOR` function to bucket rows based on MINUTE granularity. +For the metrics, you apply aggregations to sum the `bytes` and `packets` columns and add a column that counts the number of rows that get rolled-up. Review Comment: It seems better to put timestamp first, so you can describe metrics and dimensions together. Or maybe a list like follows: ```suggestion Note the following aspects of the ingestion statement: * You transform the timestamp field using the `FLOOR` function to round timestamps down to the minute. * You group by the dimensions `timestamp`, `srcIP`, and `dstIP`. * You create the `bytes` and `packets` metrics, which are summed from their respective input fields. * You also create the `count` metric that records the number of rows that get rolled-up per each row in the datasource. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org