Re: [PR] [docs] Updating Rollup tutorial (druid)

via GitHub Tue, 23 Jul 2024 09:35:46 -0700


ektravel commented on code in PR #16762:
URL: https://github.com/apache/druid/pull/16762#discussion_r1688371371



##########
docs/tutorials/tutorial-rollup.md:
##########
@@ -49,150 +49,101 @@ For this tutorial, we'll use a small sample of network 
flow event data, represen
 {"timestamp":"2018-01-02T21:35:45Z","srcIP":"7.7.7.7", 
"dstIP":"8.8.8.8","packets":12,"bytes":2818}
 ```
 
-A file containing this sample input data is located at 
`quickstart/tutorial/rollup-data.json`.
+The tutorial guides you through how to ingest this data using rollup.
 
-We'll ingest this data using the following ingestion task spec, located at 
`quickstart/tutorial/rollup-index.json`.
+## Load the example data
 
-```json
-{
-  "type" : "index_parallel",
-  "spec" : {
-    "dataSchema" : {
-      "dataSource" : "rollup-tutorial",
-      "dimensionsSpec" : {
-        "dimensions" : [
-          "srcIP",
-          "dstIP"
-        ]
-      },
-      "timestampSpec": {
-        "column": "timestamp",
-        "format": "iso"
-      },
-      "metricsSpec" : [
-        { "type" : "count", "name" : "count" },
-        { "type" : "longSum", "name" : "packets", "fieldName" : "packets" },
-        { "type" : "longSum", "name" : "bytes", "fieldName" : "bytes" }
-      ],
-      "granularitySpec" : {
-        "type" : "uniform",
-        "segmentGranularity" : "week",
-        "queryGranularity" : "minute",
-        "intervals" : ["2018-01-01/2018-01-03"],
-        "rollup" : true
-      }
-    },
-    "ioConfig" : {
-      "type" : "index_parallel",
-      "inputSource" : {
-        "type" : "local",
-        "baseDir" : "quickstart/tutorial",
-        "filter" : "rollup-data.json"
-      },
-      "inputFormat" : {
-        "type" : "json"
-      },
-      "appendToExisting" : false
-    },
-    "tuningConfig" : {
-      "type" : "index_parallel",
-      "partitionsSpec": {
-        "type": "dynamic"
-      },
-      "maxRowsInMemory" : 25000
-    }
-  }
-}
+Load the sample dataset using INSERT and EXTERN functions. The EXTERN function 
lets you read external data or write to an external location.
+
+In the Druid web console, go to the Query view and run the following query:
+
+```sql
+INSERT INTO "rollup_tutorial"
+WITH "inline_data" AS (
+  SELECT *
+  FROM TABLE(EXTERN('{
+    "type":"inline",
+    
"data":"{\"timestamp\":\"2018-01-01T01:01:35Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":20,\"bytes\":9024}\n{\"timestamp\":\"2018-01-01T01:02:14Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":38,\"bytes\":6289}\n{\"timestamp\":\"2018-01-01T01:01:59Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":11,\"bytes\":5780}\n{\"timestamp\":\"2018-01-01T01:01:51Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":255,\"bytes\":21133}\n{\"timestamp\":\"2018-01-01T01:02:29Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":377,\"bytes\":359971}\n{\"timestamp\":\"2018-01-01T01:03:29Z\",\"srcIP\":\"1.1.1.1\",\"dstIP\":\"2.2.2.2\",\"packets\":49,\"bytes\":10204}\n{\"timestamp\":\"2018-01-02T21:33:14Z\",\"srcIP\":\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":38,\"bytes\":6289}\n{\"timestamp\":\"2018-01-02T21:33:45Z\",\"srcIP\":\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":123,\"bytes\":93999}\n{\"timestamp\":\"2018-01-02T21:35:45Z\",\"srcIP\"
 :\"7.7.7.7\",\"dstIP\":\"8.8.8.8\",\"packets\":12,\"bytes\":2818}"}', 
+    '{"type":"json"}')) 
+    EXTEND ("timestamp" VARCHAR, "srcIP" VARCHAR, "dstIP" VARCHAR, "packets" 
BIGINT, "bytes" BIGINT)
+)
+SELECT
+  FLOOR(TIME_PARSE("timestamp") TO MINUTE) AS __time,
+  "srcIP",
+  "dstIP",
+  SUM("bytes") AS "bytes",
+  COUNT(*) AS "count",
+  SUM("packets") AS "packets"
+FROM "inline_data"
+GROUP BY 1, 2, 3
+PARTITIONED BY DAY
 ```
 
-Rollup has been enabled by setting `"rollup" : true` in the `granularitySpec`.
+Note that the query uses the `FLOOR` function to give the `__time` a 
granularity of `MINUTE`. The query defines the dimensions of the rollup by 
grouping columns 1, 2, and 3, which corresponds to the `timestamp`, `srcIP`, 
and `dstIP` columns. The query defines the metrics of the rollup by aggregating 
the `bytes` and `packets` columns.
 
-Note that we have `srcIP` and `dstIP` defined as dimensions, a longSum metric 
is defined for the `packets` and `bytes` columns, and the `queryGranularity` 
has been defined as `minute`.
+After the ingestion completes, you can query the data.
 
-We will see how these definitions are used after we load this data.
-
-## Load the example data
+## Query the example data
 
-From the apache-druid-{{DRUIDVERSION}} package root, run the following command:
+Open a new tab in the Query view and run the following query to see what data 
was ingested:
 
-```bash
-bin/post-index-task --file quickstart/tutorial/rollup-index.json --url 
http://localhost:8081
+```sql
+SELECT * FROM "rollup_tutorial"
 ```
 
-After the script completes, we will query the data.
+Returns the following:
 
-## Query the example data
+| `__time` | `srcIP` | `dstIP` | `bytes` | `count` | `packets` |
+| -- | -- | -- | -- | -- | -- |
+| `2018-01-01T01:01:00.000Z` | `1.1.1.1` | `2.2.2.2` | `35,937` | `3` | `286` |
+| `2018-01-01T01:02:00.000Z` | `1.1.1.1` | `2.2.2.2` | `366,260` | `2` | `415` 
|
+| `2018-01-01T01:03:00.000Z` | `1.1.1.1` | `2.2.2.2` | `10,204` | `1` | `49` |
+| `2018-01-02T21:33:00.000Z` | `7.7.7.7` | `8.8.8.8` | `100,288` | `2` | `161` 
|
+| `2018-01-02T21:35:00.000Z` | `7.7.7.7` | `8.8.8.8` | `2,818` | `1` | `12` |
 
-Let's run `bin/dsql` and issue a `select * from "rollup-tutorial";` query to 
see what data was ingested.
-
-```bash
-$ bin/dsql
-Welcome to dsql, the command-line client for Druid SQL.
-Type "\h" for help.
-dsql> select * from "rollup-tutorial";
-┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
-│ __time                   │ bytes  │ count │ dstIP   │ packets │ srcIP   │
-├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤
-│ 2018-01-01T01:01:00.000Z │  35937 │     3 │ 2.2.2.2 │     286 │ 1.1.1.1 │
-│ 2018-01-01T01:02:00.000Z │ 366260 │     2 │ 2.2.2.2 │     415 │ 1.1.1.1 │
-│ 2018-01-01T01:03:00.000Z │  10204 │     1 │ 2.2.2.2 │      49 │ 1.1.1.1 │
-│ 2018-01-02T21:33:00.000Z │ 100288 │     2 │ 8.8.8.8 │     161 │ 7.7.7.7 │
-│ 2018-01-02T21:35:00.000Z │   2818 │     1 │ 8.8.8.8 │      12 │ 7.7.7.7 │
-└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
-Retrieved 5 rows in 1.18s.
-
-dsql>
-```
 
-Let's look at the three events in the original input data that occurred during 
`2018-01-01T01:01`:
+Consider the three events in the original input data that occur over the 
course of minute `2018-01-01T01:01`:
 
 ```json
 {"timestamp":"2018-01-01T01:01:35Z","srcIP":"1.1.1.1", 
"dstIP":"2.2.2.2","packets":20,"bytes":9024}
 {"timestamp":"2018-01-01T01:01:51Z","srcIP":"1.1.1.1", 
"dstIP":"2.2.2.2","packets":255,"bytes":21133}
 {"timestamp":"2018-01-01T01:01:59Z","srcIP":"1.1.1.1", 
"dstIP":"2.2.2.2","packets":11,"bytes":5780}
 ```
 
-These three rows have been "rolled up" into the following row:
+Druid combines the three rows into the following during rollup:
 
-```bash
-┌──────────────────────────┬────────┬───────┬─────────┬─────────┬─────────┐
-│ __time                   │ bytes  │ count │ dstIP   │ packets │ srcIP   │
-├──────────────────────────┼────────┼───────┼─────────┼─────────┼─────────┤
-│ 2018-01-01T01:01:00.000Z │  35937 │     3 │ 2.2.2.2 │     286 │ 1.1.1.1 │
-└──────────────────────────┴────────┴───────┴─────────┴─────────┴─────────┘
-```
+| `__time` | `srcIP` | `dstIP` | `bytes` | `count` | `packets` |
+| -- | -- | -- | -- | -- | -- |
+| `2018-01-01T01:01:00.000Z` | `1.1.1.1` | `2.2.2.2` | `35,937` | `3` | `286` |
 
 The input rows have been grouped by the timestamp and dimension columns 
`{timestamp, srcIP, dstIP}` with sum aggregations on the metric columns 
`packets` and `bytes`.
 
-Before the grouping occurs, the timestamps of the original input data are 
bucketed/floored by minute, due to the `"queryGranularity":"minute"` setting in 
the ingestion spec.
+Before the grouping occurs, the timestamps of the original input data are 
bucketed/floored by minute, due to the `FLOOR(TIME_PARSE("timestamp") TO 
MINUTE)` function in the query.
 
-Likewise, these two events that occurred during `2018-01-01T01:02` have been 
rolled up:
+Consider the two events in the original input data that occur over the course 
of minute `2018-01-01T01:02`:

Review Comment:
   ```suggestion
   Consider the two events in the original input data that occur over the 
course of the minute `2018-01-01T01:02`:
   ```
   Should this say "the course of the minute" or "the course of one minute?"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

Re: [PR] [docs] Updating Rollup tutorial (druid)

Reply via email to