Re: [PR] Add a tutorial for LATEST_BY to get most recent data (druid)

via GitHub Wed, 29 May 2024 15:24:51 -0700


2bethere commented on code in PR #16515:
URL: https://github.com/apache/druid/pull/16515#discussion_r1619536941



##########
docs/tutorials/tutorial-latest-by.md:
##########
@@ -0,0 +1,219 @@
+---
+id: tutorial-latest-by
+title: Query for latest values
+sidebar_label: Query for latest and deduplicated data
+description: How to use LATEST_BY or deltas for up-to-date values
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+This tutorial describes strategies in Apache Druid for use cases that might be 
handled by UPSERT in other databases. You can use the LATEST_BY aggregation at 
query time or "deltas" for numeric dimensions at insert time.
+
+The [Update data](./tutorial-update-data.md) tutorial demonstrates how to use 
batch operations to updadate data according to the timestamp, including UPSERT 
cases. However, with streaming data, you can potentially use LATEST_BY or 
deltas to satisfy requirements otherwise handled with updates.
+
+## Prerequisites
+
+Before you follow the steps in this tutorial, download Druid as described in 
the [Local quickstart](index.md) and have it running on your local machine. You 
don't need to load any data into the Druid cluster.
+
+You should be familiar with data querying in Druid. If you haven't already, go 
through the [Query data](../tutorials/tutorial-query.md) tutorial first.
+
+## Use LATEST_BY to retrieve updated values
+
+Sometimes you want to read the latest value of one dimension or measure as it 
relates to another dimension. In a transactional database, you might maintain 
dimension or measure using UPSERT, but in Druid you can append all updates or 
changes during ingestion. The LATEST_BY function lets you get the most recent 
value for the dimension with the following type of query:
+
+```sql
+SELECT dimension,
+       LATEST_BY(changed_dimension, update_timestamp)
+FROM my_table
+GROUP BY 1
+```
+
+For example, consider the following table of events that log the total number 
of points over for a user:
+
+| `__time` |  `user_id`| `total_points`|
+| --- | --- | --- |
+| `2024-01-01T01:00:00.000Z`|`funny_bunny1`| 10 |
+| `2024-01-01T01:05:00.000Z`|`funny_bunny1`| 30 |
+| `2024-01-01T02:00:00.000Z`|`funny_bunny1`| 35 |
+| `2024-01-01T02:00:00.000Z`|`silly_monkey2`| 30 |
+| `2024-01-01T02:05:00.000Z`|`silly_monkey2`| 55 |
+| `2024-01-01T03:00:00.000Z`|`funny_bunny1`| 40 |
+
+<details>
+<summary>Insert sample data</summary>
+
+```sql
+REPLACE INTO "latest_by_tutorial1" OVERWRITE ALL
+WITH "ext" AS (
+  SELECT *
+  FROM TABLE(
+    EXTERN(
+     
'{"type":"inline","data":"{\"timestamp\":\"2024-01-01T01:00:00Z\",\"user_id\":\"funny_bunny1\",
 
\"points\":10}\n{\"timestamp\":\"2024-01-01T01:05:00Z\",\"user_id\":\"funny_bunny1\",
 \"points\":30}\n{\"timestamp\": 
\"2024-01-01T02:00:00Z\",\"user_id\":\"funny_bunny1\", 
\"points\":35}\n{\"timestamp\":\"2024-01-01T02:00:00Z\",\"user_id\":\"silly_monkey2\",
 
\"points\":30}\n{\"timestamp\":\"2024-01-01T02:05:00Z\",\"user_id\":\"silly_monkey2\",
 
\"points\":55}\n{\"timestamp\":\"2024-01-01T03:00:00Z\",\"user_id\":\"funny_bunny1\",
 \"points\":40}"}',
+     '{"type":"json"}'
+    )
+  ) EXTEND ("timestamp" VARCHAR, "user_id" VARCHAR, "points" BIGINT)
+)
+SELECT
+  TIME_PARSE("timestamp") AS "__time",
+  "user_id",
+  "points"
+FROM "ext"
+PARTITIONED BY DAY
+```
+</details>
+
+The following query gives us most recent `points` value for each `user_id`:
+
+```sql
+SELECT user_id,
+     LATEST_BY("points", "__time") AS latest_points
+FROM latest_by_tutorial1
+GROUP BY 1
+```
+
+Returns
+
+|  `user_id`| `total_points`|
+| --- | --- |
+|`silly_monkey2`| 55 |
+|`funny_bunny1`| 40 |
+
+In the example, the values increase each time, but this method works even if 
the values fluctuate up and down.
+
+You can use this query shape as a subquery to do additional processing. 
However, if there a lot of values for `user_id`, the query can be expensive.
+
+If your want the to track the latest value for different times within a larger 
time frame, you need an additional timestamp to record update times so Druid 
can track the latest version. Consider the following data that represents 
points for various users updated within an hour time frame:
+
+| `__time` | `update_time` | `user_id`| `total_points`|
+| --- | --- | --- | --- |
+| `2024-01-01T01:00:00.000Z`| `2024-01-01T01:00:00.000Z`|`funny_bunny1`| 10 |
+|`2024-01-01T01:00:00.000Z`| `2024-01-01T01:05:00.000Z`|`funny_bunny1`| 30 |
+|`2024-01-01T02:00:00.000Z`| `2024-01-01T02:00:00.000Z`|`funny_bunny1`| 35 |
+|`2024-01-01T02:00:00.000Z`|`2024-01-01T02:00:00.000Z`|`silly_monkey2`| 30 |
+|`2024-01-01T02:00:00.000Z`| `2024-01-01T02:05:00.000Z`|`silly_monkey2`| 55 |
+|`2024-01-01T03:00:00.000Z`| `2024-01-01T03:00:00.000Z`|`funny_bunny1`| 40 |
+
+<details>
+<summary>Insert sample data</summary>
+
+```sql
+REPLACE INTO "latest_by_tutorial2" OVERWRITE ALL
+WITH "ext" AS (
+  SELECT *
+  FROM TABLE(
+    EXTERN(
+     
'{"type":"inline","data":"{\"timestamp\":\"2024-01-01T01:00:00Z\",\"updated_timestamp\":\"2024-01-01T01:00:00Z\",\"user_id\":\"funny_bunny1\",
 
\"points\":10}\n{\"timestamp\":\"2024-01-01T01:05:00Z\",\"updated_timestamp\":\"2024-01-01T01:05:00Z\",\"user_id\":\"funny_bunny1\",
 \"points\":30}\n{\"timestamp\": 
\"2024-01-01T02:00:00Z\",\"updated_timestamp\":\"2024-01-01T02:00:00Z\",\"user_id\":\"funny_bunny1\",
 
\"points\":35}\n{\"timestamp\":\"2024-01-01T02:00:00Z\",\"updated_timestamp\":\"2024-01-01T02:00:00Z\",\"user_id\":\"silly_monkey2\",
 
\"points\":30}\n{\"timestamp\":\"2024-01-01T02:00:00Z\",\"updated_timestamp\":\"2024-01-01T02:05:00Z\",\"user_id\":\"silly_monkey2\",
 
\"points\":55}\n{\"timestamp\":\"2024-01-01T03:00:00Z\",\"updated_timestamp\":\"2024-01-01T03:00:00Z\",\"user_id\":\"funny_bunny1\",
 \"points\":40}"}',
+     '{"type":"json"}'
+    )
+  ) EXTEND ("timestamp" VARCHAR, "updated_timestamp" VARCHAR, "user_id" 
VARCHAR, "points" BIGINT)
+)
+SELECT
+  TIME_PARSE("timestamp") AS "__time",
+  "updated_timestamp",
+  "user_id",
+  "points"
+FROM "ext"
+PARTITIONED BY DAY
+```
+</details>
+
+
+The following query demonstrates how to query for the latest points value by 
user for each hour:
+
+```sql
+SELECT FLOOR("__time" TO HOUR) AS "hour_time",
+      "user_id",
+       LATEST_BY("points", TIME_PARSE(updated_timestamp)) AS 
"latest_points_hour"
+FROM latest_by_tutorial2
+GROUP BY 1,2
+```
+
+The results are as follows:
+
+| `hour_time` | `user_id` | `latest_points_hour`|
+|---|---|---|
+|`2024-01-01T01:00:00.000Z`|`funny_bunny1`|20|
+|`2024-01-01T02:00:00.000Z`|`funny_bunny1`|5|
+|`2024-01-01T02:00:00.000Z`|`silly_monkey2`|25|
+|`2024-01-01T03:00:00.000Z`|`funny_bunny1`|10|
+
+You can set up a periodic batch ingestion job that reindexes modified data 
into a new datasource for direct querying without grouping to mitigate for the 
cost of these kinds of queries.

Review Comment:
   ```suggestion
   LATEST_BY() is an aggregation function. While it's very efficient if there 
are not a lot of update rows matching the dimension (e.g. "user_id"), it does 
scan all matching rows with the same dimension. This means, for any permutation 
of dimensions where there are a lot of updates (e.g. the user has played the 
game a million times), and a lot of the updates are not coming in timely order, 
Druid will be forced to process all rows matching the user_id to find the row 
with max timestamp to give you the latest data. 
   
   You can think about this where if updates constitute 1-5% of your data, 
you'll get good query performance, if updates constitute 50%+ of your data, 
your queries will be slow.
   
   To mitigate this, you can set up a periodic batch ingestion job that 
reindexes modified data into a new datasource for direct querying without 
grouping to reduce for the cost of these kinds of queries by essentially 
pre-computing the latest value and store them. Though your view of latest data 
will not be up to date until the next refresh happens.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

Re: [PR] Add a tutorial for LATEST_BY to get most recent data (druid)

Reply via email to