(kudu) branch gh-pages updated: [blog] blogpost about auto-incrementing column in Kudu

alexey Tue, 23 Apr 2024 16:47:21 -0700

This is an automated email from the ASF dual-hosted git repository.

alexey pushed a commit to branch gh-pages
in repository https://gitbox.apache.org/repos/asf/kudu.git



The following commit(s) were added to refs/heads/gh-pages by this push:
     new b7bf6b03b [blog] blogpost about auto-incrementing column in Kudu
b7bf6b03b is described below

commit b7bf6b03b01a7c8f561d9bd7ff132fc255b2209f
Author: Abhishek Chennaka <achenn...@cloudera.com>
AuthorDate: Thu Mar 7 17:43:30 2024 -0800

    [blog] blogpost about auto-incrementing column in Kudu
    
    Change-Id: I39f34eea6877a8e050ba2e187ff71555256bf797
    Reviewed-on: http://gerrit.cloudera.org:8080/21119
    Reviewed-by: Alexey Serbin <ale...@apache.org>
    Tested-by: Abhishek Chennaka <achenn...@cloudera.com>
---
 ...4-03-07-introducing-auto-incrementing-column.md | 178 +++++++++++++++++++++
 1 file changed, 178 insertions(+)

diff --git a/_posts/2024-03-07-introducing-auto-incrementing-column.md 
b/_posts/2024-03-07-introducing-auto-incrementing-column.md
new file mode 100644
index 000000000..d43bf442c
--- /dev/null
+++ b/_posts/2024-03-07-introducing-auto-incrementing-column.md
@@ -0,0 +1,178 @@
+---
+layout: post
+title: "Introducing Auto-incrementing Column in Kudu"
+author: Abhishek Chennaka
+---
+<!--more-->
+
+# Introduction
+
+Kudu has a strict requirement for a primary key presence in a table. This is 
primarily to help in
+point lookups and support DELETE and UPDATE operations on the table data. 
There are situations where
+users are unable to define a unique primary key in their data set and have to 
either introduce
+additional columns to be a part of the primary key or define a new column and 
maintain it to enforce
+uniqueness. Kudu 1.17 has introduced support for the auto-incrementing column 
to have partially
+defined primary keys (keys which are not unique across the table) during table 
creation. This way a
+user does not have to worry about the uniqueness constraint when defining a 
primary key.
+
+# Implementation Details
+
+When a primary key is partially defined, Kudu internally creates a new column 
named
+“auto_incrementing_id” as a part of the primary key. The column is populated 
with a monotonically
+increasing counter. The system updates the counter value upon every INSERT 
operation and populates
+the "auto_incrementing_id" column on the server side. The counter is 
partition-local i.e. every
+tablet has its own counter.
+
+## Server Side
+
+When a user writes data into a table with the auto-incrementing column, the 
server makes sure that
+no INSERT operations have the “auto_incrementing_id” column field value set 
and populates this
+column value. The highest value of the counter written into the 
"auto_incrementing_id" column
+until any particular point is stored in memory and this is used to set the 
column value for the
+next INSERT operation.
+
+## Client Side
+
+When creating a table without an explicitly defined primary key, users will 
have to declare the key
+as non-unique. Internally, the client builds a schema with an extra column 
named
+“auto_incrementing_id” and forwards the request to the server where the table 
is created. For
+INSERT operations, the user shouldn’t specify the “auto_incrementing_id” 
column value as it will be
+populated on the server side.
+
+### Impala Integration
+
+In Impala, the new column is not exposed to the user by default. This is due 
to the reason that it
+is not a part of the user table schema. The below query will not return the 
“auto_incrementing_id”
+column
+SELECT \* FROM &lt;tablename&gt;
+
+If the auto-incrementing column's data is needed, the column name has to be 
specifically requested.
+The below query will return the column values:
+SELECT \*, auto_incrementing_id FROM &lt;tablename&gt;
+
+#### Examples
+
+Create a table with two columns and two hash partitions:
+
+```
+default> CREATE TABLE demo_table(id INT NON UNIQUE PRIMARY KEY, name STRING) 
PARTITION BY HASH (id) PARTITIONS 2 STORED AS KUDU;
+Query: CREATE TABLE demo_table(id INT NON UNIQUE PRIMARY KEY, name STRING) 
PARTITION BY HASH (id) PARTITIONS 2 STORED AS KUDU
++-------------------------+
+| summary                 |
++-------------------------+
+| Table has been created. |
++-------------------------+
+Fetched 1 row(s) in 3.94s
+```
+
+Describe the table:
+
+```
+default> DESCRIBE demo_table;
+Query: DESCRIBE demo_table
++----------------------+--------+---------+-------------+------------+----------+---------------+---------------+---------------------+------------+
+| name                 | type   | comment | primary_key | key_unique | 
nullable | default_value | encoding      | compression         | block_size |
++----------------------+--------+---------+-------------+------------+----------+---------------+---------------+---------------------+------------+
+| id                   | int    |         | true        | false      | false   
 |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
+| auto_incrementing_id | bigint |         | true        | false      | false   
 |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
+| name                 | string |         | false       |            | true    
 |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
++----------------------+--------+---------+-------------+------------+----------+---------------+---------------+---------------------+------------+
+```
+
+<br/>Insert rows with duplicate partial primary key column values:
+
+```
+default> INSERT INTO demo_table VALUES (1, 'John'), (2, 'Bob'), (3, 'Mary'), 
(1, 'Joe');
+Query: INSERT INTO demo_table VALUES (1, 'John'), (2, 'Bob'), (3, 'Mary'), (1, 
'Joe')
+..
+Modified 4 row(s), 0 row error(s) in 0.41s
+```
+<br/>Scan the table (notice the duplicate values in the 'id' column):
+
+```
+default> SELECT * FROM demo_table;
+Query: SELECT * FROM demo_table
+..
++----+------+
+| id | name |
++----+------+
+| 3  | Mary |
+| 1  | John |
+| 1  | Joe  |
+| 2  | Bob  |
++----+------+
+Fetched 4 row(s) in 0.24s
+```
+
+Explicitly specify the auto-incrementing column name to fetch the column 
values:
+```
+default> SELECT *, auto_incrementing_id FROM demo_table;
+Query: SELECT *, auto_incrementing_id FROM demo_table
+..
++----+------+----------------------+
+| id | name | auto_incrementing_id |
++----+------+----------------------+
+| 3  | Mary | 1                    |
+| 1  | John | 1                    |
+| 1  | Joe  | 2                    |
+| 2  | Bob  | 2                    |
++----+------+----------------------+
+Fetched 4 row(s) in 0.24s
+```
+
+Update and Delete rows:
+```
+default> UPDATE demo_table SET name='Matt' WHERE id=3;
+Query: UPDATE demo_table SET name='Matt' WHERE id=3
+Modified 1 row(s), 0 row error(s) in 1.99s
+default> UPDATE demo_table SET name='Liam' WHERE id=1;
+Query: UPDATE demo_table SET name='Liam' WHERE id=1
+Modified 2 row(s), 0 row error(s) in 2.15s
+default> DELETE FROM demo_table where id=2;
+Query: DELETE FROM demo_table where id=2;
+Modified 1 row(s), 0 row error(s) in 1.40s
+```
+Scan all the columns of the table:
+```
+default> SELECT *, auto_incrementing_id FROM demo_table;
+Query: SELECT *, auto_incrementing_id FROM demo_table
+..
++----+------+----------------------+
+| id | name | auto_incrementing_id |
++----+------+----------------------+
+| 3  | Matt | 1                    |
+| 1  | Liam | 1                    |
+| 1  | Liam | 2                    |
++----+------+----------------------+
+Fetched 3 row(s) in 0.20s
+```
+#### Limitations
+
+Impala doesn’t support UPSERT operations on tables with the auto-incrementing 
column as of writing
+this article.
+
+### Kudu clients(Java, C++, Python)
+
+Unlike in Impala, scanning the table fetches all the table data including the 
auto incrementing column.
+There is no need to explicitly request the auto-incrementing column.
+
+There is also support for UPSERT operations but the user has to provide the 
entire primary key
+including the value for the auto-incrementing column. If the row is present, 
it will be considered a
+regular UPDATE operation. If the row is not present, it is considered an 
INSERT operation.
+
+#### Examples
+
+<https://github.com/apache/kudu/blob/master/examples/cpp/non_unique_primary_key.cc>
+
+<https://github.com/apache/kudu/blob/master/examples/python/basic-python-example/non_unique_primary_key.py>
+
+##Backup and Restore
+
+The Kudu backup tool from Kudu 1.17 and later supports backing up tables with 
the
+auto-incrementing column. The prior backup tools will fail with an error 
message -
+"auto_incrementing_id is a reserved column name" during backup and restore 
operations.
+
+The backed up data (from Kudu 1.17 and later) includes the auto-incrementing 
column in the table
+schema and the column values as well. Restoring this backed up table with the 
Kudu restore tool
+will create a table with the auto-incrementing column and the column values 
identical to the
+original source table.
\ No newline at end of file

(kudu) branch gh-pages updated: [blog] blogpost about auto-incrementing column in Kudu

Reply via email to