[ 
https://issues.apache.org/jira/browse/KUDU-3483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17780130#comment-17780130
 ] 

ASF subversion and git services commented on KUDU-3483:
-------------------------------------------------------

Commit 453d3aa5515d73ab7df8a41a38ed56cbfdfbc6b9 in kudu's branch 
refs/heads/branch-1.17.x from xinghuayu007
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=453d3aa55 ]

KUDU-3483 Fix flushing data in batch when table schema changed

In auto_flush_background or manual_flush mode, applying an operation
firstly inserts the row into the buffer. When the buffer is full or
function flush() is called, it tries to flush multiple rows into
Kudu server. Firstly, it groups this data according to the tablet
id as a batch. A batch may contains multiple rows which belong to
the same tablet. Then a batch will encode into bytes. At this time,
it reads the table schema of the first row and decides the format
of the data. If two rows have different schema but belongs to the same
tablet, which maybe because of altering the table between inserting
two rows, it causes array index outof index bound exception.

This patch will validate the schema of multiple rows which belong
to the same tablet. If the schema is different, it puts them into
the different groups as different batches.

Change-Id: Ie6501962b32814d121f180b2942999c402d927db
Reviewed-on: http://gerrit.cloudera.org:8080/19949
Tested-by: Kudu Jenkins
Reviewed-by: Alexey Serbin <ale...@apache.org>
(cherry picked from commit c9e6e36a742c1164bf20a1913a50b9bd03807ed7)
Reviewed-on: http://gerrit.cloudera.org:8080/20622
Tested-by: Alexey Serbin <ale...@apache.org>
Reviewed-by: Wang Xixu <1450306...@qq.com>
Reviewed-by: Yifan Zhang <chinazhangyi...@163.com>


> Flushing data in AUTO_FLUSH_BACKGROUND mode fails when the table's schema is 
> changing
> -------------------------------------------------------------------------------------
>
>                 Key: KUDU-3483
>                 URL: https://issues.apache.org/jira/browse/KUDU-3483
>             Project: Kudu
>          Issue Type: Bug
>            Reporter: Xixu Wang
>            Priority: Major
>             Fix For: 1.18.0
>
>         Attachments: image-2023-05-30-16-12-20-361.png
>
>
>  
> *1.The problem*
> Flush multiple data in auto_flush_background mode maybe fail when the table 
> schema has changed. The following is the error message:
> !image-2023-05-30-16-12-20-361.png!
>  
> *2.How to repeat the case*
> 1.create a table with 2 columns.
> 2.insert a data into this table in auto_flush_background mode.
> 3.Add 3 new columns for this table.
> 4.reopen this table
> 5.insert a data into this table in auto_flush_background mode.
> 6.flush the buffer
> {code:java}
> KuduTable table = createTable(ImmutableList.of());
> // Add a row with addNullableDef=null    
> final KuduSession session = client.newSession();    
> session.setFlushMode(SessionConfiguration.FlushMode.AUTO_FLUSH_BACKGROUND);   
>  
> Insert insert = table.newInsert();    
> PartialRow row = insert.getRow();    
> row.addInt("c0", 101);    
> row.addInt("c1", 101);    
> session.apply(insert);
> // Add some new columns.    
> client.alterTable(tableName, new AlterTableOptions()    
>   .addColumn("addNonNull", Type.INT32, 100)    
>   .addNullableColumn("addNullable", Type.INT32)    
>   .addNullableColumn("addNullableDef", Type.INT32, 200));
>     
> // Reopen table for the new schema.    
> table = client.openTable(tableName);    
> assertEquals(5, table.getSchema().getColumnCount());    
> Insert newinsert = table.newInsert();    
> PartialRow newrow = newinsert.getRow();    
> newrow.addInt("c0", 101);    
> newrow.addInt("c1", 101);    
> newrow.addInt("addNonNull", 101);    
> newrow.addInt("addNullable", 101);    
> newrow.setNull("addNullableDef");    
> session.apply(newinsert);    
> session.flush(); {code}
>  
> *3.Why this problem happened*
> In auto_flush_background mode, applying an operation will firstly be inserted 
> into the buffer. When the buffer is full or function flush() is called, it 
> will try to flush multiple data into Kudu server. First, it will group these 
> data according to the tablet id as a batch. A batch may contains multiple 
> rows which belong to the same tablet. Then a batch will encode into bytes. At 
> this time, it will read the table schema of the first row and decide the 
> format of the data. If two rows has different schema but belongs to the same 
> table, which because of altering the table between inserting two rows, it 
> will cause array index outbound exception.
>  
> By the way, it hard to trace the whole process, especially in kudu tablet 
> server, it is better to log downstream IP and client id.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to