[ 
https://issues.apache.org/jira/browse/IMPALA-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302967#comment-17302967
 ] 

Wenzhe Zhou commented on IMPALA-10564:
--------------------------------------

Since we write column by column when writing a row, we have to rewind the table 
writer for partially wrote row if we want to skip a row with invalid column 
data.

Read the code of the table writers for different formats to confirm if we can 
rewind the writer for partially wrote row. It seems that it's not hard for 
Kudu, Text and HBase formats since they buffer row data before writing a row to 
the table. But it's really hard to find a good way for Parquet to skip a 
partially wrote row.

Kudu (KuduTableSink::Send() in kudu-table-sink.cc) create KuduWriteOperation 
object for each row and push the object into a vector after adding all columns. 
We could change the code not to push the KuduWriteOperation object to the 
vector if there is invalid data for one column.

The text table writer (HdfsTextTableWriter::AppendRows() in 
hdfs-text-table-writter.cc) use stringstream to buffer the row data. The 
stringstream itself could not be re-winded, but we could save ending offset of 
last row, and flush the stringstream up to the ending offset of last row if we 
get an invalid column value when writing a new row. Then reset stringstream for 
next row.

Although HBase is known to be a column oriented database (where the column data 
stay together), the data in HBase for a particular row stay together and the 
column data is spread and not together.  Hbase table writer 
HBaseTableWriter::AppendRows() in hbase_table-writer.cc) create one "Put" 
object for each row, and save a batch of "Put" object in one Java ArrayLIst, 
then write an array of rows into HFile in one function call. We can change the 
code not to add the "Put" to ArrayList after creating it. Instead, add the 
"Put" to ArrayList after all columns are successfully added to "Put".

But Parquet is real column oriented database. Parquet Table Writer code 
(HdfsParquetTableWriter in hdfs-parquet-table-writer.cc) create a 
BaseColumnWriter object for each column. Each BaseColumnWriter object has its 
data page to buffer column data. If the current data page of BaseColumnWriter 
is full, it will be flushed (finalized by calling FinalizeCurrentPage()).  It's 
really complicated to rewind a BaseColumnWriter after its data page has been 
flushed. Since data pages for different BaseColumnWriter objects are flushed 
independently,  it's hard to rewind table writer to skip a partially wrote row. 
 

Based on above investigation, we will not support row skipping. We will add new 
query option "use_null_for_decimal_errors" as mentioned in Aman's comments.

 

> No error returned when inserting an overflowed value into a decimal column
> --------------------------------------------------------------------------
>
>                 Key: IMPALA-10564
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10564
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend, Frontend
>    Affects Versions: Impala 4.0
>            Reporter: Wenzhe Zhou
>            Assignee: Wenzhe Zhou
>            Priority: Major
>
> When using CTAS statements or INSERT-SELECT statements to insert rows to 
> table with decimal columns, Impala insert NULL for overflowed decimal values, 
> instead of returning error. This issue happens when the data expression for 
> the decimal column in SELECT sub-query consists at least one alias. This 
> issue is similar as IMPALA-6340, but IMPALA-6340 only fixed the issue for the 
> cases with the data expression for the decimal columns as constants so that 
> the overflowed decimal values could be detected by frontend during expression 
> analysis.  If there is alias (variables) in the data expression for the 
> decimal column, Frontend could not evaluate data expression in expression 
> analysis phase. Only backend could evaluate the data expression when backend 
> execute fragment instances for SELECT sub-queries. The log messages showed 
> that the executor detected the decimal overflow error, but somehow it did not 
> propagate the error to the coordinator, hence the error was not returned to 
> the client.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to