Peng Yu created FLINK-22143: ------------------------------- Summary: Flink returns less rows than expected when using limit in SQL Key: FLINK-22143 URL: https://issues.apache.org/jira/browse/FLINK-22143 Project: Flink Issue Type: Bug Components: Table SQL / Runtime Affects Versions: 1.13.0 Reporter: Peng Yu Fix For: 1.13.0
Flink's blink runtime returns less rows than expected when querying Hive tables with limit. {code:java} // sql select i_item_sk from tpcds_1g_snappy.item limit 5000; {code} Above query will return only *4998* lines in some cases. This problem can be re-produced on below conditions: # A Hive table with parquet format. # Running SQL with limit using blink planner since Flink version 1.12.0 # The input table is small. (With only 1 data file in which there is only 1 row group, e.g. 1 GB of TPCDS benchmark data) # The requested count of lines by `limit` is above the batch size (2048 by default) After investigation, a bug is found lying in the *LimitableBulkFormat* class. In this class, for each batch, *numRead* will be increased *1* more than actual count of rows returned by reader.readBatch(). The reason is that *numRead* get increased even when next() reaches then end of current batch. If there is only 1 input split, no more lines will be merged into the final result. As a result, less lines will be returned by Flink. -- This message was sent by Atlassian Jira (v8.3.4#803005)