wesleydeng created ORC-1172:
-------------------------------

             Summary: add row count limit config for one stripe 
                 Key: ORC-1172
                 URL: https://issues.apache.org/jira/browse/ORC-1172
             Project: ORC
          Issue Type: New Feature
          Components: Java
            Reporter: wesleydeng


for query engine like presto,stripe is the base unit for query concurrency, one 
stripe can only be processed by one split.
In current implement of orc writer, the only config which can control row count 
in stripe is the "orc.stripe.size".
But for different kind of table, the row count is difficult to use.
 * for table with much columns( eg. 100 columns), 64MB may contain 5000 rows.
 * for table with less columns(eg. 5 columns), 64MB may contain 100000 rows.

for presto, normal olap query only read a subset of table columns, the row 
count is the key factor of query performance. If one stripe contain much rows, 
the query performance may become too low.

So, besides the config "orc.stripe.size", we need another config like 
"orc.stripe.row.count" to control the row count of one stripe.
The similar config has been introduced to cudf ( a GPU DataFrame library base 
on apache arrow): 
[rapidsai/cudf#9261|https://github.com/rapidsai/cudf/issues/9261]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to