[ https://issues.apache.org/jira/browse/DRILL-5266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dechang Gu closed DRILL-5266. ----------------------------- verified. see closure comment in DRILL-5267 > Parquet Reader produces "low density" record batches - bits vs. bytes > --------------------------------------------------------------------- > > Key: DRILL-5266 > URL: https://issues.apache.org/jira/browse/DRILL-5266 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet > Affects Versions: 1.10.0 > Reporter: Paul Rogers > Assignee: Paul Rogers > Labels: ready-to-commit > Fix For: 1.10.0 > > > Testing with the managed sort revealed that, for at least one file, Parquet > produces "low-density" batches: batches in which only 5% of each value vector > contains actual data, with the rest being unused space. When fed into the > sort, we end up buffering 95% of wasted space, using only 5% of available > memory to hold actual query data. The result is poor performance of the sort > as it must spill far more frequently than expected. > The managed sort analyzes incoming batches to prepare good memory use > estimates. The following the the output from the Parquet file in question: > {code} > Actual batch schema & sizes { > T1¦¦cs_sold_date_sk(std col. size: 4, actual col. size: 4, total size: > 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4) > T1¦¦cs_sold_time_sk(std col. size: 4, actual col. size: 4, total size: > 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4) > T1¦¦cs_ship_date_sk(std col. size: 4, actual col. size: 4, total size: > 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4) > ... > c_email_address(std col. size: 54, actual col. size: 27, total size: 53248, > vector size: 49152, data size: 30327, row capacity: 4095, density: 62) > Records: 1129, Total size: 32006144, Row width:28350, Density:5} > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)