[ https://issues.apache.org/jira/browse/SPARK-25354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-25354: ------------------------------------ Assignee: Apache Spark > Parquet vectorized record reader has unneeded operation in several methods > -------------------------------------------------------------------------- > > Key: SPARK-25354 > URL: https://issues.apache.org/jira/browse/SPARK-25354 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.4.0 > Reporter: SongYadong > Assignee: Apache Spark > Priority: Major > Original Estimate: 6h > Remaining Estimate: 6h > > VectorizedParquetRecordReader class has unneeded operation in nextKeyValue > method and other functions called from it: > 1. In nextKeyValue() method, we call resultBatch() for only initializing a > columnar batch if not initialized, not for a return of columnar batch. so we > can move initBatch() operation to nextBatch(); > 2. In nextBatch() method, we need not reset columnVectors every time. When > rowsReturned >= totalRowCount, function return, reset cost is vasted. so we > can put "if (rowsReturned >= totalRowCount) return false;" before > columnVectors reset for performance. > 3. In nextBatch() method, we need not call checkEndOfRowGroup() every time. > When rowsReturned != totalCountLoadedSoFar is true, checkEndOfRowGroup do > nothing but just return, so we can call checkEndOfRowGroup only when > rowsReturned == totalCountLoadedSoFar for reducing function calling. > 4. In checkEndOfRowGroup() function, we need not get columns of > requestedSchema every time. we can get columns only for the first time and > save it for future use for performance. > Accoring to analysis of spark application with JMC tool, we found parquet > vectorized record reader call nextKeyValue() and subsequent function very > very frequent, performance gains from optimizition of this process is worth > to do. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org