Henry Robinson created SPARK-22736: -------------------------------------- Summary: Consider caching decoded dictionaries in VectorizedColumnReader Key: SPARK-22736 URL: https://issues.apache.org/jira/browse/SPARK-22736 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.1 Reporter: Henry Robinson
{{VectorizedColumnReader.decodeDictionaryIds()}} calls {{dictionary.decodeToX}} for every dictionary ID encountered in a dict-encoded Parquet page. The whole idea of dictionary encoding is that a) values are repeated in a page and b) the dictionary only contains values that are in a page. So we should be able to save some decoding cost by decoding the entire dictionary page once, at the cost of using some memory (but theoretically we could discard the encoded dictionary, I think), and using the decoded dictionary to populate rows. This is particularly true for TIMESTAMP data, which after SPARK-12297, might have a timezone conversion as part of its decoding step. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org