Github user MickDavies commented on the pull request: https://github.com/apache/spark/pull/4187#issuecomment-71377752 I don' think the line in question is hot, but I think your suggestions are good so I have made the changes. I also looked a bit more into Parquet code. I think that the array will be created per column per row group. It looks like Parquet uses a dictionary until a max number of bytes per column per row group have been added - extract from ParquetOutputFormat ```java * # There is one dictionary page per column per row group when dictionary encoding is used. * # The dictionary page size works like the page size but for dictionary * parquet.dictionary.page.size=1048576 # in bytes, default = 1 * 1024 * 1024 ``` and ```java /** * Will attempt to encode values using a dictionary and fall back to plain encoding * if the dictionary gets too big * * @author Julien Le Dem * */ public abstract class DictionaryValuesWriter extends ValuesWriter implements RequiresFallback ``` Where bytes is the size of the Binary data with an 4 byte overhead per entry. I think this imposes an upper limit on size of this array, and the related Strings to a few MB.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org