Github user MickDavies commented on the pull request:

    https://github.com/apache/spark/pull/4187#issuecomment-71377752
  
    I don' think the line in question is hot, but I think your suggestions are 
good so I have made the changes.
    
    I also looked a bit more into Parquet code. I think that the array will be 
created per column per row group. It looks like Parquet uses a dictionary until 
a max number of bytes per column per row group have been added - extract from 
ParquetOutputFormat
    ```java
     * # There is one dictionary page per column per row group when dictionary 
encoding is used.
     * # The dictionary page size works like the page size but for dictionary
     * parquet.dictionary.page.size=1048576 # in bytes, default = 1 * 1024 * 
1024
    ```
    and 
    ```java
    /**
     * Will attempt to encode values using a dictionary and fall back to plain 
encoding
     *  if the dictionary gets too big
     *
     * @author Julien Le Dem
     *
     */
    public abstract class DictionaryValuesWriter extends ValuesWriter 
implements RequiresFallback
    ```
    
    Where bytes is the size of the Binary data with an 4 byte overhead per 
entry. I think this imposes an upper limit on size of this array, and the 
related Strings to a few MB.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to