Github user MickDavies commented on the pull request:

    https://github.com/apache/spark/pull/4187#issuecomment-71326437
  
    The dictionary already exists, the change will cause an additional array to 
be created to hold the converted values, but I do not think this is very 
significant. I guess it is possible that the converted Strings in the array 
themselves increase non-short lived memory - but this is probably not a cost as 
they will very likely have been referenced further up stream in the Spark code.
    
    Adding an array to hold converted String values seems to be the pattern for 
the implementation of this form of converter and a number of similar examples 
can be seen in the Parquet code base, for example: 
parquet.avro.AvroIndexedRecordConverter.FieldStringConverter
    
    The improved performance is not only due to reduced cpu from performing 
less UTF8 conversion, but also due to the significant reduction in String 
creation resulting in less GC time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to