[ https://issues.apache.org/jira/browse/SPARK-19843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wenchen Fan resolved SPARK-19843. --------------------------------- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17184 [https://github.com/apache/spark/pull/17184] > UTF8String => (int / long) conversion expensive for invalid inputs > ------------------------------------------------------------------ > > Key: SPARK-19843 > URL: https://issues.apache.org/jira/browse/SPARK-19843 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.1.0 > Reporter: Tejas Patil > Fix For: 2.2.0 > > > In case of invalid inputs, converting a UTF8String to int or long returns > null. This comes at a cost wherein the method for conversion (e.g [0]) would > throw an exception. Exception handling is expensive as it will convert the > UTF8String into a java string, populate the stack trace (which is a native > call). While migrating workloads from Hive -> Spark, I see that this at an > aggregate level affects the performance of queries in comparison with hive. > The exception is just indicating that the conversion failed.. its not > propagated to users so it would be good to avoid. > Couple of options: > - Return Integer / Long (instead of primitive types) which can be set to NULL > if the conversion fails. This is boxing and super bad for perf so a big no. > - Hive has a pre-check [1] for this which is not a perfect safety net but > helpful to capture typical bad inputs eg. empty string, "null". > [0] : > https://github.com/apache/spark/blob/4ba9c6c453606f5e5a1e324d5f933d2c9307a604/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L950 > [1] : > https://github.com/apache/hive/blob/ff67cdda1c538dc65087878eeba3e165cf3230f4/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyUtils.java#L90 -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org