Tejas Patil created SPARK-19843:
-----------------------------------

             Summary: UTF8String => (int / long) conversion expensive for 
invalid inputs
                 Key: SPARK-19843
                 URL: https://issues.apache.org/jira/browse/SPARK-19843
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.1.0
            Reporter: Tejas Patil


In case of invalid inputs, converting a UTF8String to int or long returns null. 
This comes at a cost wherein the method for conversion (e.g [0]) would throw an 
exception. Exception handling is expensive as it will convert the UTF8String 
into a java string, populate the stack trace (which is a native call). While 
migrating workloads from Hive -> Spark, I see that this at an aggregate level 
affects the performance of queries in comparison with hive.

The exception is just indicating that the conversion failed.. its not 
propagated to users so it would be good to avoid.

Couple of options:
- Return Integer / Long (instead of primitive types) which can be set to NULL 
if the conversion fails. This is boxing and super bad for perf so a big no.
- Hive has a pre-check [1] for this which is not a perfect safety net but 
helpful to capture typical bad inputs eg. empty string, "null".

[0] : 
https://github.com/apache/spark/blob/4ba9c6c453606f5e5a1e324d5f933d2c9307a604/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L950
[1] : 
https://github.com/apache/hive/blob/ff67cdda1c538dc65087878eeba3e165cf3230f4/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyUtils.java#L90



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to