[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...

viirya Sun, 03 Sep 2017 20:31:06 -0700

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18865
  
    @HyukjinKwon 's provided use case looks pretty fair. The corrupt record is 
the whole line which doesn't follow the json format. It is kind of different to 
the corrupt record case that some json fields can't be correctly converted to 
desired data type.
    
    This two kind of corrupt records can be mixed in one json file. E.g.,
    
        echo '{"field": 1
         {"field" 2}
         {"field": 3}
         {"field": "4"}' >/tmp/sample.json
    
        scala> dfFromFile.show(false)
        +-----+---------------+
        |field|_corrupt_record|
        +-----+---------------+
        |null |{"field": 1    |
        |null | {"field" 2}   |
        |3    |null           |
        |null |{"field": "4"} |
        +-----+---------------+
    
        scala> dfFromFile.select($"_corrupt_record").show()
        +---------------+
        |_corrupt_record|
        +---------------+
        |    {"field": 1|
        |    {"field" 2}|
        |           null|
        |           null|
        +---------------+
    
    
    At least we should clearly explain the difference in the error message. 
Maybe something like: The query to execute now requires only `_corrupt_record` 
in effect after optimization. When there are corrupt records due to json field 
conversion error, those corrupt records might not correctly generated in the 
end, because now no other json fields are required along actually. In order to 
obtain most accurate result, we recommend users to cache or save the dataset 
before the queries requiring only `_corrupt_record`.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...

Reply via email to