Ladislav Jech created SPARK-27593:
-------------------------------------

             Summary: CSV Parser returns 2 DataFrame - Valid and Malformed DFs
                 Key: SPARK-27593
                 URL: https://issues.apache.org/jira/browse/SPARK-27593
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 2.4.2
            Reporter: Ladislav Jech


When we process CSV in any kind of data warehouse, its common procedure to 
report corrupted records for audit purposes and feedback back to vendor, so 
they can enhance their procedure. CSV is no difference from XSD from 
perspective that it define a schema although in very limited way (in some cases 
only as number of columns without even headers, and we don't have types), but 
when I check XML document against XSD file, I get exact report of if the file 
is completely valid and if not I get exact report of what records are not 
following schema. 

Such feature will have big value in Spark for CSV, get malformed records into 
some dataframe, with line count (pointer within the data object), so I can log 
both pointer and real data (line/row) and trigger action on this unfortunate 
event.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to