Maxim Gekk created SPARK-25393: ---------------------------------- Summary: Parsing CSV strings in a column Key: SPARK-25393 URL: https://issues.apache.org/jira/browse/SPARK-25393 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk
There are use cases when content in CSV format is dumped into an external storage as one of columns. For example, CSV records are stored together with other meta-info to Kafka. Current Spark API doesn't allow to parse such columns directly. The existing method [csv()|https://github.com/apache/spark/blob/e754887182304ad0d622754e33192ebcdd515965/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L487] requires a dataset with one string column. The API is inconvenient in parsing CSV column in dataset with many columns. The ticket aims to add new function similar to [from_json()|https://github.com/apache/spark/blob/d749d034a80f528932f613ac97f13cfb99acd207/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3456] with the following signatures in Scala: {code:scala} def from_csv(e: Column, schema: StructType, options: Map[String, String]): Column {code} and for using from Python, R and Java: {code:scala} def from_csv(e: Column, schema: String, options: java.util.Map[String, String]): Column {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org