[ https://issues.apache.org/jira/browse/SPARK-15463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16840018#comment-16840018 ]
Ruslan Dautkhanov edited comment on SPARK-15463 at 5/15/19 4:00 PM: -------------------------------------------------------------------- [~hyukjin.kwon] would it be possible to make csvParsing optional (and have only schema inference)? We have an RDD with columns stored separately in a tuple .. but all strings. Would be great to infer schema without parsing a single String as a csv. Current workaround is to glue all strings together (with proper quoting, escaping etc) just so that bring columns back here [https://github.com/apache/spark/blob/3f42c4cc7b93e32cb8d4f2517987097b73e733fd/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L156] into a list of columns and finally use inferSchema infer() already accepts `RDD[Array[String]] ` but current API only accepts `RDD[String]` (or `Dataset[String]`) [https://github.com/apache/spark/blob/a30983db575de5c87b3a4698b223229327fd65cf/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala#L51] I don't think it's a public API to use infer() directly from [https://github.com/apache/spark/blob/a30983db575de5c87b3a4698b223229327fd65cf/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala#L30] ? It seems to be a better workaround than collapsing all-string columns to a csv, parse it internally by Spark only to infer data types of those columns. Thank you for any leads. was (Author: tagar): [~hyukjin.kwon] would it be possible to make csvParsing optional (and have only ? We have an RDD with columns stored separately in a tuple .. but all strings. Would be great to infer schema without parsing a single String as a csv. Current workaround is to glue all strings together (with proper quoting, escaping etc) just so that bring columns back here [https://github.com/apache/spark/blob/3f42c4cc7b93e32cb8d4f2517987097b73e733fd/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L156] into a list of columns and finally use inferSchema infer() already accepts `RDD[Array[String]] ` but current API only accepts `RDD[String]` (or `Dataset[String]`) [https://github.com/apache/spark/blob/a30983db575de5c87b3a4698b223229327fd65cf/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala#L51] I don't think it's a public API to use infer() directly from [https://github.com/apache/spark/blob/a30983db575de5c87b3a4698b223229327fd65cf/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala#L30] ? It seems to be a better workaround than collapsing all-string columns to a csv, parse it internally by Spark only to infer data types of those columns. Thank you for any leads. > Support for creating a dataframe from CSV in Dataset[String] > ------------------------------------------------------------ > > Key: SPARK-15463 > URL: https://issues.apache.org/jira/browse/SPARK-15463 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.0.0 > Reporter: PJ Fanning > Assignee: Hyukjin Kwon > Priority: Major > Fix For: 2.2.0 > > > I currently use Databrick's spark-csv lib but some features don't work with > Apache Spark 2.0.0-SNAPSHOT. I understand that with the addition of CSV > support into spark-sql directly, that spark-csv won't be modified. > I currently read some CSV data that has been pre-processed and is in > RDD[String] format. > There is sqlContext.read.json(rdd: RDD[String]) but other formats don't > appear to support the creation of DataFrames based on loading from > RDD[String]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org