Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/22676#discussion_r223698787 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -505,20 +505,14 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { val actualSchema = StructType(schema.filterNot(_.name == parsedOptions.columnNameOfCorruptRecord)) - val linesWithoutHeader = if (parsedOptions.headerFlag && maybeFirstLine.isDefined) { - val firstLine = maybeFirstLine.get - val parser = new CsvParser(parsedOptions.asParserSettings) - val columnNames = parser.parseLine(firstLine) - CSVDataSource.checkHeaderColumnNames( + val linesWithoutHeader: RDD[String] = maybeFirstLine.map { firstLine => + val headerChecker = new CSVHeaderChecker( actualSchema, - columnNames, - csvDataset.getClass.getCanonicalName, - parsedOptions.enforceSchema, - sparkSession.sessionState.conf.caseSensitiveAnalysis) + parsedOptions, + source = s"CSV source: ${csvDataset.getClass.getCanonicalName}") --- End diff -- Is it better to output more concrete info about the dataset. For example, `toString` outputs field names at least. I think it will help in log analysis.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org