[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17068#discussion_r104581628 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala --- @@ -190,28 +194,28 @@ object WholeFileCSVDataSource extends CSVDataSource { sparkSession: SparkSession, inputPaths: Seq[FileStatus], parsedOptions: CSVOptions): Option[StructType] = { -val csv: RDD[PortableDataStream] = createBaseRdd(sparkSession, inputPaths, parsedOptions) -val maybeFirstRow: Option[Array[String]] = csv.flatMap { lines => +val csv = createBaseRdd(sparkSession, inputPaths, parsedOptions) +csv.flatMap { lines => UnivocityParser.tokenizeStream( CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, lines.getPath()), -false, +shouldDropHeader = false, new CsvParser(parsedOptions.asParserSettings)) -}.take(1).headOption - -if (maybeFirstRow.isDefined) { - val firstRow = maybeFirstRow.get - val caseSensitive = sparkSession.sessionState.conf.caseSensitiveAnalysis - val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions) - val tokenRDD = csv.flatMap { lines => -UnivocityParser.tokenizeStream( - CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, lines.getPath()), - parsedOptions.headerFlag, - new CsvParser(parsedOptions.asParserSettings)) - } - Some(CSVInferSchema.infer(tokenRDD, header, parsedOptions)) -} else { - // If the first row could not be read, just return the empty schema. - Some(StructType(Nil)) +}.take(1).headOption match { --- End diff -- Thank you both for bearing with me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...
Github user wojtek-szymanski commented on a diff in the pull request: https://github.com/apache/spark/pull/17068#discussion_r104527927 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala --- @@ -190,28 +194,28 @@ object WholeFileCSVDataSource extends CSVDataSource { sparkSession: SparkSession, inputPaths: Seq[FileStatus], parsedOptions: CSVOptions): Option[StructType] = { -val csv: RDD[PortableDataStream] = createBaseRdd(sparkSession, inputPaths, parsedOptions) -val maybeFirstRow: Option[Array[String]] = csv.flatMap { lines => +val csv = createBaseRdd(sparkSession, inputPaths, parsedOptions) +csv.flatMap { lines => UnivocityParser.tokenizeStream( CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, lines.getPath()), -false, +shouldDropHeader = false, new CsvParser(parsedOptions.asParserSettings)) -}.take(1).headOption - -if (maybeFirstRow.isDefined) { - val firstRow = maybeFirstRow.get - val caseSensitive = sparkSession.sessionState.conf.caseSensitiveAnalysis - val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions) - val tokenRDD = csv.flatMap { lines => -UnivocityParser.tokenizeStream( - CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, lines.getPath()), - parsedOptions.headerFlag, - new CsvParser(parsedOptions.asParserSettings)) - } - Some(CSVInferSchema.infer(tokenRDD, header, parsedOptions)) -} else { - // If the first row could not be read, just return the empty schema. - Some(StructType(Nil)) +}.take(1).headOption match { --- End diff -- @HyukjinKwon @cloud-fan many thanks for your effort. I really appreciate it and I will take it into account when working with the codebase. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17068 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17068#discussion_r104357269 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala --- @@ -190,28 +194,28 @@ object WholeFileCSVDataSource extends CSVDataSource { sparkSession: SparkSession, inputPaths: Seq[FileStatus], parsedOptions: CSVOptions): Option[StructType] = { -val csv: RDD[PortableDataStream] = createBaseRdd(sparkSession, inputPaths, parsedOptions) -val maybeFirstRow: Option[Array[String]] = csv.flatMap { lines => +val csv = createBaseRdd(sparkSession, inputPaths, parsedOptions) +csv.flatMap { lines => UnivocityParser.tokenizeStream( CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, lines.getPath()), -false, +shouldDropHeader = false, new CsvParser(parsedOptions.asParserSettings)) -}.take(1).headOption - -if (maybeFirstRow.isDefined) { - val firstRow = maybeFirstRow.get - val caseSensitive = sparkSession.sessionState.conf.caseSensitiveAnalysis - val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions) - val tokenRDD = csv.flatMap { lines => -UnivocityParser.tokenizeStream( - CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, lines.getPath()), - parsedOptions.headerFlag, - new CsvParser(parsedOptions.asParserSettings)) - } - Some(CSVInferSchema.infer(tokenRDD, header, parsedOptions)) -} else { - // If the first row could not be read, just return the empty schema. - Some(StructType(Nil)) +}.take(1).headOption match { --- End diff -- https://github.com/apache/spark/pull/17068#discussion_r104356866 did not show up when I write my comment. I am fine as is. I am not supposed to decide this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17068#discussion_r104356921 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala --- @@ -190,28 +194,28 @@ object WholeFileCSVDataSource extends CSVDataSource { sparkSession: SparkSession, inputPaths: Seq[FileStatus], parsedOptions: CSVOptions): Option[StructType] = { -val csv: RDD[PortableDataStream] = createBaseRdd(sparkSession, inputPaths, parsedOptions) -val maybeFirstRow: Option[Array[String]] = csv.flatMap { lines => +val csv = createBaseRdd(sparkSession, inputPaths, parsedOptions) +csv.flatMap { lines => UnivocityParser.tokenizeStream( CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, lines.getPath()), -false, +shouldDropHeader = false, new CsvParser(parsedOptions.asParserSettings)) -}.take(1).headOption - -if (maybeFirstRow.isDefined) { - val firstRow = maybeFirstRow.get - val caseSensitive = sparkSession.sessionState.conf.caseSensitiveAnalysis - val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions) - val tokenRDD = csv.flatMap { lines => -UnivocityParser.tokenizeStream( - CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, lines.getPath()), - parsedOptions.headerFlag, - new CsvParser(parsedOptions.asParserSettings)) - } - Some(CSVInferSchema.infer(tokenRDD, header, parsedOptions)) -} else { - // If the first row could not be read, just return the empty schema. - Some(StructType(Nil)) +}.take(1).headOption match { --- End diff -- All three patterns I mentioned are being used across the code base. There is no style guide for this both in https://github.com/databricks/scala-style-guide and http://spark.apache.org/contributing.html In this case, matching new one to other similar ones is a better choice to reduce changed lines, rather than doing the opposite. Personal taste might be secondary. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17068#discussion_r104356866 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala --- @@ -190,28 +194,28 @@ object WholeFileCSVDataSource extends CSVDataSource { sparkSession: SparkSession, inputPaths: Seq[FileStatus], parsedOptions: CSVOptions): Option[StructType] = { -val csv: RDD[PortableDataStream] = createBaseRdd(sparkSession, inputPaths, parsedOptions) -val maybeFirstRow: Option[Array[String]] = csv.flatMap { lines => +val csv = createBaseRdd(sparkSession, inputPaths, parsedOptions) +csv.flatMap { lines => UnivocityParser.tokenizeStream( CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, lines.getPath()), -false, +shouldDropHeader = false, new CsvParser(parsedOptions.asParserSettings)) -}.take(1).headOption - -if (maybeFirstRow.isDefined) { - val firstRow = maybeFirstRow.get - val caseSensitive = sparkSession.sessionState.conf.caseSensitiveAnalysis - val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions) - val tokenRDD = csv.flatMap { lines => -UnivocityParser.tokenizeStream( - CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, lines.getPath()), - parsedOptions.headerFlag, - new CsvParser(parsedOptions.asParserSettings)) - } - Some(CSVInferSchema.infer(tokenRDD, header, parsedOptions)) -} else { - // If the first row could not be read, just return the empty schema. - Some(StructType(Nil)) +}.take(1).headOption match { --- End diff -- I don't have a strong preference, this looks fine --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...
Github user wojtek-szymanski commented on a diff in the pull request: https://github.com/apache/spark/pull/17068#discussion_r104353760 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala --- @@ -190,28 +194,28 @@ object WholeFileCSVDataSource extends CSVDataSource { sparkSession: SparkSession, inputPaths: Seq[FileStatus], parsedOptions: CSVOptions): Option[StructType] = { -val csv: RDD[PortableDataStream] = createBaseRdd(sparkSession, inputPaths, parsedOptions) -val maybeFirstRow: Option[Array[String]] = csv.flatMap { lines => +val csv = createBaseRdd(sparkSession, inputPaths, parsedOptions) +csv.flatMap { lines => UnivocityParser.tokenizeStream( CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, lines.getPath()), -false, +shouldDropHeader = false, new CsvParser(parsedOptions.asParserSettings)) -}.take(1).headOption - -if (maybeFirstRow.isDefined) { - val firstRow = maybeFirstRow.get - val caseSensitive = sparkSession.sessionState.conf.caseSensitiveAnalysis - val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions) - val tokenRDD = csv.flatMap { lines => -UnivocityParser.tokenizeStream( - CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, lines.getPath()), - parsedOptions.headerFlag, - new CsvParser(parsedOptions.asParserSettings)) - } - Some(CSVInferSchema.infer(tokenRDD, header, parsedOptions)) -} else { - // If the first row could not be read, just return the empty schema. - Some(StructType(Nil)) +}.take(1).headOption match { --- End diff -- I would leave it as it is, since pattern matching still looks a bit clearer than conditionals. If minimizing changes is so critical, I can revert the previous version here and replace pattern matching with conditionals in my fix , @cloud-fan please advise. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17068#discussion_r104350586 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala --- @@ -190,28 +194,28 @@ object WholeFileCSVDataSource extends CSVDataSource { sparkSession: SparkSession, inputPaths: Seq[FileStatus], parsedOptions: CSVOptions): Option[StructType] = { -val csv: RDD[PortableDataStream] = createBaseRdd(sparkSession, inputPaths, parsedOptions) -val maybeFirstRow: Option[Array[String]] = csv.flatMap { lines => +val csv = createBaseRdd(sparkSession, inputPaths, parsedOptions) +csv.flatMap { lines => UnivocityParser.tokenizeStream( CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, lines.getPath()), -false, +shouldDropHeader = false, new CsvParser(parsedOptions.asParserSettings)) -}.take(1).headOption - -if (maybeFirstRow.isDefined) { - val firstRow = maybeFirstRow.get - val caseSensitive = sparkSession.sessionState.conf.caseSensitiveAnalysis - val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions) - val tokenRDD = csv.flatMap { lines => -UnivocityParser.tokenizeStream( - CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, lines.getPath()), - parsedOptions.headerFlag, - new CsvParser(parsedOptions.asParserSettings)) - } - Some(CSVInferSchema.infer(tokenRDD, header, parsedOptions)) -} else { - // If the first row could not be read, just return the empty schema. - Some(StructType(Nil)) +}.take(1).headOption match { --- End diff -- IMHO, `Option.isDefine` with `Option.get`, `Option.map` with `Option.getOrElse` and `Option` with `match case Some... case None` all might be fine. But, how about minimising the change by matching the above one to `Option.isDefine` with `Option.get`? Then, it would not require the changes here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17068#discussion_r104329421 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala --- @@ -190,28 +194,26 @@ object WholeFileCSVDataSource extends CSVDataSource { sparkSession: SparkSession, inputPaths: Seq[FileStatus], parsedOptions: CSVOptions): Option[StructType] = { -val csv: RDD[PortableDataStream] = createBaseRdd(sparkSession, inputPaths, parsedOptions) -val maybeFirstRow: Option[Array[String]] = csv.flatMap { lines => +val csv = createBaseRdd(sparkSession, inputPaths, parsedOptions) +csv.flatMap { lines => UnivocityParser.tokenizeStream( CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, lines.getPath()), -false, +shouldDropHeader = false, new CsvParser(parsedOptions.asParserSettings)) -}.take(1).headOption - -if (maybeFirstRow.isDefined) { - val firstRow = maybeFirstRow.get - val caseSensitive = sparkSession.sessionState.conf.caseSensitiveAnalysis - val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions) - val tokenRDD = csv.flatMap { lines => -UnivocityParser.tokenizeStream( - CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, lines.getPath()), - parsedOptions.headerFlag, - new CsvParser(parsedOptions.asParserSettings)) - } - Some(CSVInferSchema.infer(tokenRDD, header, parsedOptions)) -} else { - // If the first row could not be read, just return the empty schema. - Some(StructType(Nil)) +}.take(1).headOption match { --- End diff -- same question --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17068#discussion_r104329410 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala --- @@ -133,20 +133,24 @@ object TextInputCSVDataSource extends CSVDataSource { sparkSession: SparkSession, inputPaths: Seq[FileStatus], parsedOptions: CSVOptions): Option[StructType] = { -val csv: Dataset[String] = createBaseDataset(sparkSession, inputPaths, parsedOptions) -val firstLine: String = CSVUtils.filterCommentAndEmpty(csv, parsedOptions).first() -val firstRow = new CsvParser(parsedOptions.asParserSettings).parseLine(firstLine) -val caseSensitive = sparkSession.sessionState.conf.caseSensitiveAnalysis -val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions) -val tokenRDD = csv.rdd.mapPartitions { iter => - val filteredLines = CSVUtils.filterCommentAndEmpty(iter, parsedOptions) - val linesWithoutHeader = -CSVUtils.filterHeaderLine(filteredLines, firstLine, parsedOptions) - val parser = new CsvParser(parsedOptions.asParserSettings) - linesWithoutHeader.map(parser.parseLine) +val csv = createBaseDataset(sparkSession, inputPaths, parsedOptions) +CSVUtils.filterCommentAndEmpty(csv, parsedOptions).take(1).headOption match { --- End diff -- why call `take(1)`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...
Github user wojtek-szymanski commented on a diff in the pull request: https://github.com/apache/spark/pull/17068#discussion_r104317488 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala --- @@ -133,8 +133,19 @@ object TextInputCSVDataSource extends CSVDataSource { sparkSession: SparkSession, inputPaths: Seq[FileStatus], parsedOptions: CSVOptions): Option[StructType] = { -val csv: Dataset[String] = createBaseDataset(sparkSession, inputPaths, parsedOptions) -val firstLine: String = CSVUtils.filterCommentAndEmpty(csv, parsedOptions).first() +val csv = createBaseDataset(sparkSession, inputPaths, parsedOptions) +CSVUtils.filterCommentAndEmpty(csv, parsedOptions) + .take(1) + .headOption + .map(firstLine => infer(sparkSession, parsedOptions, csv, firstLine)) + .orElse(Some(StructType(Seq( --- End diff -- I would suggest that we use pattern matching in order to make it more verbose and avoid code like this: if (maybeFirstRow.isDefined) { val firstRow = maybeFirstRow.get I also touched `WholeFileCSVDataSource` to unify both implementations. What's your opinion? Regarding code de-duplication, I fully agree, that it should be done in separate PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...
Github user wojtek-szymanski commented on a diff in the pull request: https://github.com/apache/spark/pull/17068#discussion_r104316989 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -305,13 +305,21 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils { test("test with empty file and known schema") { val result = spark.read .format("csv") - .schema(StructType(List(StructField("column", StringType, false + .schema(StructType(List(StructField("column", StringType, nullable = false .load(testFile(emptyFile)) -assert(result.collect.size === 0) +assert(result.collect().isEmpty) assert(result.schema.fieldNames.size === 1) } + test("test with empty file without schema") { --- End diff -- Good idea, done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17068#discussion_r104288543 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -305,13 +305,21 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils { test("test with empty file and known schema") { val result = spark.read .format("csv") - .schema(StructType(List(StructField("column", StringType, false + .schema(StructType(List(StructField("column", StringType, nullable = false .load(testFile(emptyFile)) -assert(result.collect.size === 0) +assert(result.collect().isEmpty) assert(result.schema.fieldNames.size === 1) } + test("test with empty file without schema") { --- End diff -- Let's re-use the test in [CSVSuite.scala#L1083](https://github.com/wojtek-szymanski/spark/blob/bdf189087c52e8934cee4ee5563dffea9dde6a99/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala#L1083). We could.. ```scala test("Empty file produces empty dataframe with empty schema") { Seq(false, true).foreach { wholeFile => val df = spark.read.format("csv") .option("header", true) .option("wholeFile", true) .load(testFile(emptyFile)) assert(df.schema === spark.emptyDataFrame.schema) checkAnswer(df, spark.emptyDataFrame) } } } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17068#discussion_r104288502 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala --- @@ -133,8 +133,19 @@ object TextInputCSVDataSource extends CSVDataSource { sparkSession: SparkSession, inputPaths: Seq[FileStatus], parsedOptions: CSVOptions): Option[StructType] = { -val csv: Dataset[String] = createBaseDataset(sparkSession, inputPaths, parsedOptions) -val firstLine: String = CSVUtils.filterCommentAndEmpty(csv, parsedOptions).first() +val csv = createBaseDataset(sparkSession, inputPaths, parsedOptions) +CSVUtils.filterCommentAndEmpty(csv, parsedOptions) + .take(1) + .headOption + .map(firstLine => infer(sparkSession, parsedOptions, csv, firstLine)) + .orElse(Some(StructType(Seq( --- End diff -- Could we maybe just match it to [CSVDataSource.scala#L204-L224](https://github.com/wojtek-szymanski/spark/blob/bdf189087c52e8934cee4ee5563dffea9dde6a99/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L204-L224) just for consistency for now? Personally, I thought such chaining makes hard to read the codes sometimes. Maybe, we could consider the code de-duplication about this in another PR. If would be easier if they looks similar at least. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...
Github user wojtek-szymanski commented on a diff in the pull request: https://github.com/apache/spark/pull/17068#discussion_r103353729 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala --- @@ -40,7 +41,19 @@ private[csv] object CSVInferSchema { csv: Dataset[String], caseSensitive: Boolean, options: CSVOptions): StructType = { -val firstLine: String = CSVUtils.filterCommentAndEmpty(csv, options).first() +val lines = CSVUtils.filterCommentAndEmpty(csv, options) --- End diff -- You are absolutely right. Relying on exception handling is smelly, while `Option` gives more opportunities. I also see no difference from performance point of view, since both `first()` and `take(1)` call the the same function `head(1)`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17068#discussion_r103214603 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala --- @@ -40,7 +41,19 @@ private[csv] object CSVInferSchema { csv: Dataset[String], caseSensitive: Boolean, options: CSVOptions): StructType = { -val firstLine: String = CSVUtils.filterCommentAndEmpty(csv, options).first() +val lines = CSVUtils.filterCommentAndEmpty(csv, options) --- End diff -- Hi @wojtek-szymanski I think we should not rely on exception handling. I can think of `take(1).headOption` but we could use shorten one if you know any other good way. What do you think about this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...
GitHub user wojtek-szymanski opened a pull request: https://github.com/apache/spark/pull/17068 [SPARK-19709][SQL] Read empty file with CSV data source ## What changes were proposed in this pull request? Bugfix for reading empty file with CSV data source. Instead of throwing `NoSuchElementException`, an empty data frame is returned. ## How was this patch tested? Added new unit test in `org.apache.spark.sql.execution.datasources.csv.CSVSuite` You can merge this pull request into a Git repository by running: $ git pull https://github.com/wojtek-szymanski/spark SPARK-19709 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17068.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17068 commit 1c80d4c7933ce823129642c9bfa5d291e7530671 Author: Wojtek Szymanski Date: 2017-02-25T23:37:19Z Fix: SPARK-19709 CSV datasource fails to read empty file commit 90f315cea819b521745b637c36026fcdc16bfb0d Author: Wojtek Szymanski Date: 2017-02-25T23:43:17Z test renamed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org