[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...

2017-03-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17068#discussion_r104581628
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
 ---
@@ -190,28 +194,28 @@ object WholeFileCSVDataSource extends CSVDataSource {
   sparkSession: SparkSession,
   inputPaths: Seq[FileStatus],
   parsedOptions: CSVOptions): Option[StructType] = {
-val csv: RDD[PortableDataStream] = createBaseRdd(sparkSession, 
inputPaths, parsedOptions)
-val maybeFirstRow: Option[Array[String]] = csv.flatMap { lines =>
+val csv = createBaseRdd(sparkSession, inputPaths, parsedOptions)
+csv.flatMap { lines =>
   UnivocityParser.tokenizeStream(
 
CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, 
lines.getPath()),
-false,
+shouldDropHeader = false,
 new CsvParser(parsedOptions.asParserSettings))
-}.take(1).headOption
-
-if (maybeFirstRow.isDefined) {
-  val firstRow = maybeFirstRow.get
-  val caseSensitive = 
sparkSession.sessionState.conf.caseSensitiveAnalysis
-  val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions)
-  val tokenRDD = csv.flatMap { lines =>
-UnivocityParser.tokenizeStream(
-  
CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, 
lines.getPath()),
-  parsedOptions.headerFlag,
-  new CsvParser(parsedOptions.asParserSettings))
-  }
-  Some(CSVInferSchema.infer(tokenRDD, header, parsedOptions))
-} else {
-  // If the first row could not be read, just return the empty schema.
-  Some(StructType(Nil))
+}.take(1).headOption match {
--- End diff --

Thank you both for bearing with me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...

2017-03-06 Thread wojtek-szymanski
Github user wojtek-szymanski commented on a diff in the pull request:

https://github.com/apache/spark/pull/17068#discussion_r104527927
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
 ---
@@ -190,28 +194,28 @@ object WholeFileCSVDataSource extends CSVDataSource {
   sparkSession: SparkSession,
   inputPaths: Seq[FileStatus],
   parsedOptions: CSVOptions): Option[StructType] = {
-val csv: RDD[PortableDataStream] = createBaseRdd(sparkSession, 
inputPaths, parsedOptions)
-val maybeFirstRow: Option[Array[String]] = csv.flatMap { lines =>
+val csv = createBaseRdd(sparkSession, inputPaths, parsedOptions)
+csv.flatMap { lines =>
   UnivocityParser.tokenizeStream(
 
CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, 
lines.getPath()),
-false,
+shouldDropHeader = false,
 new CsvParser(parsedOptions.asParserSettings))
-}.take(1).headOption
-
-if (maybeFirstRow.isDefined) {
-  val firstRow = maybeFirstRow.get
-  val caseSensitive = 
sparkSession.sessionState.conf.caseSensitiveAnalysis
-  val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions)
-  val tokenRDD = csv.flatMap { lines =>
-UnivocityParser.tokenizeStream(
-  
CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, 
lines.getPath()),
-  parsedOptions.headerFlag,
-  new CsvParser(parsedOptions.asParserSettings))
-  }
-  Some(CSVInferSchema.infer(tokenRDD, header, parsedOptions))
-} else {
-  // If the first row could not be read, just return the empty schema.
-  Some(StructType(Nil))
+}.take(1).headOption match {
--- End diff --

@HyukjinKwon @cloud-fan many thanks for your effort. I really appreciate it 
and I will take it into account when working with the codebase.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...

2017-03-06 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17068


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...

2017-03-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17068#discussion_r104357269
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
 ---
@@ -190,28 +194,28 @@ object WholeFileCSVDataSource extends CSVDataSource {
   sparkSession: SparkSession,
   inputPaths: Seq[FileStatus],
   parsedOptions: CSVOptions): Option[StructType] = {
-val csv: RDD[PortableDataStream] = createBaseRdd(sparkSession, 
inputPaths, parsedOptions)
-val maybeFirstRow: Option[Array[String]] = csv.flatMap { lines =>
+val csv = createBaseRdd(sparkSession, inputPaths, parsedOptions)
+csv.flatMap { lines =>
   UnivocityParser.tokenizeStream(
 
CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, 
lines.getPath()),
-false,
+shouldDropHeader = false,
 new CsvParser(parsedOptions.asParserSettings))
-}.take(1).headOption
-
-if (maybeFirstRow.isDefined) {
-  val firstRow = maybeFirstRow.get
-  val caseSensitive = 
sparkSession.sessionState.conf.caseSensitiveAnalysis
-  val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions)
-  val tokenRDD = csv.flatMap { lines =>
-UnivocityParser.tokenizeStream(
-  
CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, 
lines.getPath()),
-  parsedOptions.headerFlag,
-  new CsvParser(parsedOptions.asParserSettings))
-  }
-  Some(CSVInferSchema.infer(tokenRDD, header, parsedOptions))
-} else {
-  // If the first row could not be read, just return the empty schema.
-  Some(StructType(Nil))
+}.take(1).headOption match {
--- End diff --

https://github.com/apache/spark/pull/17068#discussion_r104356866 did not 
show up when I write my comment. I am fine as is. I am not supposed to decide 
this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...

2017-03-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17068#discussion_r104356921
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
 ---
@@ -190,28 +194,28 @@ object WholeFileCSVDataSource extends CSVDataSource {
   sparkSession: SparkSession,
   inputPaths: Seq[FileStatus],
   parsedOptions: CSVOptions): Option[StructType] = {
-val csv: RDD[PortableDataStream] = createBaseRdd(sparkSession, 
inputPaths, parsedOptions)
-val maybeFirstRow: Option[Array[String]] = csv.flatMap { lines =>
+val csv = createBaseRdd(sparkSession, inputPaths, parsedOptions)
+csv.flatMap { lines =>
   UnivocityParser.tokenizeStream(
 
CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, 
lines.getPath()),
-false,
+shouldDropHeader = false,
 new CsvParser(parsedOptions.asParserSettings))
-}.take(1).headOption
-
-if (maybeFirstRow.isDefined) {
-  val firstRow = maybeFirstRow.get
-  val caseSensitive = 
sparkSession.sessionState.conf.caseSensitiveAnalysis
-  val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions)
-  val tokenRDD = csv.flatMap { lines =>
-UnivocityParser.tokenizeStream(
-  
CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, 
lines.getPath()),
-  parsedOptions.headerFlag,
-  new CsvParser(parsedOptions.asParserSettings))
-  }
-  Some(CSVInferSchema.infer(tokenRDD, header, parsedOptions))
-} else {
-  // If the first row could not be read, just return the empty schema.
-  Some(StructType(Nil))
+}.take(1).headOption match {
--- End diff --

All three patterns I mentioned are being used across the code base. There 
is no style guide for this both in 
https://github.com/databricks/scala-style-guide and 
http://spark.apache.org/contributing.html

In this case, matching new one to other similar ones is a better choice to 
reduce changed lines, rather than doing the opposite. Personal taste might be 
secondary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...

2017-03-05 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17068#discussion_r104356866
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
 ---
@@ -190,28 +194,28 @@ object WholeFileCSVDataSource extends CSVDataSource {
   sparkSession: SparkSession,
   inputPaths: Seq[FileStatus],
   parsedOptions: CSVOptions): Option[StructType] = {
-val csv: RDD[PortableDataStream] = createBaseRdd(sparkSession, 
inputPaths, parsedOptions)
-val maybeFirstRow: Option[Array[String]] = csv.flatMap { lines =>
+val csv = createBaseRdd(sparkSession, inputPaths, parsedOptions)
+csv.flatMap { lines =>
   UnivocityParser.tokenizeStream(
 
CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, 
lines.getPath()),
-false,
+shouldDropHeader = false,
 new CsvParser(parsedOptions.asParserSettings))
-}.take(1).headOption
-
-if (maybeFirstRow.isDefined) {
-  val firstRow = maybeFirstRow.get
-  val caseSensitive = 
sparkSession.sessionState.conf.caseSensitiveAnalysis
-  val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions)
-  val tokenRDD = csv.flatMap { lines =>
-UnivocityParser.tokenizeStream(
-  
CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, 
lines.getPath()),
-  parsedOptions.headerFlag,
-  new CsvParser(parsedOptions.asParserSettings))
-  }
-  Some(CSVInferSchema.infer(tokenRDD, header, parsedOptions))
-} else {
-  // If the first row could not be read, just return the empty schema.
-  Some(StructType(Nil))
+}.take(1).headOption match {
--- End diff --

I don't have a strong preference, this looks fine


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...

2017-03-05 Thread wojtek-szymanski
Github user wojtek-szymanski commented on a diff in the pull request:

https://github.com/apache/spark/pull/17068#discussion_r104353760
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
 ---
@@ -190,28 +194,28 @@ object WholeFileCSVDataSource extends CSVDataSource {
   sparkSession: SparkSession,
   inputPaths: Seq[FileStatus],
   parsedOptions: CSVOptions): Option[StructType] = {
-val csv: RDD[PortableDataStream] = createBaseRdd(sparkSession, 
inputPaths, parsedOptions)
-val maybeFirstRow: Option[Array[String]] = csv.flatMap { lines =>
+val csv = createBaseRdd(sparkSession, inputPaths, parsedOptions)
+csv.flatMap { lines =>
   UnivocityParser.tokenizeStream(
 
CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, 
lines.getPath()),
-false,
+shouldDropHeader = false,
 new CsvParser(parsedOptions.asParserSettings))
-}.take(1).headOption
-
-if (maybeFirstRow.isDefined) {
-  val firstRow = maybeFirstRow.get
-  val caseSensitive = 
sparkSession.sessionState.conf.caseSensitiveAnalysis
-  val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions)
-  val tokenRDD = csv.flatMap { lines =>
-UnivocityParser.tokenizeStream(
-  
CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, 
lines.getPath()),
-  parsedOptions.headerFlag,
-  new CsvParser(parsedOptions.asParserSettings))
-  }
-  Some(CSVInferSchema.infer(tokenRDD, header, parsedOptions))
-} else {
-  // If the first row could not be read, just return the empty schema.
-  Some(StructType(Nil))
+}.take(1).headOption match {
--- End diff --

I would leave it as it is, since pattern matching still looks a bit clearer 
than conditionals. If minimizing changes is so critical, I can revert the 
previous version here and replace pattern matching with conditionals in my fix 
, @cloud-fan please advise.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...

2017-03-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17068#discussion_r104350586
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
 ---
@@ -190,28 +194,28 @@ object WholeFileCSVDataSource extends CSVDataSource {
   sparkSession: SparkSession,
   inputPaths: Seq[FileStatus],
   parsedOptions: CSVOptions): Option[StructType] = {
-val csv: RDD[PortableDataStream] = createBaseRdd(sparkSession, 
inputPaths, parsedOptions)
-val maybeFirstRow: Option[Array[String]] = csv.flatMap { lines =>
+val csv = createBaseRdd(sparkSession, inputPaths, parsedOptions)
+csv.flatMap { lines =>
   UnivocityParser.tokenizeStream(
 
CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, 
lines.getPath()),
-false,
+shouldDropHeader = false,
 new CsvParser(parsedOptions.asParserSettings))
-}.take(1).headOption
-
-if (maybeFirstRow.isDefined) {
-  val firstRow = maybeFirstRow.get
-  val caseSensitive = 
sparkSession.sessionState.conf.caseSensitiveAnalysis
-  val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions)
-  val tokenRDD = csv.flatMap { lines =>
-UnivocityParser.tokenizeStream(
-  
CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, 
lines.getPath()),
-  parsedOptions.headerFlag,
-  new CsvParser(parsedOptions.asParserSettings))
-  }
-  Some(CSVInferSchema.infer(tokenRDD, header, parsedOptions))
-} else {
-  // If the first row could not be read, just return the empty schema.
-  Some(StructType(Nil))
+}.take(1).headOption match {
--- End diff --

IMHO, `Option.isDefine` with `Option.get`, `Option.map` with 
`Option.getOrElse` and `Option` with  `match case Some... case None` all might 
be fine. But, how about minimising the change by matching the above one to 
`Option.isDefine` with `Option.get`? Then, it would not require the changes 
here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...

2017-03-05 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17068#discussion_r104329421
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
 ---
@@ -190,28 +194,26 @@ object WholeFileCSVDataSource extends CSVDataSource {
   sparkSession: SparkSession,
   inputPaths: Seq[FileStatus],
   parsedOptions: CSVOptions): Option[StructType] = {
-val csv: RDD[PortableDataStream] = createBaseRdd(sparkSession, 
inputPaths, parsedOptions)
-val maybeFirstRow: Option[Array[String]] = csv.flatMap { lines =>
+val csv = createBaseRdd(sparkSession, inputPaths, parsedOptions)
+csv.flatMap { lines =>
   UnivocityParser.tokenizeStream(
 
CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, 
lines.getPath()),
-false,
+shouldDropHeader = false,
 new CsvParser(parsedOptions.asParserSettings))
-}.take(1).headOption
-
-if (maybeFirstRow.isDefined) {
-  val firstRow = maybeFirstRow.get
-  val caseSensitive = 
sparkSession.sessionState.conf.caseSensitiveAnalysis
-  val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions)
-  val tokenRDD = csv.flatMap { lines =>
-UnivocityParser.tokenizeStream(
-  
CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, 
lines.getPath()),
-  parsedOptions.headerFlag,
-  new CsvParser(parsedOptions.asParserSettings))
-  }
-  Some(CSVInferSchema.infer(tokenRDD, header, parsedOptions))
-} else {
-  // If the first row could not be read, just return the empty schema.
-  Some(StructType(Nil))
+}.take(1).headOption match {
--- End diff --

same question


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...

2017-03-05 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17068#discussion_r104329410
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
 ---
@@ -133,20 +133,24 @@ object TextInputCSVDataSource extends CSVDataSource {
   sparkSession: SparkSession,
   inputPaths: Seq[FileStatus],
   parsedOptions: CSVOptions): Option[StructType] = {
-val csv: Dataset[String] = createBaseDataset(sparkSession, inputPaths, 
parsedOptions)
-val firstLine: String = CSVUtils.filterCommentAndEmpty(csv, 
parsedOptions).first()
-val firstRow = new 
CsvParser(parsedOptions.asParserSettings).parseLine(firstLine)
-val caseSensitive = 
sparkSession.sessionState.conf.caseSensitiveAnalysis
-val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions)
-val tokenRDD = csv.rdd.mapPartitions { iter =>
-  val filteredLines = CSVUtils.filterCommentAndEmpty(iter, 
parsedOptions)
-  val linesWithoutHeader =
-CSVUtils.filterHeaderLine(filteredLines, firstLine, parsedOptions)
-  val parser = new CsvParser(parsedOptions.asParserSettings)
-  linesWithoutHeader.map(parser.parseLine)
+val csv = createBaseDataset(sparkSession, inputPaths, parsedOptions)
+CSVUtils.filterCommentAndEmpty(csv, parsedOptions).take(1).headOption 
match {
--- End diff --

why call `take(1)`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...

2017-03-05 Thread wojtek-szymanski
Github user wojtek-szymanski commented on a diff in the pull request:

https://github.com/apache/spark/pull/17068#discussion_r104317488
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
 ---
@@ -133,8 +133,19 @@ object TextInputCSVDataSource extends CSVDataSource {
   sparkSession: SparkSession,
   inputPaths: Seq[FileStatus],
   parsedOptions: CSVOptions): Option[StructType] = {
-val csv: Dataset[String] = createBaseDataset(sparkSession, inputPaths, 
parsedOptions)
-val firstLine: String = CSVUtils.filterCommentAndEmpty(csv, 
parsedOptions).first()
+val csv = createBaseDataset(sparkSession, inputPaths, parsedOptions)
+CSVUtils.filterCommentAndEmpty(csv, parsedOptions)
+  .take(1)
+  .headOption
+  .map(firstLine => infer(sparkSession, parsedOptions, csv, firstLine))
+  .orElse(Some(StructType(Seq(
--- End diff --

I would suggest that we use pattern matching in order to make it more 
verbose and avoid code like this:

if (maybeFirstRow.isDefined) {
val firstRow = maybeFirstRow.get

I also touched `WholeFileCSVDataSource` to unify both implementations. 
What's your opinion?
Regarding code de-duplication, I fully agree, that it should be done in 
separate PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...

2017-03-05 Thread wojtek-szymanski
Github user wojtek-szymanski commented on a diff in the pull request:

https://github.com/apache/spark/pull/17068#discussion_r104316989
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -305,13 +305,21 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils {
   test("test with empty file and known schema") {
 val result = spark.read
   .format("csv")
-  .schema(StructType(List(StructField("column", StringType, false
+  .schema(StructType(List(StructField("column", StringType, nullable = 
false
   .load(testFile(emptyFile))
 
-assert(result.collect.size === 0)
+assert(result.collect().isEmpty)
 assert(result.schema.fieldNames.size === 1)
   }
 
+  test("test with empty file without schema") {
--- End diff --

Good idea, done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...

2017-03-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17068#discussion_r104288543
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -305,13 +305,21 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils {
   test("test with empty file and known schema") {
 val result = spark.read
   .format("csv")
-  .schema(StructType(List(StructField("column", StringType, false
+  .schema(StructType(List(StructField("column", StringType, nullable = 
false
   .load(testFile(emptyFile))
 
-assert(result.collect.size === 0)
+assert(result.collect().isEmpty)
 assert(result.schema.fieldNames.size === 1)
   }
 
+  test("test with empty file without schema") {
--- End diff --

Let's re-use the test in 
[CSVSuite.scala#L1083](https://github.com/wojtek-szymanski/spark/blob/bdf189087c52e8934cee4ee5563dffea9dde6a99/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala#L1083).

We could..

```scala
  test("Empty file produces empty dataframe with empty schema") {
Seq(false, true).foreach { wholeFile =>
  val df = spark.read.format("csv")
.option("header", true)
.option("wholeFile", true)
.load(testFile(emptyFile))

assert(df.schema === spark.emptyDataFrame.schema)
checkAnswer(df, spark.emptyDataFrame)
  }
}
  }
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...

2017-03-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17068#discussion_r104288502
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
 ---
@@ -133,8 +133,19 @@ object TextInputCSVDataSource extends CSVDataSource {
   sparkSession: SparkSession,
   inputPaths: Seq[FileStatus],
   parsedOptions: CSVOptions): Option[StructType] = {
-val csv: Dataset[String] = createBaseDataset(sparkSession, inputPaths, 
parsedOptions)
-val firstLine: String = CSVUtils.filterCommentAndEmpty(csv, 
parsedOptions).first()
+val csv = createBaseDataset(sparkSession, inputPaths, parsedOptions)
+CSVUtils.filterCommentAndEmpty(csv, parsedOptions)
+  .take(1)
+  .headOption
+  .map(firstLine => infer(sparkSession, parsedOptions, csv, firstLine))
+  .orElse(Some(StructType(Seq(
--- End diff --

Could we maybe just match it to 
[CSVDataSource.scala#L204-L224](https://github.com/wojtek-szymanski/spark/blob/bdf189087c52e8934cee4ee5563dffea9dde6a99/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L204-L224)
 just for consistency for now? 

Personally, I thought such chaining makes hard to read the codes sometimes. 
Maybe, we could consider the code de-duplication about this in another PR. If 
would be easier if they looks similar at least.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...

2017-02-27 Thread wojtek-szymanski
Github user wojtek-szymanski commented on a diff in the pull request:

https://github.com/apache/spark/pull/17068#discussion_r103353729
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala
 ---
@@ -40,7 +41,19 @@ private[csv] object CSVInferSchema {
   csv: Dataset[String],
   caseSensitive: Boolean,
   options: CSVOptions): StructType = {
-val firstLine: String = CSVUtils.filterCommentAndEmpty(csv, 
options).first()
+val lines = CSVUtils.filterCommentAndEmpty(csv, options)
--- End diff --

You are absolutely right. Relying on exception handling is smelly, while 
`Option` gives more opportunities. I also see no difference from performance 
point of view, since both `first()` and `take(1)` call the the same function 
`head(1)`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...

2017-02-27 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17068#discussion_r103214603
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala
 ---
@@ -40,7 +41,19 @@ private[csv] object CSVInferSchema {
   csv: Dataset[String],
   caseSensitive: Boolean,
   options: CSVOptions): StructType = {
-val firstLine: String = CSVUtils.filterCommentAndEmpty(csv, 
options).first()
+val lines = CSVUtils.filterCommentAndEmpty(csv, options)
--- End diff --

Hi @wojtek-szymanski I think we should not rely on exception handling. I 
can think of `take(1).headOption` but we could use shorten one if you know any 
other good way. What do you think about this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...

2017-02-25 Thread wojtek-szymanski
GitHub user wojtek-szymanski opened a pull request:

https://github.com/apache/spark/pull/17068

[SPARK-19709][SQL] Read empty file with CSV data source

## What changes were proposed in this pull request?

Bugfix for reading empty file with CSV data source. Instead of throwing 
`NoSuchElementException`, an empty data frame is returned.
 
## How was this patch tested?

Added new unit test in 
`org.apache.spark.sql.execution.datasources.csv.CSVSuite`

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wojtek-szymanski/spark SPARK-19709

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17068.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17068


commit 1c80d4c7933ce823129642c9bfa5d291e7530671
Author: Wojtek Szymanski 
Date:   2017-02-25T23:37:19Z

Fix: SPARK-19709 CSV datasource fails to read empty file

commit 90f315cea819b521745b637c36026fcdc16bfb0d
Author: Wojtek Szymanski 
Date:   2017-02-25T23:43:17Z

test renamed




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org