[GitHub] spark pull request: [SPARK-15125][SQL] Changing CSV data source ma...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/12904#issuecomment-218976260 @sureshthalamati oh, the comments are not related with this PR but moving the discussion to here was suggested. So, i did. Sorry for that if it was confusing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15125][SQL] Changing CSV data source ma...
Github user sureshthalamati commented on the pull request: https://github.com/apache/spark/pull/12904#issuecomment-218975151 @HyukjinKwon does your previous comment for meant for some other PR ? This PR does not have any change you mentioned above. Am I missing some thing ? @andrewor14 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15125][SQL] Changing CSV data source ma...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/12904#issuecomment-218680978 After the discussion https://github.com/apache/spark/pull/13041, I think the field below: ```scala StructField("", StringType, nullable =true) ``` should be renamed to a field having prefix `_c`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15125][SQL] Changing CSV data source ma...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/12904#issuecomment-218635820 After the discussion here, https://github.com/apache/spark/pull/12904, then I think ```scala StructField("", StringType, nullable = true) ``` should be renamed to `_c0`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15125][SQL] Changing CSV data source ma...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/12904#issuecomment-218633000 @andrewor14 Here it seems it is concluded that `""` is a string and a empty string is `null`. Because `""` is a legitimate string, this can be a field name whereas a empty string might not be able (`null`). Although we cannot query to empty strings, we can still see when it performs `show()`. It seems anyway it is decided that empty strings will be unnamed columns with prefix (`_c`). Do you think this can be generalized to data sources and Spark SQL including fields name with spaces (eg. `""`)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15125][SQL] Changing CSV data source ma...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/12904#issuecomment-217765823 +1 for @sureshthalamati #12921 handles the inconsistent behaviour and this is why I think we should hold off this until that PR is merged. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15125][SQL] Changing CSV data source ma...
Github user sureshthalamati commented on the pull request: https://github.com/apache/spark/pull/12904#issuecomment-217765505 I am not sure what was the history behind returning empty String for null value. In my opinion it should be null be default. current behavior is also inconsistent; for numerics it will return null and for strings it will return empty string by default. Example: See the Year (int), and comment (String in the following data). year,make,model,comment,price 2017,Tesla,Mode 3,looks nice.,35000.99 ,Chevy,Bolt,,29000.00 2015,Porsche,"",, scala> val df= sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load("/tmp/test1.csv") df: org.apache.spark.sql.DataFrame = [year: int, make: string ... 3 more fields] scala> df.show ++---+--+---++ |year| make| model|comment| price| ++---+--+---++ |2017| Tesla|Mode 3|looks nice.|35000.99| |null| Chevy| Bolt| | 29000.0| |2015|Porsche| null| |null| ++---+--+---++ I can update this PR to change the nullValue default if needed, --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15125][SQL] Changing CSV data source ma...
Github user sureshthalamati commented on a diff in the pull request: https://github.com/apache/spark/pull/12904#discussion_r62444274 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -555,4 +558,37 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils { assert(numbers.count() == 8) } + + test("load data with empty quoted string fields.") { +val results = sqlContext + .read + .format("csv") + .options(Map( +"header" -> "true", +"nullValue" -> null, --- End diff -- If nullValue is not set it will return empty string for null values by default. The reason I set it explicitly is to make sure my fix is working. Before my fix it was retruning null for the empty quoted string , and empty string for null values by default. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15125][SQL] Changing CSV data source ma...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/12904#issuecomment-217605838 In case of writing, I think ``` Row("", "null", null) ``` should produce the CSV as below: 1. With the option, `nullValue` set to `"null"`, I think ```csv ,null,null ``` 2. Without any options, I think ```csv ,null, ``` 3. With the option, `nullValue` set to `""`, I think ```csv ,null, ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15125][SQL] Changing CSV data source ma...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/12904#issuecomment-217605160 Here is what I think CSV datasource should handle `""`, empty string and `nullValue`. With the option, `nullValue` set to `"null"`, I think ```csv ,"","null" ``` should produce the records as below: ``` Row(null, "", null) ``` Would this make sense? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15125][SQL] Changing CSV data source ma...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/12904#issuecomment-217604986 @rxin @sureshthalamati Do you mind holding off this change until #12921 is merged? That PR also handles `nullValue`. Apparently, I guess `nullValue` could affect this behaviour. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15125][SQL] Changing CSV data source ma...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/12904#discussion_r62411095 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -555,4 +558,37 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils { assert(numbers.count() == 8) } + + test("load data with empty quoted string fields.") { +val results = sqlContext + .read + .format("csv") + .options(Map( +"header" -> "true", +"nullValue" -> null, --- End diff -- Could I ask what happen if we don't set `nullValue`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15125][SQL] Changing CSV data source ma...
Github user sureshthalamati commented on the pull request: https://github.com/apache/spark/pull/12904#issuecomment-217555333 Thank you for the feedback , Reynold , HyukjinKwon. Update the PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org