[GitHub] spark pull request: [SPARK-15125][SQL] Changing CSV data source ma...

2016-05-13 Thread HyukjinKwon
Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/12904#issuecomment-218976260
  
@sureshthalamati oh, the comments are not related with this PR but moving 
the discussion to here was suggested. So, i did. Sorry for that if it was 
confusing. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15125][SQL] Changing CSV data source ma...

2016-05-13 Thread sureshthalamati
Github user sureshthalamati commented on the pull request:

https://github.com/apache/spark/pull/12904#issuecomment-218975151
  
@HyukjinKwon  does your previous comment for meant for  some other PR ? 
This PR does not have any change you mentioned above. Am I missing some thing ?

@andrewor14 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15125][SQL] Changing CSV data source ma...

2016-05-12 Thread HyukjinKwon
Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/12904#issuecomment-218680978
  
After the discussion https://github.com/apache/spark/pull/13041, I think 
the field below:

```scala
StructField("", StringType, nullable =true)
```

should be renamed to a field having prefix `_c`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15125][SQL] Changing CSV data source ma...

2016-05-11 Thread HyukjinKwon
Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/12904#issuecomment-218635820
  
After the discussion here, https://github.com/apache/spark/pull/12904, then 
I think 

```scala
StructField("", StringType, nullable = true)
```

should be renamed to `_c0`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15125][SQL] Changing CSV data source ma...

2016-05-11 Thread HyukjinKwon
Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/12904#issuecomment-218633000
  
@andrewor14 Here it seems it is concluded that `""` is a string and a empty 
string is `null`. Because `""` is a legitimate string, this can be a field name 
whereas a empty string might not be able (`null`). Although we cannot query to 
empty strings, we can still see when it performs `show()`.

It seems anyway it is decided that empty strings will be unnamed columns 
with prefix (`_c`). Do you think this can be generalized to data sources and 
Spark SQL including fields name with spaces (eg. `""`)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15125][SQL] Changing CSV data source ma...

2016-05-08 Thread HyukjinKwon
Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/12904#issuecomment-217765823
  
+1 for @sureshthalamati  
#12921 handles the inconsistent behaviour and this is why I think we should 
hold off this until that PR is merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15125][SQL] Changing CSV data source ma...

2016-05-08 Thread sureshthalamati
Github user sureshthalamati commented on the pull request:

https://github.com/apache/spark/pull/12904#issuecomment-217765505
  
I am not sure what was the history behind returning empty String for null 
value. In my opinion it should be null be default.  current behavior is also  
inconsistent; for numerics it will return null and for strings it will return 
empty string by default. 

Example:
See the Year (int), and comment (String in the following data).
year,make,model,comment,price
2017,Tesla,Mode 3,looks nice.,35000.99
,Chevy,Bolt,,29000.00
2015,Porsche,"",,
scala> val df= sqlContext.read.format("csv").option("header", 
"true").option("inferSchema", "true").load("/tmp/test1.csv")
df: org.apache.spark.sql.DataFrame = [year: int, make: string ... 3 more 
fields]

scala> df.show
++---+--+---++
|year|   make| model|comment|   price|
++---+--+---++
|2017|  Tesla|Mode 3|looks nice.|35000.99|
|null|  Chevy|  Bolt|   | 29000.0|
|2015|Porsche|  null|   |null|
++---+--+---++


I can update this PR to change the nullValue default if  needed, 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15125][SQL] Changing CSV data source ma...

2016-05-08 Thread sureshthalamati
Github user sureshthalamati commented on a diff in the pull request:

https://github.com/apache/spark/pull/12904#discussion_r62444274
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -555,4 +558,37 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils {
 
 assert(numbers.count() == 8)
   }
+
+  test("load data with empty quoted string fields.") {
+val results = sqlContext
+  .read
+  .format("csv")
+  .options(Map(
+"header" -> "true",
+"nullValue" -> null,
--- End diff --

If nullValue is not set  it  will return empty string for null values by 
default.  The reason I set it explicitly is to make sure my fix is working.  
Before my fix it was  retruning null for the empty quoted string , and empty  
string for null values by default.  



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15125][SQL] Changing CSV data source ma...

2016-05-06 Thread HyukjinKwon
Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/12904#issuecomment-217605838
  
In case of writing, I think

```
Row("", "null", null)
```
should produce the CSV as below:

1. With the option, `nullValue` set to `"null"`, I think 

  ```csv
,null,null
```

2. Without any options, I think 

  ```csv
,null,
```

3. With the option, `nullValue` set to `""`, I think 

  ```csv
,null,
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15125][SQL] Changing CSV data source ma...

2016-05-06 Thread HyukjinKwon
Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/12904#issuecomment-217605160
  
Here is what I think CSV datasource should handle `""`, empty string and 
`nullValue`.

With the option, `nullValue` set to `"null"`, I think 

```csv
,"","null"
```

should produce the records as below:

```
Row(null, "", null)
```

Would this make sense?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15125][SQL] Changing CSV data source ma...

2016-05-06 Thread HyukjinKwon
Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/12904#issuecomment-217604986
  
@rxin @sureshthalamati Do you mind holding off this change until #12921 is 
merged? That PR also handles `nullValue`. Apparently, I guess `nullValue` could 
affect this behaviour. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15125][SQL] Changing CSV data source ma...

2016-05-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/12904#discussion_r62411095
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -555,4 +558,37 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils {
 
 assert(numbers.count() == 8)
   }
+
+  test("load data with empty quoted string fields.") {
+val results = sqlContext
+  .read
+  .format("csv")
+  .options(Map(
+"header" -> "true",
+"nullValue" -> null,
--- End diff --

Could I ask what happen if we don't set `nullValue`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15125][SQL] Changing CSV data source ma...

2016-05-06 Thread sureshthalamati
Github user sureshthalamati commented on the pull request:

https://github.com/apache/spark/pull/12904#issuecomment-217555333
  
Thank you for the feedback , Reynold , HyukjinKwon.  Update the PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org