[GitHub] spark issue #19613: Fixed a typo

2017-10-30 Thread jmchung
Github user jmchung commented on the issue:

https://github.com/apache/spark/pull/19613
  
Hi @ganeshchand , could you also fix the typo in `JdbcUtils.scala`? Thanks!
#L459 underling => underlying


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19604: [SPARK-22291][SQL] Conversion error when transforming ar...

2017-10-30 Thread jmchung
Github user jmchung commented on the issue:

https://github.com/apache/spark/pull/19604
  
Sure, I'll close it and thank you and @viirya.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19604: [SPARK-22291][SQL] Conversion error when transfor...

2017-10-30 Thread jmchung
Github user jmchung closed the pull request at:

https://github.com/apache/spark/pull/19604


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19604: [SPARK-22291][SQL][FOLLOWUP] Conversion error when trans...

2017-10-29 Thread jmchung
Github user jmchung commented on the issue:

https://github.com/apache/spark/pull/19604
  
I found in this branch the Docker-based integration will fail due to can 
not pull the image `wnameless/oracle-xe-11g:14.04.4`, should we move on to 
`wnameless/oracle-xe-11g`?

```
Error response from daemon: manifest for wnameless/oracle-xe-11g:14.04.4 
not found
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19604: [SPARK-22291][SQL][FOLLOWUP] Conversion error whe...

2017-10-29 Thread jmchung
Github user jmchung commented on a diff in the pull request:

https://github.com/apache/spark/pull/19604#discussion_r147605119
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
 ---
@@ -440,8 +440,9 @@ object JdbcUtils extends Logging {
 
 case StringType =>
   (array: Object) =>
-array.asInstanceOf[Array[java.lang.String]]
-  .map(UTF8String.fromString)
+// some underling types are not String such as uuid, inet, 
cidr, etc.
--- End diff --

Oops, a typo occurred, thanks @viirya !!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19604: [SPARK-22291][SQL][FOLLOWUP] Conversion error when trans...

2017-10-29 Thread jmchung
Github user jmchung commented on the issue:

https://github.com/apache/spark/pull/19604
  
cc @cloud-fan, the follow-up PR for 2.2, thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19604: [SPARK-22291][SQL][FOLLOWUP] Conversion error whe...

2017-10-29 Thread jmchung
GitHub user jmchung opened a pull request:

https://github.com/apache/spark/pull/19604

[SPARK-22291][SQL][FOLLOWUP] Conversion error when transforming array types 
of uuid, inet and cidr to StingType in PostgreSQL

… types of uuid, inet and cidr to StingType in PostgreSQL

## What changes were proposed in this pull request?

This is a followup of #19567 , to fix the conversion error when 
transforming array types of uuid, inet and cidr to StingType in PostgreSQL for 
Spark 2.2.

## How was this patch tested?

Added test in `PostgresIntegrationSuite`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jmchung/spark SPARK-22291-FOLLOWUP

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19604.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19604


commit 549cb814e01c2338a67c4a9efa4d880a3fb9cdac
Author: Jen-Ming Chung <jenmingi...@gmail.com>
Date:   2017-10-30T01:25:28Z

[SPARK-22291][SQL][FOLLOWUP] Conversion error when transforming array types 
of uuid, inet and cidr to StingType in PostgreSQL




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19567: [SPARK-22291][SQL] Conversion error when transfor...

2017-10-29 Thread jmchung
Github user jmchung commented on a diff in the pull request:

https://github.com/apache/spark/pull/19567#discussion_r147581349
  
--- Diff: 
external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala
 ---
@@ -134,11 +149,28 @@ class PostgresIntegrationSuite extends 
DockerJDBCIntegrationSuite {
 assert(schema(1).dataType == ShortType)
   }
 
-  test("SPARK-20557: column type TIMESTAMP with TIME ZONE and TIME with 
TIME ZONE should be recognized") {
+  test("SPARK-20557: column type TIMESTAMP with TIME ZONE and TIME with 
TIME ZONE " +
+"should be recognized") {
 val dfRead = sqlContext.read.jdbc(jdbcUrl, "ts_with_timezone", new 
Properties)
 val rows = dfRead.collect()
 val types = rows(0).toSeq.map(x => x.getClass.toString)
 assert(types(1).equals("class java.sql.Timestamp"))
 assert(types(2).equals("class java.sql.Timestamp"))
   }
+
+  test("SPARK-22291: PostgreSQL UUID[] to StringType: Conversion Error") {
--- End diff --

Oops, I forgot it, thanks!!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19567: [SPARK-22291][SQL] Conversion error when transfor...

2017-10-29 Thread jmchung
Github user jmchung commented on a diff in the pull request:

https://github.com/apache/spark/pull/19567#discussion_r147579149
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
 ---
@@ -456,8 +456,10 @@ object JdbcUtils extends Logging {
 
 case StringType =>
   (array: Object) =>
-array.asInstanceOf[Array[java.lang.String]]
-  .map(UTF8String.fromString)
+// some underling types are not String such as uuid, inet, 
cidr, etc.
+array.asInstanceOf[Array[java.lang.Object]]
+  .map(obj => UTF8String.fromString(
+if (obj == null) null else obj.toString))
--- End diff --

Thanks @cloud-fan !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19567: [SPARK-22291][SQL] Conversion error when transforming ar...

2017-10-29 Thread jmchung
Github user jmchung commented on the issue:

https://github.com/apache/spark/pull/19567
  
gentle ping @cloud-fan and @viirya, there are some feedbacks about the 
behavior of obj to string.

``` scala
case StringType =>
  (array: Object) =>
array.asInstanceOf[Array[java.lang.Object]]
  .map(obj => UTF8String.fromString(obj.toString))
```
As @HyukjinKwon mentioned before, it may unsafe to covnert the object to 
string directly, we'll have the `java.lang.NullPointerException` in test cases. 
`StringType` is different from other types use `nullSafeConvert` method to deal 
with the null, it process within `UTF8String.fromString`.

The `UTF8String.fromString` returns null and UTF8String so that we cannot 
use `String.valueOf(obj)` to avoid the NPE, it will raise another errors 
(because `null != "null"`):

```
- Type mapping for various types *** FAILED ***
  Array("a", "null", "b") did not equal List("a", null, "b") 
(PostgresIntegrationSuite.scala:120)
 ``` 

IMHO, if we don't want to make separation of match cases, we may remain 
null to meet the original design, e.g.,

``` scala
case StringType =>
  (array: Object) =>
array.asInstanceOf[Array[java.lang.Object]]
  .map(obj => UTF8String.fromString(
if (obj == null) null else obj.toString))
```

Alternative is we keep the String and Object cases?

Would you mind give some comments? I can keep working on this :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19567: [SPARK-22291][SQL] Conversion error when transforming ar...

2017-10-28 Thread jmchung
Github user jmchung commented on the issue:

https://github.com/apache/spark/pull/19567
  
Thanks @viirya, the title has been changed. Please correct me if the 
modified title is still inappropriate.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19567: [SPARK-22291][SQL] Postgresql UUID[] to Cassandra...

2017-10-27 Thread jmchung
Github user jmchung commented on a diff in the pull request:

https://github.com/apache/spark/pull/19567#discussion_r147542014
  
--- Diff: 
external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala
 ---
@@ -134,11 +149,28 @@ class PostgresIntegrationSuite extends 
DockerJDBCIntegrationSuite {
 assert(schema(1).dataType == ShortType)
   }
 
-  test("SPARK-20557: column type TIMESTAMP with TIME ZONE and TIME with 
TIME ZONE should be recognized") {
+  test("SPARK-20557: column type TIMESTAMP with TIME ZONE and TIME with 
TIME ZONE " +
+"should be recognized") {
 val dfRead = sqlContext.read.jdbc(jdbcUrl, "ts_with_timezone", new 
Properties)
 val rows = dfRead.collect()
 val types = rows(0).toSeq.map(x => x.getClass.toString)
 assert(types(1).equals("class java.sql.Timestamp"))
 assert(types(2).equals("class java.sql.Timestamp"))
   }
+
+  test("SPARK-22291: Postgresql UUID[] to Cassandra: Conversion Error") {
--- End diff --

Thanks @gatorsmile and `Cassandra` has been removed. Feel free to correct 
me if the revised title is inappropriate. Do you have more comments? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19567: [SPARK-22291][SQL] Postgresql UUID[] to Cassandra...

2017-10-26 Thread jmchung
Github user jmchung commented on a diff in the pull request:

https://github.com/apache/spark/pull/19567#discussion_r147316748
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
 ---
@@ -456,8 +456,17 @@ object JdbcUtils extends Logging {
 
 case StringType =>
   (array: Object) =>
-array.asInstanceOf[Array[java.lang.String]]
-  .map(UTF8String.fromString)
+array match {
+  case _: Array[java.lang.String] =>
+array.asInstanceOf[Array[java.lang.String]]
+  .map(UTF8String.fromString)
+  case _: Array[java.util.UUID] =>
+array.asInstanceOf[Array[java.util.UUID]]
--- End diff --

Yes, Object can cover the UUID since it's a Superclass in Java. Should we 
not clearly distinguish between classes? Would you mind giving some comments? 
Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19567: [SPARK-22291][SQL] Postgresql UUID[] to Cassandra...

2017-10-26 Thread jmchung
Github user jmchung commented on a diff in the pull request:

https://github.com/apache/spark/pull/19567#discussion_r147314628
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
 ---
@@ -456,8 +456,17 @@ object JdbcUtils extends Logging {
 
 case StringType =>
   (array: Object) =>
-array.asInstanceOf[Array[java.lang.String]]
-  .map(UTF8String.fromString)
+array match {
+  case _: Array[java.lang.String] =>
+array.asInstanceOf[Array[java.lang.String]]
+  .map(UTF8String.fromString)
+  case _: Array[java.util.UUID] =>
+array.asInstanceOf[Array[java.util.UUID]]
+  .map(uuid => UTF8String.fromString(uuid.toString))
+  case _ =>
--- End diff --

I was wrong, org.postgresql.util.PGobject cannot be cast to 
java.lang.String.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19567: [SPARK-22291][SQL] Postgresql UUID[] to Cassandra...

2017-10-26 Thread jmchung
Github user jmchung commented on a diff in the pull request:

https://github.com/apache/spark/pull/19567#discussion_r147312311
  
--- Diff: 
external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala
 ---
@@ -18,7 +18,7 @@
 package org.apache.spark.sql.jdbc
 
 import java.sql.Connection
-import java.util.Properties
+import java.util.{Properties, UUID}
--- End diff --

Thanks! Remove the unused import statement.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19567: [SPARK-22291][SQL] Postgresql UUID[] to Cassandra...

2017-10-26 Thread jmchung
Github user jmchung commented on a diff in the pull request:

https://github.com/apache/spark/pull/19567#discussion_r147309952
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
 ---
@@ -456,8 +456,17 @@ object JdbcUtils extends Logging {
 
 case StringType =>
   (array: Object) =>
-array.asInstanceOf[Array[java.lang.String]]
-  .map(UTF8String.fromString)
+array match {
+  case _: Array[java.lang.String] =>
+array.asInstanceOf[Array[java.lang.String]]
+  .map(UTF8String.fromString)
+  case _: Array[java.util.UUID] =>
+array.asInstanceOf[Array[java.util.UUID]]
+  .map(uuid => UTF8String.fromString(uuid.toString))
+  case _ =>
--- End diff --

Will it be more appropriate to use `obj.asInstanceOf[String]` to convert?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19567: [SPARK-22291][SQL] Postgresql UUID[] to Cassandra: Conve...

2017-10-26 Thread jmchung
Github user jmchung commented on the issue:

https://github.com/apache/spark/pull/19567
  
Thanks @HyukjinKwon :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19567: [SPARK-22291][SQL] Postgresql UUID[] to Cassandra: Conve...

2017-10-25 Thread jmchung
Github user jmchung commented on the issue:

https://github.com/apache/spark/pull/19567
  
gentle ping @wangyum and @viirya, the test case has been added to 
`PostgresIntegrationSuite`. As @viirya mentioned early, data types `inet[]` and 
`cidr[]` did not work as for instance of String, they are both mapped to Java's 
Object class.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19567: [SPARK-22291] Postgresql UUID[] to Cassandra: Conversion...

2017-10-24 Thread jmchung
Github user jmchung commented on the issue:

https://github.com/apache/spark/pull/19567
  
Thanks @wangyum and @viirya, I'll add the corresponding tests in 
`PostgresIntegrationSuite`.
To @viirya , I'm not sure if the other data types will work,  will consider 
them into tests, thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19567: [SPARK-22291] Postgresql UUID[] to Cassandra: Conversion...

2017-10-24 Thread jmchung
Github user jmchung commented on the issue:

https://github.com/apache/spark/pull/19567
  
cc @viirya Can you help review this? Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19567: [SPARK-22291] Postgresql UUID[] to Cassandra: Con...

2017-10-24 Thread jmchung
GitHub user jmchung opened a pull request:

https://github.com/apache/spark/pull/19567

[SPARK-22291] Postgresql UUID[] to Cassandra: Conversion Error

## What changes were proposed in this pull request?

This PR fixes the conversion error when reads data from a PostgreSQL table 
that contains columns of `UUID[]` data type. 

For example, create a table with the UUID[] data type, and insert the test 
data.
```SQL
CREATE TABLE users
(
id smallint NOT NULL,
name character varying(50),
user_ids uuid[],
PRIMARY KEY (id)
)

INSERT INTO users ("id", "name","user_ids") 
VALUES (1, 'foo', ARRAY
['7be8aaf8-650e-4dbb-8186-0a749840ecf2'
,'205f9bfc-018c-4452-a605-609c0cfad228']::UUID[]
)
```
Then it will throw the following exceptions when trying to load the data.
```
java.lang.ClassCastException: [Ljava.util.UUID; cannot be cast to 
[Ljava.lang.String;
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:459)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:458)
...
```


## How was this patch tested?

Existing tests.

I try to imitate the tests with above case in `JDBCSuite`, but the `ARRAY` 
is unsupported type now. Therefore I took the above example in my Postgres and 
verified by the following code.

```scala
val opts = Map(
  "url" -> 
"jdbc:postgresql://localhost:5432/postgres?user=postgres=postgres",
  "dbtable" -> "users")
val df = spark.read.format("jdbc").options(opts).load()
df.show(truncate = false)


+---+++
|id |name|user_ids  
  |

+---+++
|1  |foo |[7be8aaf8-650e-4dbb-8186-0a749840ecf2, 
205f9bfc-018c-4452-a605-609c0cfad228]|

+---+++

```


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jmchung/spark SPARK-22291

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19567.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19567


commit d84b1bb89e3be33931531345fb23cadd8fe6868f
Author: Jen-Ming Chung <jenmingi...@gmail.com>
Date:   2017-10-24T18:24:43Z

[SPARK-22291] Postgresql UUID[] to Cassandra: Conversion Error




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19199: [SPARK-21610][SQL][FOLLOWUP] Corrupt records are not han...

2017-09-12 Thread jmchung
Github user jmchung commented on the issue:

https://github.com/apache/spark/pull/19199
  
Thanks @HyukjinKwon and @viirya :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19199: [SPARK-21610][SQL][FOLLOWUP] Corrupt records are ...

2017-09-12 Thread jmchung
Github user jmchung commented on a diff in the pull request:

https://github.com/apache/spark/pull/19199#discussion_r138287636
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
 ---
@@ -109,6 +109,20 @@ class CSVFileFormat extends TextBasedFileFormat with 
DataSourceRegister {
   }
 }
 
+if (requiredSchema.length == 1 &&
+  requiredSchema.head.name == parsedOptions.columnNameOfCorruptRecord) 
{
+  throw new AnalysisException(
+"Since Spark 2.3, the queries from raw JSON/CSV files are 
disallowed when the\n" +
+  "referenced columns only include the internal corrupt record 
column\n" +
+  s"(named ${parsedOptions.columnNameOfCorruptRecord} by default). 
For example:\n" +
--- End diff --

Thanks @viirya. Should we also need to replace the weird part in 
`JsonFileFormat` with `_corrupt_record`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19199: [SPARK-21610][SQL][FOLLOWUP] Corrupt records are not han...

2017-09-12 Thread jmchung
Github user jmchung commented on the issue:

https://github.com/apache/spark/pull/19199
  
cc @gatorsmile, @HyukjinKwon and @viirya. Could you guys help to review 
this? Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19199: [SPARK-21610][SQL][FOLLOWUP] Corrupt records are ...

2017-09-12 Thread jmchung
GitHub user jmchung opened a pull request:

https://github.com/apache/spark/pull/19199

[SPARK-21610][SQL][FOLLOWUP] Corrupt records are not handled properly when 
creating a dataframe from a file

## What changes were proposed in this pull request?

When the `requiredSchema` only contains `_corrupt_record`, the derived 
`actualSchema` is empty and the `_corrupt_record` are all null for all rows. 
This PR captures above situation and raise an exception with a reasonable 
workaround messag so that users can know what happened and how to fix the query.

## How was this patch tested?

Added unit test in `CSVSuite`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jmchung/spark SPARK-21610-FOLLOWUP

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19199.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19199


commit e703fc8f33d1fde90d790057481f1d23f466f378
Author: Jen-Ming Chung <jenmingi...@gmail.com>
Date:   2017-09-12T06:48:33Z

follow-up PR for CSV




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...

2017-09-10 Thread jmchung
Github user jmchung commented on the issue:

https://github.com/apache/spark/pull/18865
  
@gatorsmile Sure, I'll make a follow-up PR for CSV.
Great thanks for everyone's feedback in this patch, I really learned lot 
for this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...

2017-09-10 Thread jmchung
Github user jmchung commented on the issue:

https://github.com/apache/spark/pull/18865
  
cc @gatorsmile Please take another look when you have time. I've already 
updated. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...

2017-09-09 Thread jmchung
Github user jmchung commented on the issue:

https://github.com/apache/spark/pull/18865
  
@gatorsmile those negative cases are already added in `JsonSuite`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18865: [SPARK-21610][SQL] Corrupt records are not handle...

2017-09-09 Thread jmchung
Github user jmchung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18865#discussion_r137928316
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1542,6 +1542,10 @@ options.
 
 # Migration Guide
 
+## Upgrading From Spark SQL 2.2 to 2.3
+
+  - The queries which select only `spark.sql.columnNameOfCorruptRecord` 
column are disallowed now. Notice that the queries which have only the column 
after column pruning (e.g. filtering on the column followed by a counting 
operation) are also disallowed. If you want to select only the corrupt records, 
you should cache or save the underlying Dataset and DataFrame before running 
such queries.
--- End diff --

Maybe a typo `_corrupt_column`, it should be `_corrupt_record`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18865: [SPARK-21610][SQL] Corrupt records are not handle...

2017-09-09 Thread jmchung
Github user jmchung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18865#discussion_r137924841
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -2034,4 +2034,25 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
   }
 }
   }
+
+  test("SPARK-21610: Corrupt records are not handled properly when 
creating a dataframe " +
--- End diff --

Sure, I'll update the PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...

2017-09-08 Thread jmchung
Github user jmchung commented on the issue:

https://github.com/apache/spark/pull/18865
  
Thank you for review, @HyukjinKwon.

cc @gatorsmile 
Could you review this again when you have sometime? thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...

2017-09-08 Thread jmchung
Github user jmchung commented on the issue:

https://github.com/apache/spark/pull/18865
  
@viirya, thank you so much for taking a look and your time.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...

2017-09-05 Thread jmchung
Github user jmchung commented on the issue:

https://github.com/apache/spark/pull/18865
  
Could @gatorsmile and @HyukjinKwon please share some instructions for 
revised details on exception message? The current message indicates the reason 
of disallowance when users just select the `_corrupt_record`, and provides the 
alternative way to get corrupt records. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18865: [SPARK-21610][SQL] Corrupt records are not handle...

2017-09-03 Thread jmchung
Github user jmchung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18865#discussion_r136715367
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala
 ---
@@ -113,6 +113,18 @@ class JsonFileFormat extends TextBasedFileFormat with 
DataSourceRegister {
   }
 }
 
+if (requiredSchema.length == 1 &&
+  requiredSchema.head.name == parsedOptions.columnNameOfCorruptRecord) 
{
+  throw new AnalysisException(
+s"""
+   |'${parsedOptions.columnNameOfCorruptRecord}' cannot be 
selected alone without other
+   |data columns. If you want to select corrupt records only, 
cache or save the Dataset
+   |before executing queries. For example:
+   |df.cache() then
+   |df.select("${parsedOptions.columnNameOfCorruptRecord}")
+ """.stripMargin.replaceAll(System.lineSeparator(), " "))
--- End diff --

agreed, will fix it in the next commit, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...

2017-09-03 Thread jmchung
Github user jmchung commented on the issue:

https://github.com/apache/spark/pull/18865
  
To @viirya and @gatorsmile, I made some modifications as follows:
1.  move the check of `_corrupt_record` out of the function block to get 
fast fail in driver instead of executor side.
2. `sparkContext.parallelize(data, 1)` to 
`sparkContext.parallelize(Seq(data), 1)`; without `Seq` the data will be stored 
with a char for each line.
3. `schema` is still required to load the json file in this test case. 
Would it be that infer schema did not contain `_corrupt_record`?

BTW, the previous test case will get the following schema:
```
root
 |-- _corrupt_record: string (nullable = true)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...

2017-09-02 Thread jmchung
Github user jmchung commented on the issue:

https://github.com/apache/spark/pull/18865
  
@gatorsmile Thanks for your feedback.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...

2017-09-02 Thread jmchung
Github user jmchung commented on the issue:

https://github.com/apache/spark/pull/18865
  
@viirya Thanks, the description of PR has been updated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...

2017-09-02 Thread jmchung
Github user jmchung commented on the issue:

https://github.com/apache/spark/pull/18865
  
Thanks @viirya's suggestion, the redundant comment is removed and 
`withTempPath` is applied in the test case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...

2017-09-01 Thread jmchung
Github user jmchung commented on the issue:

https://github.com/apache/spark/pull/18865
  
gentle ping @viirya, @gatorsmile, made a minor change to throw reasonable 
workaround message.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19017: [SPARK-21804][SQL] json_tuple returns null values within...

2017-08-24 Thread jmchung
Github user jmchung commented on the issue:

https://github.com/apache/spark/pull/19017
  
Thanks @viirya, @HyukjinKwon and @gatorsmile.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19017: [SPARK-21804][SQL] json_tuple returns null values within...

2017-08-24 Thread jmchung
Github user jmchung commented on the issue:

https://github.com/apache/spark/pull/19017
  
@gatorsmile ok and really thanks for all the nice comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19017: [SPARK-21804][SQL] json_tuple returns null values...

2017-08-23 Thread jmchung
Github user jmchung commented on a diff in the pull request:

https://github.com/apache/spark/pull/19017#discussion_r134925669
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -447,7 +448,18 @@ case class JsonTuple(children: Seq[Expression])
   generator => copyCurrentStructure(generator, parser)
 }
 
-row(idx) = UTF8String.fromBytes(output.toByteArray)
+val jsonValue = UTF8String.fromBytes(output.toByteArray)
+row(idx) = jsonValue
+idx = idx + 1
+
+// SPARK-21804: json_tuple returns null values within repeated 
columns
+// except the first one; so that we need to check the 
remaining fields.
+while (idx < fieldNames.length) {
+  if (fieldNames(idx) == jsonField) {
+row(idx) = jsonValue
+  }
+  idx = idx + 1
+}
--- End diff --

If I comment out the L451-452, the repeated fields still have the same 
jsonValue because `fieldNames(idx) == jsonField`, but the first comparison is 
not necessary since `idx >= 0` means matched.

Could you please give me some advice? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19017: [SPARK-21804][SQL] json_tuple returns null values within...

2017-08-23 Thread jmchung
Github user jmchung commented on the issue:

https://github.com/apache/spark/pull/19017
  
@viirya PR title fixed, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19017: SPARK-21804: json_tuple returns null values within repea...

2017-08-23 Thread jmchung
Github user jmchung commented on the issue:

https://github.com/apache/spark/pull/19017
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19017: SPARK-21804: json_tuple returns null values within repea...

2017-08-22 Thread jmchung
Github user jmchung commented on the issue:

https://github.com/apache/spark/pull/19017
  
@HyukjinKwon @viirya I replaced the functional transformations with a while 
loop. 
What do you think about this? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19017: SPARK-21804: json_tuple returns null values within repea...

2017-08-22 Thread jmchung
Github user jmchung commented on the issue:

https://github.com/apache/spark/pull/19017
  
@HyukjinKwon That's a good point, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19017: SPARK-21804: json_tuple returns null values within repea...

2017-08-22 Thread jmchung
Github user jmchung commented on the issue:

https://github.com/apache/spark/pull/19017
  
cc @viirya 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19017: SPARK-21804: json_tuple returns null values withi...

2017-08-22 Thread jmchung
GitHub user jmchung opened a pull request:

https://github.com/apache/spark/pull/19017

SPARK-21804: json_tuple returns null values within repeated columns except 
the first one

## What changes were proposed in this pull request?

When json_tuple in extracting values from JSON it returns null values 
within repeated columns except the first one as below:

``` scala
scala> spark.sql("""SELECT json_tuple('{"a":1, "b":2}', 'a', 'b', 
'a')""").show()
+---+---++
| c0| c1|  c2|
+---+---++
|  1|  2|null|
+---+---++
```

I think this should be consistent with Hive's implementation:
```
hive> SELECT json_tuple('{"a": 1, "b": 2}', 'a', 'a');
...
11
```

In this PR, we located all the matched indices in `fieldNames` instead of 
returning the first matched index, i.e., indexOf.

## How was this patch tested?

Added test in JsonExpressionsSuite.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jmchung/spark SPARK-21804

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19017.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19017


commit f04b896f3a8b3befdc1cbfb60464dfdcb019b684
Author: Jen-Ming Chung <jenmingi...@gmail.com>
Date:   2017-08-22T06:38:40Z

SPARK-21804: json_tuple returns null values within repeated columns except 
the first one




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18930: [SPARK-21677][SQL] json_tuple throws NullPointException ...

2017-08-17 Thread jmchung
Github user jmchung commented on the issue:

https://github.com/apache/spark/pull/18930
  
Thanks @viirya @HyukjinKwon @gatorsmile , I learned a lot from this journey.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18930: [SPARK-21677][SQL] json_tuple throws NullPointExc...

2017-08-17 Thread jmchung
Github user jmchung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18930#discussion_r133723664
  
--- Diff: sql/core/src/test/resources/sql-tests/inputs/json-functions.sql 
---
@@ -20,3 +20,9 @@ select from_json('{"a":1}', 'a InvalidType');
 select from_json('{"a":1}', 'a INT', named_struct('mode', 'PERMISSIVE'));
 select from_json('{"a":1}', 'a INT', map('mode', 1));
 select from_json();
+-- json_tuple
+SELECT json_tuple('{"a" : 1, "b" : 2}', CAST(NULL AS STRING), 'b', 
CAST(NULL AS STRING), 'a');
+CREATE TEMPORARY VIEW jsonTable(jsonField, a, b) AS SELECT * FROM VALUES 
('{"a": 1, "b": 2}', 'a', 'b');
--- End diff --

ok, we have altered the `'a'` to table field `a`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18930: [SPARK-21677][SQL] json_tuple throws NullPointExc...

2017-08-17 Thread jmchung
Github user jmchung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18930#discussion_r133723113
  
--- Diff: sql/core/src/test/resources/sql-tests/inputs/json-functions.sql 
---
@@ -20,3 +20,9 @@ select from_json('{"a":1}', 'a InvalidType');
 select from_json('{"a":1}', 'a INT', named_struct('mode', 'PERMISSIVE'));
 select from_json('{"a":1}', 'a INT', map('mode', 1));
 select from_json();
+-- json_tuple
+SELECT json_tuple('{"a" : 1, "b" : 2}', CAST(NULL AS STRING), 'b', 
CAST(NULL AS STRING), 'a');
+CREATE TEMPORARY VIEW jsonTable(jsonField, a, b) AS SELECT * FROM VALUES 
('{"a": 1, "b": 2}', 'a', 'b');
+SELECT json_tuple(jsonField, b, CAST(NULL AS STRING), 'a') FROM jsonTable;
+-- Clean up
+DROP VIEW IF EXISTS jsonTable;
--- End diff --

@HyukjinKwon @viirya Fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18930: [SPARK-21677][SQL] json_tuple throws NullPointExc...

2017-08-16 Thread jmchung
Github user jmchung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18930#discussion_r133614456
  
--- Diff: sql/core/src/test/resources/sql-tests/inputs/json-functions.sql 
---
@@ -20,3 +20,9 @@ select from_json('{"a":1}', 'a InvalidType');
 select from_json('{"a":1}', 'a INT', named_struct('mode', 'PERMISSIVE'));
 select from_json('{"a":1}', 'a INT', map('mode', 1));
 select from_json();
+-- json_tuple
+describe function json_tuple;
+describe function extended json_tuple;
+select json_tuple('{"a" : 1, "b" : 2}', cast(NULL AS STRING), 'b', 
cast(NULL AS STRING), 'a')
+create temporary view jsonTable(jsonField, a, b) as select * from values 
'{"a": 1, "b": 2}', 'a', 'b';
+SELECT json_tuple(jsonField, b, cast(NULL AS STRING), 'a') FROM jsonTable
--- End diff --

@gatorsmile @viirya Thank you for your time to review the code. SQL 
statements are consistent in style and the golden file of `json-functions.sql` 
also committed. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18930: [SPARK-21677][SQL] json_tuple throws NullPointExc...

2017-08-16 Thread jmchung
Github user jmchung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18930#discussion_r133479566
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -2034,4 +2034,25 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
   }
 }
   }
+
+  test("SPARK-21677: json_tuple throws NullPointException when column is 
null as string type") {
+checkAnswer(sql(
+  """
+|SELECT json_tuple('{"a" : 1, "b" : 2}'
+|, cast(NULL AS STRING), 'b'
+|, cast(NULL AS STRING), 'a')
+  """.stripMargin), Row(null, "2", null, "1"))
--- End diff --

ok, thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18930: [SPARK-21677][SQL] json_tuple throws NullPointExc...

2017-08-16 Thread jmchung
Github user jmchung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18930#discussion_r133480372
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -2034,4 +2034,25 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
   }
 }
   }
+
+  test("SPARK-21677: json_tuple throws NullPointException when column is 
null as string type") {
--- End diff --

@gatorsmile has added unit test case in `JsonExpressionsSuite`
@viirya also add end-to-end test in `json-functions.sql`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18930: [SPARK-21677][SQL] json_tuple throws NullPointExc...

2017-08-16 Thread jmchung
Github user jmchung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18930#discussion_r133479509
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -2034,4 +2034,25 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
   }
 }
   }
+
+  test("SPARK-21677: json_tuple throws NullPointException when column is 
null as string type") {
+checkAnswer(sql(
+  """
+|SELECT json_tuple('{"a" : 1, "b" : 2}'
+|, cast(NULL AS STRING), 'b'
+|, cast(NULL AS STRING), 'a')
+  """.stripMargin), Row(null, "2", null, "1"))
+
+// mixes constant field name and non constant one
+withTempView("jsonTable") {
+  Seq(("""{"a": 1, "b": 2}""", "a", "b"))
+.toDF("jsonField", "a", "b")
+.createOrReplaceTempView("jsonTable")
+
+  checkAnswer(
+sql("""SELECT json_tuple(jsonField, b, cast(NULL AS STRING), 'a') 
FROM jsonTable"""),
--- End diff --

will move L2053 to json-functions.sql


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18930: [SPARK-21677][SQL] json_tuple throws NullPointExc...

2017-08-15 Thread jmchung
Github user jmchung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18930#discussion_r133200977
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -362,12 +362,12 @@ case class JsonTuple(children: Seq[Expression])
   @transient private lazy val fieldExpressions: Seq[Expression] = 
children.tail
 
   // eagerly evaluate any foldable the field names
-  @transient private lazy val foldableFieldNames: IndexedSeq[String] = {
+  @transient private lazy val foldableFieldNames: 
IndexedSeq[Option[String]] = {
 fieldExpressions.map {
-  case expr if expr.foldable => 
expr.eval().asInstanceOf[UTF8String].toString
+  case expr if expr.foldable => 
Option(expr.eval()).map(_.asInstanceOf[UTF8String].toString)
   case _ => null
-}.toIndexedSeq
-  }
+}
+  }.toIndexedSeq
--- End diff --

@viirya Done: (1) remove redundant comment  (2) move `toIndexedSeq` after 
the map


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18930: [SPARK-21677][SQL] json_tuple throws NullPointExc...

2017-08-15 Thread jmchung
Github user jmchung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18930#discussion_r133135302
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -359,14 +359,14 @@ case class JsonTuple(children: Seq[Expression])
   @transient private lazy val jsonExpr: Expression = children.head
 
   // the fields to query are the remaining children
-  @transient private lazy val fieldExpressions: Seq[Expression] = 
children.tail
+  @transient private lazy val fieldExpressions: Array[Expression] = 
children.tail.toArray
 
   // eagerly evaluate any foldable the field names
-  @transient private lazy val foldableFieldNames: IndexedSeq[String] = {
+  @transient private lazy val foldableFieldNames: Array[Option[String]] = {
--- End diff --

@viirya ok, thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18930: [SPARK-21677][SQL] json_tuple throws NullPointExc...

2017-08-14 Thread jmchung
Github user jmchung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18930#discussion_r133116488
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -2034,4 +2034,13 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
   }
 }
   }
+
+  test("SPARK-21677: json_tuple throws NullPointException when column is 
null as string type") {
+checkAnswer(sql(
+  """
+|SELECT json_tuple('{"a" : 1, "b" : 2}'
+|, cast(NULL AS STRING), 'b'
+|, cast(NULL AS STRING), 'a')
+  """.stripMargin), Row(null, "2", null, "1"))
--- End diff --

@viirya Done, the added test case contains column name, constant field 
name, and null field name.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18930: [SPARK-21677][SQL] json_tuple throws NullPointExc...

2017-08-14 Thread jmchung
Github user jmchung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18930#discussion_r132984129
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -361,10 +361,18 @@ case class JsonTuple(children: Seq[Expression])
   // the fields to query are the remaining children
   @transient private lazy val fieldExpressions: Seq[Expression] = 
children.tail
 
+  // a field name given with constant null will be replaced with this 
pseudo field name
+  private val nullFieldName = "__NullFieldName"
--- End diff --

@HyukjinKwon @viirya  Yep, we've discarded the fake field name and use 
Option here. We made a slight revision to deal with the None in 
`foldableFieldNames` instead of creating a new function.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18930: Spark 21677

2017-08-12 Thread jmchung
Github user jmchung commented on the issue:

https://github.com/apache/spark/pull/18930
  
cc @viirya 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18930: Spark 21677

2017-08-12 Thread jmchung
GitHub user jmchung opened a pull request:

https://github.com/apache/spark/pull/18930

Spark 21677

## What changes were proposed in this pull request?
``` scala
scala> Seq(("""{"Hyukjin": 224, "John": 
1225}""")).toDS.selectExpr("json_tuple(value, trim(null))").show()
...
java.lang.NullPointerException
at ...
```

Currently the `null` field name will throw NullPointException. As a given 
field name null can't be matched with any field names in json, we just output 
null as its column value. This PR achieves it by returning a very unlikely 
column name `__NullFieldName` in evaluation of the field names.



## How was this patch tested?
Added unit test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jmchung/spark SPARK-21677

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18930.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18930


commit 596c2804265e457c378cec433ba2b7fe64f4efdd
Author: Jen-Ming Chung <jenmingi...@gmail.com>
Date:   2017-08-12T08:37:04Z

fix null field name and add corresponding test

commit 796041f0f84eac70a2528c37e02dbe4163ddf3aa
Author: Jen-Ming Chung <jenmingi...@gmail.com>
Date:   2017-08-12T08:39:56Z

add description

commit f07a9f7e73a17ab5447945d8139a40a1ba8372c5
Author: Jen-Ming Chung <jenmingi...@gmail.com>
Date:   2017-08-13T01:48:31Z

modify the comment about nullFieldName




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18865: [SPARK-21610][SQL] Corrupt records are not handle...

2017-08-07 Thread jmchung
GitHub user jmchung reopened a pull request:

https://github.com/apache/spark/pull/18865

[SPARK-21610][SQL] Corrupt records are not handled properly when creating a 
dataframe from a file

## What changes were proposed in this pull request?
```
echo '{"field": 1}
{"field": 2}
{"field": "3"}' >/tmp/sample.json
```

```scala
import org.apache.spark.sql.types._

val schema = new StructType()
  .add("field", ByteType)
  .add("_corrupt_record", StringType)

val file = "/tmp/sample.json"

val dfFromFile = spark.read.schema(schema).json(file)

scala> dfFromFile.show(false)
+-+---+
|field|_corrupt_record|
+-+---+
|1|null   |
|2|null   |
|null |{"field": "3"} |
+-+---+

scala> dfFromFile.filter($"_corrupt_record".isNotNull).count()
res1: Long = 0

scala> dfFromFile.filter($"_corrupt_record".isNull).count()
res2: Long = 3
```
When the `requiredSchema` only contains `_corrupt_record`, the derived 
`actualSchema` is empty and the `_corrupt_record` are all null for all rows. 
When users requires only `_corrupt_record`, we assume that the corrupt records 
are required for all json fields.

## How was this patch tested?

    Added test case.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jmchung/spark SPARK-21610

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18865.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18865


commit 09aa76cc228162edba7ece45063592cd17ae4a27
Author: Jen-Ming Chung <jenmingi...@gmail.com>
Date:   2017-08-07T03:52:45Z

[SPARK-21610][SQL] Corrupt records are not handled properly when creating a 
dataframe from a file

commit f73c3874a9e6a35344a3dc8f6ec8cfb17a1be2f8
Author: Jen-Ming Chung <jenmingi...@gmail.com>
Date:   2017-08-07T04:39:36Z

add explanation to schema change and minor refactor in test case

commit 7a595984f16f6c998883f271bf63e2e84af5f046
Author: Jen-Ming Chung <jenmingi...@gmail.com>
Date:   2017-08-07T04:59:07Z

move test case from DataFrameReaderWriterSuite to JsonSuite

commit 97290f0f891f4261bf173c5ff596d0bb33168d57
Author: Jen-Ming Chung <jenmingi...@gmail.com>
Date:   2017-08-07T05:41:15Z

filter not _corrupt_record in dataSchema

commit f5eec40d51bec8ed0f79f52c5a408ba98f26ca1a
Author: Jen-Ming Chung <jenmingi...@gmail.com>
Date:   2017-08-07T06:17:48Z

code refactor




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18865: [SPARK-21610][SQL] Corrupt records are not handle...

2017-08-07 Thread jmchung
Github user jmchung closed the pull request at:

https://github.com/apache/spark/pull/18865


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18865: [SPARK-21610][SQL] Corrupt records are not handle...

2017-08-07 Thread jmchung
GitHub user jmchung opened a pull request:

https://github.com/apache/spark/pull/18865

[SPARK-21610][SQL] Corrupt records are not handled properly when creating a 
dataframe from a file

## What changes were proposed in this pull request?
```
echo '{"field": 1}
{"field": 2}
{"field": "3"}' >/tmp/sample.json
```

```scala
import org.apache.spark.sql.types._

val schema = new StructType()
  .add("field", ByteType)
  .add("_corrupt_record", StringType)

val file = "/tmp/sample.json"

val dfFromFile = spark.read.schema(schema).json(file)

scala> dfFromFile.show(false)
+-+---+
|field|_corrupt_record|
+-+---+
|1|null   |
|2|null   |
|null |{"field": "3"} |
+-+---+

scala> dfFromFile.filter($"_corrupt_record".isNotNull).count()
res1: Long = 0

scala> dfFromFile.filter($"_corrupt_record".isNull).count()
res2: Long = 3
```
When the `requiredSchema` only contains `_corrupt_record`, the derived 
`actualSchema` is empty and the `_corrupt_record` are all null for all rows. 
When users requires only `_corrupt_record`, we assume that the corrupt records 
are required for all json fields.

## How was this patch tested?

    Added test case.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jmchung/spark SPARK-21610

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18865.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18865


commit 09aa76cc228162edba7ece45063592cd17ae4a27
Author: Jen-Ming Chung <jenmingi...@gmail.com>
Date:   2017-08-07T03:52:45Z

[SPARK-21610][SQL] Corrupt records are not handled properly when creating a 
dataframe from a file

commit f73c3874a9e6a35344a3dc8f6ec8cfb17a1be2f8
Author: Jen-Ming Chung <jenmingi...@gmail.com>
Date:   2017-08-07T04:39:36Z

add explanation to schema change and minor refactor in test case

commit 7a595984f16f6c998883f271bf63e2e84af5f046
Author: Jen-Ming Chung <jenmingi...@gmail.com>
Date:   2017-08-07T04:59:07Z

move test case from DataFrameReaderWriterSuite to JsonSuite

commit 97290f0f891f4261bf173c5ff596d0bb33168d57
Author: Jen-Ming Chung <jenmingi...@gmail.com>
Date:   2017-08-07T05:41:15Z

filter not _corrupt_record in dataSchema

commit f5eec40d51bec8ed0f79f52c5a408ba98f26ca1a
Author: Jen-Ming Chung <jenmingi...@gmail.com>
Date:   2017-08-07T06:17:48Z

code refactor




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org