[GitHub] spark issue #19613: Fixed a typo
Github user jmchung commented on the issue: https://github.com/apache/spark/pull/19613 Hi @ganeshchand , could you also fix the typo in `JdbcUtils.scala`? Thanks! #L459 underling => underlying --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19604: [SPARK-22291][SQL] Conversion error when transfor...
Github user jmchung closed the pull request at: https://github.com/apache/spark/pull/19604 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19604: [SPARK-22291][SQL] Conversion error when transforming ar...
Github user jmchung commented on the issue: https://github.com/apache/spark/pull/19604 Sure, I'll close it and thank you and @viirya. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19604: [SPARK-22291][SQL][FOLLOWUP] Conversion error when trans...
Github user jmchung commented on the issue: https://github.com/apache/spark/pull/19604 I found in this branch the Docker-based integration will fail due to can not pull the image `wnameless/oracle-xe-11g:14.04.4`, should we move on to `wnameless/oracle-xe-11g`? ``` Error response from daemon: manifest for wnameless/oracle-xe-11g:14.04.4 not found ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19604: [SPARK-22291][SQL][FOLLOWUP] Conversion error whe...
Github user jmchung commented on a diff in the pull request: https://github.com/apache/spark/pull/19604#discussion_r147605119 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala --- @@ -440,8 +440,9 @@ object JdbcUtils extends Logging { case StringType => (array: Object) => -array.asInstanceOf[Array[java.lang.String]] - .map(UTF8String.fromString) +// some underling types are not String such as uuid, inet, cidr, etc. --- End diff -- Oops, a typo occurred, thanks @viirya !! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19604: [SPARK-22291][SQL][FOLLOWUP] Conversion error when trans...
Github user jmchung commented on the issue: https://github.com/apache/spark/pull/19604 cc @cloud-fan, the follow-up PR for 2.2, thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19604: [SPARK-22291][SQL][FOLLOWUP] Conversion error whe...
GitHub user jmchung opened a pull request: https://github.com/apache/spark/pull/19604 [SPARK-22291][SQL][FOLLOWUP] Conversion error when transforming array types of uuid, inet and cidr to StingType in PostgreSQL ⦠types of uuid, inet and cidr to StingType in PostgreSQL ## What changes were proposed in this pull request? This is a followup of #19567 , to fix the conversion error when transforming array types of uuid, inet and cidr to StingType in PostgreSQL for Spark 2.2. ## How was this patch tested? Added test in `PostgresIntegrationSuite`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jmchung/spark SPARK-22291-FOLLOWUP Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19604.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19604 commit 549cb814e01c2338a67c4a9efa4d880a3fb9cdac Author: Jen-Ming Chung Date: 2017-10-30T01:25:28Z [SPARK-22291][SQL][FOLLOWUP] Conversion error when transforming array types of uuid, inet and cidr to StingType in PostgreSQL --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19567: [SPARK-22291][SQL] Conversion error when transfor...
Github user jmchung commented on a diff in the pull request: https://github.com/apache/spark/pull/19567#discussion_r147581349 --- Diff: external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala --- @@ -134,11 +149,28 @@ class PostgresIntegrationSuite extends DockerJDBCIntegrationSuite { assert(schema(1).dataType == ShortType) } - test("SPARK-20557: column type TIMESTAMP with TIME ZONE and TIME with TIME ZONE should be recognized") { + test("SPARK-20557: column type TIMESTAMP with TIME ZONE and TIME with TIME ZONE " + +"should be recognized") { val dfRead = sqlContext.read.jdbc(jdbcUrl, "ts_with_timezone", new Properties) val rows = dfRead.collect() val types = rows(0).toSeq.map(x => x.getClass.toString) assert(types(1).equals("class java.sql.Timestamp")) assert(types(2).equals("class java.sql.Timestamp")) } + + test("SPARK-22291: PostgreSQL UUID[] to StringType: Conversion Error") { --- End diff -- Oops, I forgot it, thanks!! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19567: [SPARK-22291][SQL] Conversion error when transfor...
Github user jmchung commented on a diff in the pull request: https://github.com/apache/spark/pull/19567#discussion_r147579149 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala --- @@ -456,8 +456,10 @@ object JdbcUtils extends Logging { case StringType => (array: Object) => -array.asInstanceOf[Array[java.lang.String]] - .map(UTF8String.fromString) +// some underling types are not String such as uuid, inet, cidr, etc. +array.asInstanceOf[Array[java.lang.Object]] + .map(obj => UTF8String.fromString( +if (obj == null) null else obj.toString)) --- End diff -- Thanks @cloud-fan ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19567: [SPARK-22291][SQL] Conversion error when transforming ar...
Github user jmchung commented on the issue: https://github.com/apache/spark/pull/19567 gentle ping @cloud-fan and @viirya, there are some feedbacks about the behavior of obj to string. ``` scala case StringType => (array: Object) => array.asInstanceOf[Array[java.lang.Object]] .map(obj => UTF8String.fromString(obj.toString)) ``` As @HyukjinKwon mentioned before, it may unsafe to covnert the object to string directly, we'll have the `java.lang.NullPointerException` in test cases. `StringType` is different from other types use `nullSafeConvert` method to deal with the null, it process within `UTF8String.fromString`. The `UTF8String.fromString` returns null and UTF8String so that we cannot use `String.valueOf(obj)` to avoid the NPE, it will raise another errors (because `null != "null"`): ``` - Type mapping for various types *** FAILED *** Array("a", "null", "b") did not equal List("a", null, "b") (PostgresIntegrationSuite.scala:120) ``` IMHO, if we don't want to make separation of match cases, we may remain null to meet the original design, e.g., ``` scala case StringType => (array: Object) => array.asInstanceOf[Array[java.lang.Object]] .map(obj => UTF8String.fromString( if (obj == null) null else obj.toString)) ``` Alternative is we keep the String and Object cases? Would you mind give some comments? I can keep working on this :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19567: [SPARK-22291][SQL] Conversion error when transforming ar...
Github user jmchung commented on the issue: https://github.com/apache/spark/pull/19567 Thanks @viirya, the title has been changed. Please correct me if the modified title is still inappropriate. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19567: [SPARK-22291][SQL] Postgresql UUID[] to Cassandra...
Github user jmchung commented on a diff in the pull request: https://github.com/apache/spark/pull/19567#discussion_r147542014 --- Diff: external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala --- @@ -134,11 +149,28 @@ class PostgresIntegrationSuite extends DockerJDBCIntegrationSuite { assert(schema(1).dataType == ShortType) } - test("SPARK-20557: column type TIMESTAMP with TIME ZONE and TIME with TIME ZONE should be recognized") { + test("SPARK-20557: column type TIMESTAMP with TIME ZONE and TIME with TIME ZONE " + +"should be recognized") { val dfRead = sqlContext.read.jdbc(jdbcUrl, "ts_with_timezone", new Properties) val rows = dfRead.collect() val types = rows(0).toSeq.map(x => x.getClass.toString) assert(types(1).equals("class java.sql.Timestamp")) assert(types(2).equals("class java.sql.Timestamp")) } + + test("SPARK-22291: Postgresql UUID[] to Cassandra: Conversion Error") { --- End diff -- Thanks @gatorsmile and `Cassandra` has been removed. Feel free to correct me if the revised title is inappropriate. Do you have more comments? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19567: [SPARK-22291][SQL] Postgresql UUID[] to Cassandra...
Github user jmchung commented on a diff in the pull request: https://github.com/apache/spark/pull/19567#discussion_r147316748 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala --- @@ -456,8 +456,17 @@ object JdbcUtils extends Logging { case StringType => (array: Object) => -array.asInstanceOf[Array[java.lang.String]] - .map(UTF8String.fromString) +array match { + case _: Array[java.lang.String] => +array.asInstanceOf[Array[java.lang.String]] + .map(UTF8String.fromString) + case _: Array[java.util.UUID] => +array.asInstanceOf[Array[java.util.UUID]] --- End diff -- Yes, Object can cover the UUID since it's a Superclass in Java. Should we not clearly distinguish between classes? Would you mind giving some comments? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19567: [SPARK-22291][SQL] Postgresql UUID[] to Cassandra...
Github user jmchung commented on a diff in the pull request: https://github.com/apache/spark/pull/19567#discussion_r147314628 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala --- @@ -456,8 +456,17 @@ object JdbcUtils extends Logging { case StringType => (array: Object) => -array.asInstanceOf[Array[java.lang.String]] - .map(UTF8String.fromString) +array match { + case _: Array[java.lang.String] => +array.asInstanceOf[Array[java.lang.String]] + .map(UTF8String.fromString) + case _: Array[java.util.UUID] => +array.asInstanceOf[Array[java.util.UUID]] + .map(uuid => UTF8String.fromString(uuid.toString)) + case _ => --- End diff -- I was wrong, org.postgresql.util.PGobject cannot be cast to java.lang.String. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19567: [SPARK-22291][SQL] Postgresql UUID[] to Cassandra...
Github user jmchung commented on a diff in the pull request: https://github.com/apache/spark/pull/19567#discussion_r147312311 --- Diff: external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala --- @@ -18,7 +18,7 @@ package org.apache.spark.sql.jdbc import java.sql.Connection -import java.util.Properties +import java.util.{Properties, UUID} --- End diff -- Thanks! Remove the unused import statement. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19567: [SPARK-22291][SQL] Postgresql UUID[] to Cassandra...
Github user jmchung commented on a diff in the pull request: https://github.com/apache/spark/pull/19567#discussion_r147309952 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala --- @@ -456,8 +456,17 @@ object JdbcUtils extends Logging { case StringType => (array: Object) => -array.asInstanceOf[Array[java.lang.String]] - .map(UTF8String.fromString) +array match { + case _: Array[java.lang.String] => +array.asInstanceOf[Array[java.lang.String]] + .map(UTF8String.fromString) + case _: Array[java.util.UUID] => +array.asInstanceOf[Array[java.util.UUID]] + .map(uuid => UTF8String.fromString(uuid.toString)) + case _ => --- End diff -- Will it be more appropriate to use `obj.asInstanceOf[String]` to convert? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19567: [SPARK-22291][SQL] Postgresql UUID[] to Cassandra: Conve...
Github user jmchung commented on the issue: https://github.com/apache/spark/pull/19567 Thanks @HyukjinKwon :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19567: [SPARK-22291][SQL] Postgresql UUID[] to Cassandra: Conve...
Github user jmchung commented on the issue: https://github.com/apache/spark/pull/19567 gentle ping @wangyum and @viirya, the test case has been added to `PostgresIntegrationSuite`. As @viirya mentioned early, data types `inet[]` and `cidr[]` did not work as for instance of String, they are both mapped to Java's Object class. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19567: [SPARK-22291] Postgresql UUID[] to Cassandra: Conversion...
Github user jmchung commented on the issue: https://github.com/apache/spark/pull/19567 Thanks @wangyum and @viirya, I'll add the corresponding tests in `PostgresIntegrationSuite`. To @viirya , I'm not sure if the other data types will work, will consider them into tests, thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19567: [SPARK-22291] Postgresql UUID[] to Cassandra: Conversion...
Github user jmchung commented on the issue: https://github.com/apache/spark/pull/19567 cc @viirya Can you help review this? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19567: [SPARK-22291] Postgresql UUID[] to Cassandra: Con...
GitHub user jmchung opened a pull request: https://github.com/apache/spark/pull/19567 [SPARK-22291] Postgresql UUID[] to Cassandra: Conversion Error ## What changes were proposed in this pull request? This PR fixes the conversion error when reads data from a PostgreSQL table that contains columns of `UUID[]` data type. For example, create a table with the UUID[] data type, and insert the test data. ```SQL CREATE TABLE users ( id smallint NOT NULL, name character varying(50), user_ids uuid[], PRIMARY KEY (id) ) INSERT INTO users ("id", "name","user_ids") VALUES (1, 'foo', ARRAY ['7be8aaf8-650e-4dbb-8186-0a749840ecf2' ,'205f9bfc-018c-4452-a605-609c0cfad228']::UUID[] ) ``` Then it will throw the following exceptions when trying to load the data. ``` java.lang.ClassCastException: [Ljava.util.UUID; cannot be cast to [Ljava.lang.String; at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:459) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:458) ... ``` ## How was this patch tested? Existing tests. I try to imitate the tests with above case in `JDBCSuite`, but the `ARRAY` is unsupported type now. Therefore I took the above example in my Postgres and verified by the following code. ```scala val opts = Map( "url" -> "jdbc:postgresql://localhost:5432/postgres?user=postgres&password=postgres", "dbtable" -> "users") val df = spark.read.format("jdbc").options(opts).load() df.show(truncate = false) +---+++ |id |name|user_ids | +---+++ |1 |foo |[7be8aaf8-650e-4dbb-8186-0a749840ecf2, 205f9bfc-018c-4452-a605-609c0cfad228]| +---++----+ ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/jmchung/spark SPARK-22291 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19567.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19567 commit d84b1bb89e3be33931531345fb23cadd8fe6868f Author: Jen-Ming Chung Date: 2017-10-24T18:24:43Z [SPARK-22291] Postgresql UUID[] to Cassandra: Conversion Error --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19199: [SPARK-21610][SQL][FOLLOWUP] Corrupt records are not han...
Github user jmchung commented on the issue: https://github.com/apache/spark/pull/19199 Thanks @HyukjinKwon and @viirya :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19199: [SPARK-21610][SQL][FOLLOWUP] Corrupt records are ...
Github user jmchung commented on a diff in the pull request: https://github.com/apache/spark/pull/19199#discussion_r138287636 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala --- @@ -109,6 +109,20 @@ class CSVFileFormat extends TextBasedFileFormat with DataSourceRegister { } } +if (requiredSchema.length == 1 && + requiredSchema.head.name == parsedOptions.columnNameOfCorruptRecord) { + throw new AnalysisException( +"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the\n" + + "referenced columns only include the internal corrupt record column\n" + + s"(named ${parsedOptions.columnNameOfCorruptRecord} by default). For example:\n" + --- End diff -- Thanks @viirya. Should we also need to replace the weird part in `JsonFileFormat` with `_corrupt_record`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19199: [SPARK-21610][SQL][FOLLOWUP] Corrupt records are not han...
Github user jmchung commented on the issue: https://github.com/apache/spark/pull/19199 cc @gatorsmile, @HyukjinKwon and @viirya. Could you guys help to review this? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19199: [SPARK-21610][SQL][FOLLOWUP] Corrupt records are ...
GitHub user jmchung opened a pull request: https://github.com/apache/spark/pull/19199 [SPARK-21610][SQL][FOLLOWUP] Corrupt records are not handled properly when creating a dataframe from a file ## What changes were proposed in this pull request? When the `requiredSchema` only contains `_corrupt_record`, the derived `actualSchema` is empty and the `_corrupt_record` are all null for all rows. This PR captures above situation and raise an exception with a reasonable workaround messag so that users can know what happened and how to fix the query. ## How was this patch tested? Added unit test in `CSVSuite`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jmchung/spark SPARK-21610-FOLLOWUP Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19199.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19199 commit e703fc8f33d1fde90d790057481f1d23f466f378 Author: Jen-Ming Chung Date: 2017-09-12T06:48:33Z follow-up PR for CSV --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...
Github user jmchung commented on the issue: https://github.com/apache/spark/pull/18865 @gatorsmile Sure, I'll make a follow-up PR for CSV. Great thanks for everyone's feedback in this patch, I really learned lot for this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...
Github user jmchung commented on the issue: https://github.com/apache/spark/pull/18865 cc @gatorsmile Please take another look when you have time. I've already updated. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...
Github user jmchung commented on the issue: https://github.com/apache/spark/pull/18865 @gatorsmile those negative cases are already added in `JsonSuite`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18865: [SPARK-21610][SQL] Corrupt records are not handle...
Github user jmchung commented on a diff in the pull request: https://github.com/apache/spark/pull/18865#discussion_r137928316 --- Diff: docs/sql-programming-guide.md --- @@ -1542,6 +1542,10 @@ options. # Migration Guide +## Upgrading From Spark SQL 2.2 to 2.3 + + - The queries which select only `spark.sql.columnNameOfCorruptRecord` column are disallowed now. Notice that the queries which have only the column after column pruning (e.g. filtering on the column followed by a counting operation) are also disallowed. If you want to select only the corrupt records, you should cache or save the underlying Dataset and DataFrame before running such queries. --- End diff -- Maybe a typo `_corrupt_column`, it should be `_corrupt_record`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18865: [SPARK-21610][SQL] Corrupt records are not handle...
Github user jmchung commented on a diff in the pull request: https://github.com/apache/spark/pull/18865#discussion_r137924841 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2034,4 +2034,25 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { } } } + + test("SPARK-21610: Corrupt records are not handled properly when creating a dataframe " + --- End diff -- Sure, I'll update the PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...
Github user jmchung commented on the issue: https://github.com/apache/spark/pull/18865 Thank you for review, @HyukjinKwon. cc @gatorsmile Could you review this again when you have sometime? thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...
Github user jmchung commented on the issue: https://github.com/apache/spark/pull/18865 @viirya, thank you so much for taking a look and your time. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...
Github user jmchung commented on the issue: https://github.com/apache/spark/pull/18865 Could @gatorsmile and @HyukjinKwon please share some instructions for revised details on exception message? The current message indicates the reason of disallowance when users just select the `_corrupt_record`, and provides the alternative way to get corrupt records. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18865: [SPARK-21610][SQL] Corrupt records are not handle...
Github user jmchung commented on a diff in the pull request: https://github.com/apache/spark/pull/18865#discussion_r136715367 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala --- @@ -113,6 +113,18 @@ class JsonFileFormat extends TextBasedFileFormat with DataSourceRegister { } } +if (requiredSchema.length == 1 && + requiredSchema.head.name == parsedOptions.columnNameOfCorruptRecord) { + throw new AnalysisException( +s""" + |'${parsedOptions.columnNameOfCorruptRecord}' cannot be selected alone without other + |data columns. If you want to select corrupt records only, cache or save the Dataset + |before executing queries. For example: + |df.cache() then + |df.select("${parsedOptions.columnNameOfCorruptRecord}") + """.stripMargin.replaceAll(System.lineSeparator(), " ")) --- End diff -- agreed, will fix it in the next commit, thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...
Github user jmchung commented on the issue: https://github.com/apache/spark/pull/18865 To @viirya and @gatorsmile, I made some modifications as follows: 1. move the check of `_corrupt_record` out of the function block to get fast fail in driver instead of executor side. 2. `sparkContext.parallelize(data, 1)` to `sparkContext.parallelize(Seq(data), 1)`; without `Seq` the data will be stored with a char for each line. 3. `schema` is still required to load the json file in this test case. Would it be that infer schema did not contain `_corrupt_record`? BTW, the previous test case will get the following schema: ``` root |-- _corrupt_record: string (nullable = true) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...
Github user jmchung commented on the issue: https://github.com/apache/spark/pull/18865 @gatorsmile Thanks for your feedback. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...
Github user jmchung commented on the issue: https://github.com/apache/spark/pull/18865 @viirya Thanks, the description of PR has been updated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...
Github user jmchung commented on the issue: https://github.com/apache/spark/pull/18865 Thanks @viirya's suggestion, the redundant comment is removed and `withTempPath` is applied in the test case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...
Github user jmchung commented on the issue: https://github.com/apache/spark/pull/18865 gentle ping @viirya, @gatorsmile, made a minor change to throw reasonable workaround message. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19017: [SPARK-21804][SQL] json_tuple returns null values within...
Github user jmchung commented on the issue: https://github.com/apache/spark/pull/19017 Thanks @viirya, @HyukjinKwon and @gatorsmile. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19017: [SPARK-21804][SQL] json_tuple returns null values within...
Github user jmchung commented on the issue: https://github.com/apache/spark/pull/19017 @gatorsmile ok and really thanks for all the nice comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19017: [SPARK-21804][SQL] json_tuple returns null values...
Github user jmchung commented on a diff in the pull request: https://github.com/apache/spark/pull/19017#discussion_r134925669 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -447,7 +448,18 @@ case class JsonTuple(children: Seq[Expression]) generator => copyCurrentStructure(generator, parser) } -row(idx) = UTF8String.fromBytes(output.toByteArray) +val jsonValue = UTF8String.fromBytes(output.toByteArray) +row(idx) = jsonValue +idx = idx + 1 + +// SPARK-21804: json_tuple returns null values within repeated columns +// except the first one; so that we need to check the remaining fields. +while (idx < fieldNames.length) { + if (fieldNames(idx) == jsonField) { +row(idx) = jsonValue + } + idx = idx + 1 +} --- End diff -- If I comment out the L451-452, the repeated fields still have the same jsonValue because `fieldNames(idx) == jsonField`, but the first comparison is not necessary since `idx >= 0` means matched. Could you please give me some advice? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19017: [SPARK-21804][SQL] json_tuple returns null values within...
Github user jmchung commented on the issue: https://github.com/apache/spark/pull/19017 @viirya PR title fixed, thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19017: SPARK-21804: json_tuple returns null values within repea...
Github user jmchung commented on the issue: https://github.com/apache/spark/pull/19017 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19017: SPARK-21804: json_tuple returns null values within repea...
Github user jmchung commented on the issue: https://github.com/apache/spark/pull/19017 @HyukjinKwon @viirya I replaced the functional transformations with a while loop. What do you think about this? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19017: SPARK-21804: json_tuple returns null values within repea...
Github user jmchung commented on the issue: https://github.com/apache/spark/pull/19017 @HyukjinKwon That's a good point, thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19017: SPARK-21804: json_tuple returns null values within repea...
Github user jmchung commented on the issue: https://github.com/apache/spark/pull/19017 cc @viirya --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19017: SPARK-21804: json_tuple returns null values withi...
GitHub user jmchung opened a pull request: https://github.com/apache/spark/pull/19017 SPARK-21804: json_tuple returns null values within repeated columns except the first one ## What changes were proposed in this pull request? When json_tuple in extracting values from JSON it returns null values within repeated columns except the first one as below: ``` scala scala> spark.sql("""SELECT json_tuple('{"a":1, "b":2}', 'a', 'b', 'a')""").show() +---+---++ | c0| c1| c2| +---+---++ | 1| 2|null| +---+---++ ``` I think this should be consistent with Hive's implementation: ``` hive> SELECT json_tuple('{"a": 1, "b": 2}', 'a', 'a'); ... 11 ``` In this PR, we located all the matched indices in `fieldNames` instead of returning the first matched index, i.e., indexOf. ## How was this patch tested? Added test in JsonExpressionsSuite. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jmchung/spark SPARK-21804 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19017.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19017 commit f04b896f3a8b3befdc1cbfb60464dfdcb019b684 Author: Jen-Ming Chung Date: 2017-08-22T06:38:40Z SPARK-21804: json_tuple returns null values within repeated columns except the first one --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18930: [SPARK-21677][SQL] json_tuple throws NullPointException ...
Github user jmchung commented on the issue: https://github.com/apache/spark/pull/18930 Thanks @viirya @HyukjinKwon @gatorsmile , I learned a lot from this journey. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18930: [SPARK-21677][SQL] json_tuple throws NullPointExc...
Github user jmchung commented on a diff in the pull request: https://github.com/apache/spark/pull/18930#discussion_r133723664 --- Diff: sql/core/src/test/resources/sql-tests/inputs/json-functions.sql --- @@ -20,3 +20,9 @@ select from_json('{"a":1}', 'a InvalidType'); select from_json('{"a":1}', 'a INT', named_struct('mode', 'PERMISSIVE')); select from_json('{"a":1}', 'a INT', map('mode', 1)); select from_json(); +-- json_tuple +SELECT json_tuple('{"a" : 1, "b" : 2}', CAST(NULL AS STRING), 'b', CAST(NULL AS STRING), 'a'); +CREATE TEMPORARY VIEW jsonTable(jsonField, a, b) AS SELECT * FROM VALUES ('{"a": 1, "b": 2}', 'a', 'b'); --- End diff -- ok, we have altered the `'a'` to table field `a`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18930: [SPARK-21677][SQL] json_tuple throws NullPointExc...
Github user jmchung commented on a diff in the pull request: https://github.com/apache/spark/pull/18930#discussion_r133723113 --- Diff: sql/core/src/test/resources/sql-tests/inputs/json-functions.sql --- @@ -20,3 +20,9 @@ select from_json('{"a":1}', 'a InvalidType'); select from_json('{"a":1}', 'a INT', named_struct('mode', 'PERMISSIVE')); select from_json('{"a":1}', 'a INT', map('mode', 1)); select from_json(); +-- json_tuple +SELECT json_tuple('{"a" : 1, "b" : 2}', CAST(NULL AS STRING), 'b', CAST(NULL AS STRING), 'a'); +CREATE TEMPORARY VIEW jsonTable(jsonField, a, b) AS SELECT * FROM VALUES ('{"a": 1, "b": 2}', 'a', 'b'); +SELECT json_tuple(jsonField, b, CAST(NULL AS STRING), 'a') FROM jsonTable; +-- Clean up +DROP VIEW IF EXISTS jsonTable; --- End diff -- @HyukjinKwon @viirya Fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18930: [SPARK-21677][SQL] json_tuple throws NullPointExc...
Github user jmchung commented on a diff in the pull request: https://github.com/apache/spark/pull/18930#discussion_r133614456 --- Diff: sql/core/src/test/resources/sql-tests/inputs/json-functions.sql --- @@ -20,3 +20,9 @@ select from_json('{"a":1}', 'a InvalidType'); select from_json('{"a":1}', 'a INT', named_struct('mode', 'PERMISSIVE')); select from_json('{"a":1}', 'a INT', map('mode', 1)); select from_json(); +-- json_tuple +describe function json_tuple; +describe function extended json_tuple; +select json_tuple('{"a" : 1, "b" : 2}', cast(NULL AS STRING), 'b', cast(NULL AS STRING), 'a') +create temporary view jsonTable(jsonField, a, b) as select * from values '{"a": 1, "b": 2}', 'a', 'b'; +SELECT json_tuple(jsonField, b, cast(NULL AS STRING), 'a') FROM jsonTable --- End diff -- @gatorsmile @viirya Thank you for your time to review the code. SQL statements are consistent in style and the golden file of `json-functions.sql` also committed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18930: [SPARK-21677][SQL] json_tuple throws NullPointExc...
Github user jmchung commented on a diff in the pull request: https://github.com/apache/spark/pull/18930#discussion_r133479566 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2034,4 +2034,25 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { } } } + + test("SPARK-21677: json_tuple throws NullPointException when column is null as string type") { +checkAnswer(sql( + """ +|SELECT json_tuple('{"a" : 1, "b" : 2}' +|, cast(NULL AS STRING), 'b' +|, cast(NULL AS STRING), 'a') + """.stripMargin), Row(null, "2", null, "1")) --- End diff -- ok, thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18930: [SPARK-21677][SQL] json_tuple throws NullPointExc...
Github user jmchung commented on a diff in the pull request: https://github.com/apache/spark/pull/18930#discussion_r133480372 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2034,4 +2034,25 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { } } } + + test("SPARK-21677: json_tuple throws NullPointException when column is null as string type") { --- End diff -- @gatorsmile has added unit test case in `JsonExpressionsSuite` @viirya also add end-to-end test in `json-functions.sql` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18930: [SPARK-21677][SQL] json_tuple throws NullPointExc...
Github user jmchung commented on a diff in the pull request: https://github.com/apache/spark/pull/18930#discussion_r133479509 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2034,4 +2034,25 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { } } } + + test("SPARK-21677: json_tuple throws NullPointException when column is null as string type") { +checkAnswer(sql( + """ +|SELECT json_tuple('{"a" : 1, "b" : 2}' +|, cast(NULL AS STRING), 'b' +|, cast(NULL AS STRING), 'a') + """.stripMargin), Row(null, "2", null, "1")) + +// mixes constant field name and non constant one +withTempView("jsonTable") { + Seq(("""{"a": 1, "b": 2}""", "a", "b")) +.toDF("jsonField", "a", "b") +.createOrReplaceTempView("jsonTable") + + checkAnswer( +sql("""SELECT json_tuple(jsonField, b, cast(NULL AS STRING), 'a') FROM jsonTable"""), --- End diff -- will move L2053 to json-functions.sql --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18930: [SPARK-21677][SQL] json_tuple throws NullPointExc...
Github user jmchung commented on a diff in the pull request: https://github.com/apache/spark/pull/18930#discussion_r133200977 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -362,12 +362,12 @@ case class JsonTuple(children: Seq[Expression]) @transient private lazy val fieldExpressions: Seq[Expression] = children.tail // eagerly evaluate any foldable the field names - @transient private lazy val foldableFieldNames: IndexedSeq[String] = { + @transient private lazy val foldableFieldNames: IndexedSeq[Option[String]] = { fieldExpressions.map { - case expr if expr.foldable => expr.eval().asInstanceOf[UTF8String].toString + case expr if expr.foldable => Option(expr.eval()).map(_.asInstanceOf[UTF8String].toString) case _ => null -}.toIndexedSeq - } +} + }.toIndexedSeq --- End diff -- @viirya Done: (1) remove redundant comment (2) move `toIndexedSeq` after the map --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18930: [SPARK-21677][SQL] json_tuple throws NullPointExc...
Github user jmchung commented on a diff in the pull request: https://github.com/apache/spark/pull/18930#discussion_r133135302 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -359,14 +359,14 @@ case class JsonTuple(children: Seq[Expression]) @transient private lazy val jsonExpr: Expression = children.head // the fields to query are the remaining children - @transient private lazy val fieldExpressions: Seq[Expression] = children.tail + @transient private lazy val fieldExpressions: Array[Expression] = children.tail.toArray // eagerly evaluate any foldable the field names - @transient private lazy val foldableFieldNames: IndexedSeq[String] = { + @transient private lazy val foldableFieldNames: Array[Option[String]] = { --- End diff -- @viirya ok, thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18930: [SPARK-21677][SQL] json_tuple throws NullPointExc...
Github user jmchung commented on a diff in the pull request: https://github.com/apache/spark/pull/18930#discussion_r133116488 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2034,4 +2034,13 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { } } } + + test("SPARK-21677: json_tuple throws NullPointException when column is null as string type") { +checkAnswer(sql( + """ +|SELECT json_tuple('{"a" : 1, "b" : 2}' +|, cast(NULL AS STRING), 'b' +|, cast(NULL AS STRING), 'a') + """.stripMargin), Row(null, "2", null, "1")) --- End diff -- @viirya Done, the added test case contains column name, constant field name, and null field name. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18930: [SPARK-21677][SQL] json_tuple throws NullPointExc...
Github user jmchung commented on a diff in the pull request: https://github.com/apache/spark/pull/18930#discussion_r132984129 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -361,10 +361,18 @@ case class JsonTuple(children: Seq[Expression]) // the fields to query are the remaining children @transient private lazy val fieldExpressions: Seq[Expression] = children.tail + // a field name given with constant null will be replaced with this pseudo field name + private val nullFieldName = "__NullFieldName" --- End diff -- @HyukjinKwon @viirya Yep, we've discarded the fake field name and use Option here. We made a slight revision to deal with the None in `foldableFieldNames` instead of creating a new function. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18930: Spark 21677
Github user jmchung commented on the issue: https://github.com/apache/spark/pull/18930 cc @viirya --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18930: Spark 21677
GitHub user jmchung opened a pull request: https://github.com/apache/spark/pull/18930 Spark 21677 ## What changes were proposed in this pull request? ``` scala scala> Seq(("""{"Hyukjin": 224, "John": 1225}""")).toDS.selectExpr("json_tuple(value, trim(null))").show() ... java.lang.NullPointerException at ... ``` Currently the `null` field name will throw NullPointException. As a given field name null can't be matched with any field names in json, we just output null as its column value. This PR achieves it by returning a very unlikely column name `__NullFieldName` in evaluation of the field names. ## How was this patch tested? Added unit test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jmchung/spark SPARK-21677 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18930.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18930 commit 596c2804265e457c378cec433ba2b7fe64f4efdd Author: Jen-Ming Chung Date: 2017-08-12T08:37:04Z fix null field name and add corresponding test commit 796041f0f84eac70a2528c37e02dbe4163ddf3aa Author: Jen-Ming Chung Date: 2017-08-12T08:39:56Z add description commit f07a9f7e73a17ab5447945d8139a40a1ba8372c5 Author: Jen-Ming Chung Date: 2017-08-13T01:48:31Z modify the comment about nullFieldName --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18865: [SPARK-21610][SQL] Corrupt records are not handle...
GitHub user jmchung reopened a pull request: https://github.com/apache/spark/pull/18865 [SPARK-21610][SQL] Corrupt records are not handled properly when creating a dataframe from a file ## What changes were proposed in this pull request? ``` echo '{"field": 1} {"field": 2} {"field": "3"}' >/tmp/sample.json ``` ```scala import org.apache.spark.sql.types._ val schema = new StructType() .add("field", ByteType) .add("_corrupt_record", StringType) val file = "/tmp/sample.json" val dfFromFile = spark.read.schema(schema).json(file) scala> dfFromFile.show(false) +-+---+ |field|_corrupt_record| +-+---+ |1|null | |2|null | |null |{"field": "3"} | +-+---+ scala> dfFromFile.filter($"_corrupt_record".isNotNull).count() res1: Long = 0 scala> dfFromFile.filter($"_corrupt_record".isNull).count() res2: Long = 3 ``` When the `requiredSchema` only contains `_corrupt_record`, the derived `actualSchema` is empty and the `_corrupt_record` are all null for all rows. When users requires only `_corrupt_record`, we assume that the corrupt records are required for all json fields. ## How was this patch tested? Added test case. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jmchung/spark SPARK-21610 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18865.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18865 commit 09aa76cc228162edba7ece45063592cd17ae4a27 Author: Jen-Ming Chung Date: 2017-08-07T03:52:45Z [SPARK-21610][SQL] Corrupt records are not handled properly when creating a dataframe from a file commit f73c3874a9e6a35344a3dc8f6ec8cfb17a1be2f8 Author: Jen-Ming Chung Date: 2017-08-07T04:39:36Z add explanation to schema change and minor refactor in test case commit 7a595984f16f6c998883f271bf63e2e84af5f046 Author: Jen-Ming Chung Date: 2017-08-07T04:59:07Z move test case from DataFrameReaderWriterSuite to JsonSuite commit 97290f0f891f4261bf173c5ff596d0bb33168d57 Author: Jen-Ming Chung Date: 2017-08-07T05:41:15Z filter not _corrupt_record in dataSchema commit f5eec40d51bec8ed0f79f52c5a408ba98f26ca1a Author: Jen-Ming Chung Date: 2017-08-07T06:17:48Z code refactor --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18865: [SPARK-21610][SQL] Corrupt records are not handle...
Github user jmchung closed the pull request at: https://github.com/apache/spark/pull/18865 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18865: [SPARK-21610][SQL] Corrupt records are not handle...
GitHub user jmchung opened a pull request: https://github.com/apache/spark/pull/18865 [SPARK-21610][SQL] Corrupt records are not handled properly when creating a dataframe from a file ## What changes were proposed in this pull request? ``` echo '{"field": 1} {"field": 2} {"field": "3"}' >/tmp/sample.json ``` ```scala import org.apache.spark.sql.types._ val schema = new StructType() .add("field", ByteType) .add("_corrupt_record", StringType) val file = "/tmp/sample.json" val dfFromFile = spark.read.schema(schema).json(file) scala> dfFromFile.show(false) +-+---+ |field|_corrupt_record| +-+---+ |1|null | |2|null | |null |{"field": "3"} | +-+---+ scala> dfFromFile.filter($"_corrupt_record".isNotNull).count() res1: Long = 0 scala> dfFromFile.filter($"_corrupt_record".isNull).count() res2: Long = 3 ``` When the `requiredSchema` only contains `_corrupt_record`, the derived `actualSchema` is empty and the `_corrupt_record` are all null for all rows. When users requires only `_corrupt_record`, we assume that the corrupt records are required for all json fields. ## How was this patch tested? Added test case. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jmchung/spark SPARK-21610 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18865.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18865 commit 09aa76cc228162edba7ece45063592cd17ae4a27 Author: Jen-Ming Chung Date: 2017-08-07T03:52:45Z [SPARK-21610][SQL] Corrupt records are not handled properly when creating a dataframe from a file commit f73c3874a9e6a35344a3dc8f6ec8cfb17a1be2f8 Author: Jen-Ming Chung Date: 2017-08-07T04:39:36Z add explanation to schema change and minor refactor in test case commit 7a595984f16f6c998883f271bf63e2e84af5f046 Author: Jen-Ming Chung Date: 2017-08-07T04:59:07Z move test case from DataFrameReaderWriterSuite to JsonSuite commit 97290f0f891f4261bf173c5ff596d0bb33168d57 Author: Jen-Ming Chung Date: 2017-08-07T05:41:15Z filter not _corrupt_record in dataSchema commit f5eec40d51bec8ed0f79f52c5a408ba98f26ca1a Author: Jen-Ming Chung Date: 2017-08-07T06:17:48Z code refactor --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org