[GitHub] spark issue #22893: One part of Spark MLlib Kmean Logic Performance problem
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22893 Please fix the PR title as described in https://spark.apache.org/contributing.html and read it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22844: [SPARK-25847][SQL][TEST] Refactor JSONBenchmarks ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22844#discussion_r229243855 --- Diff: sql/core/benchmarks/JSONBenchmarks-results.txt --- @@ -0,0 +1,33 @@ + +Benchmark for performance of JSON parsing + + +OpenJDK 64-Bit Server VM 1.8.0_163-b01 on Windows 7 6.1 +Intel64 Family 6 Model 94 Stepping 3, GenuineIntel +JSON schema inferring: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +No encoding 48088 / 48180 2.1 480.9 1.0X +UTF-8 is set71881 / 71992 1.4 718.8 0.7X + +OpenJDK 64-Bit Server VM 1.8.0_163-b01 on Windows 7 6.1 +Intel64 Family 6 Model 94 Stepping 3, GenuineIntel +JSON per-line parsing: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +No encoding 12107 / 12246 8.3 121.1 1.0X +UTF-8 is set12375 / 12475 8.1 123.8 1.0X --- End diff -- Ah, I see. This is also because of count optimization. ratio is weird but actually it's performance improvement for both cases. shouldn't be a big deal. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22888: SPARK-25881
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22888 I would close this, @351zyf. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22888: SPARK-25881
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22888 You're introducing a flag to convert. I think it's virtually same enabling the flag vs calling a function to convert. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22844: [SPARK-25847][SQL][TEST] Refactor JSONBenchmarks ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22844#discussion_r229214337 --- Diff: sql/core/benchmarks/JSONBenchmarks-results.txt --- @@ -0,0 +1,33 @@ + +Benchmark for performance of JSON parsing + + +OpenJDK 64-Bit Server VM 1.8.0_163-b01 on Windows 7 6.1 +Intel64 Family 6 Model 94 Stepping 3, GenuineIntel +JSON schema inferring: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +No encoding 48088 / 48180 2.1 480.9 1.0X +UTF-8 is set71881 / 71992 1.4 718.8 0.7X + +OpenJDK 64-Bit Server VM 1.8.0_163-b01 on Windows 7 6.1 +Intel64 Family 6 Model 94 Stepping 3, GenuineIntel +JSON per-line parsing: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +No encoding 12107 / 12246 8.3 121.1 1.0X +UTF-8 is set12375 / 12475 8.1 123.8 1.0X --- End diff -- Let me take a quick look within few days. This is per line basic case where many users are affected. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22844: [SPARK-25847][SQL][TEST] Refactor JSONBenchmarks ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22844#discussion_r229213742 --- Diff: sql/core/benchmarks/JSONBenchmarks-results.txt --- @@ -0,0 +1,33 @@ + +Benchmark for performance of JSON parsing + + +OpenJDK 64-Bit Server VM 1.8.0_163-b01 on Windows 7 6.1 +Intel64 Family 6 Model 94 Stepping 3, GenuineIntel +JSON schema inferring: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +No encoding 48088 / 48180 2.1 480.9 1.0X +UTF-8 is set71881 / 71992 1.4 718.8 0.7X + +OpenJDK 64-Bit Server VM 1.8.0_163-b01 on Windows 7 6.1 +Intel64 Family 6 Model 94 Stepping 3, GenuineIntel +JSON per-line parsing: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +No encoding 12107 / 12246 8.3 121.1 1.0X +UTF-8 is set12375 / 12475 8.1 123.8 1.0X --- End diff -- IIRC, this benchmark was added rather we can make sure setting encoding does not affect the performance without encoding (right @MaxGekk ?). We should fix this. @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22844: [SPARK-25847][SQL][TEST] Refactor JSONBenchmarks ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22844#discussion_r229212923 --- Diff: sql/core/benchmarks/JSONBenchmarks-results.txt --- @@ -0,0 +1,33 @@ + +Benchmark for performance of JSON parsing + + +OpenJDK 64-Bit Server VM 1.8.0_163-b01 on Windows 7 6.1 +Intel64 Family 6 Model 94 Stepping 3, GenuineIntel +JSON schema inferring: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +No encoding 48088 / 48180 2.1 480.9 1.0X +UTF-8 is set71881 / 71992 1.4 718.8 0.7X + +OpenJDK 64-Bit Server VM 1.8.0_163-b01 on Windows 7 6.1 +Intel64 Family 6 Model 94 Stepping 3, GenuineIntel +JSON per-line parsing: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +No encoding 12107 / 12246 8.3 121.1 1.0X +UTF-8 is set12375 / 12475 8.1 123.8 1.0X --- End diff -- Wait .. this is almost 50% slower. This had to be around 8000ish. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22888: SPARK-25881
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22888 Then, you can convert the type into double or floats in Spark DataFrame. This is super easily able to work around at Pandas DataFrame or Spark's DataFrame. I don't think we should add this flag. BTW, the same feature should be added to when Arrow optimization is enabled as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22885: [BUILD][MINOR] release script should not interrupt by sv...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22885 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22888: SPARK-25881
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22888 I think you can just manually convert from Pandas DataFrame, no? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] incubator-livy issue #121: [WIP][LIVY-518][BUILD] Support Spark 2.4
Github user HyukjinKwon commented on the issue: https://github.com/apache/incubator-livy/pull/121 As of RC3, all the unit tests were passed (https://travis-ci.org/HyukjinKwon/incubator-livy/builds/441687251). I am running tests against RC 5 - https://travis-ci.org/HyukjinKwon/incubator-livy/builds/448148672 ---
[GitHub] spark issue #16429: [SPARK-19019][PYTHON] Fix hijacked `collections.namedtup...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16429 This is fixed from Spark 1.6.4, 2.0.3, 2.1.1 and 2.2.0. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22878: [SPARK-25789][SQL] Support for Dataset of Avro
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22878 Just quickly and roughly tested. Merge script looks only recognising main author of each commit in a PR. Let's just push a commit into here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22878: [SPARK-25789][SQL] Support for Dataset of Avro
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22878 I wonder if that can be handled by merge script tho. I think it's okay just to pick up some commits there and rebase them to here even if they are empty commits. That's easier for committers to put his name as primary author when it's merged. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22666: [SPARK-25672][SQL] schema_of_csv() - schema inference fr...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22666 Thanks, @cloud-fan. The change looks good to me from my side. Let me take another look for this and leave a sign-off (which means a sign-off for @MaxGekk's code changes) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22877: [MINOR][SQL] Avoid hardcoded configuration keys in SQLCo...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22877 Thanks, @kiszk and @dongjoon-hyun --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22877: [MINOR][SQL] Avoid hardcoded configuration keys in SQLCo...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22877 Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21654: [SPARK-24671][PySpark] DataFrame length using a dunder/m...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21654 Thanks, @holdenk for addressing my concern. I will try to join as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22666: [SPARK-25672][SQL] schema_of_csv() - schema infer...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22666#discussion_r228949792 --- Diff: sql/core/src/test/resources/sql-tests/inputs/csv-functions.sql --- @@ -7,3 +7,11 @@ select from_csv('1', 'a InvalidType'); select from_csv('1', 'a INT', named_struct('mode', 'PERMISSIVE')); select from_csv('1', 'a INT', map('mode', 1)); select from_csv(); +-- infer schema of json literal +select from_csv('1,abc', schema_of_csv('1,abc')); +select schema_of_csv('1|abc', map('delimiter', '|')); +select schema_of_csv(null); +CREATE TEMPORARY VIEW csvTable(csvField, a) AS SELECT * FROM VALUES ('1,abc', 'a'); +SELECT schema_of_csv(csvField) FROM csvTable; +-- Clean up +DROP VIEW IF EXISTS csvTable; --- End diff -- I see but isn't it still better to explicitly clean tables up? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21588: [SPARK-24590][BUILD] Make Jenkins tests passed with hado...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21588 Yes, that was what I was thinking at worst case. For clarification, @wangyum made a try and all tests were passed at least - https://github.com/apache/spark/pull/20659. Given this try, I think it looks pretty much feasible to upgrade without breaking compatibility. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22877: [MINOR][SQL] Avoid hardcoded configuration keys i...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/22877 [MINOR][SQL] Avoid hardcoded configuration keys in SQLConf's `doc` ## What changes were proposed in this pull request? This PR proposes to avoid hardcorded configuration keys in SQLConf's `doc. ## How was this patch tested? Manually verified. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark minor-conf-name Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22877.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22877 commit 599557012cc9f4d4e5f36744180121cc8845413c Author: hyukjinkwon Date: 2018-10-29T12:42:54Z Avoid hardcoded configuration keys in SQLConf's `doc` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22872: [SPARK-25864][SQL][TEST] Make main args accessible for B...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22872 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22530: [SPARK-24869][SQL] Fix SaveIntoDataSourceCommand's input...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22530 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22530: [SPARK-24869][SQL] Fix SaveIntoDataSourceCommand's input...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22530 retest this please -- --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22078: [SPARK-25085][SQL] Table subdirectories should in...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22078#discussion_r228882841 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -626,6 +626,14 @@ object SQLConf { .stringConf .createWithDefault("parquet") + val DATA_SOURCE_TABLE_INHERIT_PERMS = buildConf("spark.sql.datasource.table.inherit.perms") +.internal() +.doc("Set this to true if the table directories should be inheriting the permission " + + "of the warehouse or database directory " + --- End diff -- Hive configuration doc itself doesn't looks quite clear to me as well actually. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22078: [SPARK-25085][SQL] Table subdirectories should in...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22078#discussion_r228881996 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala --- @@ -70,7 +76,6 @@ case class InsertIntoHadoopFsRelationCommand( val hadoopConf = sparkSession.sessionState.newHadoopConfWithOptions(options) val fs = outputPath.getFileSystem(hadoopConf) val qualifiedOutputPath = outputPath.makeQualified(fs.getUri, fs.getWorkingDirectory) - --- End diff -- I would remove unrelated newline changes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22078: [SPARK-25085][SQL] Table subdirectories should in...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22078#discussion_r228881824 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala --- @@ -261,4 +272,69 @@ case class InsertIntoHadoopFsRelationCommand( } }.toMap } + + private def isExtendedAclEnabled(hadoopConf: Configuration): Boolean = +hadoopConf.getBoolean(DFS_NAMENODE_ACLS_ENABLED_KEY, DFS_NAMENODE_ACLS_ENABLED_DEFAULT) + + private def getFullFileStatus( + conf: SQLConf, + hadoopConf: Configuration, + pathExists: Boolean, + fs: FileSystem, + file: Path): (Option[FsPermission], Option[AclStatus]) = { +var permission: Option[FsPermission] = None +var aclStatus: Option[AclStatus] = None +if (conf.isDataSouceTableInheritPerms && pathExists) { + permission = Some(fs.getFileStatus(file).getPermission) + if (isExtendedAclEnabled(hadoopConf)) aclStatus = Some(fs.getAclStatus(file)) +} +(permission, aclStatus) --- End diff -- I would remove `var` by, for instance: ```scala val shouldFetchPermission = conf.isDataSouceTableInheritPerms && pathExists val shouldFetchAclStatus = shouldFetchPermission && isExtendedAclEnabled(hadoopConf) val permission = if (shouldFetchPermission) { Some(fs.getFileStatus(file).getPermission) } else { None } val aclStatus = if (shouldFetchAclStatus) { Some(fs.getAclStatus(file)) } else { None } (permission, aclStatus) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] zeppelin issue #3206: [WIP][ZEPPELIN-3810] Support Spark 2.4
Github user HyukjinKwon commented on the issue: https://github.com/apache/zeppelin/pull/3206 It should be usable if the changes is cherry-picked properly. This PR basically just replace one line: https://github.com/apache/zeppelin/blob/v0.8.0/spark/scala-2.11/src/main/scala/org/apache/zeppelin/spark/SparkScala211Interpreter.scala#L84 to a private function. ---
[GitHub] spark issue #22275: [SPARK-25274][PYTHON][SQL] In toPandas with Arrow send o...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22275 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] zeppelin issue #3206: [WIP][ZEPPELIN-3810] Support Spark 2.4
Github user HyukjinKwon commented on the issue: https://github.com/apache/zeppelin/pull/3206 The error message: ``` [ERROR] /home/cloud-user/ajay/code/csf-cc-zeppelin-k8szep/spark/scala-2.11/src/main/scala/org/apache/zeppelin/spark/SparkScala211Interpreter.scala:37: error: class SparkScala211Interpreter needs to be abstract, since method interpret in class BaseSparkScalaInterpreter of type (code: String, context: org.apache.zeppelin.interpreter.InterpreterContext)org.apache.zeppelin.interpreter.InterpreterResult is not defined ``` complains there's no `interpret` method defined at `BaseSparkScalaInterpreter`. ---
[GitHub] zeppelin issue #3206: [WIP][ZEPPELIN-3810] Support Spark 2.4
Github user HyukjinKwon commented on the issue: https://github.com/apache/zeppelin/pull/3206 Does that happen only with this code changes? The change here does not touch signature at `class SparkScala211Interpreter(` and the error message looks pretty unrelated. The whole change proposed here does not change any signature. ---
[GitHub] spark issue #22870: [SPARK-25862][SQL] Remove rangeBetween APIs introduced i...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22870 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22871: [SPARK-25179][PYTHON][DOCS] Document BinaryType support ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22871 Thanks, @dongjoon-hyun and @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22530: [SPARK-24869][SQL] Fix SaveIntoDataSourceCommand's input...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22530 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22326: [SPARK-25314][SQL] Fix Python UDF accessing attributes f...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22326 late LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22847: [SPARK-25850][SQL] Make the split threshold for t...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22847#discussion_r228789484 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -812,6 +812,17 @@ object SQLConf { .intConf .createWithDefault(65535) + val CODEGEN_METHOD_SPLIT_THRESHOLD = buildConf("spark.sql.codegen.methodSplitThreshold") +.internal() +.doc("The threshold of source code length without comment of a single Java function by " + + "codegen to be split. When the generated Java function source code exceeds this threshold" + + ", it will be split into multiple small functions. We can't know how many bytecode will " + --- End diff -- Not a big deal but I would avoid abbreviation in documentation. `can't` -> `cannot` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22666: [SPARK-25672][SQL] schema_of_csv() - schema infer...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22666#discussion_r228787126 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala --- @@ -19,14 +19,39 @@ package org.apache.spark.sql.catalyst.expressions import org.apache.spark.sql.AnalysisException import org.apache.spark.sql.catalyst.util.ArrayBasedMapData -import org.apache.spark.sql.types.{MapType, StringType, StructType} +import org.apache.spark.sql.types.{DataType, MapType, StringType, StructType} +import org.apache.spark.unsafe.types.UTF8String object ExprUtils { - def evalSchemaExpr(exp: Expression): StructType = exp match { -case Literal(s, StringType) => StructType.fromDDL(s.toString) + def evalSchemaExpr(exp: Expression): StructType = { +// Use `DataType.fromDDL` since the type string can be struct<...>. +val dataType = exp match { + case Literal(s, StringType) => +DataType.fromDDL(s.toString) + case e @ SchemaOfCsv(_: Literal, _) => +val ddlSchema = e.eval(EmptyRow).asInstanceOf[UTF8String] +DataType.fromDDL(ddlSchema.toString) + case e => throw new AnalysisException( +"Schema should be specified in DDL format as a string literal or output of " + + s"the schema_of_csv function instead of ${e.sql}") +} + +if (!dataType.isInstanceOf[StructType]) { + throw new AnalysisException( +s"Schema should be struct type but got ${dataType.sql}.") +} +dataType.asInstanceOf[StructType] + } + + def evalTypeExpr(exp: Expression): DataType = exp match { +case Literal(s, StringType) => DataType.fromDDL(s.toString) --- End diff -- Yup, that's what I initially thought that we should allow constant-foldable expressions as well but just decided to follow the initial intent - literal only support. I wasn't also sure about when we would need constant folding to construct a JSON example because I suspected that's usually copied and pasted from, for instance, a file. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22871: [SPARK-25179][PYTHON][DOCS] Document BinaryType support ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22871 cc @BryanCutler and @gatorsmile. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22871: [SPARK-25179][PYTHON][DOCS] Document BinaryType s...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/22871 [SPARK-25179][PYTHON][DOCS] Document BinaryType support in Arrow conversion ## What changes were proposed in this pull request? This PR targets to document binary type in "Apache Arrow in Spark". ## How was this patch tested? Manually built the documentation and checked. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-25179 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22871.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22871 commit 34109bc63dd12f733cf2052b048093799f9b0102 Author: hyukjinkwon Date: 2018-10-29T02:53:18Z Document BinaryType support in Arrow conversion --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21588: [SPARK-24590][BUILD] Make Jenkins tests passed with hado...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21588 @dongjoon-hyun and @wangyum, please fix my comment if I am wrong at any point - I believe you guys took a look for this part more then I did. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21588: [SPARK-24590][BUILD] Make Jenkins tests passed with hado...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21588 > Does this upgrade Hive for execution or also for metastore? Spark supports virtually all Hive metastore versions out there, and a lot of deployments do run different versions of Spark against the same old Hive metastore, and it'd be bad to break connectivity to old Hive metastores. > The execution part is a different story and we can upgrade them easily. The upgrade basically targets to upgrade Hive for execution (let me know if I am mistaken). For metastore compatibility, I believe we are able to provide metastore jars and support other Hive versions via explicitly configuring the JARs via isolated classloader. I believe we have basic tests for different Hive versions. I would cautiously like to raise an option - drop the builtin metastore support at 3.0 by default if the upgrade makes to keep builtin metastore support hard enough. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22868#discussion_r228776349 --- Diff: docs/sql-migration-guide-hive-compatibility.md --- @@ -51,6 +51,9 @@ Spark SQL supports the vast majority of Hive features, such as: * Explain * Partitioned tables including dynamic partition insertion * View + * If column aliases are not specified in view definition queries, both Spark and Hive will +generate alias names, but in different ways. In order for Spark to be able to read views created +by Hive, users should explicitly specify column aliases in view definition queries. --- End diff -- +1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22865: [DOC] Fix doc for spark.sql.parquet.recordLevelFi...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22865#discussion_r228776300 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -462,7 +462,7 @@ object SQLConf { val PARQUET_RECORD_FILTER_ENABLED = buildConf("spark.sql.parquet.recordLevelFilter.enabled") .doc("If true, enables Parquet's native record-level filtering using the pushed down " + "filters. This configuration only has an effect when 'spark.sql.parquet.filterPushdown' " + - "is enabled.") + "is enabled and spark.sql.parquet.enableVectorizedReader is disabled.") --- End diff -- SGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22858: [SPARK-24709][SQL][2.4] use str instead of basestring in...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22858 Oops, mind fixing PR title too? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22858: [SPARK-24709][SQL][2.4] use str instead of basestring in...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22858 @cloud-fan, thanks for doing this backport! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22858: [SPARK-24709][SQL][2.4] use str instead of basestring in...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22858 Merged to branch-2.4. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22865: [DOC] Fix doc for spark.sql.parquet.recordLevelFi...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22865#discussion_r228731568 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -462,7 +462,7 @@ object SQLConf { val PARQUET_RECORD_FILTER_ENABLED = buildConf("spark.sql.parquet.recordLevelFilter.enabled") .doc("If true, enables Parquet's native record-level filtering using the pushed down " + "filters. This configuration only has an effect when 'spark.sql.parquet.filterPushdown' " + - "is enabled.") + "is enabled and spark.sql.parquet.enableVectorizedReader is disabled.") --- End diff -- Btw vectorized reader at parquet side is in progress at Parquet side. Maybe ideally 3.0 is a good target to replace. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22865: [DOC] Fix doc for spark.sql.parquet.recordLevelFi...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22865#discussion_r228731385 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -462,7 +462,7 @@ object SQLConf { val PARQUET_RECORD_FILTER_ENABLED = buildConf("spark.sql.parquet.recordLevelFilter.enabled") .doc("If true, enables Parquet's native record-level filtering using the pushed down " + "filters. This configuration only has an effect when 'spark.sql.parquet.filterPushdown' " + - "is enabled.") + "is enabled and spark.sql.parquet.enableVectorizedReader is disabled.") --- End diff -- I didn't document this intentionally because `spark.sql.parquet.enableVectorizedReader` can be disabled for other conditions even if it's enabled (like for complex types). Shall we clearly document this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22858: [SPARK-24709][SQL][2.4] use str instead of basestring in...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22858 Yup, I think strictly we should change. Looks there are two occurrences at `udf` and `pands_udf` `isinstance(..., str)`. Another problem at PySpark is, inconsistent type comparison like type(...) == t` vs `isinstance(..., t)`. For instance, `type(...) == dict` vs `isinstance(..., dict)` - the former does not allow `OrderedDict` but the later allows. Another problem is, some types like `bool` at Python inherits `int`. In this case, `isinstance(...)` might produce unexpected results, for instance, ```python >>> isinstance(True, int) True ``` I was nervous about the cases above and didn't fix those changes so far. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22858: [SPARK-24709][SQL][2.4] use str instead of basest...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22858#discussion_r228731178 --- Diff: python/pyspark/sql/functions.py --- @@ -2326,7 +2326,7 @@ def schema_of_json(json): >>> df.select(schema_of_json('{"a": 0}').alias("json")).collect() [Row(json=u'struct')] """ -if isinstance(json, basestring): +if isinstance(json, str): --- End diff -- Yea we should. They are put only when it's needed because there are so many cases like that (for instance, imap in Python 2 and map in Python 3) Looks that's added in another PR in master beach only. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21157 I meant to use https://github.com/apache/spark/blob/a97001d21757ae214c86371141bd78a376200f66/python/pyspark/serializers.py#L583 Instead of https://github.com/apache/spark/blob/a97001d21757ae214c86371141bd78a376200f66/python/pyspark/serializers.py#L561 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22858: [SPARK-24709][SQL][2.4] use str instead of basest...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22858#discussion_r228713086 --- Diff: python/pyspark/sql/functions.py --- @@ -2326,7 +2326,7 @@ def schema_of_json(json): >>> df.select(schema_of_json('{"a": 0}').alias("json")).collect() [Row(json=u'struct')] """ -if isinstance(json, basestring): +if isinstance(json, str): --- End diff -- The problem here is we will not support unicode in Python 2 .. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22858: [SPARK-24709][SQL][2.4] use str instead of basestring in...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22858 Wenchen, this is because ```python if sys.version >= '3': basestring = str ``` Is missing. Python 3 does not have `basestring`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21157 Adding @gatorsmile and @cloud-fan as well since this might be potentially breaking changes for 3.0 release (it affects RDD operation only with namedtuple in certain case tho) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21157 And you can also run profiler to show the performance effect. See https://github.com/apache/spark/pull/19246#discussion_r139874732 to run the profile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21157 You can just replace it to CloudPickler, remove changes at tests, and push that commit here to show no case is broken --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22666: [SPARK-25672][SQL] schema_of_csv() - schema inference fr...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22666 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22666: [SPARK-25672][SQL] schema_of_csv() - schema inference fr...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22666 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20503: [SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviour for R...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20503 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22775: [SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json's inpu...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22775 Oh you mean the conflict fixing is not that hard. Thanks for doing this @cloud-fan. I planned to do this today .. :-). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21157 Yea, so to avoid to break, we could change the default pickler to CloudPickler or document this workaround. @superbobry, can you check if the case can be preserved if we use CloudPickler instead? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22775: [SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json's inpu...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22775 Yea, but I meant a bit complicated but I'm okay in that way @cloud-fan. Thanks for doing that. I planed to do it today (now) :). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21588: [SPARK-24590][BUILD] Make Jenkins tests passed with hado...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21588 > Hive 2.3 works with Hadoop 2.x (Hive 3.x works with Hadoop 3.x). This is essentially what we need for Hadoop 3 support [release-2.3.2|https://github.com/apache/hive/blob/rel/release-2.3.2/shims/common/src/main/java/org/apache/hadoop/hive/shims/ShimLoader.java#L152-L154]. See https://issues.apache.org/jira/browse/HIVE-16081. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22775: [SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json's inpu...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22775 Sure! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22850: [MINOR][DOC] Fix comment error of HiveUtils
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22850 Yea, I was aware of it. I think there are some more old comments in this file if I remember this correctly. Can you double check and fix them while we are here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22850: [MINOR][DOC] Fix comment error of HiveUtils
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22850 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22775: [SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22775#discussion_r228520891 --- Diff: python/pyspark/sql/functions.py --- @@ -2365,30 +2365,32 @@ def to_json(col, options={}): @ignore_unicode_prefix @since(2.4) -def schema_of_json(col, options={}): +def schema_of_json(json, options={}): """ -Parses a column containing a JSON string and infers its schema in DDL format. +Parses a JSON string and infers its schema in DDL format. -:param col: string column in json format +:param json: a JSON string or a string literal containing a JSON string. :param options: options to control parsing. accepts the same options as the JSON datasource .. versionchanged:: 3.0 It accepts `options` parameter to control schema inferring. ->>> from pyspark.sql.types import * ->>> data = [(1, '{"a": 1}')] ->>> df = spark.createDataFrame(data, ("key", "value")) ->>> df.select(schema_of_json(df.value).alias("json")).collect() -[Row(json=u'struct')] +>>> df = spark.range(1) >>> df.select(schema_of_json(lit('{"a": 0}')).alias("json")).collect() [Row(json=u'struct')] ->>> schema = schema_of_json(lit('{a: 1}'), {'allowUnquotedFieldNames':'true'}) +>>> schema = schema_of_json('{a: 1}', {'allowUnquotedFieldNames':'true'}) >>> df.select(schema.alias("json")).collect() [Row(json=u'struct')] """ +if isinstance(json, basestring): --- End diff -- Actually, Wenchen, I think that's going to make it a bit complicated after few more thoughts .. I think it's okay to go ahead. It's kind of a mistake that we had to do in 2.4. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22775: [SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22775#discussion_r228504453 --- Diff: python/pyspark/sql/functions.py --- @@ -2365,30 +2365,32 @@ def to_json(col, options={}): @ignore_unicode_prefix @since(2.4) -def schema_of_json(col, options={}): +def schema_of_json(json, options={}): """ -Parses a column containing a JSON string and infers its schema in DDL format. +Parses a JSON string and infers its schema in DDL format. -:param col: string column in json format +:param json: a JSON string or a string literal containing a JSON string. :param options: options to control parsing. accepts the same options as the JSON datasource .. versionchanged:: 3.0 It accepts `options` parameter to control schema inferring. ->>> from pyspark.sql.types import * ->>> data = [(1, '{"a": 1}')] ->>> df = spark.createDataFrame(data, ("key", "value")) ->>> df.select(schema_of_json(df.value).alias("json")).collect() -[Row(json=u'struct')] +>>> df = spark.range(1) >>> df.select(schema_of_json(lit('{"a": 0}')).alias("json")).collect() [Row(json=u'struct')] ->>> schema = schema_of_json(lit('{a: 1}'), {'allowUnquotedFieldNames':'true'}) +>>> schema = schema_of_json('{a: 1}', {'allowUnquotedFieldNames':'true'}) >>> df.select(schema.alias("json")).collect() [Row(json=u'struct')] """ +if isinstance(json, basestring): --- End diff -- +1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22771: [SPARK-25773][Core]Cancel zombie tasks in a result stage...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22771 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22775: [SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json's inpu...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22775 Yup, yup .. I should sync the tests --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22814: [SPARK-25819][SQL] Support parse mode option for the fun...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22814 Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21588: [SPARK-24590][BUILD] Make Jenkins tests passed with hado...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21588 Yup, it supports Hadoop 3, and other fixes what @wangyum mentioned. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22814: [SPARK-25819][SQL] Support parse mode option for the fun...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22814 LGTM too --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22814: [SPARK-25819][SQL] Support parse mode option for ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22814#discussion_r228381742 --- Diff: docs/sql-data-sources-avro.md --- @@ -177,6 +180,19 @@ Data source options of Avro can be set using the `.option` method on `DataFrameR Currently supported codecs are uncompressed, snappy, deflate, bzip2 and xz. If the option is not set, the configuration spark.sql.avro.compression.codec config is taken into account. write + +mode +FAILFAST +The mode option allows to specify parse mode for function from_avro. + Currently supported modes are: + +FAILFAST: Throws an exception on processing corrupted record. +PERMISSIVE: Corrupt records are processed as null result. To implement this, the --- End diff -- not a big deal but ..`To implement this` sounds a little bit awkward .. I think you can just say ... `Corrupt records are processed as null; therefore, the given schema is forced to be fully nullable.` (feel free to fix words as you think righter). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22814: [SPARK-25819][SQL] Support parse mode option for ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22814#discussion_r228380951 --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/package.scala --- @@ -31,10 +32,32 @@ package object avro { * @since 2.4.0 */ @Experimental - def from_avro(data: Column, jsonFormatSchema: String): Column = { -new Column(AvroDataToCatalyst(data.expr, jsonFormatSchema)) + def from_avro( + data: Column, + jsonFormatSchema: String): Column = { +new Column(AvroDataToCatalyst(data.expr, jsonFormatSchema, Map.empty)) + } + + /** + * Converts a binary column of avro format into its corresponding catalyst value. The specified + * schema must match the read data, otherwise the behavior is undefined: it may fail or return + * arbitrary result. + * + * @param data the binary column. + * @param jsonFormatSchema the avro schema in JSON string format. + * @param options options to control how the Avro record is parsed. + * + * @since 3.0.0 + */ + @Experimental + def from_avro( + data: Column, + jsonFormatSchema: String, + options: Map[String, String]): Column = { --- End diff -- One thing I am less sure is tho, what do you think about we just expose one Java specific one? This will only be usable in Scala side. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22814: [SPARK-25819][SQL] Support parse mode option for ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22814#discussion_r228380639 --- Diff: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroFunctionsSuite.scala --- @@ -61,6 +59,24 @@ class AvroFunctionsSuite extends QueryTest with SharedSQLContext { checkAnswer(avroStructDF.select(from_avro('avro, avroTypeStruct)), df) } + test("handle invalid input in from_avro") { +val count = 10 +val df = spark.range(count).select(struct('id, 'id.as("id2")).as("struct")) +val avroStructDF = df.select(to_avro('struct).as("avro")) +val avroTypeStruct = s""" + |{ + | "type": "record", + | "name": "struct", + | "fields": [ + |{"name": "col1", "type": "long"}, + |{"name": "col2", "type": "double"} + | ] + |} +""".stripMargin +val expected = (0 until count).map(_ => Row(Row(null, null))) +checkAnswer(avroStructDF.select(from_avro('avro, avroTypeStruct)), expected) --- End diff -- Thanks. Makes sense to me --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22827: [SPARK-25832][SQL][BRANCH-2.4] Revert newly added map re...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22827 LGTM too --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22841: [SPARK-25842][SQL] Deprecate rangeBetween APIs introduce...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22841 Looks good to me. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22841: [SPARK-25842][SQL] Deprecate rangeBetween APIs in...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22841#discussion_r228376996 --- Diff: python/pyspark/sql/window.py --- @@ -239,34 +212,27 @@ def rangeBetween(self, start, end): and "5" means the five off after the current row. We recommend users use ``Window.unboundedPreceding``, ``Window.unboundedFollowing``, -``Window.currentRow``, ``pyspark.sql.functions.unboundedPreceding``, -``pyspark.sql.functions.unboundedFollowing`` and ``pyspark.sql.functions.currentRow`` -to specify special boundary values, rather than using integral values directly. +and ``Window.currentRow`` to specify special boundary values, rather than using integral --- End diff -- Oops, right. I misread. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22815: [SPARK-25821][SQL] Remove SQLContext methods depr...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22815#discussion_r228376272 --- Diff: R/pkg/R/SQLContext.R --- @@ -434,6 +388,7 @@ read.orc <- function(path, ...) { #' Loads a Parquet file, returning the result as a SparkDataFrame. #' #' @param path path of file to read. A vector of multiple paths is allowed. +#' @param ... additional external data source specific named properties. --- End diff -- Yeam actually I opened a PR .. https://github.com/srowen/spark/pull/2 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22841: [SPARK-25842][SQL] Deprecate rangeBetween APIs in...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22841#discussion_r228376015 --- Diff: python/pyspark/sql/window.py --- @@ -239,34 +212,27 @@ def rangeBetween(self, start, end): and "5" means the five off after the current row. We recommend users use ``Window.unboundedPreceding``, ``Window.unboundedFollowing``, -``Window.currentRow``, ``pyspark.sql.functions.unboundedPreceding``, -``pyspark.sql.functions.unboundedFollowing`` and ``pyspark.sql.functions.currentRow`` -to specify special boundary values, rather than using integral values directly. +and ``Window.currentRow`` to specify special boundary values, rather than using integral --- End diff -- Reynold, looks we should deprecate `rangeBetween` here in Python side as well. ```python def ragneBetween(...): """ ... .. note:: Deprecated in Spark 2.4. See SPARK-25842 for more information. """ warnings.warn( "Deprecated in Spark 2.4. See SPARK-25842 for more information.", DeprecationWarning) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16812: [SPARK-19465][SQL] Added options for custom boolean valu...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16812 This can be easily worked around, no? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22841: [SPARK-25842][SQL] Deprecate rangeBetween APIs introduce...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22841 Yup, I also agree with this revert. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22775: [SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json's inpu...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22775 Maybe I am too much careful about it but I am kind of nervous about this column case. I don't intend to disallow it entirely but only for Spark 2.4. We might have to find a way to use column column with `from_json` but I want to defer to 3.0. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22775: [SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json's inpu...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22775 Actually, that usecase can more easily accomplished by simply inferring schema by JSON datasource. Yea, I indeed suggested that as workaround for this issue before. Let's say, `spark.read.json(df.select("json").as[String]).schema`. I know it's not super clear about the usecase of `schema_of_json` and actually that's also partly why I want to allow what we need for this expression now. @rxin, WDYT? This PR tries to allow what we only need for now. Let's say disallow: ```scala schema_of_json(column) ``` and only allow ```scala schema_of_json(literal) ``` because the main usecase is: ```scala from_json(schema_of_json(literal)) ``` and ```scala from_json(schema_of_json(column)) ``` is already not being supported. My judgement was `schema_of_json(column)` doesn't have to be exposed for now and actually want to have a talk with @MaxGekk about this when he comes back from his vacation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22621: [SPARK-25602][SQL] SparkPlan.getByteArrayRdd should not ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22621 That's my point. Why do we have to document for fixing unexpected results fixed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22747: [SPARK-25760][SQL] Set AddJarCommand return empty
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22747 Yup, that's similar argument I had in https://github.com/apache/spark/pull/22773#issuecomment-432923361 I think we should clarify what to document there. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22814: [SPARK-25819][SQL] Support parse mode option for ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22814#discussion_r228113346 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -10,6 +10,9 @@ displayTitle: Spark SQL Upgrading Guide ## Upgrading From Spark SQL 2.4 to 3.0 - In PySpark, when creating a `SparkSession` with `SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, the builder was trying to update the `SparkConf` of the existing `SparkContext` with configurations specified to the builder, but the `SparkContext` is shared by all `SparkSession`s, so we should not update them. Since 3.0, the builder comes to not update the configurations. This is the same behavior as Java/Scala API in 2.3 and above. If you want to update them, you need to update them prior to creating a `SparkSession`. + - In Avro data source, the function `from_avro` supports following parse modes: --- End diff -- nit: let's add a space to make it more readable (I added each space between each item). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22814: [SPARK-25819][SQL] Support parse mode option for ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22814#discussion_r228115771 --- Diff: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroFunctionsSuite.scala --- @@ -61,6 +59,24 @@ class AvroFunctionsSuite extends QueryTest with SharedSQLContext { checkAnswer(avroStructDF.select(from_avro('avro, avroTypeStruct)), df) } + test("handle invalid input in from_avro") { +val count = 10 +val df = spark.range(count).select(struct('id, 'id.as("id2")).as("struct")) +val avroStructDF = df.select(to_avro('struct).as("avro")) +val avroTypeStruct = s""" + |{ + | "type": "record", + | "name": "struct", + | "fields": [ + |{"name": "col1", "type": "long"}, + |{"name": "col2", "type": "double"} + | ] + |} +""".stripMargin +val expected = (0 until count).map(_ => Row(Row(null, null))) +checkAnswer(avroStructDF.select(from_avro('avro, avroTypeStruct)), expected) --- End diff -- BTW, when would there be malformed records - is this usually only when the schema is different? Main purpose for parse modes at CSV and JSON is to cover the limitation from semi structured input data. Was wondering how useful it is in Avro if the most of cases are only when the input schema is different. Also, one good thing about PERMISSIVE mode is that we allow to fill invalid records at `columnNameOfCorruptRecord`. Here it looks it doesn't quite make sense to add this functionality at Avro. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22775: [SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json's inpu...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22775 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22814: [SPARK-25819][SQL] Support parse mode option for ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22814#discussion_r228065259 --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala --- @@ -21,16 +21,31 @@ import org.apache.avro.Schema import org.apache.avro.generic.GenericDatumReader import org.apache.avro.io.{BinaryDecoder, DecoderFactory} -import org.apache.spark.sql.catalyst.expressions.{ExpectsInputTypes, Expression, UnaryExpression} +import org.apache.spark.SparkException +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.expressions.{ExpectsInputTypes, Expression, SpecificInternalRow, UnaryExpression} import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, CodeGenerator, ExprCode} -import org.apache.spark.sql.types.{AbstractDataType, BinaryType, DataType} +import org.apache.spark.sql.catalyst.util.{FailFastMode, ParseMode, PermissiveMode} +import org.apache.spark.sql.types._ -case class AvroDataToCatalyst(child: Expression, jsonFormatSchema: String) +case class AvroDataToCatalyst( +child: Expression, +jsonFormatSchema: String, +options: Map[String, String]) extends UnaryExpression with ExpectsInputTypes { override def inputTypes: Seq[AbstractDataType] = Seq(BinaryType) - override lazy val dataType: DataType = SchemaConverters.toSqlType(avroSchema).dataType + override lazy val dataType: DataType = { +val dt = SchemaConverters.toSqlType(avroSchema).dataType +parseMode match { + // With PermissiveMode, the output Catalyst row might contain columns of null values for + // corrupt records, even if some of the columns are not nullable in the user-provided schema. + // Therefore we force the schema to be all nullable here. + case PermissiveMode => dt.asNullable --- End diff -- This looks going to be an external behaviour change to users from 2.4.0. Please update migration guide as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22621: [SPARK-25602][SQL] SparkPlan.getByteArrayRdd should not ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22621 Let's say, this can be behaivour changes too since metrics are now changed. Should we update migration guide for safety? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22690: [SPARK-19287][CORE][STREAMING] JavaPairRDD flatMapValues...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22690 cc @cloud-fan and @gatorsmile Should we update migration guide as well? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22503: [SPARK-25493][SQL] Use auto-detection for CRLF in CSV da...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22503 @justinuang, this might affect existing users application. Although this matches the behaviour to non-miltiline mode, can we explicitly mention it in migration guide? cc @cloud-fan and @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22747: [SPARK-25760][SQL] Set AddJarCommand return empty
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22747 This looks also external changes to existing application users. Shall we update migration guide? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22773: [SPARK-25785][SQL] Add prettyNames for from_json, to_jso...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22773 Yup, will encourage to update the migration guide in that way. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22728: [SPARK-25736][SQL][TEST] add tests to verify the behavio...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22728 (From https://github.com/apache/spark/pull/22773#issuecomment-432917994) @gatorsmile and @cloud-fan, let's say this will break `DESCRIBE FUNCTION EXTENDED`. Should we update migration guide as well? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22815: [SPARK-25821][SQL] Remove SQLContext methods deprecated ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22815 BTW, should we update migration guide too? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22773: [SPARK-25785][SQL] Add prettyNames for from_json, to_jso...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22773 Sure, so for clarification, we will document everything that affects to external users application, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22775: [SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json's inpu...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22775 @cloud-fan, looks we are going to start another RC. Would you mind if I ask to take a quick look before the new RC? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org