[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/7#discussion_r217953294 --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R --- @@ -1803,6 +1803,18 @@ test_that("string operators", { collect(select(df4, split_string(df4$a, "")))[1, 1], list(list("a.b@c.d 1", "b")) ) + expect_equal( +collect(select(df4, split_string(df4$a, "\\.", 2)))[1, 1], +list(list("a", "b@c.d 1\\b")) + ) + expect_equal( +collect(select(df4, split_string(df4$a, "b", -2)))[1, 1], +list(list("a.", "@c.d 1\\", "")) + ) + expect_equal( +collect(select(df4, split_string(df4$a, "b", 0)))[1, 1], --- End diff -- for context, we've had some cases in the past the wrong value is passed for an parameter - so let's at least get one with and one without any optional parameter --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22418: [SPARK-25427][SQL][TEST] Add BloomFilter creation...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22418#discussion_r217952724 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala --- @@ -50,6 +55,66 @@ abstract class OrcSuite extends OrcTest with BeforeAndAfterAll { .createOrReplaceTempView("orc_temp_table") } + protected def testBloomFilterCreation(bloomFilterKind: Kind) { +val tableName = "bloomFilter" + +withTempDir { dir => + withTable(tableName) { +val sqlStatement = orcImp match { + case "native" => +s""" + |CREATE TABLE $tableName (a INT, b STRING) + |USING ORC + |OPTIONS ( + | path '${dir.toURI}', + | orc.bloom.filter.columns '*', + | orc.bloom.filter.fpp 0.1 + |) +""".stripMargin + case "hive" => +s""" + |CREATE TABLE $tableName (a INT, b STRING) + |STORED AS ORC + |LOCATION '${dir.toURI}' + |TBLPROPERTIES ( + | orc.bloom.filter.columns='*', + | orc.bloom.filter.fpp=0.1 + |) +""".stripMargin + case impl => +throw new UnsupportedOperationException(s"Unknown ORC implementation: $impl") +} + +sql(sqlStatement) +sql(s"INSERT INTO $tableName VALUES (1, 'str')") + +val partFiles = dir.listFiles() + .filter(f => f.isFile && !f.getName.startsWith(".") && !f.getName.startsWith("_")) +assert(partFiles.length === 1) + +val orcFilePath = new Path(partFiles.head.getAbsolutePath) +val readerOptions = OrcFile.readerOptions(new Configuration()) +val reader = OrcFile.createReader(orcFilePath, readerOptions) +var recordReader: RecordReaderImpl = null +try { + recordReader = reader.rows.asInstanceOf[RecordReaderImpl] + + // BloomFilter array is created for all types; `struct`, int (`a`), string (`b`) + val sargColumns = Array(true, true, true) + val orcIndex = recordReader.readRowIndex(0, null, sargColumns) + + // Check the types and counts of bloom filters + assert(orcIndex.getBloomFilterKinds.forall(_ === bloomFilterKind)) --- End diff -- Do you mean how we extend this test case? If so, I think it's fine since what we need to test within Spark is the specified bloom filter works or not. It's rather none of all so one test case should be okay. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22439: [SPARK-25444][SQL] Refactor GenArrayData.genCodeToCreate...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22439 **[Test build #96120 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96120/testReport)** for PR 22439 at commit [`24fbf74`](https://github.com/apache/spark/commit/24fbf742fdd8490f57d29325100036e556847c77). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22439: [SPARK-25444][SQL] Refactor GenArrayData.genCodeToCreate...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22439 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22439: [SPARK-25444][SQL] Refactor GenArrayData.genCodeToCreate...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22439 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3146/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22439: [SPARK-25444][SQL] Refactor GenArrayData.genCodeT...
GitHub user kiszk opened a pull request: https://github.com/apache/spark/pull/22439 [SPARK-25444][SQL] Refactor GenArrayData.genCodeToCreateArrayData method ## What changes were proposed in this pull request? This PR makes `GenArrayData.genCodeToCreateArrayData` method simple by using `ArrayData.createArrayData` method. Before this PR, `genCodeToCreateArrayData` method was complicated * Generated a temporary Java array to create `ArrayData` * Had separate code generation path to assign values for `GenericArrayData` and `UnsafeArrayData` After this PR, the method * Directly generates `GenericArrayData` or `UnsafeArrayData` without a temporary array * Has only code generation path to assign values ## How was this patch tested? Existing UTs You can merge this pull request into a Git repository by running: $ git pull https://github.com/kiszk/spark SPARK-25444 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22439.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22439 commit 24fbf742fdd8490f57d29325100036e556847c77 Author: Kazuaki Ishizaki Date: 2018-09-17T05:28:00Z initial commit --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22408: [SPARK-25417][SQL] ArrayContains function may return inc...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22408 LGTM. I think the last piece is the migration guide, to explain what changed from 2.3 to 2.4 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22418: [SPARK-25427][SQL][TEST] Add BloomFilter creation...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22418#discussion_r217949342 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala --- @@ -50,6 +55,66 @@ abstract class OrcSuite extends OrcTest with BeforeAndAfterAll { .createOrReplaceTempView("orc_temp_table") } + protected def testBloomFilterCreation(bloomFilterKind: Kind) { +val tableName = "bloomFilter" + +withTempDir { dir => + withTable(tableName) { +val sqlStatement = orcImp match { + case "native" => +s""" + |CREATE TABLE $tableName (a INT, b STRING) + |USING ORC + |OPTIONS ( + | path '${dir.toURI}', + | orc.bloom.filter.columns '*', + | orc.bloom.filter.fpp 0.1 + |) +""".stripMargin + case "hive" => +s""" + |CREATE TABLE $tableName (a INT, b STRING) + |STORED AS ORC + |LOCATION '${dir.toURI}' + |TBLPROPERTIES ( + | orc.bloom.filter.columns='*', + | orc.bloom.filter.fpp=0.1 + |) +""".stripMargin + case impl => +throw new UnsupportedOperationException(s"Unknown ORC implementation: $impl") +} + +sql(sqlStatement) +sql(s"INSERT INTO $tableName VALUES (1, 'str')") + +val partFiles = dir.listFiles() + .filter(f => f.isFile && !f.getName.startsWith(".") && !f.getName.startsWith("_")) +assert(partFiles.length === 1) + +val orcFilePath = new Path(partFiles.head.getAbsolutePath) +val readerOptions = OrcFile.readerOptions(new Configuration()) +val reader = OrcFile.createReader(orcFilePath, readerOptions) +var recordReader: RecordReaderImpl = null +try { + recordReader = reader.rows.asInstanceOf[RecordReaderImpl] + + // BloomFilter array is created for all types; `struct`, int (`a`), string (`b`) + val sargColumns = Array(true, true, true) + val orcIndex = recordReader.readRowIndex(0, null, sargColumns) + + // Check the types and counts of bloom filters + assert(orcIndex.getBloomFilterKinds.forall(_ === bloomFilterKind)) --- End diff -- how can we extend it in the future? How can we change the bloom filter kind via the CREATE TABLE statement? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22438: [SPARK-25443][INFRA] fix issues when building docs with ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22438 **[Test build #96119 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96119/testReport)** for PR 22438 at commit [`dbb4fa2`](https://github.com/apache/spark/commit/dbb4fa2469f89c612e7c8ef966d001e828fc8b91). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22438: [SPARK-25443][INFRA] fix issues when building docs with ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22438 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3145/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22438: [SPARK-25443][INFRA] fix issues when building docs with ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22438 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22438: [SPARK-25443][INFRA] fix issues when building docs with ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22438 cc @vanzin @felixcheung @srowen @jerryshao --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22438: [SPARK-25443][INFRA] fix issues when building doc...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/22438 [SPARK-25443][INFRA] fix issues when building docs with release scripts in docker ## What changes were proposed in this pull request? These 2 changes are required to build the docs for Spark 2.4.0 RC1: 1. install `mkdocs` in the docker image 2. set locale to C.UTF-8. Otherwise jekyll fails to build the doc. ## How was this patch tested? tested manually when doing the 2.4.0 RC1 You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark infra Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22438.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22438 commit dbb4fa2469f89c612e7c8ef966d001e828fc8b91 Author: Wenchen Fan Date: 2018-09-17T04:48:28Z fix issues when building docs --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22437: [SPARK-25431][SQL][EXAMPLES] Fix function examples and t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22437 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3144/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22437: [SPARK-25431][SQL][EXAMPLES] Fix function examples and t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22437 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22428: [SPARK-25430][SQL] Add map parameter for withColumnRenam...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22428 The performance issue was introduced by repeating query plan analysis, which is resolved in the current master if I am not mistaken - if you're in doubt, I would suggest to do a quick benchamrk. I think this is something we should do it with one liner helper in application side code. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22435: [SPARK-25423][SQL] Output "dataFilters" in DataSourceSca...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22435 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96115/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22435: [SPARK-25423][SQL] Output "dataFilters" in DataSourceSca...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22435 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22435: [SPARK-25423][SQL] Output "dataFilters" in DataSourceSca...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22435 **[Test build #96115 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96115/testReport)** for PR 22435 at commit [`da86846`](https://github.com/apache/spark/commit/da868465de9ccdd302699786db30fe4fe90e4cfa). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22437: [SPARK-25431][SQL][EXAMPLES] Fix function examples and t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22437 **[Test build #96118 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96118/testReport)** for PR 22437 at commit [`1ae77da`](https://github.com/apache/spark/commit/1ae77dad91a26e1390de070ab677b270c6309065). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22428: [SPARK-25430][SQL] Add map parameter for withColumnRenam...
Github user goungoun commented on the issue: https://github.com/apache/spark/pull/22428 @HyukjinKwon , thanks for your review. Actually, that is the reason that I open this pull request. I think it is better to giving reusable option to users than repeating too much of same code in their analysis. In notebook environment, whenever visualization is required in the middle of the analysis, I had to convert column names rather than using it as it is so that I can deliver right messages to the report readers. During the process, I had to repeat withColumenRenamed too many times. So, I've researched how the other users are trying to overcome the limitation. It seems that users tend to use foldleft or for loop with withColumnRenamed which can cause performance issue creating too many dataframes inside of Spark engine even without knowing it. The arguments can be found as follows. StackOverflow - https://stackoverflow.com/questions/38798567/pyspark-rename-more-than-one-column-using-withcolumnrenamed - https://stackoverflow.com/questions/35592917/renaming-column-names-of-a-dataframe-in-spark-scala?noredirect=1=1 Spark Issues [SPARK-12225] Support adding or replacing multiple columns at once in DataFrame API [SPARK-21582] DataFrame.withColumnRenamed cause huge performance overhead If foldleft is used, too many columns can cause performance issue --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22418: [SPARK-25427][SQL][TEST] Add BloomFilter creation test c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22418 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22418: [SPARK-25427][SQL][TEST] Add BloomFilter creation test c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22418 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96113/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22418: [SPARK-25427][SQL][TEST] Add BloomFilter creation test c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22418 **[Test build #96113 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96113/testReport)** for PR 22418 at commit [`a378adb`](https://github.com/apache/spark/commit/a378adb85ef58a603ca4f9d6a7a527c35e0f2db5). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22436: [SPARK-24654][BUILD][FOLLOWUP] Update, fix LICENSE and N...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22436 Thanks for the fix! I'm not familiar with part though, let's ping @vanzin @felixcheung @jerryshao --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22437: [SPARK-25431][SQL][EXAMPLES] Fix function examples and t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22437 **[Test build #96117 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96117/testReport)** for PR 22437 at commit [`865b09b`](https://github.com/apache/spark/commit/865b09bffc964a6c7411b50abe44bb1bab68f649). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22437: [SPARK-25431][SQL][EXAMPLES] Fix function examples and t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22437 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3143/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22437: [SPARK-25431][SQL][EXAMPLES] Fix function examples and t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22437 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22437: [SPARK-25431][SQL][EXAMPLES] Fix function example...
GitHub user ueshin opened a pull request: https://github.com/apache/spark/pull/22437 [SPARK-25431][SQL][EXAMPLES] Fix function examples and the example results. ## What changes were proposed in this pull request? There are some mistakes in examples of newly added functions. Also the format of the example results are not unified. We should fix them. ## How was this patch tested? Manually executed the examples. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ueshin/apache-spark issues/SPARK-25431/fix_examples_2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22437.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22437 commit 865b09bffc964a6c7411b50abe44bb1bab68f649 Author: Takuya UESHIN Date: 2018-09-14T09:19:56Z Fix function examples and the example results. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22437: [SPARK-25431][SQL][EXAMPLES] Fix function examples and t...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/22437 cc @dongjoon-hyun @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22436: [SPARK-24654][BUILD][FOLLOWUP] Update, fix LICENSE and N...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22436 **[Test build #96116 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96116/testReport)** for PR 22436 at commit [`e4cad8c`](https://github.com/apache/spark/commit/e4cad8c60f3c959af63a900232f14c378cef7928). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22395: [SPARK-16323][SQL] Add IntegralDivide expression
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22395 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22395: [SPARK-16323][SQL] Add IntegralDivide expression
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22395 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96114/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22436: [SPARK-24654][BUILD][FOLLOWUP] Update, fix LICENSE and N...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22436 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22395: [SPARK-16323][SQL] Add IntegralDivide expression
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22395 **[Test build #96114 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96114/testReport)** for PR 22395 at commit [`71255a1`](https://github.com/apache/spark/commit/71255a1787012baf2d5188991421e8197ec44733). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22436: [SPARK-24654][BUILD][FOLLOWUP] Update, fix LICENSE and N...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22436 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3142/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22436: [SPARK-24654][BUILD][FOLLOWUP] Update, fix LICENSE and N...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22436 CC @cloud-fan This one doesn't block 2.4.0 but would be nice to have. Certainly if there's a second RC. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22436: [SPARK-24654][BUILD][FOLLOWUP] Update, fix LICENS...
GitHub user srowen opened a pull request: https://github.com/apache/spark/pull/22436 [SPARK-24654][BUILD][FOLLOWUP] Update, fix LICENSE and NOTICE, and specialize for source vs binary ## What changes were proposed in this pull request? Fix location of licenses-binary in binary release, and remove binary items from source release ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/srowen/spark SPARK-24654.2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22436.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22436 commit e4cad8c60f3c959af63a900232f14c378cef7928 Author: Sean Owen Date: 2018-09-17T03:42:10Z Fix location of licenses-binary in binary release, and remove binary items from source releas --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/7 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96112/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/7 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/7 **[Test build #96112 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96112/testReport)** for PR 7 at commit [`5c8f487`](https://github.com/apache/spark/commit/5c8f48715748bdeda703761fba6a4d1828a19985). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22433: [SPARK-25442][SQL][K8S] Support STS to run in k8s deploy...
Github user suryag10 commented on the issue: https://github.com/apache/spark/pull/22433 > I'm wondering, is there some reason this isn't supported in cluster mode for yarn & mesos? Or put another way, what is the rationale for k8s being added as an exception to this rule? I donno the specific reason why this was not supported in yarn and mesos. The initial contributions to the spark on K8S started with cluster mode(with restriction for client mode). So this PR enhances such that STS can run in k8s deployments with spark cluster mode(In the latest spark code i had observed that the client mode also works(need to cross verify this once)). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22395: [SPARK-16323][SQL] Add IntegralDivide expression
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22395#discussion_r217942351 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala --- @@ -143,16 +143,14 @@ class ArithmeticExpressionSuite extends SparkFunSuite with ExpressionEvalHelper } } - // By fixing SPARK-15776, Divide's inputType is required to be DoubleType of DecimalType. - // TODO: in future release, we should add a IntegerDivide to support integral types. - ignore("/ (Divide) for integral type") { -checkEvaluation(Divide(Literal(1.toByte), Literal(2.toByte)), 0.toByte) -checkEvaluation(Divide(Literal(1.toShort), Literal(2.toShort)), 0.toShort) -checkEvaluation(Divide(Literal(1), Literal(2)), 0) -checkEvaluation(Divide(Literal(1.toLong), Literal(2.toLong)), 0.toLong) -checkEvaluation(Divide(positiveShortLit, negativeShortLit), 0.toShort) -checkEvaluation(Divide(positiveIntLit, negativeIntLit), 0) -checkEvaluation(Divide(positiveLongLit, negativeLongLit), 0L) + test("/ (Divide) for integral type") { +checkEvaluation(IntegralDivide(Literal(1.toByte), Literal(2.toByte)), 0L) +checkEvaluation(IntegralDivide(Literal(1.toShort), Literal(2.toShort)), 0L) +checkEvaluation(IntegralDivide(Literal(1), Literal(2)), 0L) +checkEvaluation(IntegralDivide(Literal(1.toLong), Literal(2.toLong)), 0L) +checkEvaluation(IntegralDivide(positiveShortLit, negativeShortLit), 0L) +checkEvaluation(IntegralDivide(positiveIntLit, negativeIntLit), 0L) +checkEvaluation(IntegralDivide(positiveLongLit, negativeLongLit), 0L) --- End diff -- good catch! We should clearly define the behavior in the doc string too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22432: [SPARK-22713][CORE][TEST][FOLLOWUP] Fix flaky Ext...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22432 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22432: [SPARK-22713][CORE][TEST][FOLLOWUP] Fix flaky ExternalAp...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22432 thanks, merging to master/2.4! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22231: [SPARK-25238][PYTHON] lint-python: Upgrade pycodestyle t...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22231 Yeah I noticed that. I think we should leave it, and, if somehow RC1 passes, we'll mark this as fixed for a later release. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21677: [SPARK-24692][TESTS] Improvement FilterPushdownBenchmark
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21677 > So, are you heading main-method style with separate BM output files? Yes. So it's not reverting this PR, since writing BM result to a file is good. But we should update these BMs to use main-method style. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22231: [SPARK-25238][PYTHON] lint-python: Upgrade pycodestyle t...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22231 Note that, RC1 was cut before merging this PR, which means, this patch is not available in 2.4.0. I hit some problems running the release scripts and spent quite a lot of time to fix them, so the final vote is several days behind the RC1 tag creation. @srowen please advice if we should 1. fail the RC1 to include this patch 2. do nothing and release it with 2.4.1 3. revert it from 2.4 since it's an upgrade. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21860: [SPARK-24901][SQL]Merge the codegen of RegularHashMap an...
Github user heary-cao commented on the issue: https://github.com/apache/spark/pull/21860 cc @maropu @kiszk @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22428: [SPARK-25430][SQL] Add map parameter for withColumnRenam...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22428 Can we simply call the API multiple times? I think we haven't usually added such aliases for an API unless there's strong argument for it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22428: [SPARK-25430][SQL] Add map parameter for withColu...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22428#discussion_r217937566 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -2300,6 +2300,37 @@ class Dataset[T] private[sql]( } } + /** + * Returns a new Dataset with columns renamed. + * This is a no-op if schema doesn't contain existingNames in columnMap. + * {{{ + * df.withColumnRenamed(Map( + * "c1" -> "first_column", + * "c2" -> "second_column" + * )) + * }}} + * + * @group untypedrel + * @since 2.4.0 --- End diff -- branch-2.4 is cut out. We will probably target 3.0.0 if we happen to add new APIs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18304: [SPARK-21098] Set lineseparator csv multiline and csv wr...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18304 CSV's `lineSep` is not added yet. The problem here is specific to CSV - we are os-dependent on the newline separator by Univocity's which is not the case in Jackson and which can be worked around when CSV's newline option is added. I was working on this feature but faced some problems with handling `multiLine` in CSV. Will make a PR when I'm available. @danielvdende, let's leave this closed for now. Will ping you in the PR I will open later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22395: [SPARK-16323][SQL] Add IntegralDivide expression
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22395#discussion_r217935398 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala --- @@ -143,16 +143,14 @@ class ArithmeticExpressionSuite extends SparkFunSuite with ExpressionEvalHelper } } - // By fixing SPARK-15776, Divide's inputType is required to be DoubleType of DecimalType. - // TODO: in future release, we should add a IntegerDivide to support integral types. - ignore("/ (Divide) for integral type") { -checkEvaluation(Divide(Literal(1.toByte), Literal(2.toByte)), 0.toByte) -checkEvaluation(Divide(Literal(1.toShort), Literal(2.toShort)), 0.toShort) -checkEvaluation(Divide(Literal(1), Literal(2)), 0) -checkEvaluation(Divide(Literal(1.toLong), Literal(2.toLong)), 0.toLong) -checkEvaluation(Divide(positiveShortLit, negativeShortLit), 0.toShort) -checkEvaluation(Divide(positiveIntLit, negativeIntLit), 0) -checkEvaluation(Divide(positiveLongLit, negativeLongLit), 0L) + test("/ (Divide) for integral type") { +checkEvaluation(IntegralDivide(Literal(1.toByte), Literal(2.toByte)), 0L) +checkEvaluation(IntegralDivide(Literal(1.toShort), Literal(2.toShort)), 0L) +checkEvaluation(IntegralDivide(Literal(1), Literal(2)), 0L) +checkEvaluation(IntegralDivide(Literal(1.toLong), Literal(2.toLong)), 0L) +checkEvaluation(IntegralDivide(positiveShortLit, negativeShortLit), 0L) +checkEvaluation(IntegralDivide(positiveIntLit, negativeIntLit), 0L) +checkEvaluation(IntegralDivide(positiveLongLit, negativeLongLit), 0L) --- End diff -- Could you add a test case for `divide by zero` like `test("/ (Divide) basic")`? For now, this PR seems to follow the behavior of Spark `/` instead of Hive `div`. We had better be clear on our decision and prevent future unintended behavior changes. ```scala scala> sql("select 2 / 0, 2 div 0").show() +---+-+ |(CAST(2 AS DOUBLE) / CAST(0 AS DOUBLE))|(2 div 0)| +---+-+ | null| null| +---+-+ ``` ```sql 0: jdbc:hive2://ctr-e138-1518143905142-477481> select 2 / 0; +---+ | _c0 | +---+ | NULL | +---+ 0: jdbc:hive2://ctr-e138-1518143905142-477481> select 2 div 0; Error: Error while compiling statement: FAILED: SemanticException [Error 10014]: Line 1:7 Wrong arguments '0': org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public org.apache.hadoop.io.LongWritable org.apache.hadoop.hive.ql.udf.UDFOPLongDivide.evaluate(org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.LongWritable) with arguments {2,0}:/ by zero (state=42000,code=10014) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22435: [SPARK-25423][SQL] Output "dataFilters" in DataSourceSca...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22435 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3141/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22435: [SPARK-25423][SQL] Output "dataFilters" in DataSourceSca...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22435 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22435: [SPARK-25423][SQL] Output "dataFilters" in DataSourceSca...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22435 **[Test build #96115 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96115/testReport)** for PR 22435 at commit [`da86846`](https://github.com/apache/spark/commit/da868465de9ccdd302699786db30fe4fe90e4cfa). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22435: [SPARK-25423][SQL] Output "dataFilters" in DataSo...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/22435#discussion_r217934875 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/DataSourceScanExecRedactionSuite.scala --- @@ -83,4 +83,20 @@ class DataSourceScanExecRedactionSuite extends QueryTest with SharedSQLContext { } } + test("FileSourceScanExec metadata") { +withTempDir { dir => + val basePath = dir.getCanonicalPath + spark.range(0, 10).toDF("a").write.parquet(new Path(basePath, "foo=1").toString) + val df = spark.read.parquet(basePath).filter("a = 1") --- End diff -- Thanks @dongjoon-hyun I fixed it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22395: [SPARK-16323][SQL] Add IntegralDivide expression
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22395 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3140/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22395: [SPARK-16323][SQL] Add IntegralDivide expression
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22395 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22395: [SPARK-16323][SQL] Add IntegralDivide expression
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22395 **[Test build #96114 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96114/testReport)** for PR 22395 at commit [`71255a1`](https://github.com/apache/spark/commit/71255a1787012baf2d5188991421e8197ec44733). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22395: [SPARK-16323][SQL] Add IntegralDivide expression
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22395 Retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22418: [SPARK-25427][SQL][TEST] Add BloomFilter creation test c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22418 **[Test build #96113 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96113/testReport)** for PR 22418 at commit [`a378adb`](https://github.com/apache/spark/commit/a378adb85ef58a603ca4f9d6a7a527c35e0f2db5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22418: [SPARK-25427][SQL][TEST] Add BloomFilter creation test c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22418 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22418: [SPARK-25427][SQL][TEST] Add BloomFilter creation test c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22418 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3139/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22418: [SPARK-25427][SQL][TEST] Add BloomFilter creation test c...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22418 Retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22435: [SPARK-25423][SQL] Output "dataFilters" in DataSo...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22435#discussion_r217933342 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/DataSourceScanExecRedactionSuite.scala --- @@ -83,4 +83,20 @@ class DataSourceScanExecRedactionSuite extends QueryTest with SharedSQLContext { } } + test("FileSourceScanExec metadata") { +withTempDir { dir => + val basePath = dir.getCanonicalPath + spark.range(0, 10).toDF("a").write.parquet(new Path(basePath, "foo=1").toString) + val df = spark.read.parquet(basePath).filter("a = 1") --- End diff -- Hi, @wangyum . I know that you follow the style of the other test cases in this suite, but could you simplify like the following? We had better keep a single test case as simple as possible by excluding irrelevant stuffs. ```scala withTempPath { path => val dir = path.getCanonicalPath spark.range(0, 10).toDF("a").write.parquet(dir) val df = spark.read.parquet(dir).filter("a = 1") ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/7 **[Test build #96112 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96112/testReport)** for PR 7 at commit [`5c8f487`](https://github.com/apache/spark/commit/5c8f48715748bdeda703761fba6a4d1828a19985). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...
Github user phegstrom commented on a diff in the pull request: https://github.com/apache/spark/pull/7#discussion_r217930978 --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R --- @@ -1803,6 +1803,18 @@ test_that("string operators", { collect(select(df4, split_string(df4$a, "")))[1, 1], list(list("a.b@c.d 1", "b")) ) + expect_equal( +collect(select(df4, split_string(df4$a, "\\.", 2)))[1, 1], +list(list("a", "b@c.d 1\\b")) + ) + expect_equal( +collect(select(df4, split_string(df4$a, "b", -2)))[1, 1], +list(list("a.", "@c.d 1\\", "")) + ) + expect_equal( +collect(select(df4, split_string(df4$a, "b", 0)))[1, 1], --- End diff -- per @felixcheung's I added back the `limit = 0` case --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...
Github user phegstrom commented on a diff in the pull request: https://github.com/apache/spark/pull/7#discussion_r217930893 --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R --- @@ -1803,6 +1803,10 @@ test_that("string operators", { collect(select(df4, split_string(df4$a, "")))[1, 1], list(list("a.b@c.d 1", "b")) ) + expect_equal( +collect(select(df4, split_string(df4$a, "\\.", 2)))[1, 1], +list(list("a", "b@c.d 1\\b")) --- End diff -- added for `limit = 0` to catch the "change behavior" case --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22429: [SPARK-25440][SQL] Dumping query execution info t...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/22429#discussion_r217928631 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala --- @@ -469,7 +470,17 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product { def treeString: String = treeString(verbose = true) def treeString(verbose: Boolean, addSuffix: Boolean = false): String = { -generateTreeString(0, Nil, new StringBuilder, verbose = verbose, addSuffix = addSuffix).toString +val baos = new ByteArrayOutputStream() --- End diff -- In this particular method, there is no benefits. This was changed to reused the method which accepts `OutputStream` instead of `StringBuilder`. Benefit of `OutputStream` over `StringBuilder` is no full materialization in memory and no string size limit. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22429: [SPARK-25440][SQL] Dumping query execution info t...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/22429#discussion_r217928428 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala --- @@ -250,5 +254,36 @@ class QueryExecution(val sparkSession: SparkSession, val logical: LogicalPlan) { def codegenToSeq(): Seq[(String, String)] = { org.apache.spark.sql.execution.debug.codegenStringSeq(executedPlan) } + +/** + * Dumps debug information about query execution into the specified file. + */ +def toFile(path: String): Unit = { + val maxFields = SparkEnv.get.conf.getInt(Utils.MAX_TO_STRING_FIELDS, +Utils.DEFAULT_MAX_TO_STRING_FIELDS) + val filePath = new Path(path) + val fs = FileSystem.get(filePath.toUri, sparkSession.sessionState.newHadoopConf()) + val writer = new BufferedWriter(new OutputStreamWriter(fs.create(filePath))) + + try { +SparkEnv.get.conf.set(Utils.MAX_TO_STRING_FIELDS, Int.MaxValue.toString) +writer.write("== Parsed Logical Plan ==\n") --- End diff -- Can we combine this entire block with what is done in the `toString()` method? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22429: [SPARK-25440][SQL] Dumping query execution info t...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/22429#discussion_r217928334 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala --- @@ -250,5 +254,36 @@ class QueryExecution(val sparkSession: SparkSession, val logical: LogicalPlan) { def codegenToSeq(): Seq[(String, String)] = { org.apache.spark.sql.execution.debug.codegenStringSeq(executedPlan) } + +/** + * Dumps debug information about query execution into the specified file. + */ +def toFile(path: String): Unit = { + val maxFields = SparkEnv.get.conf.getInt(Utils.MAX_TO_STRING_FIELDS, +Utils.DEFAULT_MAX_TO_STRING_FIELDS) + val filePath = new Path(path) + val fs = FileSystem.get(filePath.toUri, sparkSession.sessionState.newHadoopConf()) + val writer = new BufferedWriter(new OutputStreamWriter(fs.create(filePath))) + + try { +SparkEnv.get.conf.set(Utils.MAX_TO_STRING_FIELDS, Int.MaxValue.toString) --- End diff -- It is generally a bad idea to change this conf as people expect that it is immutable. Also this change has some far reaching consequences, others will now also be exposed to a different `Utils.MAX_TO_STRING_FIELDS` value when calling `explain()`. Can you please just pass the parameter down the tree? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22429: [SPARK-25440][SQL] Dumping query execution info t...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/22429#discussion_r217928262 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala --- @@ -469,7 +470,17 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product { def treeString: String = treeString(verbose = true) def treeString(verbose: Boolean, addSuffix: Boolean = false): String = { -generateTreeString(0, Nil, new StringBuilder, verbose = verbose, addSuffix = addSuffix).toString +val baos = new ByteArrayOutputStream() --- End diff -- What is the benefit of using this instead of using a `java.io.StringWriter` or `org.apache.commons.io.output.StringBuilderWriter`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22396#discussion_r217926918 --- Diff: docs/sql-programming-guide.md --- @@ -1897,7 +1897,8 @@ working with timestamps in `pandas_udf`s to get the best performance, see - In version 2.3 and earlier, CSV rows are considered as malformed if at least one column value in the row is malformed. CSV parser dropped such rows in the DROPMALFORMED mode or outputs an error in the FAILFAST mode. Since Spark 2.4, CSV row is considered as malformed only when it contains malformed column values requested from CSV datasource, other values can be ignored. As an example, CSV file contains the "id,name" header and one row "1234". In Spark 2.4, selection of the id column consists of a row with one column value 1234 but in Spark 2.3 and earlier it is empty in the DROPMALFORMED mode. To restore the previous behavior, set `spark.sql.csv.parser.columnPruning.enabled` to `false`. - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. - - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files. For example, the row of `"a", null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as `a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to empty (not quoted) string. + - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files. For example, the row of `"a", null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as `a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to empty (not quoted) string. + - Since Spark 2.4, The LOAD DATA command supports wildcard characters ? and *, which match any one character, and zero or more characters, respectively. Example: LOAD DATA INPATH '/tmp/folder*/ or LOAD DATA INPATH /tmp/part-?. Special Characters like spaces also now work in paths. Example: LOAD DATA INPATH /tmp/folder name/. --- End diff -- The commands and paths should be back-tick-quoted for readability. I think they may be interpreted as markdown otherwise. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22429: [SPARK-25440][SQL] Dumping query execution info to a fil...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22429 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22429: [SPARK-25440][SQL] Dumping query execution info to a fil...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22429 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96109/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22435: [SPARK-25423][SQL] Output "dataFilters" in DataSourceSca...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22435 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96111/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22435: [SPARK-25423][SQL] Output "dataFilters" in DataSourceSca...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22435 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22429: [SPARK-25440][SQL] Dumping query execution info to a fil...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22429 **[Test build #96109 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96109/testReport)** for PR 22429 at commit [`ce2c086`](https://github.com/apache/spark/commit/ce2c08688bb8b51e97f686c95279a5f42b52116a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22435: [SPARK-25423][SQL] Output "dataFilters" in DataSourceSca...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22435 **[Test build #96111 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96111/testReport)** for PR 22435 at commit [`830e188`](https://github.com/apache/spark/commit/830e1881b4ef4d9bb661d8b6635470e2596d4eaa). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22393: [MINOR][DOCS] Axe deprecated doc refs
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22393 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22393: [MINOR][DOCS] Axe deprecated doc refs
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22393 thx. merged to master/2.4 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22432: [SPARK-22713][CORE][TEST][FOLLOWUP] Fix flaky ExternalAp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22432 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22432: [SPARK-22713][CORE][TEST][FOLLOWUP] Fix flaky ExternalAp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22432 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96108/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22432: [SPARK-22713][CORE][TEST][FOLLOWUP] Fix flaky ExternalAp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22432 **[Test build #96108 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96108/testReport)** for PR 22432 at commit [`04c3f7b`](https://github.com/apache/spark/commit/04c3f7b3c2a1b6a79d571ca2079ca6cc477027a7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS p...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22396 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96110/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS p...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22396 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS p...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22396 **[Test build #96110 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96110/testReport)** for PR 22396 at commit [`b34b962`](https://github.com/apache/spark/commit/b34b96208dc86e9642dbc65e33a643df7b7ee406). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22343 Thank YOU for your PR and open discussion on this, @seancxmao . Let's see in another PRs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21677: [SPARK-24692][TESTS] Improvement FilterPushdownBenchmark
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/21677 Yes. @cloud-fan . We can embrace that concept to all the other main-method style benchmark. Previously, we do the manual copy to put the result into the nearest place to the corresponding BM code. It's not an easy way for automation. With @wangyum 's that specific contribution, we can automate all benchmarks. Possibly, we can use that in the release process, too. So, are you heading `main-method` style with separate BM output files? For me, +1. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22427: [SPARK-25438][SQL][TEST] Fix FilterPushdownBenchm...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22427#discussion_r217923482 --- Diff: sql/core/benchmarks/FilterPushdownBenchmark-results.txt --- @@ -2,737 +2,669 @@ Pushdown for many distinct value case -Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 -Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz - +OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64 +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz Select 0 string row (value IS NULL): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative -Parquet Vectorized8970 / 9122 1.8 570.3 1.0X -Parquet Vectorized (Pushdown) 471 / 491 33.4 30.0 19.0X -Native ORC Vectorized 7661 / 7853 2.1 487.0 1.2X -Native ORC Vectorized (Pushdown) 1134 / 1161 13.9 72.1 7.9X - -Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 -Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz +Parquet Vectorized 11405 / 11485 1.4 725.1 1.0X +Parquet Vectorized (Pushdown) 675 / 690 23.3 42.9 16.9X +Native ORC Vectorized 7127 / 7170 2.2 453.1 1.6X +Native ORC Vectorized (Pushdown) 519 / 541 30.3 33.0 22.0X +OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64 +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz Select 0 string row ('7864320' < value < '7864320'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative -Parquet Vectorized9246 / 9297 1.7 587.8 1.0X -Parquet Vectorized (Pushdown) 480 / 488 32.8 30.5 19.3X -Native ORC Vectorized 7838 / 7850 2.0 498.3 1.2X -Native ORC Vectorized (Pushdown) 1054 / 1118 14.9 67.0 8.8X - -Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 -Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz +Parquet Vectorized 11457 / 11473 1.4 728.4 1.0X +Parquet Vectorized (Pushdown) 656 / 686 24.0 41.7 17.5X +Native ORC Vectorized 7328 / 7342 2.1 465.9 1.6X +Native ORC Vectorized (Pushdown) 539 / 565 29.2 34.2 21.3X +OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64 +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz Select 1 string row (value = '7864320'): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative -Parquet Vectorized8989 / 9100 1.7 571.5 1.0X -Parquet Vectorized (Pushdown) 448 / 467 35.1 28.5 20.1X -Native ORC Vectorized 7680 / 7768 2.0 488.3 1.2X -Native ORC Vectorized (Pushdown) 1067 / 1118 14.7 67.8 8.4X - -Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 -Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz +Parquet Vectorized 11878 / 11888 1.3 755.2 1.0X +Parquet Vectorized (Pushdown) 630 / 654 25.0 40.1 18.9X +Native ORC Vectorized 7342 / 7362 2.1 466.8 1.6X +Native ORC Vectorized (Pushdown) 519 / 537 30.3 33.0 22.9X +OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64 +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz Select 1 string row (value <=> '7864320'): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative -Parquet Vectorized9115 / 9266 1.7 579.5 1.0X -Parquet Vectorized (Pushdown) 466 / 492 33.7 29.7 19.5X -Native ORC Vectorized 7800 / 7914 2.0 495.9
[GitHub] spark issue #22433: [SPARK-25442][SQL][K8S] Support STS to run in k8s deploy...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22433 > As this script is common start point for all the resource managers(k8s/yarn/mesos/standalone/local), i guess changing this to fit for all the cases has a value add, instead of doing at each resource manager level. Thoughts? Please note that I am specifically referring only to the need for changing application `name`. The rationale given that `name` should be DNS compliant is a restriction specific to k8s and not spark. Instead of doing one off rename's the right approach would be to handle this name translation such that it will benefit not just STS, but any user application. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22433: [SPARK-25442][SQL][K8S] Support STS to run in k8s deploy...
Github user jacobdr commented on the issue: https://github.com/apache/spark/pull/22433 > a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.' Your changes to the name handling donât comply with this, so agree with @mridulm you should move this change elsewhere and more broadly support name validation/sanitization for submitted applications in kubernetes --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22434: [SPARK-24685][BUILD][FOLLOWUP] Fix the nonexist profile ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22434 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96107/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22434: [SPARK-24685][BUILD][FOLLOWUP] Fix the nonexist profile ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22434 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22434: [SPARK-24685][BUILD][FOLLOWUP] Fix the nonexist profile ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22434 **[Test build #96107 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96107/testReport)** for PR 22434 at commit [`18a9135`](https://github.com/apache/spark/commit/18a91354abdf793a569a84046f3bf2016b2ccd03). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...
Github user sujith71955 commented on a diff in the pull request: https://github.com/apache/spark/pull/22396#discussion_r217920696 --- Diff: docs/sql-programming-guide.md --- @@ -1898,6 +1898,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files. For example, the row of `"a", null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as `a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to empty (not quoted) string. + - Since Spark 2.4 load command from local filesystem supports wildcards in the folder level paths(e.g. LOAD DATA LOCAL INPATH 'tmp/folder*/).Also in Older versions space in folder/file names has been represented using '%20'(e.g. LOAD DATA INPATH 'tmp/folderName/myFile%20Name.csv), this usage will not be supported from spark 2.4 version. Since Spark 2.4, Spark supports normal space character in folder/file names (e.g. LOAD DATA INPATH 'hdfs://tmp/folderName/file Name.csv') and wildcard character '?' can be used. (e.g. LOAD DATA INPATH 'hdfs://tmp/folderName/fileName?.csv') --- End diff -- @gatorsmile Just used a common encoder (%20) in our example. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22396: [SPARK-23425][SQL][FOLLOWUP] Support wildcards in...
Github user sujith71955 commented on a diff in the pull request: https://github.com/apache/spark/pull/22396#discussion_r217920417 --- Diff: docs/sql-programming-guide.md --- @@ -1898,6 +1898,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files. For example, the row of `"a", null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as `a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to empty (not quoted) string. + - Since Spark 2.4 load command from local filesystem supports wildcards in the folder level paths(e.g. LOAD DATA LOCAL INPATH 'tmp/folder*/).Also in Older versions space in folder/file names has been represented using '%20'(e.g. LOAD DATA INPATH 'tmp/folderName/myFile%20Name.csv), this usage will not be supported from spark 2.4 version. Since Spark 2.4, Spark supports normal space character in folder/file names (e.g. LOAD DATA INPATH 'hdfs://tmp/folderName/file Name.csv') and wildcard character '?' can be used. (e.g. LOAD DATA INPATH 'hdfs://tmp/folderName/fileName?.csv') --- End diff -- @srowen Sorry Sean i missed your suggested text, I updated the message based on your suggestions. Actually i became bit confused as this PR is a combination of bug fix and improvement :) . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22435: [SPARK-25423][SQL] Output "dataFilters" in DataSourceSca...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22435 **[Test build #96111 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96111/testReport)** for PR 22435 at commit [`830e188`](https://github.com/apache/spark/commit/830e1881b4ef4d9bb661d8b6635470e2596d4eaa). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22435: [SPARK-25423][SQL] Output "dataFilters" in DataSourceSca...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22435 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org