[GitHub] spark issue #20626: [SPARK-23447][SQL] Cleanup codegen template for Literal
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20626 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/930/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20625: [SPARK-23446][PYTHON] Explicitly check supported types i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20625 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/931/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20626: [SPARK-23447][SQL] Cleanup codegen template for Literal
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20626 **[Test build #87503 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87503/testReport)** for PR 20626 at commit [`68edf0f`](https://github.com/apache/spark/commit/68edf0f3463daed3bb7042becb333788b22b23b0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20057: [SPARK-22880][SQL] Add cascadeTruncate option to JDBC da...
Github user Fokko commented on the issue: https://github.com/apache/spark/pull/20057 Hi @gatorsmile, thanks for putting it to the tests. The main reason why I personally dislike Sqoop is: - **Legacy.** The old map-reduce should be buried in the upcoming years. As a data engineering consultant, I see more people questioning the whole Hadoop stack. Using Sqoop you still need to run map-reduce tasks, and this isn't easy on other platforms like kubernetes. - **Stability.** I see Sqoop jobs fail quite often, and there isn't a nice way of retrying this in an atomic way. For example, when having a Sqoop job on Airflow, we cannot simply retry the operation. We when we import data from a rmdbs to hdfs, we have to make sure that the target directory of the previous run has been deleted. This is also where Spark-jdbc comes in, for example, in the future we would like to delete single partitions, but this is wip. Maybe @danielvdende can elaborate a bit on their use-case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20057: [SPARK-22880][SQL] Add cascadeTruncate option to JDBC da...
Github user danielvdende commented on the issue: https://github.com/apache/spark/pull/20057 Hmm, not it fails the OrcQuerySuite. This PR doesn't touch any of the Orc implementation in Spark. Could this be a flaky test @gatorsmile ? ```org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to eventually never returned normally. Attempted 12 times over 10.15875468798 seconds. Last failure message: There are 1 possibly leaked file streams..``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20625: [SPARK-23446][PYTHON] Explicitly check supported types i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20625 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/932/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20625: [SPARK-23446][PYTHON] Explicitly check supported types i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20625 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20625: [SPARK-23446][PYTHON] Explicitly check supported types i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20625 **[Test build #87504 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87504/testReport)** for PR 20625 at commit [`4e5708c`](https://github.com/apache/spark/commit/4e5708ca01f048f2408ded0b039ae724b806977c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20625: [SPARK-23446][PYTHON] Explicitly check supported types i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20625 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87502/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20625: [SPARK-23446][PYTHON] Explicitly check supported types i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20625 **[Test build #87502 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87502/testReport)** for PR 20625 at commit [`c79c6df`](https://github.com/apache/spark/commit/c79c6df7284b9717fe4e4c26090dcb51bf7712da). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20626: [SPARK-23447][SQL] Cleanup codegen template for Literal
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20626 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87503/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20626: [SPARK-23447][SQL] Cleanup codegen template for Literal
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20626 **[Test build #87503 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87503/testReport)** for PR 20626 at commit [`68edf0f`](https://github.com/apache/spark/commit/68edf0f3463daed3bb7042becb333788b22b23b0). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20625: [SPARK-23446][PYTHON] Explicitly check supported types i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20625 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20626: [SPARK-23447][SQL] Cleanup codegen template for Literal
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20626 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20057: [SPARK-22880][SQL] Add cascadeTruncate option to JDBC da...
Github user danielvdende commented on the issue: https://github.com/apache/spark/pull/20057 Hi guys, @Fokko @gatorsmile, completely agree with what @Fokko mentioned, our main reason for wanting to get away from Sqoop is also for stability reasons and to get rid of MapReduce in preparation for our move to Kubernetes (or something similar). We've also seen it to be much faster than Sqoop. In terms of why we need the feature in this PR: we have some tables in PostgreSQL that have foreign keys linking them. We have also specified a schema for these tables. If we use the drop-and-recreate option, Spark will determine the schema, overriding our PostgreSQL schema. Obviously, these should match up, but I personally don't like that Spark can do this (and that you can't explicitly tell it not to). Because of this behaviour, we currently require 2 tasks in Airflow (as @Fokko mentioned) to ensure the tables are truncated, but the schema stays in place. This PR would enable us to specify in a single, idempotent (Airflow) task that we want to truncate the table before putting new data in there. The cascade enables us to not break foreign key relations and cause errors. To be clear, this therefore isn't emulating a Sqoop feature (as a Sqoop task isn't idempotent), but is in fact improving on what Sqoop offers. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20625: [SPARK-23446][PYTHON] Explicitly check supported types i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20625 **[Test build #87504 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87504/testReport)** for PR 20625 at commit [`4e5708c`](https://github.com/apache/spark/commit/4e5708ca01f048f2408ded0b039ae724b806977c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20625: [SPARK-23446][PYTHON] Explicitly check supported types i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20625 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87504/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20625: [SPARK-23446][PYTHON] Explicitly check supported types i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20625 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20621#discussion_r168697351 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -407,6 +407,29 @@ object PartitioningUtils { Literal(bigDecimal) } +val dateTry = Try { + // try and parse the date, if no exception occurs this is a candidate to be resolved as + // DateType + DateTimeUtils.getThreadLocalDateFormat.parse(raw) --- End diff -- actually all the `DateFormat`'s `parse` allow extra-characters after a valid date: (https://docs.oracle.com/javase/7/docs/api/java/text/DateFormat.html#parse(java.lang.String)). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20621#discussion_r168697699 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -407,6 +407,29 @@ object PartitioningUtils { Literal(bigDecimal) } +val dateTry = Try { + // try and parse the date, if no exception occurs this is a candidate to be resolved as + // DateType + DateTimeUtils.getThreadLocalDateFormat.parse(raw) + // SPARK-23436: Casting the string to date may still return null if a bad Date is provided. + // We need to check that we can cast the raw string since we later can use Cast to get + // the partition values with the right DataType (see + // org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning) + val dateOption = Option(Cast(Literal(raw), DateType).eval()) --- End diff -- sure, aren't these comments enough? may you please provide some suggestions about how you would like to improve them, ie. what is it missing/not clear? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20568: [SPARK-23381][CORE] Murmur3 hash generates a different v...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20568 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20568: [SPARK-23381][CORE] Murmur3 hash generates a diff...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20568#discussion_r168698344 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala --- @@ -218,4 +221,32 @@ object FeatureHasher extends DefaultParamsReadable[FeatureHasher] { @Since("2.3.0") override def load(path: String): FeatureHasher = super.load(path) + + private val seed = OldHashingTF.seed + + /** + * Calculate a hash code value for the term object using + * Austin Appleby's MurmurHash 3 algorithm (MurmurHash3_x86_32). + * This is the default hash algorithm used from Spark 2.0 onwards. + * Use hashUnsafeBytes2 to match the original algorithm with the value. + * See SPARK-23381. + */ + @Since("2.3.0") + def murmur3Hash(term: Any): Int = { --- End diff -- Maybe `private[feature]`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20464 @felixcheung Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20626: [SPARK-23447][SQL] Cleanup codegen template for Literal
Github user rednaxelafx commented on the issue: https://github.com/apache/spark/pull/20626 jenkins retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20626: [SPARK-23447][SQL] Cleanup codegen template for Literal
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20626 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20626: [SPARK-23447][SQL] Cleanup codegen template for Literal
Github user rednaxelafx commented on the issue: https://github.com/apache/spark/pull/20626 So I was able to find quite a few cases where the `DUMMY` placeholder caught uses of the `value` field outside of appropriate null-checked regions. I'll check the individual cases and then update this PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20626: [SPARK-23447][SQL] Cleanup codegen template for Literal
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20626 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/933/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20626: [SPARK-23447][SQL] Cleanup codegen template for Literal
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20626 **[Test build #87505 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87505/testReport)** for PR 20626 at commit [`68edf0f`](https://github.com/apache/spark/commit/68edf0f3463daed3bb7042becb333788b22b23b0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20621#discussion_r168700662 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -407,6 +407,29 @@ object PartitioningUtils { Literal(bigDecimal) } +val dateTry = Try { + // try and parse the date, if no exception occurs this is a candidate to be resolved as + // DateType + DateTimeUtils.getThreadLocalDateFormat.parse(raw) + // SPARK-23436: Casting the string to date may still return null if a bad Date is provided. + // We need to check that we can cast the raw string since we later can use Cast to get + // the partition values with the right DataType (see + // org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning) + val dateOption = Option(Cast(Literal(raw), DateType).eval()) --- End diff -- I mean .. simply like: ``` // Disallow date type if the cast returned null blah blah require(dateOption.isDefine) ``` nothing special. I am fine with not adding it too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20626: [SPARK-23447][SQL] Cleanup codegen template for Literal
Github user rednaxelafx commented on the issue: https://github.com/apache/spark/pull/20626 cc @cloud-fan @hvanhovell Note: this is for master and branch-2.3 post 2.3.0 release. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20626: [SPARK-23447][SQL] Cleanup codegen template for Literal
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20626 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/934/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20626: [SPARK-23447][SQL] Cleanup codegen template for Literal
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20626 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20626: [SPARK-23447][SQL] Cleanup codegen template for Literal
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20626 **[Test build #87506 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87506/testReport)** for PR 20626 at commit [`d709e24`](https://github.com/apache/spark/commit/d709e246d99c0d821238afda1b203b9880eb1ed1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20627: [SPARK-23217][ML][PYTHON] Add distanceMeasure par...
GitHub user mgaido91 opened a pull request: https://github.com/apache/spark/pull/20627 [SPARK-23217][ML][PYTHON] Add distanceMeasure param to ClusteringEvaluator Python API ## What changes were proposed in this pull request? The PR adds the `distanceMeasure` param to ClusteringEvaluator in the Python API. This allows the user to specify `cosine` as distance measure in addition to the default `squaredEuclidean`. ## How was this patch tested? added UT You can merge this pull request into a Git repository by running: $ git pull https://github.com/mgaido91/spark SPARK-23217_python Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20627.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20627 commit 8fe8efaaf0202f804e80b36ec11b43d5aa34d511 Author: Marco Gaido Date: 2018-02-16T09:24:45Z [SPARK-23217][ML][PYTHON] Add distanceMeasure param to ClusteringEvaluator Python API --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20627: [SPARK-23217][ML][PYTHON] Add distanceMeasure param to C...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20627 **[Test build #87507 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87507/testReport)** for PR 20627 at commit [`8fe8efa`](https://github.com/apache/spark/commit/8fe8efaaf0202f804e80b36ec11b43d5aa34d511). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20627: [SPARK-23217][ML][PYTHON] Add distanceMeasure param to C...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20627 cc @srowen @BryanCutler --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20626: [SPARK-23447][SQL] Cleanup codegen template for Literal
Github user rednaxelafx commented on the issue: https://github.com/apache/spark/pull/20626 Ah...I see, there are more places where they're statically referencing some variable but dynamically those variables would always be null. I'll update the PR later to fix those places as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20626: [SPARK-23447][SQL] Cleanup codegen template for Literal
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20626 **[Test build #87505 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87505/testReport)** for PR 20626 at commit [`68edf0f`](https://github.com/apache/spark/commit/68edf0f3463daed3bb7042becb333788b22b23b0). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20626: [SPARK-23447][SQL] Cleanup codegen template for Literal
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20626 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20626: [SPARK-23447][SQL] Cleanup codegen template for Literal
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20626 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87505/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20627: [SPARK-23217][ML][PYTHON] Add distanceMeasure param to C...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20627 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/935/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20627: [SPARK-23217][ML][PYTHON] Add distanceMeasure param to C...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20627 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20621 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/936/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20621 **[Test build #87508 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87508/testReport)** for PR 20621 at commit [`6274537`](https://github.com/apache/spark/commit/6274537139b2282ac5f9ded605037f63c7bee2f9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20621 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20625: [SPARK-23446][PYTHON] Explicitly check supported ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/20625#discussion_r168709505 --- Diff: python/pyspark/sql/dataframe.py --- @@ -2000,10 +2001,12 @@ def toPandas(self): return _check_dataframe_localize_timestamps(pdf, timezone) else: return pd.DataFrame.from_records([], columns=self.columns) -except ImportError as e: -msg = "note: pyarrow must be installed and available on calling Python process " \ - "if using spark.sql.execution.arrow.enabled=true" -raise ImportError("%s\n%s" % (_exception_message(e), msg)) +except Exception as e: +msg = ( +"Note: toPandas attempted Arrow optimization because " +"'spark.sql.execution.arrow.enabled' is set to true. Please set it to false " +"to disable this.") --- End diff -- hmm, this says why it's trying arrow and how to turn it off, but doesn't say why I have to turn it off? perhaps say something like pyarrow is not found (if it is the cause if we know)? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20619: [SPARK-23390][SQL] Register task completion liste...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/20619#discussion_r168709918 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -395,16 +395,19 @@ class ParquetFileFormat ParquetInputFormat.setFilterPredicate(hadoopAttemptContext.getConfiguration, pushed.get) } val taskContext = Option(TaskContext.get()) - val parquetReader = if (enableVectorizedReader) { + val iter = if (enableVectorizedReader) { val vectorizedReader = new VectorizedParquetRecordReader( convertTz.orNull, enableOffHeapColumnVector && taskContext.isDefined, capacity) +val recordReaderIterator = new RecordReaderIterator(vectorizedReader) +// Register a task completion lister before `initalization`. --- End diff -- could `new VectorizedParquetRecordReader` or `new RecordReaderIterator` fail? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20568: [SPARK-23381][CORE] Murmur3 hash generates a different v...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/20568 Jenkins, retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20568: [SPARK-23381][CORE] Murmur3 hash generates a different v...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20568 **[Test build #87509 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87509/testReport)** for PR 20568 at commit [`c20cd97`](https://github.com/apache/spark/commit/c20cd97d7ce5690993b4490bb7cca955e7703d90). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20619: [SPARK-23390][SQL] Register task completion liste...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20619#discussion_r168711619 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -395,16 +395,19 @@ class ParquetFileFormat ParquetInputFormat.setFilterPredicate(hadoopAttemptContext.getConfiguration, pushed.get) } val taskContext = Option(TaskContext.get()) - val parquetReader = if (enableVectorizedReader) { + val iter = if (enableVectorizedReader) { val vectorizedReader = new VectorizedParquetRecordReader( convertTz.orNull, enableOffHeapColumnVector && taskContext.isDefined, capacity) +val recordReaderIterator = new RecordReaderIterator(vectorizedReader) +// Register a task completion lister before `initalization`. --- End diff -- Those constructors didn't look heavy to me. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20627: [SPARK-23217][ML][PYTHON] Add distanceMeasure param to C...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20627 **[Test build #87507 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87507/testReport)** for PR 20627 at commit [`8fe8efa`](https://github.com/apache/spark/commit/8fe8efaaf0202f804e80b36ec11b43d5aa34d511). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20627: [SPARK-23217][ML][PYTHON] Add distanceMeasure param to C...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20627 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20627: [SPARK-23217][ML][PYTHON] Add distanceMeasure param to C...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20627 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87507/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20619: [SPARK-23390][SQL] Register task completion liste...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/20619#discussion_r168714722 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -395,16 +395,19 @@ class ParquetFileFormat ParquetInputFormat.setFilterPredicate(hadoopAttemptContext.getConfiguration, pushed.get) } val taskContext = Option(TaskContext.get()) - val parquetReader = if (enableVectorizedReader) { + val iter = if (enableVectorizedReader) { val vectorizedReader = new VectorizedParquetRecordReader( convertTz.orNull, enableOffHeapColumnVector && taskContext.isDefined, capacity) +val recordReaderIterator = new RecordReaderIterator(vectorizedReader) +// Register a task completion lister before `initalization`. --- End diff -- ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20625: [SPARK-23446][PYTHON] Explicitly check supported ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20625#discussion_r168718112 --- Diff: python/pyspark/sql/dataframe.py --- @@ -2000,10 +2001,12 @@ def toPandas(self): return _check_dataframe_localize_timestamps(pdf, timezone) else: return pd.DataFrame.from_records([], columns=self.columns) -except ImportError as e: -msg = "note: pyarrow must be installed and available on calling Python process " \ - "if using spark.sql.execution.arrow.enabled=true" -raise ImportError("%s\n%s" % (_exception_message(e), msg)) +except Exception as e: +msg = ( +"Note: toPandas attempted Arrow optimization because " +"'spark.sql.execution.arrow.enabled' is set to true. Please set it to false " +"to disable this.") --- End diff -- Oh, that should be part of the original message. For example, I don't have PyArrow in `pypy` in my local. it shows the error like: ``` RuntimeError: PyArrow >= 0.8.0 must be installed; however, it was not found. Note: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true. Please set it to false to disable this. ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20626: [SPARK-23447][SQL] Cleanup codegen template for Literal
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20626 **[Test build #87506 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87506/testReport)** for PR 20626 at commit [`d709e24`](https://github.com/apache/spark/commit/d709e246d99c0d821238afda1b203b9880eb1ed1). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20626: [SPARK-23447][SQL] Cleanup codegen template for Literal
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20626 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87506/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20626: [SPARK-23447][SQL] Cleanup codegen template for Literal
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20626 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20626: [SPARK-23447][SQL] Cleanup codegen template for Literal
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/20626 You are going to need to 'type' null values for this work, I think casting would be enough. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20621 **[Test build #87508 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87508/testReport)** for PR 20621 at commit [`6274537`](https://github.com/apache/spark/commit/6274537139b2282ac5f9ded605037f63c7bee2f9). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20621 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20621 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87508/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20568: [SPARK-23381][CORE] Murmur3 hash generates a different v...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20568 **[Test build #87509 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87509/testReport)** for PR 20568 at commit [`c20cd97`](https://github.com/apache/spark/commit/c20cd97d7ce5690993b4490bb7cca955e7703d90). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20568: [SPARK-23381][CORE] Murmur3 hash generates a different v...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20568 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20568: [SPARK-23381][CORE] Murmur3 hash generates a different v...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20568 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87509/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20568: [SPARK-23381][CORE] Murmur3 hash generates a different v...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/20568 Jenkins, retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20628: Preserve extraJavaOptions ordering
GitHub user andrusha opened a pull request: https://github.com/apache/spark/pull/20628 Preserve extraJavaOptions ordering For some JVM options, like `-XX:+UnlockExperimentalVMOptions` ordering is necessary. ## What changes were proposed in this pull request? Keep original extraJavaOptions ordering, when passing them through environment variables inside the Docker container. ## How was this patch tested? Ran base branch a couple of times and checked startup command in logs. Ordering differed every time. Added sorting, ordering was consistent to what user had in `extraJavaOptions`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/andrusha/spark patch-2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20628.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20628 commit 6759e9e9f9075427b87fe5071e803c60d7521629 Author: Andrew Korzhuev Date: 2018-02-16T14:24:48Z Preserve extraJavaOptions ordering For some JVM options, like `-XX:+UnlockExperimentalVMOptions` ordering is necessary. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20628: [SPARK-23449][K8S] Preserve extraJavaOptions ordering
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20628 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20628: [SPARK-23449][K8S] Preserve extraJavaOptions ordering
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20628 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20619: [SPARK-23390][SQL] Register task completion listerners f...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20619 can we provide a manual test like the OOM one in your ORC PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost
GitHub user mgaido91 opened a pull request: https://github.com/apache/spark/pull/20629 [SPARK-23451][ML] Deprecate KMeans.computeCost ## What changes were proposed in this pull request? Deprecate `KMeans.computeCost` which was introduced as a temp fix and now it is not needed anymore, since we introduced `ClusteringEvaluator`. ## How was this patch tested? manual test (deprecation warning displayed) Scala ``` ... scala> model.computeCost(dataset) warning: there was one deprecation warning; re-run with -deprecation for details res1: Double = 0.0 ``` Python ``` >>> import warnings >>> warnings.simplefilter('always', DeprecationWarning) ... >>> model.computeCost(df) /Users/mgaido/apache/spark/python/pyspark/ml/clustering.py:330: DeprecationWarning: Deprecated in 2.4.0. It will be removed in 3.0.0. Use ClusteringEvaluator instead. " instead.", DeprecationWarning) ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/mgaido91/spark SPARK-23451 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20629.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20629 commit 2f79bb2d5c7e29e85a4a7abe63254d392a49fe53 Author: Marco Gaido Date: 2018-02-16T16:03:09Z [SPARK-23451][ML] Deprecate KMeans.computeCost --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20629 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20629 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/937/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20629 **[Test build #87510 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87510/testReport)** for PR 20629 at commit [`2f79bb2`](https://github.com/apache/spark/commit/2f79bb2d5c7e29e85a4a7abe63254d392a49fe53). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20057: [SPARK-22880][SQL] Add cascadeTruncate option to JDBC da...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20057 Our overwrite semantics is confusing to most. We need to correct it in the next release, i.e., Spark 2.4. Even if we try our best to keep the schema of the original table, the actual CREATE TABLE statements still take many vendor-specific info. It is hard for us to rebuild all of them. I can understand your use case for truncate. I am sorry this will not be part of Spark 2.3 release. We will include it in the next release. You can still do the change in your forked Spark. Just feel free to let us know if you find anything that we should do in Spark SQL JCBC to match the corresponding ones in SQOOP. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20057: [SPARK-22880][SQL] Add cascadeTruncate option to JDBC da...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20057 This test is a flaky test. Your changes did not fail any test case. I will review your PR after the 2.3 release. Thanks again! cc @dongjoon-hyun Do you want to take a look at this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20511: [SPARK-23340][SQL] Upgrade Apache ORC to 1.4.3
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20511#discussion_r168817045 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala --- @@ -160,6 +160,15 @@ abstract class OrcSuite extends OrcTest with BeforeAndAfterAll { } } } + + test("SPARK-23340 Empty float/double array columns raise EOFException") { +Seq(Seq(Array.empty[Float]).toDF(), Seq(Array.empty[Double]).toDF()).foreach { df => + withTempPath { path => --- End diff -- We have three ORC readers, right? We need to check all of them, and also vectorized reader too, even if they do not support it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20621 This is a blocker-level regression. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20387: [SPARK-23203][SQL]: DataSourceV2: Use immutable logical ...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/20387 @cloud-fan, is there anything else that needs to be updated, or is this ready to be merged? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20621 It sounds like Spark 2.2 already has this bug. This causes an incorrect result. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20567 Thanks! Happy Lunar New Year! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20625: [SPARK-23446][PYTHON] Explicitly check supported types i...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20625 Thanks for the fast fix! We need to merge it to SPARK-2.3.0 before RC4. Will merge it now. We can improve the fix later if anybody has better ideas. Thanks! Merged to master/2.3 Happy Lunar New Year! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20424: [Spark-23240][python] Better error message when extraneo...
Github user squito commented on the issue: https://github.com/apache/spark/pull/20424 still lgtm, thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20611: [SPARK-23425][SQL]When wild card is been used in load co...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20611 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20625: [SPARK-23446][PYTHON] Explicitly check supported ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20625 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20625: [SPARK-23446][PYTHON] Explicitly check supported ...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/20625#discussion_r168823994 --- Diff: python/pyspark/sql/dataframe.py --- @@ -2000,10 +2001,12 @@ def toPandas(self): return _check_dataframe_localize_timestamps(pdf, timezone) else: return pd.DataFrame.from_records([], columns=self.columns) -except ImportError as e: -msg = "note: pyarrow must be installed and available on calling Python process " \ - "if using spark.sql.execution.arrow.enabled=true" -raise ImportError("%s\n%s" % (_exception_message(e), msg)) +except Exception as e: +msg = ( +"Note: toPandas attempted Arrow optimization because " +"'spark.sql.execution.arrow.enabled' is set to true. Please set it to false " +"to disable this.") +raise RuntimeError("%s\n%s" % (_exception_message(e), msg)) --- End diff -- Should the same type of error be raised instead of `RuntimeError`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20611: [SPARK-23425][SQL]When wild card is been used in load co...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20611 **[Test build #87511 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87511/testReport)** for PR 20611 at commit [`af17f65`](https://github.com/apache/spark/commit/af17f65d2d60b69fe0c4addff5299153d4af37c0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20568: [SPARK-23381][CORE] Murmur3 hash generates a different v...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/20568 @mrkm4ntr this is legitimate failure. Can you fix the python tests? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20621 @gatorsmile thanks for checking. Yes, Spark 2.2 is affected too, so I am not sure whether this should be considered a blocker regression. But, I think we should fix it as soon as possible, nonetheless. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20611: [SPARK-23425][SQL]When wild card is been used in load co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20611 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20611: [SPARK-23425][SQL]When wild card is been used in load co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20611 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87511/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20611: [SPARK-23425][SQL]When wild card is been used in load co...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20611 **[Test build #87511 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87511/testReport)** for PR 20611 at commit [`af17f65`](https://github.com/apache/spark/commit/af17f65d2d60b69fe0c4addff5299153d4af37c0). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20625: [SPARK-23446][PYTHON] Explicitly check supported ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20625#discussion_r168825608 --- Diff: python/pyspark/sql/dataframe.py --- @@ -2000,10 +2001,12 @@ def toPandas(self): return _check_dataframe_localize_timestamps(pdf, timezone) else: return pd.DataFrame.from_records([], columns=self.columns) -except ImportError as e: -msg = "note: pyarrow must be installed and available on calling Python process " \ - "if using spark.sql.execution.arrow.enabled=true" -raise ImportError("%s\n%s" % (_exception_message(e), msg)) +except Exception as e: +msg = ( +"Note: toPandas attempted Arrow optimization because " +"'spark.sql.execution.arrow.enabled' is set to true. Please set it to false " +"to disable this.") +raise RuntimeError("%s\n%s" % (_exception_message(e), msg)) --- End diff -- Yup, please open a PR if you have a better idea. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20625: [SPARK-23446][PYTHON] Explicitly check supported types i...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20625 This was my best for a small and safe fix as possible as I could. Thanks for mering it @gatorsmile sincirely. This was my last concern about PyArrow abd Pandas. I don't mind at all if anyone opens another PR with a better idea to be clear. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20625: [SPARK-23446][PYTHON] Explicitly check supported types i...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/20625 I think `RuntimeError` is fine for now and we can improve this later with logic to fallback too - best not to try and get too clever so close to the release :) Thanks for catching this and the quick fix @HyukjinKwon ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20625: [SPARK-23446][PYTHON] Explicitly check supported types i...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20625 Thank you @BryanCutler! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20629 **[Test build #87510 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87510/testReport)** for PR 20629 at commit [`2f79bb2`](https://github.com/apache/spark/commit/2f79bb2d5c7e29e85a4a7abe63254d392a49fe53). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20629 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87510/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20629 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20424: [Spark-23240][python] Better error message when extraneo...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20424 Thanks @squito. Will merge this one in few days. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org