[GitHub] spark issue #16762: [SPARK-19419] [SPARK-19420] Fix the cross join detection
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16762 **[Test build #73921 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73921/testReport)** for PR 16762 at commit [`efdf04e`](https://github.com/apache/spark/commit/efdf04ee00c68c4914dd52e8262bda8dfef476da). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17161: [SPARK-19819][SparkR] Use concrete data in SparkR DataFr...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/17161 Firstly, I see this as slightly different from Python, in that in R it is common to have built-in datasets and possibly users are used to having them and having examples using them. And as of now, many of our examples are not meant currently to be runnable and they are clearly indicated as such. I have done a pass on the changes in this PR and I'm happy with changing from non-existing json file to `mtcars`. I'm slightly concerned with the few cases of artificial 3 rows data (like [here](https://github.com/apache/spark/pull/17161/files#diff-508641a8bd6c6b59f3e77c80cdcfa6a9R2483)) - more on that below on small dataset. That said, I wonder about the verbosity of adding to examples like this, similarly as in the Python discussions, and, since we have more than 300 pages of API doc, this is not a simple task to change them all. But I do agree that not having broken or incorrect examples is very important. My concerns are: - how much work and how much change is it to change all examples (this is only 1 .R out of 20-something files we have, in a total of 300+ methods which is on the high side for R packages) - how much churn will it be to keep them up-to-date when we are having changes to API (eg. `sparkR.session()`); especially since in order to have examples self-contained we tend to add additional calls to manipulate data and thereby increasing the number of references of API calls - perhaps more importantly, how practical or useful it would be to use built-in datasets or native R data.frame (`mtcars`, `cars`, `Titanic`, `iris`, or make up some; that are super small) on a scalable data platform like Spark? perhaps it is better to demonstrate, in examples, how to work with external data sources, multiple file formats etc.? - and lastly, we still have about a dozen methods that are without example that are being flagged by CRAN checks (but not enough to fail it yet) Couple of *random* thoughts (would be interested to see how they look first!): - group smaller functions into a single page and sharing a longer, more concrete example (need to check if it messes up parameter documentation or make them more confusing! or, how it might affect method help discoverability, like with `?predict`) (btw, this is the approach we have for ML methods) - reference external example files - have examples using datasets that come with Spark (like [this one](https://github.com/apache/spark/blob/master/examples/src/main/resources/people.json)) - have examples in templates and reuse them - keep existing page breakdown but instead of scattering examples around in each, link to a special group of pages (via `@seealso`) with longer, more concrete examples (eg. column manipulation set) - make example run (ie. remove dontrun) this, of course, would need to make sure examples are self-contained and are correct (this is a bigger effort; this could possibly extend build time and/or make build fails more often, as example will then run as a part of CRAN check) (?!) I suspect we would likely need a combination or subset of these techniques. To me, the high-level priority would be in order i) example correctness; ii) example coverage - we should have some examples for every method; iii) better, richer, self-contained examples in strategic places Thoughts? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17161: [SPARK-19819][SparkR] Use concrete data in SparkR...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17161#discussion_r104306085 --- Diff: R/pkg/R/DataFrame.R --- @@ -741,12 +724,12 @@ setMethod("coalesce", #' @examples #'\dontrun{ #' sparkR.session() -#' path <- "path/to/file.json" -#' df <- read.json(path) +#' df <- createDataFrame(mtcars) +#' newDF <- coalesce(df, 1L) --- End diff -- should probably not have coalesce in the example blob for repartition --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17161: [SPARK-19819][SparkR] Use concrete data in SparkR...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17161#discussion_r104306095 --- Diff: R/pkg/R/DataFrame.R --- @@ -548,10 +537,9 @@ setMethod("registerTempTable", #' @examples #'\dontrun{ #' sparkR.session() -#' df <- read.df(path, "parquet") -#' df2 <- read.df(path2, "parquet") -#' createOrReplaceTempView(df, "table1") -#' insertInto(df2, "table1", overwrite = TRUE) +#' df <- limit(createDataFrame(faithful), 5) --- End diff -- why limit? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17161: [SPARK-19819][SparkR] Use concrete data in SparkR...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17161#discussion_r104306091 --- Diff: R/pkg/R/DataFrame.R --- @@ -741,12 +724,12 @@ setMethod("coalesce", #' @examples #'\dontrun{ #' sparkR.session() -#' path <- "path/to/file.json" -#' df <- read.json(path) +#' df <- createDataFrame(mtcars) +#' newDF <- coalesce(df, 1L) #' newDF <- repartition(df, 2L) #' newDF <- repartition(df, numPartitions = 2L) -#' newDF <- repartition(df, col = df$"col1", df$"col2") -#' newDF <- repartition(df, 3L, col = df$"col1", df$"col2") +#' newDF <- repartition(df, col = df[[1]], df[[2]]) --- End diff -- showing as an example column reference with `$name` is important too --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17161: [SPARK-19819][SparkR] Use concrete data in SparkR...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17161#discussion_r104306047 --- Diff: R/pkg/R/DataFrame.R --- @@ -2805,10 +2779,9 @@ setMethod("except", #' @examples #'\dontrun{ #' sparkR.session() -#' path <- "path/to/file.json" -#' df <- read.json(path) -#' write.df(df, "myfile", "parquet", "overwrite") -#' saveDF(df, parquetPath2, "parquet", mode = saveMode, mergeSchema = mergeSchema) +#' df <- createDataFrame(mtcars) +#' write.df(df, tempfile(), "parquet", "overwrite") --- End diff -- I think we should avoid having `tempfile()` as output path in example, as that might point users into the wrong direction - anything saved in tempfile will disappear as soon as the R session ends. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17161: [SPARK-19819][SparkR] Use concrete data in SparkR...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17161#discussion_r104306070 --- Diff: R/pkg/R/DataFrame.R --- @@ -1123,10 +1096,9 @@ setMethod("dim", #' @examples #'\dontrun{ #' sparkR.session() -#' path <- "path/to/file.json" -#' df <- read.json(path) +#' df <- createDataFrame(mtcars) #' collected <- collect(df) -#' firstName <- collected[[1]]$name +#' collected[[1]] --- End diff -- right, that seems rather unnecessary. any other idea on how to show it is a data.frame? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/17166 What is the rationale for this change ? Is it to propagate the task kill reason to UI ? The one line in https://github.com/apache/spark/pull/17166/files#diff-b8adb646ef90f616c34eb5c98d1ebd16R357. Or did I miss some other use for this ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17145: [SPARK-19805][TEST] Log the row type when query result d...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17145 cc @srowen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17167: [SPARK-19822][TEST] CheckpointSuite.testCheckpointedOper...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17167 cc @zsxwing --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17167: [SPARK-19822][TEST] CheckpointSuite.testCheckpointedOper...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17167 cc @srowen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17167: [SPARK-19822][TEST] CheckpointSuite.testCheckpointedOper...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17167 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73920/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17167: [SPARK-19822][TEST] CheckpointSuite.testCheckpointedOper...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17167 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17167: [SPARK-19822][TEST] CheckpointSuite.testCheckpointedOper...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17167 **[Test build #73920 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73920/testReport)** for PR 17167 at commit [`72f1963`](https://github.com/apache/spark/commit/72f1963a36f9f1abfe8ca10d30b01f52c2281d82). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17145: [SPARK-19805][TEST] Log the row type when query result d...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17145 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73919/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17145: [SPARK-19805][TEST] Log the row type when query result d...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17145 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17145: [SPARK-19805][TEST] Log the row type when query result d...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17145 **[Test build #73919 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73919/testReport)** for PR 17145 at commit [`2cff2b2`](https://github.com/apache/spark/commit/2cff2b2e3261bb988391200c366a10ca0f274fc8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16656: [SPARK-18116][DStream] Report stream input information a...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/16656 ping @zsxwing --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17167: [SPARK-19822][TEST] CheckpointSuite.testCheckpointedOper...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17167 **[Test build #73920 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73920/testReport)** for PR 17167 at commit [`72f1963`](https://github.com/apache/spark/commit/72f1963a36f9f1abfe8ca10d30b01f52c2281d82). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17167: [SPARK-19822][TEST] CheckpointSuite.testCheckpoin...
GitHub user uncleGen opened a pull request: https://github.com/apache/spark/pull/17167 [SPARK-19822][TEST] CheckpointSuite.testCheckpointedOperation: should not check checkpointFilesOfLatestTime by the PATH string. ## What changes were proposed in this pull request? https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73800/testReport/ ``` org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to eventually never returned normally. Attempted 617 times over 10.003740484 seconds. Last failure message: 8 did not equal 2. ``` the check condition is: ``` val checkpointFilesOfLatestTime = Checkpoint.getCheckpointFiles(checkpointDir).filter { _.toString.contains(clock.getTimeMillis.toString) } // Checkpoint files are written twice for every batch interval. So assert that both // are written to make sure that both of them have been written. assert(checkpointFilesOfLatestTime.size === 2) ``` the path string may contain the `clock.getTimeMillis.toString`, like: ``` file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-500 file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-1000 file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-1500 file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-2000 file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-2500 file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-3000 file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-3500.bk file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-3500 ------ ``` so we should only check the filename, but not the while path. ## How was this patch tested? Jenkins. You can merge this pull request into a Git repository by running: $ git pull https://github.com/uncleGen/spark flaky-CheckpointSuite Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17167.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17167 commit 72f1963a36f9f1abfe8ca10d30b01f52c2281d82 Author: uncleGen Date: 2017-03-03T10:11:52Z flaky CheckpointSuite test failure --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17134: [SPARK-19795][SPARKR] add column functions to_json, from...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17134 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73918/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17134: [SPARK-19795][SPARKR] add column functions to_json, from...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17134 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17134: [SPARK-19795][SPARKR] add column functions to_json, from...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17134 **[Test build #73918 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73918/testReport)** for PR 17134 at commit [`3748d9b`](https://github.com/apache/spark/commit/3748d9b081a83a0f97c4c711d3dba06ee350435b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17159: [SPARK-19818][SparkR] union should check for name consis...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/17159 hmm... this is somewhat by design - `union` could take in 2 DataFrames that might not match in column names or type. In that case values in one of the DataFrame will be coerced to make things fit ``` >>> d = spark.createDataFrame([{'name': 'Alice', 'age': 1}]) >>> l = spark.createDataFrame([(1, 2)]) >>> d.union(l).head(2) [Row(age=1, name=u'Alice'), Row(age=1, name=u'2')] >>> l.dtypes [('_1', 'bigint'), ('_2', 'bigint')] >>> d.dtypes [('age', 'bigint'), ('name', 'string')] ``` Do you see this as something that might be unexpected for R users (in which case `rbind` might be the overload to look into) or SQL users (documented as equivalent to SQL UNION ALL)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17145: [SPARK-19805][TEST] Log the row type when query result d...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17145 **[Test build #73919 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73919/testReport)** for PR 17145 at commit [`2cff2b2`](https://github.com/apache/spark/commit/2cff2b2e3261bb988391200c366a10ca0f274fc8). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17144: [SPARK-19803][TEST] flaky BlockManagerReplicationSuite t...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17144 @kayousterhout sure, I was being doing that flaky test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17145: [SPARK-19805][TEST] Log the row type when query r...
Github user uncleGen commented on a diff in the pull request: https://github.com/apache/spark/pull/17145#discussion_r104304108 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala --- @@ -312,13 +312,23 @@ object QueryTest { sparkAnswer: Seq[Row], isSorted: Boolean = false): Option[String] = { if (prepareAnswer(expectedAnswer, isSorted) != prepareAnswer(sparkAnswer, isSorted)) { + val getRowType: Option[Row] => String = row => +"RowType" + row.map(row => --- End diff -- @hvanhovell After use `schema.catalogString` ``` !== Correct Answer - 1 == == Spark Answer - 1 == !struct<_1:string,_2:string> struct<_1:int,_2:string> ![1,a] [1,a] ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17134: [SPARK-19795][SPARKR] add column functions to_json, from...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17134 **[Test build #73918 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73918/testReport)** for PR 17134 at commit [`3748d9b`](https://github.com/apache/spark/commit/3748d9b081a83a0f97c4c711d3dba06ee350435b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucket...
Github user crackcell commented on the issue: https://github.com/apache/spark/pull/17123 @imatiach-msft @cloud-fan I updated the code, replaced java.lang.Double with isNullAt() and getDouble(). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16954: [SPARK-18874][SQL] First phase: Deferring the correlated...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16954 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16954: [SPARK-18874][SQL] First phase: Deferring the correlated...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16954 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73915/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16954: [SPARK-18874][SQL] First phase: Deferring the correlated...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16954 **[Test build #73915 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73915/testReport)** for PR 16954 at commit [`7178719`](https://github.com/apache/spark/commit/7178719aaae961a3b5b38132d09a0d4d91ade692). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17166 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17166 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73917/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17166 **[Test build #73917 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73917/testReport)** for PR 17166 at commit [`91b8aef`](https://github.com/apache/spark/commit/91b8aeff8adca4454b9631a0bfa01876de71bb53). * This patch **fails MiMa tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class TaskKilled(reason: String, override val shouldRetry: Boolean) extends TaskFailedReason ` * ` case class KillTask(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17035: [SPARK-19705][SQL] Preferred location supporting HDFS ca...
Github user tanejagagan commented on the issue: https://github.com/apache/spark/pull/17035 @hvanhovell Can you help me with this pull request --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17166 **[Test build #73917 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73917/testReport)** for PR 17166 at commit [`91b8aef`](https://github.com/apache/spark/commit/91b8aeff8adca4454b9631a0bfa01876de71bb53). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17166 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73916/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17166 **[Test build #73916 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73916/testReport)** for PR 17166 at commit [`1a716aa`](https://github.com/apache/spark/commit/1a716aa31ff2e8e6f5d8e3b73362d28b944319f2). * This patch **fails MiMa tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class TaskKilled(reason: String, override val shouldRetry: Boolean) extends TaskFailedReason ` * ` case class KillTask(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17166 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17166 **[Test build #73916 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73916/testReport)** for PR 17166 at commit [`1a716aa`](https://github.com/apache/spark/commit/1a716aa31ff2e8e6f5d8e3b73362d28b944319f2). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17136: [SPARK-19783][SQL] Treat shorter/longer lengths o...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17136#discussion_r104301364 --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R --- @@ -246,8 +246,8 @@ test_that("read/write csv as DataFrame", { mockLinesCsv <- c("year,make,model,comment,blank", "\"2012\",\"Tesla\",\"S\",\"No comment\",", "1997,Ford,E350,\"Go get one now they are going fast\",", - "2015,Chevy,Volt", - "NA,Dummy,Placeholder") + "2015,Chevy,Volt,,", --- End diff -- is there not a way to support variable number of values (and commas) in csv row? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17094: [SPARK-19762][ML] Hierarchy for consolidating ML aggrega...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17094 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17094: [SPARK-19762][ML] Hierarchy for consolidating ML aggrega...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17094 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73914/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17094: [SPARK-19762][ML] Hierarchy for consolidating ML aggrega...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17094 **[Test build #73914 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73914/testReport)** for PR 17094 at commit [`d7dceeb`](https://github.com/apache/spark/commit/d7dceebb5fecc22c74a4ba2a334ab8ca492a518b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16954: [SPARK-18874][SQL] First phase: Deferring the correlated...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16954 **[Test build #73915 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73915/testReport)** for PR 16954 at commit [`7178719`](https://github.com/apache/spark/commit/7178719aaae961a3b5b38132d09a0d4d91ade692). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16954: [SPARK-18874][SQL] First phase: Deferring the cor...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/16954#discussion_r104301109 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -109,6 +109,26 @@ object TypeCoercion { } /** + * This function determines the target type of a comparison operator when one operand + * is a String and the other is not. It also handles when one op is a Date and the + * other is a Timestamp by making the target type to be String. Currently this is used + * to coerce types between LHS and RHS of the IN expression. + */ + val findCommonTypeForBinaryComparison: (DataType, DataType) => Option[DataType] = { +case (StringType, DateType) => Some(StringType) +case (DateType, StringType) => Some(StringType) +case (StringType, TimestampType) => Some(StringType) +case (TimestampType, StringType) => Some(StringType) +case (TimestampType, DateType) => Some(StringType) --- End diff -- @hvanhovell Thanks!!. I had tried to do this before as well as this came up during the internal review. I have made another try. Please let me know what you think. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...
Github user michalsenkyr commented on the issue: https://github.com/apache/spark/pull/16541 Also please note the [UnsafeArrayData-producing branch](https://github.com/michalsenkyr/spark/compare/dataset-seq-builder...michalsenkyr:dataset-seq-builder-unsafe) that is not yet merged into this branch. I'd like to get somebody's opinion on that before I do it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...
Github user michalsenkyr commented on the issue: https://github.com/apache/spark/pull/16541 Would it be possible for somebody to review this PR for me? I have a few ideas that are dependent on this and I'd like to get to work on them. Most notably support for Java Lists. Maybe @cloud-fan could take a look at this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis slow che...
Github user brkyvz commented on the issue: https://github.com/apache/spark/pull/16842 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16933: [SPARK-19601] [SQL] Fix CollapseRepartition rule to pres...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16933 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16933: [SPARK-19601] [SQL] Fix CollapseRepartition rule to pres...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16933 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73911/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16933: [SPARK-19601] [SQL] Fix CollapseRepartition rule to pres...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16933 **[Test build #73911 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73911/testReport)** for PR 16933 at commit [`680c3af`](https://github.com/apache/spark/commit/680c3afa4f29aeffadd17798b7a06f1664964683). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `abstract class RepartitionOperation extends UnaryNode ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17166 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17166 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73913/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17166 **[Test build #73913 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73913/testReport)** for PR 17166 at commit [`ba7cbd0`](https://github.com/apache/spark/commit/ba7cbd09ec0602ac8c9ad59966b2b45a70354bf7). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17094: [SPARK-19762][ML] Hierarchy for consolidating ML aggrega...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17094 **[Test build #73914 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73914/testReport)** for PR 17094 at commit [`d7dceeb`](https://github.com/apache/spark/commit/d7dceebb5fecc22c74a4ba2a334ab8ca492a518b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17094: [SPARK-19762][ML] Hierarchy for consolidating ML aggrega...
Github user sethah commented on the issue: https://github.com/apache/spark/pull/17094 Jenkins test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16998: [SPARK-19665][SQL] Improve constraint propagation
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16998 @hvanhovell Do you have any thoughts on this already? Please let me know. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17166 **[Test build #73913 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73913/testReport)** for PR 17166 at commit [`ba7cbd0`](https://github.com/apache/spark/commit/ba7cbd09ec0602ac8c9ad59966b2b45a70354bf7). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17163: [SPARK-16617][BUILD][CORE] Upgrade to Avro 1.8.x
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/17163 If Avro is good at backwards compatibility it shouldn't be an issue; @JoshRosen seems to maintain the spark-avro package so he might have more insights. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17166 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73912/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17166 **[Test build #73912 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73912/testReport)** for PR 17166 at commit [`e9178b6`](https://github.com/apache/spark/commit/e9178b61f356ecf4469a58a05ee4183e7beb4bf9). * This patch **fails to build**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class TaskKilled(reason: String, override val shouldRetry: Boolean) extends TaskFailedReason ` * ` case class KillTask(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17166 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16933: [SPARK-19601] [SQL] Fix CollapseRepartition rule to pres...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16933 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73910/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16933: [SPARK-19601] [SQL] Fix CollapseRepartition rule to pres...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16933 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16933: [SPARK-19601] [SQL] Fix CollapseRepartition rule to pres...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16933 **[Test build #73910 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73910/testReport)** for PR 16933 at commit [`0f95a6f`](https://github.com/apache/spark/commit/0f95a6f564b044c7f866ab69edd2ba0a565bb47b). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `abstract class RepartitionOperation(numPartitions: Int) extends UnaryNode ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17166 **[Test build #73912 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73912/testReport)** for PR 17166 at commit [`e9178b6`](https://github.com/apache/spark/commit/e9178b61f356ecf4469a58a05ee4183e7beb4bf9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
GitHub user ericl opened a pull request: https://github.com/apache/spark/pull/17166 [SPARK-19820] [core] Allow reason to be specified for task kill ## What changes were proposed in this pull request? This refactors the task kill path to allow specifying a reason for the task kill. The reason is propagated opaquely through events, and will show up in the UI automatically as `(N tasks killed: $reason)` and `TaskKilled: $reason`. Also, make the logic for whether a task failure should be retried explicit rather than special casing TaskKilled messages. cc @rxin ## How was this patch tested? Existing tests, tried killing some stages in the UI and verified the messages are as expected. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ericl/spark kill-reason Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17166.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17166 commit e9178b61f356ecf4469a58a05ee4183e7beb4bf9 Author: Eric Liang Date: 2017-03-04T23:47:36Z Allow reason to be specified for task kill --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17164: [SPARK-16844][SQL][WIP] Support codegen for sort-based a...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/17164 @maropu I think this is pretty exciting. This is very useful in situations where we have a lot of groups, in that case I will happily take a 2x performance improvement any day. This is still pretty decent if you consider that this aggregate is dominate by sorting. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16933: [SPARK-19601] [SQL] Fix CollapseRepartition rule to pres...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16933 **[Test build #73911 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73911/testReport)** for PR 16933 at commit [`680c3af`](https://github.com/apache/spark/commit/680c3afa4f29aeffadd17798b7a06f1664964683). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16933: [SPARK-19601] [SQL] Fix CollapseRepartition rule to pres...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16933 **[Test build #73910 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73910/testReport)** for PR 16933 at commit [`0f95a6f`](https://github.com/apache/spark/commit/0f95a6f564b044c7f866ab69edd2ba0a565bb47b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12461: [SPARK-14409][ML] Adding a RankingEvaluator to ML
Github user yongtang commented on the issue: https://github.com/apache/spark/pull/12461 /cc @daniloascione please take a look. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17165: [DO NOT MERGE][TESTING] Vince shieh spark 17498
Github user jkbradley closed the pull request at: https://github.com/apache/spark/pull/17165 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16883: [SPARK-17498][ML] StringIndexer enhancement for h...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/16883#discussion_r104296156 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala --- @@ -142,18 +166,18 @@ class StringIndexerModel ( } /** @group setParam */ - @Since("1.6.0") - def setHandleInvalid(value: String): this.type = set(handleInvalid, value) - setDefault(handleInvalid, "error") - - /** @group setParam */ @Since("1.4.0") def setInputCol(value: String): this.type = set(inputCol, value) /** @group setParam */ @Since("1.4.0") def setOutputCol(value: String): this.type = set(outputCol, value) + /** @group setParam */ + @Since("2.2.0") + def setHandleInvalid(value: String): this.type = set(handleInvalid, value) + setDefault(handleInvalid, StringIndexer.ERROR_UNSEEN_LABEL) --- End diff -- No need to set default here since it's set in the trait --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16883: [SPARK-17498][ML] StringIndexer enhancement for h...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/16883#discussion_r104296099 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala --- @@ -105,7 +125,11 @@ class StringIndexer @Since("1.4.0") ( @Since("1.6.0") object StringIndexer extends DefaultParamsReadable[StringIndexer] { - + private[feature] val SKIP_UNSEEN_LABEL: String = "skip" + private[feature] val ERROR_UNSEEN_LABEL: String = "error" + private[feature] val KEEP_UNSEEN_LABEL: String = "keep" + private[feature] val supportedHandleInvalids: Array[String] = +Array(SKIP_UNSEEN_LABEL, ERROR_UNSEEN_LABEL, KEEP_UNSEEN_LABEL) @Since("1.6.0") --- End diff -- style: add newline here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16883: [SPARK-17498][ML] StringIndexer enhancement for h...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/16883#discussion_r104296562 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala --- @@ -163,25 +187,28 @@ class StringIndexerModel ( } transformSchema(dataset.schema, logging = true) +val metadata = NominalAttribute.defaultAttr + .withName($(outputCol)).withValues(labels).toMetadata() +// If we are skipping invalid records, filter them out. +val (filteredDataset, keepInvalid) = getHandleInvalid match { --- End diff -- I'm OK with returning a tuple; that's a common pattern. Do you mean that it makes the code inside the match statement confusing? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16883: [SPARK-17498][ML] StringIndexer enhancement for h...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/16883#discussion_r104296367 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala --- @@ -163,25 +190,28 @@ class StringIndexerModel ( } transformSchema(dataset.schema, logging = true) +val metadata = NominalAttribute.defaultAttr + .withName($(outputCol)).withValues(labels).toMetadata() --- End diff -- Yep, that's what I meant: In ```withValues(labels)```, labels can be set as: ``` val labels = getHandleInvalid match { case StringIndexer.KEEP_UNSEEN_LABEL => labels :+ "__unknown" case _ => labels } ``` I'm adding underscores to the attribute name to make it a little less likely to hit conflicts. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16883: [SPARK-17498][ML] StringIndexer enhancement for h...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/16883#discussion_r104296546 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala --- @@ -71,18 +92,17 @@ class StringIndexer @Since("1.4.0") ( def this() = this(Identifiable.randomUID("strIdx")) /** @group setParam */ - @Since("1.6.0") - def setHandleInvalid(value: String): this.type = set(handleInvalid, value) - setDefault(handleInvalid, "error") - - /** @group setParam */ @Since("1.4.0") def setInputCol(value: String): this.type = set(inputCol, value) /** @group setParam */ @Since("1.4.0") def setOutputCol(value: String): this.type = set(outputCol, value) + /** @group setParam */ + @Since("2.2.0") + def setHandleInvalid(value: String): this.type = set(handleInvalid, value) --- End diff -- +1 for maintaining order. setDefault will go in the trait (except in cases where it belongs in just one of the Estimator or Model) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16883: [SPARK-17498][ML] StringIndexer enhancement for h...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/16883#discussion_r104296045 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala --- @@ -34,8 +36,27 @@ import org.apache.spark.util.collection.OpenHashMap /** * Base trait for [[StringIndexer]] and [[StringIndexerModel]]. */ -private[feature] trait StringIndexerBase extends Params with HasInputCol with HasOutputCol -with HasHandleInvalid { +private[feature] trait StringIndexerBase extends Params with HasInputCol with HasOutputCol { + + /** + * Param for how to handle unseen labels. Options are 'skip' (filter out rows with + * unseen labels), 'error' (throw an error), or 'keep' (put unseen labels in a special additional + * bucket, at index numLabels. + * Default: "error" + * @group param + */ + @Since("2.1.0") + val handleInvalid: Param[String] = new Param[String](this, "handleInvalid", "how to handle " + +"unseen labels. Options are 'skip' (filter out rows with unseen labels), " + +"error (throw an error), or 'keep' (put unseen labels in a special additional bucket," + --- End diff -- need space after comma: "bucket, " --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16883: [SPARK-17498][ML] StringIndexer enhancement for h...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/16883#discussion_r104296396 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala --- @@ -163,25 +190,28 @@ class StringIndexerModel ( } transformSchema(dataset.schema, logging = true) +val metadata = NominalAttribute.defaultAttr + .withName($(outputCol)).withValues(labels).toMetadata() +// If we are skipping invalid records, filter them out. +val (filteredDataset, keepInvalid) = getHandleInvalid match { + case SKIP_UNSEEN_LABEL => +val filterer = udf { label: String => + labelToIndex.contains(label) +} +(dataset.where(filterer(dataset($(inputCol, false) + case _ => (dataset, getHandleInvalid == KEEP_UNSEEN_LABEL) +} + val indexer = udf { label: String => if (labelToIndex.contains(label)) { labelToIndex(label) + } else if (keepInvalid) { +labels.length } else { throw new SparkException(s"Unseen label: $label.") --- End diff -- Can you improve the error message? ``` throw new SparkException(s"Unseen label: $label. To handle unseen labels, set Param handleInvalid to ${StringIndexer.KEEP_UNSEEN_LABEL}.") ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16883: [SPARK-17498][ML] StringIndexer enhancement for h...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/16883#discussion_r104296526 --- Diff: docs/ml-features.md --- @@ -542,12 +543,13 @@ column, we should get the following: "a" gets index `0` because it is the most frequent, followed by "c" with index `1` and "b" with index `2`. -Additionally, there are two strategies regarding how `StringIndexer` will handle +Additionally, there are three strategies regarding how `StringIndexer` will handle unseen labels when you have fit a `StringIndexer` on one dataset and then use it to transform another: - throw an exception (which is the default) - skip the row containing the unseen label entirely +- map the unseen labels with indices [numLabels] --- End diff -- Or just match the phrasing in the doc param --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16883: [SPARK-17498][ML] StringIndexer enhancement for h...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/16883#discussion_r104296075 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala --- @@ -105,7 +125,11 @@ class StringIndexer @Since("1.4.0") ( @Since("1.6.0") object StringIndexer extends DefaultParamsReadable[StringIndexer] { - + private[feature] val SKIP_UNSEEN_LABEL: String = "skip" + private[feature] val ERROR_UNSEEN_LABEL: String = "error" + private[feature] val KEEP_UNSEEN_LABEL: String = "keep" --- End diff -- At some point, let's do that, but not yet. I like keeping things private at first in case we find mistakes after release and need to change things. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17165: [DO NOT MERGE][TESTING] Vince shieh spark 17498
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17165 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17165: [DO NOT MERGE][TESTING] Vince shieh spark 17498
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17165 **[Test build #73909 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73909/testReport)** for PR 17165 at commit [`67f02d5`](https://github.com/apache/spark/commit/67f02d565685dc4b9be2709783539f7af1ea1bb5). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17165: [DO NOT MERGE][TESTING] Vince shieh spark 17498
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17165 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73909/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17161: [SPARK-19819][SparkR] Use concrete data in SparkR DataFr...
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17161 I think most examples in R packages are (supposed to be) runnable. Coming from a user perspective, I find it useful if I can run the examples directly and see what the function does in action. Since we already have the pseudo-code here, wouldn't it be better to change it to real data? Especially for the more complicated cases like `join`, providing self-contained examples will save users much time in constructing their own examples. Indeed, by making the examples runnable, I have found and fixed several issues with the pseudo example. For example, the original example in `insertInto` seems to be wrong: ``` createOrReplaceTempView(df, "table1") # This should be saveAsTable insertInto(df2, "table1", overwrite = TRUE) ``` This is very hard to find without running real examples. @srowen @felixcheung --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17165: [DO NOT MERGE][TESTING] Vince shieh spark 17498
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17165 **[Test build #73909 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73909/testReport)** for PR 17165 at commit [`67f02d5`](https://github.com/apache/spark/commit/67f02d565685dc4b9be2709783539f7af1ea1bb5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17165: [DO NOT MERGE][TESTING] Vince shieh spark 17498
GitHub user jkbradley opened a pull request: https://github.com/apache/spark/pull/17165 [DO NOT MERGE][TESTING] Vince shieh spark 17498 Temp PR to reproduce Jenkins compilation error You can merge this pull request into a Git repository by running: $ git pull https://github.com/jkbradley/spark VinceShieh-spark-17498 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17165.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17165 commit b970728f48f22f0c2789a941c1fe1ac6b94a3b49 Author: VinceShieh Date: 2017-02-10T05:50:30Z [SPARK-17498][ML] StringIndexer handles unseen labels This PR is an enhancement to ML StringIndexer. Before this PR, String Indexer only supports "skip"/"error" options to deal with unseen records. But sometimes those unseen records might still be useful in certain use cases, so user would like to keep the unseen labels. This PR enables StringIndexer to support keeping unseen labels as indices [numLabels]. '''Before StringIndexer().setHandleInvalid("skip") StringIndexer().setHandleInvalid("error") '''After support the third option "keep" StringIndexer().setHandleInvalid("keep") Signed-off-by: VinceShieh commit 5d4b07f517cdf52e5b3b0b786e1dba1993659b2e Author: VinceShieh Date: 2017-02-10T07:02:44Z fix compilation issue Signed-off-by: VinceShieh commit 0eb7f0784a71cb695f4d936255abbe8ad30bd95d Author: VinceShieh Date: 2017-02-10T08:16:57Z code refactoring Signed-off-by: VinceShieh commit 9a4174579aa811c99a81967dd829e506c0096ccd Author: VinceShieh Date: 2017-02-10T09:08:30Z add exclusion rules in mima to pass binary compability check Signed-off-by: VinceShieh commit 1736057d055ad4a01dac3e9e79950bfcd9b91e1e Author: VinceShieh Date: 2017-02-10T09:33:31Z update document Signed-off-by: VinceShieh commit ebe9ddb0dc3dd597d435f8a641fce790b4033a64 Author: VinceShieh Date: 2017-02-10T09:37:43Z Revert "add exclusion rules in mima to pass binary compability check" This reverts commit 9a4174579aa811c99a81967dd829e506c0096ccd. commit 27c1b10f25db851cd1e670bd6a0d6e6f59c2ce1e Author: VinceShieh Date: 2017-02-10T09:42:56Z Mima changes to pass binary compatibility check Signed-off-by: VinceShieh commit 9bcaffc19e7a11d31aa6bb9ebbcd96367fc1cd38 Author: VinceShieh Date: 2017-03-01T02:09:36Z update Signed-off-by: VinceShieh commit 4dc10e6390b30fa8df9789479430e0a3f7c65c39 Author: VinceShieh Date: 2017-03-01T02:16:29Z update target version Signed-off-by: VinceShieh commit fa24e433c3f9fe6f76fe0a55df4551881f194d7b Author: VinceShieh Date: 2017-03-01T02:26:43Z fix compilation on val (filteredDataset, keepInvalid) = getHandleInvalid match { case .. } Signed-off-by: VinceShieh commit 67f02d565685dc4b9be2709783539f7af1ea1bb5 Author: Joseph K. Bradley Date: 2017-03-04T20:08:21Z remove scala existentials import --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15274: [SPARK-17699] Support for parsing JSON string columns
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15274 Based on the comment @marmbrus in a JIRA, we prefer to using our DDL format. For example, like what we did for CREATE TABLE, we can specify the schema using `a int, b string` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17163: [SPARK-16617][BUILD][CORE] Upgrade to Avro 1.8.x
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17163 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17163: [SPARK-16617][BUILD][CORE] Upgrade to Avro 1.8.x
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17163 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73906/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17163: [SPARK-16617][BUILD][CORE] Upgrade to Avro 1.8.x
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17163 **[Test build #73906 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73906/testReport)** for PR 17163 at commit [`9461741`](https://github.com/apache/spark/commit/94617414ef580bc0ce2934c1c8e7e22423eff51e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15274: [SPARK-17699] Support for parsing JSON string columns
Github user Sazpaimon commented on the issue: https://github.com/apache/spark/pull/15274 @gatorsmile Alternatively, one can use do what brickhouse's `from_json` Hive UDF does ( https://gist.github.com/jeromebanks/8855408#file-gistfile1-sql ) (For the record, I actually need this in SQL) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17164: [SPARK-16844][SQL][WIP] Support codegen for sort-based a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17164 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73908/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17164: [SPARK-16844][SQL][WIP] Support codegen for sort-based a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17164 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17164: [SPARK-16844][SQL][WIP] Support codegen for sort-based a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17164 **[Test build #73908 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73908/testReport)** for PR 17164 at commit [`9a26a0a`](https://github.com/apache/spark/commit/9a26a0a0e9c7f9d0e90dc5257eb5038eafeb206c). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `abstract class AggregateExec extends UnaryExecNode ` * `trait CodegenAggregateSupport extends CodegenSupport ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16611: [SPARK-17967][SPARK-17878][SQL][PYTHON] Support for arra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16611 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73905/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16611: [SPARK-17967][SPARK-17878][SQL][PYTHON] Support for arra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16611 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16611: [SPARK-17967][SPARK-17878][SQL][PYTHON] Support for arra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16611 **[Test build #73905 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73905/testReport)** for PR 16611 at commit [`9f7e679`](https://github.com/apache/spark/commit/9f7e679586b9ede33d10ef0cd7db2fba3237c712). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org