[GitHub] spark pull request #17148: [SPARK-17075][SQL][followup] fix filter estimatio...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17148 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17096: [SPARK-15243][ML][SQL][PYTHON] Add missing support for u...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17096 Thank you @viirya for your sign-off. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17148: [SPARK-17075][SQL][followup] fix filter estimation issue...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/17148 thanks, merging to master! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17090 I commented further on the [JIRA](https://issues.apache.org/jira/browse/SPARK-14409?focusedCommentId=15898855&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15898855). Sorry if my other comments here and on JIRA were unclear. But the proposed schema for input to `RankingEvaluator` is: ### Schema 1 ``` +--+---+--+--+ |userId|movieId|rating|prediction| +--+---+--+--+ | 230|318| 5.0| 4.2403245| | 230| 3424| 4.0| null| | 230| 81191| null| 4.317455| +--+---+--+--+ ``` You will notice that `rating` and `prediction` columns can be `null`. This is by design. There are three cases shown above: 1. 1st row indicates a (user-item) pair that occurs in *both* the ground-truth set *and* the top-k predictions; 2. 2nd row indicates a (user-item) pair that occurs in the ground-truth set, *but not* in the top-k predictions; 3. 3rd row indicates a (user-item) pair that occurs in the top-k predictions, *but not* in the ground-truth set. _Note_ for reference, the input to the current `mllib` `RankingMetrics` is: ### Schema 2 ``` RDD[(true labels array, predicted labels array)], i.e. RDD of ([318, 3424, 7139,...], [81191, 93040, 31...]) ``` (So actually neither of the above schemas are easily compatible with the return schema here - but I think it is not really necessary to match the `mllib.RankingMetrics` format) ### ALS cross-validation My proposal for fitting ALS into cross-validation is the `ALSModel.transform` will output a DF of **Schema 1** - *only* when the parameters `k` and `recommendFor` are appropriately set, and the input DF contains both `user` and `item` columns. In practice, this scenario will occur during cross-validation only. So what I am saying is that ALS itself (not the evaluator) must know how to return the correct DataFrame output from `transform` such that it can be used in a cross-validation as input to the `RankingEvaluator`. __Concretely:__ ```scala val als = new ALS().setRecommendFor("user").setK(10) val validator = new TrainValidationSplit() .setEvaluator(new RankingEvaluator().setK(10)) .setEstimator(als) .setEstimatorParamMaps(...) val bestModel = validator.fit(ratings) ``` So while it is complex under the hood - to users it's simply a case of setting 2 params and the rest is as normal. Now, we have the best model selected by cross-validation. We can make recommendations using these convenience methods (I think it will need a cast): ```scala val recommendations = bestModel.asInstanceOf[ALSModel].recommendItemsforUsers(10) ``` Alternatively, the `transform` version looks like this: ```scala val usersDF = ... +--+ |userId| +--+ | 1| | 2| | 3| +--+ val recommendations = bestModel.transform(usersDF) ``` So the questions: 1. should we support the above `transform`-based recommendations? Or only support it for cross-validation purposes as a special case? 2. if we do, what should the output schema of the above `transform` version look like? It must certainly match the output of `recommendX` methods. The options are: (1) The schema in this PR: **Pros**: as you mention above - also more "compact" **Cons**: doesn't match up so closely with the `transform` "cross-validation" schema above (2) The schema below. It is basically an "exploded" version of option (1) ``` +--+---+--+ |userId|movieId|prediction| +--+---+--+ | 1| 1| 4.3| | 1| 5| 3.2| | 1| 9| 2.1| +--+---+--+ ``` **Pros***: matches more closely with the cross-validation / evaluator input format. Perhaps slightly more "dataframe-like". **Cons**: less compact; lose ordering?; may require more munging to save to external data stores etc. Anyway sorry for hijacking this PR discussion - but as I think you can see, the evaluator / ALS transform interplay is a bit subtle and requires some thought to get the right approach. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spa
[GitHub] spark pull request #16826: [SPARK-19540][SQL] Add ability to clone SparkSess...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16826#discussion_r104605419 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionState.scala --- @@ -17,89 +17,55 @@ package org.apache.spark.sql.hive +import org.apache.spark.SparkContext import org.apache.spark.sql._ -import org.apache.spark.sql.catalyst.analysis.Analyzer -import org.apache.spark.sql.execution.SparkPlanner +import org.apache.spark.sql.catalyst.analysis.{Analyzer, FunctionRegistry} +import org.apache.spark.sql.catalyst.parser.ParserInterface +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.execution.{QueryExecution, SparkPlanner, SparkSqlParser} import org.apache.spark.sql.execution.datasources._ import org.apache.spark.sql.hive.client.HiveClient -import org.apache.spark.sql.internal.SessionState +import org.apache.spark.sql.internal.{SessionState, SharedState, SQLConf} +import org.apache.spark.sql.streaming.StreamingQueryManager /** * A class that holds all session-specific state in a given [[SparkSession]] backed by Hive. + * @param catalog A Hive client used for interacting with the metastore. + * @param analyzer An analyzer that uses the Hive metastore. + * @param plannerCreator Lambda to create a [[SparkPlanner]] that converts optimized logical + * plans to physical plans. */ -private[hive] class HiveSessionState(sparkSession: SparkSession) - extends SessionState(sparkSession) { - - self => - - /** - * A Hive client used for interacting with the metastore. - */ - lazy val metadataHive: HiveClient = - sparkSession.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client.newSession() - - /** - * Internal catalog for managing table and database states. - */ - override lazy val catalog = { -new HiveSessionCatalog( - sparkSession.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog], - sparkSession.sharedState.globalTempViewManager, - sparkSession, - functionResourceLoader, - functionRegistry, +private[hive] class HiveSessionState( +sparkContext: SparkContext, +sharedState: SharedState, +conf: SQLConf, +experimentalMethods: ExperimentalMethods, +functionRegistry: FunctionRegistry, +override val catalog: HiveSessionCatalog, +sqlParser: ParserInterface, +val metadataHive: HiveClient, +analyzer: Analyzer, +streamingQueryManager: StreamingQueryManager, +queryExecutionCreator: LogicalPlan => QueryExecution, +val plannerCreator: () => SparkPlanner) --- End diff -- How about adding `val planner` to `SessionState`? So far, the interface of `HiveSessionState` looks a little bit complex to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16826: [SPARK-19540][SQL] Add ability to clone SparkSess...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16826#discussion_r104604870 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionState.scala --- @@ -17,89 +17,55 @@ package org.apache.spark.sql.hive +import org.apache.spark.SparkContext import org.apache.spark.sql._ -import org.apache.spark.sql.catalyst.analysis.Analyzer -import org.apache.spark.sql.execution.SparkPlanner +import org.apache.spark.sql.catalyst.analysis.{Analyzer, FunctionRegistry} +import org.apache.spark.sql.catalyst.parser.ParserInterface +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.execution.{QueryExecution, SparkPlanner, SparkSqlParser} import org.apache.spark.sql.execution.datasources._ import org.apache.spark.sql.hive.client.HiveClient -import org.apache.spark.sql.internal.SessionState +import org.apache.spark.sql.internal.{SessionState, SharedState, SQLConf} +import org.apache.spark.sql.streaming.StreamingQueryManager /** * A class that holds all session-specific state in a given [[SparkSession]] backed by Hive. + * @param catalog A Hive client used for interacting with the metastore. + * @param analyzer An analyzer that uses the Hive metastore. + * @param plannerCreator Lambda to create a [[SparkPlanner]] that converts optimized logical + * plans to physical plans. */ -private[hive] class HiveSessionState(sparkSession: SparkSession) - extends SessionState(sparkSession) { - - self => - - /** - * A Hive client used for interacting with the metastore. - */ - lazy val metadataHive: HiveClient = - sparkSession.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client.newSession() - - /** - * Internal catalog for managing table and database states. - */ - override lazy val catalog = { -new HiveSessionCatalog( - sparkSession.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog], - sparkSession.sharedState.globalTempViewManager, - sparkSession, - functionResourceLoader, - functionRegistry, +private[hive] class HiveSessionState( +sparkContext: SparkContext, +sharedState: SharedState, +conf: SQLConf, +experimentalMethods: ExperimentalMethods, +functionRegistry: FunctionRegistry, +override val catalog: HiveSessionCatalog, +sqlParser: ParserInterface, +val metadataHive: HiveClient, --- End diff -- This is for avoiding using `lazy val`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17178: [SPARK-19828][R] Support array type in from_json ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17178#discussion_r104604621 --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R --- @@ -1342,28 +1342,52 @@ test_that("column functions", { df <- read.json(mapTypeJsonPath) j <- collect(select(df, alias(to_json(df$info), "json"))) expect_equal(j[order(j$json), ][1], "{\"age\":16,\"height\":176.5}") - df <- as.DataFrame(j) - schema <- structType(structField("age", "integer"), - structField("height", "double")) - s <- collect(select(df, alias(from_json(df$json, schema), "structcol"))) - expect_equal(ncol(s), 1) - expect_equal(nrow(s), 3) - expect_is(s[[1]][[1]], "struct") - expect_true(any(apply(s, 1, function(x) { x[[1]]$age == 16 } ))) - - # passing option - df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}"))) - schema2 <- structType(structField("date", "date")) - expect_error(tryCatch(collect(select(df, from_json(df$col, schema2))), -error = function(e) { stop(e) }), - paste0(".*(java.lang.NumberFormatException: For input string:).*")) - s <- collect(select(df, from_json(df$col, schema2, dateFormat = "dd/MM/"))) - expect_is(s[[1]][[1]]$date, "Date") - expect_equal(as.character(s[[1]][[1]]$date), "2014-10-21") - - # check for unparseable - df <- as.DataFrame(list(list("a" = ""))) - expect_equal(collect(select(df, from_json(df$a, schema)))[[1]][[1]], NA) + + schemas <- list(structType(structField("age", "integer"), structField("height", "double")), + "struct") --- End diff -- Yes.. I persuaded myself that it is a valid string for `structField`. If you prefer optional parameter one, I could try. Otherwise, let me close this for now If you are not sure of both ways :). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17146: [SPARK-19806][ML][PySpark] PySpark GeneralizedLinearRegr...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/17146 @actuaryzhang would you take a look at this one. If recall, it's one option we considered for R API. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17146: [SPARK-19806][ML][PySpark] PySpark GeneralizedLin...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17146#discussion_r104604254 --- Diff: python/pyspark/ml/regression.py --- @@ -1344,40 +1347,53 @@ class GeneralizedLinearRegression(JavaEstimator, HasLabelCol, HasFeaturesCol, Ha family = Param(Params._dummy(), "family", "The name of family which is a description of " + "the error distribution to be used in the model. Supported options: " + - "gaussian (default), binomial, poisson and gamma.", + "gaussian (default), binomial, poisson, gamma and tweedie.", typeConverter=TypeConverters.toString) link = Param(Params._dummy(), "link", "The name of link function which provides the " + "relationship between the linear predictor and the mean of the distribution " + "function. Supported options: identity, log, inverse, logit, probit, cloglog " + "and sqrt.", typeConverter=TypeConverters.toString) linkPredictionCol = Param(Params._dummy(), "linkPredictionCol", "link prediction (linear " + "predictor) column name", typeConverter=TypeConverters.toString) +variancePower = Param(Params._dummy(), "variancePower", "The power in the variance function " + + "of the Tweedie distribution which characterizes the relationship " + + "between the variance and mean of the distribution. Only applicable " + + "for the Tweedie family. Supported values: 0 and [1, Inf).", + typeConverter=TypeConverters.toFloat) +linkPower = Param(Params._dummy(), "linkPower", "The index in the power link function. " + + "Only applicable for the Tweedie family.", + typeConverter=TypeConverters.toFloat) @keyword_only def __init__(self, labelCol="label", featuresCol="features", predictionCol="prediction", family="gaussian", link=None, fitIntercept=True, maxIter=25, tol=1e-6, --- End diff -- hmm, it sounds like `link` should really be `None` then --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16896: [SPARK-19561][Python] cast TimestampType.toInternal outp...
Github user davies commented on the issue: https://github.com/apache/spark/pull/16896 lgtm, will merge it when I get a chance. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16290: [SPARK-18817] [SPARKR] [SQL] Set default warehouse dir t...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/16290 so base on this comment https://github.com/apache/spark/pull/16330#issuecomment-282101389 doesn't it mean we shouldn't set warehouse dir to under tempdir()? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17096: [SPARK-15243][ML][SQL][PYTHON] Add missing support for u...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/17096 Remaining changes LGTM. cc @holdenk --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17182: [SPARK-19840][SQL] Disallow creating permanent functions...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17182 **[Test build #74078 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74078/testReport)** for PR 17182 at commit [`7641789`](https://github.com/apache/spark/commit/7641789c00360ed1ba0fe4475b350415fea56c95). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17186: [SPARK-19846][SQL] Add a flag to disable constraint prop...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17186 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17186: [SPARK-19846][SQL] Add a flag to disable constraint prop...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17186 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74071/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17186: [SPARK-19846][SQL] Add a flag to disable constraint prop...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17186 **[Test build #74071 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74071/testReport)** for PR 17186 at commit [`44e494b`](https://github.com/apache/spark/commit/44e494bd1725645a2ce13321ec846098d8da6ae4). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class InferFiltersFromConstraints(conf: CatalystConf)` * `case class PruneFilters(conf: CatalystConf) extends Rule[LogicalPlan] with PredicateHelper ` * `case class EliminateOuterJoin(conf: CatalystConf) extends Rule[LogicalPlan] with PredicateHelper ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16856: [SPARK-19516][DOC] update public doc to use SparkSession...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16856 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74077/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16856: [SPARK-19516][DOC] update public doc to use SparkSession...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16856 **[Test build #74077 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74077/testReport)** for PR 16856 at commit [`e284665`](https://github.com/apache/spark/commit/e2846652588d95a9b5e393f974846cb5d0ce2e31). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16856: [SPARK-19516][DOC] update public doc to use SparkSession...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16856 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16867: [SPARK-16929] Improve performance when check speculatabl...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/16867 Thanks a lot for comments. I refined accordingly : ) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17144: [SPARK-19803][TEST] flaky BlockManagerReplication...
Github user kayousterhout commented on a diff in the pull request: https://github.com/apache/spark/pull/17144#discussion_r104602172 --- Diff: core/src/test/scala/org/apache/spark/storage/BlockManagerReplicationSuite.scala --- @@ -494,7 +494,9 @@ class BlockManagerProactiveReplicationSuite extends BlockManagerReplicationBehav val newLocations = master.getLocations(blockId).toSet logInfo(s"New locations : $newLocations") -assert(newLocations.size === replicationFactor) +eventually(timeout(5 seconds), interval(10 millis)) { + assert(newLocations.size === replicationFactor) --- End diff -- Also can you remove the two sleeps above now? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17144: [SPARK-19803][TEST] flaky BlockManagerReplication...
Github user kayousterhout commented on a diff in the pull request: https://github.com/apache/spark/pull/17144#discussion_r104602103 --- Diff: core/src/test/scala/org/apache/spark/storage/BlockManagerReplicationSuite.scala --- @@ -494,7 +494,9 @@ class BlockManagerProactiveReplicationSuite extends BlockManagerReplicationBehav val newLocations = master.getLocations(blockId).toSet logInfo(s"New locations : $newLocations") -assert(newLocations.size === replicationFactor) +eventually(timeout(5 seconds), interval(10 millis)) { + assert(newLocations.size === replicationFactor) --- End diff -- line 495 needs to be in here too -- otherwise you're continually checking the same set of locations --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16867: [SPARK-16929] Improve performance when check speculatabl...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16867 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74070/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16867: [SPARK-16929] Improve performance when check speculatabl...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16867 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16867: [SPARK-16929] Improve performance when check speculatabl...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16867 **[Test build #74070 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74070/testReport)** for PR 16867 at commit [`ee4c486`](https://github.com/apache/spark/commit/ee4c4865036063ff1bc6de0a5ebccbcdd5711717). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15229: [SPARK-17654] [SQL] Propagate bucketing information for ...
Github user carlos-verdes commented on the issue: https://github.com/apache/spark/pull/15229 Hi @rxin, In Hive you have two levels, the partition and the buckets. The partitons are translated to folders on HDFS, for example: ```bash /apps/hive/warehouse/model_table/date=6 ``` Where model_table is the name of the table and date is the partition. Inside a folder you will have n files and Hive let you decide how many files you want to create (buckets) and which data you want to store within. If you create a table like this on Hive: ```sql create table events ( timestamp: long, userId: String, event: String ) partitioned by (event_date int) clustered by (userId) sorted by (userId, timestamp) into 10 buckets; ``` Then when it will be only 10 files per partition and all the events for one user will be only on one partition and sorted by time. If you insert data on this table using the next query on Hive you will see that the clustering policy is respected: ```sql set hive.enforce.bucketing = true; -- (Note: Not needed in Hive 2.x onward) from event_feed_source e insert overwrite table events partition (event_date = 20170307) select e.*, 20170307 where event_day = 20170307; ``` However... if you do the next insert with Spark: ```scala sqlContext.sql("insert overwrite table events partition (event_date = 20170307) select e.*,1 from event_feed_source e") ``` You will see that the data is stored with the same partitioning as it is on the source dataset. What is the benefit of respecting the Hive clustering policy? The main benefit is to avoid shuffle and have a control on the number or partitions. To give an example we have a pipeline that reads thousands of events per user and save them into another table (model), so it means the events table is going to have x times more data than the model table (imagine a factor of 10x). First point is, if the source data are clustered properly we can read all the events per user without shuffle (I mean to do something like `events.groupBy(user).mapValues(_.sortBy(timestamp)` will be done without shuffle). Second point is when we generate the model RDD/Dataser from the event RDD/Dataset. Spark respects the source partitioning (unless you indicate otherwise) which means... is going to save into Hive 10 times the number of files for the model as needed (not respecting the clustering policy on Hive). This implies that we have 10x more partitions than needed and also that the queries over the model table are not "clustered"... which means full scan every time we need to do a query (a full scan over 10 times the optimal number of partitions). I hope I clarify the point on Hive clusters ;) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16896: [SPARK-19561][Python] cast TimestampType.toInternal outp...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/16896 +1 LGTM. Could you review and merge this please, @davies ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16856: [SPARK-19516][DOC] update public doc to use SparkSession...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16856 **[Test build #74077 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74077/testReport)** for PR 16856 at commit [`e284665`](https://github.com/apache/spark/commit/e2846652588d95a9b5e393f974846cb5d0ce2e31). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16896: [SPARK-19561][Python] cast TimestampType.toIntern...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/16896#discussion_r104601283 --- Diff: python/pyspark/sql/types.py --- @@ -189,7 +189,7 @@ def toInternal(self, dt): if dt is not None: seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo else time.mktime(dt.timetuple())) -return int(seconds) * 100 + dt.microsecond +return long(seconds) * 100 + dt.microsecond --- End diff -- Yep. For me, it looks every review comments are applied. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16867: [SPARK-16929] Improve performance when check speculatabl...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16867 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74069/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16867: [SPARK-16929] Improve performance when check speculatabl...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16867 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16867: [SPARK-16929] Improve performance when check speculatabl...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16867 **[Test build #74069 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74069/testReport)** for PR 16867 at commit [`a2381b6`](https://github.com/apache/spark/commit/a2381b6e96b3e05cce83dcae6af960f33e9243c8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16856: [SPARK-19516][DOC] update public doc to use Spark...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16856#discussion_r104600900 --- Diff: docs/quick-start.md --- @@ -438,8 +412,7 @@ Lines with a: 46, Lines with b: 23 # Where to Go from Here Congratulations on running your first Spark application! -* For an in-depth overview of the API, start with the [Spark programming guide](programming-guide.html), - or see "Programming Guides" menu for other components. +* For an in-depth overview of the API, start with the [RDD programming guide](rdd-programming-guide.html) and the [SQL programming guide](sql-programming-guide.html), or see "Programming Guides" menu for other components. --- End diff -- users still need to read RDD programming guide to learn some basic concepts like driver, executor, etc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user kayousterhout commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r104600104 --- Diff: core/src/main/scala/org/apache/spark/ui/jobs/UIData.scala --- @@ -64,7 +64,7 @@ private[spark] object UIData { var numCompletedTasks: Int = 0, var numSkippedTasks: Int = 0, var numFailedTasks: Int = 0, -var numKilledTasks: Int = 0, +var numKilledTasks: Map[String, Int] = Map.empty, --- End diff -- reasonToNumKilledTasks? (here and the other two places) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user kayousterhout commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r104599195 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskScheduler.scala --- @@ -54,6 +54,9 @@ private[spark] trait TaskScheduler { // Cancel a stage. def cancelTasks(stageId: Int, interruptThread: Boolean): Unit + // Kill a task. --- End diff -- A comment isn't very helpful if it just adds "a" to the method name :). Can you remove this? Also as above "killAndReschedule" would be a better name here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user kayousterhout commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r104598647 --- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala --- @@ -239,8 +244,9 @@ private[spark] class Executor( */ @volatile var task: Task[Any] = _ -def kill(interruptThread: Boolean): Unit = { - logInfo(s"Executor is trying to kill $taskName (TID $taskId)") +def kill(interruptThread: Boolean, reason: String): Unit = { + logInfo(s"Executor is trying to kill $taskName (TID $taskId), reason: $reason") --- End diff -- extra paren --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user kayousterhout commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r104599383 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -2250,6 +2250,25 @@ class SparkContext(config: SparkConf) extends Logging { } /** + * Kill a given task. It will be retried. --- End diff -- "Kill and reschedule the given task."? (makes it slightly more obvious why this might be useful; "It will be retried" sounds almost like the retry is accidental) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user kayousterhout commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r104599319 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -2250,6 +2250,25 @@ class SparkContext(config: SparkConf) extends Logging { } /** + * Kill a given task. It will be retried. + * + * @param taskId the task ID to kill + */ + def killTask(taskId: Long): Unit = { +killTask(taskId, "cancelled") --- End diff -- "killed by user via SparkContext.killTask"? These things can be hard to debug when they're unexpected and "cancelled" isn't very helpful --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user kayousterhout commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r104600582 --- Diff: core/src/test/scala/org/apache/spark/SparkContextSuite.scala --- @@ -538,10 +538,37 @@ class SparkContextSuite extends SparkFunSuite with LocalSparkContext with Eventu } } + test("Killing tasks") { +sc = new SparkContext(new SparkConf().setAppName("test").setMaster("local")) + +SparkContextSuite.isTaskStarted = false +SparkContextSuite.taskKilled = false + --- End diff -- Can you add a comment about what this test is doing? e.g., Launches one task that will run forever. Once the SparkListener detects the task has started, kill and re-schedule it. The second run of the task will complete immediately. If this test times out, then the first version of the task wasn't killed successfully. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user kayousterhout commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r104600689 --- Diff: core/src/test/scala/org/apache/spark/SparkContextSuite.scala --- @@ -538,10 +538,37 @@ class SparkContextSuite extends SparkFunSuite with LocalSparkContext with Eventu } } + test("Killing tasks") { +sc = new SparkContext(new SparkConf().setAppName("test").setMaster("local")) + +SparkContextSuite.isTaskStarted = false +SparkContextSuite.taskKilled = false + +val listener = new SparkListener { + override def onTaskStart(taskStart: SparkListenerTaskStart): Unit = { +eventually(timeout(10.seconds)) { + assert(SparkContextSuite.isTaskStarted) +} +if (!SparkContextSuite.taskKilled) { + SparkContextSuite.taskKilled = true + sc.killTask(taskStart.taskInfo.taskId, "first attempt will hang") +} + } +} +sc.addSparkListener(listener) +sc.parallelize(1 to 1).foreach { x => --- End diff -- Can you wrap this part in an eventually statement so that the test will fail relatively quickly (i.e., in less than the 250m overall test timeout) if the kill fails --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user kayousterhout commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r104599420 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -2250,6 +2250,25 @@ class SparkContext(config: SparkConf) extends Logging { } /** + * Kill a given task. It will be retried. + * + * @param taskId the task ID to kill + */ + def killTask(taskId: Long): Unit = { +killTask(taskId, "cancelled") --- End diff -- Also why not make this a default argument below and then have just one method? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user kayousterhout commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r104599899 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -710,7 +710,11 @@ private[spark] class TaskSetManager( logInfo(s"Killing attempt ${attemptInfo.attemptNumber} for task ${attemptInfo.id} " + s"in stage ${taskSet.id} (TID ${attemptInfo.taskId}) on ${attemptInfo.host} " + s"as the attempt ${info.attemptNumber} succeeded on ${info.host}") - sched.backend.killTask(attemptInfo.taskId, attemptInfo.executorId, true) + sched.backend.killTask( +attemptInfo.taskId, +attemptInfo.executorId, +interruptThread = true, +reason = "another attempt succeeded") --- End diff -- Can you put the longer version that's in the info log here? (s"Attempt${info.attemptNumber} succeeded on ${info.host}") --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17178: [SPARK-19828][R] Support array type in from_json ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17178#discussion_r104600695 --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R --- @@ -1342,28 +1342,52 @@ test_that("column functions", { df <- read.json(mapTypeJsonPath) j <- collect(select(df, alias(to_json(df$info), "json"))) expect_equal(j[order(j$json), ][1], "{\"age\":16,\"height\":176.5}") - df <- as.DataFrame(j) - schema <- structType(structField("age", "integer"), - structField("height", "double")) - s <- collect(select(df, alias(from_json(df$json, schema), "structcol"))) - expect_equal(ncol(s), 1) - expect_equal(nrow(s), 3) - expect_is(s[[1]][[1]], "struct") - expect_true(any(apply(s, 1, function(x) { x[[1]]$age == 16 } ))) - - # passing option - df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}"))) - schema2 <- structType(structField("date", "date")) - expect_error(tryCatch(collect(select(df, from_json(df$col, schema2))), -error = function(e) { stop(e) }), - paste0(".*(java.lang.NumberFormatException: For input string:).*")) - s <- collect(select(df, from_json(df$col, schema2, dateFormat = "dd/MM/"))) - expect_is(s[[1]][[1]]$date, "Date") - expect_equal(as.character(s[[1]][[1]]$date), "2014-10-21") - - # check for unparseable - df <- as.DataFrame(list(list("a" = ""))) - expect_equal(collect(select(df, from_json(df$a, schema)))[[1]][[1]], NA) + + schemas <- list(structType(structField("age", "integer"), structField("height", "double")), + "struct") --- End diff -- I wonder if this is too loosely-typed and format hard to explain/illustrate to R user `struct` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16826: [SPARK-19540][SQL] Add ability to clone SparkSess...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16826#discussion_r104600696 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala --- @@ -278,6 +278,8 @@ private[hive] class HiveClientImpl( state.getConf.setClassLoader(clientLoader.classLoader) // Set the thread local metastore client to the client associated with this HiveClientImpl. Hive.set(client) +// Replace conf in the thread local Hive with current conf +Hive.get(conf) --- End diff -- Because `IsolatedClientLoader` is shared? If we do not make this change, any test case failed? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17176: [SPARK-19833][SQL]remove SQLConf.HIVE_VERIFY_PARTITION_P...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17176 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74072/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17176: [SPARK-19833][SQL]remove SQLConf.HIVE_VERIFY_PARTITION_P...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17176 **[Test build #74072 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74072/testReport)** for PR 17176 at commit [`22b1f53`](https://github.com/apache/spark/commit/22b1f538b31c50a6328afa9051e1e0ddc29da072). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17176: [SPARK-19833][SQL]remove SQLConf.HIVE_VERIFY_PARTITION_P...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17176 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16944: [SPARK-19611][SQL] Introduce configurable table schema i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16944 **[Test build #74076 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74076/testReport)** for PR 16944 at commit [`9bf6a3a`](https://github.com/apache/spark/commit/9bf6a3a75d9a9f287d5b05eab4071c0c3e9db267). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16944: [SPARK-19611][SQL] Introduce configurable table s...
Github user budde commented on a diff in the pull request: https://github.com/apache/spark/pull/16944#discussion_r104598894 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -476,46 +476,6 @@ object ParquetFileFormat extends Logging { } /** - * Reconciles Hive Metastore case insensitivity issue and data type conflicts between Metastore - * schema and Parquet schema. - * - * Hive doesn't retain case information, while Parquet is case sensitive. On the other hand, the - * schema read from Parquet files may be incomplete (e.g. older versions of Parquet doesn't - * distinguish binary and string). This method generates a correct schema by merging Metastore - * schema data types and Parquet schema field names. - */ - def mergeMetastoreParquetSchema( - metastoreSchema: StructType, - parquetSchema: StructType): StructType = { -def schemaConflictMessage: String = - s"""Converting Hive Metastore Parquet, but detected conflicting schemas. Metastore schema: - |${metastoreSchema.prettyJson} - | - |Parquet schema: - |${parquetSchema.prettyJson} - """.stripMargin - -val mergedParquetSchema = mergeMissingNullableFields(metastoreSchema, parquetSchema) --- End diff -- Oversight on my part. Removed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user kayousterhout commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r104598515 --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/executor/MesosExecutorBackend.scala --- @@ -104,7 +104,8 @@ private[spark] class MesosExecutorBackend logError("Received KillTask but executor was null") } else { // TODO: Determine the 'interruptOnCancel' property set for the given job. - executor.killTask(t.getValue.toLong, interruptThread = false) + executor.killTask( +t.getValue.toLong, interruptThread = false, reason = "killed intentionally") --- End diff -- can you make this say "killed by Mesos" instead? (assuming this is correct?) "intentionally" doesn't provide much value since it's always intentional --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17097: [SPARK-19765][SPARK-18549][SQL] UNCACHE TABLE should un-...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/17097 @gatorsmile I have put the JIRA number in this PR title, without adding a test, because the new behavior this PR introduced can obviously fix that bug. I don't want to add a lot of end tests to prove a single functionality. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user kayousterhout commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r104598293 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -467,7 +474,7 @@ private[spark] class TaskSchedulerImpl private[scheduler]( taskState: TaskState, reason: TaskFailedReason): Unit = synchronized { taskSetManager.handleFailedTask(tid, taskState, reason) -if (!taskSetManager.isZombie && taskState != TaskState.KILLED) { +if (!taskSetManager.isZombie) { --- End diff -- I need to do some git blaming to be 100% sure this is OK...I'll take a look tomorrow --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/17090 @MLnick OK I think I misunderstood some of your comments above then. I see the proposal in SPARK-14409 differs from this PR, so I agree it'd be nice to resolve it. We can make changes to this PR's schema as long as it happens soon. Here are the pros of each as I see them: 1. Nested schema (as in this PR): ```[user, Array((item, rating))]``` * Easy to work with both nested & flattened schema (```df.select("recommendations.item")```) (AFAIK there's no simple way to zip and nest the 2 columns when starting with the flattened schema.) 2. Flattened schema (as in SPARK-14409): ```[user, Array(item), Array(rating)]``` * More efficient to store in Row-based formats like Avro I'm not sure if there's a performance difference in the formats when stored in Tungsten Rows. I think not, but that'd be good to know. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16944: [SPARK-19611][SQL] Introduce configurable table s...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/16944#discussion_r104598287 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -476,46 +476,6 @@ object ParquetFileFormat extends Logging { } /** - * Reconciles Hive Metastore case insensitivity issue and data type conflicts between Metastore - * schema and Parquet schema. - * - * Hive doesn't retain case information, while Parquet is case sensitive. On the other hand, the - * schema read from Parquet files may be incomplete (e.g. older versions of Parquet doesn't - * distinguish binary and string). This method generates a correct schema by merging Metastore - * schema data types and Parquet schema field names. - */ - def mergeMetastoreParquetSchema( - metastoreSchema: StructType, - parquetSchema: StructType): StructType = { -def schemaConflictMessage: String = - s"""Converting Hive Metastore Parquet, but detected conflicting schemas. Metastore schema: - |${metastoreSchema.prettyJson} - | - |Parquet schema: - |${parquetSchema.prettyJson} - """.stripMargin - -val mergedParquetSchema = mergeMissingNullableFields(metastoreSchema, parquetSchema) --- End diff -- No other places use `mergeMissingNullableFields` anymore. If we remove `mergeMetastoreParquetSchema`, `mergeMissingNullableFields` can be removed too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user kayousterhout commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r104598221 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -2250,6 +2250,25 @@ class SparkContext(config: SparkConf) extends Logging { } /** + * Kill a given task. It will be retried. + * + * @param taskId the task ID to kill + */ + def killTask(taskId: Long): Unit = { +killTask(taskId, "cancelled") + } + + /** + * Kill a given task. It will be retried. + * + * @param taskId the task ID to kill + * @param reason the reason for killing the task, which should be a short string + */ + def killTask(taskId: Long, reason: String): Unit = { --- End diff -- What about calling it "killAndRescheduleTask" then? Otherwise kill is a little misleading -- since where we use it elsewhere (to kill a stage) it implies no retry --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17173: [SPARK-19832][SQL]DynamicPartitionWriteTask get p...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17173 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17187: [SPARK-19847][SQL] port hive read to FileFormat API
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17187 **[Test build #74075 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74075/testReport)** for PR 17187 at commit [`ad47887`](https://github.com/apache/spark/commit/ad478870c652553b8b225a569e460fb6ccef0c36). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15363: [SPARK-17791][SQL] Join reordering using star schema det...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15363 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74068/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15363: [SPARK-17791][SQL] Join reordering using star schema det...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15363 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17173: [SPARK-19832][SQL]DynamicPartitionWriteTask get partitio...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/17173 thanks, merging to master! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15363: [SPARK-17791][SQL] Join reordering using star schema det...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15363 **[Test build #74068 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74068/testReport)** for PR 15363 at commit [`072e3a9`](https://github.com/apache/spark/commit/072e3a95b0d09b185dfd2ce270bdaadea3825431). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17187: [SPARK-19847][SQL] port hive read to FileFormat API
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/17187 cc @sameeragarwal @rxin @gatorsmile --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17187: [SPARK-19847][SQL] port hive read to FileFormat A...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/17187 [SPARK-19847][SQL] port hive read to FileFormat API ## What changes were proposed in this pull request? implement the read logic in `HiveFileFormat`, to unify the table read path between data source and hive serde tables. The major change is, hive partition may have a different serde, so the planner should put more information in `PartitionedFile` and send it to executors. Tow things need to be improved in the future: 1. Due to the way we read hive table files, we do not support reading a partial file yet, which may reduce the parallelism for large files. 2. Hive tables with storage handler(non-file-based) still go to the old code path. ## How was this patch tested? existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark hive-read Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17187.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17187 commit ad478870c652553b8b225a569e460fb6ccef0c36 Author: Wenchen Fan Date: 2017-03-02T07:15:42Z port hive read to FileFormat API --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16290: [SPARK-18817] [SPARKR] [SQL] Set default warehouse dir t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16290 **[Test build #74074 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74074/testReport)** for PR 16290 at commit [`7a98b91`](https://github.com/apache/spark/commit/7a98b91274e6108a2fcbccf6e6e47d49964cf22c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16826: [SPARK-19540][SQL] Add ability to clone SparkSess...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16826#discussion_r104596342 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala --- @@ -17,43 +17,70 @@ package org.apache.spark.sql.internal -import java.io.File - import org.apache.hadoop.conf.Configuration -import org.apache.hadoop.fs.Path +import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql._ -import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.catalyst.analysis.{Analyzer, FunctionRegistry} import org.apache.spark.sql.catalyst.catalog._ import org.apache.spark.sql.catalyst.optimizer.Optimizer import org.apache.spark.sql.catalyst.parser.ParserInterface import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.catalyst.rules.Rule import org.apache.spark.sql.execution._ -import org.apache.spark.sql.execution.command.AnalyzeTableCommand import org.apache.spark.sql.execution.datasources._ -import org.apache.spark.sql.streaming.{StreamingQuery, StreamingQueryManager} +import org.apache.spark.sql.streaming.StreamingQueryManager import org.apache.spark.sql.util.ExecutionListenerManager /** * A class that holds all session-specific state in a given [[SparkSession]]. + * @param functionRegistry Internal catalog for managing functions registered by the user. + * @param catalog Internal catalog for managing table and database states. + * @param sqlParser Parser that extracts expressions, plans, table identifiers etc. from SQL texts. + * @param analyzer Logical query plan analyzer for resolving unresolved attributes and relations. + * @param streamingQueryManager Interface to start and stop + * [[org.apache.spark.sql.streaming.StreamingQuery]]s. + * @param queryExecutionCreator Lambda to create a [[QueryExecution]] from a [[LogicalPlan]] */ -private[sql] class SessionState(sparkSession: SparkSession) { +private[sql] class SessionState( +sparkContext: SparkContext, +sharedState: SharedState, +val conf: SQLConf, +val experimentalMethods: ExperimentalMethods, +val functionRegistry: FunctionRegistry, +val catalog: SessionCatalog, +val sqlParser: ParserInterface, +val analyzer: Analyzer, +val streamingQueryManager: StreamingQueryManager, +val queryExecutionCreator: LogicalPlan => QueryExecution) { - // Note: These are all lazy vals because they depend on each other (e.g. conf) and we - // want subclasses to override some of the fields. Otherwise, we would get a lot of NPEs. + /** + * Interface exposed to the user for registering user-defined functions. + * Note that the user-defined functions must be deterministic. + */ + val udf: UDFRegistration = new UDFRegistration(functionRegistry) /** - * SQL-specific key-value configurations. + * Logical query plan optimizer. --- End diff -- Nit: remove the extra space --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17166 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16826: [SPARK-19540][SQL] Add ability to clone SparkSess...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16826#discussion_r104596281 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala --- @@ -17,43 +17,70 @@ package org.apache.spark.sql.internal -import java.io.File - import org.apache.hadoop.conf.Configuration -import org.apache.hadoop.fs.Path +import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql._ -import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.catalyst.analysis.{Analyzer, FunctionRegistry} import org.apache.spark.sql.catalyst.catalog._ import org.apache.spark.sql.catalyst.optimizer.Optimizer import org.apache.spark.sql.catalyst.parser.ParserInterface import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.catalyst.rules.Rule import org.apache.spark.sql.execution._ -import org.apache.spark.sql.execution.command.AnalyzeTableCommand import org.apache.spark.sql.execution.datasources._ -import org.apache.spark.sql.streaming.{StreamingQuery, StreamingQueryManager} +import org.apache.spark.sql.streaming.StreamingQueryManager import org.apache.spark.sql.util.ExecutionListenerManager /** * A class that holds all session-specific state in a given [[SparkSession]]. + * @param functionRegistry Internal catalog for managing functions registered by the user. + * @param catalog Internal catalog for managing table and database states. + * @param sqlParser Parser that extracts expressions, plans, table identifiers etc. from SQL texts. + * @param analyzer Logical query plan analyzer for resolving unresolved attributes and relations. + * @param streamingQueryManager Interface to start and stop + * [[org.apache.spark.sql.streaming.StreamingQuery]]s. + * @param queryExecutionCreator Lambda to create a [[QueryExecution]] from a [[LogicalPlan]] --- End diff -- Let us document all parms? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17166 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74065/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17166 **[Test build #74065 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74065/testReport)** for PR 17166 at commit [`170fa34`](https://github.com/apache/spark/commit/170fa34cdc607285a315cbb5c6fd98461ac10fe8). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class TaskKilled(reason: String) extends TaskFailedReason ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17170#discussion_r104594383 --- Diff: R/pkg/R/generics.R --- @@ -1420,6 +1420,17 @@ setGeneric("spark.posterior", function(object, newData) { standardGeneric("spark #' @export setGeneric("spark.perplexity", function(object, data) { standardGeneric("spark.perplexity") }) +#' @rdname spark.fpGrowth +#' @export +setGeneric("spark.fpGrowth", function(data, ...) { standardGeneric("spark.fpGrowth") }) + +#' @rdname spark.fpGrowth +#' @export +setGeneric("freqItemsets", function(object) { standardGeneric("freqItemsets") }) --- End diff -- we seems to follow the pattern `spark.something` - see LDA. do you think it makes sense here too? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17170#discussion_r104595228 --- Diff: R/pkg/R/mllib_fpm.R --- @@ -0,0 +1,144 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# mllib_fpm.R: Provides methods for MLlib frequent pattern mining algorithms integration + +#' S4 class that represents a FPGrowthModel +#' +#' @param jobj a Java object reference to the backing Scala FPGrowthModel +#' @export +#' @note FPGrowthModel since 2.2.0 +setClass("FPGrowthModel", slots = list(jobj = "jobj")) + +#' FPGrowth Model --- End diff -- could you use the long form name (eg. look at LDA) and drop the word "Model" which we avoid using --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17170#discussion_r104595814 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/FPGrowthWrapper.scala --- @@ -0,0 +1,84 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.r + +import org.apache.hadoop.fs.Path +import org.json4s._ +import org.json4s.JsonDSL._ +import org.json4s.jackson.JsonMethods._ --- End diff -- do we need these? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17170#discussion_r104594800 --- Diff: R/pkg/R/mllib_fpm.R --- @@ -0,0 +1,144 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# mllib_fpm.R: Provides methods for MLlib frequent pattern mining algorithms integration + +#' S4 class that represents a FPGrowthModel +#' +#' @param jobj a Java object reference to the backing Scala FPGrowthModel +#' @export +#' @note FPGrowthModel since 2.2.0 +setClass("FPGrowthModel", slots = list(jobj = "jobj")) + +#' FPGrowth Model +#' +#' Provides FP-growth algorithm to mine frequent itemsets. +#' +#' @param data A SparkDataFrame for training. +#' @param minSupport Minimal support level. +#' @param minConfidence Minimal confidence level. +#' @param featuresCol Features column name. +#' @param predictionCol Prediction column name. +#' @param ... additional argument(s) passed to the method. +#' @return \code{spark.fpGrowth} returns a fitted FPGrowth model. +#' +#' @rdname spark.fpGrowth +#' @name spark.fpGrowth +#' @aliases spark.fpGrowth,SparkDataFrame-method +#' @export +#' @examples +#' \dontrun{ +#' itemsets <- data.frame(features = c("a,b", "a,b,c", "c,d")) +#' data <- selectExpr(createDataFrame(itemsets), "split(features, ',') as features") +#' model <- spark.fpGrowth(data) +#' +#' # Show frequent itemsets +#' frequent_itemsets <- freqItemsets(model) +#' showDF(frequent_itemsets) +#' +#' # Show association rules +#' association_rules <- associationRules(model) +#' showDF(association_rules) +#' +#' # Predict on new data +#' new_itemsets <- data.frame(features = c("b", "a,c", "d")) +#' new_data <- selectExpr(createDataFrame(itemsets), "split(features, ',') as features") +#' predict(model, new_data) +#' +#' # Save and load model +#' path <- "/path/to/model" +#' write.ml(model, path) +#' read.ml(path) +#' +#' # Optional arguments +#' baskets_data <- selectExpr(createDataFrame(itemsets), "split(features, ',') as baskets") +#' another_model <- spark.fpGrowth(data, minSupport = 0.1, minConfidence = 0.5 +#' featureCol = "baskets", predictionCol = "predicted") +#' } +#' @note spark.fpGrowth since 2.2.0 +setMethod("spark.fpGrowth", signature(data = "SparkDataFrame"), + function(data, minSupport = 0.3, minConfidence = 0.8, + featuresCol = "features", predictionCol = "prediction") { +if (!is.numeric(minSupport) || minSupport < 0 || minSupport > 1) { + stop("minSupport should be a number [0, 1].") +} +if (!is.numeric(minConfidence) || minConfidence < 0 || minConfidence > 1) { + stop("minConfidence should be a number [0, 1].") +} + +jobj <- callJStatic("org.apache.spark.ml.r.FPGrowthWrapper", "fit", +data@sdf, minSupport, minConfidence, --- End diff -- you may want to `as.numeric` on `minSupport`, `minConfidence` in case someone is passing in an integer and `callJStatic` would fail to match the wrapper method --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17170#discussion_r104595735 --- Diff: R/pkg/inst/tests/testthat/test_mllib_fpm.R --- @@ -0,0 +1,74 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +library(testthat) + +context("MLlib frequent pattern mining") + +# Tests for MLlib frequent pattern mining algorithms in SparkR +sparkSession <- sparkR.session(enableHiveSupport = FALSE) + +test_that("spark.fpGrowth", { + data <- selectExpr(createDataFrame(data.frame(features = c( +"1,2", +"1,2", +"1,2,3", +"1,3" + ))), "split(features, ',') as features") + + model <- spark.fpGrowth(data, minSupport = 0.3, minConfidence = 0.8) + + itemsets <- collect(freqItemsets(model)) + + expected_itemsets <- data.frame( +items = I(list(list("3"), list("3", "1"), list("2"), list("2", "1"), list("1"))), +freq = c(2, 2, 3, 3, 4) + ) + + expect_equivalent(expected_itemsets, collect(freqItemsets(model))) --- End diff -- don't repeat `freqItemsets(model)` - use `itemsets` from above --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17170#discussion_r104594654 --- Diff: R/pkg/R/mllib_fpm.R --- @@ -0,0 +1,144 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# mllib_fpm.R: Provides methods for MLlib frequent pattern mining algorithms integration + +#' S4 class that represents a FPGrowthModel +#' +#' @param jobj a Java object reference to the backing Scala FPGrowthModel +#' @export +#' @note FPGrowthModel since 2.2.0 +setClass("FPGrowthModel", slots = list(jobj = "jobj")) + +#' FPGrowth Model +#' +#' Provides FP-growth algorithm to mine frequent itemsets. +#' +#' @param data A SparkDataFrame for training. +#' @param minSupport Minimal support level. +#' @param minConfidence Minimal confidence level. +#' @param featuresCol Features column name. +#' @param predictionCol Prediction column name. +#' @param ... additional argument(s) passed to the method. +#' @return \code{spark.fpGrowth} returns a fitted FPGrowth model. +#' +#' @rdname spark.fpGrowth +#' @name spark.fpGrowth +#' @aliases spark.fpGrowth,SparkDataFrame-method +#' @export +#' @examples +#' \dontrun{ +#' itemsets <- data.frame(features = c("a,b", "a,b,c", "c,d")) +#' data <- selectExpr(createDataFrame(itemsets), "split(features, ',') as features") +#' model <- spark.fpGrowth(data) +#' +#' # Show frequent itemsets +#' frequent_itemsets <- freqItemsets(model) +#' showDF(frequent_itemsets) +#' +#' # Show association rules +#' association_rules <- associationRules(model) +#' showDF(association_rules) +#' +#' # Predict on new data +#' new_itemsets <- data.frame(features = c("b", "a,c", "d")) +#' new_data <- selectExpr(createDataFrame(itemsets), "split(features, ',') as features") +#' predict(model, new_data) +#' +#' # Save and load model +#' path <- "/path/to/model" +#' write.ml(model, path) +#' read.ml(path) +#' +#' # Optional arguments +#' baskets_data <- selectExpr(createDataFrame(itemsets), "split(features, ',') as baskets") +#' another_model <- spark.fpGrowth(data, minSupport = 0.1, minConfidence = 0.5 +#' featureCol = "baskets", predictionCol = "predicted") +#' } +#' @note spark.fpGrowth since 2.2.0 +setMethod("spark.fpGrowth", signature(data = "SparkDataFrame"), + function(data, minSupport = 0.3, minConfidence = 0.8, + featuresCol = "features", predictionCol = "prediction") { --- End diff -- we generally avoid allow setting `predictionCol` too --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17170#discussion_r104594539 --- Diff: R/pkg/R/mllib_fpm.R --- @@ -0,0 +1,144 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# mllib_fpm.R: Provides methods for MLlib frequent pattern mining algorithms integration + +#' S4 class that represents a FPGrowthModel +#' +#' @param jobj a Java object reference to the backing Scala FPGrowthModel +#' @export +#' @note FPGrowthModel since 2.2.0 +setClass("FPGrowthModel", slots = list(jobj = "jobj")) + +#' FPGrowth Model +#' +#' Provides FP-growth algorithm to mine frequent itemsets. +#' +#' @param data A SparkDataFrame for training. +#' @param minSupport Minimal support level. +#' @param minConfidence Minimal confidence level. +#' @param featuresCol Features column name. +#' @param predictionCol Prediction column name. +#' @param ... additional argument(s) passed to the method. +#' @return \code{spark.fpGrowth} returns a fitted FPGrowth model. +#' +#' @rdname spark.fpGrowth +#' @name spark.fpGrowth +#' @aliases spark.fpGrowth,SparkDataFrame-method +#' @export +#' @examples +#' \dontrun{ +#' itemsets <- data.frame(features = c("a,b", "a,b,c", "c,d")) +#' data <- selectExpr(createDataFrame(itemsets), "split(features, ',') as features") +#' model <- spark.fpGrowth(data) +#' +#' # Show frequent itemsets +#' frequent_itemsets <- freqItemsets(model) +#' showDF(frequent_itemsets) +#' +#' # Show association rules +#' association_rules <- associationRules(model) +#' showDF(association_rules) +#' +#' # Predict on new data +#' new_itemsets <- data.frame(features = c("b", "a,c", "d")) +#' new_data <- selectExpr(createDataFrame(itemsets), "split(features, ',') as features") +#' predict(model, new_data) +#' +#' # Save and load model +#' path <- "/path/to/model" +#' write.ml(model, path) +#' read.ml(path) +#' +#' # Optional arguments +#' baskets_data <- selectExpr(createDataFrame(itemsets), "split(features, ',') as baskets") +#' another_model <- spark.fpGrowth(data, minSupport = 0.1, minConfidence = 0.5 +#' featureCol = "baskets", predictionCol = "predicted") +#' } +#' @note spark.fpGrowth since 2.2.0 +setMethod("spark.fpGrowth", signature(data = "SparkDataFrame"), + function(data, minSupport = 0.3, minConfidence = 0.8, + featuresCol = "features", predictionCol = "prediction") { --- End diff -- instead of `features` it should take a formula? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17170#discussion_r104595125 --- Diff: R/pkg/R/mllib_fpm.R --- @@ -0,0 +1,144 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# mllib_fpm.R: Provides methods for MLlib frequent pattern mining algorithms integration + +#' S4 class that represents a FPGrowthModel +#' +#' @param jobj a Java object reference to the backing Scala FPGrowthModel +#' @export +#' @note FPGrowthModel since 2.2.0 +setClass("FPGrowthModel", slots = list(jobj = "jobj")) + +#' FPGrowth Model +#' +#' Provides FP-growth algorithm to mine frequent itemsets. +#' +#' @param data A SparkDataFrame for training. +#' @param minSupport Minimal support level. +#' @param minConfidence Minimal confidence level. +#' @param featuresCol Features column name. +#' @param predictionCol Prediction column name. +#' @param ... additional argument(s) passed to the method. +#' @return \code{spark.fpGrowth} returns a fitted FPGrowth model. +#' +#' @rdname spark.fpGrowth +#' @name spark.fpGrowth +#' @aliases spark.fpGrowth,SparkDataFrame-method +#' @export +#' @examples +#' \dontrun{ +#' itemsets <- data.frame(features = c("a,b", "a,b,c", "c,d")) +#' data <- selectExpr(createDataFrame(itemsets), "split(features, ',') as features") +#' model <- spark.fpGrowth(data) +#' +#' # Show frequent itemsets +#' frequent_itemsets <- freqItemsets(model) +#' showDF(frequent_itemsets) +#' +#' # Show association rules +#' association_rules <- associationRules(model) +#' showDF(association_rules) +#' +#' # Predict on new data +#' new_itemsets <- data.frame(features = c("b", "a,c", "d")) +#' new_data <- selectExpr(createDataFrame(itemsets), "split(features, ',') as features") +#' predict(model, new_data) +#' +#' # Save and load model +#' path <- "/path/to/model" +#' write.ml(model, path) +#' read.ml(path) +#' +#' # Optional arguments +#' baskets_data <- selectExpr(createDataFrame(itemsets), "split(features, ',') as baskets") +#' another_model <- spark.fpGrowth(data, minSupport = 0.1, minConfidence = 0.5 +#' featureCol = "baskets", predictionCol = "predicted") +#' } +#' @note spark.fpGrowth since 2.2.0 +setMethod("spark.fpGrowth", signature(data = "SparkDataFrame"), + function(data, minSupport = 0.3, minConfidence = 0.8, + featuresCol = "features", predictionCol = "prediction") { +if (!is.numeric(minSupport) || minSupport < 0 || minSupport > 1) { + stop("minSupport should be a number [0, 1].") +} +if (!is.numeric(minConfidence) || minConfidence < 0 || minConfidence > 1) { + stop("minConfidence should be a number [0, 1].") +} + +jobj <- callJStatic("org.apache.spark.ml.r.FPGrowthWrapper", "fit", +data@sdf, minSupport, minConfidence, +featuresCol, predictionCol) +new("FPGrowthModel", jobj = jobj) + }) + +# Get frequent itemsets. +#' @param object a fitted FPGrowth model. +#' @return A DataFrame with frequent itemsets. +#' +#' @rdname spark.fpGrowth +#' @aliases freqItemsets,FPGrowthModel-method +#' @export +#' @note freqItemsets(FPGrowthModel) since 2.2.0 +setMethod("freqItemsets", signature(object = "FPGrowthModel"), + function(object) { +jobj <- object@jobj +freqItemsets <- callJMethod(jobj, "freqItemsets") +dataFrame(freqItemsets) --- End diff -- It might make sense to do this in a single line: ``` dataFrame(callJMethod(object@jobj, "freqItemsets") ``` might be more readable that way. ditto with Association Rules below --- If your project is set up for it, you can reply to this email and have your repl
[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17170#discussion_r104595392 --- Diff: R/pkg/R/mllib_fpm.R --- @@ -0,0 +1,144 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# mllib_fpm.R: Provides methods for MLlib frequent pattern mining algorithms integration + +#' S4 class that represents a FPGrowthModel +#' +#' @param jobj a Java object reference to the backing Scala FPGrowthModel +#' @export +#' @note FPGrowthModel since 2.2.0 +setClass("FPGrowthModel", slots = list(jobj = "jobj")) + +#' FPGrowth Model +#' +#' Provides FP-growth algorithm to mine frequent itemsets. +#' +#' @param data A SparkDataFrame for training. +#' @param minSupport Minimal support level. +#' @param minConfidence Minimal confidence level. +#' @param featuresCol Features column name. +#' @param predictionCol Prediction column name. +#' @param ... additional argument(s) passed to the method. +#' @return \code{spark.fpGrowth} returns a fitted FPGrowth model. +#' +#' @rdname spark.fpGrowth +#' @name spark.fpGrowth +#' @aliases spark.fpGrowth,SparkDataFrame-method +#' @export +#' @examples +#' \dontrun{ +#' itemsets <- data.frame(features = c("a,b", "a,b,c", "c,d")) +#' data <- selectExpr(createDataFrame(itemsets), "split(features, ',') as features") --- End diff -- instead of duplicating `createDataFrame`, set `itemsets <- createDataFrame(data.frame(features = c("a,b", "a,b,c", "c,d")))` btw, do we have real data to use instead? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17170#discussion_r104594212 --- Diff: R/pkg/DESCRIPTION --- @@ -54,5 +55,5 @@ Collate: 'types.R' 'utils.R' 'window.R' -RoxygenNote: 5.0.1 +RoxygenNote: 6.0.1 --- End diff -- let's revert this - new roxygen2 seems to have some new features we are not ready for yet --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17170#discussion_r104594501 --- Diff: R/pkg/R/mllib_fpm.R --- @@ -0,0 +1,144 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# mllib_fpm.R: Provides methods for MLlib frequent pattern mining algorithms integration + +#' S4 class that represents a FPGrowthModel +#' +#' @param jobj a Java object reference to the backing Scala FPGrowthModel +#' @export +#' @note FPGrowthModel since 2.2.0 +setClass("FPGrowthModel", slots = list(jobj = "jobj")) + +#' FPGrowth Model +#' +#' Provides FP-growth algorithm to mine frequent itemsets. +#' +#' @param data A SparkDataFrame for training. +#' @param minSupport Minimal support level. +#' @param minConfidence Minimal confidence level. +#' @param featuresCol Features column name. +#' @param predictionCol Prediction column name. +#' @param ... additional argument(s) passed to the method. +#' @return \code{spark.fpGrowth} returns a fitted FPGrowth model. +#' +#' @rdname spark.fpGrowth +#' @name spark.fpGrowth +#' @aliases spark.fpGrowth,SparkDataFrame-method +#' @export +#' @examples +#' \dontrun{ +#' itemsets <- data.frame(features = c("a,b", "a,b,c", "c,d")) +#' data <- selectExpr(createDataFrame(itemsets), "split(features, ',') as features") +#' model <- spark.fpGrowth(data) +#' +#' # Show frequent itemsets +#' frequent_itemsets <- freqItemsets(model) +#' showDF(frequent_itemsets) +#' +#' # Show association rules +#' association_rules <- associationRules(model) +#' showDF(association_rules) +#' +#' # Predict on new data +#' new_itemsets <- data.frame(features = c("b", "a,c", "d")) +#' new_data <- selectExpr(createDataFrame(itemsets), "split(features, ',') as features") +#' predict(model, new_data) +#' +#' # Save and load model +#' path <- "/path/to/model" +#' write.ml(model, path) +#' read.ml(path) +#' +#' # Optional arguments +#' baskets_data <- selectExpr(createDataFrame(itemsets), "split(features, ',') as baskets") +#' another_model <- spark.fpGrowth(data, minSupport = 0.1, minConfidence = 0.5 +#' featureCol = "baskets", predictionCol = "predicted") +#' } +#' @note spark.fpGrowth since 2.2.0 +setMethod("spark.fpGrowth", signature(data = "SparkDataFrame"), + function(data, minSupport = 0.3, minConfidence = 0.8, --- End diff -- should it have `numPartitions`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17170#discussion_r104595454 --- Diff: R/pkg/R/mllib_fpm.R --- @@ -0,0 +1,144 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# mllib_fpm.R: Provides methods for MLlib frequent pattern mining algorithms integration + +#' S4 class that represents a FPGrowthModel +#' +#' @param jobj a Java object reference to the backing Scala FPGrowthModel +#' @export +#' @note FPGrowthModel since 2.2.0 +setClass("FPGrowthModel", slots = list(jobj = "jobj")) + +#' FPGrowth Model +#' +#' Provides FP-growth algorithm to mine frequent itemsets. +#' +#' @param data A SparkDataFrame for training. +#' @param minSupport Minimal support level. +#' @param minConfidence Minimal confidence level. +#' @param featuresCol Features column name. +#' @param predictionCol Prediction column name. +#' @param ... additional argument(s) passed to the method. +#' @return \code{spark.fpGrowth} returns a fitted FPGrowth model. +#' +#' @rdname spark.fpGrowth +#' @name spark.fpGrowth +#' @aliases spark.fpGrowth,SparkDataFrame-method +#' @export +#' @examples +#' \dontrun{ +#' itemsets <- data.frame(features = c("a,b", "a,b,c", "c,d")) +#' data <- selectExpr(createDataFrame(itemsets), "split(features, ',') as features") +#' model <- spark.fpGrowth(data) +#' +#' # Show frequent itemsets +#' frequent_itemsets <- freqItemsets(model) +#' showDF(frequent_itemsets) --- End diff -- collapse this to `head(freqItemsets(model))` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r104595706 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -2250,6 +2250,25 @@ class SparkContext(config: SparkConf) extends Logging { } /** + * Kill a given task. It will be retried. + * + * @param taskId the task ID to kill + */ + def killTask(taskId: Long): Unit = { +killTask(taskId, "cancelled") + } + + /** + * Kill a given task. It will be retried. + * + * @param taskId the task ID to kill + * @param reason the reason for killing the task, which should be a short string + */ + def killTask(taskId: Long, reason: String): Unit = { --- End diff -- ah ok. for some reason i read it as killTask(long) is kill without retry, and killTask(long, string) is with. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17166 **[Test build #74073 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74073/testReport)** for PR 17166 at commit [`f03de61`](https://github.com/apache/spark/commit/f03de619a5e2df41fa7d70dd574159da5b4fa4db). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16826: [SPARK-19540][SQL] Add ability to clone SparkSess...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16826#discussion_r104595290 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala --- @@ -1197,6 +1199,65 @@ class SessionCatalogSuite extends PlanTest { } } + test("clone SessionCatalog - temp views") { +val externalCatalog = newEmptyCatalog() +val original = new SessionCatalog(externalCatalog) +val tempTable1 = Range(1, 10, 1, 10) +original.createTempView("copytest1", tempTable1, overrideIfExists = false) + +// check if tables copied over +val clone = original.clone( + SimpleCatalystConf(caseSensitiveAnalysis = true), + new Configuration(), + new SimpleFunctionRegistry, + CatalystSqlParser) +assert(original ne clone) +assert(clone.getTempView("copytest1") == Option(tempTable1)) + +// check if clone and original independent +clone.dropTable(TableIdentifier("copytest1"), ignoreIfNotExists = false, purge = false) +assert(original.getTempView("copytest1") == Option(tempTable1)) + +val tempTable2 = Range(1, 20, 2, 10) +original.createTempView("copytest2", tempTable2, overrideIfExists = false) +assert(clone.getTempView("copytest2").isEmpty) + } + + test("clone SessionCatalog - current db") { +val externalCatalog = newEmptyCatalog() +externalCatalog.createDatabase(newDb("copytest1"), true) +externalCatalog.createDatabase(newDb("copytest2"), true) +externalCatalog.createDatabase(newDb("copytest3"), true) + +val original = new SessionCatalog(externalCatalog) +val tempTable1 = Range(1, 10, 1, 10) +val db1 = "copytest1" +original.createTempView(db1, tempTable1, overrideIfExists = false) +original.setCurrentDatabase(db1) + +// check if current db copied over +val clone = original.clone( + SimpleCatalystConf(caseSensitiveAnalysis = true), + new Configuration(), + new SimpleFunctionRegistry, + CatalystSqlParser) +assert(original ne clone) +assert(clone.getCurrentDatabase == db1) + +// check if clone and original independent +val db2 = "copytest2" +val tempTable2 = Range(1, 20, 2, 20) +clone.createTempView(db2, tempTable2, overrideIfExists = false) +clone.setCurrentDatabase(db2) +assert(original.getCurrentDatabase == db1) + +val db3 = "copytest3" +val tempTable3 = Range(1, 30, 2, 30) +original.createTempView(db3, tempTable3, overrideIfExists = false) +original.setCurrentDatabase(db3) +assert(clone.getCurrentDatabase == db2) + } --- End diff -- How about? ```Scala test("clone SessionCatalog - current db") { val externalCatalog = newEmptyCatalog() val db1 = "db1" val db2 = "db2" val db3 = "db3" val data = Range(1, 10, 1, 10) externalCatalog.createDatabase(newDb(db1), ignoreIfExists = true) externalCatalog.createDatabase(newDb(db2), ignoreIfExists = true) externalCatalog.createDatabase(newDb(db3), ignoreIfExists = true) val original = new SessionCatalog(externalCatalog) original.createTempView("view1", data, overrideIfExists = false) original.setCurrentDatabase(db1) // check if current db copied over val clone = original.clone( SimpleCatalystConf(caseSensitiveAnalysis = true), new Configuration(), new SimpleFunctionRegistry, CatalystSqlParser) assert(original != clone) assert(clone.getCurrentDatabase == db1) // check if clone and original independent clone.createTempView("view2", data, overrideIfExists = false) clone.setCurrentDatabase(db2) assert(original.getCurrentDatabase == db1) original.setCurrentDatabase(db3) assert(clone.getCurrentDatabase == db2) } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r104595023 --- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala --- @@ -158,7 +158,8 @@ private[spark] class Executor( threadPool.execute(tr) } - def killTask(taskId: Long, interruptThread: Boolean): Unit = { + def killTask( --- End diff -- Fixed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17164: [SPARK-16844][SQL][WIP] Support codegen for sort-based a...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/17164 This pr added an new SQL option `spark.sql.aggregate.preferSortAggregate` to preferably select `SortAggregate` for easy-to-test in `DataFrameAggregateSuite.scala`. In some cases (e.g., input data is already sorted in cache), sort aggregate is faster than hash one (See: https://issues.apache.org/jira/browse/SPARK-18591). But, you know, the current spark does not adaptively select sort aggregate in these cases. So, I probably think this option is some useful to control aggregate strategies by user. What do u think? cc: @hvanhovell If yes, I'd like to make another pr to add this option before this pr reviewed. https://github.com/apache/spark/compare/master...maropu:SPARK-16844-3 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r104595065 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedClusterMessage.scala --- @@ -40,7 +40,8 @@ private[spark] object CoarseGrainedClusterMessages { // Driver to executors case class LaunchTask(data: SerializableBuffer) extends CoarseGrainedClusterMessage - case class KillTask(taskId: Long, executor: String, interruptThread: Boolean) + case class KillTask( --- End diff -- Fixed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r104594970 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -2250,6 +2250,25 @@ class SparkContext(config: SparkConf) extends Logging { } /** + * Kill a given task. It will be retried. + * + * @param taskId the task ID to kill + */ + def killTask(taskId: Long): Unit = { +killTask(taskId, "cancelled") + } + + /** + * Kill a given task. It will be retried. + * + * @param taskId the task ID to kill + * @param reason the reason for killing the task, which should be a short string + */ + def killTask(taskId: Long, reason: String): Unit = { --- End diff -- Well, it turns out there's not a good reason to not retry. The task will get retried anyways eventually unless the stage is cancelled. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17148: [SPARK-17075][SQL][followup] fix filter estimation issue...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17148 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17148: [SPARK-17075][SQL][followup] fix filter estimation issue...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17148 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74067/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17151: [ML][Minor] Separate estimator and model params f...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/17151#discussion_r104594062 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala --- @@ -372,16 +372,18 @@ class DecisionTreeClassifierSuite // Categorical splits with tree depth 2 val categoricalData: DataFrame = TreeTests.setMetadata(rdd, Map(0 -> 2, 1 -> 3), numClasses = 2) -testEstimatorAndModelReadWrite(dt, categoricalData, allParamSettings, checkModelData) +testEstimatorAndModelReadWrite(dt, categoricalData, allParamSettings, --- End diff -- Good points. I still think it's better to just add the extra constructor, but I don't feel strongly about it. So we can proceed with whatever you feel is best. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17159: [SPARK-19818][SparkR] rbind should check for name...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17159 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r104593920 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedClusterMessage.scala --- @@ -40,7 +40,8 @@ private[spark] object CoarseGrainedClusterMessages { // Driver to executors case class LaunchTask(data: SerializableBuffer) extends CoarseGrainedClusterMessage - case class KillTask(taskId: Long, executor: String, interruptThread: Boolean) + case class KillTask( --- End diff -- this also seems to fit in one line? ``` case class KillTask(taskId: Long, executor: String, interruptThread: Boolean, reason: String) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16826: [SPARK-19540][SQL] Add ability to clone SparkSess...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16826#discussion_r104593862 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala --- @@ -1197,6 +1199,65 @@ class SessionCatalogSuite extends PlanTest { } } + test("clone SessionCatalog - temp views") { +val externalCatalog = newEmptyCatalog() +val original = new SessionCatalog(externalCatalog) +val tempTable1 = Range(1, 10, 1, 10) +original.createTempView("copytest1", tempTable1, overrideIfExists = false) + +// check if tables copied over +val clone = original.clone( + SimpleCatalystConf(caseSensitiveAnalysis = true), + new Configuration(), + new SimpleFunctionRegistry, + CatalystSqlParser) +assert(original ne clone) +assert(clone.getTempView("copytest1") == Option(tempTable1)) + +// check if clone and original independent +clone.dropTable(TableIdentifier("copytest1"), ignoreIfNotExists = false, purge = false) +assert(original.getTempView("copytest1") == Option(tempTable1)) + +val tempTable2 = Range(1, 20, 2, 10) +original.createTempView("copytest2", tempTable2, overrideIfExists = false) +assert(clone.getTempView("copytest2").isEmpty) + } + + test("clone SessionCatalog - current db") { +val externalCatalog = newEmptyCatalog() +externalCatalog.createDatabase(newDb("copytest1"), true) +externalCatalog.createDatabase(newDb("copytest2"), true) +externalCatalog.createDatabase(newDb("copytest3"), true) + +val original = new SessionCatalog(externalCatalog) +val tempTable1 = Range(1, 10, 1, 10) +val db1 = "copytest1" +original.createTempView(db1, tempTable1, overrideIfExists = false) +original.setCurrentDatabase(db1) + +// check if current db copied over +val clone = original.clone( + SimpleCatalystConf(caseSensitiveAnalysis = true), + new Configuration(), + new SimpleFunctionRegistry, + CatalystSqlParser) +assert(original ne clone) +assert(clone.getCurrentDatabase == db1) + +// check if clone and original independent +val db2 = "copytest2" +val tempTable2 = Range(1, 20, 2, 20) +clone.createTempView(db2, tempTable2, overrideIfExists = false) --- End diff -- the same here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r104593825 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -732,6 +732,13 @@ class DAGScheduler( } /** + * Kill a given task. It will be retried. + */ + def killTask(taskId: Long, reason: String): Unit = { --- End diff -- similar to the public api, we should separate retry from reason... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r104593790 --- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala --- @@ -158,7 +158,8 @@ private[spark] class Executor( threadPool.execute(tr) } - def killTask(taskId: Long, interruptThread: Boolean): Unit = { + def killTask( --- End diff -- this fits in one line? ``` def killTask(taskId: Long, interruptThread: Boolean, reason: String): Unit = { ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17159: [SPARK-19818][SparkR] rbind should check for name consis...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/17159 merged to master. thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r104593710 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -2250,6 +2250,25 @@ class SparkContext(config: SparkConf) extends Logging { } /** + * Kill a given task. It will be retried. + * + * @param taskId the task ID to kill + */ + def killTask(taskId: Long): Unit = { +killTask(taskId, "cancelled") + } + + /** + * Kill a given task. It will be retried. + * + * @param taskId the task ID to kill + * @param reason the reason for killing the task, which should be a short string + */ + def killTask(taskId: Long, reason: String): Unit = { --- End diff -- hm i don't think we should automatically retry just by providing a reason. Perhaps this ``` def killTask(taskId: Long, reason: String): Unit def killTaskAndRetry(taskId: Long, reason: String): Unit ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r104593724 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -2250,6 +2250,25 @@ class SparkContext(config: SparkConf) extends Logging { } /** + * Kill a given task. It will be retried. + * + * @param taskId the task ID to kill + */ + def killTask(taskId: Long): Unit = { +killTask(taskId, "cancelled") + } + + /** + * Kill a given task. It will be retried. + * + * @param taskId the task ID to kill + * @param reason the reason for killing the task, which should be a short string + */ + def killTask(taskId: Long, reason: String): Unit = { --- End diff -- same thing for the lower level dag scheduler api. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17164: [SPARK-16844][SQL][WIP] Support codegen for sort-based a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17164 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74066/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org