date:20170306

[GitHub] spark pull request #17148: [SPARK-17075][SQL][followup] fix filter estimatio...

2017-03-06 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17148


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17096: [SPARK-15243][ML][SQL][PYTHON] Add missing support for u...

2017-03-06 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17096
  
Thank you @viirya for your sign-off.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17148: [SPARK-17075][SQL][followup] fix filter estimation issue...

2017-03-06 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/17148
  
thanks, merging to master!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

2017-03-06 Thread MLnick

Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/17090
  
I commented further on the 
[JIRA](https://issues.apache.org/jira/browse/SPARK-14409?focusedCommentId=15898855&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15898855).

Sorry if my other comments here and on JIRA were unclear. But the proposed 
schema for input to `RankingEvaluator` is:

### Schema 1
```
+--+---+--+--+
|userId|movieId|rating|prediction|
+--+---+--+--+
|   230|318|   5.0| 4.2403245|
|   230|   3424|   4.0|  null|
|   230|  81191|  null|  4.317455|
+--+---+--+--+
```

You will notice that `rating` and `prediction` columns can be `null`. This 
is by design. There are three cases shown above:
1. 1st row indicates a (user-item) pair that occurs in *both* the 
ground-truth set *and* the top-k predictions;
2. 2nd row indicates a (user-item) pair that occurs in the ground-truth 
set, *but not* in the top-k predictions;
3. 3rd row indicates a (user-item) pair that occurs in the top-k 
predictions, *but not* in the ground-truth set.

_Note_ for reference, the input to the current `mllib` `RankingMetrics` is:

### Schema 2
```
RDD[(true labels array, predicted labels array)],
i.e.
RDD of ([318, 3424, 7139,...], [81191, 93040, 31...])
```

(So actually neither of the above schemas are easily compatible with the 
return schema here - but I think it is not really necessary to match the 
`mllib.RankingMetrics` format)

### ALS cross-validation

My proposal for fitting ALS into cross-validation is the 
`ALSModel.transform` will output a DF of **Schema 1** - *only* when the 
parameters `k` and `recommendFor` are appropriately set, and the input DF 
contains both `user` and `item` columns. In practice, this scenario will occur 
during cross-validation only. 

So what I am saying is that ALS itself (not the evaluator) must know how to 
return the correct DataFrame output from `transform` such that it can be used 
in a cross-validation as input to the `RankingEvaluator`.

__Concretely:__
```scala
val als = new ALS().setRecommendFor("user").setK(10)
val validator = new TrainValidationSplit()
  .setEvaluator(new RankingEvaluator().setK(10))
  .setEstimator(als)
  .setEstimatorParamMaps(...)
val bestModel = validator.fit(ratings)
```

So while it is complex under the hood - to users it's simply a case of 
setting 2 params and the rest is as normal.

Now, we have the best model selected by cross-validation. We can make 
recommendations using these convenience methods (I think it will need a cast):

```scala
val recommendations = 
bestModel.asInstanceOf[ALSModel].recommendItemsforUsers(10)
```

Alternatively, the `transform` version looks like this:
```scala
val usersDF = ...
+--+
|userId|
+--+
| 1|
| 2|
| 3|
+--+
val recommendations = bestModel.transform(usersDF)
```

So the questions:
1. should we support the above `transform`-based recommendations? Or only 
support it for cross-validation purposes as a special case?
2. if we do, what should the output schema of the above `transform` version 
look like? It must certainly match the output of `recommendX` methods.

The options are:

(1) The schema in this PR: 
**Pros**: as you mention above - also more "compact"
**Cons**: doesn't match up so closely with the `transform` 
"cross-validation" schema above

(2) The schema below. It is basically an "exploded" version of option (1)

```
+--+---+--+
|userId|movieId|prediction|
+--+---+--+
| 1|  1|   4.3|
| 1|  5|   3.2|
| 1|  9|   2.1|
+--+---+--+
```

**Pros***: matches more closely with the cross-validation / evaluator input 
format. Perhaps slightly more "dataframe-like".
**Cons**: less compact; lose ordering?; may require more munging to save to 
external data stores etc. 

Anyway sorry for hijacking this PR discussion - but as I think you can see, 
the evaluator / ALS transform interplay is a bit subtle and requires some 
thought to get the right approach.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spa

[GitHub] spark pull request #16826: [SPARK-19540][SQL] Add ability to clone SparkSess...

2017-03-06 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16826#discussion_r104605419
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionState.scala ---
@@ -17,89 +17,55 @@
 
 package org.apache.spark.sql.hive
 
+import org.apache.spark.SparkContext
 import org.apache.spark.sql._
-import org.apache.spark.sql.catalyst.analysis.Analyzer
-import org.apache.spark.sql.execution.SparkPlanner
+import org.apache.spark.sql.catalyst.analysis.{Analyzer, FunctionRegistry}
+import org.apache.spark.sql.catalyst.parser.ParserInterface
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.execution.{QueryExecution, SparkPlanner, 
SparkSqlParser}
 import org.apache.spark.sql.execution.datasources._
 import org.apache.spark.sql.hive.client.HiveClient
-import org.apache.spark.sql.internal.SessionState
+import org.apache.spark.sql.internal.{SessionState, SharedState, SQLConf}
+import org.apache.spark.sql.streaming.StreamingQueryManager
 
 
 /**
  * A class that holds all session-specific state in a given 
[[SparkSession]] backed by Hive.
+ * @param catalog A Hive client used for interacting with the metastore.
+ * @param analyzer An analyzer that uses the Hive metastore.
+ * @param plannerCreator Lambda to create a [[SparkPlanner]] that converts 
optimized logical
+ *   plans to physical plans.
  */
-private[hive] class HiveSessionState(sparkSession: SparkSession)
-  extends SessionState(sparkSession) {
-
-  self =>
-
-  /**
-   * A Hive client used for interacting with the metastore.
-   */
-  lazy val metadataHive: HiveClient =
-
sparkSession.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client.newSession()
-
-  /**
-   * Internal catalog for managing table and database states.
-   */
-  override lazy val catalog = {
-new HiveSessionCatalog(
-  
sparkSession.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog],
-  sparkSession.sharedState.globalTempViewManager,
-  sparkSession,
-  functionResourceLoader,
-  functionRegistry,
+private[hive] class HiveSessionState(
+sparkContext: SparkContext,
+sharedState: SharedState,
+conf: SQLConf,
+experimentalMethods: ExperimentalMethods,
+functionRegistry: FunctionRegistry,
+override val catalog: HiveSessionCatalog,
+sqlParser: ParserInterface,
+val metadataHive: HiveClient,
+analyzer: Analyzer,
+streamingQueryManager: StreamingQueryManager,
+queryExecutionCreator: LogicalPlan => QueryExecution,
+val plannerCreator: () => SparkPlanner)
--- End diff --

How about adding `val planner` to `SessionState`? So far, the interface of 
`HiveSessionState` looks a little bit complex to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16826: [SPARK-19540][SQL] Add ability to clone SparkSess...

2017-03-06 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16826#discussion_r104604870
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionState.scala ---
@@ -17,89 +17,55 @@
 
 package org.apache.spark.sql.hive
 
+import org.apache.spark.SparkContext
 import org.apache.spark.sql._
-import org.apache.spark.sql.catalyst.analysis.Analyzer
-import org.apache.spark.sql.execution.SparkPlanner
+import org.apache.spark.sql.catalyst.analysis.{Analyzer, FunctionRegistry}
+import org.apache.spark.sql.catalyst.parser.ParserInterface
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.execution.{QueryExecution, SparkPlanner, 
SparkSqlParser}
 import org.apache.spark.sql.execution.datasources._
 import org.apache.spark.sql.hive.client.HiveClient
-import org.apache.spark.sql.internal.SessionState
+import org.apache.spark.sql.internal.{SessionState, SharedState, SQLConf}
+import org.apache.spark.sql.streaming.StreamingQueryManager
 
 
 /**
  * A class that holds all session-specific state in a given 
[[SparkSession]] backed by Hive.
+ * @param catalog A Hive client used for interacting with the metastore.
+ * @param analyzer An analyzer that uses the Hive metastore.
+ * @param plannerCreator Lambda to create a [[SparkPlanner]] that converts 
optimized logical
+ *   plans to physical plans.
  */
-private[hive] class HiveSessionState(sparkSession: SparkSession)
-  extends SessionState(sparkSession) {
-
-  self =>
-
-  /**
-   * A Hive client used for interacting with the metastore.
-   */
-  lazy val metadataHive: HiveClient =
-
sparkSession.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client.newSession()
-
-  /**
-   * Internal catalog for managing table and database states.
-   */
-  override lazy val catalog = {
-new HiveSessionCatalog(
-  
sparkSession.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog],
-  sparkSession.sharedState.globalTempViewManager,
-  sparkSession,
-  functionResourceLoader,
-  functionRegistry,
+private[hive] class HiveSessionState(
+sparkContext: SparkContext,
+sharedState: SharedState,
+conf: SQLConf,
+experimentalMethods: ExperimentalMethods,
+functionRegistry: FunctionRegistry,
+override val catalog: HiveSessionCatalog,
+sqlParser: ParserInterface,
+val metadataHive: HiveClient,
--- End diff --

This is for avoiding using `lazy val`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17178: [SPARK-19828][R] Support array type in from_json ...

2017-03-06 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17178#discussion_r104604621
  
--- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
@@ -1342,28 +1342,52 @@ test_that("column functions", {
   df <- read.json(mapTypeJsonPath)
   j <- collect(select(df, alias(to_json(df$info), "json")))
   expect_equal(j[order(j$json), ][1], "{\"age\":16,\"height\":176.5}")
-  df <- as.DataFrame(j)
-  schema <- structType(structField("age", "integer"),
-   structField("height", "double"))
-  s <- collect(select(df, alias(from_json(df$json, schema), "structcol")))
-  expect_equal(ncol(s), 1)
-  expect_equal(nrow(s), 3)
-  expect_is(s[[1]][[1]], "struct")
-  expect_true(any(apply(s, 1, function(x) { x[[1]]$age == 16 } )))
-
-  # passing option
-  df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}")))
-  schema2 <- structType(structField("date", "date"))
-  expect_error(tryCatch(collect(select(df, from_json(df$col, schema2))),
-error = function(e) { stop(e) }),
-   paste0(".*(java.lang.NumberFormatException: For input 
string:).*"))
-  s <- collect(select(df, from_json(df$col, schema2, dateFormat = 
"dd/MM/")))
-  expect_is(s[[1]][[1]]$date, "Date")
-  expect_equal(as.character(s[[1]][[1]]$date), "2014-10-21")
-
-  # check for unparseable
-  df <- as.DataFrame(list(list("a" = "")))
-  expect_equal(collect(select(df, from_json(df$a, schema)))[[1]][[1]], NA)
+
+  schemas <- list(structType(structField("age", "integer"), 
structField("height", "double")),
+  "struct")
--- End diff --

Yes.. I persuaded myself that it is a valid string for `structField`. If 
you prefer optional parameter one, I could try. Otherwise, let me close this 
for now If you are not sure of both ways :).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17146: [SPARK-19806][ML][PySpark] PySpark GeneralizedLinearRegr...

2017-03-06 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/17146
  
@actuaryzhang would you take a look at this one. If recall, it's one option 
we considered for R API.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17146: [SPARK-19806][ML][PySpark] PySpark GeneralizedLin...

2017-03-06 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/17146#discussion_r104604254
  
--- Diff: python/pyspark/ml/regression.py ---
@@ -1344,40 +1347,53 @@ class GeneralizedLinearRegression(JavaEstimator, 
HasLabelCol, HasFeaturesCol, Ha
 
 family = Param(Params._dummy(), "family", "The name of family which is 
a description of " +
"the error distribution to be used in the model. 
Supported options: " +
-   "gaussian (default), binomial, poisson and gamma.",
+   "gaussian (default), binomial, poisson, gamma and 
tweedie.",
typeConverter=TypeConverters.toString)
 link = Param(Params._dummy(), "link", "The name of link function which 
provides the " +
  "relationship between the linear predictor and the mean 
of the distribution " +
  "function. Supported options: identity, log, inverse, 
logit, probit, cloglog " +
  "and sqrt.", typeConverter=TypeConverters.toString)
 linkPredictionCol = Param(Params._dummy(), "linkPredictionCol", "link 
prediction (linear " +
   "predictor) column name", 
typeConverter=TypeConverters.toString)
+variancePower = Param(Params._dummy(), "variancePower", "The power in 
the variance function " +
+  "of the Tweedie distribution which characterizes 
the relationship " +
+  "between the variance and mean of the 
distribution. Only applicable " +
+  "for the Tweedie family. Supported values: 0 and 
[1, Inf).",
+  typeConverter=TypeConverters.toFloat)
+linkPower = Param(Params._dummy(), "linkPower", "The index in the 
power link function. " +
+  "Only applicable for the Tweedie family.",
+  typeConverter=TypeConverters.toFloat)
 
 @keyword_only
 def __init__(self, labelCol="label", featuresCol="features", 
predictionCol="prediction",
  family="gaussian", link=None, fitIntercept=True, 
maxIter=25, tol=1e-6,
--- End diff --

hmm, it sounds like `link` should really be `None` then


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16896: [SPARK-19561][Python] cast TimestampType.toInternal outp...

2017-03-06 Thread davies

Github user davies commented on the issue:

https://github.com/apache/spark/pull/16896
  
lgtm, will merge it when I get a chance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16290: [SPARK-18817] [SPARKR] [SQL] Set default warehouse dir t...

2017-03-06 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/16290
  
so base on this comment 
https://github.com/apache/spark/pull/16330#issuecomment-282101389
doesn't it mean we shouldn't set warehouse dir to under tempdir()?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17096: [SPARK-15243][ML][SQL][PYTHON] Add missing support for u...

2017-03-06 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/17096
  
Remaining changes LGTM. cc @holdenk 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17182: [SPARK-19840][SQL] Disallow creating permanent functions...

2017-03-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17182
  
**[Test build #74078 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74078/testReport)**
 for PR 17182 at commit 
[`7641789`](https://github.com/apache/spark/commit/7641789c00360ed1ba0fe4475b350415fea56c95).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17186: [SPARK-19846][SQL] Add a flag to disable constraint prop...

2017-03-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17186
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17186: [SPARK-19846][SQL] Add a flag to disable constraint prop...

2017-03-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17186
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74071/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17186: [SPARK-19846][SQL] Add a flag to disable constraint prop...

2017-03-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17186
  
**[Test build #74071 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74071/testReport)**
 for PR 17186 at commit 
[`44e494b`](https://github.com/apache/spark/commit/44e494bd1725645a2ce13321ec846098d8da6ae4).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class InferFiltersFromConstraints(conf: CatalystConf)`
  * `case class PruneFilters(conf: CatalystConf) extends Rule[LogicalPlan] 
with PredicateHelper `
  * `case class EliminateOuterJoin(conf: CatalystConf) extends 
Rule[LogicalPlan] with PredicateHelper `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16856: [SPARK-19516][DOC] update public doc to use SparkSession...

2017-03-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16856
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74077/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16856: [SPARK-19516][DOC] update public doc to use SparkSession...

2017-03-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16856
  
**[Test build #74077 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74077/testReport)**
 for PR 16856 at commit 
[`e284665`](https://github.com/apache/spark/commit/e2846652588d95a9b5e393f974846cb5d0ce2e31).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16856: [SPARK-19516][DOC] update public doc to use SparkSession...

2017-03-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16856
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16867: [SPARK-16929] Improve performance when check speculatabl...

2017-03-06 Thread jinxing64

Github user jinxing64 commented on the issue:

https://github.com/apache/spark/pull/16867
  
Thanks a lot for comments. I refined accordingly : )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17144: [SPARK-19803][TEST] flaky BlockManagerReplication...

2017-03-06 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/17144#discussion_r104602172
  
--- Diff: 
core/src/test/scala/org/apache/spark/storage/BlockManagerReplicationSuite.scala 
---
@@ -494,7 +494,9 @@ class BlockManagerProactiveReplicationSuite extends 
BlockManagerReplicationBehav
 
 val newLocations = master.getLocations(blockId).toSet
 logInfo(s"New locations : $newLocations")
-assert(newLocations.size === replicationFactor)
+eventually(timeout(5 seconds), interval(10 millis)) {
+  assert(newLocations.size === replicationFactor)
--- End diff --

Also can you remove the two sleeps above now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17144: [SPARK-19803][TEST] flaky BlockManagerReplication...

2017-03-06 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/17144#discussion_r104602103
  
--- Diff: 
core/src/test/scala/org/apache/spark/storage/BlockManagerReplicationSuite.scala 
---
@@ -494,7 +494,9 @@ class BlockManagerProactiveReplicationSuite extends 
BlockManagerReplicationBehav
 
 val newLocations = master.getLocations(blockId).toSet
 logInfo(s"New locations : $newLocations")
-assert(newLocations.size === replicationFactor)
+eventually(timeout(5 seconds), interval(10 millis)) {
+  assert(newLocations.size === replicationFactor)
--- End diff --

line 495 needs to be in here too -- otherwise you're continually checking 
the same set of locations


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16867: [SPARK-16929] Improve performance when check speculatabl...

2017-03-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16867
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74070/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16867: [SPARK-16929] Improve performance when check speculatabl...

2017-03-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16867
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16867: [SPARK-16929] Improve performance when check speculatabl...

2017-03-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16867
  
**[Test build #74070 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74070/testReport)**
 for PR 16867 at commit 
[`ee4c486`](https://github.com/apache/spark/commit/ee4c4865036063ff1bc6de0a5ebccbcdd5711717).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15229: [SPARK-17654] [SQL] Propagate bucketing information for ...

2017-03-06 Thread carlos-verdes

Github user carlos-verdes commented on the issue:

https://github.com/apache/spark/pull/15229
  
Hi @rxin,

In Hive you have two levels, the partition and the buckets.
The partitons are translated to folders on HDFS, for example:
```bash
/apps/hive/warehouse/model_table/date=6
```
Where model_table is the name of the table and date is the partition.

Inside a folder you will have n files and Hive let you decide how many 
files you want to create (buckets) and which data you want to store within.

If you create a table like this on Hive:
```sql 
create table events (
  timestamp: long,
  userId: String,
  event: String
)
partitioned by (event_date int)
clustered by (userId) sorted by (userId, timestamp) into 10 buckets;
```

Then when it will be only 10 files per partition and all the events for one 
user will be only on one partition and sorted by time. 

If you insert data on this table using the next query on Hive you will see 
that the clustering policy is respected:
```sql
set hive.enforce.bucketing = true;  -- (Note: Not needed in Hive 2.x onward)
from  event_feed_source e
insert overwrite table events
partition (event_date = 20170307)
select e.*, 20170307   
where event_day = 20170307;
```

However... if you do the next insert with Spark:
```scala
sqlContext.sql("insert overwrite table events partition (event_date = 
20170307) select e.*,1 from event_feed_source e")
```

You will see that the data is stored with the same partitioning as it is on 
the source dataset.

What is the benefit of respecting the Hive clustering policy?
The main benefit is to avoid shuffle and have a control on the number or 
partitions.

To give an example we have a pipeline that reads thousands of events per 
user and save them into another table (model), so it means the events table is 
going to have x times more data than the model table (imagine a factor of 10x).

First point is, if the source data are clustered properly we can read all 
the events per user without shuffle (I mean to do something like 
`events.groupBy(user).mapValues(_.sortBy(timestamp)` will be done without 
shuffle).

Second point is when we generate the model RDD/Dataser from the event 
RDD/Dataset. Spark respects the source partitioning (unless you indicate 
otherwise) which means... is going to save into Hive 10 times the number of 
files for the model as needed (not respecting the clustering policy on Hive).
This implies that we have 10x more partitions than needed and also that the 
queries over the model table are not "clustered"... which means full scan every 
time we need to do a query (a full scan over 10 times the optimal number of 
partitions).

I hope I clarify the point on Hive clusters ;)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16896: [SPARK-19561][Python] cast TimestampType.toInternal outp...

2017-03-06 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/16896
  
+1 LGTM.
Could you review and merge this please, @davies ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16856: [SPARK-19516][DOC] update public doc to use SparkSession...

2017-03-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16856
  
**[Test build #74077 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74077/testReport)**
 for PR 16856 at commit 
[`e284665`](https://github.com/apache/spark/commit/e2846652588d95a9b5e393f974846cb5d0ce2e31).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16896: [SPARK-19561][Python] cast TimestampType.toIntern...

2017-03-06 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/16896#discussion_r104601283
  
--- Diff: python/pyspark/sql/types.py ---
@@ -189,7 +189,7 @@ def toInternal(self, dt):
 if dt is not None:
 seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
else time.mktime(dt.timetuple()))
-return int(seconds) * 100 + dt.microsecond
+return long(seconds) * 100 + dt.microsecond
--- End diff --

Yep. For me, it looks every review comments are applied.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16867: [SPARK-16929] Improve performance when check speculatabl...

2017-03-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16867
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74069/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16867: [SPARK-16929] Improve performance when check speculatabl...

2017-03-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16867
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16867: [SPARK-16929] Improve performance when check speculatabl...

2017-03-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16867
  
**[Test build #74069 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74069/testReport)**
 for PR 16867 at commit 
[`a2381b6`](https://github.com/apache/spark/commit/a2381b6e96b3e05cce83dcae6af960f33e9243c8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16856: [SPARK-19516][DOC] update public doc to use Spark...

2017-03-06 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16856#discussion_r104600900
  
--- Diff: docs/quick-start.md ---
@@ -438,8 +412,7 @@ Lines with a: 46, Lines with b: 23
 # Where to Go from Here
 Congratulations on running your first Spark application!
 
-* For an in-depth overview of the API, start with the [Spark programming 
guide](programming-guide.html),
-  or see "Programming Guides" menu for other components.
+* For an in-depth overview of the API, start with the [RDD programming 
guide](rdd-programming-guide.html) and the [SQL programming 
guide](sql-programming-guide.html), or see "Programming Guides" menu for other 
components.
--- End diff --

users still need to read RDD programming guide to learn some basic concepts 
like driver, executor, etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...

2017-03-06 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/17166#discussion_r104600104
  
--- Diff: core/src/main/scala/org/apache/spark/ui/jobs/UIData.scala ---
@@ -64,7 +64,7 @@ private[spark] object UIData {
 var numCompletedTasks: Int = 0,
 var numSkippedTasks: Int = 0,
 var numFailedTasks: Int = 0,
-var numKilledTasks: Int = 0,
+var numKilledTasks: Map[String, Int] = Map.empty,
--- End diff --

reasonToNumKilledTasks? (here and the other two places)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...

2017-03-06 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/17166#discussion_r104599195
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskScheduler.scala ---
@@ -54,6 +54,9 @@ private[spark] trait TaskScheduler {
   // Cancel a stage.
   def cancelTasks(stageId: Int, interruptThread: Boolean): Unit
 
+  // Kill a task.
--- End diff --

A comment isn't very helpful if it just adds "a" to the method name :).  
Can you remove this?  Also as above "killAndReschedule" would be a better name 
here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...

2017-03-06 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/17166#discussion_r104598647
  
--- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala ---
@@ -239,8 +244,9 @@ private[spark] class Executor(
  */
 @volatile var task: Task[Any] = _
 
-def kill(interruptThread: Boolean): Unit = {
-  logInfo(s"Executor is trying to kill $taskName (TID $taskId)")
+def kill(interruptThread: Boolean, reason: String): Unit = {
+  logInfo(s"Executor is trying to kill $taskName (TID $taskId), 
reason: $reason")
--- End diff --

extra paren


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...

2017-03-06 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/17166#discussion_r104599383
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -2250,6 +2250,25 @@ class SparkContext(config: SparkConf) extends 
Logging {
   }
 
   /**
+   * Kill a given task. It will be retried.
--- End diff --

"Kill and reschedule the given task."?  (makes it slightly more obvious why 
this might be useful; "It will be retried" sounds almost like the retry is 
accidental)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...

2017-03-06 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/17166#discussion_r104599319
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -2250,6 +2250,25 @@ class SparkContext(config: SparkConf) extends 
Logging {
   }
 
   /**
+   * Kill a given task. It will be retried.
+   *
+   * @param taskId the task ID to kill
+   */
+  def killTask(taskId: Long): Unit = {
+killTask(taskId, "cancelled")
--- End diff --

"killed by user via SparkContext.killTask"?  These things can be hard to 
debug when they're unexpected and "cancelled" isn't very helpful


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...

2017-03-06 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/17166#discussion_r104600582
  
--- Diff: core/src/test/scala/org/apache/spark/SparkContextSuite.scala ---
@@ -538,10 +538,37 @@ class SparkContextSuite extends SparkFunSuite with 
LocalSparkContext with Eventu
 }
   }
 
+  test("Killing tasks") {
+sc = new SparkContext(new 
SparkConf().setAppName("test").setMaster("local"))
+
+SparkContextSuite.isTaskStarted = false
+SparkContextSuite.taskKilled = false
+
--- End diff --

Can you add a comment about what this test is doing?  e.g.,

Launches one task that will run forever.  Once the SparkListener detects 
the task has started, kill and re-schedule it.  The second run of the task will 
complete immediately.  If this test times out, then the first version of the 
task wasn't killed successfully.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...

2017-03-06 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/17166#discussion_r104600689
  
--- Diff: core/src/test/scala/org/apache/spark/SparkContextSuite.scala ---
@@ -538,10 +538,37 @@ class SparkContextSuite extends SparkFunSuite with 
LocalSparkContext with Eventu
 }
   }
 
+  test("Killing tasks") {
+sc = new SparkContext(new 
SparkConf().setAppName("test").setMaster("local"))
+
+SparkContextSuite.isTaskStarted = false
+SparkContextSuite.taskKilled = false
+
+val listener = new SparkListener {
+  override def onTaskStart(taskStart: SparkListenerTaskStart): Unit = {
+eventually(timeout(10.seconds)) {
+  assert(SparkContextSuite.isTaskStarted)
+}
+if (!SparkContextSuite.taskKilled) {
+  SparkContextSuite.taskKilled = true
+  sc.killTask(taskStart.taskInfo.taskId, "first attempt will hang")
+}
+  }
+}
+sc.addSparkListener(listener)
+sc.parallelize(1 to 1).foreach { x =>
--- End diff --

Can you wrap this part in an eventually statement so that the test will 
fail relatively quickly (i.e., in less than the 250m overall test timeout) if 
the kill fails


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...

2017-03-06 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/17166#discussion_r104599420
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -2250,6 +2250,25 @@ class SparkContext(config: SparkConf) extends 
Logging {
   }
 
   /**
+   * Kill a given task. It will be retried.
+   *
+   * @param taskId the task ID to kill
+   */
+  def killTask(taskId: Long): Unit = {
+killTask(taskId, "cancelled")
--- End diff --

Also why not make this a default argument below and then have just one 
method?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...

2017-03-06 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/17166#discussion_r104599899
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -710,7 +710,11 @@ private[spark] class TaskSetManager(
   logInfo(s"Killing attempt ${attemptInfo.attemptNumber} for task 
${attemptInfo.id} " +
 s"in stage ${taskSet.id} (TID ${attemptInfo.taskId}) on 
${attemptInfo.host} " +
 s"as the attempt ${info.attemptNumber} succeeded on ${info.host}")
-  sched.backend.killTask(attemptInfo.taskId, attemptInfo.executorId, 
true)
+  sched.backend.killTask(
+attemptInfo.taskId,
+attemptInfo.executorId,
+interruptThread = true,
+reason = "another attempt succeeded")
--- End diff --

Can you put the longer version that's in the info log here? 
(s"Attempt${info.attemptNumber} succeeded on ${info.host}")


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17178: [SPARK-19828][R] Support array type in from_json ...

2017-03-06 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/17178#discussion_r104600695
  
--- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
@@ -1342,28 +1342,52 @@ test_that("column functions", {
   df <- read.json(mapTypeJsonPath)
   j <- collect(select(df, alias(to_json(df$info), "json")))
   expect_equal(j[order(j$json), ][1], "{\"age\":16,\"height\":176.5}")
-  df <- as.DataFrame(j)
-  schema <- structType(structField("age", "integer"),
-   structField("height", "double"))
-  s <- collect(select(df, alias(from_json(df$json, schema), "structcol")))
-  expect_equal(ncol(s), 1)
-  expect_equal(nrow(s), 3)
-  expect_is(s[[1]][[1]], "struct")
-  expect_true(any(apply(s, 1, function(x) { x[[1]]$age == 16 } )))
-
-  # passing option
-  df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}")))
-  schema2 <- structType(structField("date", "date"))
-  expect_error(tryCatch(collect(select(df, from_json(df$col, schema2))),
-error = function(e) { stop(e) }),
-   paste0(".*(java.lang.NumberFormatException: For input 
string:).*"))
-  s <- collect(select(df, from_json(df$col, schema2, dateFormat = 
"dd/MM/")))
-  expect_is(s[[1]][[1]]$date, "Date")
-  expect_equal(as.character(s[[1]][[1]]$date), "2014-10-21")
-
-  # check for unparseable
-  df <- as.DataFrame(list(list("a" = "")))
-  expect_equal(collect(select(df, from_json(df$a, schema)))[[1]][[1]], NA)
+
+  schemas <- list(structType(structField("age", "integer"), 
structField("height", "double")),
+  "struct")
--- End diff --

I wonder if this is too loosely-typed and format hard to explain/illustrate 
to R user
`struct`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16826: [SPARK-19540][SQL] Add ability to clone SparkSess...

2017-03-06 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16826#discussion_r104600696
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala 
---
@@ -278,6 +278,8 @@ private[hive] class HiveClientImpl(
 state.getConf.setClassLoader(clientLoader.classLoader)
 // Set the thread local metastore client to the client associated with 
this HiveClientImpl.
 Hive.set(client)
+// Replace conf in the thread local Hive with current conf
+Hive.get(conf)
--- End diff --

Because `IsolatedClientLoader` is shared? If we do not make this change, 
any test case failed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17176: [SPARK-19833][SQL]remove SQLConf.HIVE_VERIFY_PARTITION_P...

2017-03-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17176
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74072/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17176: [SPARK-19833][SQL]remove SQLConf.HIVE_VERIFY_PARTITION_P...

2017-03-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17176
  
**[Test build #74072 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74072/testReport)**
 for PR 17176 at commit 
[`22b1f53`](https://github.com/apache/spark/commit/22b1f538b31c50a6328afa9051e1e0ddc29da072).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17176: [SPARK-19833][SQL]remove SQLConf.HIVE_VERIFY_PARTITION_P...

2017-03-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17176
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16944: [SPARK-19611][SQL] Introduce configurable table schema i...

2017-03-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16944
  
**[Test build #74076 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74076/testReport)**
 for PR 16944 at commit 
[`9bf6a3a`](https://github.com/apache/spark/commit/9bf6a3a75d9a9f287d5b05eab4071c0c3e9db267).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16944: [SPARK-19611][SQL] Introduce configurable table s...

2017-03-06 Thread budde

Github user budde commented on a diff in the pull request:

https://github.com/apache/spark/pull/16944#discussion_r104598894
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ---
@@ -476,46 +476,6 @@ object ParquetFileFormat extends Logging {
   }
 
   /**
-   * Reconciles Hive Metastore case insensitivity issue and data type 
conflicts between Metastore
-   * schema and Parquet schema.
-   *
-   * Hive doesn't retain case information, while Parquet is case 
sensitive. On the other hand, the
-   * schema read from Parquet files may be incomplete (e.g. older versions 
of Parquet doesn't
-   * distinguish binary and string).  This method generates a correct 
schema by merging Metastore
-   * schema data types and Parquet schema field names.
-   */
-  def mergeMetastoreParquetSchema(
-  metastoreSchema: StructType,
-  parquetSchema: StructType): StructType = {
-def schemaConflictMessage: String =
-  s"""Converting Hive Metastore Parquet, but detected conflicting 
schemas. Metastore schema:
- |${metastoreSchema.prettyJson}
- |
- |Parquet schema:
- |${parquetSchema.prettyJson}
-   """.stripMargin
-
-val mergedParquetSchema = mergeMissingNullableFields(metastoreSchema, 
parquetSchema)
--- End diff --

Oversight on my part. Removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...

2017-03-06 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/17166#discussion_r104598515
  
--- Diff: 
resource-managers/mesos/src/main/scala/org/apache/spark/executor/MesosExecutorBackend.scala
 ---
@@ -104,7 +104,8 @@ private[spark] class MesosExecutorBackend
   logError("Received KillTask but executor was null")
 } else {
   // TODO: Determine the 'interruptOnCancel' property set for the 
given job.
-  executor.killTask(t.getValue.toLong, interruptThread = false)
+  executor.killTask(
+t.getValue.toLong, interruptThread = false, reason = "killed 
intentionally")
--- End diff --

can you make this say "killed by Mesos" instead? (assuming this is 
correct?) "intentionally" doesn't provide much value since it's always 
intentional


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17097: [SPARK-19765][SPARK-18549][SQL] UNCACHE TABLE should un-...

2017-03-06 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/17097
  
@gatorsmile I have put the JIRA number in this PR title, without adding a 
test, because the new behavior this PR introduced can obviously fix that bug.

I don't want to add a lot of end tests to prove a single functionality.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...

2017-03-06 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/17166#discussion_r104598293
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
@@ -467,7 +474,7 @@ private[spark] class TaskSchedulerImpl 
private[scheduler](
   taskState: TaskState,
   reason: TaskFailedReason): Unit = synchronized {
 taskSetManager.handleFailedTask(tid, taskState, reason)
-if (!taskSetManager.isZombie && taskState != TaskState.KILLED) {
+if (!taskSetManager.isZombie) {
--- End diff --

I need to do some git blaming to be 100% sure this is OK...I'll take a look 
tomorrow


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

2017-03-06 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/17090
  
@MLnick  OK I think I misunderstood some of your comments above then.  I 
see the proposal in SPARK-14409 differs from this PR, so I agree it'd be nice 
to resolve it.  We can make changes to this PR's schema as long as it happens 
soon.

Here are the pros of each as I see them:
1. Nested schema (as in this PR): ```[user, Array((item, rating))]```
  * Easy to work with both nested & flattened schema 
(```df.select("recommendations.item")```)  (AFAIK there's no simple way to zip 
and nest the 2 columns when starting with the flattened schema.)
2. Flattened schema (as in SPARK-14409): ```[user, Array(item), 
Array(rating)]```
  * More efficient to store in Row-based formats like Avro

I'm not sure if there's a performance difference in the formats when stored 
in Tungsten Rows.  I think not, but that'd be good to know.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16944: [SPARK-19611][SQL] Introduce configurable table s...

2017-03-06 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16944#discussion_r104598287
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ---
@@ -476,46 +476,6 @@ object ParquetFileFormat extends Logging {
   }
 
   /**
-   * Reconciles Hive Metastore case insensitivity issue and data type 
conflicts between Metastore
-   * schema and Parquet schema.
-   *
-   * Hive doesn't retain case information, while Parquet is case 
sensitive. On the other hand, the
-   * schema read from Parquet files may be incomplete (e.g. older versions 
of Parquet doesn't
-   * distinguish binary and string).  This method generates a correct 
schema by merging Metastore
-   * schema data types and Parquet schema field names.
-   */
-  def mergeMetastoreParquetSchema(
-  metastoreSchema: StructType,
-  parquetSchema: StructType): StructType = {
-def schemaConflictMessage: String =
-  s"""Converting Hive Metastore Parquet, but detected conflicting 
schemas. Metastore schema:
- |${metastoreSchema.prettyJson}
- |
- |Parquet schema:
- |${parquetSchema.prettyJson}
-   """.stripMargin
-
-val mergedParquetSchema = mergeMissingNullableFields(metastoreSchema, 
parquetSchema)
--- End diff --

No other places use `mergeMissingNullableFields` anymore. If we remove 
`mergeMetastoreParquetSchema`, `mergeMissingNullableFields` can be removed too.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...

2017-03-06 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/17166#discussion_r104598221
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -2250,6 +2250,25 @@ class SparkContext(config: SparkConf) extends 
Logging {
   }
 
   /**
+   * Kill a given task. It will be retried.
+   *
+   * @param taskId the task ID to kill
+   */
+  def killTask(taskId: Long): Unit = {
+killTask(taskId, "cancelled")
+  }
+
+  /**
+   * Kill a given task. It will be retried.
+   *
+   * @param taskId the task ID to kill
+   * @param reason the reason for killing the task, which should be a 
short string
+   */
+  def killTask(taskId: Long, reason: String): Unit = {
--- End diff --

What about calling it "killAndRescheduleTask" then?  Otherwise kill is a 
little misleading -- since where we use it elsewhere (to kill a stage) it 
implies no retry


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17173: [SPARK-19832][SQL]DynamicPartitionWriteTask get p...

2017-03-06 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17173


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17187: [SPARK-19847][SQL] port hive read to FileFormat API

2017-03-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17187
  
**[Test build #74075 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74075/testReport)**
 for PR 17187 at commit 
[`ad47887`](https://github.com/apache/spark/commit/ad478870c652553b8b225a569e460fb6ccef0c36).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15363: [SPARK-17791][SQL] Join reordering using star schema det...

2017-03-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15363
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74068/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15363: [SPARK-17791][SQL] Join reordering using star schema det...

2017-03-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15363
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17173: [SPARK-19832][SQL]DynamicPartitionWriteTask get partitio...

2017-03-06 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/17173
  
thanks, merging to master!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15363: [SPARK-17791][SQL] Join reordering using star schema det...

2017-03-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15363
  
**[Test build #74068 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74068/testReport)**
 for PR 15363 at commit 
[`072e3a9`](https://github.com/apache/spark/commit/072e3a95b0d09b185dfd2ce270bdaadea3825431).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17187: [SPARK-19847][SQL] port hive read to FileFormat API

2017-03-06 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/17187
  
cc @sameeragarwal @rxin @gatorsmile 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17187: [SPARK-19847][SQL] port hive read to FileFormat A...

2017-03-06 Thread cloud-fan

GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/17187

[SPARK-19847][SQL] port hive read to FileFormat API

## What changes were proposed in this pull request?

implement the read logic in `HiveFileFormat`, to unify the table read path 
between data source and hive serde tables.

The major change is, hive partition may have a different serde, so the 
planner should put more information in `PartitionedFile` and send it to 
executors.

Tow things need to be improved in the future:
1. Due to the way we read hive table files, we do not support reading a 
partial file yet, which may reduce the parallelism for large files.
2. Hive tables with storage handler(non-file-based) still go to the old 
code path.

## How was this patch tested?

existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark hive-read

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17187.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17187


commit ad478870c652553b8b225a569e460fb6ccef0c36
Author: Wenchen Fan 
Date:   2017-03-02T07:15:42Z

port hive read to FileFormat API




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16290: [SPARK-18817] [SPARKR] [SQL] Set default warehouse dir t...

2017-03-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16290
  
**[Test build #74074 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74074/testReport)**
 for PR 16290 at commit 
[`7a98b91`](https://github.com/apache/spark/commit/7a98b91274e6108a2fcbccf6e6e47d49964cf22c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16826: [SPARK-19540][SQL] Add ability to clone SparkSess...

2017-03-06 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16826#discussion_r104596342
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala ---
@@ -17,43 +17,70 @@
 
 package org.apache.spark.sql.internal
 
-import java.io.File
-
 import org.apache.hadoop.conf.Configuration
-import org.apache.hadoop.fs.Path
 
+import org.apache.spark.{SparkConf, SparkContext}
 import org.apache.spark.sql._
-import org.apache.spark.sql.catalyst.TableIdentifier
 import org.apache.spark.sql.catalyst.analysis.{Analyzer, FunctionRegistry}
 import org.apache.spark.sql.catalyst.catalog._
 import org.apache.spark.sql.catalyst.optimizer.Optimizer
 import org.apache.spark.sql.catalyst.parser.ParserInterface
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.catalyst.rules.Rule
 import org.apache.spark.sql.execution._
-import org.apache.spark.sql.execution.command.AnalyzeTableCommand
 import org.apache.spark.sql.execution.datasources._
-import org.apache.spark.sql.streaming.{StreamingQuery, 
StreamingQueryManager}
+import org.apache.spark.sql.streaming.StreamingQueryManager
 import org.apache.spark.sql.util.ExecutionListenerManager
 
 
 /**
  * A class that holds all session-specific state in a given 
[[SparkSession]].
+ * @param functionRegistry Internal catalog for managing functions 
registered by the user.
+ * @param catalog Internal catalog for managing table and database states.
+ * @param sqlParser Parser that extracts expressions, plans, table 
identifiers etc. from SQL texts.
+ * @param analyzer Logical query plan analyzer for resolving unresolved 
attributes and relations.
+ * @param streamingQueryManager Interface to start and stop
+ *  
[[org.apache.spark.sql.streaming.StreamingQuery]]s.
+ * @param queryExecutionCreator Lambda to create a [[QueryExecution]] from 
a [[LogicalPlan]]
  */
-private[sql] class SessionState(sparkSession: SparkSession) {
+private[sql] class SessionState(
+sparkContext: SparkContext,
+sharedState: SharedState,
+val conf: SQLConf,
+val experimentalMethods: ExperimentalMethods,
+val functionRegistry: FunctionRegistry,
+val catalog: SessionCatalog,
+val sqlParser: ParserInterface,
+val analyzer: Analyzer,
+val streamingQueryManager: StreamingQueryManager,
+val queryExecutionCreator: LogicalPlan => QueryExecution) {
 
-  // Note: These are all lazy vals because they depend on each other (e.g. 
conf) and we
-  // want subclasses to override some of the fields. Otherwise, we would 
get a lot of NPEs.
+  /**
+   * Interface exposed to the user for registering user-defined functions.
+   * Note that the user-defined functions must be deterministic.
+   */
+  val udf: UDFRegistration = new UDFRegistration(functionRegistry)
 
   /**
-   * SQL-specific key-value configurations.
+   *   Logical query plan optimizer.
--- End diff --

Nit: remove the extra space


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...

2017-03-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17166
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16826: [SPARK-19540][SQL] Add ability to clone SparkSess...

2017-03-06 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16826#discussion_r104596281
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala ---
@@ -17,43 +17,70 @@
 
 package org.apache.spark.sql.internal
 
-import java.io.File
-
 import org.apache.hadoop.conf.Configuration
-import org.apache.hadoop.fs.Path
 
+import org.apache.spark.{SparkConf, SparkContext}
 import org.apache.spark.sql._
-import org.apache.spark.sql.catalyst.TableIdentifier
 import org.apache.spark.sql.catalyst.analysis.{Analyzer, FunctionRegistry}
 import org.apache.spark.sql.catalyst.catalog._
 import org.apache.spark.sql.catalyst.optimizer.Optimizer
 import org.apache.spark.sql.catalyst.parser.ParserInterface
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.catalyst.rules.Rule
 import org.apache.spark.sql.execution._
-import org.apache.spark.sql.execution.command.AnalyzeTableCommand
 import org.apache.spark.sql.execution.datasources._
-import org.apache.spark.sql.streaming.{StreamingQuery, 
StreamingQueryManager}
+import org.apache.spark.sql.streaming.StreamingQueryManager
 import org.apache.spark.sql.util.ExecutionListenerManager
 
 
 /**
  * A class that holds all session-specific state in a given 
[[SparkSession]].
+ * @param functionRegistry Internal catalog for managing functions 
registered by the user.
+ * @param catalog Internal catalog for managing table and database states.
+ * @param sqlParser Parser that extracts expressions, plans, table 
identifiers etc. from SQL texts.
+ * @param analyzer Logical query plan analyzer for resolving unresolved 
attributes and relations.
+ * @param streamingQueryManager Interface to start and stop
+ *  
[[org.apache.spark.sql.streaming.StreamingQuery]]s.
+ * @param queryExecutionCreator Lambda to create a [[QueryExecution]] from 
a [[LogicalPlan]]
--- End diff --

Let us document all parms?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...

2017-03-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17166
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74065/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...

2017-03-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17166
  
**[Test build #74065 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74065/testReport)**
 for PR 17166 at commit 
[`170fa34`](https://github.com/apache/spark/commit/170fa34cdc607285a315cbb5c6fd98461ac10fe8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class TaskKilled(reason: String) extends TaskFailedReason `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth

2017-03-06 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/17170#discussion_r104594383
  
--- Diff: R/pkg/R/generics.R ---
@@ -1420,6 +1420,17 @@ setGeneric("spark.posterior", function(object, 
newData) { standardGeneric("spark
 #' @export
 setGeneric("spark.perplexity", function(object, data) { 
standardGeneric("spark.perplexity") })
 
+#' @rdname spark.fpGrowth
+#' @export
+setGeneric("spark.fpGrowth", function(data, ...) { 
standardGeneric("spark.fpGrowth") })
+
+#' @rdname spark.fpGrowth
+#' @export
+setGeneric("freqItemsets", function(object) { 
standardGeneric("freqItemsets") })
--- End diff --

we seems to follow the pattern `spark.something` - see LDA. do you think it 
makes sense here too?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth

2017-03-06 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/17170#discussion_r104595228
  
--- Diff: R/pkg/R/mllib_fpm.R ---
@@ -0,0 +1,144 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# mllib_fpm.R: Provides methods for MLlib frequent pattern mining 
algorithms integration
+
+#' S4 class that represents a FPGrowthModel
+#'
+#' @param jobj a Java object reference to the backing Scala FPGrowthModel
+#' @export
+#' @note FPGrowthModel since 2.2.0
+setClass("FPGrowthModel", slots = list(jobj = "jobj"))
+
+#' FPGrowth Model
--- End diff --

could you use the long form name (eg. look at LDA) and drop the word 
"Model" which we avoid using


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth

2017-03-06 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/17170#discussion_r104595814
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/r/FPGrowthWrapper.scala 
---
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.r
+
+import org.apache.hadoop.fs.Path
+import org.json4s._
+import org.json4s.JsonDSL._
+import org.json4s.jackson.JsonMethods._
--- End diff --

do we need these?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth

2017-03-06 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/17170#discussion_r104594800
  
--- Diff: R/pkg/R/mllib_fpm.R ---
@@ -0,0 +1,144 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# mllib_fpm.R: Provides methods for MLlib frequent pattern mining 
algorithms integration
+
+#' S4 class that represents a FPGrowthModel
+#'
+#' @param jobj a Java object reference to the backing Scala FPGrowthModel
+#' @export
+#' @note FPGrowthModel since 2.2.0
+setClass("FPGrowthModel", slots = list(jobj = "jobj"))
+
+#' FPGrowth Model
+#' 
+#' Provides FP-growth algorithm to mine frequent itemsets. 
+#'
+#' @param data A SparkDataFrame for training.
+#' @param minSupport Minimal support level.
+#' @param minConfidence Minimal confidence level.
+#' @param featuresCol Features column name.
+#' @param predictionCol Prediction column name.
+#' @param ... additional argument(s) passed to the method.
+#' @return \code{spark.fpGrowth} returns a fitted FPGrowth model.
+#' 
+#' @rdname spark.fpGrowth
+#' @name spark.fpGrowth
+#' @aliases spark.fpGrowth,SparkDataFrame-method
+#' @export
+#' @examples
+#' \dontrun{
+#' itemsets <- data.frame(features = c("a,b", "a,b,c", "c,d"))
+#' data <- selectExpr(createDataFrame(itemsets), "split(features, ',') as 
features")
+#' model <- spark.fpGrowth(data)
+#' 
+#' # Show frequent itemsets
+#' frequent_itemsets <- freqItemsets(model)
+#' showDF(frequent_itemsets)
+#' 
+#' # Show association rules
+#' association_rules <- associationRules(model)
+#' showDF(association_rules)
+#' 
+#' # Predict on new data
+#' new_itemsets <- data.frame(features = c("b", "a,c", "d"))
+#' new_data <- selectExpr(createDataFrame(itemsets), "split(features, ',') 
as features")
+#' predict(model, new_data)
+#' 
+#' # Save and load model
+#' path <- "/path/to/model"
+#' write.ml(model, path)
+#' read.ml(path)
+#' 
+#' # Optional arguments
+#' baskets_data <- selectExpr(createDataFrame(itemsets), "split(features, 
',') as baskets")
+#' another_model <- spark.fpGrowth(data, minSupport = 0.1, minConfidence = 
0.5
+#' featureCol = "baskets", predictionCol = 
"predicted")
+#' }
+#' @note spark.fpGrowth since 2.2.0
+setMethod("spark.fpGrowth", signature(data = "SparkDataFrame"),
+  function(data, minSupport = 0.3, minConfidence = 0.8,
+   featuresCol = "features", predictionCol = "prediction") 
{
+if (!is.numeric(minSupport) || minSupport < 0 || minSupport > 
1) {
+  stop("minSupport should be a number [0, 1].")
+}
+if (!is.numeric(minConfidence) || minConfidence < 0 || 
minConfidence > 1) {
+  stop("minConfidence should be a number [0, 1].")
+}
+
+jobj <- callJStatic("org.apache.spark.ml.r.FPGrowthWrapper", 
"fit",
+data@sdf, minSupport, minConfidence,
--- End diff --

you may want to `as.numeric` on `minSupport`, `minConfidence` in case 
someone is passing in an integer and `callJStatic` would fail to match the 
wrapper method


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth

2017-03-06 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/17170#discussion_r104595735
  
--- Diff: R/pkg/inst/tests/testthat/test_mllib_fpm.R ---
@@ -0,0 +1,74 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+library(testthat)
+
+context("MLlib frequent pattern mining")
+
+# Tests for MLlib frequent pattern mining algorithms in SparkR
+sparkSession <- sparkR.session(enableHiveSupport = FALSE)
+
+test_that("spark.fpGrowth", {
+  data <- selectExpr(createDataFrame(data.frame(features = c(
+"1,2",
+"1,2",
+"1,2,3",
+"1,3"
+  ))), "split(features, ',') as features")
+
+  model <- spark.fpGrowth(data, minSupport = 0.3, minConfidence = 0.8)
+
+  itemsets <- collect(freqItemsets(model))
+
+  expected_itemsets <- data.frame(
+items = I(list(list("3"), list("3", "1"), list("2"), list("2", "1"), 
list("1"))),
+freq = c(2, 2, 3, 3, 4)
+  )
+
+  expect_equivalent(expected_itemsets, collect(freqItemsets(model)))
--- End diff --

don't repeat `freqItemsets(model)` - use `itemsets` from above


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth

2017-03-06 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/17170#discussion_r104594654
  
--- Diff: R/pkg/R/mllib_fpm.R ---
@@ -0,0 +1,144 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# mllib_fpm.R: Provides methods for MLlib frequent pattern mining 
algorithms integration
+
+#' S4 class that represents a FPGrowthModel
+#'
+#' @param jobj a Java object reference to the backing Scala FPGrowthModel
+#' @export
+#' @note FPGrowthModel since 2.2.0
+setClass("FPGrowthModel", slots = list(jobj = "jobj"))
+
+#' FPGrowth Model
+#' 
+#' Provides FP-growth algorithm to mine frequent itemsets. 
+#'
+#' @param data A SparkDataFrame for training.
+#' @param minSupport Minimal support level.
+#' @param minConfidence Minimal confidence level.
+#' @param featuresCol Features column name.
+#' @param predictionCol Prediction column name.
+#' @param ... additional argument(s) passed to the method.
+#' @return \code{spark.fpGrowth} returns a fitted FPGrowth model.
+#' 
+#' @rdname spark.fpGrowth
+#' @name spark.fpGrowth
+#' @aliases spark.fpGrowth,SparkDataFrame-method
+#' @export
+#' @examples
+#' \dontrun{
+#' itemsets <- data.frame(features = c("a,b", "a,b,c", "c,d"))
+#' data <- selectExpr(createDataFrame(itemsets), "split(features, ',') as 
features")
+#' model <- spark.fpGrowth(data)
+#' 
+#' # Show frequent itemsets
+#' frequent_itemsets <- freqItemsets(model)
+#' showDF(frequent_itemsets)
+#' 
+#' # Show association rules
+#' association_rules <- associationRules(model)
+#' showDF(association_rules)
+#' 
+#' # Predict on new data
+#' new_itemsets <- data.frame(features = c("b", "a,c", "d"))
+#' new_data <- selectExpr(createDataFrame(itemsets), "split(features, ',') 
as features")
+#' predict(model, new_data)
+#' 
+#' # Save and load model
+#' path <- "/path/to/model"
+#' write.ml(model, path)
+#' read.ml(path)
+#' 
+#' # Optional arguments
+#' baskets_data <- selectExpr(createDataFrame(itemsets), "split(features, 
',') as baskets")
+#' another_model <- spark.fpGrowth(data, minSupport = 0.1, minConfidence = 
0.5
+#' featureCol = "baskets", predictionCol = 
"predicted")
+#' }
+#' @note spark.fpGrowth since 2.2.0
+setMethod("spark.fpGrowth", signature(data = "SparkDataFrame"),
+  function(data, minSupport = 0.3, minConfidence = 0.8,
+   featuresCol = "features", predictionCol = "prediction") 
{
--- End diff --

we generally avoid allow setting `predictionCol` too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth

2017-03-06 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/17170#discussion_r104594539
  
--- Diff: R/pkg/R/mllib_fpm.R ---
@@ -0,0 +1,144 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# mllib_fpm.R: Provides methods for MLlib frequent pattern mining 
algorithms integration
+
+#' S4 class that represents a FPGrowthModel
+#'
+#' @param jobj a Java object reference to the backing Scala FPGrowthModel
+#' @export
+#' @note FPGrowthModel since 2.2.0
+setClass("FPGrowthModel", slots = list(jobj = "jobj"))
+
+#' FPGrowth Model
+#' 
+#' Provides FP-growth algorithm to mine frequent itemsets. 
+#'
+#' @param data A SparkDataFrame for training.
+#' @param minSupport Minimal support level.
+#' @param minConfidence Minimal confidence level.
+#' @param featuresCol Features column name.
+#' @param predictionCol Prediction column name.
+#' @param ... additional argument(s) passed to the method.
+#' @return \code{spark.fpGrowth} returns a fitted FPGrowth model.
+#' 
+#' @rdname spark.fpGrowth
+#' @name spark.fpGrowth
+#' @aliases spark.fpGrowth,SparkDataFrame-method
+#' @export
+#' @examples
+#' \dontrun{
+#' itemsets <- data.frame(features = c("a,b", "a,b,c", "c,d"))
+#' data <- selectExpr(createDataFrame(itemsets), "split(features, ',') as 
features")
+#' model <- spark.fpGrowth(data)
+#' 
+#' # Show frequent itemsets
+#' frequent_itemsets <- freqItemsets(model)
+#' showDF(frequent_itemsets)
+#' 
+#' # Show association rules
+#' association_rules <- associationRules(model)
+#' showDF(association_rules)
+#' 
+#' # Predict on new data
+#' new_itemsets <- data.frame(features = c("b", "a,c", "d"))
+#' new_data <- selectExpr(createDataFrame(itemsets), "split(features, ',') 
as features")
+#' predict(model, new_data)
+#' 
+#' # Save and load model
+#' path <- "/path/to/model"
+#' write.ml(model, path)
+#' read.ml(path)
+#' 
+#' # Optional arguments
+#' baskets_data <- selectExpr(createDataFrame(itemsets), "split(features, 
',') as baskets")
+#' another_model <- spark.fpGrowth(data, minSupport = 0.1, minConfidence = 
0.5
+#' featureCol = "baskets", predictionCol = 
"predicted")
+#' }
+#' @note spark.fpGrowth since 2.2.0
+setMethod("spark.fpGrowth", signature(data = "SparkDataFrame"),
+  function(data, minSupport = 0.3, minConfidence = 0.8,
+   featuresCol = "features", predictionCol = "prediction") 
{
--- End diff --

instead of `features` it should take a formula?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth

2017-03-06 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/17170#discussion_r104595125
  
--- Diff: R/pkg/R/mllib_fpm.R ---
@@ -0,0 +1,144 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# mllib_fpm.R: Provides methods for MLlib frequent pattern mining 
algorithms integration
+
+#' S4 class that represents a FPGrowthModel
+#'
+#' @param jobj a Java object reference to the backing Scala FPGrowthModel
+#' @export
+#' @note FPGrowthModel since 2.2.0
+setClass("FPGrowthModel", slots = list(jobj = "jobj"))
+
+#' FPGrowth Model
+#' 
+#' Provides FP-growth algorithm to mine frequent itemsets. 
+#'
+#' @param data A SparkDataFrame for training.
+#' @param minSupport Minimal support level.
+#' @param minConfidence Minimal confidence level.
+#' @param featuresCol Features column name.
+#' @param predictionCol Prediction column name.
+#' @param ... additional argument(s) passed to the method.
+#' @return \code{spark.fpGrowth} returns a fitted FPGrowth model.
+#' 
+#' @rdname spark.fpGrowth
+#' @name spark.fpGrowth
+#' @aliases spark.fpGrowth,SparkDataFrame-method
+#' @export
+#' @examples
+#' \dontrun{
+#' itemsets <- data.frame(features = c("a,b", "a,b,c", "c,d"))
+#' data <- selectExpr(createDataFrame(itemsets), "split(features, ',') as 
features")
+#' model <- spark.fpGrowth(data)
+#' 
+#' # Show frequent itemsets
+#' frequent_itemsets <- freqItemsets(model)
+#' showDF(frequent_itemsets)
+#' 
+#' # Show association rules
+#' association_rules <- associationRules(model)
+#' showDF(association_rules)
+#' 
+#' # Predict on new data
+#' new_itemsets <- data.frame(features = c("b", "a,c", "d"))
+#' new_data <- selectExpr(createDataFrame(itemsets), "split(features, ',') 
as features")
+#' predict(model, new_data)
+#' 
+#' # Save and load model
+#' path <- "/path/to/model"
+#' write.ml(model, path)
+#' read.ml(path)
+#' 
+#' # Optional arguments
+#' baskets_data <- selectExpr(createDataFrame(itemsets), "split(features, 
',') as baskets")
+#' another_model <- spark.fpGrowth(data, minSupport = 0.1, minConfidence = 
0.5
+#' featureCol = "baskets", predictionCol = 
"predicted")
+#' }
+#' @note spark.fpGrowth since 2.2.0
+setMethod("spark.fpGrowth", signature(data = "SparkDataFrame"),
+  function(data, minSupport = 0.3, minConfidence = 0.8,
+   featuresCol = "features", predictionCol = "prediction") 
{
+if (!is.numeric(minSupport) || minSupport < 0 || minSupport > 
1) {
+  stop("minSupport should be a number [0, 1].")
+}
+if (!is.numeric(minConfidence) || minConfidence < 0 || 
minConfidence > 1) {
+  stop("minConfidence should be a number [0, 1].")
+}
+
+jobj <- callJStatic("org.apache.spark.ml.r.FPGrowthWrapper", 
"fit",
+data@sdf, minSupport, minConfidence,
+featuresCol, predictionCol)
+new("FPGrowthModel", jobj = jobj)
+  })
+
+# Get frequent itemsets.
+#' @param object a fitted FPGrowth model.
+#' @return A DataFrame with frequent itemsets.
+#' 
+#' @rdname spark.fpGrowth
+#' @aliases freqItemsets,FPGrowthModel-method
+#' @export
+#' @note freqItemsets(FPGrowthModel) since 2.2.0
+setMethod("freqItemsets", signature(object = "FPGrowthModel"),
+  function(object) {
+jobj <- object@jobj
+freqItemsets <- callJMethod(jobj, "freqItemsets")
+dataFrame(freqItemsets)
--- End diff --

It might make sense to do this in a single line:
```
dataFrame(callJMethod(object@jobj, "freqItemsets")
```

might be more readable that way. ditto with Association Rules below


---
If your project is set up for it, you can reply to this email and have your
repl

[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth

2017-03-06 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/17170#discussion_r104595392
  
--- Diff: R/pkg/R/mllib_fpm.R ---
@@ -0,0 +1,144 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# mllib_fpm.R: Provides methods for MLlib frequent pattern mining 
algorithms integration
+
+#' S4 class that represents a FPGrowthModel
+#'
+#' @param jobj a Java object reference to the backing Scala FPGrowthModel
+#' @export
+#' @note FPGrowthModel since 2.2.0
+setClass("FPGrowthModel", slots = list(jobj = "jobj"))
+
+#' FPGrowth Model
+#' 
+#' Provides FP-growth algorithm to mine frequent itemsets. 
+#'
+#' @param data A SparkDataFrame for training.
+#' @param minSupport Minimal support level.
+#' @param minConfidence Minimal confidence level.
+#' @param featuresCol Features column name.
+#' @param predictionCol Prediction column name.
+#' @param ... additional argument(s) passed to the method.
+#' @return \code{spark.fpGrowth} returns a fitted FPGrowth model.
+#' 
+#' @rdname spark.fpGrowth
+#' @name spark.fpGrowth
+#' @aliases spark.fpGrowth,SparkDataFrame-method
+#' @export
+#' @examples
+#' \dontrun{
+#' itemsets <- data.frame(features = c("a,b", "a,b,c", "c,d"))
+#' data <- selectExpr(createDataFrame(itemsets), "split(features, ',') as 
features")
--- End diff --

instead of duplicating `createDataFrame`, set `itemsets <- 
createDataFrame(data.frame(features = c("a,b", "a,b,c", "c,d")))`

btw, do we have real data to use instead?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth

2017-03-06 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/17170#discussion_r104594212
  
--- Diff: R/pkg/DESCRIPTION ---
@@ -54,5 +55,5 @@ Collate:
 'types.R'
 'utils.R'
 'window.R'
-RoxygenNote: 5.0.1
+RoxygenNote: 6.0.1
--- End diff --

let's revert this - new roxygen2 seems to have some new features we are not 
ready for yet


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth

2017-03-06 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/17170#discussion_r104594501
  
--- Diff: R/pkg/R/mllib_fpm.R ---
@@ -0,0 +1,144 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# mllib_fpm.R: Provides methods for MLlib frequent pattern mining 
algorithms integration
+
+#' S4 class that represents a FPGrowthModel
+#'
+#' @param jobj a Java object reference to the backing Scala FPGrowthModel
+#' @export
+#' @note FPGrowthModel since 2.2.0
+setClass("FPGrowthModel", slots = list(jobj = "jobj"))
+
+#' FPGrowth Model
+#' 
+#' Provides FP-growth algorithm to mine frequent itemsets. 
+#'
+#' @param data A SparkDataFrame for training.
+#' @param minSupport Minimal support level.
+#' @param minConfidence Minimal confidence level.
+#' @param featuresCol Features column name.
+#' @param predictionCol Prediction column name.
+#' @param ... additional argument(s) passed to the method.
+#' @return \code{spark.fpGrowth} returns a fitted FPGrowth model.
+#' 
+#' @rdname spark.fpGrowth
+#' @name spark.fpGrowth
+#' @aliases spark.fpGrowth,SparkDataFrame-method
+#' @export
+#' @examples
+#' \dontrun{
+#' itemsets <- data.frame(features = c("a,b", "a,b,c", "c,d"))
+#' data <- selectExpr(createDataFrame(itemsets), "split(features, ',') as 
features")
+#' model <- spark.fpGrowth(data)
+#' 
+#' # Show frequent itemsets
+#' frequent_itemsets <- freqItemsets(model)
+#' showDF(frequent_itemsets)
+#' 
+#' # Show association rules
+#' association_rules <- associationRules(model)
+#' showDF(association_rules)
+#' 
+#' # Predict on new data
+#' new_itemsets <- data.frame(features = c("b", "a,c", "d"))
+#' new_data <- selectExpr(createDataFrame(itemsets), "split(features, ',') 
as features")
+#' predict(model, new_data)
+#' 
+#' # Save and load model
+#' path <- "/path/to/model"
+#' write.ml(model, path)
+#' read.ml(path)
+#' 
+#' # Optional arguments
+#' baskets_data <- selectExpr(createDataFrame(itemsets), "split(features, 
',') as baskets")
+#' another_model <- spark.fpGrowth(data, minSupport = 0.1, minConfidence = 
0.5
+#' featureCol = "baskets", predictionCol = 
"predicted")
+#' }
+#' @note spark.fpGrowth since 2.2.0
+setMethod("spark.fpGrowth", signature(data = "SparkDataFrame"),
+  function(data, minSupport = 0.3, minConfidence = 0.8,
--- End diff --

should it have `numPartitions`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth

2017-03-06 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/17170#discussion_r104595454
  
--- Diff: R/pkg/R/mllib_fpm.R ---
@@ -0,0 +1,144 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# mllib_fpm.R: Provides methods for MLlib frequent pattern mining 
algorithms integration
+
+#' S4 class that represents a FPGrowthModel
+#'
+#' @param jobj a Java object reference to the backing Scala FPGrowthModel
+#' @export
+#' @note FPGrowthModel since 2.2.0
+setClass("FPGrowthModel", slots = list(jobj = "jobj"))
+
+#' FPGrowth Model
+#' 
+#' Provides FP-growth algorithm to mine frequent itemsets. 
+#'
+#' @param data A SparkDataFrame for training.
+#' @param minSupport Minimal support level.
+#' @param minConfidence Minimal confidence level.
+#' @param featuresCol Features column name.
+#' @param predictionCol Prediction column name.
+#' @param ... additional argument(s) passed to the method.
+#' @return \code{spark.fpGrowth} returns a fitted FPGrowth model.
+#' 
+#' @rdname spark.fpGrowth
+#' @name spark.fpGrowth
+#' @aliases spark.fpGrowth,SparkDataFrame-method
+#' @export
+#' @examples
+#' \dontrun{
+#' itemsets <- data.frame(features = c("a,b", "a,b,c", "c,d"))
+#' data <- selectExpr(createDataFrame(itemsets), "split(features, ',') as 
features")
+#' model <- spark.fpGrowth(data)
+#' 
+#' # Show frequent itemsets
+#' frequent_itemsets <- freqItemsets(model)
+#' showDF(frequent_itemsets)
--- End diff --

collapse this to `head(freqItemsets(model))`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...

2017-03-06 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/17166#discussion_r104595706
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -2250,6 +2250,25 @@ class SparkContext(config: SparkConf) extends 
Logging {
   }
 
   /**
+   * Kill a given task. It will be retried.
+   *
+   * @param taskId the task ID to kill
+   */
+  def killTask(taskId: Long): Unit = {
+killTask(taskId, "cancelled")
+  }
+
+  /**
+   * Kill a given task. It will be retried.
+   *
+   * @param taskId the task ID to kill
+   * @param reason the reason for killing the task, which should be a 
short string
+   */
+  def killTask(taskId: Long, reason: String): Unit = {
--- End diff --

ah ok. for some reason i read it as killTask(long) is kill without retry, 
and killTask(long, string) is with.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...

2017-03-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17166
  
**[Test build #74073 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74073/testReport)**
 for PR 17166 at commit 
[`f03de61`](https://github.com/apache/spark/commit/f03de619a5e2df41fa7d70dd574159da5b4fa4db).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16826: [SPARK-19540][SQL] Add ability to clone SparkSess...

2017-03-06 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16826#discussion_r104595290
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala
 ---
@@ -1197,6 +1199,65 @@ class SessionCatalogSuite extends PlanTest {
 }
   }
 
+  test("clone SessionCatalog - temp views") {
+val externalCatalog = newEmptyCatalog()
+val original = new SessionCatalog(externalCatalog)
+val tempTable1 = Range(1, 10, 1, 10)
+original.createTempView("copytest1", tempTable1, overrideIfExists = 
false)
+
+// check if tables copied over
+val clone = original.clone(
+  SimpleCatalystConf(caseSensitiveAnalysis = true),
+  new Configuration(),
+  new SimpleFunctionRegistry,
+  CatalystSqlParser)
+assert(original ne clone)
+assert(clone.getTempView("copytest1") == Option(tempTable1))
+
+// check if clone and original independent
+clone.dropTable(TableIdentifier("copytest1"), ignoreIfNotExists = 
false, purge = false)
+assert(original.getTempView("copytest1") == Option(tempTable1))
+
+val tempTable2 = Range(1, 20, 2, 10)
+original.createTempView("copytest2", tempTable2, overrideIfExists = 
false)
+assert(clone.getTempView("copytest2").isEmpty)
+  }
+
+  test("clone SessionCatalog - current db") {
+val externalCatalog = newEmptyCatalog()
+externalCatalog.createDatabase(newDb("copytest1"), true)
+externalCatalog.createDatabase(newDb("copytest2"), true)
+externalCatalog.createDatabase(newDb("copytest3"), true)
+
+val original = new SessionCatalog(externalCatalog)
+val tempTable1 = Range(1, 10, 1, 10)
+val db1 = "copytest1"
+original.createTempView(db1, tempTable1, overrideIfExists = false)
+original.setCurrentDatabase(db1)
+
+// check if current db copied over
+val clone = original.clone(
+  SimpleCatalystConf(caseSensitiveAnalysis = true),
+  new Configuration(),
+  new SimpleFunctionRegistry,
+  CatalystSqlParser)
+assert(original ne clone)
+assert(clone.getCurrentDatabase == db1)
+
+// check if clone and original independent
+val db2 = "copytest2"
+val tempTable2 = Range(1, 20, 2, 20)
+clone.createTempView(db2, tempTable2, overrideIfExists = false)
+clone.setCurrentDatabase(db2)
+assert(original.getCurrentDatabase == db1)
+
+val db3 = "copytest3"
+val tempTable3 = Range(1, 30, 2, 30)
+original.createTempView(db3, tempTable3, overrideIfExists = false)
+original.setCurrentDatabase(db3)
+assert(clone.getCurrentDatabase == db2)
+  }
--- End diff --

How about?

```Scala
  test("clone SessionCatalog - current db") {
val externalCatalog = newEmptyCatalog()
val db1 = "db1"
val db2 = "db2"
val db3 = "db3"
val data = Range(1, 10, 1, 10)

externalCatalog.createDatabase(newDb(db1), ignoreIfExists = true)
externalCatalog.createDatabase(newDb(db2), ignoreIfExists = true)
externalCatalog.createDatabase(newDb(db3), ignoreIfExists = true)

val original = new SessionCatalog(externalCatalog)
original.createTempView("view1", data, overrideIfExists = false)
original.setCurrentDatabase(db1)

// check if current db copied over
val clone = original.clone(
  SimpleCatalystConf(caseSensitiveAnalysis = true),
  new Configuration(),
  new SimpleFunctionRegistry,
  CatalystSqlParser)
assert(original != clone)
assert(clone.getCurrentDatabase == db1)

// check if clone and original independent
clone.createTempView("view2", data, overrideIfExists = false)
clone.setCurrentDatabase(db2)
assert(original.getCurrentDatabase == db1)
original.setCurrentDatabase(db3)
assert(clone.getCurrentDatabase == db2)
  }
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...

2017-03-06 Thread ericl

Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/17166#discussion_r104595023
  
--- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala ---
@@ -158,7 +158,8 @@ private[spark] class Executor(
 threadPool.execute(tr)
   }
 
-  def killTask(taskId: Long, interruptThread: Boolean): Unit = {
+  def killTask(
--- End diff --

Fixed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17164: [SPARK-16844][SQL][WIP] Support codegen for sort-based a...

2017-03-06 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/17164
  
This pr added an new SQL option `spark.sql.aggregate.preferSortAggregate` 
to preferably select `SortAggregate` for easy-to-test in 
`DataFrameAggregateSuite.scala`. In some cases (e.g., input data is already 
sorted in cache), sort aggregate is faster than hash one (See: 
https://issues.apache.org/jira/browse/SPARK-18591). But, you know, the current 
spark  does not adaptively select sort aggregate in these cases. So, I probably 
think this option is some useful to control aggregate strategies by user. What 
do u think? cc: @hvanhovell  If yes, I'd like to make another pr to add this 
option before this pr reviewed. 
https://github.com/apache/spark/compare/master...maropu:SPARK-16844-3


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...

2017-03-06 Thread ericl

Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/17166#discussion_r104595065
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedClusterMessage.scala
 ---
@@ -40,7 +40,8 @@ private[spark] object CoarseGrainedClusterMessages {
   // Driver to executors
   case class LaunchTask(data: SerializableBuffer) extends 
CoarseGrainedClusterMessage
 
-  case class KillTask(taskId: Long, executor: String, interruptThread: 
Boolean)
+  case class KillTask(
--- End diff --

Fixed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...

2017-03-06 Thread ericl

Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/17166#discussion_r104594970
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -2250,6 +2250,25 @@ class SparkContext(config: SparkConf) extends 
Logging {
   }
 
   /**
+   * Kill a given task. It will be retried.
+   *
+   * @param taskId the task ID to kill
+   */
+  def killTask(taskId: Long): Unit = {
+killTask(taskId, "cancelled")
+  }
+
+  /**
+   * Kill a given task. It will be retried.
+   *
+   * @param taskId the task ID to kill
+   * @param reason the reason for killing the task, which should be a 
short string
+   */
+  def killTask(taskId: Long, reason: String): Unit = {
--- End diff --

Well, it turns out there's not a good reason to not retry. The task will 
get retried anyways eventually unless the stage is cancelled.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17148: [SPARK-17075][SQL][followup] fix filter estimation issue...

2017-03-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17148
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17148: [SPARK-17075][SQL][followup] fix filter estimation issue...

2017-03-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17148
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74067/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17151: [ML][Minor] Separate estimator and model params f...

2017-03-06 Thread sethah

Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/17151#discussion_r104594062
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala
 ---
@@ -372,16 +372,18 @@ class DecisionTreeClassifierSuite
 // Categorical splits with tree depth 2
 val categoricalData: DataFrame =
   TreeTests.setMetadata(rdd, Map(0 -> 2, 1 -> 3), numClasses = 2)
-testEstimatorAndModelReadWrite(dt, categoricalData, allParamSettings, 
checkModelData)
+testEstimatorAndModelReadWrite(dt, categoricalData, allParamSettings,
--- End diff --

Good points. I still think it's better to just add the extra constructor, 
but I don't feel strongly about it. So we can proceed with whatever you feel is 
best. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17159: [SPARK-19818][SparkR] rbind should check for name...

2017-03-06 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17159


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...

2017-03-06 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/17166#discussion_r104593920
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedClusterMessage.scala
 ---
@@ -40,7 +40,8 @@ private[spark] object CoarseGrainedClusterMessages {
   // Driver to executors
   case class LaunchTask(data: SerializableBuffer) extends 
CoarseGrainedClusterMessage
 
-  case class KillTask(taskId: Long, executor: String, interruptThread: 
Boolean)
+  case class KillTask(
--- End diff --

this also seems to fit in one line?

```
  case class KillTask(taskId: Long, executor: String, interruptThread: 
Boolean, reason: String)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16826: [SPARK-19540][SQL] Add ability to clone SparkSess...

2017-03-06 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16826#discussion_r104593862
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala
 ---
@@ -1197,6 +1199,65 @@ class SessionCatalogSuite extends PlanTest {
 }
   }
 
+  test("clone SessionCatalog - temp views") {
+val externalCatalog = newEmptyCatalog()
+val original = new SessionCatalog(externalCatalog)
+val tempTable1 = Range(1, 10, 1, 10)
+original.createTempView("copytest1", tempTable1, overrideIfExists = 
false)
+
+// check if tables copied over
+val clone = original.clone(
+  SimpleCatalystConf(caseSensitiveAnalysis = true),
+  new Configuration(),
+  new SimpleFunctionRegistry,
+  CatalystSqlParser)
+assert(original ne clone)
+assert(clone.getTempView("copytest1") == Option(tempTable1))
+
+// check if clone and original independent
+clone.dropTable(TableIdentifier("copytest1"), ignoreIfNotExists = 
false, purge = false)
+assert(original.getTempView("copytest1") == Option(tempTable1))
+
+val tempTable2 = Range(1, 20, 2, 10)
+original.createTempView("copytest2", tempTable2, overrideIfExists = 
false)
+assert(clone.getTempView("copytest2").isEmpty)
+  }
+
+  test("clone SessionCatalog - current db") {
+val externalCatalog = newEmptyCatalog()
+externalCatalog.createDatabase(newDb("copytest1"), true)
+externalCatalog.createDatabase(newDb("copytest2"), true)
+externalCatalog.createDatabase(newDb("copytest3"), true)
+
+val original = new SessionCatalog(externalCatalog)
+val tempTable1 = Range(1, 10, 1, 10)
+val db1 = "copytest1"
+original.createTempView(db1, tempTable1, overrideIfExists = false)
+original.setCurrentDatabase(db1)
+
+// check if current db copied over
+val clone = original.clone(
+  SimpleCatalystConf(caseSensitiveAnalysis = true),
+  new Configuration(),
+  new SimpleFunctionRegistry,
+  CatalystSqlParser)
+assert(original ne clone)
+assert(clone.getCurrentDatabase == db1)
+
+// check if clone and original independent
+val db2 = "copytest2"
+val tempTable2 = Range(1, 20, 2, 20)
+clone.createTempView(db2, tempTable2, overrideIfExists = false)
--- End diff --

the same here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...

2017-03-06 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/17166#discussion_r104593825
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -732,6 +732,13 @@ class DAGScheduler(
   }
 
   /**
+   * Kill a given task. It will be retried.
+   */
+  def killTask(taskId: Long, reason: String): Unit = {
--- End diff --

similar to the public api, we should separate retry from reason...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...

2017-03-06 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/17166#discussion_r104593790
  
--- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala ---
@@ -158,7 +158,8 @@ private[spark] class Executor(
 threadPool.execute(tr)
   }
 
-  def killTask(taskId: Long, interruptThread: Boolean): Unit = {
+  def killTask(
--- End diff --

this fits in one line?

```
  def killTask(taskId: Long, interruptThread: Boolean, reason: String): 
Unit = {
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17159: [SPARK-19818][SparkR] rbind should check for name consis...

2017-03-06 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/17159
  
merged to master. thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...

2017-03-06 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/17166#discussion_r104593710
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -2250,6 +2250,25 @@ class SparkContext(config: SparkConf) extends 
Logging {
   }
 
   /**
+   * Kill a given task. It will be retried.
+   *
+   * @param taskId the task ID to kill
+   */
+  def killTask(taskId: Long): Unit = {
+killTask(taskId, "cancelled")
+  }
+
+  /**
+   * Kill a given task. It will be retried.
+   *
+   * @param taskId the task ID to kill
+   * @param reason the reason for killing the task, which should be a 
short string
+   */
+  def killTask(taskId: Long, reason: String): Unit = {
--- End diff --

hm i don't think we should automatically retry just by providing a reason. 
Perhaps this

```
def killTask(taskId: Long, reason: String): Unit

def killTaskAndRetry(taskId: Long, reason: String): Unit
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...

2017-03-06 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/17166#discussion_r104593724
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -2250,6 +2250,25 @@ class SparkContext(config: SparkConf) extends 
Logging {
   }
 
   /**
+   * Kill a given task. It will be retried.
+   *
+   * @param taskId the task ID to kill
+   */
+  def killTask(taskId: Long): Unit = {
+killTask(taskId, "cancelled")
+  }
+
+  /**
+   * Kill a given task. It will be retried.
+   *
+   * @param taskId the task ID to kill
+   * @param reason the reason for killing the task, which should be a 
short string
+   */
+  def killTask(taskId: Long, reason: String): Unit = {
--- End diff --

same thing for the lower level dag scheduler api.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17164: [SPARK-16844][SQL][WIP] Support codegen for sort-based a...

2017-03-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17164
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74066/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 7 8 >

1 - 100 of 786 matches

Mail list logo