[GitHub] spark issue #13116: [SPARK-15324] [SQL] Add the takeSample function to the D...

2017-03-13 Thread burness
Github user burness commented on the issue:

https://github.com/apache/spark/pull/13116
  
@HyukjinKwon  It is too hard to solve the OOM, I'm so sorry


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17250: [SPARK-19911][STREAMING] Add builder interface for Kines...

2017-03-13 Thread budde
Github user budde commented on the issue:

https://github.com/apache/spark/pull/17250
  
@brkyvz I think if we're eliminating the constructor arguments then the 
second approach you've proposed might make more sense. I can't think of 
anything cleaner.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16954: [SPARK-18874][SQL] First phase: Deferring the cor...

2017-03-13 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/16954#discussion_r105831346
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
 ---
@@ -123,19 +123,36 @@ case class Not(child: Expression)
  */
 @ExpressionDescription(
   usage = "expr1 _FUNC_(expr2, expr3, ...) - Returns true if `expr` equals 
to any valN.")
-case class In(value: Expression, list: Seq[Expression]) extends Predicate
-with ImplicitCastInputTypes {
+case class In(value: Expression, list: Seq[Expression]) extends Predicate {
 
   require(list != null, "list should not be null")
-
-  override def inputTypes: Seq[AbstractDataType] = value.dataType +: 
list.map(_.dataType)
-
   override def checkInputDataTypes(): TypeCheckResult = {
-if (list.exists(l => l.dataType != value.dataType)) {
-  TypeCheckResult.TypeCheckFailure(
-"Arguments must be same type")
-} else {
-  TypeCheckResult.TypeCheckSuccess
+list match {
+  case ListQuery(sub, _, _) :: Nil =>
+val valExprs = value match {
+  case cns: CreateNamedStruct => cns.valExprs
+  case expr => Seq(expr)
+}
+val isTypeMismatched = valExprs.zip(sub.output).exists {
+  case (l, r) => l.dataType != r.dataType
+}
+if (isTypeMismatched) {
--- End diff --

@hvanhovell The new error message looks like following. Does this look okay 
to you ?

```
Error in query: cannot resolve '(named_struct('c1', at1.`c1`, 'c2', 
at1.`c2`) IN (listquery()))' due to data type mismatch: 
The data type of one or more elements in the left hand side of an IN 
subquery
is not compatible with the data type of the output of the subquery
Mismatched columns:
[(at1.`c1`:decimal(10,0), at2.`c1`:timestamp), (at1.`c2`:timestamp, 
at2.`c2`:decimal(10,0))]
Left side:
[decimal(10,0), timestamp].
Right side:
[timestamp, decimal(10,0)].
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17109: [SPARK-19740][MESOS]Add support in Spark to pass arbitra...

2017-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17109
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74488/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17109: [SPARK-19740][MESOS]Add support in Spark to pass arbitra...

2017-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17109
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17109: [SPARK-19740][MESOS]Add support in Spark to pass arbitra...

2017-03-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17109
  
**[Test build #74488 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74488/testReport)**
 for PR 17109 at commit 
[`cbb784a`](https://github.com/apache/spark/commit/cbb784a1a278f2d0db5c5122d52c30dfd26fc3db).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class MesosSchedulerBackendUtilSuite extends SparkFunSuite `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17277: [SPARK-19887][SQL] dynamic partition keys can be null or...

2017-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17277
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17277: [SPARK-19887][SQL] dynamic partition keys can be null or...

2017-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17277
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74482/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16373: [SPARK-18961][SQL] Support `SHOW TABLE EXTENDED ....

2017-03-13 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16373#discussion_r105831078
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala 
---
@@ -925,6 +925,26 @@ class DDLSuite extends QueryTest with SharedSQLContext 
with BeforeAndAfterEach {
 }
   }
 
+  test("show table extended ... partition") {
--- End diff --

Okay, I'll update that later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17277: [SPARK-19887][SQL] dynamic partition keys can be null or...

2017-03-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17277
  
**[Test build #74482 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74482/testReport)**
 for PR 17277 at commit 
[`8896507`](https://github.com/apache/spark/commit/889650770345d93d520007a39a2f140350c3b104).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16373: [SPARK-18961][SQL] Support `SHOW TABLE EXTENDED ....

2017-03-13 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16373#discussion_r105830656
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala 
---
@@ -925,6 +925,26 @@ class DDLSuite extends QueryTest with SharedSQLContext 
with BeforeAndAfterEach {
 }
   }
 
+  test("show table extended ... partition") {
--- End diff --

Then, you just need to improve the function `getNormalizedResult` in 
SQLQueryTestSuite to mask it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17277: [SPARK-19887][SQL] dynamic partition keys can be null or...

2017-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17277
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74480/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17277: [SPARK-19887][SQL] dynamic partition keys can be null or...

2017-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17277
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17277: [SPARK-19887][SQL] dynamic partition keys can be null or...

2017-03-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17277
  
**[Test build #74480 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74480/testReport)**
 for PR 17277 at commit 
[`a04e7e5`](https://github.com/apache/spark/commit/a04e7e5b22105188d076010bf9c6adffdcfa1f7e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16954: [SPARK-18874][SQL] First phase: Deferring the correlated...

2017-03-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16954
  
**[Test build #74489 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74489/testReport)**
 for PR 16954 at commit 
[`19cdbb0`](https://github.com/apache/spark/commit/19cdbb040ccf2e74e1271ca33e6842607c1e0760).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17109: [SPARK-19740][MESOS]Add support in Spark to pass arbitra...

2017-03-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17109
  
**[Test build #74488 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74488/testReport)**
 for PR 17109 at commit 
[`cbb784a`](https://github.com/apache/spark/commit/cbb784a1a278f2d0db5c5122d52c30dfd26fc3db).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16954: [SPARK-18874][SQL] First phase: Deferring the cor...

2017-03-13 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/16954#discussion_r105830367
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -365,17 +368,73 @@ object TypeCoercion {
   }
 
   /**
-   * Convert the value and in list expressions to the common operator type
-   * by looking at all the argument types and finding the closest one that
-   * all the arguments can be cast to. When no common operator type is 
found
-   * the original expression will be returned and an Analysis Exception 
will
-   * be raised at type checking phase.
+   * Handles type coercion for both IN expression with subquery and IN
+   * expressions without subquery.
+   * 1. In the first case, find the common type by comparing the left hand 
side (LHS)
+   *expression types against corresponding right hand side (RHS) 
expression derived
+   *from the subquery expression's plan output. Inject appropriate 
casts in the
+   *LHS and RHS side of IN expression.
+   *
+   * 2. In the second case, convert the value and in list expressions to 
the
+   *common operator type by looking at all the argument types and 
finding
+   *the closest one that all the arguments can be cast to. When no 
common
+   *operator type is found the original expression will be returned 
and an
+   *Analysis Exception will be raised at the type checking phase.
*/
   object InConversion extends Rule[LogicalPlan] {
 def apply(plan: LogicalPlan): LogicalPlan = plan resolveExpressions {
   // Skip nodes who's children have not been resolved yet.
   case e if !e.childrenResolved => e
 
+  // Handle type casting required between value expression and 
subquery output
+  // in IN subquery.
+  case i @ In(a, Seq(ListQuery(sub, children, exprId))) if !i.resolved 
=>
+// LHS is the value expression of IN subquery.
+val lhs = a match {
+  // Multi columns in IN clause is represented as a 
CreateNamedStruct.
+  // flatten the named struct to get the list of expressions.
+  case cns: CreateNamedStruct => cns.valExprs
+  case expr => Seq(expr)
+}
+
+// RHS is the subquery output.
+val rhs = sub.output
+require(lhs.length == rhs.length)
+
+val commonTypes = lhs.zip(rhs).flatMap { case (l, r) =>
+  findCommonTypeForBinaryComparison(l.dataType, r.dataType) match {
+case d @ Some(_) => d
+case _ => findTightestCommonType(l.dataType, r.dataType)
+  }
+}
+
+// The number of columns/expressions must match between LHS and 
RHS of an
+// IN subquery expression.
+if (commonTypes.length == lhs.length) {
+  val castedRhs = rhs.zip(commonTypes).map {
+case (e, dt) if e.dataType != dt => Alias(Cast(e, dt), 
e.name)()
+case (e, _) => e
+  }
+  val castedLhs = lhs.zip(commonTypes).map {
+case (e, dt) if e.dataType != dt => Cast(e, dt)
+case (e, _) => e
+  }
+
+  // Before constructing the In expression, wrap the multi values 
in LHS
+  // in a CreatedNamedStruct.
+  val newLhs = a match {
--- End diff --

@hvanhovell Thanks a lot. You are right, we don't care about the names. 
This looks much better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13656: [SPARK-15938]Adding "support" property to MLlib Associat...

2017-03-13 Thread hhbyyh
Github user hhbyyh commented on the issue:

https://github.com/apache/spark/pull/13656
  
Close this and add the support to ml.fpm.  
https://github.com/apache/spark/pull/17280 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13656: [SPARK-15938]Adding "support" property to MLlib A...

2017-03-13 Thread hhbyyh
Github user hhbyyh closed the pull request at:

https://github.com/apache/spark/pull/13656


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17255: [SPARK-19918][SQL] Use TextFileFormat in implemen...

2017-03-13 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17255#discussion_r105830235
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonInferSchema.scala
 ---
@@ -40,18 +40,11 @@ private[sql] object JsonInferSchema {
   json: RDD[T],
   configOptions: JSONOptions,
   createParser: (JsonFactory, T) => JsonParser): StructType = {
-require(configOptions.samplingRatio > 0,
-  s"samplingRatio (${configOptions.samplingRatio}) should be greater 
than 0")
 val shouldHandleCorruptRecord = configOptions.permissive
 val columnNameOfCorruptRecord = configOptions.columnNameOfCorruptRecord
-val schemaData = if (configOptions.samplingRatio > 0.99) {
-  json
-} else {
-  json.sample(withReplacement = false, configOptions.samplingRatio, 1)
-}
--- End diff --

why move the sample logic out?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16867: [SPARK-16929] Improve performance when check speculatabl...

2017-03-13 Thread jinxing64
Github user jinxing64 commented on the issue:

https://github.com/apache/spark/pull/16867
  
@squito 
Thanks a lot for comments. I've refined :):)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16373: [SPARK-18961][SQL] Support `SHOW TABLE EXTENDED ... PART...

2017-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16373
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74481/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16373: [SPARK-18961][SQL] Support `SHOW TABLE EXTENDED ... PART...

2017-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16373
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16373: [SPARK-18961][SQL] Support `SHOW TABLE EXTENDED ... PART...

2017-03-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16373
  
**[Test build #74481 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74481/testReport)**
 for PR 16373 at commit 
[`b46d771`](https://github.com/apache/spark/commit/b46d7717aa823f839d4790b097fd841440d70660).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17279: Added dayofweek function to Functions.scala

2017-03-13 Thread RishikeshTeke
Github user RishikeshTeke closed the pull request at:

https://github.com/apache/spark/pull/17279


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15628: [SPARK-17471][ML] Add compressed method to ML matrices

2017-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15628
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74479/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15628: [SPARK-17471][ML] Add compressed method to ML matrices

2017-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15628
  
Build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15628: [SPARK-17471][ML] Add compressed method to ML matrices

2017-03-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15628
  
**[Test build #74479 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74479/consoleFull)**
 for PR 15628 at commit 
[`254b9fb`](https://github.com/apache/spark/commit/254b9fb07a35d6927fefe1a4abe6f8a24ae81d4a).
 * This patch **fails Spark unit tests**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17267: [SPARK-19926][PYSPARK] Make pyspark exception more reada...

2017-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17267
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74487/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17267: [SPARK-19926][PYSPARK] Make pyspark exception more reada...

2017-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17267
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17267: [SPARK-19926][PYSPARK] Make pyspark exception more reada...

2017-03-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17267
  
**[Test build #74487 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74487/testReport)**
 for PR 17267 at commit 
[`5bc1d8e`](https://github.com/apache/spark/commit/5bc1d8e75b3690b911cf88bcf2fba561bc63e354).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13116: [SPARK-15324] [SQL] Add the takeSample function to the D...

2017-03-13 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/13116
  
Hi @burness, what's the state of this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17267: [SPARK-19926][PYSPARK] Make pyspark exception mor...

2017-03-13 Thread uncleGen
Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17267#discussion_r105827541
  
--- Diff: python/pyspark/sql/utils.py ---
@@ -24,7 +24,7 @@ def __init__(self, desc, stackTrace):
 self.stackTrace = stackTrace
 
 def __str__(self):
-return repr(self.desc)
+return str(self.desc)
--- End diff --

based on latest commit:

```
>>> df.select("아")
Traceback (most recent call last):
  File "", line 1, in 
  File ".../spark/python/pyspark/sql/dataframe.py", line 992, in select
jdf = self._jdf.select(self._jcols(*cols))
  File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
line 1133, in __call__
  File ".../spark/python/pyspark/sql/utils.py", line 75, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException
: cannot resolve '`아`' given input columns: [age, name];;
'Project ['아]
+- Relation[age#0L,name#1] json


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17267: [SPARK-19926][PYSPARK] Make pyspark exception more reada...

2017-03-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17267
  
**[Test build #74487 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74487/testReport)**
 for PR 17267 at commit 
[`5bc1d8e`](https://github.com/apache/spark/commit/5bc1d8e75b3690b911cf88bcf2fba561bc63e354).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16373: [SPARK-18961][SQL] Support `SHOW TABLE EXTENDED ....

2017-03-13 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16373#discussion_r105827253
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala 
---
@@ -925,6 +925,26 @@ class DDLSuite extends QueryTest with SharedSQLContext 
with BeforeAndAfterEach {
 }
   }
 
+  test("show table extended ... partition") {
--- End diff --

Yes it works, but it outputs the absolute path for `Location`, so the test 
suite will fail on another environment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16373: [SPARK-18961][SQL] Support `SHOW TABLE EXTENDED ....

2017-03-13 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16373#discussion_r105826784
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala 
---
@@ -925,6 +925,26 @@ class DDLSuite extends QueryTest with SharedSQLContext 
with BeforeAndAfterEach {
 }
   }
 
+  test("show table extended ... partition") {
--- End diff --

If we change `hiveResultString` to
```
case command @ ExecutedCommandExec(s: ShowTablesCommand) if 
!s.isExtended =>
  command.executeCollect().map(_.getString(1))
```

I did a try. It works. Below is the output.


```

-- !query 22
SHOW TABLE EXTENDED LIKE 'show_t1' PARTITION(c='Ch', d=1)
-- !query 22 schema

struct
-- !query 22 output
showdb  show_t1 false   CatalogPartition(
Partition Values: [c=Ch, d=1]
Storage(Location: 
file:/Users/xiao/IdeaProjects/sparkDelivery/sql/core/spark-warehouse/showdb.db/show_t1/c=Ch/d=1)
Partition Parameters:{})
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15628: [SPARK-17471][ML] Add compressed method to ML matrices

2017-03-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15628
  
**[Test build #74486 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74486/testReport)**
 for PR 15628 at commit 
[`baa8c9d`](https://github.com/apache/spark/commit/baa8c9daff8e405575c1c733e4001a0c1ccb6796).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17175: [SPARK-19931][SQL] InMemoryTableScanExec should r...

2017-03-13 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/17175#discussion_r105825600
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
 ---
@@ -41,11 +41,31 @@ case class InMemoryTableScanExec(
 
   override def output: Seq[Attribute] = attributes
 
+  private def updateAttribute(expr: Expression, attrMap: 
AttributeMap[Attribute]): Expression =
--- End diff --

then when processing `outputOrdering`, we will create `attrMap` many times.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17175: [SPARK-19931][SQL] InMemoryTableScanExec should r...

2017-03-13 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17175#discussion_r105825397
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
 ---
@@ -41,11 +41,31 @@ case class InMemoryTableScanExec(
 
   override def output: Seq[Attribute] = attributes
 
+  private def updateAttribute(expr: Expression, attrMap: 
AttributeMap[Attribute]): Expression =
--- End diff --

we can create the `attrMap` in this method


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17285: [SPARK-19944][SQL] Move SQLConf from sql/core to sql/cat...

2017-03-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17285
  
**[Test build #74485 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74485/testReport)**
 for PR 17285 at commit 
[`c199469`](https://github.com/apache/spark/commit/c1994696172192f0808cd210ed7f453ec2e7ef7d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17265: [SPARK-19924] [SQL] Handle InvocationTargetExcept...

2017-03-13 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17265


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17265: [SPARK-19924] [SQL] Handle InvocationTargetException for...

2017-03-13 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/17265
  
LGTM, merging to master!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17285: [SPARK-19944][SQL] Move SQLConf from sql/core to sql/cat...

2017-03-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17285
  
**[Test build #74484 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74484/testReport)**
 for PR 17285 at commit 
[`bbf0211`](https://github.com/apache/spark/commit/bbf02110a9232a545370c05dcdac7840f5b96af7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17285: [SPARK-19944][SQL] Move SQLConf from sql/core to ...

2017-03-13 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/17285

[SPARK-19944][SQL] Move SQLConf from sql/core to sql/catalyst

## What changes were proposed in this pull request?
This patch moves SQLConf from sql/core to sql/catalyst. To minimize the 
changes, the patch used type alias to still keep CatalystConf (as a type alias) 
and SimpleCatalystConf (as a concrete class that extends SQLConf).

Motivation for the change is that it is pretty weird to have SQLConf only 
in sql/core and then we have to duplicate config options that impact 
optimizer/analyzer in sql/catalyst using CatalystConf.

## How was this patch tested?
N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-19944

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17285.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17285


commit bbf02110a9232a545370c05dcdac7840f5b96af7
Author: Reynold Xin 
Date:   2017-03-14T04:01:17Z

[SPARK-19944][SQL] Move SQLConf from sql/core to sql/catalyst




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17241: [SPARK-19877][SQL] Restrict the nested level of a view

2017-03-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17241
  
**[Test build #74483 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74483/testReport)**
 for PR 17241 at commit 
[`5c91ab7`](https://github.com/apache/spark/commit/5c91ab7fb1dede638b246e1fb2d7b7018e0b284f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17270: [SPARK-19929] [SQL] Showing Hive Managed table's LOATION...

2017-03-13 Thread ouyangxiaochen
Github user ouyangxiaochen commented on the issue:

https://github.com/apache/spark/pull/17270
  
@gatorsmile cc ,is it reasonable? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17179: [SPARK-19067][SS] Processing-time-based timeout i...

2017-03-13 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/17179#discussion_r105822080
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/KeyedState.scala ---
@@ -61,25 +65,50 @@ import 
org.apache.spark.sql.catalyst.plans.logical.LogicalKeyedState
  *  - After that, if `update(newState)` is called, then `exists()` will 
again return `true`,
  *`get()` and `getOption()`will return the updated value.
  *
+ * Important points to note about using `KeyedStateTimeout`.
+ *  - The timeout type is a global param across all the keys (set as 
`timeout` param in
+ *`[map|flatMap]GroupsWithState`, but the exact timeout duration is 
configurable per key
+ *(by calling `setTimeout...()` in `KeyedState`).
+ *  - When the timeout occurs for a key, the function is called with no 
values, and
+ *`KeyedState.isTimingOut()` set to true.
+ *  - The timeout is reset for key every time the function is called on 
the key, that is,
+ *when the key has new data, or the key has timed out. So the user has 
to set the timeout
+ *duration every time the function is called, otherwise there will not 
be any timeout set.
+ *  - Guarantees provided on processing-time-based timeout of key, when 
timeout duration is D ms:
+ *- Timeout will never be called before real clock time has advanced 
by D ms
+ *- Timeout will be called eventually when there is a trigger in the 
query
+ *  (i.e. after D ms). So there is a no strict upper bound on when the 
timeout would occur.
+ *  For example, the trigger interval of the query will affect when 
the timeout is actually hit.
+ *  If there is no data in the stream (for any key) for a while, then 
their will not be
+ *  any trigger and timeout will not be hit until there is data.
+ *
  * Scala example of using KeyedState in `mapGroupsWithState`:
  * {{{
  * // A mapping function that maintains an integer state for string keys 
and returns a string.
--- End diff --

Could you update this comment to describe the timeout behavior of the 
function?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17179: [SPARK-19067][SS] Processing-time-based timeout i...

2017-03-13 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/17179#discussion_r105821698
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala ---
@@ -298,12 +368,14 @@ class KeyValueGroupedDataset[K, V] private[sql](
* @param outputMode The output mode of the function.
*
* See [[Encoder]] for more details on what types are encodable to Spark 
SQL.
-   * @since 2.1.1
+   * @since 2.2.0
*/
   @Experimental
   @InterfaceStability.Evolving
   def flatMapGroupsWithState[S: Encoder, U: Encoder](
-  func: (K, Iterator[V], KeyedState[S]) => Iterator[U], outputMode: 
OutputMode): Dataset[U] = {
+  func: (K, Iterator[V], KeyedState[S]) => Iterator[U],
--- End diff --

Another option here would put the function at the end so that you could do 
this:

```scala
df.flatMapGroupWithState(Append) { (key, iter, state: KeyedState[Int]) =>
   ...
}
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17179: [SPARK-19067][SS] Processing-time-based timeout i...

2017-03-13 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/17179#discussion_r105823059
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FlatMapGroupsWithStateExec.scala
 ---
@@ -0,0 +1,270 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.streaming
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
+import org.apache.spark.sql.catalyst.expressions.{Ascending, Attribute, 
AttributeReference, Expression, Literal, SortOrder, SpecificInternalRow, 
UnsafeProjection, UnsafeRow}
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalKeyedState, 
ProcessingTimeTimeout}
+import 
org.apache.spark.sql.catalyst.plans.physical.{ClusteredDistribution, 
Distribution, Partitioning}
+import org.apache.spark.sql.execution._
+import org.apache.spark.sql.execution.streaming.state._
+import org.apache.spark.sql.streaming.{KeyedStateTimeout, OutputMode}
+import org.apache.spark.sql.types.{BooleanType, IntegerType}
+import org.apache.spark.util.CompletionIterator
+
+/**
+ * Physical operator for executing `FlatMapGroupsWithState.`
+ *
+ * @param func function called on each group
+ * @param keyDeserializer used to extract the key object for each group.
+ * @param valueDeserializer used to extract the items in the iterator from 
an input row.
+ * @param groupingAttributes used to group the data
+ * @param dataAttributes used to read the data
+ * @param outputObjAttr used to define the output object
+ * @param stateEncoder used to serialize/deserialize state before calling 
`func`
+ * @param outputMode the output mode of `func`
+ * @param timeout used to timeout groups that have not received data in a 
while
+ * @param batchTimestampMs processing timestamp of the current batch.
+ */
+case class FlatMapGroupsWithStateExec(
+func: (Any, Iterator[Any], LogicalKeyedState[Any]) => Iterator[Any],
+keyDeserializer: Expression,
+valueDeserializer: Expression,
+groupingAttributes: Seq[Attribute],
+dataAttributes: Seq[Attribute],
+outputObjAttr: Attribute,
+stateId: Option[OperatorStateId],
+stateEncoder: ExpressionEncoder[Any],
+outputMode: OutputMode,
+timeout: KeyedStateTimeout,
+batchTimestampMs: Long,
+child: SparkPlan) extends UnaryExecNode with ObjectProducerExec with 
StateStoreWriter {
+
+  private val isTimeoutEnabled = timeout == ProcessingTimeTimeout
+  private val timestampTimeoutAttribute =
+AttributeReference("timeoutTimestamp", dataType = IntegerType, 
nullable = false)()
+  private val stateExistsAttribute =
+AttributeReference("stateExists", dataType = BooleanType, nullable = 
false)()
+  private val stateAttributes: Seq[Attribute] = {
+val encoderSchemaAttributes = stateEncoder.schema.toAttributes
+if (isTimeoutEnabled) {
+  encoderSchemaAttributes :+ stateExistsAttribute :+ 
timestampTimeoutAttribute
+} else encoderSchemaAttributes
+  }
+
+  import KeyedStateImpl._
+  override def outputPartitioning: Partitioning = child.outputPartitioning
--- End diff --

This is not true, right?  They could be outputting whatever they want.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17179: [SPARK-19067][SS] Processing-time-based timeout i...

2017-03-13 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/17179#discussion_r105822317
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/KeyedState.scala ---
@@ -61,25 +65,50 @@ import 
org.apache.spark.sql.catalyst.plans.logical.LogicalKeyedState
  *  - After that, if `update(newState)` is called, then `exists()` will 
again return `true`,
  *`get()` and `getOption()`will return the updated value.
  *
+ * Important points to note about using `KeyedStateTimeout`.
+ *  - The timeout type is a global param across all the keys (set as 
`timeout` param in
+ *`[map|flatMap]GroupsWithState`, but the exact timeout duration is 
configurable per key
+ *(by calling `setTimeout...()` in `KeyedState`).
+ *  - When the timeout occurs for a key, the function is called with no 
values, and
+ *`KeyedState.isTimingOut()` set to true.
+ *  - The timeout is reset for key every time the function is called on 
the key, that is,
+ *when the key has new data, or the key has timed out. So the user has 
to set the timeout
+ *duration every time the function is called, otherwise there will not 
be any timeout set.
+ *  - Guarantees provided on processing-time-based timeout of key, when 
timeout duration is D ms:
+ *- Timeout will never be called before real clock time has advanced 
by D ms
+ *- Timeout will be called eventually when there is a trigger in the 
query
+ *  (i.e. after D ms). So there is a no strict upper bound on when the 
timeout would occur.
+ *  For example, the trigger interval of the query will affect when 
the timeout is actually hit.
+ *  If there is no data in the stream (for any key) for a while, then 
their will not be
+ *  any trigger and timeout will not be hit until there is data.
--- End diff --

How hard to remove this limitation?  It seems like its very hard to build 
reliable monitoring applications on this API unless we fix this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17179: [SPARK-19067][SS] Processing-time-based timeout i...

2017-03-13 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/17179#discussion_r105822109
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/KeyedState.scala ---
@@ -61,25 +65,50 @@ import 
org.apache.spark.sql.catalyst.plans.logical.LogicalKeyedState
  *  - After that, if `update(newState)` is called, then `exists()` will 
again return `true`,
  *`get()` and `getOption()`will return the updated value.
  *
+ * Important points to note about using `KeyedStateTimeout`.
+ *  - The timeout type is a global param across all the keys (set as 
`timeout` param in
+ *`[map|flatMap]GroupsWithState`, but the exact timeout duration is 
configurable per key
+ *(by calling `setTimeout...()` in `KeyedState`).
+ *  - When the timeout occurs for a key, the function is called with no 
values, and
+ *`KeyedState.isTimingOut()` set to true.
+ *  - The timeout is reset for key every time the function is called on 
the key, that is,
+ *when the key has new data, or the key has timed out. So the user has 
to set the timeout
+ *duration every time the function is called, otherwise there will not 
be any timeout set.
+ *  - Guarantees provided on processing-time-based timeout of key, when 
timeout duration is D ms:
+ *- Timeout will never be called before real clock time has advanced 
by D ms
+ *- Timeout will be called eventually when there is a trigger in the 
query
+ *  (i.e. after D ms). So there is a no strict upper bound on when the 
timeout would occur.
+ *  For example, the trigger interval of the query will affect when 
the timeout is actually hit.
+ *  If there is no data in the stream (for any key) for a while, then 
their will not be
+ *  any trigger and timeout will not be hit until there is data.
+ *
  * Scala example of using KeyedState in `mapGroupsWithState`:
  * {{{
  * // A mapping function that maintains an integer state for string keys 
and returns a string.
  * def mappingFunction(key: String, value: Iterator[Int], state: 
KeyedState[Int]): String = {
- *   // Check if state exists
- *   if (state.exists) {
- * val existingState = state.get  // Get the existing state
- * val shouldRemove = ... // Decide whether to remove the state
+ *
+ *   if (state.isTimingOut) {// If called when timing out, 
remove the state
+ * state.remove()
+ *
+ *   } else if (state.exists) {  // If state exists, use it 
for processing
+ * val existingState = state.get // Get the existing state
+ * val shouldRemove = ...// Decide whether to remove 
the state
  * if (shouldRemove) {
- *   state.remove() // Remove the state
+ *   state.remove()  // Remove the state
+ *
  * } else {
  *   val newState = ...
- *   state.update(newState)// Set the new state
+ *   state.update(newState)  // Set the new state
  * }
+ *
  *   } else {
  * val initialState = ...
- * state.update(initialState)  // Set the initial state
+ * state.update(initialState)// Set the initial state
  *   }
- *   ... // return something
+ *   state.setTimeoutDuration("1 hour")  // Set the timeout
--- End diff --

Does this set a timeout on a removed state?  What does that do?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17179: [SPARK-19067][SS] Processing-time-based timeout i...

2017-03-13 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/17179#discussion_r105821496
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala ---
@@ -249,6 +250,43 @@ class KeyValueGroupedDataset[K, V] private[sql](
 dataAttributes,
 OutputMode.Update,
 isMapGroupsWithState = true,
+KeyedStateTimeout.none,
+child = logicalPlan))
+  }
+
+  /**
+   * ::Experimental::
+   * (Scala-specific)
+   * Applies the given function to each group of data, while maintaining a 
user-defined per-group
+   * state. The result Dataset will represent the objects returned by the 
function.
+   * For a static batch Dataset, the function will be invoked once per 
group. For a streaming
+   * Dataset, the function will be invoked for each group repeatedly in 
every trigger, and
+   * updates to each group's state will be saved across invocations.
+   * See [[org.apache.spark.sql.streaming.KeyedState]] for more details.
+   *
+   * @tparam S The type of the user-defined state. Must be encodable to 
Spark SQL types.
+   * @tparam U The type of the output objects. Must be encodable to Spark 
SQL types.
+   * @param func Function to be called on every group.
+   * @param timeout Timeout information for groups that do not receive 
data for a while
+   *
+   * See [[Encoder]] for more details on what types are encodable to Spark 
SQL.
+   * @since 2.2.0
+   */
+  @Experimental
+  @InterfaceStability.Evolving
+  def mapGroupsWithState[S: Encoder, U: Encoder](
+  func: (K, Iterator[V], KeyedState[S]) => U,
+  timeout: KeyedStateTimeout): Dataset[U] = {
--- End diff --

`timeoutType`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17240: [SPARK-19915][SQL] Improve join reorder: simplify...

2017-03-13 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17240#discussion_r105823209
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -122,46 +119,48 @@ case class CostBasedJoinReorder(conf: CatalystConf) 
extends Rule[LogicalPlan] wi
  * level 3: p({A, B, C, D})
  * where p({A, B, C, D}) is the final output plan.
  *
- * For cost evaluation, since physical costs for operators are not 
available currently, we use
- * cardinalities and sizes to compute costs.
+ * To evaluate cost for a given plan, we calculate the sum of 
cardinalities for all intermediate
+ * joins in the plan.
  */
 object JoinReorderDP extends PredicateHelper {
 
   def search(
   conf: CatalystConf,
   items: Seq[LogicalPlan],
-  conditions: Set[Expression],
-  topOutput: AttributeSet): Option[LogicalPlan] = {
+  conditions: Set[Expression]): Option[LogicalPlan] = {
 
 // Level i maintains all found plans for i + 1 items.
 // Create the initial plans: each plan is a single item with zero cost.
-val itemIndex = items.zipWithIndex
+val itemIndex = items.zipWithIndex.map(_.swap).toMap
 val foundPlans = mutable.Buffer[JoinPlanMap](itemIndex.map {
-  case (item, id) => Set(id) -> JoinPlan(Set(id), item, Set(), Cost(0, 
0))
-}.toMap)
+  case (id, item) => Set(id) -> JoinPlan(Set(id), item, cost = 0)
+})
 
-for (lev <- 1 until items.length) {
+while (foundPlans.size < items.length && foundPlans.last.size > 1) {
   // Build plans for the next level.
-  foundPlans += searchLevel(foundPlans, conf, conditions, topOutput)
+  foundPlans += searchLevel(foundPlans, conf, conditions)
 }
 
-val plansLastLevel = foundPlans(items.length - 1)
-if (plansLastLevel.isEmpty) {
-  // Failed to find a plan, fall back to the original plan
-  None
-} else {
-  // There must be only one plan at the last level, which contains all 
items.
-  assert(plansLastLevel.size == 1 && plansLastLevel.head._1.size == 
items.length)
-  Some(plansLastLevel.head._2.plan)
+// Find the best plan
+assert(foundPlans.last.size <= 1)
+val bestJoinPlan = foundPlans.last.headOption
--- End diff --

and what if the last level has more than one entries? shall we pick the 
best among them?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16867: [SPARK-16929] Improve performance when check speculatabl...

2017-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16867
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74476/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16867: [SPARK-16929] Improve performance when check speculatabl...

2017-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16867
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16867: [SPARK-16929] Improve performance when check speculatabl...

2017-03-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16867
  
**[Test build #74476 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74476/testReport)**
 for PR 16867 at commit 
[`5aa2fcf`](https://github.com/apache/spark/commit/5aa2fcf8c244e4503302053a98ef12c7d5c80878).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17240: [SPARK-19915][SQL] Improve join reorder: simplify...

2017-03-13 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17240#discussion_r105822819
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -122,46 +119,48 @@ case class CostBasedJoinReorder(conf: CatalystConf) 
extends Rule[LogicalPlan] wi
  * level 3: p({A, B, C, D})
  * where p({A, B, C, D}) is the final output plan.
  *
- * For cost evaluation, since physical costs for operators are not 
available currently, we use
- * cardinalities and sizes to compute costs.
+ * To evaluate cost for a given plan, we calculate the sum of 
cardinalities for all intermediate
+ * joins in the plan.
  */
 object JoinReorderDP extends PredicateHelper {
 
   def search(
   conf: CatalystConf,
   items: Seq[LogicalPlan],
-  conditions: Set[Expression],
-  topOutput: AttributeSet): Option[LogicalPlan] = {
+  conditions: Set[Expression]): Option[LogicalPlan] = {
 
 // Level i maintains all found plans for i + 1 items.
 // Create the initial plans: each plan is a single item with zero cost.
-val itemIndex = items.zipWithIndex
+val itemIndex = items.zipWithIndex.map(_.swap).toMap
 val foundPlans = mutable.Buffer[JoinPlanMap](itemIndex.map {
-  case (item, id) => Set(id) -> JoinPlan(Set(id), item, Set(), Cost(0, 
0))
-}.toMap)
+  case (id, item) => Set(id) -> JoinPlan(Set(id), item, cost = 0)
+})
 
-for (lev <- 1 until items.length) {
+while (foundPlans.size < items.length && foundPlans.last.size > 1) {
   // Build plans for the next level.
-  foundPlans += searchLevel(foundPlans, conf, conditions, topOutput)
+  foundPlans += searchLevel(foundPlans, conf, conditions)
 }
 
-val plansLastLevel = foundPlans(items.length - 1)
-if (plansLastLevel.isEmpty) {
-  // Failed to find a plan, fall back to the original plan
-  None
-} else {
-  // There must be only one plan at the last level, which contains all 
items.
-  assert(plansLastLevel.size == 1 && plansLastLevel.head._1.size == 
items.length)
-  Some(plansLastLevel.head._2.plan)
+// Find the best plan
+assert(foundPlans.last.size <= 1)
+val bestJoinPlan = foundPlans.last.headOption
--- End diff --

what if the last level has 0 entry but the previous level has some?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16867: [SPARK-16929] Improve performance when check speculatabl...

2017-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16867
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74475/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16867: [SPARK-16929] Improve performance when check speculatabl...

2017-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16867
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16867: [SPARK-16929] Improve performance when check speculatabl...

2017-03-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16867
  
**[Test build #74475 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74475/testReport)**
 for PR 16867 at commit 
[`318a172`](https://github.com/apache/spark/commit/318a172130bd84c0f36494f839a87b86c6750f66).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17240: [SPARK-19915][SQL] Improve join reorder: simplify...

2017-03-13 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17240#discussion_r105822517
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -122,46 +119,48 @@ case class CostBasedJoinReorder(conf: CatalystConf) 
extends Rule[LogicalPlan] wi
  * level 3: p({A, B, C, D})
  * where p({A, B, C, D}) is the final output plan.
  *
- * For cost evaluation, since physical costs for operators are not 
available currently, we use
- * cardinalities and sizes to compute costs.
+ * To evaluate cost for a given plan, we calculate the sum of 
cardinalities for all intermediate
+ * joins in the plan.
  */
 object JoinReorderDP extends PredicateHelper {
 
   def search(
   conf: CatalystConf,
   items: Seq[LogicalPlan],
-  conditions: Set[Expression],
-  topOutput: AttributeSet): Option[LogicalPlan] = {
+  conditions: Set[Expression]): Option[LogicalPlan] = {
 
 // Level i maintains all found plans for i + 1 items.
 // Create the initial plans: each plan is a single item with zero cost.
-val itemIndex = items.zipWithIndex
+val itemIndex = items.zipWithIndex.map(_.swap).toMap
 val foundPlans = mutable.Buffer[JoinPlanMap](itemIndex.map {
-  case (item, id) => Set(id) -> JoinPlan(Set(id), item, Set(), Cost(0, 
0))
-}.toMap)
+  case (id, item) => Set(id) -> JoinPlan(Set(id), item, cost = 0)
+})
 
-for (lev <- 1 until items.length) {
+while (foundPlans.size < items.length && foundPlans.last.size > 1) {
--- End diff --

add some comments to explain why we can stop when the last level has less 
than 1 entry.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17233: [SPARK-11569][ML] Fix StringIndexer to handle null value...

2017-03-13 Thread crackcell
Github user crackcell commented on the issue:

https://github.com/apache/spark/pull/17233
  
@jkbradley Hi, I have made some updates according to your comments, please 
review it again. :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17240: [SPARK-19915][SQL] Improve join reorder: simplify...

2017-03-13 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17240#discussion_r105821744
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -87,8 +84,8 @@ case class CostBasedJoinReorder(conf: CatalystConf) 
extends Rule[LogicalPlan] wi
   val replacedLeft = replaceWithOrderedJoin(left)
   val replacedRight = replaceWithOrderedJoin(right)
   OrderedJoin(j.copy(left = replacedLeft, right = replacedRight))
-case p @ Project(_, join) =>
-  p.copy(child = replaceWithOrderedJoin(join))
+case p @ Project(projectList, j: Join) =>
--- End diff --

now the result of join reordering won't have project, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17278: [SPARK-19933][SQL] Do not change output of a subq...

2017-03-13 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17278#discussion_r105821394
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -140,7 +140,8 @@ abstract class Optimizer(sessionCatalog: 
SessionCatalog, conf: CatalystConf)
   object OptimizeSubqueries extends Rule[LogicalPlan] {
 def apply(plan: LogicalPlan): LogicalPlan = plan 
transformAllExpressions {
   case s: SubqueryExpression =>
-s.withNewPlan(Optimizer.this.execute(s.plan))
+val ReturnAnswer(newPlan) = 
Optimizer.this.execute(ReturnAnswer(s.plan))
--- End diff --

How about using a case class like `OptimizedSubquery` which extends 
`SubqueryExpression`? I think it's easier to understand from the name.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17277: [SPARK-19887][SQL] dynamic partition keys can be null or...

2017-03-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17277
  
**[Test build #74482 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74482/testReport)**
 for PR 17277 at commit 
[`8896507`](https://github.com/apache/spark/commit/889650770345d93d520007a39a2f140350c3b104).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16373: [SPARK-18961][SQL] Support `SHOW TABLE EXTENDED ....

2017-03-13 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16373#discussion_r105820580
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
@@ -642,18 +644,34 @@ case class ShowTablesCommand(
 // instead of calling tables in sparkSession.
 val catalog = sparkSession.sessionState.catalog
 val db = databaseName.getOrElse(catalog.getCurrentDatabase)
-val tables =
-  tableIdentifierPattern.map(catalog.listTables(db, 
_)).getOrElse(catalog.listTables(db))
-tables.map { tableIdent =>
-  val database = tableIdent.database.getOrElse("")
-  val tableName = tableIdent.table
-  val isTemp = catalog.isTemporaryTable(tableIdent)
-  if (isExtended) {
-val information = 
catalog.getTempViewOrPermanentTableMetadata(tableIdent).toString
-Row(database, tableName, isTemp, s"${information}\n")
-  } else {
-Row(database, tableName, isTemp)
+if (partitionSpec.isEmpty) {
+  // Show the information of tables.
+  val tables =
+tableIdentifierPattern.map(catalog.listTables(db, 
_)).getOrElse(catalog.listTables(db))
+  tables.map { tableIdent =>
+val database = tableIdent.database.getOrElse("")
--- End diff --

For temporary views, they have empty database.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16373: [SPARK-18961][SQL] Support `SHOW TABLE EXTENDED ... PART...

2017-03-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16373
  
**[Test build #74481 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74481/testReport)**
 for PR 16373 at commit 
[`b46d771`](https://github.com/apache/spark/commit/b46d7717aa823f839d4790b097fd841440d70660).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17277: [SPARK-19887][SQL] dynamic partition keys can be null or...

2017-03-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17277
  
**[Test build #74480 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74480/testReport)**
 for PR 17277 at commit 
[`a04e7e5`](https://github.com/apache/spark/commit/a04e7e5b22105188d076010bf9c6adffdcfa1f7e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17194: [SPARK-19851] Add new aggregates EVERY and ANY (SOME).

2017-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17194
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74478/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17194: [SPARK-19851] Add new aggregates EVERY and ANY (SOME).

2017-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17194
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17194: [SPARK-19851] Add new aggregates EVERY and ANY (SOME).

2017-03-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17194
  
**[Test build #74478 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74478/testReport)**
 for PR 17194 at commit 
[`1f49c7f`](https://github.com/apache/spark/commit/1f49c7f715df3c871463a23fac1b7a6bc8c9bfc5).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17233: [SPARK-11569][ML] Fix StringIndexer to handle nul...

2017-03-13 Thread crackcell
Github user crackcell commented on a diff in the pull request:

https://github.com/apache/spark/pull/17233#discussion_r105820314
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
@@ -188,35 +189,45 @@ class StringIndexerModel (
 transformSchema(dataset.schema, logging = true)
 
 val filteredLabels = getHandleInvalid match {
-  case StringIndexer.KEEP_UNSEEN_LABEL => labels :+ "__unknown"
+  case StringIndexer.KEEP_INVALID => labels :+ "__unknown"
   case _ => labels
 }
 
 val metadata = NominalAttribute.defaultAttr
   .withName($(outputCol)).withValues(filteredLabels).toMetadata()
 // If we are skipping invalid records, filter them out.
 val (filteredDataset, keepInvalid) = getHandleInvalid match {
-  case StringIndexer.SKIP_UNSEEN_LABEL =>
+  case StringIndexer.SKIP_INVALID =>
 val filterer = udf { label: String =>
   labelToIndex.contains(label)
 }
-(dataset.where(filterer(dataset($(inputCol, false)
-  case _ => (dataset, getHandleInvalid == 
StringIndexer.KEEP_UNSEEN_LABEL)
+
(dataset.na.drop(Array($(inputCol))).where(filterer(dataset($(inputCol, 
false)
+  case _ => (dataset, getHandleInvalid == StringIndexer.KEEP_INVALID)
 }
 
-val indexer = udf { label: String =>
-  if (labelToIndex.contains(label)) {
-labelToIndex(label)
-  } else if (keepInvalid) {
-labels.length
+val indexer = udf { row: Row =>
--- End diff --

got it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17233: [SPARK-11569][ML] Fix StringIndexer to handle nul...

2017-03-13 Thread crackcell
Github user crackcell commented on a diff in the pull request:

https://github.com/apache/spark/pull/17233#discussion_r105820279
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala ---
@@ -122,6 +122,86 @@ class StringIndexerSuite
 assert(output === expected)
   }
 
+  test("StringIndexer with a string input column with NULLs") {
+val data: Seq[java.lang.String] = Seq("a", "b", "b", null)
+val data2: Seq[java.lang.String] = Seq("a", "b", null)
+val expectedSkip = Array(1.0, 0.0)
+val expectedKeep = Array(1.0, 0.0, 2.0)
+val df = data.toDF("label")
+val df2 = data2.toDF("label")
+
+val indexer = new StringIndexer()
+  .setInputCol("label")
+  .setOutputCol("labelIndex")
+
+withClue("StringIndexer should throw error when setHandleValid=error 
when given NULL values") {
+  intercept[SparkException] {
+indexer.setHandleInvalid("error")
+indexer.fit(df).transform(df2).collect()
+  }
+}
+
+indexer.setHandleInvalid("skip")
+val transformedSkip = indexer.fit(df).transform(df2)
+val attrSkip = Attribute
+  .fromStructField(transformedSkip.schema("labelIndex"))
+  .asInstanceOf[NominalAttribute]
+assert(attrSkip.values.get === Array("b", "a"))
+assert(transformedSkip.select("labelIndex").rdd.map { r =>
+  r.getDouble(0)
+}.collect() === expectedSkip)
+
+indexer.setHandleInvalid("keep")
+val transformedKeep = indexer.fit(df).transform(df2)
+val attrKeep = Attribute
+  .fromStructField(transformedKeep.schema("labelIndex"))
+  .asInstanceOf[NominalAttribute]
+assert(attrKeep.values.get === Array("b", "a", "__unknown"))
+assert(transformedKeep.select("labelIndex").rdd.map { r =>
+  r.getDouble(0)
+}.collect() === expectedKeep)
+  }
+
+  test("StringIndexer with a numeric input column with NULLs") {
--- End diff --

OK, I'll remove the numeric test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17233: [SPARK-11569][ML] Fix StringIndexer to handle nul...

2017-03-13 Thread crackcell
Github user crackcell commented on a diff in the pull request:

https://github.com/apache/spark/pull/17233#discussion_r105820283
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala ---
@@ -122,6 +122,86 @@ class StringIndexerSuite
 assert(output === expected)
   }
 
+  test("StringIndexer with a string input column with NULLs") {
+val data: Seq[java.lang.String] = Seq("a", "b", "b", null)
+val data2: Seq[java.lang.String] = Seq("a", "b", null)
+val expectedSkip = Array(1.0, 0.0)
+val expectedKeep = Array(1.0, 0.0, 2.0)
+val df = data.toDF("label")
+val df2 = data2.toDF("label")
+
+val indexer = new StringIndexer()
+  .setInputCol("label")
+  .setOutputCol("labelIndex")
+
+withClue("StringIndexer should throw error when setHandleValid=error 
when given NULL values") {
+  intercept[SparkException] {
+indexer.setHandleInvalid("error")
+indexer.fit(df).transform(df2).collect()
+  }
+}
+
+indexer.setHandleInvalid("skip")
+val transformedSkip = indexer.fit(df).transform(df2)
+val attrSkip = Attribute
+  .fromStructField(transformedSkip.schema("labelIndex"))
+  .asInstanceOf[NominalAttribute]
+assert(attrSkip.values.get === Array("b", "a"))
+assert(transformedSkip.select("labelIndex").rdd.map { r =>
+  r.getDouble(0)
+}.collect() === expectedSkip)
--- End diff --

roger


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16373: [SPARK-18961][SQL] Support `SHOW TABLE EXTENDED ....

2017-03-13 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16373#discussion_r105819853
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala 
---
@@ -925,6 +925,26 @@ class DDLSuite extends QueryTest with SharedSQLContext 
with BeforeAndAfterEach {
 }
   }
 
+  test("show table extended ... partition") {
--- End diff --

In `show-tables.sql`, we only output the value of the column `tableName`, 
we should verify the schema here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15628: [SPARK-17471][ML] Add compressed method to ML matrices

2017-03-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15628
  
**[Test build #74479 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74479/consoleFull)**
 for PR 15628 at commit 
[`254b9fb`](https://github.com/apache/spark/commit/254b9fb07a35d6927fefe1a4abe6f8a24ae81d4a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17186: [SPARK-19846][SQL] Add a flag to disable constraint prop...

2017-03-13 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/17186
  
@sameeragarwal Thanks for the comment. I've updated 
`InferFiltersFromConstraints`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17274: [SPARK-19925][SPARKR] Fix SparkR spark.getSparkFi...

2017-03-13 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/17274#discussion_r105818725
  
--- Diff: R/pkg/inst/tests/testthat/test_context.R ---
@@ -177,6 +177,13 @@ test_that("add and get file to be downloaded with 
Spark job on every node", {
   spark.addFile(path)
   download_path <- spark.getSparkFiles(filename)
   expect_equal(readLines(download_path), words)
+
+  # Test spark.getSparkFiles works well on executors.
+  seq <- seq(from = 1, to = 10, length.out = 5)
+  f <- function(seq) { readLines(spark.getSparkFiles(filename)) }
+  results <- spark.lapply(seq, f)
+  for (i in 1:5) { expect_equal(results[[i]], words) }
+
--- End diff --

```
Failed 
-
1. Error: add and get file to be downloaded with Spark job on every node 
(@test_context.R#184) 
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 
in stage 0.0 failed 1 times, most recent failure: Lost task 2.0 in stage 0.0 
(TID 2, localhost, executor driver): org.apache.spark.SparkException: R 
computation failed with
 [1] 3
[1] 2
[1] 3
[1][1] 1 1

[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
cannot open the connection
In addition: Warning message:
In file(con, "r") :
  cannot open file 
'/tmp/spark-82bf379c-0f31-47c0-8ac6-6b764c3cfc90/userFiles-1327d8ac-889e-4861-8263-4f270697fb85/hello221b332ef87f.txt':
 No such file or directory
```
The error is weird, since it can pass if I paste these code to SparkR 
console. And it also pass if I write these code in a separate script and 
submitting with ```bin/spark-submit```. Any thoughts? cc @felixcheung @shivaram 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17251: [SPARK-19910][SQL] `stack` should not reject NULL...

2017-03-13 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17251#discussion_r105818676
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -590,6 +591,22 @@ object TypeCoercion {
   }
 
   /**
+   * Coerces NullTypes of a Stack function to the corresponding column 
types.
+   */
+  object StackCoercion extends Rule[LogicalPlan] {
+def apply(plan: LogicalPlan): LogicalPlan = plan resolveExpressions {
+  case s @ Stack(children @ Literal(_, IntegerType) :: _) if 
s.childrenResolved =>
--- End diff --

what about `1 + 2`? The requirement is a foldable int type expression, not 
have to be int literal.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17186: [SPARK-19846][SQL] Add a flag to disable constrai...

2017-03-13 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/17186#discussion_r105818486
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -190,6 +190,15 @@ object SQLConf {
 .booleanConf
 .createWithDefault(false)
 
+  val CONSTRAINT_PROPAGATION_ENABLED = 
buildConf("spark.sql.constraintPropagation.enabled")
+.internal()
--- End diff --

Due to the fact there are few users reporting hitting this issue, we may 
need to expose it as an external flag.

However, I would think that a large portion of external users may not know 
constraint propagation. It might not be intuitive to link the problem they hit 
to constraint propagation and to find this config, even it is external.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17250: [SPARK-19911][STREAMING] Add builder interface for Kines...

2017-03-13 Thread brkyvz
Github user brkyvz commented on the issue:

https://github.com/apache/spark/pull/17250
  
Good point @budde. I can think of two options:

 1. Leave it as a constructor param
 2. Make the `Builder` class non-generic and have the `build` function take 
the message handler:

```scala
class Builder {
  
  def build(): KinesisInputDStream[Array[Byte]]

  def buildWithMessageHandler[T](f: Record => T): KinesisInputDStream[T]
}
```

It's a matter of taking it as the first parameter or the final parameter. 
There are other ways to do it as well, but will throw runtime exceptions 
instead of at compile time.

cc @rxin for input on APIs


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17186: [SPARK-19846][SQL] Add a flag to disable constraint prop...

2017-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17186
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17186: [SPARK-19846][SQL] Add a flag to disable constraint prop...

2017-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17186
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74473/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17186: [SPARK-19846][SQL] Add a flag to disable constraint prop...

2017-03-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17186
  
**[Test build #74473 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74473/testReport)**
 for PR 17186 at commit 
[`0e204bc`](https://github.com/apache/spark/commit/0e204bc226cfa520fe76a83e233790153c776522).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17240: [SPARK-19915][SQL] Improve join reorder: simplify cost e...

2017-03-13 Thread wzhfy
Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/17240
  
@nsyca Thanks. I know there could be such cases when size is also useful. 
However, usually big tables (fact table) have more columns than small tables, 
so cardinality and size is positively correlated, i.e. relation with larger 
cardinality also has larger size.
Again, I agree with you in some cases this could be violated. But we also 
need to consider from implementation aspect. Do column pruning after reordering 
makes more sense and makes the code much conciser.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17186: [SPARK-19846][SQL] Add a flag to disable constrai...

2017-03-13 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17186#discussion_r105815354
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -190,6 +190,15 @@ object SQLConf {
 .booleanConf
 .createWithDefault(false)
 
+  val CONSTRAINT_PROPAGATION_ENABLED = 
buildConf("spark.sql.constraintPropagation.enabled")
+.internal()
--- End diff --

To determine whether a flag is internal or not, we should consider the 
impact of external users. If users could easily hit this, we might need to 
expose it as an external flag and document it in [the public 
document](http://spark.apache.org/docs/latest/sql-programming-guide.html).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17240: [SPARK-19915][SQL] Improve join reorder: simplify...

2017-03-13 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17240#discussion_r105815087
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -204,63 +206,37 @@ object JoinReorderDP extends PredicateHelper {
   oneJoinPlan: JoinPlan,
   otherJoinPlan: JoinPlan,
   conf: CatalystConf,
-  conditions: Set[Expression],
-  topOutput: AttributeSet): JoinPlan = {
+  conditions: Set[Expression]): Option[JoinPlan] = {
 
 val onePlan = oneJoinPlan.plan
 val otherPlan = otherJoinPlan.plan
-// Now both onePlan and otherPlan become intermediate joins, so the 
cost of the
-// new join should also include their own cardinalities and sizes.
-val newCost = if (isCartesianProduct(onePlan) || 
isCartesianProduct(otherPlan)) {
-  // We consider cartesian product very expensive, thus set a very 
large cost for it.
-  // This enables to plan all the cartesian products at the end, 
because having a cartesian
-  // product as an intermediate join will significantly increase a 
plan's cost, making it
-  // impossible to be selected as the best plan for the items, unless 
there's no other choice.
-  Cost(
-rows = BigInt(Long.MaxValue) * BigInt(Long.MaxValue),
-size = BigInt(Long.MaxValue) * BigInt(Long.MaxValue))
-} else {
-  val onePlanStats = onePlan.stats(conf)
-  val otherPlanStats = otherPlan.stats(conf)
-  Cost(
-rows = oneJoinPlan.cost.rows + onePlanStats.rowCount.get +
-  otherJoinPlan.cost.rows + otherPlanStats.rowCount.get,
-size = oneJoinPlan.cost.size + onePlanStats.sizeInBytes +
-  otherJoinPlan.cost.size + otherPlanStats.sizeInBytes)
-}
-
-// Put the deeper side on the left, tend to build a left-deep tree.
-val (left, right) = if (oneJoinPlan.itemIds.size >= 
otherJoinPlan.itemIds.size) {
-  (onePlan, otherPlan)
-} else {
-  (otherPlan, onePlan)
-}
 val joinConds = conditions
   .filterNot(l => canEvaluate(l, onePlan))
   .filterNot(r => canEvaluate(r, otherPlan))
   .filter(e => e.references.subsetOf(onePlan.outputSet ++ 
otherPlan.outputSet))
-// We use inner join whether join condition is empty or not. Since 
cross join is
-// equivalent to inner join without condition.
-val newJoin = Join(left, right, Inner, joinConds.reduceOption(And))
-val collectedJoinConds = joinConds ++ oneJoinPlan.joinConds ++ 
otherJoinPlan.joinConds
-val remainingConds = conditions -- collectedJoinConds
-val neededAttr = AttributeSet(remainingConds.flatMap(_.references)) ++ 
topOutput
-val neededFromNewJoin = newJoin.outputSet.filter(neededAttr.contains)
-val newPlan =
-  if ((newJoin.outputSet -- neededFromNewJoin).nonEmpty) {
-Project(neededFromNewJoin.toSeq, newJoin)
+if (joinConds.isEmpty) {
+  // Cartesian product is very expensive, so we exclude them from 
candidate plans.
+  // This also helps us to reduce the search space. Unjoinable items 
will be put at the end
+  // of the plan when the reordering phase finishes.
+  None
+} else {
+  // Put the deeper side on the left, tend to build a left-deep tree.
+  val (left, right) = if (oneJoinPlan.itemIds.size >= 
otherJoinPlan.itemIds.size) {
+(onePlan, otherPlan)
   } else {
-newJoin
+(otherPlan, onePlan)
   }
+  val newJoin = Join(left, right, Inner, joinConds.reduceOption(And))
+  val itemIds = oneJoinPlan.itemIds.union(otherJoinPlan.itemIds)
 
-val itemIds = oneJoinPlan.itemIds.union(otherJoinPlan.itemIds)
-JoinPlan(itemIds, newPlan, collectedJoinConds, newCost)
-  }
+  // Now onePlan/otherPlan becomes an intermediate join (if it's a 
non-leaf item),
+  // so the cost of the new join should also include their own 
cardinalities.
+  val newCost = oneJoinPlan.cost + otherJoinPlan.cost +
+(if (oneJoinPlan.itemIds.size > 1) 
onePlan.stats(conf).rowCount.get else 0) +
+(if (otherJoinPlan.itemIds.size > 1) 
otherPlan.stats(conf).rowCount.get else 0)
--- End diff --

Filtering factor is considered in `def stats`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For 

[GitHub] spark pull request #14547: [SPARK-16718][MLlib] gbm-style treeboost

2017-03-13 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/14547#discussion_r105814881
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impurity/ApproxBernoulliImpurity.scala
 ---
@@ -0,0 +1,155 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tree.impurity
+
+import org.apache.spark.annotation.{DeveloperApi, Since}
+import org.apache.spark.mllib.tree.impurity._
+
+/**
+ * [[ApproxBernoulliImpurity]] currently uses variance as a (proxy) 
impurity measure
+ * during tree construction. The main purpose of the class is to have an 
alternative
+ * leaf prediction calculation.
+ *
+ * Only data with examples each of weight 1.0 is supported.
+ *
+ * Class for calculating variance during regression.
+ */
+@Since("2.1")
+private[spark] object ApproxBernoulliImpurity extends Impurity {
+
+  /**
+   * :: DeveloperApi ::
+   * information calculation for multiclass classification
+   * @param counts Array[Double] with counts for each label
+   * @param totalCount sum of counts for all labels
+   * @return information value, or 0 if totalCount = 0
+   */
+  @Since("2.1")
+  @DeveloperApi
+  override def calculate(counts: Array[Double], totalCount: Double): 
Double =
+throw new 
UnsupportedOperationException("ApproxBernoulliImpurity.calculate")
+
+  /**
+   * :: DeveloperApi ::
+   * variance calculation
+   * @param count number of instances
+   * @param sum sum of labels
+   * @param sumSquares summation of squares of the labels
+   * @return information value, or 0 if count = 0
+   */
+  @Since("2.1")
+  @DeveloperApi
+  override def calculate(count: Double, sum: Double, sumSquares: Double): 
Double = {
+Variance.calculate(count, sum, sumSquares)
+  }
+}
+
+/**
+ * Class for updating views of a vector of sufficient statistics,
+ * in order to compute impurity from a sample.
+ * Note: Instances of this class do not hold the data; they operate on 
views of the data.
+ */
+private[spark] class ApproxBernoulliAggregator
+  extends ImpurityAggregator(statsSize = 4) with Serializable {
+
+  /**
+   * Update stats for one (node, feature, bin) with the given label.
+   * @param allStats  Flat stats array, with stats for this (node, 
feature, bin) contiguous.
+   * @param offsetStart index of stats for this (node, feature, bin).
+   */
+  def update(allStats: Array[Double], offset: Int, label: Double, 
instanceWeight: Double): Unit = {
+allStats(offset) += instanceWeight
+allStats(offset + 1) += instanceWeight * label
+allStats(offset + 2) += instanceWeight * label * label
+allStats(offset + 3) += instanceWeight * Math.abs(label)
+  }
+
+  /**
+   * Get an [[ImpurityCalculator]] for a (node, feature, bin).
+   * @param allStats  Flat stats array, with stats for this (node, 
feature, bin) contiguous.
+   * @param offsetStart index of stats for this (node, feature, bin).
+   */
+  def getCalculator(allStats: Array[Double], offset: Int): 
ApproxBernoulliCalculator = {
+new ApproxBernoulliCalculator(allStats.view(offset, offset + 
statsSize).toArray)
+  }
+}
+
+/**
+ * Stores statistics for one (node, feature, bin) for calculating impurity.
+ * Unlike [[ImpurityAggregator]], this class stores its own data and is 
for a specific
+ * (node, feature, bin).
+ * @param stats  Array of sufficient statistics for a (node, feature, bin).
+ */
+private[spark] class ApproxBernoulliCalculator(stats: Array[Double])
+  extends ImpurityCalculator(stats) {
+
+  require(stats.length == 4,
+s"ApproxBernoulliCalculator requires sufficient statistics array stats 
to be of length 4," +
+  s" but was given array of length ${stats.length}.")
+
+  /**
+   * Make a deep copy of this [[ImpurityCalculator]].
+   */
+  def copy: 

[GitHub] spark issue #17272: [SPARK-19724][SQL]create a managed table with an existed...

2017-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17272
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74472/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17272: [SPARK-19724][SQL]create a managed table with an existed...

2017-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17272
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17272: [SPARK-19724][SQL]create a managed table with an existed...

2017-03-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17272
  
**[Test build #74472 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74472/testReport)**
 for PR 17272 at commit 
[`cd4a091`](https://github.com/apache/spark/commit/cd4a0912bc934ca475f2d9097ab1f684351b7bf8).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17277: [SPARK-19887][SQL] null is a valid partition value

2017-03-13 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17277
  
Found a JIRA https://issues.apache.org/jira/browse/IMPALA-252 to explain 
how IMPALA handles it.

**Static partition keys may not be NULL or the empty string**
So `INSERT INTO TABLE tbl PARTITION(part="") SELECT ...` will raise an 
error.
**Dynamic partition keys may be empty or NULL**
So `INSERT INTO TABLE tbl PARTITION(part) SELECT ...`, `NULL` will work.
**Partitions with NULL or empty string keys are mapped to 
`__HIVE_DEFAULT_PARTITION__`**
Whether the keys are `NULL` or "", both will be written to the same 
`__HIVE_DEFAULT_PARTITION__` partition.
**Values read from the partitioned column in partition 
__HIVE_DEFAULT_PARTITION__ are mapped back to NULL**
Here we deviate from Hive; Hive returns {{_HIVE_DEFAULT_PARTITION_ }} - 
even if the partition column is of integer type. This finally crosses the line 
of what we are willing to do to be compatible.
**ALTER TABLE [ADD|DROP] will reject partitions with NULL or empty 
partition keys**
You cannot create or delete default partitions manually.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17283: [SPARK-19940][ML][MINOR] FPGrowthModel.transform ...

2017-03-13 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/17283#discussion_r105813634
  
--- Diff: mllib/src/test/scala/org/apache/spark/ml/fpm/FPGrowthSuite.scala 
---
@@ -103,6 +103,22 @@ class FPGrowthSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defaul
   FPGrowthSuite.allParamSettings, checkModelData)
   }
 
+  test("SPARK-19940 - FPGrowth prediction should not contain duplicates") {
+// This should generate the same rules for t and s
+val dataset = spark.createDataFrame(Seq(
+  Array("1", "3"),
+  Array("2", "3")
+).map(Tuple1(_))).toDF("features")
+val model = new FPGrowth().fit(dataset)
+
+val predictions = model.transform(
+  spark.createDataFrame(Seq(Tuple1(Array("1", "2".toDF("features")
+)
+
+val prediction = predictions.first().getAs[Seq[String]]("prediction")
+assert(prediction.size === 1)
--- End diff --

nit: assert(prediction.size === Seq("3")) may be more clear.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17283: [SPARK-19940][ML][MINOR] FPGrowthModel.transform ...

2017-03-13 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/17283#discussion_r105813424
  
--- Diff: mllib/src/test/scala/org/apache/spark/ml/fpm/FPGrowthSuite.scala 
---
@@ -103,6 +103,22 @@ class FPGrowthSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defaul
   FPGrowthSuite.allParamSettings, checkModelData)
   }
 
+  test("SPARK-19940 - FPGrowth prediction should not contain duplicates") {
+// This should generate the same rules for t and s
+val dataset = spark.createDataFrame(Seq(
+  Array("1", "3"),
+  Array("2", "3")
+).map(Tuple1(_))).toDF("features")
+val model = new FPGrowth().fit(dataset)
+
+val predictions = model.transform(
+  spark.createDataFrame(Seq(Tuple1(Array("1", "2".toDF("features")
+)
+
+val prediction = predictions.first().getAs[Seq[String]]("prediction")
--- End diff --

we can merge this line to the last statement.
```
model.transform(
...
).first().getAs[Seq[String]]("prediction")
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17283: [SPARK-19940][ML][MINOR] FPGrowthModel.transform ...

2017-03-13 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/17283#discussion_r105813550
  
--- Diff: mllib/src/test/scala/org/apache/spark/ml/fpm/FPGrowthSuite.scala 
---
@@ -103,6 +103,22 @@ class FPGrowthSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defaul
   FPGrowthSuite.allParamSettings, checkModelData)
   }
 
+  test("SPARK-19940 - FPGrowth prediction should not contain duplicates") {
--- End diff --

This may be a violation of style. Not sure if we need the jira id here as 
this is self-explanatory.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17194: [SPARK-19851] Add new aggregates EVERY and ANY (SOME).

2017-03-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17194
  
**[Test build #74478 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74478/testReport)**
 for PR 17194 at commit 
[`1f49c7f`](https://github.com/apache/spark/commit/1f49c7f715df3c871463a23fac1b7a6bc8c9bfc5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15628: [SPARK-17471][ML] Add compressed method to ML matrices

2017-03-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15628
  
**[Test build #74477 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74477/testReport)**
 for PR 15628 at commit 
[`cf8945a`](https://github.com/apache/spark/commit/cf8945a83530324701d51fcc212d0becdf4a70c9).
 * This patch **fails MiMa tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15628: [SPARK-17471][ML] Add compressed method to ML matrices

2017-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15628
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74477/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15628: [SPARK-17471][ML] Add compressed method to ML matrices

2017-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15628
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15628: [SPARK-17471][ML] Add compressed method to ML matrices

2017-03-13 Thread sethah
Github user sethah commented on the issue:

https://github.com/apache/spark/pull/15628
  
@dbtsai Let me know your thoughts on the comments I left. Thanks for the 
review!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   >