[GitHub] spark pull request #16826: [SPARK-19540][SQL] Add ability to clone SparkSess...

2017-02-24 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16826#discussion_r103073487
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/ExperimentalMethods.scala ---
@@ -46,4 +46,10 @@ class ExperimentalMethods private[sql]() {
 
   @volatile var extraOptimizations: Seq[Rule[LogicalPlan]] = Nil
 
+  override def clone(): ExperimentalMethods = {
--- End diff --

It sounds like we also need to add sync for both `extraStrategies ` and 
`extraOptimizations `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16826: [SPARK-19540][SQL] Add ability to clone SparkSess...

2017-02-24 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16826#discussion_r103073400
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 ---
@@ -1178,4 +1181,36 @@ class SessionCatalog(
 }
   }
 
+  /**
+   * Get an identical copy of the `SessionCatalog`.
+   * The temporary tables and function registry are retained.
--- End diff --

`temporary tables` -> `temporary views`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17027: [SPARK-19650] Commands should not trigger a Spark...

2017-02-24 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17027


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17027: [SPARK-19650] Commands should not trigger a Spark job

2017-02-24 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/17027
  
merging to master!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17063: [SPARK-19735][SQL] Remove HOLD_DDLTIME from Catal...

2017-02-24 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17063


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17027: [SPARK-19650] Commands should not trigger a Spark...

2017-02-24 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17027#discussion_r103073282
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -175,19 +175,14 @@ class Dataset[T] private[sql](
   }
 
   @transient private[sql] val logicalPlan: LogicalPlan = {
-def hasSideEffects(plan: LogicalPlan): Boolean = plan match {
-  case _: Command |
-   _: InsertIntoTable => true
-  case _ => false
-}
-
+// For various commands (like DDL) and queries with side effects, we 
force query execution
+// to happen right away to let these side effects take place eagerly.
 queryExecution.analyzed match {
   // For various commands (like DDL) and queries with side effects, we 
force query execution
--- End diff --

actually let me remove it while merging


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17063: [SPARK-19735][SQL] Remove HOLD_DDLTIME from Catalog APIs

2017-02-24 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/17063
  
LGTM, merging to master!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17065: [SPARK-17075][SQL][followup] fix some minor issues and c...

2017-02-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17065
  
**[Test build #73466 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73466/testReport)**
 for PR 17065 at commit 
[`5fa69b3`](https://github.com/apache/spark/commit/5fa69b36294940e1406bb3b1515539decbb9e03a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17065: [SPARK-17075][SQL][followup] fix some minor issues and c...

2017-02-24 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/17065
  
CC @ron8hu @wzhfy 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17060: [SQL] Duplicate test exception in SQLQueryTestSuite due ...

2017-02-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17060
  
**[Test build #73463 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73463/testReport)**
 for PR 17060 at commit 
[`4837de8`](https://github.com/apache/spark/commit/4837de867c0e93a9b2801cce352573d68684aa35).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17060: [SQL] Duplicate test exception in SQLQueryTestSuite due ...

2017-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17060
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17060: [SQL] Duplicate test exception in SQLQueryTestSuite due ...

2017-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17060
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73463/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17065: [SPARK-17075][SQL][followup] fix some minor issue...

2017-02-24 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17065#discussion_r103073176
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -361,57 +343,52 @@ case class FilterEstimation(plan: Filter, 
catalystConf: CatalystConf) extends Lo
*/
 
   def evaluateInSet(
-  attrRef: AttributeReference,
+  attr: Attribute,
   hSet: Set[Any],
-  update: Boolean)
-: Option[Double] = {
-if (!mutableColStats.contains(attrRef.exprId)) {
-  logDebug("[CBO] No statistics for " + attrRef)
+  update: Boolean): Option[Double] = {
+if (!colStatsMap.contains(attr)) {
+  logDebug("[CBO] No statistics for " + attr)
   return None
 }
 
-val aColStat = mutableColStats(attrRef.exprId)
-val ndv = aColStat.distinctCount
-val aType = attrRef.dataType
-var newNdv: Long = 0
+val colStat = colStatsMap(attr)
+val ndv = colStat.distinctCount
+val dataType = attr.dataType
+var newNdv = ndv
 
 // use [min, max] to filter the original hSet
-aType match {
-  case _: NumericType | DateType | TimestampType =>
-val statsRange =
-  Range(aColStat.min, aColStat.max, 
aType).asInstanceOf[NumericRange]
-
-// To facilitate finding the min and max values in hSet, we map 
hSet values to BigDecimal.
-// Using hSetBigdec, we can find the min and max values quickly in 
the ordered hSetBigdec.
-val hSetBigdec = hSet.map(e => BigDecimal(e.toString))
-val validQuerySet = hSetBigdec.filter(e => e >= statsRange.min && 
e <= statsRange.max)
-// We use hSetBigdecToAnyMap to help us find the original hSet 
value.
-val hSetBigdecToAnyMap: Map[BigDecimal, Any] =
-  hSet.map(e => BigDecimal(e.toString) -> e).toMap
+dataType match {
+  case _: NumericType | BooleanType | DateType | TimestampType =>
--- End diff --

add boolean type


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17065: [SPARK-17075][SQL][followup] fix some minor issue...

2017-02-24 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17065#discussion_r103073163
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -297,6 +278,8 @@ case class FilterEstimation(plan: Filter, catalystConf: 
CatalystConf) extends Lo
 Some(DateTimeUtils.toJavaDate(litValue.toString.toInt))
   case TimestampType =>
 Some(DateTimeUtils.toJavaTimestamp(litValue.toString.toLong))
+  case _: DecimalType =>
+Some(litValue.asInstanceOf[Decimal].toJavaBigDecimal)
--- End diff --

@ron8hu the external value type of `DecimalType` is java decimal, and the 
internal value type is `Decimal`, we need to convert it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17065: [SPARK-17075][SQL][followup] fix some minor issue...

2017-02-24 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17065#discussion_r103073122
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -258,27 +246,20 @@ case class FilterEstimation(plan: Filter, 
catalystConf: CatalystConf) extends Lo
*/
   def evaluateBinary(
   op: BinaryComparison,
-  attrRef: AttributeReference,
+  attr: Attribute,
   literal: Literal,
-  update: Boolean)
-: Option[Double] = {
-if (!mutableColStats.contains(attrRef.exprId)) {
-  logDebug("[CBO] No statistics for " + attrRef)
-  return None
-}
-
-op match {
-  case EqualTo(l, r) => evaluateEqualTo(attrRef, literal, update)
+  update: Boolean): Option[Double] = {
+attr.dataType match {
+  case _: NumericType | DateType | TimestampType =>
+evaluateBinaryForNumeric(op, attr, literal, update)
+  case StringType | BinaryType =>
+// TODO: It is difficult to support other binary comparisons for 
String/Binary
+// type without min/max and advanced statistics like histogram.
+logDebug("[CBO] No range comparison statistics for String/Binary 
type " + attr)
+None
   case _ =>
-attrRef.dataType match {
-  case _: NumericType | DateType | TimestampType =>
-evaluateBinaryForNumeric(op, attrRef, literal, update)
-  case StringType | BinaryType =>
--- End diff --

previously we totally missed `BooleanType` and will throw `MatchError` if 
the attribute is bool. But the logic in `evaluateBinaryForNumeric` doesn't work 
for boolean, so I treat it as supported for now. @wzhfy do you have time to 
work on it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17065: [SPARK-17075][SQL][followup] fix some minor issue...

2017-02-24 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17065#discussion_r103073081
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -140,56 +129,56 @@ case class FilterEstimation(plan: Filter, 
catalystConf: CatalystConf) extends Lo
* @param condition a single logical expression
* @param update a boolean flag to specify if we need to update 
ColumnStat of a column
*   for subsequent conditions
-   * @return Option[Double] value to show the percentage of rows meeting a 
given condition.
+   * @return an optional double value to show the percentage of rows 
meeting a given condition.
* It returns None if the condition is not supported.
*/
   def calculateSingleCondition(condition: Expression, update: Boolean): 
Option[Double] = {
 condition match {
   // For evaluateBinary method, we assume the literal on the right 
side of an operator.
   // So we will change the order if not.
 
-  // EqualTo does not care about the order
-  case op @ EqualTo(ar: AttributeReference, l: Literal) =>
-evaluateBinary(op, ar, l, update)
-  case op @ EqualTo(l: Literal, ar: AttributeReference) =>
-evaluateBinary(op, ar, l, update)
+  // EqualTo/EqualNullSafe does not care about the order
--- End diff --

also support `EqualNullSafe`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17065: [SPARK-17075][SQL][followup] fix some minor issue...

2017-02-24 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17065#discussion_r103073076
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -95,15 +84,16 @@ case class FilterEstimation(plan: Filter, catalystConf: 
CatalystConf) extends Lo
* @param condition the compound logical expression
* @param update a boolean flag to specify if we need to update 
ColumnStat of a column
*   for subsequent conditions
-   * @return a double value to show the percentage of rows meeting a given 
condition.
+   * @return an optional double value to show the percentage of rows 
meeting a given condition.
* It returns None if the condition is not supported.
*/
   def calculateFilterSelectivity(condition: Expression, update: Boolean = 
true): Option[Double] = {
-
 condition match {
   case And(cond1, cond2) =>
-(calculateFilterSelectivity(cond1, update), 
calculateFilterSelectivity(cond2, update))
-match {
+// For ease of debugging, we compute percent1 and percent2 in 2 
statements.
+val percent1 = calculateFilterSelectivity(cond1, update)
+val percent2 = calculateFilterSelectivity(cond2, update)
+(percent1, percent2) match {
   case (Some(p1), Some(p2)) => Some(p1 * p2)
   case (Some(p1), None) => Some(p1)
--- End diff --

here we are actually over-estimating: if a condition is unsupported in 
`And`, we assume it's 100% selectivity, which may leads to under-estimation if 
this `And` is wrapped by `Not`.

We should

1. if one condition is unsupported, this `And` is unsupported
2. do not handle nested `Not`

cc @wzhfy @ron8hu 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17065: [SPARK-17075][SQL][followup] fix some minor issues and c...

2017-02-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17065
  
**[Test build #73465 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73465/testReport)**
 for PR 17065 at commit 
[`04cc681`](https://github.com/apache/spark/commit/04cc6811445790a636a534c42e2caf053a238eb8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17065: [SPARK-17075][SQL][followup] fix some minor issue...

2017-02-24 Thread cloud-fan
GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/17065

[SPARK-17075][SQL][followup] fix some minor issues and clean up the code

## What changes were proposed in this pull request?

This fixes some code style issues, naming issues, some missing cases in 
pattern match, etc.

## How was this patch tested?

existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark follow-up

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17065.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17065


commit 04cc6811445790a636a534c42e2caf053a238eb8
Author: Wenchen Fan 
Date:   2017-02-25T01:06:37Z

fix some minor issues and clean up the code




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17063: [SPARK-19735][SQL] Remove HOLD_DDLTIME from Catalog APIs

2017-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17063
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17063: [SPARK-19735][SQL] Remove HOLD_DDLTIME from Catalog APIs

2017-02-24 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17063
  
cc @cloud-fan @yhuai @sameeragarwal 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17063: [SPARK-19735][SQL] Remove HOLD_DDLTIME from Catalog APIs

2017-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17063
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73462/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16592: [SPARK-19235] [SQL] [TESTS] Enable Test Cases in DDLSuit...

2017-02-24 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16592
  
Let me continue the work now


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17063: [SPARK-19735][SQL] Remove HOLD_DDLTIME from Catalog APIs

2017-02-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17063
  
**[Test build #73462 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73462/testReport)**
 for PR 17063 at commit 
[`d4ac13c`](https://github.com/apache/spark/commit/d4ac13cf3a0ac3d4734eaa54dad50501374b7405).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17064: [SPARK-19736][SQL] refreshByPath should clear all cached...

2017-02-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17064
  
**[Test build #73464 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73464/testReport)**
 for PR 17064 at commit 
[`dd6d8ca`](https://github.com/apache/spark/commit/dd6d8ca1c091d00bfc29363ebc0d518b12927325).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17064: [SPARK-19736][SQL] refreshByPath should clear all...

2017-02-24 Thread viirya
GitHub user viirya opened a pull request:

https://github.com/apache/spark/pull/17064

[SPARK-19736][SQL] refreshByPath should clear all cached plans with the 
specified path

## What changes were proposed in this pull request?

`Catalog.refreshByPath` can refresh the cache entry and the associated 
metadata for all dataframes (if any), that contain the given data source path.

However, `CacheManager.invalidateCachedPath` doesn't clear all cached plans 
with the specified path. It causes some strange behaviors reported in 
SPARK-15678.

## How was this patch tested?

Jenkins tests.

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/viirya/spark-1 fix-refreshByPath

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17064.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17064


commit dd6d8ca1c091d00bfc29363ebc0d518b12927325
Author: Liang-Chi Hsieh 
Date:   2017-02-25T05:58:40Z

refreshByPath should clear all cached plans with the specified path.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17060: [SQL] Duplicate test exception in SQLQueryTestSuite due ...

2017-02-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17060
  
**[Test build #73463 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73463/testReport)**
 for PR 17060 at commit 
[`4837de8`](https://github.com/apache/spark/commit/4837de867c0e93a9b2801cce352573d68684aa35).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14789: [SPARK-17209][YARN] Add the ability to manually update c...

2017-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14789
  
Build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14789: [SPARK-17209][YARN] Add the ability to manually update c...

2017-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14789
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73461/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14789: [SPARK-17209][YARN] Add the ability to manually update c...

2017-02-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14789
  
**[Test build #73461 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73461/consoleFull)**
 for PR 14789 at commit 
[`b1cac5c`](https://github.com/apache/spark/commit/b1cac5c95f3c7f77a8253a88aff48d2a1934072c).
 * This patch **fails Spark unit tests**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17060: [SQL] Duplicate test exception in SQLQueryTestSui...

2017-02-24 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/17060#discussion_r103071356
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala ---
@@ -241,7 +243,8 @@ class SQLQueryTestSuite extends QueryTest with 
SharedSQLContext {
   private def listTestCases(): Seq[TestCase] = {
 listFilesRecursively(new File(inputFilePath)).map { file =>
   val resultFile = file.getAbsolutePath.replace(inputFilePath, 
goldenFilePath) + ".out"
-  TestCase(file.getName, file.getAbsolutePath, resultFile)
+  val absPath = file.getAbsolutePath
+  TestCase(absPath.stripPrefix(inputFilePath + File.separator), 
absPath, resultFile)
--- End diff --

@srowen Sure. Although i think we are safe, i would go ahead and make that 
change. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17056: [SPARK-17495] [SQL] Support Decimal type in Hive-hash

2017-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17056
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17056: [SPARK-17495] [SQL] Support Decimal type in Hive-hash

2017-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17056
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73460/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17056: [SPARK-17495] [SQL] Support Decimal type in Hive-hash

2017-02-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17056
  
**[Test build #73460 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73460/testReport)**
 for PR 17056 at commit 
[`8595305`](https://github.com/apache/spark/commit/8595305c2dc3b276d6390724ca1f1469794540f5).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17060: [SQL] Duplicate test exception in SQLQueryTestSui...

2017-02-24 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17060#discussion_r103071094
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala ---
@@ -98,7 +98,9 @@ class SQLQueryTestSuite extends QueryTest with 
SharedSQLContext {
 
   /** List of test cases to ignore, in lower cases. */
   private val blackList = Set(
-"blacklist.sql"  // Do NOT remove this one. It is here to test the 
blacklist functionality.
+"blacklist.sql",  // Do NOT remove this one. It is here to test the 
blacklist functionality.
+".ds_store"   // A meta-file that may be created on Mac by Finder 
App.
--- End diff --

That's right, you'd have to lower-case both. I think that's clearer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17060: [SQL] Duplicate test exception in SQLQueryTestSui...

2017-02-24 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/17060#discussion_r103070909
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala ---
@@ -98,7 +98,9 @@ class SQLQueryTestSuite extends QueryTest with 
SharedSQLContext {
 
   /** List of test cases to ignore, in lower cases. */
   private val blackList = Set(
-"blacklist.sql"  // Do NOT remove this one. It is here to test the 
blacklist functionality.
+"blacklist.sql",  // Do NOT remove this one. It is here to test the 
blacklist functionality.
+".ds_store"   // A meta-file that may be created on Mac by Finder 
App.
--- End diff --

@srowen I did think about it. If we want to keep the mixed case here in 
this list, then we have to change the code later to lowercase both the sides 
and then compare ? Would you prefer to change the comparison to the following ?
```
if (blackList.exists(t => 
testCase.name.toLowerCase.contains(t.toLowerCase))) {
...
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17060: [SQL] Duplicate test exception in SQLQueryTestSui...

2017-02-24 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17060#discussion_r103070856
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala ---
@@ -241,7 +243,8 @@ class SQLQueryTestSuite extends QueryTest with 
SharedSQLContext {
   private def listTestCases(): Seq[TestCase] = {
 listFilesRecursively(new File(inputFilePath)).map { file =>
   val resultFile = file.getAbsolutePath.replace(inputFilePath, 
goldenFilePath) + ".out"
-  TestCase(file.getName, file.getAbsolutePath, resultFile)
+  val absPath = file.getAbsolutePath
+  TestCase(absPath.stripPrefix(inputFilePath + File.separator), 
absPath, resultFile)
--- End diff --

Does this get into problems where inputFilePath might already have a 
trailing separator and then isn't a prefix? maybe it can't happen, but maybe 
it's easier to strip leading separators from the final result, directly, if 
that's what you mean to do.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17063: [SPARK-19735][SQL] Remove HOLD_DDLTIME from Catalog APIs

2017-02-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17063
  
**[Test build #73462 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73462/testReport)**
 for PR 17063 at commit 
[`d4ac13c`](https://github.com/apache/spark/commit/d4ac13cf3a0ac3d4734eaa54dad50501374b7405).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16976: [SPARK-19610][SQL] Support parsing multiline CSV files

2017-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16976
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16976: [SPARK-19610][SQL] Support parsing multiline CSV files

2017-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16976
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73457/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16976: [SPARK-19610][SQL] Support parsing multiline CSV files

2017-02-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16976
  
**[Test build #73457 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73457/testReport)**
 for PR 16976 at commit 
[`b7d56c6`](https://github.com/apache/spark/commit/b7d56c6b6db7df7fe12831efb8338787b3108c0e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17063: [SPARK-19735][SQL] Remove HOLD_DDLTIME from Catal...

2017-02-24 Thread gatorsmile
GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/17063

[SPARK-19735][SQL] Remove HOLD_DDLTIME from Catalog APIs

### What changes were proposed in this pull request?
As explained in Hive JIRA https://issues.apache.org/jira/browse/HIVE-12224, 
HOLD_DDLTIME was broken as soon as it landed. Hive 2.0 removes HOLD_DDLTIME 
from the API. In Spark SQL, we always set it to FALSE. Like Hive, we should 
also remove it from our Catalog APIs.

### How was this patch tested?
N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark removalHoldDDLTime

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17063.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17063


commit d4ac13cf3a0ac3d4734eaa54dad50501374b7405
Author: Xiao Li 
Date:   2017-02-25T04:14:34Z

fix.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17060: [SQL] Duplicate test exception in SQLQueryTestSui...

2017-02-24 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/17060#discussion_r103070711
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala ---
@@ -241,7 +243,8 @@ class SQLQueryTestSuite extends QueryTest with 
SharedSQLContext {
   private def listTestCases(): Seq[TestCase] = {
 listFilesRecursively(new File(inputFilePath)).map { file =>
   val resultFile = file.getAbsolutePath.replace(inputFilePath, 
goldenFilePath) + ".out"
-  TestCase(file.getName, file.getAbsolutePath, resultFile)
+  val absPath = file.getAbsolutePath
+  TestCase(absPath.stripPrefix(inputFilePath + File.separator), 
absPath, resultFile)
--- End diff --

@srowen I am adding a separator to make sure the test case name does not 
have a leading separator character. Example: 
```
file = 
/home/spark/sql/core/src/test/resources/sql-tests/inputs/subquery/exists-subquery/exists-basic.sql
inputFilePath = /home/spark/sql/core/src/test/resources/sql-tests/inputs
```
So i wanted the test case name to be 
`subquery/exists-subquery/exists-basic.sql` and not 
`/subquery/exists-subquery/exists-basic.sql`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17060: [SQL] Duplicate test exception in SQLQueryTestSui...

2017-02-24 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/17060#discussion_r103070622
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala ---
@@ -121,7 +123,7 @@ class SQLQueryTestSuite extends QueryTest with 
SharedSQLContext {
   }
 
   private def createScalaTestCase(testCase: TestCase): Unit = {
-if (blackList.contains(testCase.name.toLowerCase)) {
+if (blackList.exists(testCase.name.toLowerCase.contains(_))) {
--- End diff --

@srowen Thanks !! I will change.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17062: [SPARK-17495] [SQL] Support date, timestamp and interval...

2017-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17062
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17062: [SPARK-17495] [SQL] Support date, timestamp and interval...

2017-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17062
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73459/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17062: [SPARK-17495] [SQL] Support date, timestamp and interval...

2017-02-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17062
  
**[Test build #73459 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73459/testReport)**
 for PR 17062 at commit 
[`332475c`](https://github.com/apache/spark/commit/332475c1641f61080aa41dda9f1ceec237351d75).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16976: [SPARK-19610][SQL] Support parsing multiline CSV files

2017-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16976
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73456/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16976: [SPARK-19610][SQL] Support parsing multiline CSV files

2017-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16976
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16976: [SPARK-19610][SQL] Support parsing multiline CSV files

2017-02-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16976
  
**[Test build #73456 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73456/testReport)**
 for PR 16976 at commit 
[`6187f6c`](https://github.com/apache/spark/commit/6187f6c0617fa0aec75d1eb638a00c6e7a350d61).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17061: [SPARK-13446] [SQL] Support reading data from Hive 2.0.1...

2017-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17061
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17061: [SPARK-13446] [SQL] Support reading data from Hive 2.0.1...

2017-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17061
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73458/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17061: [SPARK-13446] [SQL] Support reading data from Hive 2.0.1...

2017-02-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17061
  
**[Test build #73458 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73458/testReport)**
 for PR 17061 at commit 
[`60af17f`](https://github.com/apache/spark/commit/60af17f0178ba8ab7d7881c118915334e2c824eb).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16976: [SPARK-19610][SQL] Support parsing multiline CSV files

2017-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16976
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73455/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16976: [SPARK-19610][SQL] Support parsing multiline CSV files

2017-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16976
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16976: [SPARK-19610][SQL] Support parsing multiline CSV files

2017-02-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16976
  
**[Test build #73455 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73455/testReport)**
 for PR 16976 at commit 
[`b4e6983`](https://github.com/apache/spark/commit/b4e6983351192b788708ba3cd2a4c8e7321f34b0).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17047: [SPARK-19720][SPARK SUBMIT] Redact sensitive info...

2017-02-24 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17047#discussion_r103069680
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2574,13 +2575,30 @@ private[spark] object Utils extends Logging {
 
   def redact(conf: SparkConf, kvs: Seq[(String, String)]): Seq[(String, 
String)] = {
 val redactionPattern = conf.get(SECRET_REDACTION_PATTERN).r
+redact(redactionPattern, kvs)
+  }
+
+  private def redact(redactionPattern: Regex, kvs: Seq[(String, String)]): 
Seq[(String, String)] = {
 kvs.map { kv =>
   redactionPattern.findFirstIn(kv._1)
 .map { ignore => (kv._1, REDACTION_REPLACEMENT_TEXT) }
 .getOrElse(kv)
 }
   }
 
+  /**
+   * Looks up the redaction regex from within the key value pairs and uses 
it to redact the rest
+   * of the key value pairs. No care is taken to make sure the redaction 
property itself is not
+   * redacted. So theoretically, the property itself could be configured 
to redact its own value
+   * when printing.
+   * @param kvs
+   * @return
+   */
+  def redact(kvs: Map[String, String]): Seq[(String, String)] = {
--- End diff --

(Nit: I'd omit param and return if they're not filled in.)
So this is used in cases where there isn't a conf object available yet, but 
the argument itself has the redaction config? I was slightly worried about the 
parallel implementation but that would be a reasonable reason to do it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16976: [SPARK-19610][SQL] Support parsing multiline CSV files

2017-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16976
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73454/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16976: [SPARK-19610][SQL] Support parsing multiline CSV files

2017-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16976
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16976: [SPARK-19610][SQL] Support parsing multiline CSV files

2017-02-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16976
  
**[Test build #73454 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73454/testReport)**
 for PR 16976 at commit 
[`321d082`](https://github.com/apache/spark/commit/321d082e4f0dbac34c39618adbded99d119e06ac).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14963: [SPARK-16992][PYSPARK] Virtualenv for Pylint and pep8 in...

2017-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14963
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73451/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14963: [SPARK-16992][PYSPARK] Virtualenv for Pylint and pep8 in...

2017-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14963
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14963: [SPARK-16992][PYSPARK] Virtualenv for Pylint and pep8 in...

2017-02-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14963
  
**[Test build #73451 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73451/testReport)**
 for PR 14963 at commit 
[`215b7b3`](https://github.com/apache/spark/commit/215b7b34170f112c4448fba98b02a50dbb19b2a7).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17052: [SPARK-19690][SS] Join a streaming DataFrame with a batc...

2017-02-24 Thread uncleGen
Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17052
  
@zsxwing got it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15415: [SPARK-14503][ML] spark.ml API for FPGrowth

2017-02-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15415#discussion_r103068598
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala ---
@@ -0,0 +1,339 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.fpm
+
+import scala.collection.mutable.ArrayBuffer
+import scala.reflect.ClassTag
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.fpm.{AssociationRules => 
MLlibAssociationRules,
+  FPGrowth => MLlibFPGrowth}
+import org.apache.spark.mllib.fpm.FPGrowth.FreqItemset
+import org.apache.spark.sql._
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * Common params for FPGrowth and FPGrowthModel
+ */
+private[fpm] trait FPGrowthParams extends Params with HasFeaturesCol with 
HasPredictionCol {
+
+  /**
+   * Minimal support level of the frequent pattern. [0.0, 1.0]. Any 
pattern that appears
+   * more than (minSupport * size-of-the-dataset) times will be output
+   * Default: 0.3
+   * @group param
+   */
+  @Since("2.2.0")
+  val minSupport: DoubleParam = new DoubleParam(this, "minSupport",
+"the minimal support level of a frequent pattern",
+ParamValidators.inRange(0.0, 1.0))
+  setDefault(minSupport -> 0.3)
+
+  /** @group getParam */
+  @Since("2.2.0")
+  def getMinSupport: Double = $(minSupport)
+
+  /**
+   * Number of partitions (>=1) used by parallel FP-growth. By default the 
param is not set, and
+   * partition number of the input dataset is used.
+   * @group expertParam
+   */
+  @Since("2.2.0")
+  val numPartitions: IntParam = new IntParam(this, "numPartitions",
+"Number of partitions used by parallel FP-growth", 
ParamValidators.gtEq[Int](1))
+
+  /** @group expertGetParam */
+  @Since("2.2.0")
+  def getNumPartitions: Int = $(numPartitions)
+
+  /**
+   * Minimal confidence for generating Association Rule.
+   * Note that minConfidence has no effect during fitting.
+   * Default: 0.8
+   * @group param
+   */
+  @Since("2.2.0")
+  val minConfidence: DoubleParam = new DoubleParam(this, "minConfidence",
+"minimal confidence for generating Association Rule",
+ParamValidators.inRange(0.0, 1.0))
+  setDefault(minConfidence -> 0.8)
+
+  /** @group getParam */
+  @Since("2.2.0")
+  def getMinConfidence: Double = $(minConfidence)
+
+  /**
+   * Validates and transforms the input schema.
+   * @param schema input schema
+   * @return output schema
+   */
+  @Since("2.2.0")
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputType = schema($(featuresCol)).dataType
+require(inputType.isInstanceOf[ArrayType],
+  s"The input column must be ArrayType, but got $inputType.")
+SchemaUtils.appendColumn(schema, $(predictionCol), 
schema($(featuresCol)).dataType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * A parallel FP-growth algorithm to mine frequent itemsets. The algorithm 
is described in
+ * http://dx.doi.org/10.1145/1454008.1454027";>Li et al., PFP: 
Parallel FP-Growth for Query
+ * Recommendation. PFP distributes computation in such a way that each 
worker executes an
+ * independent group of mining tasks. The FP-Growth algorithm is described 
in
+ * http://dx.doi.org/10.1145/335191.335372";>Han et al., Mining 
frequent patterns without
+ * candidate generation. Note null values in the feature column are 
ignored during fit().
+ *
+ * @see http://en.wikipedia.org/wiki/Association_rule_learning";>
+ * Association rule learning (Wikiped

[GitHub] spark pull request #15415: [SPARK-14503][ML] spark.ml API for FPGrowth

2017-02-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15415#discussion_r103068619
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala ---
@@ -0,0 +1,339 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.fpm
+
+import scala.collection.mutable.ArrayBuffer
+import scala.reflect.ClassTag
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.fpm.{AssociationRules => 
MLlibAssociationRules,
+  FPGrowth => MLlibFPGrowth}
+import org.apache.spark.mllib.fpm.FPGrowth.FreqItemset
+import org.apache.spark.sql._
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * Common params for FPGrowth and FPGrowthModel
+ */
+private[fpm] trait FPGrowthParams extends Params with HasFeaturesCol with 
HasPredictionCol {
+
+  /**
+   * Minimal support level of the frequent pattern. [0.0, 1.0]. Any 
pattern that appears
+   * more than (minSupport * size-of-the-dataset) times will be output
+   * Default: 0.3
+   * @group param
+   */
+  @Since("2.2.0")
+  val minSupport: DoubleParam = new DoubleParam(this, "minSupport",
+"the minimal support level of a frequent pattern",
+ParamValidators.inRange(0.0, 1.0))
+  setDefault(minSupport -> 0.3)
+
+  /** @group getParam */
+  @Since("2.2.0")
+  def getMinSupport: Double = $(minSupport)
+
+  /**
+   * Number of partitions (>=1) used by parallel FP-growth. By default the 
param is not set, and
+   * partition number of the input dataset is used.
+   * @group expertParam
+   */
+  @Since("2.2.0")
+  val numPartitions: IntParam = new IntParam(this, "numPartitions",
+"Number of partitions used by parallel FP-growth", 
ParamValidators.gtEq[Int](1))
+
+  /** @group expertGetParam */
+  @Since("2.2.0")
+  def getNumPartitions: Int = $(numPartitions)
+
+  /**
+   * Minimal confidence for generating Association Rule.
+   * Note that minConfidence has no effect during fitting.
+   * Default: 0.8
+   * @group param
+   */
+  @Since("2.2.0")
+  val minConfidence: DoubleParam = new DoubleParam(this, "minConfidence",
+"minimal confidence for generating Association Rule",
+ParamValidators.inRange(0.0, 1.0))
+  setDefault(minConfidence -> 0.8)
+
+  /** @group getParam */
+  @Since("2.2.0")
+  def getMinConfidence: Double = $(minConfidence)
+
+  /**
+   * Validates and transforms the input schema.
+   * @param schema input schema
+   * @return output schema
+   */
+  @Since("2.2.0")
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputType = schema($(featuresCol)).dataType
+require(inputType.isInstanceOf[ArrayType],
+  s"The input column must be ArrayType, but got $inputType.")
+SchemaUtils.appendColumn(schema, $(predictionCol), 
schema($(featuresCol)).dataType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * A parallel FP-growth algorithm to mine frequent itemsets. The algorithm 
is described in
+ * http://dx.doi.org/10.1145/1454008.1454027";>Li et al., PFP: 
Parallel FP-Growth for Query
+ * Recommendation. PFP distributes computation in such a way that each 
worker executes an
+ * independent group of mining tasks. The FP-Growth algorithm is described 
in
+ * http://dx.doi.org/10.1145/335191.335372";>Han et al., Mining 
frequent patterns without
+ * candidate generation. Note null values in the feature column are 
ignored during fit().
+ *
+ * @see http://en.wikipedia.org/wiki/Association_rule_learning";>
+ * Association rule learning (Wikiped

[GitHub] spark issue #16626: [SPARK-19261][SQL] Alter add columns for Hive serde and ...

2017-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16626
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73450/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16626: [SPARK-19261][SQL] Alter add columns for Hive serde and ...

2017-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16626
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16944: [SPARK-19611][SQL] Introduce configurable table schema i...

2017-02-24 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16944
  
Few small comments left. LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16626: [SPARK-19261][SQL] Alter add columns for Hive serde and ...

2017-02-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16626
  
**[Test build #73450 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73450/testReport)**
 for PR 16626 at commit 
[`4bd04fc`](https://github.com/apache/spark/commit/4bd04fc556fde3805f58e39a0e6ad50c2a1e8aec).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16944: [SPARK-19611][SQL] Introduce configurable table s...

2017-02-24 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16944#discussion_r103068578
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSchemaInferenceSuite.scala
 ---
@@ -0,0 +1,192 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import java.io.File
+import java.util.concurrent.{Executors, TimeUnit}
+
+import org.scalatest.BeforeAndAfterEach
+
+import org.apache.spark.metrics.source.HiveCatalogMetrics
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.catalog._
+import org.apache.spark.sql.execution.datasources.FileStatusCache
+import org.apache.spark.sql.QueryTest
+import org.apache.spark.sql.hive.client.HiveClient
+import org.apache.spark.sql.hive.test.TestHiveSingleton
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.internal.SQLConf.HiveCaseSensitiveInferenceMode
+import org.apache.spark.sql.test.SQLTestUtils
+import org.apache.spark.sql.types._
+
+class HiveSchemaInferenceSuite
+  extends QueryTest with TestHiveSingleton with SQLTestUtils with 
BeforeAndAfterEach {
+
+  import HiveSchemaInferenceSuite._
+
+  override def beforeEach(): Unit = {
+super.beforeEach()
+FileStatusCache.resetForTesting()
+  }
+
+  override def afterEach(): Unit = {
+super.afterEach()
+FileStatusCache.resetForTesting()
+  }
+
+  private val externalCatalog = 
spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog]
+  private val lowercaseSchema = StructType(Seq(
+StructField("fieldone", LongType),
+StructField("partcol1", IntegerType),
+StructField("partcol2", IntegerType)))
+  private val caseSensitiveSchema = StructType(Seq(
+StructField("fieldOne", LongType),
+// Partition columns remain case-insensitive
+StructField("partcol1", IntegerType),
+StructField("partcol2", IntegerType)))
+
+  // Create a CatalogTable instance modeling an external Hive Metastore 
table backed by
+  // Parquet data files.
+  private def hiveExternalCatalogTable(
+  tableName: String,
+  location: String,
+  schema: StructType,
+  partitionColumns: Seq[String],
+  properties: Map[String, String] = Map.empty): CatalogTable = {
+CatalogTable(
+  identifier = TableIdentifier(table = tableName, database = 
Option(DATABASE)),
+  tableType = CatalogTableType.EXTERNAL,
+  storage = CatalogStorageFormat(
+locationUri = Option(location),
+inputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"),
+outputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat"),
+serde = 
Option("org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"),
+compressed = false,
+properties = Map("serialization.format" -> "1")),
+  schema = schema,
+  provider = Option("hive"),
+  partitionColumnNames = partitionColumns,
+  properties = properties)
+  }
+
+  // Creates CatalogTablePartition instances for adding partitions of data 
to our test table.
+  private def hiveCatalogPartition(location: String, index: Int): 
CatalogTablePartition
+= CatalogTablePartition(
+  spec = Map("partcol1" -> index.toString, "partcol2" -> 
index.toString),
+  storage = CatalogStorageFormat(
+locationUri = 
Option(s"${location}/partCol1=$index/partCol2=$index/"),
+inputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"),
+outputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat"),
+serde = 
Option("org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"),
+compressed = false,
+properties = Map("serialization.format" -> "1")))
+
+  // Creates a ca

[GitHub] spark issue #17056: [SPARK-17495] [SQL] Support Decimal type in Hive-hash

2017-02-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17056
  
**[Test build #73460 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73460/testReport)**
 for PR 17056 at commit 
[`8595305`](https://github.com/apache/spark/commit/8595305c2dc3b276d6390724ca1f1469794540f5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14789: [SPARK-17209][YARN] Add the ability to manually update c...

2017-02-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14789
  
**[Test build #73461 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73461/consoleFull)**
 for PR 14789 at commit 
[`b1cac5c`](https://github.com/apache/spark/commit/b1cac5c95f3c7f77a8253a88aff48d2a1934072c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16944: [SPARK-19611][SQL] Introduce configurable table s...

2017-02-24 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16944#discussion_r103068501
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSchemaInferenceSuite.scala
 ---
@@ -0,0 +1,200 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import java.io.File
+import java.util.concurrent.{Executors, TimeUnit}
+
+import org.scalatest.BeforeAndAfterEach
+
+import org.apache.spark.metrics.source.HiveCatalogMetrics
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.catalog._
+import org.apache.spark.sql.execution.datasources.FileStatusCache
+import org.apache.spark.sql.QueryTest
+import org.apache.spark.sql.hive.client.HiveClient
+import org.apache.spark.sql.hive.test.TestHiveSingleton
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.internal.SQLConf.HiveCaseSensitiveInferenceMode
+import org.apache.spark.sql.test.SQLTestUtils
+import org.apache.spark.sql.types._
+
+class HiveSchemaInferenceSuite
+  extends QueryTest with TestHiveSingleton with SQLTestUtils with 
BeforeAndAfterEach {
+
+  import HiveSchemaInferenceSuite._
+  import HiveExternalCatalog.SPARK_SQL_PREFIX
+
+  override def beforeEach(): Unit = {
+super.beforeEach()
+FileStatusCache.resetForTesting()
+  }
+
+  override def afterEach(): Unit = {
+super.afterEach()
+FileStatusCache.resetForTesting()
+  }
+
+  private val externalCatalog = 
spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog]
+  private val lowercaseSchema = StructType(Seq(
+StructField("fieldone", LongType),
+StructField("partcol1", IntegerType),
+StructField("partcol2", IntegerType)))
+  private val caseSensitiveSchema = StructType(Seq(
+StructField("fieldOne", LongType),
+// Partition columns remain case-insensitive
+StructField("partcol1", IntegerType),
+StructField("partcol2", IntegerType)))
+
+  // Create a CatalogTable instance modeling an external Hive Metastore 
table backed by
+  // Parquet data files.
+  private def hiveExternalCatalogTable(
+  tableName: String,
+  location: String,
+  schema: StructType,
+  partitionColumns: Seq[String],
+  properties: Map[String, String] = Map.empty): CatalogTable = {
+CatalogTable(
+  identifier = TableIdentifier(table = tableName, database = 
Option(DATABASE)),
+  tableType = CatalogTableType.EXTERNAL,
+  storage = CatalogStorageFormat(
+locationUri = Option(location),
+inputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"),
+outputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat"),
+serde = 
Option("org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"),
+compressed = false,
+properties = Map("serialization.format" -> "1")),
+  schema = schema,
+  provider = Option("hive"),
+  partitionColumnNames = partitionColumns,
+  properties = properties)
+  }
+
+  // Creates CatalogTablePartition instances for adding partitions of data 
to our test table.
+  private def hiveCatalogPartition(location: String, index: Int): 
CatalogTablePartition
+= CatalogTablePartition(
+  spec = Map("partcol1" -> index.toString, "partcol2" -> 
index.toString),
+  storage = CatalogStorageFormat(
+locationUri = 
Option(s"${location}/partCol1=$index/partCol2=$index/"),
+inputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"),
+outputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat"),
+serde = 
Option("org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"),
+compressed = false,
+properties = Map("serializa

[GitHub] spark issue #15415: [SPARK-14503][ML] spark.ml API for FPGrowth

2017-02-24 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/15415
  
I don't think we need to support the default prediction (for empty/null 
inputs) now.  I agree we could use an inputer or add something as an option 
later on.

Will take a final look now


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

2017-02-24 Thread datumbox
Github user datumbox commented on the issue:

https://github.com/apache/spark/pull/17059
  
@srowen Yes, the behaviour of the method remains the same. This patch 
helped me get a measurable improvement on GC overhead in Spark 2.0, so I though 
that it would be beneficial for others. Anyway thanks for the comments. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16944: [SPARK-19611][SQL] Introduce configurable table s...

2017-02-24 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16944#discussion_r103068442
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSchemaInferenceSuite.scala
 ---
@@ -0,0 +1,200 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import java.io.File
+import java.util.concurrent.{Executors, TimeUnit}
+
+import org.scalatest.BeforeAndAfterEach
+
+import org.apache.spark.metrics.source.HiveCatalogMetrics
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.catalog._
+import org.apache.spark.sql.execution.datasources.FileStatusCache
+import org.apache.spark.sql.QueryTest
+import org.apache.spark.sql.hive.client.HiveClient
+import org.apache.spark.sql.hive.test.TestHiveSingleton
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.internal.SQLConf.HiveCaseSensitiveInferenceMode
+import org.apache.spark.sql.test.SQLTestUtils
+import org.apache.spark.sql.types._
+
+class HiveSchemaInferenceSuite
+  extends QueryTest with TestHiveSingleton with SQLTestUtils with 
BeforeAndAfterEach {
+
+  import HiveSchemaInferenceSuite._
+  import HiveExternalCatalog.SPARK_SQL_PREFIX
+
+  override def beforeEach(): Unit = {
+super.beforeEach()
+FileStatusCache.resetForTesting()
+  }
+
+  override def afterEach(): Unit = {
+super.afterEach()
+FileStatusCache.resetForTesting()
+  }
+
+  private val externalCatalog = 
spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog]
+  private val lowercaseSchema = StructType(Seq(
+StructField("fieldone", LongType),
+StructField("partcol1", IntegerType),
+StructField("partcol2", IntegerType)))
+  private val caseSensitiveSchema = StructType(Seq(
+StructField("fieldOne", LongType),
+// Partition columns remain case-insensitive
+StructField("partcol1", IntegerType),
+StructField("partcol2", IntegerType)))
+
+  // Create a CatalogTable instance modeling an external Hive Metastore 
table backed by
+  // Parquet data files.
+  private def hiveExternalCatalogTable(
+  tableName: String,
+  location: String,
+  schema: StructType,
+  partitionColumns: Seq[String],
+  properties: Map[String, String] = Map.empty): CatalogTable = {
+CatalogTable(
+  identifier = TableIdentifier(table = tableName, database = 
Option(DATABASE)),
+  tableType = CatalogTableType.EXTERNAL,
+  storage = CatalogStorageFormat(
+locationUri = Option(location),
+inputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"),
+outputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat"),
+serde = 
Option("org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"),
+compressed = false,
+properties = Map("serialization.format" -> "1")),
+  schema = schema,
+  provider = Option("hive"),
+  partitionColumnNames = partitionColumns,
+  properties = properties)
+  }
+
+  // Creates CatalogTablePartition instances for adding partitions of data 
to our test table.
+  private def hiveCatalogPartition(location: String, index: Int): 
CatalogTablePartition
+= CatalogTablePartition(
+  spec = Map("partcol1" -> index.toString, "partcol2" -> 
index.toString),
+  storage = CatalogStorageFormat(
+locationUri = 
Option(s"${location}/partCol1=$index/partCol2=$index/"),
+inputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"),
+outputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat"),
+serde = 
Option("org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"),
+compressed = false,
+properties = Map("serializa

[GitHub] spark issue #17056: [SPARK-17495] [SQL] Support Decimal type in Hive-hash

2017-02-24 Thread tejasapatil
Github user tejasapatil commented on the issue:

https://github.com/apache/spark/pull/17056
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16824: [SPARK-18069][PYTHON] Make PySpark doctests for S...

2017-02-24 Thread HyukjinKwon
Github user HyukjinKwon closed the pull request at:

https://github.com/apache/spark/pull/16824


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16824: [SPARK-18069][PYTHON] Make PySpark doctests for SQL self...

2017-02-24 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16824
  
Thanks @holdenk, let me close this for now.

@davies, please give me your opinion. If you think it is worth I will 
reopen. If not, I will resolve the JIRA.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16944: [SPARK-19611][SQL] Introduce configurable table s...

2017-02-24 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16944#discussion_r103068403
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSchemaInferenceSuite.scala
 ---
@@ -0,0 +1,200 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import java.io.File
+import java.util.concurrent.{Executors, TimeUnit}
+
+import org.scalatest.BeforeAndAfterEach
+
+import org.apache.spark.metrics.source.HiveCatalogMetrics
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.catalog._
+import org.apache.spark.sql.execution.datasources.FileStatusCache
+import org.apache.spark.sql.QueryTest
+import org.apache.spark.sql.hive.client.HiveClient
+import org.apache.spark.sql.hive.test.TestHiveSingleton
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.internal.SQLConf.HiveCaseSensitiveInferenceMode
+import org.apache.spark.sql.test.SQLTestUtils
+import org.apache.spark.sql.types._
+
+class HiveSchemaInferenceSuite
+  extends QueryTest with TestHiveSingleton with SQLTestUtils with 
BeforeAndAfterEach {
+
+  import HiveSchemaInferenceSuite._
+  import HiveExternalCatalog.SPARK_SQL_PREFIX
+
+  override def beforeEach(): Unit = {
+super.beforeEach()
+FileStatusCache.resetForTesting()
+  }
+
+  override def afterEach(): Unit = {
+super.afterEach()
+FileStatusCache.resetForTesting()
+  }
+
+  private val externalCatalog = 
spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog]
+  private val lowercaseSchema = StructType(Seq(
+StructField("fieldone", LongType),
+StructField("partcol1", IntegerType),
+StructField("partcol2", IntegerType)))
+  private val caseSensitiveSchema = StructType(Seq(
+StructField("fieldOne", LongType),
+// Partition columns remain case-insensitive
+StructField("partcol1", IntegerType),
+StructField("partcol2", IntegerType)))
+
+  // Create a CatalogTable instance modeling an external Hive Metastore 
table backed by
+  // Parquet data files.
+  private def hiveExternalCatalogTable(
+  tableName: String,
+  location: String,
+  schema: StructType,
+  partitionColumns: Seq[String],
+  properties: Map[String, String] = Map.empty): CatalogTable = {
+CatalogTable(
+  identifier = TableIdentifier(table = tableName, database = 
Option(DATABASE)),
+  tableType = CatalogTableType.EXTERNAL,
+  storage = CatalogStorageFormat(
+locationUri = Option(location),
+inputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"),
+outputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat"),
+serde = 
Option("org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"),
+compressed = false,
+properties = Map("serialization.format" -> "1")),
+  schema = schema,
+  provider = Option("hive"),
+  partitionColumnNames = partitionColumns,
+  properties = properties)
+  }
+
+  // Creates CatalogTablePartition instances for adding partitions of data 
to our test table.
+  private def hiveCatalogPartition(location: String, index: Int): 
CatalogTablePartition
+= CatalogTablePartition(
+  spec = Map("partcol1" -> index.toString, "partcol2" -> 
index.toString),
+  storage = CatalogStorageFormat(
+locationUri = 
Option(s"${location}/partCol1=$index/partCol2=$index/"),
+inputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"),
+outputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat"),
+serde = 
Option("org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"),
+compressed = false,
+properties = Map("serializa

[GitHub] spark issue #16826: [SPARK-19540][SQL] Add ability to clone SparkSession whe...

2017-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16826
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16826: [SPARK-19540][SQL] Add ability to clone SparkSession whe...

2017-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16826
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73448/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16944: [SPARK-19611][SQL] Introduce configurable table s...

2017-02-24 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16944#discussion_r103068378
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSchemaInferenceSuite.scala
 ---
@@ -0,0 +1,200 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import java.io.File
+import java.util.concurrent.{Executors, TimeUnit}
+
+import org.scalatest.BeforeAndAfterEach
+
+import org.apache.spark.metrics.source.HiveCatalogMetrics
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.catalog._
+import org.apache.spark.sql.execution.datasources.FileStatusCache
+import org.apache.spark.sql.QueryTest
+import org.apache.spark.sql.hive.client.HiveClient
+import org.apache.spark.sql.hive.test.TestHiveSingleton
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.internal.SQLConf.HiveCaseSensitiveInferenceMode
+import org.apache.spark.sql.test.SQLTestUtils
+import org.apache.spark.sql.types._
+
+class HiveSchemaInferenceSuite
+  extends QueryTest with TestHiveSingleton with SQLTestUtils with 
BeforeAndAfterEach {
+
+  import HiveSchemaInferenceSuite._
+  import HiveExternalCatalog.SPARK_SQL_PREFIX
+
+  override def beforeEach(): Unit = {
+super.beforeEach()
+FileStatusCache.resetForTesting()
+  }
+
+  override def afterEach(): Unit = {
+super.afterEach()
+FileStatusCache.resetForTesting()
+  }
+
+  private val externalCatalog = 
spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog]
+  private val lowercaseSchema = StructType(Seq(
+StructField("fieldone", LongType),
+StructField("partcol1", IntegerType),
+StructField("partcol2", IntegerType)))
+  private val caseSensitiveSchema = StructType(Seq(
+StructField("fieldOne", LongType),
+// Partition columns remain case-insensitive
+StructField("partcol1", IntegerType),
+StructField("partcol2", IntegerType)))
+
+  // Create a CatalogTable instance modeling an external Hive Metastore 
table backed by
+  // Parquet data files.
+  private def hiveExternalCatalogTable(
+  tableName: String,
+  location: String,
+  schema: StructType,
+  partitionColumns: Seq[String],
+  properties: Map[String, String] = Map.empty): CatalogTable = {
+CatalogTable(
+  identifier = TableIdentifier(table = tableName, database = 
Option(DATABASE)),
+  tableType = CatalogTableType.EXTERNAL,
+  storage = CatalogStorageFormat(
+locationUri = Option(location),
+inputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"),
+outputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat"),
+serde = 
Option("org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"),
+compressed = false,
+properties = Map("serialization.format" -> "1")),
+  schema = schema,
+  provider = Option("hive"),
+  partitionColumnNames = partitionColumns,
+  properties = properties)
+  }
+
+  // Creates CatalogTablePartition instances for adding partitions of data 
to our test table.
+  private def hiveCatalogPartition(location: String, index: Int): 
CatalogTablePartition
+= CatalogTablePartition(
+  spec = Map("partcol1" -> index.toString, "partcol2" -> 
index.toString),
+  storage = CatalogStorageFormat(
+locationUri = 
Option(s"${location}/partCol1=$index/partCol2=$index/"),
+inputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"),
+outputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat"),
+serde = 
Option("org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"),
+compressed = false,
+properties = Map("serializa

[GitHub] spark issue #16826: [SPARK-19540][SQL] Add ability to clone SparkSession whe...

2017-02-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16826
  
**[Test build #73448 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73448/testReport)**
 for PR 16826 at commit 
[`437b0bc`](https://github.com/apache/spark/commit/437b0bca7bc29809083f26b8a4848d53d999d097).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17060: [SQL] Duplicate test exception in SQLQueryTestSui...

2017-02-24 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17060#discussion_r103068302
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala ---
@@ -121,7 +123,7 @@ class SQLQueryTestSuite extends QueryTest with 
SharedSQLContext {
   }
 
   private def createScalaTestCase(testCase: TestCase): Unit = {
-if (blackList.contains(testCase.name.toLowerCase)) {
+if (blackList.exists(testCase.name.toLowerCase.contains(_))) {
--- End diff --

Nit, you can remove the `(_)` here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17060: [SQL] Duplicate test exception in SQLQueryTestSui...

2017-02-24 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17060#discussion_r103068318
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala ---
@@ -241,7 +243,8 @@ class SQLQueryTestSuite extends QueryTest with 
SharedSQLContext {
   private def listTestCases(): Seq[TestCase] = {
 listFilesRecursively(new File(inputFilePath)).map { file =>
   val resultFile = file.getAbsolutePath.replace(inputFilePath, 
goldenFilePath) + ".out"
-  TestCase(file.getName, file.getAbsolutePath, resultFile)
+  val absPath = file.getAbsolutePath
+  TestCase(absPath.stripPrefix(inputFilePath + File.separator), 
absPath, resultFile)
--- End diff --

Why add the separator here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17060: [SQL] Duplicate test exception in SQLQueryTestSui...

2017-02-24 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17060#discussion_r103068296
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala ---
@@ -98,7 +98,9 @@ class SQLQueryTestSuite extends QueryTest with 
SharedSQLContext {
 
   /** List of test cases to ignore, in lower cases. */
   private val blackList = Set(
-"blacklist.sql"  // Do NOT remove this one. It is here to test the 
blacklist functionality.
+"blacklist.sql",  // Do NOT remove this one. It is here to test the 
blacklist functionality.
+".ds_store"   // A meta-file that may be created on Mac by Finder 
App.
--- End diff --

.DS_Store right? I know you compare lower-case later but might as well 
write it as it appears


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

2017-02-24 Thread datumbox
Github user datumbox commented on the issue:

https://github.com/apache/spark/pull/17059
  
jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16944: [SPARK-19611][SQL] Introduce configurable table s...

2017-02-24 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16944#discussion_r103068277
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSchemaInferenceSuite.scala
 ---
@@ -0,0 +1,200 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import java.io.File
+import java.util.concurrent.{Executors, TimeUnit}
+
+import org.scalatest.BeforeAndAfterEach
+
+import org.apache.spark.metrics.source.HiveCatalogMetrics
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.catalog._
+import org.apache.spark.sql.execution.datasources.FileStatusCache
+import org.apache.spark.sql.QueryTest
+import org.apache.spark.sql.hive.client.HiveClient
+import org.apache.spark.sql.hive.test.TestHiveSingleton
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.internal.SQLConf.HiveCaseSensitiveInferenceMode
+import org.apache.spark.sql.test.SQLTestUtils
+import org.apache.spark.sql.types._
+
+class HiveSchemaInferenceSuite
+  extends QueryTest with TestHiveSingleton with SQLTestUtils with 
BeforeAndAfterEach {
+
+  import HiveSchemaInferenceSuite._
+  import HiveExternalCatalog.SPARK_SQL_PREFIX
+
+  override def beforeEach(): Unit = {
+super.beforeEach()
+FileStatusCache.resetForTesting()
+  }
+
+  override def afterEach(): Unit = {
+super.afterEach()
+FileStatusCache.resetForTesting()
+  }
+
+  private val externalCatalog = 
spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog]
+  private val lowercaseSchema = StructType(Seq(
+StructField("fieldone", LongType),
+StructField("partcol1", IntegerType),
+StructField("partcol2", IntegerType)))
+  private val caseSensitiveSchema = StructType(Seq(
+StructField("fieldOne", LongType),
+// Partition columns remain case-insensitive
+StructField("partcol1", IntegerType),
+StructField("partcol2", IntegerType)))
+
+  // Create a CatalogTable instance modeling an external Hive Metastore 
table backed by
+  // Parquet data files.
+  private def hiveExternalCatalogTable(
+  tableName: String,
+  location: String,
+  schema: StructType,
+  partitionColumns: Seq[String],
+  properties: Map[String, String] = Map.empty): CatalogTable = {
+CatalogTable(
+  identifier = TableIdentifier(table = tableName, database = 
Option(DATABASE)),
+  tableType = CatalogTableType.EXTERNAL,
+  storage = CatalogStorageFormat(
+locationUri = Option(location),
+inputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"),
+outputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat"),
+serde = 
Option("org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"),
+compressed = false,
+properties = Map("serialization.format" -> "1")),
+  schema = schema,
+  provider = Option("hive"),
+  partitionColumnNames = partitionColumns,
+  properties = properties)
+  }
+
+  // Creates CatalogTablePartition instances for adding partitions of data 
to our test table.
+  private def hiveCatalogPartition(location: String, index: Int): 
CatalogTablePartition
+= CatalogTablePartition(
+  spec = Map("partcol1" -> index.toString, "partcol2" -> 
index.toString),
+  storage = CatalogStorageFormat(
+locationUri = 
Option(s"${location}/partCol1=$index/partCol2=$index/"),
+inputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"),
+outputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat"),
+serde = 
Option("org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"),
+compressed = false,
+properties = Map("serializa

[GitHub] spark issue #17062: [SPARK-17495] [SQL] Support date, timestamp and interval...

2017-02-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17062
  
**[Test build #73459 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73459/testReport)**
 for PR 17062 at commit 
[`332475c`](https://github.com/apache/spark/commit/332475c1641f61080aa41dda9f1ceec237351d75).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16944: [SPARK-19611][SQL] Introduce configurable table s...

2017-02-24 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16944#discussion_r103068259
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSchemaInferenceSuite.scala
 ---
@@ -0,0 +1,200 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import java.io.File
+import java.util.concurrent.{Executors, TimeUnit}
+
+import org.scalatest.BeforeAndAfterEach
+
+import org.apache.spark.metrics.source.HiveCatalogMetrics
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.catalog._
+import org.apache.spark.sql.execution.datasources.FileStatusCache
+import org.apache.spark.sql.QueryTest
+import org.apache.spark.sql.hive.client.HiveClient
+import org.apache.spark.sql.hive.test.TestHiveSingleton
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.internal.SQLConf.HiveCaseSensitiveInferenceMode
+import org.apache.spark.sql.test.SQLTestUtils
+import org.apache.spark.sql.types._
+
+class HiveSchemaInferenceSuite
+  extends QueryTest with TestHiveSingleton with SQLTestUtils with 
BeforeAndAfterEach {
+
+  import HiveSchemaInferenceSuite._
+  import HiveExternalCatalog.SPARK_SQL_PREFIX
+
+  override def beforeEach(): Unit = {
+super.beforeEach()
+FileStatusCache.resetForTesting()
+  }
+
+  override def afterEach(): Unit = {
+super.afterEach()
+FileStatusCache.resetForTesting()
+  }
+
+  private val externalCatalog = 
spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog]
+  private val lowercaseSchema = StructType(Seq(
+StructField("fieldone", LongType),
+StructField("partcol1", IntegerType),
+StructField("partcol2", IntegerType)))
+  private val caseSensitiveSchema = StructType(Seq(
+StructField("fieldOne", LongType),
+// Partition columns remain case-insensitive
+StructField("partcol1", IntegerType),
+StructField("partcol2", IntegerType)))
+
+  // Create a CatalogTable instance modeling an external Hive Metastore 
table backed by
+  // Parquet data files.
+  private def hiveExternalCatalogTable(
+  tableName: String,
+  location: String,
+  schema: StructType,
+  partitionColumns: Seq[String],
+  properties: Map[String, String] = Map.empty): CatalogTable = {
+CatalogTable(
+  identifier = TableIdentifier(table = tableName, database = 
Option(DATABASE)),
+  tableType = CatalogTableType.EXTERNAL,
+  storage = CatalogStorageFormat(
+locationUri = Option(location),
+inputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"),
+outputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat"),
+serde = 
Option("org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"),
+compressed = false,
+properties = Map("serialization.format" -> "1")),
+  schema = schema,
+  provider = Option("hive"),
+  partitionColumnNames = partitionColumns,
+  properties = properties)
+  }
+
+  // Creates CatalogTablePartition instances for adding partitions of data 
to our test table.
+  private def hiveCatalogPartition(location: String, index: Int): 
CatalogTablePartition
+= CatalogTablePartition(
+  spec = Map("partcol1" -> index.toString, "partcol2" -> 
index.toString),
+  storage = CatalogStorageFormat(
+locationUri = 
Option(s"${location}/partCol1=$index/partCol2=$index/"),
+inputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"),
+outputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat"),
+serde = 
Option("org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"),
+compressed = false,
+properties = Map("serializa

[GitHub] spark issue #17060: [SQL] Duplicate test exception in SQLQueryTestSuite due ...

2017-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17060
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16826: [SPARK-19540][SQL] Add ability to clone SparkSession whe...

2017-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16826
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16826: [SPARK-19540][SQL] Add ability to clone SparkSession whe...

2017-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16826
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73447/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17060: [SQL] Duplicate test exception in SQLQueryTestSuite due ...

2017-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17060
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73449/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16826: [SPARK-19540][SQL] Add ability to clone SparkSession whe...

2017-02-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16826
  
**[Test build #73447 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73447/testReport)**
 for PR 16826 at commit 
[`fd11ee2`](https://github.com/apache/spark/commit/fd11ee2289ae26b3061659dc26b1f09ded32d039).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17060: [SQL] Duplicate test exception in SQLQueryTestSuite due ...

2017-02-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17060
  
**[Test build #73449 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73449/testReport)**
 for PR 17060 at commit 
[`5f2a6a8`](https://github.com/apache/spark/commit/5f2a6a82244a623fed142555f55dd34060426e5a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17062: [SPARK-17495] [SQL] Support date, timestamp and i...

2017-02-24 Thread tejasapatil
GitHub user tejasapatil opened a pull request:

https://github.com/apache/spark/pull/17062

[SPARK-17495] [SQL] Support date, timestamp and interval types in Hive hash

## What changes were proposed in this pull request?

- Timestamp hashing is done as per 
[TimestampWritable.hashCode()](https://github.com/apache/hive/blob/ff67cdda1c538dc65087878eeba3e165cf3230f4/serde/src/java/org/apache/hadoop/hive/serde2/io/TimestampWritable.java#L406)
 in Hive
- Interval hashing is done as per 
[HiveIntervalDayTime.hashCode()](https://github.com/apache/hive/blob/ff67cdda1c538dc65087878eeba3e165cf3230f4/storage-api/src/java/org/apache/hadoop/hive/common/type/HiveIntervalDayTime.java#L178).
 Note that there are inherent differences in how Hive and Spark store intervals 
under the hood which limits the ability to be in completely sync with hive's 
hashing function. I have explained this in the method doc.
- Date type was already supported. This PR adds test for that.

## How was this patch tested?

Added unit tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tejasapatil/spark 
SPARK-17495_time_related_types

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17062.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17062


commit cc359fc45547b7ba3fd4c1d11d3dcfbaf71ea66a
Author: Tejas Patil 
Date:   2017-02-25T00:18:03Z

[SPARK-17495] [SQL] Support date, timestamp datatypes in Hive hash

commit 332475c1641f61080aa41dda9f1ceec237351d75
Author: Tejas Patil 
Date:   2017-02-25T02:23:41Z

minor refac




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17062: [SPARK-17495] [SQL] Support date, timestamp and interval...

2017-02-24 Thread tejasapatil
Github user tejasapatil commented on the issue:

https://github.com/apache/spark/pull/17062
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   >