date:20160603

[GitHub] spark issue #13508: [SPARK-15766][SparkR]:R should export is.nan

2016-06-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13508
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13508: [SPARK-15766][SparkR]:R should export is.nan

2016-06-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13508
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59984/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13508: [SPARK-15766][SparkR]:R should export is.nan

2016-06-03 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13508
  
**[Test build #59984 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59984/consoleFull)**
 for PR 13508 at commit 
[`da04a0d`](https://github.com/apache/spark/commit/da04a0d80578cc2d5d5a87a61ac2377df740c3ae).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12258: [SPARK-14485][CORE] ignore task finished for exec...

2016-06-03 Thread zhonghaihua

Github user zhonghaihua commented on a diff in the pull request:

https://github.com/apache/spark/pull/12258#discussion_r65797802
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
@@ -343,17 +343,31 @@ private[spark] class TaskSchedulerImpl(
 }
 taskIdToTaskSetManager.get(tid) match {
   case Some(taskSet) =>
+var executorId: String = null
 if (TaskState.isFinished(state)) {
   taskIdToTaskSetManager.remove(tid)
   taskIdToExecutorId.remove(tid).foreach { execId =>
+executorId = execId
 if (executorIdToTaskCount.contains(execId)) {
   executorIdToTaskCount(execId) -= 1
 }
   }
 }
 if (state == TaskState.FINISHED) {
-  taskSet.removeRunningTask(tid)
-  taskResultGetter.enqueueSuccessfulTask(taskSet, tid, 
serializedData)
+  // In some case, executor has already removed by driver for 
heartbeats timeout, but
+  // at sometime, before executor killed  by cluster, the task 
of running on this
+  // executor is finished and return task success state to 
driver. However, this kinds
+  // of task should be ignored, because the task on this 
executor is already re-queued
+  // by driver. For more details, can check in SPARK-14485.
+  if (executorId.ne(null) && 
!executorIdToTaskCount.contains(executorId)) {
+taskSet.removeRunningTask(tid)
+logWarning(
--- End diff --

Yes, you are right. I will change it soon, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13494: [SPARK-15752] [SQL] support optimization for metadata on...

2016-06-03 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/13494
  
Can you try to write a design doc on this? Would be great to discuss the 
reasons why we might want this, the kind of queries that can be answered, 
corner cases, and how it should be implemented. Thanks.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12258: [SPARK-14485][CORE] ignore task finished for exec...

2016-06-03 Thread zhonghaihua

Github user zhonghaihua commented on a diff in the pull request:

https://github.com/apache/spark/pull/12258#discussion_r6579
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
@@ -343,17 +343,31 @@ private[spark] class TaskSchedulerImpl(
 }
 taskIdToTaskSetManager.get(tid) match {
   case Some(taskSet) =>
+var executorId: String = null
 if (TaskState.isFinished(state)) {
   taskIdToTaskSetManager.remove(tid)
   taskIdToExecutorId.remove(tid).foreach { execId =>
+executorId = execId
 if (executorIdToTaskCount.contains(execId)) {
   executorIdToTaskCount(execId) -= 1
 }
   }
 }
 if (state == TaskState.FINISHED) {
-  taskSet.removeRunningTask(tid)
-  taskResultGetter.enqueueSuccessfulTask(taskSet, tid, 
serializedData)
+  // In some case, executor has already removed by driver for 
heartbeats timeout, but
+  // at sometime, before executor killed  by cluster, the task 
of running on this
+  // executor is finished and return task success state to 
driver. However, this kinds
+  // of task should be ignored, because the task on this 
executor is already re-queued
+  // by driver. For more details, can check in SPARK-14485.
+  if (executorId.ne(null) && 
!executorIdToTaskCount.contains(executorId)) {
+taskSet.removeRunningTask(tid)
--- End diff --

I see. Fix it soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12258: [SPARK-14485][CORE] ignore task finished for exec...

2016-06-03 Thread zhonghaihua

Github user zhonghaihua commented on a diff in the pull request:

https://github.com/apache/spark/pull/12258#discussion_r65797773
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
@@ -343,17 +343,31 @@ private[spark] class TaskSchedulerImpl(
 }
 taskIdToTaskSetManager.get(tid) match {
   case Some(taskSet) =>
+var executorId: String = null
 if (TaskState.isFinished(state)) {
   taskIdToTaskSetManager.remove(tid)
   taskIdToExecutorId.remove(tid).foreach { execId =>
+executorId = execId
 if (executorIdToTaskCount.contains(execId)) {
   executorIdToTaskCount(execId) -= 1
 }
   }
 }
 if (state == TaskState.FINISHED) {
-  taskSet.removeRunningTask(tid)
-  taskResultGetter.enqueueSuccessfulTask(taskSet, tid, 
serializedData)
+  // In some case, executor has already removed by driver for 
heartbeats timeout, but
+  // at sometime, before executor killed  by cluster, the task 
of running on this
+  // executor is finished and return task success state to 
driver. However, this kinds
+  // of task should be ignored, because the task on this 
executor is already re-queued
+  // by driver. For more details, can check in SPARK-14485.
+  if (executorId.ne(null) && 
!executorIdToTaskCount.contains(executorId)) {
--- End diff --

Thanks for your comments. I will fix it soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13508: [SPARK-15766][SparkR]:R should export is.nan

2016-06-03 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13508
  
**[Test build #59984 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59984/consoleFull)**
 for PR 13508 at commit 
[`da04a0d`](https://github.com/apache/spark/commit/da04a0d80578cc2d5d5a87a61ac2377df740c3ae).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13500: [SPARK-15756] [SQL] Support command 'create table...

2016-06-03 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/13500


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences

2016-06-03 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/13505
  
hm probably shouldn't happen in this pr but i'm wondering if it'd make 
sense to generalize AttributeSeq and use it everywhere, rather than 
Seq[Attribute].



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13505: [SPARK-15764][SQL] Replace N^2 loop in BindRefere...

2016-06-03 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13505#discussion_r65797603
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
 ---
@@ -86,11 +86,31 @@ package object expressions  {
   /**
* Helper functions for working with `Seq[Attribute]`.
*/
-  implicit class AttributeSeq(attrs: Seq[Attribute]) {
+  implicit class AttributeSeq(val attrs: Seq[Attribute]) {
 /** Creates a StructType with a schema matching this `Seq[Attribute]`. 
*/
 def toStructType: StructType = {
   StructType(attrs.map(a => StructField(a.name, a.dataType, 
a.nullable)))
 }
+
+private lazy val inputArr = attrs.toArray
+
+private lazy val inputToOrdinal = {
+  val map = new java.util.HashMap[ExprId, Int](inputArr.length * 2)
+  var index = 0
+  attrs.foreach { attr =>
+if (!map.containsKey(attr.exprId)) {
+  map.put(attr.exprId, index)
+}
+index += 1
+  }
+  map
+}
+
+def apply(ordinal: Int): Attribute = inputArr(ordinal)
+
+def getOrdinal(exprId: ExprId): Int = {
--- End diff --

yup ...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13500: [SPARK-15756] [SQL] Support command 'create table stored...

2016-06-03 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/13500
  
Thanks - merging in master/2.0.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13508: [SPARK-15766][SparkR]:R should export is.nan

2016-06-03 Thread wangmiao1981

GitHub user wangmiao1981 opened a pull request:

https://github.com/apache/spark/pull/13508

[SPARK-15766][SparkR]:R should export is.nan

## What changes were proposed in this pull request?

When reviewing SPARK-15545, we found that is.nan is not exported, which 
should be exported.

Add it to the NAMESPACE.


## How was this patch tested?

Manual tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangmiao1981/spark unused

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13508.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13508


commit da04a0d80578cc2d5d5a87a61ac2377df740c3ae
Author: wm...@hotmail.com 
Date:   2016-06-04T05:15:20Z

export is.nan




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13504: [SPARK-15762][SQL] Cache Metadata & StructType hashCodes...

2016-06-03 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/13504
  
LGTM other than that small comment.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13504: [SPARK-15762][SQL] Cache Metadata & StructType ha...

2016-06-03 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13504#discussion_r65797565
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/Metadata.scala ---
@@ -104,7 +104,8 @@ sealed class Metadata private[types] (private[types] 
val map: Map[String, Any])
 }
   }
 
-  override def hashCode: Int = Metadata.hash(this)
+  private lazy val _hashCode: Int = Metadata.hash(this)
+  override def hashCode: Int = _hashCode
--- End diff --

any reason why this is not just
```
override lazy val hashCode: Int = Metadata.hash(this)
```
?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13506: [SPARK-15763][SQL] Support DELETE FILE command na...

2016-06-03 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13506#discussion_r65797550
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -1441,6 +1441,32 @@ class SparkContext(config: SparkConf) extends 
Logging with ExecutorAllocationCli
   }
 
   /**
+   * Delete a file to be downloaded with this Spark job on every node.
+   * The `path` passed can be either a local file, a file in HDFS (or 
other Hadoop-supported
+   * filesystems), or an HTTP, HTTPS or FTP URI.  To access the file in 
Spark jobs,
+   * use `SparkFiles.get(fileName)` to find its download location.
+   *
+   */
+  def deleteFile(path: String): Unit = {
--- End diff --

this is fairly confusing -- i'd assume this is actually deleting the path 
given.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13507: [SPARK-15765][SQL][Streaming] Make continuous Parquet wr...

2016-06-03 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13507
  
**[Test build #59983 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59983/consoleFull)**
 for PR 13507 at commit 
[`60a2c8e`](https://github.com/apache/spark/commit/60a2c8ee7c610a783e65b78ac21e25661b84f49d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13507: [SPARK-15765][SQL][Streaming] Make continuous Par...

2016-06-03 Thread lw-lin

GitHub user lw-lin opened a pull request:

https://github.com/apache/spark/pull/13507

[SPARK-15765][SQL][Streaming] Make continuous Parquet writing consistent 
with non-consistent Parquet writing

## What changes were proposed in this pull request?

Currently there are some code duplicates in continuous Parquet writing (as 
in Structured Streaming) and non-continuous batch writing; see 
[ParquetFileFormat#prepareWrite()](https://github.com/apache/spark/blob/431542765785304edb76a19885fbc5f9b8ae7d64/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L68)
 and 
[ParquetFileFormat#ParquetOutputWriterFactory](https://github.com/apache/spark/blob/431542765785304edb76a19885fbc5f9b8ae7d64/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L414).

This may lead to inconsistent behavior, when we only change one piece of 
code but not the other.

By extracting the common code out, this patch fixes the inconsistency. As a 
result, Structured Streaming now also enjoys 
[SPARK-15719](https://github.com/apache/spark/pull/13455).

## How was this patch tested?

Just code refactoring without any logic change, this should be covered by 
existing suits.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lw-lin/spark parquet-conf-deduplicate

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13507.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13507


commit 60a2c8ee7c610a783e65b78ac21e25661b84f49d
Author: Liwei Lin 
Date:   2016-06-03T14:31:56Z

Make continuous writing consistent with non-consistent writing




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13442: [SPARK-15654][SQL] Check if all the input files are spli...

2016-06-03 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/13442
  
@rxin plz check this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13442: [SPARK-15654][SQL] Check if all the input files are spli...

2016-06-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13442
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59982/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13442: [SPARK-15654][SQL] Check if all the input files are spli...

2016-06-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13442
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13442: [SPARK-15654][SQL] Check if all the input files are spli...

2016-06-03 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13442
  
**[Test build #59982 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59982/consoleFull)**
 for PR 13442 at commit 
[`d46bfdf`](https://github.com/apache/spark/commit/d46bfdf01b6af86f0b74dd482f46a31d7cc7c632).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13413: [SPARK-15663][SQL] SparkSession.catalog.listFunct...

2016-06-03 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/13413#discussion_r65796528
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
---
@@ -88,14 +106,6 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
 checkKeywordsExist(sql("describe functioN abcadf"), "Function: abcadf 
not found.")
   }
 
-  test("SPARK-14415: All functions should have own descriptions") {
-for (f <- spark.sessionState.functionRegistry.listFunction()) {
-  if (!Seq("cube", "grouping", "grouping_id", "rollup", 
"window").contains(f)) {
-checkKeywordsNotExist(sql(s"describe function `$f`"), "N/A.")
-  }
-}
-  }
-
--- End diff --

Why this test removed? We should check these functions have own description.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13413: [SPARK-15663][SQL] SparkSession.catalog.listFunct...

2016-06-03 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/13413#discussion_r65796510
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
---
@@ -58,15 +59,32 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
 
   test("show functions") {
 def getFunctions(pattern: String): Seq[Row] = {
-  
StringUtils.filterPattern(spark.sessionState.functionRegistry.listFunction(), 
pattern)
+  StringUtils.filterPattern(
+
spark.sessionState.catalog.listFunctions("default").map(_.funcName), pattern)
 .map(Row(_))
 }
+
+def createFunction(names: Seq[String]): Unit = {
+  names.foreach { name =>
+spark.udf.register(name, (arg1: Int, arg2: String) => arg2 + arg1)
+  }
+}
+
+assert(sql("SHOW functions").collect().isEmpty)
+
+createFunction(Seq("ilog", "logi", "logii", "logiii"))
+createFunction(Seq("crc32i", "cubei", "cume_disti"))
+createFunction(Seq("isize", "ispace"))
+createFunction(Seq("to_datei", "date_addi", "current_datei"))
+
 checkAnswer(sql("SHOW functions"), getFunctions("*"))
+
 Seq("^c*", "*e$", "log*", "*date*").foreach { pattern =>
   // For the pattern part, only '*' and '|' are allowed as wildcards.
   // For '*', we need to replace it to '.*'.
   checkAnswer(sql(s"SHOW FUNCTIONS '$pattern'"), getFunctions(pattern))
 }
+
--- End diff --

Nit: Remove this line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13413: [SPARK-15663][SQL] SparkSession.catalog.listFunct...

2016-06-03 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/13413#discussion_r65796505
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -1481,17 +1481,7 @@ def test_list_functions(self):
 spark.sql("CREATE DATABASE some_db")
 functions = dict((f.name, f) for f in 
spark.catalog.listFunctions())
 functionsDefault = dict((f.name, f) for f in 
spark.catalog.listFunctions("default"))
-self.assertTrue(len(functions) > 200)
-self.assertTrue("+" in functions)
-self.assertTrue("like" in functions)
-self.assertTrue("month" in functions)
-self.assertTrue("to_unix_timestamp" in functions)
-self.assertTrue("current_database" in functions)
-self.assertEquals(functions["+"], Function(
-name="+",
-description=None,
-className="org.apache.spark.sql.catalyst.expressions.Add",
-isTemporary=True))
+self.assertEquals(len(functions), 0)
--- End diff --

Seems we need some units tests to check if we could show user-defined temp 
functions in `listFunctions`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13504: [SPARK-15762][SQL] Cache Metadata & StructType hashCodes...

2016-06-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13504
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13504: [SPARK-15762][SQL] Cache Metadata & StructType hashCodes...

2016-06-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13504
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59981/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13504: [SPARK-15762][SQL] Cache Metadata & StructType hashCodes...

2016-06-03 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13504
  
**[Test build #59981 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59981/consoleFull)**
 for PR 13504 at commit 
[`a3a6898`](https://github.com/apache/spark/commit/a3a68989bb31f14e74cbfc3532e089a4de070605).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13461: [SPARK-15721][ML] Make DefaultParamsReadable, DefaultPar...

2016-06-03 Thread yanboliang

Github user yanboliang commented on the issue:

https://github.com/apache/spark/pull/13461
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13403: [SPARK-15660][CORE] RDD and Dataset should show the cons...

2016-06-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13403
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59979/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13403: [SPARK-15660][CORE] RDD and Dataset should show the cons...

2016-06-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13403
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13403: [SPARK-15660][CORE] RDD and Dataset should show the cons...

2016-06-03 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13403
  
**[Test build #59979 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59979/consoleFull)**
 for PR 13403 at commit 
[`801`](https://github.com/apache/spark/commit/801244d24b8b8f250dac34b21b1ea3245981).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13442: [SPARK-15654][SQL] Check if all the input files are spli...

2016-06-03 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13442
  
**[Test build #59982 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59982/consoleFull)**
 for PR 13442 at commit 
[`d46bfdf`](https://github.com/apache/spark/commit/d46bfdf01b6af86f0b74dd482f46a31d7cc7c632).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13444: [SPARK-15530][SQL] Set #parallelism for file listing in ...

2016-06-03 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/13444
  
@yhuai  ping


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13486: [SPARK-15743][SQL] Prevent saving with all-column partit...

2016-06-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13486
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13486: [SPARK-15743][SQL] Prevent saving with all-column partit...

2016-06-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13486
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59977/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13486: [SPARK-15743][SQL] Prevent saving with all-column partit...

2016-06-03 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13486
  
**[Test build #59977 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59977/consoleFull)**
 for PR 13486 at commit 
[`3fd3512`](https://github.com/apache/spark/commit/3fd351292a80f9d3dc59df68bc958439a8381424).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12173: [SPARK-13792][SQL] Limit logging of bad records in CSVRe...

2016-06-03 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/12173
  
@falaki ping


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13436: [SPARK-15696][SQL] Improve `crosstab` to have a consiste...

2016-06-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13436
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59978/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13436: [SPARK-15696][SQL] Improve `crosstab` to have a consiste...

2016-06-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13436
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences

2016-06-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13505
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13436: [SPARK-15696][SQL] Improve `crosstab` to have a consiste...

2016-06-03 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13436
  
**[Test build #59978 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59978/consoleFull)**
 for PR 13436 at commit 
[`ec49b37`](https://github.com/apache/spark/commit/ec49b375c7f12d8a5f32c50d67291557ba262a5f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences

2016-06-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13505
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59976/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences

2016-06-03 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13505
  
**[Test build #59976 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59976/consoleFull)**
 for PR 13505 at commit 
[`38e8a99`](https://github.com/apache/spark/commit/38e8a9935e3ae0a166f7bcd3231bef50ef7ec71b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences

2016-06-03 Thread ericl

Github user ericl commented on the issue:

https://github.com/apache/spark/pull/13505
  
Here's a flame graph of bindReferences dominating the CPU used for a 10k 
column query: 

[profile](https://github.com/apache/spark/files/298644/slow-bind-refs.svg.zip)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13504: [SPARK-15762][SQL] Cache Metadata & StructType hashCodes...

2016-06-03 Thread ericl

Github user ericl commented on the issue:

https://github.com/apache/spark/pull/13504
  
[hash code 
profile](https://github.com/apache/spark/files/298642/hashcode.svg.zip)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences

2016-06-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13505
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59980/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences

2016-06-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13505
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences

2016-06-03 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13505
  
**[Test build #59980 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59980/consoleFull)**
 for PR 13505 at commit 
[`0b412b0`](https://github.com/apache/spark/commit/0b412b0069dee64c13daa836f43013380c9aa273).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  implicit class AttributeSeq(val attrs: Seq[Attribute]) `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13500: [SPARK-15756] [SQL] Support command 'create table stored...

2016-06-03 Thread lianhuiwang

Github user lianhuiwang commented on the issue:

https://github.com/apache/spark/pull/13500
  
@rxin I have updated PR description. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences

2016-06-03 Thread JoshRosen

Github user JoshRosen commented on the issue:

https://github.com/apache/spark/pull/13505
  
@rxin, @ericl has some new benchmarks which operate on even wider schemas 
and which uncovered this bottleneck. Adding the caching of the map here 
resulted in a huge scalability improvement. Maybe @ericl can chime in with some 
flame graph charts here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13504: [SPARK-15762][SQL] Cache Metadata & StructType hashCodes...

2016-06-03 Thread JoshRosen

Github user JoshRosen commented on the issue:

https://github.com/apache/spark/pull/13504
  
@rxin, I don't have the specific performance numbers handy but in an 
optimizer stress-test benchmark run by @ericl, these hashCode calls accounted 
for roughly 40% of the total CPU time and this bottleneck was completely 
eliminated by the caching added by this patch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13504: [SPARK-15762][SQL] Cache Metadata & StructType hashCodes...

2016-06-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13504
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59974/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13407: [SPARK-15665] [CORE] spark-submit --kill and --status ar...

2016-06-03 Thread devaraj-kavali

Github user devaraj-kavali commented on the issue:

https://github.com/apache/spark/pull/13407
  
Thanks @vanzin for review and merging.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13504: [SPARK-15762][SQL] Cache Metadata & StructType hashCodes...

2016-06-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13504
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13504: [SPARK-15762][SQL] Cache Metadata & StructType hashCodes...

2016-06-03 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13504
  
**[Test build #59974 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59974/consoleFull)**
 for PR 13504 at commit 
[`92c6c69`](https://github.com/apache/spark/commit/92c6c6994fac3cf36ea1974171f6018c36013ce0).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13504: [SPARK-15762][SQL] Cache Metadata & StructType hashCodes...

2016-06-03 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13504
  
**[Test build #59981 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59981/consoleFull)**
 for PR 13504 at commit 
[`a3a6898`](https://github.com/apache/spark/commit/a3a68989bb31f14e74cbfc3532e089a4de070605).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13505: [SPARK-15764][SQL] Replace N^2 loop in BindRefere...

2016-06-03 Thread JoshRosen

Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/13505#discussion_r65795007
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala 
---
@@ -296,7 +296,7 @@ abstract class QueryPlan[PlanType <: 
QueryPlan[PlanType]] extends TreeNode[PlanT
   /**
* All the attributes that are used for this plan.
*/
-  lazy val allAttributes: Seq[Attribute] = children.flatMap(_.output)
+  lazy val allAttributes: AttributeSeq = children.flatMap(_.output)
--- End diff --

We should probably construct the AttributeSeq outside of the loop in the 
various projection operators, too, although that doesn't appear to be as 
serious a bottleneck yet.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13505: [SPARK-15764][SQL] Replace N^2 loop in BindRefere...

2016-06-03 Thread JoshRosen

Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/13505#discussion_r65794996
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
 ---
@@ -86,11 +86,31 @@ package object expressions  {
   /**
* Helper functions for working with `Seq[Attribute]`.
*/
-  implicit class AttributeSeq(attrs: Seq[Attribute]) {
+  implicit class AttributeSeq(val attrs: Seq[Attribute]) {
 /** Creates a StructType with a schema matching this `Seq[Attribute]`. 
*/
 def toStructType: StructType = {
   StructType(attrs.map(a => StructField(a.name, a.dataType, 
a.nullable)))
 }
+
+private lazy val inputArr = attrs.toArray
+
+private lazy val inputToOrdinal = {
+  val map = new java.util.HashMap[ExprId, Int](inputArr.length * 2)
+  var index = 0
+  attrs.foreach { attr =>
+if (!map.containsKey(attr.exprId)) {
+  map.put(attr.exprId, index)
+}
+index += 1
+  }
+  map
+}
+
+def apply(ordinal: Int): Attribute = inputArr(ordinal)
+
+def getOrdinal(exprId: ExprId): Int = {
--- End diff --

I suppose this needs documentation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13248: [SPARK-15194] [ML] Add Python ML API for MultivariateGau...

2016-06-03 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/13248
  
@praveendareddy21 Just made a first pass. Also please run PEP8 on your code


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13496: [SPARK-15753][SQL] Move Analyzer stuff to Analyzer from ...

2016-06-03 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/13496
  
cc @cloud-fan @yhuai 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...

2016-06-03 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/13248#discussion_r65794904
  
--- Diff: python/pyspark/ml/stat/distribution.py ---
@@ -0,0 +1,267 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector
+import numpy as np
+
+__all__ = ['MultivariateGaussian']
+
+
+
+class MultivariateGaussian():
+"""
+This class provides basic functionality for a Multivariate Gaussian 
(Normal) Distribution. In
+ the event that the covariance matrix is singular, the density will be 
computed in a
+reduced dimensional subspace under which the distribution is supported.
+(see 
[[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]])
+
+mu The mean vector of the distribution
+sigma The covariance matrix of the distribution
+
+
+>>> mu = Vectors.dense([0.0, 0.0])
+>>> sigma= DenseMatrix(2, 2, [1.0, 1.0, 1.0, 1.0])
+>>> x = Vectors.dense([1.0, 1.0])
+>>> m = MultivariateGaussian(mu, sigma)
+>>> m.pdf(x)
+0.0682586811486
+
+"""
+
+def __init__(self, mu, sigma):
+"""
+__init__(self, mu, sigma)
+
+mu The mean vector of the distribution
+sigma The covariance matrix of the distribution
+
+mu and sigma must be instances of DenseVector and DenseMatrix 
respectively.
+
+"""
+
+
+assert (isinstance(mu, DenseVector)), "mu must be a DenseVector 
Object"
+assert (isinstance(sigma, DenseMatrix)), "sigma must be a 
DenseMatrix Object"
+
+sigma_shape=sigma.toArray().shape
+assert (sigma_shape[0]==sigma_shape[1]) , "Covariance matrix must 
be square"
+assert (sigma_shape[0]==mu.size) , "Mean vector length must match 
covariance matrix size"
+
+# initialize eagerly precomputed attributes
+
+self.mu=mu
+
+# storing sigma as numpy.ndarray
+# furthur calculations are done ndarray only
+self.sigma=sigma.toArray()
+
+
+# initialize attributes to be computed later
+
+self.prec_U = None
+self.log_det_cov = None
+
+# compute distribution dependent constants
+self.__calculateCovarianceConstants()
+
+
+def pdf(self,x):
+"""
+Returns density of this multivariate Gaussian at a point given by 
Vector x
+"""
+assert (isinstance(x, Vector)), "x must be of Vector Type"
+return float(self.__pdf(x))
+
+def logpdf(self,x):
+"""
+Returns the log-density of this multivariate Gaussian at a point 
given by Vector x
+"""
+assert (isinstance(x, Vector)), "x must be of Vector Type"
+return float(self.__logpdf(x))
+
+def __calculateCovarianceConstants(self):
+"""
+Calculates distribution dependent components used for the density 
function
+based on scipy multivariate library
+refer 
https://github.com/scipy/scipy/blob/master/scipy/stats/_multivariate.py
+tested with precision of 9 significant digits(refer testcase)
+
+
+"""
+
+try :
+# pre-processing input parameters
+# throws ValueError with invalid inputs
+self.dim, self.mu, self.sigma = 
self.__process_parameters(None, self.mu, self.sigma)
+
+# return the eigenvalues and eigenvectors 
+# of a Hermitian or symmetric matrix.
+# s =  eigen values
+# u = eigen vectors
+s, u = np.linalg.eigh(self.sigma)
+
+#Singular values are considered to be non-zero only if 
+#they exceed a tolerance based on machine precision, matrix 
size, and
+#relation to

[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...

2016-06-03 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/13248#discussion_r65794836
  
--- Diff: python/pyspark/ml/stat/distribution.py ---
@@ -0,0 +1,267 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector
+import numpy as np
+
+__all__ = ['MultivariateGaussian']
+
+
+
+class MultivariateGaussian():
+"""
+This class provides basic functionality for a Multivariate Gaussian 
(Normal) Distribution. In
+ the event that the covariance matrix is singular, the density will be 
computed in a
+reduced dimensional subspace under which the distribution is supported.
+(see 
[[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]])
+
+mu The mean vector of the distribution
+sigma The covariance matrix of the distribution
+
+
+>>> mu = Vectors.dense([0.0, 0.0])
+>>> sigma= DenseMatrix(2, 2, [1.0, 1.0, 1.0, 1.0])
+>>> x = Vectors.dense([1.0, 1.0])
+>>> m = MultivariateGaussian(mu, sigma)
+>>> m.pdf(x)
+0.0682586811486
+
+"""
+
+def __init__(self, mu, sigma):
+"""
+__init__(self, mu, sigma)
+
+mu The mean vector of the distribution
+sigma The covariance matrix of the distribution
+
+mu and sigma must be instances of DenseVector and DenseMatrix 
respectively.
+
+"""
+
+
+assert (isinstance(mu, DenseVector)), "mu must be a DenseVector 
Object"
+assert (isinstance(sigma, DenseMatrix)), "sigma must be a 
DenseMatrix Object"
+
+sigma_shape=sigma.toArray().shape
+assert (sigma_shape[0]==sigma_shape[1]) , "Covariance matrix must 
be square"
+assert (sigma_shape[0]==mu.size) , "Mean vector length must match 
covariance matrix size"
+
+# initialize eagerly precomputed attributes
+
+self.mu=mu
+
+# storing sigma as numpy.ndarray
+# furthur calculations are done ndarray only
+self.sigma=sigma.toArray()
+
+
+# initialize attributes to be computed later
+
+self.prec_U = None
+self.log_det_cov = None
+
+# compute distribution dependent constants
+self.__calculateCovarianceConstants()
+
+
+def pdf(self,x):
+"""
+Returns density of this multivariate Gaussian at a point given by 
Vector x
+"""
+assert (isinstance(x, Vector)), "x must be of Vector Type"
+return float(self.__pdf(x))
+
+def logpdf(self,x):
+"""
+Returns the log-density of this multivariate Gaussian at a point 
given by Vector x
+"""
+assert (isinstance(x, Vector)), "x must be of Vector Type"
+return float(self.__logpdf(x))
+
+def __calculateCovarianceConstants(self):
+"""
+Calculates distribution dependent components used for the density 
function
+based on scipy multivariate library
+refer 
https://github.com/scipy/scipy/blob/master/scipy/stats/_multivariate.py
+tested with precision of 9 significant digits(refer testcase)
+
+
+"""
+
+try :
+# pre-processing input parameters
+# throws ValueError with invalid inputs
+self.dim, self.mu, self.sigma = 
self.__process_parameters(None, self.mu, self.sigma)
+
+# return the eigenvalues and eigenvectors 
+# of a Hermitian or symmetric matrix.
+# s =  eigen values
+# u = eigen vectors
+s, u = np.linalg.eigh(self.sigma)
+
+#Singular values are considered to be non-zero only if 
+#they exceed a tolerance based on machine precision, matrix 
size, and
+#relation to

[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...

2016-06-03 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/13248#discussion_r65794779
  
--- Diff: python/pyspark/ml/stat/distribution.py ---
@@ -0,0 +1,267 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector
+import numpy as np
+
+__all__ = ['MultivariateGaussian']
+
+
+
+class MultivariateGaussian():
+"""
+This class provides basic functionality for a Multivariate Gaussian 
(Normal) Distribution. In
+ the event that the covariance matrix is singular, the density will be 
computed in a
+reduced dimensional subspace under which the distribution is supported.
+(see 
[[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]])
+
+mu The mean vector of the distribution
+sigma The covariance matrix of the distribution
+
+
+>>> mu = Vectors.dense([0.0, 0.0])
+>>> sigma= DenseMatrix(2, 2, [1.0, 1.0, 1.0, 1.0])
+>>> x = Vectors.dense([1.0, 1.0])
+>>> m = MultivariateGaussian(mu, sigma)
+>>> m.pdf(x)
+0.0682586811486
+
+"""
+
+def __init__(self, mu, sigma):
+"""
+__init__(self, mu, sigma)
+
+mu The mean vector of the distribution
+sigma The covariance matrix of the distribution
+
+mu and sigma must be instances of DenseVector and DenseMatrix 
respectively.
+
+"""
+
+
+assert (isinstance(mu, DenseVector)), "mu must be a DenseVector 
Object"
+assert (isinstance(sigma, DenseMatrix)), "sigma must be a 
DenseMatrix Object"
+
+sigma_shape=sigma.toArray().shape
+assert (sigma_shape[0]==sigma_shape[1]) , "Covariance matrix must 
be square"
+assert (sigma_shape[0]==mu.size) , "Mean vector length must match 
covariance matrix size"
+
+# initialize eagerly precomputed attributes
+
+self.mu=mu
+
+# storing sigma as numpy.ndarray
+# furthur calculations are done ndarray only
+self.sigma=sigma.toArray()
+
+
+# initialize attributes to be computed later
+
+self.prec_U = None
+self.log_det_cov = None
+
+# compute distribution dependent constants
+self.__calculateCovarianceConstants()
+
+
+def pdf(self,x):
+"""
+Returns density of this multivariate Gaussian at a point given by 
Vector x
+"""
+assert (isinstance(x, Vector)), "x must be of Vector Type"
+return float(self.__pdf(x))
+
+def logpdf(self,x):
+"""
+Returns the log-density of this multivariate Gaussian at a point 
given by Vector x
+"""
+assert (isinstance(x, Vector)), "x must be of Vector Type"
+return float(self.__logpdf(x))
+
+def __calculateCovarianceConstants(self):
+"""
+Calculates distribution dependent components used for the density 
function
+based on scipy multivariate library
+refer 
https://github.com/scipy/scipy/blob/master/scipy/stats/_multivariate.py
+tested with precision of 9 significant digits(refer testcase)
+
+
+"""
+
+try :
+# pre-processing input parameters
+# throws ValueError with invalid inputs
+self.dim, self.mu, self.sigma = 
self.__process_parameters(None, self.mu, self.sigma)
+
+# return the eigenvalues and eigenvectors 
+# of a Hermitian or symmetric matrix.
+# s =  eigen values
+# u = eigen vectors
+s, u = np.linalg.eigh(self.sigma)
+
+#Singular values are considered to be non-zero only if 
+#they exceed a tolerance based on machine precision, matrix 
size, and
+#relation to

[GitHub] spark issue #13503: [SPARK-15761] [MLlib] [PySpark] Load ipython when defaul...

2016-06-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13503
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13503: [SPARK-15761] [MLlib] [PySpark] Load ipython when defaul...

2016-06-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13503
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59970/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13061: [SPARK-14279] [Build] Pick the spark version from pom

2016-06-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13061
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13061: [SPARK-14279] [Build] Pick the spark version from pom

2016-06-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13061
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59972/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13503: [SPARK-15761] [MLlib] [PySpark] Load ipython when defaul...

2016-06-03 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13503
  
**[Test build #59970 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59970/consoleFull)**
 for PR 13503 at commit 
[`a051953`](https://github.com/apache/spark/commit/a05195341deb46cb2fd8521e7eb8065ca760ad7a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13061: [SPARK-14279] [Build] Pick the spark version from pom

2016-06-03 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13061
  
**[Test build #59972 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59972/consoleFull)**
 for PR 13061 at commit 
[`11cef41`](https://github.com/apache/spark/commit/11cef4111fcc969cc08b3bb0e4b738f47da4aaf8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...

2016-06-03 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/13248#discussion_r65794427
  
--- Diff: python/pyspark/ml/stat/distribution.py ---
@@ -0,0 +1,267 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector
+import numpy as np
+
+__all__ = ['MultivariateGaussian']
+
+
+
+class MultivariateGaussian():
+"""
+This class provides basic functionality for a Multivariate Gaussian 
(Normal) Distribution. In
+ the event that the covariance matrix is singular, the density will be 
computed in a
+reduced dimensional subspace under which the distribution is supported.
+(see 
[[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]])
+
+mu The mean vector of the distribution
+sigma The covariance matrix of the distribution
+
+
+>>> mu = Vectors.dense([0.0, 0.0])
+>>> sigma= DenseMatrix(2, 2, [1.0, 1.0, 1.0, 1.0])
+>>> x = Vectors.dense([1.0, 1.0])
+>>> m = MultivariateGaussian(mu, sigma)
+>>> m.pdf(x)
+0.0682586811486
+
+"""
+
+def __init__(self, mu, sigma):
+"""
+__init__(self, mu, sigma)
+
+mu The mean vector of the distribution
+sigma The covariance matrix of the distribution
+
+mu and sigma must be instances of DenseVector and DenseMatrix 
respectively.
+
+"""
+
+
+assert (isinstance(mu, DenseVector)), "mu must be a DenseVector 
Object"
+assert (isinstance(sigma, DenseMatrix)), "sigma must be a 
DenseMatrix Object"
+
+sigma_shape=sigma.toArray().shape
+assert (sigma_shape[0]==sigma_shape[1]) , "Covariance matrix must 
be square"
+assert (sigma_shape[0]==mu.size) , "Mean vector length must match 
covariance matrix size"
+
+# initialize eagerly precomputed attributes
+
+self.mu=mu
+
+# storing sigma as numpy.ndarray
+# furthur calculations are done ndarray only
+self.sigma=sigma.toArray()
+
+
+# initialize attributes to be computed later
+
+self.prec_U = None
+self.log_det_cov = None
+
+# compute distribution dependent constants
+self.__calculateCovarianceConstants()
+
+
+def pdf(self,x):
+"""
+Returns density of this multivariate Gaussian at a point given by 
Vector x
+"""
+assert (isinstance(x, Vector)), "x must be of Vector Type"
+return float(self.__pdf(x))
+
+def logpdf(self,x):
+"""
+Returns the log-density of this multivariate Gaussian at a point 
given by Vector x
+"""
+assert (isinstance(x, Vector)), "x must be of Vector Type"
+return float(self.__logpdf(x))
+
+def __calculateCovarianceConstants(self):
+"""
+Calculates distribution dependent components used for the density 
function
+based on scipy multivariate library
+refer 
https://github.com/scipy/scipy/blob/master/scipy/stats/_multivariate.py
+tested with precision of 9 significant digits(refer testcase)
+
+
+"""
+
+try :
+# pre-processing input parameters
+# throws ValueError with invalid inputs
+self.dim, self.mu, self.sigma = 
self.__process_parameters(None, self.mu, self.sigma)
+
+# return the eigenvalues and eigenvectors 
+# of a Hermitian or symmetric matrix.
+# s =  eigen values
+# u = eigen vectors
+s, u = np.linalg.eigh(self.sigma)
+
+#Singular values are considered to be non-zero only if 
+#they exceed a tolerance based on machine precision, matrix 
size, and
+#relation to

[GitHub] spark pull request #13505: [SPARK-15764][SQL] Replace N^2 loop in BindRefere...

2016-06-03 Thread JoshRosen

Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/13505#discussion_r65794395
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala 
---
@@ -296,7 +296,7 @@ abstract class QueryPlan[PlanType <: 
QueryPlan[PlanType]] extends TreeNode[PlanT
   /**
* All the attributes that are used for this plan.
*/
-  lazy val allAttributes: Seq[Attribute] = children.flatMap(_.output)
+  lazy val allAttributes: AttributeSeq = children.flatMap(_.output)
--- End diff --

@ericl and I found another layer of polynomial looping: in 
QueryPlan.cleanArgs we take every expression in the query plan and bind its 
references against `allAttributes`, which can be huge. If we  turn this into an 
`AttributeSeq` once and build the map inside of that wrapper then we amortize 
that cost and remove this expensive loop. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...

2016-06-03 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/13248#discussion_r65794325
  
--- Diff: python/pyspark/ml/stat/distribution.py ---
@@ -0,0 +1,267 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector
+import numpy as np
+
+__all__ = ['MultivariateGaussian']
+
+
+
+class MultivariateGaussian():
+"""
+This class provides basic functionality for a Multivariate Gaussian 
(Normal) Distribution. In
+ the event that the covariance matrix is singular, the density will be 
computed in a
+reduced dimensional subspace under which the distribution is supported.
+(see 
[[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]])
+
+mu The mean vector of the distribution
+sigma The covariance matrix of the distribution
+
+
+>>> mu = Vectors.dense([0.0, 0.0])
+>>> sigma= DenseMatrix(2, 2, [1.0, 1.0, 1.0, 1.0])
+>>> x = Vectors.dense([1.0, 1.0])
+>>> m = MultivariateGaussian(mu, sigma)
+>>> m.pdf(x)
+0.0682586811486
+
+"""
+
+def __init__(self, mu, sigma):
+"""
+__init__(self, mu, sigma)
+
+mu The mean vector of the distribution
+sigma The covariance matrix of the distribution
+
+mu and sigma must be instances of DenseVector and DenseMatrix 
respectively.
+
+"""
+
+
+assert (isinstance(mu, DenseVector)), "mu must be a DenseVector 
Object"
+assert (isinstance(sigma, DenseMatrix)), "sigma must be a 
DenseMatrix Object"
+
+sigma_shape=sigma.toArray().shape
+assert (sigma_shape[0]==sigma_shape[1]) , "Covariance matrix must 
be square"
+assert (sigma_shape[0]==mu.size) , "Mean vector length must match 
covariance matrix size"
+
+# initialize eagerly precomputed attributes
+
+self.mu=mu
+
+# storing sigma as numpy.ndarray
+# furthur calculations are done ndarray only
+self.sigma=sigma.toArray()
+
+
+# initialize attributes to be computed later
+
+self.prec_U = None
+self.log_det_cov = None
+
+# compute distribution dependent constants
+self.__calculateCovarianceConstants()
+
+
+def pdf(self,x):
+"""
+Returns density of this multivariate Gaussian at a point given by 
Vector x
+"""
+assert (isinstance(x, Vector)), "x must be of Vector Type"
+return float(self.__pdf(x))
+
+def logpdf(self,x):
+"""
+Returns the log-density of this multivariate Gaussian at a point 
given by Vector x
+"""
+assert (isinstance(x, Vector)), "x must be of Vector Type"
+return float(self.__logpdf(x))
+
+def __calculateCovarianceConstants(self):
+"""
+Calculates distribution dependent components used for the density 
function
+based on scipy multivariate library
+refer 
https://github.com/scipy/scipy/blob/master/scipy/stats/_multivariate.py
+tested with precision of 9 significant digits(refer testcase)
+
+
+"""
+
+try :
+# pre-processing input parameters
+# throws ValueError with invalid inputs
+self.dim, self.mu, self.sigma = 
self.__process_parameters(None, self.mu, self.sigma)
+
+# return the eigenvalues and eigenvectors 
+# of a Hermitian or symmetric matrix.
+# s =  eigen values
+# u = eigen vectors
+s, u = np.linalg.eigh(self.sigma)
+
+#Singular values are considered to be non-zero only if 
+#they exceed a tolerance based on machine precision, matrix 
size, and
+#relation to

[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...

2016-06-03 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/13248#discussion_r65794293
  
--- Diff: python/pyspark/ml/stat/distribution.py ---
@@ -0,0 +1,267 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector
+import numpy as np
+
+__all__ = ['MultivariateGaussian']
+
+
+
+class MultivariateGaussian():
+"""
+This class provides basic functionality for a Multivariate Gaussian 
(Normal) Distribution. In
+ the event that the covariance matrix is singular, the density will be 
computed in a
+reduced dimensional subspace under which the distribution is supported.
+(see 
[[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]])
+
+mu The mean vector of the distribution
+sigma The covariance matrix of the distribution
+
+
+>>> mu = Vectors.dense([0.0, 0.0])
+>>> sigma= DenseMatrix(2, 2, [1.0, 1.0, 1.0, 1.0])
+>>> x = Vectors.dense([1.0, 1.0])
+>>> m = MultivariateGaussian(mu, sigma)
+>>> m.pdf(x)
+0.0682586811486
+
+"""
+
+def __init__(self, mu, sigma):
+"""
+__init__(self, mu, sigma)
+
+mu The mean vector of the distribution
+sigma The covariance matrix of the distribution
+
+mu and sigma must be instances of DenseVector and DenseMatrix 
respectively.
+
+"""
+
+
+assert (isinstance(mu, DenseVector)), "mu must be a DenseVector 
Object"
+assert (isinstance(sigma, DenseMatrix)), "sigma must be a 
DenseMatrix Object"
+
+sigma_shape=sigma.toArray().shape
+assert (sigma_shape[0]==sigma_shape[1]) , "Covariance matrix must 
be square"
+assert (sigma_shape[0]==mu.size) , "Mean vector length must match 
covariance matrix size"
+
+# initialize eagerly precomputed attributes
+
+self.mu=mu
+
+# storing sigma as numpy.ndarray
+# furthur calculations are done ndarray only
+self.sigma=sigma.toArray()
+
+
+# initialize attributes to be computed later
+
+self.prec_U = None
+self.log_det_cov = None
+
+# compute distribution dependent constants
+self.__calculateCovarianceConstants()
+
+
+def pdf(self,x):
+"""
+Returns density of this multivariate Gaussian at a point given by 
Vector x
+"""
+assert (isinstance(x, Vector)), "x must be of Vector Type"
+return float(self.__pdf(x))
+
+def logpdf(self,x):
+"""
+Returns the log-density of this multivariate Gaussian at a point 
given by Vector x
+"""
+assert (isinstance(x, Vector)), "x must be of Vector Type"
+return float(self.__logpdf(x))
+
+def __calculateCovarianceConstants(self):
+"""
+Calculates distribution dependent components used for the density 
function
+based on scipy multivariate library
+refer 
https://github.com/scipy/scipy/blob/master/scipy/stats/_multivariate.py
+tested with precision of 9 significant digits(refer testcase)
+
+
+"""
+
+try :
+# pre-processing input parameters
+# throws ValueError with invalid inputs
+self.dim, self.mu, self.sigma = 
self.__process_parameters(None, self.mu, self.sigma)
+
+# return the eigenvalues and eigenvectors 
+# of a Hermitian or symmetric matrix.
+# s =  eigen values
+# u = eigen vectors
+s, u = np.linalg.eigh(self.sigma)
+
+#Singular values are considered to be non-zero only if 
+#they exceed a tolerance based on machine precision, matrix 
size, and
+#relation to

[GitHub] spark issue #13504: [SPARK-15762][SQL] Cache Metadata & StructType hashCodes...

2016-06-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13504
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59971/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13504: [SPARK-15762][SQL] Cache Metadata & StructType hashCodes...

2016-06-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13504
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13504: [SPARK-15762][SQL] Cache Metadata & StructType hashCodes...

2016-06-03 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13504
  
**[Test build #59971 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59971/consoleFull)**
 for PR 13504 at commit 
[`8061600`](https://github.com/apache/spark/commit/806160072f0744f5e700f78d5632ea237fbc6515).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...

2016-06-03 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/13248#discussion_r65794126
  
--- Diff: python/pyspark/ml/stat/distribution.py ---
@@ -0,0 +1,267 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector
+import numpy as np
+
+__all__ = ['MultivariateGaussian']
+
+
+
+class MultivariateGaussian():
+"""
+This class provides basic functionality for a Multivariate Gaussian 
(Normal) Distribution. In
+ the event that the covariance matrix is singular, the density will be 
computed in a
+reduced dimensional subspace under which the distribution is supported.
+(see 
[[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]])
+
+mu The mean vector of the distribution
+sigma The covariance matrix of the distribution
+
+
+>>> mu = Vectors.dense([0.0, 0.0])
+>>> sigma= DenseMatrix(2, 2, [1.0, 1.0, 1.0, 1.0])
+>>> x = Vectors.dense([1.0, 1.0])
+>>> m = MultivariateGaussian(mu, sigma)
+>>> m.pdf(x)
+0.0682586811486
+
+"""
+
+def __init__(self, mu, sigma):
+"""
+__init__(self, mu, sigma)
+
+mu The mean vector of the distribution
+sigma The covariance matrix of the distribution
+
+mu and sigma must be instances of DenseVector and DenseMatrix 
respectively.
+
+"""
+
+
+assert (isinstance(mu, DenseVector)), "mu must be a DenseVector 
Object"
+assert (isinstance(sigma, DenseMatrix)), "sigma must be a 
DenseMatrix Object"
+
+sigma_shape=sigma.toArray().shape
+assert (sigma_shape[0]==sigma_shape[1]) , "Covariance matrix must 
be square"
+assert (sigma_shape[0]==mu.size) , "Mean vector length must match 
covariance matrix size"
+
+# initialize eagerly precomputed attributes
+
+self.mu=mu
+
+# storing sigma as numpy.ndarray
+# furthur calculations are done ndarray only
+self.sigma=sigma.toArray()
+
+
+# initialize attributes to be computed later
+
+self.prec_U = None
+self.log_det_cov = None
+
+# compute distribution dependent constants
+self.__calculateCovarianceConstants()
+
+
+def pdf(self,x):
--- End diff --

Should we fall back to SciPy's multivariate normal if that is present?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13472: [SPARK-15735] Allow specifying min time to run in microb...

2016-06-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13472
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59973/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13472: [SPARK-15735] Allow specifying min time to run in microb...

2016-06-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13472
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13472: [SPARK-15735] Allow specifying min time to run in microb...

2016-06-03 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13472
  
**[Test build #59973 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59973/consoleFull)**
 for PR 13472 at commit 
[`ead4a2c`](https://github.com/apache/spark/commit/ead4a2c9a5e9f225f475a40d0e4247c54a76830d).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences

2016-06-03 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13505
  
**[Test build #59980 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59980/consoleFull)**
 for PR 13505 at commit 
[`0b412b0`](https://github.com/apache/spark/commit/0b412b0069dee64c13daa836f43013380c9aa273).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...

2016-06-03 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/13248#discussion_r65794056
  
--- Diff: python/pyspark/ml/stat/distribution.py ---
@@ -0,0 +1,267 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector
+import numpy as np
+
+__all__ = ['MultivariateGaussian']
+
+
+
+class MultivariateGaussian():
+"""
+This class provides basic functionality for a Multivariate Gaussian 
(Normal) Distribution. In
+ the event that the covariance matrix is singular, the density will be 
computed in a
+reduced dimensional subspace under which the distribution is supported.
+(see 
[[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]])
+
+mu The mean vector of the distribution
+sigma The covariance matrix of the distribution
+
+
+>>> mu = Vectors.dense([0.0, 0.0])
+>>> sigma= DenseMatrix(2, 2, [1.0, 1.0, 1.0, 1.0])
+>>> x = Vectors.dense([1.0, 1.0])
+>>> m = MultivariateGaussian(mu, sigma)
+>>> m.pdf(x)
+0.0682586811486
+
+"""
+
+def __init__(self, mu, sigma):
+"""
+__init__(self, mu, sigma)
+
+mu The mean vector of the distribution
+sigma The covariance matrix of the distribution
+
+mu and sigma must be instances of DenseVector and DenseMatrix 
respectively.
+
+"""
+
+
+assert (isinstance(mu, DenseVector)), "mu must be a DenseVector 
Object"
+assert (isinstance(sigma, DenseMatrix)), "sigma must be a 
DenseMatrix Object"
+
+sigma_shape=sigma.toArray().shape
+assert (sigma_shape[0]==sigma_shape[1]) , "Covariance matrix must 
be square"
+assert (sigma_shape[0]==mu.size) , "Mean vector length must match 
covariance matrix size"
+
+# initialize eagerly precomputed attributes
+
+self.mu=mu
+
+# storing sigma as numpy.ndarray
+# furthur calculations are done ndarray only
+self.sigma=sigma.toArray()
+
+
+# initialize attributes to be computed later
+
+self.prec_U = None
+self.log_det_cov = None
+
+# compute distribution dependent constants
+self.__calculateCovarianceConstants()
+
+
+def pdf(self,x):
+"""
+Returns density of this multivariate Gaussian at a point given by 
Vector x
+"""
+assert (isinstance(x, Vector)), "x must be of Vector Type"
+return float(self.__pdf(x))
+
+def logpdf(self,x):
+"""
+Returns the log-density of this multivariate Gaussian at a point 
given by Vector x
+"""
+assert (isinstance(x, Vector)), "x must be of Vector Type"
+return float(self.__logpdf(x))
+
+def __calculateCovarianceConstants(self):
+"""
+Calculates distribution dependent components used for the density 
function
+based on scipy multivariate library
+refer 
https://github.com/scipy/scipy/blob/master/scipy/stats/_multivariate.py
+tested with precision of 9 significant digits(refer testcase)
+
+
+"""
+
+try :
+# pre-processing input parameters
+# throws ValueError with invalid inputs
+self.dim, self.mu, self.sigma = 
self.__process_parameters(None, self.mu, self.sigma)
+
+# return the eigenvalues and eigenvectors 
+# of a Hermitian or symmetric matrix.
+# s =  eigen values
+# u = eigen vectors
+s, u = np.linalg.eigh(self.sigma)
+
+#Singular values are considered to be non-zero only if 
+#they exceed a tolerance based on machine precision, matrix 
size, and
+#relation to

[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...

2016-06-03 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/13248#discussion_r65793951
  
--- Diff: python/pyspark/ml/stat/distribution.py ---
@@ -0,0 +1,267 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector
+import numpy as np
+
+__all__ = ['MultivariateGaussian']
+
+
+
+class MultivariateGaussian():
+"""
+This class provides basic functionality for a Multivariate Gaussian 
(Normal) Distribution. In
+ the event that the covariance matrix is singular, the density will be 
computed in a
+reduced dimensional subspace under which the distribution is supported.
+(see 
[[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]])
+
+mu The mean vector of the distribution
+sigma The covariance matrix of the distribution
+
+
+>>> mu = Vectors.dense([0.0, 0.0])
+>>> sigma= DenseMatrix(2, 2, [1.0, 1.0, 1.0, 1.0])
+>>> x = Vectors.dense([1.0, 1.0])
+>>> m = MultivariateGaussian(mu, sigma)
+>>> m.pdf(x)
+0.0682586811486
+
+"""
+
+def __init__(self, mu, sigma):
+"""
+__init__(self, mu, sigma)
+
+mu The mean vector of the distribution
+sigma The covariance matrix of the distribution
+
+mu and sigma must be instances of DenseVector and DenseMatrix 
respectively.
+
+"""
+
+
+assert (isinstance(mu, DenseVector)), "mu must be a DenseVector 
Object"
+assert (isinstance(sigma, DenseMatrix)), "sigma must be a 
DenseMatrix Object"
+
+sigma_shape=sigma.toArray().shape
+assert (sigma_shape[0]==sigma_shape[1]) , "Covariance matrix must 
be square"
+assert (sigma_shape[0]==mu.size) , "Mean vector length must match 
covariance matrix size"
--- End diff --

You can use the `numRows`, `numCols`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13506: [SPARK-15763][SQL] Support DELETE FILE command natively

2016-06-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13506
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13506: [SPARK-15763][SQL] Support DELETE FILE command na...

2016-06-03 Thread kevinyu98

GitHub user kevinyu98 opened a pull request:

https://github.com/apache/spark/pull/13506

[SPARK-15763][SQL] Support DELETE FILE command natively

## What changes were proposed in this pull request?
Hive supports  these cli commands to manage the resource [Hive 
Doc](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli) : 
`ADD/DELETE (FILE(s)|JAR(s) )` 
`LIST (FILE(S) [filepath ...] | JAR(S) [jarpath ...]) `

 but SPARK only supports two commands  
`ADD (FILE  | JAR )` 
`LIST (FILE(S) [filepath ...] | JAR(S) [jarpath ...])` for now.

This PR is to add the DELETE FILE command into Spark SQL and I will submit 
another PR for the DELETE JAR(s).

`DELETE FILE `

## **Example:**
**DELETE FILE**
```
scala> spark.sql("add file /Users/qianyangyu/myfile.txt")
res0: org.apache.spark.sql.DataFrame = []

scala> spark.sql("add file /Users/qianyangyu/myfile2.txt")
res1: org.apache.spark.sql.DataFrame = []

scala> spark.sql("list file")
res2: org.apache.spark.sql.DataFrame = [Results: string]

scala> spark.sql("list file").show(false)
+--+
|Results   |
+--+
|file:/Users/qianyangyu/myfile2.txt|
|file:/Users/qianyangyu/myfile.txt |
+--+
scala> spark.sql("delete file /Users/qianyangyu/myfile.txt")
res4: org.apache.spark.sql.DataFrame = []

scala> spark.sql("list file").show(false)
+--+
|Results   |
+--+
|file:/Users/qianyangyu/myfile2.txt|
+--+


scala> spark.sql("delete file /Users/qianyangyu/myfile2.txt")
res6: org.apache.spark.sql.DataFrame = []

scala> spark.sql("list file").show(false)
+---+
|Results|
+---+
+---+
```

## How was this patch tested?

Add test cases in Spark-SQL SPARK-Shell and SparkContext suites.


(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kevinyu98/spark spark-15763

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13506.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13506


commit 3b44c5978bd44db986621d3e8511e9165b66926b
Author: Kevin Yu 
Date:   2016-04-20T18:06:30Z

adding testcase

commit 18b4a31c687b264b50aa5f5a74455956911f738a
Author: Kevin Yu 
Date:   2016-04-22T21:48:00Z

Merge remote-tracking branch 'upstream/master'

commit 4f4d1c8f2801b1e662304ab2b33351173e71b427
Author: Kevin Yu 
Date:   2016-04-23T16:50:19Z

Merge remote-tracking branch 'upstream/master'
get latest code from upstream

commit f5f0cbed1eb5754c04c36933b374c3b3d2ae4f4e
Author: Kevin Yu 
Date:   2016-04-23T22:20:53Z

Merge remote-tracking branch 'upstream/master'
adding trim characters support

commit d8b2edbd13ee9a4f057bca7dcb0c0940e8e867b8
Author: Kevin Yu 
Date:   2016-04-25T20:24:33Z

Merge remote-tracking branch 'upstream/master'
get latest code for pr12646

commit 196b6c66b0d55232f427c860c0e7c6876c216a67
Author: Kevin Yu 
Date:   2016-04-25T23:45:57Z

Merge remote-tracking branch 'upstream/master'
merge latest code

commit f37a01e005f3e27ae2be056462d6eb6730933ba5
Author: Kevin Yu 
Date:   2016-04-27T14:15:06Z

Merge remote-tracking branch 'upstream/master'
merge upstream/master

commit bb5b01fd3abeea1b03315eccf26762fcc23f80c0
Author: Kevin Yu 
Date:   2016-04-30T23:49:31Z

Merge remote-tracking branch 'upstream/master'

commit bde5820a181cf84e0879038ad8c4cebac63c1e24
Author: Kevin Yu 
Date:   2016-05-04T03:52:31Z

Merge remote-tracking branch 'upstream/master'

commit 5f7cd96d495f065cd04e8e4cc58461843e45bc8d
Author: Kevin Yu 
Date:   2016-05-10T21:14:50Z

Merge remote-tracking branch 'upstream/master'

commit 893a49af0bfd153ccb59ba50b63a232660e0eada
Author: Kevin Yu 
Date:   2016-05-13T18:20:39Z

Merge remote-tracking branch 'upstream/master'

commit 4bbe1fd4a3ebd50338ccbe07dc5887fe289cd53d
Author: Kevin Yu 
Date:   2016-05-17T21:58:14Z

Merge remote-tracking branch 'upstream/master'

commit b2dd795e23c36cbbd022f07a10c0cf21c85eb421
Author: Kevin Yu 
Date:   2016-05-18T06:37:13Z

Merge remote-tracking branch 'upstream/master'

commit 8c3e5da458dbff397ed60fcb68f2a46d87ab7ba4
Author: Kevin Yu

[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...

2016-06-03 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/13248#discussion_r65793865
  
--- Diff: python/pyspark/ml/stat/distribution.py ---
@@ -0,0 +1,267 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector
+import numpy as np
--- End diff --

This import should be moved above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences

2016-06-03 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13505
  
**[Test build #59976 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59976/consoleFull)**
 for PR 13505 at commit 
[`38e8a99`](https://github.com/apache/spark/commit/38e8a9935e3ae0a166f7bcd3231bef50ef7ec71b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13403: [SPARK-15660][CORE] RDD and Dataset should show the cons...

2016-06-03 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13403
  
**[Test build #59979 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59979/consoleFull)**
 for PR 13403 at commit 
[`801`](https://github.com/apache/spark/commit/801244d24b8b8f250dac34b21b1ea3245981).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13436: [SPARK-15696][SQL] Improve `crosstab` to have a consiste...

2016-06-03 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13436
  
**[Test build #59978 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59978/consoleFull)**
 for PR 13436 at commit 
[`ec49b37`](https://github.com/apache/spark/commit/ec49b375c7f12d8a5f32c50d67291557ba262a5f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13486: [SPARK-15743][SQL] Prevent saving with all-column partit...

2016-06-03 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13486
  
**[Test build #59977 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59977/consoleFull)**
 for PR 13486 at commit 
[`3fd3512`](https://github.com/apache/spark/commit/3fd351292a80f9d3dc59df68bc958439a8381424).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13505: [SPARK-15764][SQL] Replace N^2 loop in BindRefere...

2016-06-03 Thread JoshRosen

Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/13505#discussion_r65793015
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala
 ---
@@ -84,17 +84,27 @@ object BindReferences extends Logging {
   expression: A,
   input: Seq[Attribute],
--- End diff --

I think we should add an overload which takes a sequence of expressions and 
binds all of their references. We should then replace the call sites in the 
various projection operators.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12938: [SPARK-15162][SPARK-15164][PySpark][DOCS][ML] upd...

2016-06-03 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/12938#discussion_r65792982
  
--- Diff: python/pyspark/ml/classification.py ---
@@ -183,7 +191,7 @@ def getThresholds(self):
 If :py:attr:`thresholds` is set, return its value.
 Otherwise, if :py:attr:`threshold` is set, return the equivalent 
thresholds for binary
 classification: (1-threshold, threshold).
-If neither are set, throw an error.
--- End diff --

I'd say yeah - cc @yanboliang thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences

2016-06-03 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13505
  
**[Test build #59975 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59975/consoleFull)**
 for PR 13505 at commit 
[`6216e94`](https://github.com/apache/spark/commit/6216e944a1aef8dc67a446654c0799c8e9920b4f).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences

2016-06-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13505
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59975/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences

2016-06-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13505
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13505: [SPARK-15764][SQL] Replace N^2 loop in BindRefere...

2016-06-03 Thread JoshRosen

Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/13505#discussion_r65792855
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala
 ---
@@ -84,17 +84,27 @@ object BindReferences extends Logging {
   expression: A,
   input: Seq[Attribute],
--- End diff --

Actually, yeah: in `GenerateMutableProjection` we use the same InputSchema 
for every expression. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences

2016-06-03 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13505
  
**[Test build #59975 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59975/consoleFull)**
 for PR 13505 at commit 
[`6216e94`](https://github.com/apache/spark/commit/6216e944a1aef8dc67a446654c0799c8e9920b4f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13505: [SPARK-15764][SQL] Replace N^2 loop in BindRefere...

2016-06-03 Thread JoshRosen

Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/13505#discussion_r65792692
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala
 ---
@@ -84,17 +84,27 @@ object BindReferences extends Logging {
   expression: A,
   input: Seq[Attribute],
--- End diff --

I wonder whether we can push the map construction up one level so that we 
can amortize its cost across multiple `bindReference` calls.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13505: [SPARK-15764][SQL] Replace N^2 loop in BindRefere...

2016-06-03 Thread JoshRosen

GitHub user JoshRosen opened a pull request:

https://github.com/apache/spark/pull/13505

[SPARK-15764][SQL] Replace N^2 loop in BindReferences

BindReferences contains a n^2 loop which causes performance issues when 
operating over large schemas: to determine the ordinal of an attribute 
reference, we perform a linear scan over the `input` array. Because input can 
sometimes be a `List`, the call to `input(ordinal).nullable` can also be O(n).

Instead of performing a linear scan, we can convert the input into an array 
and build a hash map to map from expression ids to ordinals. The greater 
up-front cost of the map construction is offset by the fact that an expression 
can contain multiple attribute references, so the cost of the map construction 
is amortized across a number of lookups.

Perf. benchmarks to follow. /cc @ericl

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/JoshRosen/spark bind-references-improvement

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13505.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13505


commit 6216e944a1aef8dc67a446654c0799c8e9920b4f
Author: Josh Rosen 
Date:   2016-06-04T00:12:16Z

Replace N^2 loop in BindReferences.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 >

1 - 100 of 568 matches

Mail list logo