[GitHub] spark issue #13508: [SPARK-15766][SparkR]:R should export is.nan
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13508 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13508: [SPARK-15766][SparkR]:R should export is.nan
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13508 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59984/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13508: [SPARK-15766][SparkR]:R should export is.nan
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13508 **[Test build #59984 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59984/consoleFull)** for PR 13508 at commit [`da04a0d`](https://github.com/apache/spark/commit/da04a0d80578cc2d5d5a87a61ac2377df740c3ae). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12258: [SPARK-14485][CORE] ignore task finished for exec...
Github user zhonghaihua commented on a diff in the pull request: https://github.com/apache/spark/pull/12258#discussion_r65797802 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -343,17 +343,31 @@ private[spark] class TaskSchedulerImpl( } taskIdToTaskSetManager.get(tid) match { case Some(taskSet) => +var executorId: String = null if (TaskState.isFinished(state)) { taskIdToTaskSetManager.remove(tid) taskIdToExecutorId.remove(tid).foreach { execId => +executorId = execId if (executorIdToTaskCount.contains(execId)) { executorIdToTaskCount(execId) -= 1 } } } if (state == TaskState.FINISHED) { - taskSet.removeRunningTask(tid) - taskResultGetter.enqueueSuccessfulTask(taskSet, tid, serializedData) + // In some case, executor has already removed by driver for heartbeats timeout, but + // at sometime, before executor killed by cluster, the task of running on this + // executor is finished and return task success state to driver. However, this kinds + // of task should be ignored, because the task on this executor is already re-queued + // by driver. For more details, can check in SPARK-14485. + if (executorId.ne(null) && !executorIdToTaskCount.contains(executorId)) { +taskSet.removeRunningTask(tid) +logWarning( --- End diff -- Yes, you are right. I will change it soon, thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13494: [SPARK-15752] [SQL] support optimization for metadata on...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/13494 Can you try to write a design doc on this? Would be great to discuss the reasons why we might want this, the kind of queries that can be answered, corner cases, and how it should be implemented. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12258: [SPARK-14485][CORE] ignore task finished for exec...
Github user zhonghaihua commented on a diff in the pull request: https://github.com/apache/spark/pull/12258#discussion_r6579 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -343,17 +343,31 @@ private[spark] class TaskSchedulerImpl( } taskIdToTaskSetManager.get(tid) match { case Some(taskSet) => +var executorId: String = null if (TaskState.isFinished(state)) { taskIdToTaskSetManager.remove(tid) taskIdToExecutorId.remove(tid).foreach { execId => +executorId = execId if (executorIdToTaskCount.contains(execId)) { executorIdToTaskCount(execId) -= 1 } } } if (state == TaskState.FINISHED) { - taskSet.removeRunningTask(tid) - taskResultGetter.enqueueSuccessfulTask(taskSet, tid, serializedData) + // In some case, executor has already removed by driver for heartbeats timeout, but + // at sometime, before executor killed by cluster, the task of running on this + // executor is finished and return task success state to driver. However, this kinds + // of task should be ignored, because the task on this executor is already re-queued + // by driver. For more details, can check in SPARK-14485. + if (executorId.ne(null) && !executorIdToTaskCount.contains(executorId)) { +taskSet.removeRunningTask(tid) --- End diff -- I see. Fix it soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12258: [SPARK-14485][CORE] ignore task finished for exec...
Github user zhonghaihua commented on a diff in the pull request: https://github.com/apache/spark/pull/12258#discussion_r65797773 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -343,17 +343,31 @@ private[spark] class TaskSchedulerImpl( } taskIdToTaskSetManager.get(tid) match { case Some(taskSet) => +var executorId: String = null if (TaskState.isFinished(state)) { taskIdToTaskSetManager.remove(tid) taskIdToExecutorId.remove(tid).foreach { execId => +executorId = execId if (executorIdToTaskCount.contains(execId)) { executorIdToTaskCount(execId) -= 1 } } } if (state == TaskState.FINISHED) { - taskSet.removeRunningTask(tid) - taskResultGetter.enqueueSuccessfulTask(taskSet, tid, serializedData) + // In some case, executor has already removed by driver for heartbeats timeout, but + // at sometime, before executor killed by cluster, the task of running on this + // executor is finished and return task success state to driver. However, this kinds + // of task should be ignored, because the task on this executor is already re-queued + // by driver. For more details, can check in SPARK-14485. + if (executorId.ne(null) && !executorIdToTaskCount.contains(executorId)) { --- End diff -- Thanks for your comments. I will fix it soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13508: [SPARK-15766][SparkR]:R should export is.nan
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13508 **[Test build #59984 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59984/consoleFull)** for PR 13508 at commit [`da04a0d`](https://github.com/apache/spark/commit/da04a0d80578cc2d5d5a87a61ac2377df740c3ae). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13500: [SPARK-15756] [SQL] Support command 'create table...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/13500 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences
Github user rxin commented on the issue: https://github.com/apache/spark/pull/13505 hm probably shouldn't happen in this pr but i'm wondering if it'd make sense to generalize AttributeSeq and use it everywhere, rather than Seq[Attribute]. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13505: [SPARK-15764][SQL] Replace N^2 loop in BindRefere...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/13505#discussion_r65797603 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala --- @@ -86,11 +86,31 @@ package object expressions { /** * Helper functions for working with `Seq[Attribute]`. */ - implicit class AttributeSeq(attrs: Seq[Attribute]) { + implicit class AttributeSeq(val attrs: Seq[Attribute]) { /** Creates a StructType with a schema matching this `Seq[Attribute]`. */ def toStructType: StructType = { StructType(attrs.map(a => StructField(a.name, a.dataType, a.nullable))) } + +private lazy val inputArr = attrs.toArray + +private lazy val inputToOrdinal = { + val map = new java.util.HashMap[ExprId, Int](inputArr.length * 2) + var index = 0 + attrs.foreach { attr => +if (!map.containsKey(attr.exprId)) { + map.put(attr.exprId, index) +} +index += 1 + } + map +} + +def apply(ordinal: Int): Attribute = inputArr(ordinal) + +def getOrdinal(exprId: ExprId): Int = { --- End diff -- yup ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13500: [SPARK-15756] [SQL] Support command 'create table stored...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/13500 Thanks - merging in master/2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13508: [SPARK-15766][SparkR]:R should export is.nan
GitHub user wangmiao1981 opened a pull request: https://github.com/apache/spark/pull/13508 [SPARK-15766][SparkR]:R should export is.nan ## What changes were proposed in this pull request? When reviewing SPARK-15545, we found that is.nan is not exported, which should be exported. Add it to the NAMESPACE. ## How was this patch tested? Manual tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangmiao1981/spark unused Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13508.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13508 commit da04a0d80578cc2d5d5a87a61ac2377df740c3ae Author: wm...@hotmail.comDate: 2016-06-04T05:15:20Z export is.nan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13504: [SPARK-15762][SQL] Cache Metadata & StructType hashCodes...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/13504 LGTM other than that small comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13504: [SPARK-15762][SQL] Cache Metadata & StructType ha...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/13504#discussion_r65797565 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/Metadata.scala --- @@ -104,7 +104,8 @@ sealed class Metadata private[types] (private[types] val map: Map[String, Any]) } } - override def hashCode: Int = Metadata.hash(this) + private lazy val _hashCode: Int = Metadata.hash(this) + override def hashCode: Int = _hashCode --- End diff -- any reason why this is not just ``` override lazy val hashCode: Int = Metadata.hash(this) ``` ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13506: [SPARK-15763][SQL] Support DELETE FILE command na...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/13506#discussion_r65797550 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1441,6 +1441,32 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli } /** + * Delete a file to be downloaded with this Spark job on every node. + * The `path` passed can be either a local file, a file in HDFS (or other Hadoop-supported + * filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, + * use `SparkFiles.get(fileName)` to find its download location. + * + */ + def deleteFile(path: String): Unit = { --- End diff -- this is fairly confusing -- i'd assume this is actually deleting the path given. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13507: [SPARK-15765][SQL][Streaming] Make continuous Parquet wr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13507 **[Test build #59983 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59983/consoleFull)** for PR 13507 at commit [`60a2c8e`](https://github.com/apache/spark/commit/60a2c8ee7c610a783e65b78ac21e25661b84f49d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13507: [SPARK-15765][SQL][Streaming] Make continuous Par...
GitHub user lw-lin opened a pull request: https://github.com/apache/spark/pull/13507 [SPARK-15765][SQL][Streaming] Make continuous Parquet writing consistent with non-consistent Parquet writing ## What changes were proposed in this pull request? Currently there are some code duplicates in continuous Parquet writing (as in Structured Streaming) and non-continuous batch writing; see [ParquetFileFormat#prepareWrite()](https://github.com/apache/spark/blob/431542765785304edb76a19885fbc5f9b8ae7d64/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L68) and [ParquetFileFormat#ParquetOutputWriterFactory](https://github.com/apache/spark/blob/431542765785304edb76a19885fbc5f9b8ae7d64/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L414). This may lead to inconsistent behavior, when we only change one piece of code but not the other. By extracting the common code out, this patch fixes the inconsistency. As a result, Structured Streaming now also enjoys [SPARK-15719](https://github.com/apache/spark/pull/13455). ## How was this patch tested? Just code refactoring without any logic change, this should be covered by existing suits. You can merge this pull request into a Git repository by running: $ git pull https://github.com/lw-lin/spark parquet-conf-deduplicate Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13507.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13507 commit 60a2c8ee7c610a783e65b78ac21e25661b84f49d Author: Liwei LinDate: 2016-06-03T14:31:56Z Make continuous writing consistent with non-consistent writing --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13442: [SPARK-15654][SQL] Check if all the input files are spli...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/13442 @rxin plz check this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13442: [SPARK-15654][SQL] Check if all the input files are spli...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13442 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59982/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13442: [SPARK-15654][SQL] Check if all the input files are spli...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13442 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13442: [SPARK-15654][SQL] Check if all the input files are spli...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13442 **[Test build #59982 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59982/consoleFull)** for PR 13442 at commit [`d46bfdf`](https://github.com/apache/spark/commit/d46bfdf01b6af86f0b74dd482f46a31d7cc7c632). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13413: [SPARK-15663][SQL] SparkSession.catalog.listFunct...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/13413#discussion_r65796528 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -88,14 +106,6 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { checkKeywordsExist(sql("describe functioN abcadf"), "Function: abcadf not found.") } - test("SPARK-14415: All functions should have own descriptions") { -for (f <- spark.sessionState.functionRegistry.listFunction()) { - if (!Seq("cube", "grouping", "grouping_id", "rollup", "window").contains(f)) { -checkKeywordsNotExist(sql(s"describe function `$f`"), "N/A.") - } -} - } - --- End diff -- Why this test removed? We should check these functions have own description. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13413: [SPARK-15663][SQL] SparkSession.catalog.listFunct...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/13413#discussion_r65796510 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -58,15 +59,32 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { test("show functions") { def getFunctions(pattern: String): Seq[Row] = { - StringUtils.filterPattern(spark.sessionState.functionRegistry.listFunction(), pattern) + StringUtils.filterPattern( + spark.sessionState.catalog.listFunctions("default").map(_.funcName), pattern) .map(Row(_)) } + +def createFunction(names: Seq[String]): Unit = { + names.foreach { name => +spark.udf.register(name, (arg1: Int, arg2: String) => arg2 + arg1) + } +} + +assert(sql("SHOW functions").collect().isEmpty) + +createFunction(Seq("ilog", "logi", "logii", "logiii")) +createFunction(Seq("crc32i", "cubei", "cume_disti")) +createFunction(Seq("isize", "ispace")) +createFunction(Seq("to_datei", "date_addi", "current_datei")) + checkAnswer(sql("SHOW functions"), getFunctions("*")) + Seq("^c*", "*e$", "log*", "*date*").foreach { pattern => // For the pattern part, only '*' and '|' are allowed as wildcards. // For '*', we need to replace it to '.*'. checkAnswer(sql(s"SHOW FUNCTIONS '$pattern'"), getFunctions(pattern)) } + --- End diff -- Nit: Remove this line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13413: [SPARK-15663][SQL] SparkSession.catalog.listFunct...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/13413#discussion_r65796505 --- Diff: python/pyspark/sql/tests.py --- @@ -1481,17 +1481,7 @@ def test_list_functions(self): spark.sql("CREATE DATABASE some_db") functions = dict((f.name, f) for f in spark.catalog.listFunctions()) functionsDefault = dict((f.name, f) for f in spark.catalog.listFunctions("default")) -self.assertTrue(len(functions) > 200) -self.assertTrue("+" in functions) -self.assertTrue("like" in functions) -self.assertTrue("month" in functions) -self.assertTrue("to_unix_timestamp" in functions) -self.assertTrue("current_database" in functions) -self.assertEquals(functions["+"], Function( -name="+", -description=None, -className="org.apache.spark.sql.catalyst.expressions.Add", -isTemporary=True)) +self.assertEquals(len(functions), 0) --- End diff -- Seems we need some units tests to check if we could show user-defined temp functions in `listFunctions`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13504: [SPARK-15762][SQL] Cache Metadata & StructType hashCodes...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13504 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13504: [SPARK-15762][SQL] Cache Metadata & StructType hashCodes...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13504 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59981/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13504: [SPARK-15762][SQL] Cache Metadata & StructType hashCodes...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13504 **[Test build #59981 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59981/consoleFull)** for PR 13504 at commit [`a3a6898`](https://github.com/apache/spark/commit/a3a68989bb31f14e74cbfc3532e089a4de070605). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13461: [SPARK-15721][ML] Make DefaultParamsReadable, DefaultPar...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/13461 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13403: [SPARK-15660][CORE] RDD and Dataset should show the cons...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13403 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59979/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13403: [SPARK-15660][CORE] RDD and Dataset should show the cons...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13403 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13403: [SPARK-15660][CORE] RDD and Dataset should show the cons...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13403 **[Test build #59979 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59979/consoleFull)** for PR 13403 at commit [`801`](https://github.com/apache/spark/commit/801244d24b8b8f250dac34b21b1ea3245981). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13442: [SPARK-15654][SQL] Check if all the input files are spli...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13442 **[Test build #59982 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59982/consoleFull)** for PR 13442 at commit [`d46bfdf`](https://github.com/apache/spark/commit/d46bfdf01b6af86f0b74dd482f46a31d7cc7c632). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13444: [SPARK-15530][SQL] Set #parallelism for file listing in ...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/13444 @yhuai ping --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13486: [SPARK-15743][SQL] Prevent saving with all-column partit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13486 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13486: [SPARK-15743][SQL] Prevent saving with all-column partit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13486 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59977/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13486: [SPARK-15743][SQL] Prevent saving with all-column partit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13486 **[Test build #59977 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59977/consoleFull)** for PR 13486 at commit [`3fd3512`](https://github.com/apache/spark/commit/3fd351292a80f9d3dc59df68bc958439a8381424). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12173: [SPARK-13792][SQL] Limit logging of bad records in CSVRe...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/12173 @falaki ping --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13436: [SPARK-15696][SQL] Improve `crosstab` to have a consiste...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13436 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59978/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13436: [SPARK-15696][SQL] Improve `crosstab` to have a consiste...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13436 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13505 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13436: [SPARK-15696][SQL] Improve `crosstab` to have a consiste...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13436 **[Test build #59978 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59978/consoleFull)** for PR 13436 at commit [`ec49b37`](https://github.com/apache/spark/commit/ec49b375c7f12d8a5f32c50d67291557ba262a5f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13505 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59976/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13505 **[Test build #59976 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59976/consoleFull)** for PR 13505 at commit [`38e8a99`](https://github.com/apache/spark/commit/38e8a9935e3ae0a166f7bcd3231bef50ef7ec71b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences
Github user ericl commented on the issue: https://github.com/apache/spark/pull/13505 Here's a flame graph of bindReferences dominating the CPU used for a 10k column query: [profile](https://github.com/apache/spark/files/298644/slow-bind-refs.svg.zip) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13504: [SPARK-15762][SQL] Cache Metadata & StructType hashCodes...
Github user ericl commented on the issue: https://github.com/apache/spark/pull/13504 [hash code profile](https://github.com/apache/spark/files/298642/hashcode.svg.zip) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13505 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59980/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13505 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13505 **[Test build #59980 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59980/consoleFull)** for PR 13505 at commit [`0b412b0`](https://github.com/apache/spark/commit/0b412b0069dee64c13daa836f43013380c9aa273). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` implicit class AttributeSeq(val attrs: Seq[Attribute]) ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13500: [SPARK-15756] [SQL] Support command 'create table stored...
Github user lianhuiwang commented on the issue: https://github.com/apache/spark/pull/13500 @rxin I have updated PR description. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences
Github user JoshRosen commented on the issue: https://github.com/apache/spark/pull/13505 @rxin, @ericl has some new benchmarks which operate on even wider schemas and which uncovered this bottleneck. Adding the caching of the map here resulted in a huge scalability improvement. Maybe @ericl can chime in with some flame graph charts here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13504: [SPARK-15762][SQL] Cache Metadata & StructType hashCodes...
Github user JoshRosen commented on the issue: https://github.com/apache/spark/pull/13504 @rxin, I don't have the specific performance numbers handy but in an optimizer stress-test benchmark run by @ericl, these hashCode calls accounted for roughly 40% of the total CPU time and this bottleneck was completely eliminated by the caching added by this patch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13504: [SPARK-15762][SQL] Cache Metadata & StructType hashCodes...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13504 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59974/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13407: [SPARK-15665] [CORE] spark-submit --kill and --status ar...
Github user devaraj-kavali commented on the issue: https://github.com/apache/spark/pull/13407 Thanks @vanzin for review and merging. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13504: [SPARK-15762][SQL] Cache Metadata & StructType hashCodes...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13504 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13504: [SPARK-15762][SQL] Cache Metadata & StructType hashCodes...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13504 **[Test build #59974 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59974/consoleFull)** for PR 13504 at commit [`92c6c69`](https://github.com/apache/spark/commit/92c6c6994fac3cf36ea1974171f6018c36013ce0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13504: [SPARK-15762][SQL] Cache Metadata & StructType hashCodes...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13504 **[Test build #59981 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59981/consoleFull)** for PR 13504 at commit [`a3a6898`](https://github.com/apache/spark/commit/a3a68989bb31f14e74cbfc3532e089a4de070605). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13505: [SPARK-15764][SQL] Replace N^2 loop in BindRefere...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/13505#discussion_r65795007 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala --- @@ -296,7 +296,7 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]] extends TreeNode[PlanT /** * All the attributes that are used for this plan. */ - lazy val allAttributes: Seq[Attribute] = children.flatMap(_.output) + lazy val allAttributes: AttributeSeq = children.flatMap(_.output) --- End diff -- We should probably construct the AttributeSeq outside of the loop in the various projection operators, too, although that doesn't appear to be as serious a bottleneck yet. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13505: [SPARK-15764][SQL] Replace N^2 loop in BindRefere...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/13505#discussion_r65794996 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala --- @@ -86,11 +86,31 @@ package object expressions { /** * Helper functions for working with `Seq[Attribute]`. */ - implicit class AttributeSeq(attrs: Seq[Attribute]) { + implicit class AttributeSeq(val attrs: Seq[Attribute]) { /** Creates a StructType with a schema matching this `Seq[Attribute]`. */ def toStructType: StructType = { StructType(attrs.map(a => StructField(a.name, a.dataType, a.nullable))) } + +private lazy val inputArr = attrs.toArray + +private lazy val inputToOrdinal = { + val map = new java.util.HashMap[ExprId, Int](inputArr.length * 2) + var index = 0 + attrs.foreach { attr => +if (!map.containsKey(attr.exprId)) { + map.put(attr.exprId, index) +} +index += 1 + } + map +} + +def apply(ordinal: Int): Attribute = inputArr(ordinal) + +def getOrdinal(exprId: ExprId): Int = { --- End diff -- I suppose this needs documentation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13248: [SPARK-15194] [ML] Add Python ML API for MultivariateGau...
Github user MechCoder commented on the issue: https://github.com/apache/spark/pull/13248 @praveendareddy21 Just made a first pass. Also please run PEP8 on your code --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13496: [SPARK-15753][SQL] Move Analyzer stuff to Analyzer from ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13496 cc @cloud-fan @yhuai --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/13248#discussion_r65794904 --- Diff: python/pyspark/ml/stat/distribution.py --- @@ -0,0 +1,267 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector +import numpy as np + +__all__ = ['MultivariateGaussian'] + + + +class MultivariateGaussian(): +""" +This class provides basic functionality for a Multivariate Gaussian (Normal) Distribution. In + the event that the covariance matrix is singular, the density will be computed in a +reduced dimensional subspace under which the distribution is supported. +(see [[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]]) + +mu The mean vector of the distribution +sigma The covariance matrix of the distribution + + +>>> mu = Vectors.dense([0.0, 0.0]) +>>> sigma= DenseMatrix(2, 2, [1.0, 1.0, 1.0, 1.0]) +>>> x = Vectors.dense([1.0, 1.0]) +>>> m = MultivariateGaussian(mu, sigma) +>>> m.pdf(x) +0.0682586811486 + +""" + +def __init__(self, mu, sigma): +""" +__init__(self, mu, sigma) + +mu The mean vector of the distribution +sigma The covariance matrix of the distribution + +mu and sigma must be instances of DenseVector and DenseMatrix respectively. + +""" + + +assert (isinstance(mu, DenseVector)), "mu must be a DenseVector Object" +assert (isinstance(sigma, DenseMatrix)), "sigma must be a DenseMatrix Object" + +sigma_shape=sigma.toArray().shape +assert (sigma_shape[0]==sigma_shape[1]) , "Covariance matrix must be square" +assert (sigma_shape[0]==mu.size) , "Mean vector length must match covariance matrix size" + +# initialize eagerly precomputed attributes + +self.mu=mu + +# storing sigma as numpy.ndarray +# furthur calculations are done ndarray only +self.sigma=sigma.toArray() + + +# initialize attributes to be computed later + +self.prec_U = None +self.log_det_cov = None + +# compute distribution dependent constants +self.__calculateCovarianceConstants() + + +def pdf(self,x): +""" +Returns density of this multivariate Gaussian at a point given by Vector x +""" +assert (isinstance(x, Vector)), "x must be of Vector Type" +return float(self.__pdf(x)) + +def logpdf(self,x): +""" +Returns the log-density of this multivariate Gaussian at a point given by Vector x +""" +assert (isinstance(x, Vector)), "x must be of Vector Type" +return float(self.__logpdf(x)) + +def __calculateCovarianceConstants(self): +""" +Calculates distribution dependent components used for the density function +based on scipy multivariate library +refer https://github.com/scipy/scipy/blob/master/scipy/stats/_multivariate.py +tested with precision of 9 significant digits(refer testcase) + + +""" + +try : +# pre-processing input parameters +# throws ValueError with invalid inputs +self.dim, self.mu, self.sigma = self.__process_parameters(None, self.mu, self.sigma) + +# return the eigenvalues and eigenvectors +# of a Hermitian or symmetric matrix. +# s = eigen values +# u = eigen vectors +s, u = np.linalg.eigh(self.sigma) + +#Singular values are considered to be non-zero only if +#they exceed a tolerance based on machine precision, matrix size, and +#relation to
[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/13248#discussion_r65794836 --- Diff: python/pyspark/ml/stat/distribution.py --- @@ -0,0 +1,267 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector +import numpy as np + +__all__ = ['MultivariateGaussian'] + + + +class MultivariateGaussian(): +""" +This class provides basic functionality for a Multivariate Gaussian (Normal) Distribution. In + the event that the covariance matrix is singular, the density will be computed in a +reduced dimensional subspace under which the distribution is supported. +(see [[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]]) + +mu The mean vector of the distribution +sigma The covariance matrix of the distribution + + +>>> mu = Vectors.dense([0.0, 0.0]) +>>> sigma= DenseMatrix(2, 2, [1.0, 1.0, 1.0, 1.0]) +>>> x = Vectors.dense([1.0, 1.0]) +>>> m = MultivariateGaussian(mu, sigma) +>>> m.pdf(x) +0.0682586811486 + +""" + +def __init__(self, mu, sigma): +""" +__init__(self, mu, sigma) + +mu The mean vector of the distribution +sigma The covariance matrix of the distribution + +mu and sigma must be instances of DenseVector and DenseMatrix respectively. + +""" + + +assert (isinstance(mu, DenseVector)), "mu must be a DenseVector Object" +assert (isinstance(sigma, DenseMatrix)), "sigma must be a DenseMatrix Object" + +sigma_shape=sigma.toArray().shape +assert (sigma_shape[0]==sigma_shape[1]) , "Covariance matrix must be square" +assert (sigma_shape[0]==mu.size) , "Mean vector length must match covariance matrix size" + +# initialize eagerly precomputed attributes + +self.mu=mu + +# storing sigma as numpy.ndarray +# furthur calculations are done ndarray only +self.sigma=sigma.toArray() + + +# initialize attributes to be computed later + +self.prec_U = None +self.log_det_cov = None + +# compute distribution dependent constants +self.__calculateCovarianceConstants() + + +def pdf(self,x): +""" +Returns density of this multivariate Gaussian at a point given by Vector x +""" +assert (isinstance(x, Vector)), "x must be of Vector Type" +return float(self.__pdf(x)) + +def logpdf(self,x): +""" +Returns the log-density of this multivariate Gaussian at a point given by Vector x +""" +assert (isinstance(x, Vector)), "x must be of Vector Type" +return float(self.__logpdf(x)) + +def __calculateCovarianceConstants(self): +""" +Calculates distribution dependent components used for the density function +based on scipy multivariate library +refer https://github.com/scipy/scipy/blob/master/scipy/stats/_multivariate.py +tested with precision of 9 significant digits(refer testcase) + + +""" + +try : +# pre-processing input parameters +# throws ValueError with invalid inputs +self.dim, self.mu, self.sigma = self.__process_parameters(None, self.mu, self.sigma) + +# return the eigenvalues and eigenvectors +# of a Hermitian or symmetric matrix. +# s = eigen values +# u = eigen vectors +s, u = np.linalg.eigh(self.sigma) + +#Singular values are considered to be non-zero only if +#they exceed a tolerance based on machine precision, matrix size, and +#relation to
[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/13248#discussion_r65794779 --- Diff: python/pyspark/ml/stat/distribution.py --- @@ -0,0 +1,267 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector +import numpy as np + +__all__ = ['MultivariateGaussian'] + + + +class MultivariateGaussian(): +""" +This class provides basic functionality for a Multivariate Gaussian (Normal) Distribution. In + the event that the covariance matrix is singular, the density will be computed in a +reduced dimensional subspace under which the distribution is supported. +(see [[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]]) + +mu The mean vector of the distribution +sigma The covariance matrix of the distribution + + +>>> mu = Vectors.dense([0.0, 0.0]) +>>> sigma= DenseMatrix(2, 2, [1.0, 1.0, 1.0, 1.0]) +>>> x = Vectors.dense([1.0, 1.0]) +>>> m = MultivariateGaussian(mu, sigma) +>>> m.pdf(x) +0.0682586811486 + +""" + +def __init__(self, mu, sigma): +""" +__init__(self, mu, sigma) + +mu The mean vector of the distribution +sigma The covariance matrix of the distribution + +mu and sigma must be instances of DenseVector and DenseMatrix respectively. + +""" + + +assert (isinstance(mu, DenseVector)), "mu must be a DenseVector Object" +assert (isinstance(sigma, DenseMatrix)), "sigma must be a DenseMatrix Object" + +sigma_shape=sigma.toArray().shape +assert (sigma_shape[0]==sigma_shape[1]) , "Covariance matrix must be square" +assert (sigma_shape[0]==mu.size) , "Mean vector length must match covariance matrix size" + +# initialize eagerly precomputed attributes + +self.mu=mu + +# storing sigma as numpy.ndarray +# furthur calculations are done ndarray only +self.sigma=sigma.toArray() + + +# initialize attributes to be computed later + +self.prec_U = None +self.log_det_cov = None + +# compute distribution dependent constants +self.__calculateCovarianceConstants() + + +def pdf(self,x): +""" +Returns density of this multivariate Gaussian at a point given by Vector x +""" +assert (isinstance(x, Vector)), "x must be of Vector Type" +return float(self.__pdf(x)) + +def logpdf(self,x): +""" +Returns the log-density of this multivariate Gaussian at a point given by Vector x +""" +assert (isinstance(x, Vector)), "x must be of Vector Type" +return float(self.__logpdf(x)) + +def __calculateCovarianceConstants(self): +""" +Calculates distribution dependent components used for the density function +based on scipy multivariate library +refer https://github.com/scipy/scipy/blob/master/scipy/stats/_multivariate.py +tested with precision of 9 significant digits(refer testcase) + + +""" + +try : +# pre-processing input parameters +# throws ValueError with invalid inputs +self.dim, self.mu, self.sigma = self.__process_parameters(None, self.mu, self.sigma) + +# return the eigenvalues and eigenvectors +# of a Hermitian or symmetric matrix. +# s = eigen values +# u = eigen vectors +s, u = np.linalg.eigh(self.sigma) + +#Singular values are considered to be non-zero only if +#they exceed a tolerance based on machine precision, matrix size, and +#relation to
[GitHub] spark issue #13503: [SPARK-15761] [MLlib] [PySpark] Load ipython when defaul...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13503 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13503: [SPARK-15761] [MLlib] [PySpark] Load ipython when defaul...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13503 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59970/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13061: [SPARK-14279] [Build] Pick the spark version from pom
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13061 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13061: [SPARK-14279] [Build] Pick the spark version from pom
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13061 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59972/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13503: [SPARK-15761] [MLlib] [PySpark] Load ipython when defaul...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13503 **[Test build #59970 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59970/consoleFull)** for PR 13503 at commit [`a051953`](https://github.com/apache/spark/commit/a05195341deb46cb2fd8521e7eb8065ca760ad7a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13061: [SPARK-14279] [Build] Pick the spark version from pom
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13061 **[Test build #59972 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59972/consoleFull)** for PR 13061 at commit [`11cef41`](https://github.com/apache/spark/commit/11cef4111fcc969cc08b3bb0e4b738f47da4aaf8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/13248#discussion_r65794427 --- Diff: python/pyspark/ml/stat/distribution.py --- @@ -0,0 +1,267 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector +import numpy as np + +__all__ = ['MultivariateGaussian'] + + + +class MultivariateGaussian(): +""" +This class provides basic functionality for a Multivariate Gaussian (Normal) Distribution. In + the event that the covariance matrix is singular, the density will be computed in a +reduced dimensional subspace under which the distribution is supported. +(see [[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]]) + +mu The mean vector of the distribution +sigma The covariance matrix of the distribution + + +>>> mu = Vectors.dense([0.0, 0.0]) +>>> sigma= DenseMatrix(2, 2, [1.0, 1.0, 1.0, 1.0]) +>>> x = Vectors.dense([1.0, 1.0]) +>>> m = MultivariateGaussian(mu, sigma) +>>> m.pdf(x) +0.0682586811486 + +""" + +def __init__(self, mu, sigma): +""" +__init__(self, mu, sigma) + +mu The mean vector of the distribution +sigma The covariance matrix of the distribution + +mu and sigma must be instances of DenseVector and DenseMatrix respectively. + +""" + + +assert (isinstance(mu, DenseVector)), "mu must be a DenseVector Object" +assert (isinstance(sigma, DenseMatrix)), "sigma must be a DenseMatrix Object" + +sigma_shape=sigma.toArray().shape +assert (sigma_shape[0]==sigma_shape[1]) , "Covariance matrix must be square" +assert (sigma_shape[0]==mu.size) , "Mean vector length must match covariance matrix size" + +# initialize eagerly precomputed attributes + +self.mu=mu + +# storing sigma as numpy.ndarray +# furthur calculations are done ndarray only +self.sigma=sigma.toArray() + + +# initialize attributes to be computed later + +self.prec_U = None +self.log_det_cov = None + +# compute distribution dependent constants +self.__calculateCovarianceConstants() + + +def pdf(self,x): +""" +Returns density of this multivariate Gaussian at a point given by Vector x +""" +assert (isinstance(x, Vector)), "x must be of Vector Type" +return float(self.__pdf(x)) + +def logpdf(self,x): +""" +Returns the log-density of this multivariate Gaussian at a point given by Vector x +""" +assert (isinstance(x, Vector)), "x must be of Vector Type" +return float(self.__logpdf(x)) + +def __calculateCovarianceConstants(self): +""" +Calculates distribution dependent components used for the density function +based on scipy multivariate library +refer https://github.com/scipy/scipy/blob/master/scipy/stats/_multivariate.py +tested with precision of 9 significant digits(refer testcase) + + +""" + +try : +# pre-processing input parameters +# throws ValueError with invalid inputs +self.dim, self.mu, self.sigma = self.__process_parameters(None, self.mu, self.sigma) + +# return the eigenvalues and eigenvectors +# of a Hermitian or symmetric matrix. +# s = eigen values +# u = eigen vectors +s, u = np.linalg.eigh(self.sigma) + +#Singular values are considered to be non-zero only if +#they exceed a tolerance based on machine precision, matrix size, and +#relation to
[GitHub] spark pull request #13505: [SPARK-15764][SQL] Replace N^2 loop in BindRefere...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/13505#discussion_r65794395 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala --- @@ -296,7 +296,7 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]] extends TreeNode[PlanT /** * All the attributes that are used for this plan. */ - lazy val allAttributes: Seq[Attribute] = children.flatMap(_.output) + lazy val allAttributes: AttributeSeq = children.flatMap(_.output) --- End diff -- @ericl and I found another layer of polynomial looping: in QueryPlan.cleanArgs we take every expression in the query plan and bind its references against `allAttributes`, which can be huge. If we turn this into an `AttributeSeq` once and build the map inside of that wrapper then we amortize that cost and remove this expensive loop. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/13248#discussion_r65794325 --- Diff: python/pyspark/ml/stat/distribution.py --- @@ -0,0 +1,267 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector +import numpy as np + +__all__ = ['MultivariateGaussian'] + + + +class MultivariateGaussian(): +""" +This class provides basic functionality for a Multivariate Gaussian (Normal) Distribution. In + the event that the covariance matrix is singular, the density will be computed in a +reduced dimensional subspace under which the distribution is supported. +(see [[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]]) + +mu The mean vector of the distribution +sigma The covariance matrix of the distribution + + +>>> mu = Vectors.dense([0.0, 0.0]) +>>> sigma= DenseMatrix(2, 2, [1.0, 1.0, 1.0, 1.0]) +>>> x = Vectors.dense([1.0, 1.0]) +>>> m = MultivariateGaussian(mu, sigma) +>>> m.pdf(x) +0.0682586811486 + +""" + +def __init__(self, mu, sigma): +""" +__init__(self, mu, sigma) + +mu The mean vector of the distribution +sigma The covariance matrix of the distribution + +mu and sigma must be instances of DenseVector and DenseMatrix respectively. + +""" + + +assert (isinstance(mu, DenseVector)), "mu must be a DenseVector Object" +assert (isinstance(sigma, DenseMatrix)), "sigma must be a DenseMatrix Object" + +sigma_shape=sigma.toArray().shape +assert (sigma_shape[0]==sigma_shape[1]) , "Covariance matrix must be square" +assert (sigma_shape[0]==mu.size) , "Mean vector length must match covariance matrix size" + +# initialize eagerly precomputed attributes + +self.mu=mu + +# storing sigma as numpy.ndarray +# furthur calculations are done ndarray only +self.sigma=sigma.toArray() + + +# initialize attributes to be computed later + +self.prec_U = None +self.log_det_cov = None + +# compute distribution dependent constants +self.__calculateCovarianceConstants() + + +def pdf(self,x): +""" +Returns density of this multivariate Gaussian at a point given by Vector x +""" +assert (isinstance(x, Vector)), "x must be of Vector Type" +return float(self.__pdf(x)) + +def logpdf(self,x): +""" +Returns the log-density of this multivariate Gaussian at a point given by Vector x +""" +assert (isinstance(x, Vector)), "x must be of Vector Type" +return float(self.__logpdf(x)) + +def __calculateCovarianceConstants(self): +""" +Calculates distribution dependent components used for the density function +based on scipy multivariate library +refer https://github.com/scipy/scipy/blob/master/scipy/stats/_multivariate.py +tested with precision of 9 significant digits(refer testcase) + + +""" + +try : +# pre-processing input parameters +# throws ValueError with invalid inputs +self.dim, self.mu, self.sigma = self.__process_parameters(None, self.mu, self.sigma) + +# return the eigenvalues and eigenvectors +# of a Hermitian or symmetric matrix. +# s = eigen values +# u = eigen vectors +s, u = np.linalg.eigh(self.sigma) + +#Singular values are considered to be non-zero only if +#they exceed a tolerance based on machine precision, matrix size, and +#relation to
[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/13248#discussion_r65794293 --- Diff: python/pyspark/ml/stat/distribution.py --- @@ -0,0 +1,267 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector +import numpy as np + +__all__ = ['MultivariateGaussian'] + + + +class MultivariateGaussian(): +""" +This class provides basic functionality for a Multivariate Gaussian (Normal) Distribution. In + the event that the covariance matrix is singular, the density will be computed in a +reduced dimensional subspace under which the distribution is supported. +(see [[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]]) + +mu The mean vector of the distribution +sigma The covariance matrix of the distribution + + +>>> mu = Vectors.dense([0.0, 0.0]) +>>> sigma= DenseMatrix(2, 2, [1.0, 1.0, 1.0, 1.0]) +>>> x = Vectors.dense([1.0, 1.0]) +>>> m = MultivariateGaussian(mu, sigma) +>>> m.pdf(x) +0.0682586811486 + +""" + +def __init__(self, mu, sigma): +""" +__init__(self, mu, sigma) + +mu The mean vector of the distribution +sigma The covariance matrix of the distribution + +mu and sigma must be instances of DenseVector and DenseMatrix respectively. + +""" + + +assert (isinstance(mu, DenseVector)), "mu must be a DenseVector Object" +assert (isinstance(sigma, DenseMatrix)), "sigma must be a DenseMatrix Object" + +sigma_shape=sigma.toArray().shape +assert (sigma_shape[0]==sigma_shape[1]) , "Covariance matrix must be square" +assert (sigma_shape[0]==mu.size) , "Mean vector length must match covariance matrix size" + +# initialize eagerly precomputed attributes + +self.mu=mu + +# storing sigma as numpy.ndarray +# furthur calculations are done ndarray only +self.sigma=sigma.toArray() + + +# initialize attributes to be computed later + +self.prec_U = None +self.log_det_cov = None + +# compute distribution dependent constants +self.__calculateCovarianceConstants() + + +def pdf(self,x): +""" +Returns density of this multivariate Gaussian at a point given by Vector x +""" +assert (isinstance(x, Vector)), "x must be of Vector Type" +return float(self.__pdf(x)) + +def logpdf(self,x): +""" +Returns the log-density of this multivariate Gaussian at a point given by Vector x +""" +assert (isinstance(x, Vector)), "x must be of Vector Type" +return float(self.__logpdf(x)) + +def __calculateCovarianceConstants(self): +""" +Calculates distribution dependent components used for the density function +based on scipy multivariate library +refer https://github.com/scipy/scipy/blob/master/scipy/stats/_multivariate.py +tested with precision of 9 significant digits(refer testcase) + + +""" + +try : +# pre-processing input parameters +# throws ValueError with invalid inputs +self.dim, self.mu, self.sigma = self.__process_parameters(None, self.mu, self.sigma) + +# return the eigenvalues and eigenvectors +# of a Hermitian or symmetric matrix. +# s = eigen values +# u = eigen vectors +s, u = np.linalg.eigh(self.sigma) + +#Singular values are considered to be non-zero only if +#they exceed a tolerance based on machine precision, matrix size, and +#relation to
[GitHub] spark issue #13504: [SPARK-15762][SQL] Cache Metadata & StructType hashCodes...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13504 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59971/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13504: [SPARK-15762][SQL] Cache Metadata & StructType hashCodes...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13504 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13504: [SPARK-15762][SQL] Cache Metadata & StructType hashCodes...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13504 **[Test build #59971 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59971/consoleFull)** for PR 13504 at commit [`8061600`](https://github.com/apache/spark/commit/806160072f0744f5e700f78d5632ea237fbc6515). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/13248#discussion_r65794126 --- Diff: python/pyspark/ml/stat/distribution.py --- @@ -0,0 +1,267 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector +import numpy as np + +__all__ = ['MultivariateGaussian'] + + + +class MultivariateGaussian(): +""" +This class provides basic functionality for a Multivariate Gaussian (Normal) Distribution. In + the event that the covariance matrix is singular, the density will be computed in a +reduced dimensional subspace under which the distribution is supported. +(see [[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]]) + +mu The mean vector of the distribution +sigma The covariance matrix of the distribution + + +>>> mu = Vectors.dense([0.0, 0.0]) +>>> sigma= DenseMatrix(2, 2, [1.0, 1.0, 1.0, 1.0]) +>>> x = Vectors.dense([1.0, 1.0]) +>>> m = MultivariateGaussian(mu, sigma) +>>> m.pdf(x) +0.0682586811486 + +""" + +def __init__(self, mu, sigma): +""" +__init__(self, mu, sigma) + +mu The mean vector of the distribution +sigma The covariance matrix of the distribution + +mu and sigma must be instances of DenseVector and DenseMatrix respectively. + +""" + + +assert (isinstance(mu, DenseVector)), "mu must be a DenseVector Object" +assert (isinstance(sigma, DenseMatrix)), "sigma must be a DenseMatrix Object" + +sigma_shape=sigma.toArray().shape +assert (sigma_shape[0]==sigma_shape[1]) , "Covariance matrix must be square" +assert (sigma_shape[0]==mu.size) , "Mean vector length must match covariance matrix size" + +# initialize eagerly precomputed attributes + +self.mu=mu + +# storing sigma as numpy.ndarray +# furthur calculations are done ndarray only +self.sigma=sigma.toArray() + + +# initialize attributes to be computed later + +self.prec_U = None +self.log_det_cov = None + +# compute distribution dependent constants +self.__calculateCovarianceConstants() + + +def pdf(self,x): --- End diff -- Should we fall back to SciPy's multivariate normal if that is present? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13472: [SPARK-15735] Allow specifying min time to run in microb...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13472 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59973/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13472: [SPARK-15735] Allow specifying min time to run in microb...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13472 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13472: [SPARK-15735] Allow specifying min time to run in microb...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13472 **[Test build #59973 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59973/consoleFull)** for PR 13472 at commit [`ead4a2c`](https://github.com/apache/spark/commit/ead4a2c9a5e9f225f475a40d0e4247c54a76830d). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13505 **[Test build #59980 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59980/consoleFull)** for PR 13505 at commit [`0b412b0`](https://github.com/apache/spark/commit/0b412b0069dee64c13daa836f43013380c9aa273). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/13248#discussion_r65794056 --- Diff: python/pyspark/ml/stat/distribution.py --- @@ -0,0 +1,267 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector +import numpy as np + +__all__ = ['MultivariateGaussian'] + + + +class MultivariateGaussian(): +""" +This class provides basic functionality for a Multivariate Gaussian (Normal) Distribution. In + the event that the covariance matrix is singular, the density will be computed in a +reduced dimensional subspace under which the distribution is supported. +(see [[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]]) + +mu The mean vector of the distribution +sigma The covariance matrix of the distribution + + +>>> mu = Vectors.dense([0.0, 0.0]) +>>> sigma= DenseMatrix(2, 2, [1.0, 1.0, 1.0, 1.0]) +>>> x = Vectors.dense([1.0, 1.0]) +>>> m = MultivariateGaussian(mu, sigma) +>>> m.pdf(x) +0.0682586811486 + +""" + +def __init__(self, mu, sigma): +""" +__init__(self, mu, sigma) + +mu The mean vector of the distribution +sigma The covariance matrix of the distribution + +mu and sigma must be instances of DenseVector and DenseMatrix respectively. + +""" + + +assert (isinstance(mu, DenseVector)), "mu must be a DenseVector Object" +assert (isinstance(sigma, DenseMatrix)), "sigma must be a DenseMatrix Object" + +sigma_shape=sigma.toArray().shape +assert (sigma_shape[0]==sigma_shape[1]) , "Covariance matrix must be square" +assert (sigma_shape[0]==mu.size) , "Mean vector length must match covariance matrix size" + +# initialize eagerly precomputed attributes + +self.mu=mu + +# storing sigma as numpy.ndarray +# furthur calculations are done ndarray only +self.sigma=sigma.toArray() + + +# initialize attributes to be computed later + +self.prec_U = None +self.log_det_cov = None + +# compute distribution dependent constants +self.__calculateCovarianceConstants() + + +def pdf(self,x): +""" +Returns density of this multivariate Gaussian at a point given by Vector x +""" +assert (isinstance(x, Vector)), "x must be of Vector Type" +return float(self.__pdf(x)) + +def logpdf(self,x): +""" +Returns the log-density of this multivariate Gaussian at a point given by Vector x +""" +assert (isinstance(x, Vector)), "x must be of Vector Type" +return float(self.__logpdf(x)) + +def __calculateCovarianceConstants(self): +""" +Calculates distribution dependent components used for the density function +based on scipy multivariate library +refer https://github.com/scipy/scipy/blob/master/scipy/stats/_multivariate.py +tested with precision of 9 significant digits(refer testcase) + + +""" + +try : +# pre-processing input parameters +# throws ValueError with invalid inputs +self.dim, self.mu, self.sigma = self.__process_parameters(None, self.mu, self.sigma) + +# return the eigenvalues and eigenvectors +# of a Hermitian or symmetric matrix. +# s = eigen values +# u = eigen vectors +s, u = np.linalg.eigh(self.sigma) + +#Singular values are considered to be non-zero only if +#they exceed a tolerance based on machine precision, matrix size, and +#relation to
[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/13248#discussion_r65793951 --- Diff: python/pyspark/ml/stat/distribution.py --- @@ -0,0 +1,267 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector +import numpy as np + +__all__ = ['MultivariateGaussian'] + + + +class MultivariateGaussian(): +""" +This class provides basic functionality for a Multivariate Gaussian (Normal) Distribution. In + the event that the covariance matrix is singular, the density will be computed in a +reduced dimensional subspace under which the distribution is supported. +(see [[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]]) + +mu The mean vector of the distribution +sigma The covariance matrix of the distribution + + +>>> mu = Vectors.dense([0.0, 0.0]) +>>> sigma= DenseMatrix(2, 2, [1.0, 1.0, 1.0, 1.0]) +>>> x = Vectors.dense([1.0, 1.0]) +>>> m = MultivariateGaussian(mu, sigma) +>>> m.pdf(x) +0.0682586811486 + +""" + +def __init__(self, mu, sigma): +""" +__init__(self, mu, sigma) + +mu The mean vector of the distribution +sigma The covariance matrix of the distribution + +mu and sigma must be instances of DenseVector and DenseMatrix respectively. + +""" + + +assert (isinstance(mu, DenseVector)), "mu must be a DenseVector Object" +assert (isinstance(sigma, DenseMatrix)), "sigma must be a DenseMatrix Object" + +sigma_shape=sigma.toArray().shape +assert (sigma_shape[0]==sigma_shape[1]) , "Covariance matrix must be square" +assert (sigma_shape[0]==mu.size) , "Mean vector length must match covariance matrix size" --- End diff -- You can use the `numRows`, `numCols` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13506: [SPARK-15763][SQL] Support DELETE FILE command natively
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13506 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13506: [SPARK-15763][SQL] Support DELETE FILE command na...
GitHub user kevinyu98 opened a pull request: https://github.com/apache/spark/pull/13506 [SPARK-15763][SQL] Support DELETE FILE command natively ## What changes were proposed in this pull request? Hive supports these cli commands to manage the resource [Hive Doc](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli) : `ADD/DELETE (FILE(s)|JAR(s) )` `LIST (FILE(S) [filepath ...] | JAR(S) [jarpath ...]) ` but SPARK only supports two commands `ADD (FILE | JAR )` `LIST (FILE(S) [filepath ...] | JAR(S) [jarpath ...])` for now. This PR is to add the DELETE FILE command into Spark SQL and I will submit another PR for the DELETE JAR(s). `DELETE FILE ` ## **Example:** **DELETE FILE** ``` scala> spark.sql("add file /Users/qianyangyu/myfile.txt") res0: org.apache.spark.sql.DataFrame = [] scala> spark.sql("add file /Users/qianyangyu/myfile2.txt") res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("list file") res2: org.apache.spark.sql.DataFrame = [Results: string] scala> spark.sql("list file").show(false) +--+ |Results | +--+ |file:/Users/qianyangyu/myfile2.txt| |file:/Users/qianyangyu/myfile.txt | +--+ scala> spark.sql("delete file /Users/qianyangyu/myfile.txt") res4: org.apache.spark.sql.DataFrame = [] scala> spark.sql("list file").show(false) +--+ |Results | +--+ |file:/Users/qianyangyu/myfile2.txt| +--+ scala> spark.sql("delete file /Users/qianyangyu/myfile2.txt") res6: org.apache.spark.sql.DataFrame = [] scala> spark.sql("list file").show(false) +---+ |Results| +---+ +---+ ``` ## How was this patch tested? Add test cases in Spark-SQL SPARK-Shell and SparkContext suites. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) You can merge this pull request into a Git repository by running: $ git pull https://github.com/kevinyu98/spark spark-15763 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13506.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13506 commit 3b44c5978bd44db986621d3e8511e9165b66926b Author: Kevin YuDate: 2016-04-20T18:06:30Z adding testcase commit 18b4a31c687b264b50aa5f5a74455956911f738a Author: Kevin Yu Date: 2016-04-22T21:48:00Z Merge remote-tracking branch 'upstream/master' commit 4f4d1c8f2801b1e662304ab2b33351173e71b427 Author: Kevin Yu Date: 2016-04-23T16:50:19Z Merge remote-tracking branch 'upstream/master' get latest code from upstream commit f5f0cbed1eb5754c04c36933b374c3b3d2ae4f4e Author: Kevin Yu Date: 2016-04-23T22:20:53Z Merge remote-tracking branch 'upstream/master' adding trim characters support commit d8b2edbd13ee9a4f057bca7dcb0c0940e8e867b8 Author: Kevin Yu Date: 2016-04-25T20:24:33Z Merge remote-tracking branch 'upstream/master' get latest code for pr12646 commit 196b6c66b0d55232f427c860c0e7c6876c216a67 Author: Kevin Yu Date: 2016-04-25T23:45:57Z Merge remote-tracking branch 'upstream/master' merge latest code commit f37a01e005f3e27ae2be056462d6eb6730933ba5 Author: Kevin Yu Date: 2016-04-27T14:15:06Z Merge remote-tracking branch 'upstream/master' merge upstream/master commit bb5b01fd3abeea1b03315eccf26762fcc23f80c0 Author: Kevin Yu Date: 2016-04-30T23:49:31Z Merge remote-tracking branch 'upstream/master' commit bde5820a181cf84e0879038ad8c4cebac63c1e24 Author: Kevin Yu Date: 2016-05-04T03:52:31Z Merge remote-tracking branch 'upstream/master' commit 5f7cd96d495f065cd04e8e4cc58461843e45bc8d Author: Kevin Yu Date: 2016-05-10T21:14:50Z Merge remote-tracking branch 'upstream/master' commit 893a49af0bfd153ccb59ba50b63a232660e0eada Author: Kevin Yu Date: 2016-05-13T18:20:39Z Merge remote-tracking branch 'upstream/master' commit 4bbe1fd4a3ebd50338ccbe07dc5887fe289cd53d Author: Kevin Yu Date: 2016-05-17T21:58:14Z Merge remote-tracking branch 'upstream/master' commit b2dd795e23c36cbbd022f07a10c0cf21c85eb421 Author: Kevin Yu Date: 2016-05-18T06:37:13Z Merge remote-tracking branch 'upstream/master' commit 8c3e5da458dbff397ed60fcb68f2a46d87ab7ba4 Author: Kevin Yu
[GitHub] spark pull request #13248: [SPARK-15194] [ML] Add Python ML API for Multivar...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/13248#discussion_r65793865 --- Diff: python/pyspark/ml/stat/distribution.py --- @@ -0,0 +1,267 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector +import numpy as np --- End diff -- This import should be moved above. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13505 **[Test build #59976 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59976/consoleFull)** for PR 13505 at commit [`38e8a99`](https://github.com/apache/spark/commit/38e8a9935e3ae0a166f7bcd3231bef50ef7ec71b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13403: [SPARK-15660][CORE] RDD and Dataset should show the cons...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13403 **[Test build #59979 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59979/consoleFull)** for PR 13403 at commit [`801`](https://github.com/apache/spark/commit/801244d24b8b8f250dac34b21b1ea3245981). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13436: [SPARK-15696][SQL] Improve `crosstab` to have a consiste...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13436 **[Test build #59978 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59978/consoleFull)** for PR 13436 at commit [`ec49b37`](https://github.com/apache/spark/commit/ec49b375c7f12d8a5f32c50d67291557ba262a5f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13486: [SPARK-15743][SQL] Prevent saving with all-column partit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13486 **[Test build #59977 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59977/consoleFull)** for PR 13486 at commit [`3fd3512`](https://github.com/apache/spark/commit/3fd351292a80f9d3dc59df68bc958439a8381424). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13505: [SPARK-15764][SQL] Replace N^2 loop in BindRefere...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/13505#discussion_r65793015 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala --- @@ -84,17 +84,27 @@ object BindReferences extends Logging { expression: A, input: Seq[Attribute], --- End diff -- I think we should add an overload which takes a sequence of expressions and binds all of their references. We should then replace the call sites in the various projection operators. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12938: [SPARK-15162][SPARK-15164][PySpark][DOCS][ML] upd...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/12938#discussion_r65792982 --- Diff: python/pyspark/ml/classification.py --- @@ -183,7 +191,7 @@ def getThresholds(self): If :py:attr:`thresholds` is set, return its value. Otherwise, if :py:attr:`threshold` is set, return the equivalent thresholds for binary classification: (1-threshold, threshold). -If neither are set, throw an error. --- End diff -- I'd say yeah - cc @yanboliang thoughts? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13505 **[Test build #59975 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59975/consoleFull)** for PR 13505 at commit [`6216e94`](https://github.com/apache/spark/commit/6216e944a1aef8dc67a446654c0799c8e9920b4f). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13505 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59975/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13505 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13505: [SPARK-15764][SQL] Replace N^2 loop in BindRefere...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/13505#discussion_r65792855 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala --- @@ -84,17 +84,27 @@ object BindReferences extends Logging { expression: A, input: Seq[Attribute], --- End diff -- Actually, yeah: in `GenerateMutableProjection` we use the same InputSchema for every expression. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13505: [SPARK-15764][SQL] Replace N^2 loop in BindReferences
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13505 **[Test build #59975 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59975/consoleFull)** for PR 13505 at commit [`6216e94`](https://github.com/apache/spark/commit/6216e944a1aef8dc67a446654c0799c8e9920b4f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13505: [SPARK-15764][SQL] Replace N^2 loop in BindRefere...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/13505#discussion_r65792692 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala --- @@ -84,17 +84,27 @@ object BindReferences extends Logging { expression: A, input: Seq[Attribute], --- End diff -- I wonder whether we can push the map construction up one level so that we can amortize its cost across multiple `bindReference` calls. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13505: [SPARK-15764][SQL] Replace N^2 loop in BindRefere...
GitHub user JoshRosen opened a pull request: https://github.com/apache/spark/pull/13505 [SPARK-15764][SQL] Replace N^2 loop in BindReferences BindReferences contains a n^2 loop which causes performance issues when operating over large schemas: to determine the ordinal of an attribute reference, we perform a linear scan over the `input` array. Because input can sometimes be a `List`, the call to `input(ordinal).nullable` can also be O(n). Instead of performing a linear scan, we can convert the input into an array and build a hash map to map from expression ids to ordinals. The greater up-front cost of the map construction is offset by the fact that an expression can contain multiple attribute references, so the cost of the map construction is amortized across a number of lookups. Perf. benchmarks to follow. /cc @ericl You can merge this pull request into a Git repository by running: $ git pull https://github.com/JoshRosen/spark bind-references-improvement Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13505.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13505 commit 6216e944a1aef8dc67a446654c0799c8e9920b4f Author: Josh RosenDate: 2016-06-04T00:12:16Z Replace N^2 loop in BindReferences. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org