[GitHub] spark issue #13762: [SPARK-14926] [ML] OneVsRest labelMetadata uses incorrec...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/13762 I think we should close this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15052: [SPARK-17500][PySpark]Make DiskBytesSpilled metric in Py...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/15052 I get that, but if it's always true, then there was no problem to begin with. That's what the code seems to think right now. I haven't looked at the code much but that's the question -- are you sure the files are non-empty in some scenario? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14638: [SPARK-11374][SQL] Support `skip.header.line.count` opti...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14638 **[Test build #65251 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65251/consoleFull)** for PR 14638 at commit [`257708a`](https://github.com/apache/spark/commit/257708af5a5c449781f57fe43d44df656eb75ba9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14623: [SPARK-17044][SQL] Make test files for window functions ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14623 **[Test build #65252 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65252/consoleFull)** for PR 14623 at commit [`04fe12d`](https://github.com/apache/spark/commit/04fe12dcc7749dc97a13683440a437ced5a71345). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14527: [SPARK-16938][SQL] `drop/dropDuplicate` should handle th...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14527 **[Test build #65248 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65248/consoleFull)** for PR 14527 at commit [`67ea924`](https://github.com/apache/spark/commit/67ea92448463c559ef9650eead719f69b5f3b51b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14961: [SPARK-17379] [BUILD] Upgrade netty-all to 4.0.41 final ...
Github user a-roberts commented on the issue: https://github.com/apache/spark/pull/14961 Sean, yep, I've had trouble reproducing it too, kicked off a bunch of builds over the weekend including one using Hadoop-2.3 which was my initial theory (only difference between our testing environments apart from the options I mention below) I'll add ``` static { System.setProperty("io.netty.recycler.maxCapacity", "0"); } ``` in TransportConf then build and test locally before updating this. FWIW I use these Java options for testing as our boxes have limited memory: -Xss2048k **-Dspark.buffer.pageSize=1048576** -Xmx4g --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14116: [SPARK-16452][SQL] Support basic INFORMATION_SCHEMA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14116 **[Test build #65250 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65250/consoleFull)** for PR 14116 at commit [`c531025`](https://github.com/apache/spark/commit/c5310252d891486c6380a015ae43050a43faae61). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14426: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14426 **[Test build #65249 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65249/consoleFull)** for PR 14426 at commit [`47d98e7`](https://github.com/apache/spark/commit/47d98e7e4f6e8985fd3b543d5e44d30ee381f604). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15046: [SPARK-17492] [SQL] Fix Reading Cataloged Data Sources w...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/15046 ah good catch! But adding a new flag looks a little tricky, let me think if there is better way to fix it --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15052: [SPARK-17500][PySpark]Make DiskBytesSpilled metric in Py...
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/15052 @srowen No. It does not matter whether the file is empty or not, if the file is empty, the `getsize()` just return 0, and this should be OK. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15023: Backport [SPARK-5847] Allow for configuring MetricsSyste...
Github user AnthonyTruchet commented on the issue: https://github.com/apache/spark/pull/15023 I'm aware that features are not generally back-ported. The point is, for us this is a bug, preventing a deployment in production. We thus back-ported the fix internally and now propose to share it with the community as the work is already done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15056: [SPARK-17503][Core] Fix memory leak in Memory store when...
Github user clockfly commented on the issue: https://github.com/apache/spark/pull/15056 @yhuai --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15040: [WIP] [SPARK-17487] [SQL] Configurable bucketing info ex...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/15040 `BucketingInfoExtractor` maybe a too flexible concept, we only need a boolean flag to indicate it's a spark native bucketing or hive bucketing, and I'm sure how soon we need to support bucketed table from other systems. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14995: [Test Only][SPARK-6235][CORE]Address various 2G limits
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14995 **[Test build #65247 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65247/consoleFull)** for PR 14995 at commit [`b31fbcd`](https://github.com/apache/spark/commit/b31fbcdb7c93b1badb0e67a509dee17445659b16). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15052: [SPARK-17500][PySpark]Make DiskBytesSpilled metric in Py...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/15052 Is the idea that the file may be non empty when written ? There is at least one more instance of this call but maybe the file is known to be empty before. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15047: [SPARK-17495] [SQL] Add Hash capability semantica...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15047#discussion_r78331887 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveHash.scala --- @@ -0,0 +1,145 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.util.{ArrayData, MapData} +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.Platform +import org.apache.spark.unsafe.types.UTF8String + +/** + * Simulates Hive's hashing function at + * org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils#hashcode() in Hive + * + * We should use this hash function for both shuffle and bucket of Hive tables, so that + * we can guarantee shuffle and bucketing have same data distribution + * + * TODO: Support Decimal and date related types + */ +@ExpressionDescription( + usage = "_FUNC_(a1, a2, ...) - Returns a hash value of the arguments.") +case class HiveHash(children: Seq[Expression], seed: Int) extends HashExpression[Int] { + def this(arguments: Seq[Expression]) = this(arguments, 42) + + override def dataType: DataType = IntegerType + + override def prettyName: String = "hive-hash" + + override protected def hasherClassName: String = classOf[HiveHash].getName + + override protected def computeHash(value: Any, dataType: DataType, seed: Int): Int = { +HiveHashFunction.hash(value, dataType, seed).toInt + } +} + +object HiveHashFunction extends InterpretedHashFunction { + override protected def hashInt(i: Int, seed: Long): Long = { +HiveHasher.hashInt(i, seed.toInt) + } + + override protected def hashLong(l: Long, seed: Long): Long = { +HiveHasher.hashLong(l, seed.toInt) + } + + override protected def hashUnsafeBytes(base: AnyRef, offset: Long, len: Int, seed: Long): Long = { +HiveHasher.hashUnsafeBytes(base, offset, len, seed.toInt) + } + + override def hash(value: Any, dataType: DataType, seed: Long): Long = { +value match { + case s: UTF8String => +val bytes = s.getBytes +var result: Int = 0 +var i = 0 +while (i < bytes.length) { + result = (result * 31) + bytes(i).toInt + i += 1 +} +result + + + case array: ArrayData => +val elementType = dataType match { + case udt: UserDefinedType[_] => udt.sqlType.asInstanceOf[ArrayType].elementType --- End diff -- the caller of `hash` guarantees the value matches the data type. So in this branch, if the value is `ArrayData`, the data type must be `ArrayType` or UDT of `ArrayType` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15048: [SPARK-17409] [SQL] Do Not Optimize Query in CTAS...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15048#discussion_r78331097 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala --- @@ -68,7 +68,7 @@ class ResolveDataSource(sparkSession: SparkSession) extends Rule[LogicalPlan] { /** * Preprocess some DDL plans, e.g. [[CreateTable]], to do some normalization and checking. --- End diff -- we should update the comments to say that this rule will also analyze the query.(we may also wanna update the rule name) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15052: [SPARK-17500][PySpark]Make DiskBytesSpilled metric in Py...
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/15052 @srowen I update PR using an increment way to update the DiskBytesSpilled metrics. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14988: [SPARK-17425][SQL] Override sameResult in HiveTab...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14988#discussion_r78330099 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala --- @@ -164,4 +164,28 @@ case class HiveTableScanExec( } override def output: Seq[Attribute] = attributes + + override def sameResult(plan: SparkPlan): Boolean = plan match { +case other: HiveTableScanExec => + val thisRequestedAttributes = requestedAttributes.map(cleanExpression) + val otherRequestedAttributes = other.requestedAttributes.map(cleanExpression) + + val result = partitionPruningPred == other.partitionPruningPred && +relation.sameResult(other.relation) && + thisRequestedAttributes.zip(otherRequestedAttributes) +.forall(p => p._1.semanticEquals(p._2)) + result +case _ => false + } + + private def cleanExpression(e: Attribute): Expression = e match { +case a: AttributeReference => + // As the root of the expression, Alias will always take an arbitrary exprId, we need --- End diff -- this comment doesn't match the code. Can you explain more about why the default `cleanExpression` doesn't work? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15052: [SPARK-17500][PySpark]Make DiskBytesSpilled metric in Py...
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/15052 @srowen you are right, I will correct it soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13513: [SPARK-15698][SQL][Streaming] Add the ability to remove ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13513 **[Test build #65246 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65246/consoleFull)** for PR 13513 at commit [`31340b5`](https://github.com/apache/spark/commit/31340b58ffa7c46c2d9666569d5694bb23cc6144). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13513: [SPARK-15698][SQL][Streaming] Add the ability to remove ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13513 **[Test build #65245 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65245/consoleFull)** for PR 13513 at commit [`f179349`](https://github.com/apache/spark/commit/f1793498a9625dc8d31039cd8e9a684611dddf23). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15055: [SPARK-17462][MLLIB]use VersionUtils to parse Spark vers...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15055 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15055: [SPARK-17462][MLLIB]use VersionUtils to parse Spark vers...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15055 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65241/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15055: [SPARK-17462][MLLIB]use VersionUtils to parse Spark vers...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15055 **[Test build #65241 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65241/consoleFull)** for PR 15055 at commit [`72a87b0`](https://github.com/apache/spark/commit/72a87b0f8a386ff53903620c3cddee6744c73a96). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13513: [SPARK-15698][SQL][Streaming] Add the ability to remove ...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/13513 @zsxwing , thanks a lot for your comments, I did several refactorings: 1. Abstract and consolidate `FileStreamSinkLog` and `FileStreamSourceLog`, now they share same code path to do compaction. 2. Change `FileStreamSourceLog` to use json format instead of binary coding, to add the compatibility and flexibility for future extension. 3. Improve the logics to fetch all metadata logs, now if compact log is existed, only scan compact log. Please help to review again, thanks a lot. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13513: [SPARK-15698][SQL][Streaming] Add the ability to remove ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13513 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65244/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13513: [SPARK-15698][SQL][Streaming] Add the ability to remove ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13513 **[Test build #65244 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65244/consoleFull)** for PR 13513 at commit [`c2aad87`](https://github.com/apache/spark/commit/c2aad87ba012c41a0f4ef6290401e6789f2c9ed6). * This patch **fails to build**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` case class FileEntry(path: String, timestamp: Timestamp, action: String = ADD_ACTION)` * ` class FileStreamSourceLog(sparkSession: SparkSession, path: String)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13513: [SPARK-15698][SQL][Streaming] Add the ability to remove ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13513 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15053 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15053 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65242/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15053 **[Test build #65242 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65242/consoleFull)** for PR 15053 at commit [`52240bc`](https://github.com/apache/spark/commit/52240bcf8df42dd454e874ce7640d7040c5cdad9). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13513: [SPARK-15698][SQL][Streaming] Add the ability to remove ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13513 **[Test build #65244 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65244/consoleFull)** for PR 13513 at commit [`c2aad87`](https://github.com/apache/spark/commit/c2aad87ba012c41a0f4ef6290401e6789f2c9ed6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15056: [SPARK-17503][Core] Fix memory leak in Memory store when...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15056 **[Test build #65243 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65243/consoleFull)** for PR 15056 at commit [`a9a4a8b`](https://github.com/apache/spark/commit/a9a4a8b23afc64d7e2d7426b92013442308a8ea3). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/12819 @zhengruifeng I saw your implementation switch the training process from RDD operation to Dataset operation with UDAF. I think we should do some performance test to verify there is no performance degradation. Otherwise, we can still use the RDD operation in this PR and update it to use Dataset later. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15056: [SPARK-17503][Core] Fix memory leak in Memory sto...
GitHub user clockfly opened a pull request: https://github.com/apache/spark/pull/15056 [SPARK-17503][Core] Fix memory leak in Memory store when unable to cache the whole RDD ## What changes were proposed in this pull request? Memory store may throws OutOfMemoryError when trying to cache a super big RDD that cannot fit in memory. ``` scala> sc.parallelize(1 to 1000, 5).map(new Array[Long](1000)).cache().count java.lang.OutOfMemoryError: Java heap space at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:24) at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:23) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$JoinIterator.next(Iterator.scala:232) at org.apache.spark.storage.memory.PartiallyUnrolledIterator.next(MemoryStore.scala:683) at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1684) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1915) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1915) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ``` Spark MemoryStore uses SizeTrackingVector as a temporary unrolling buffer to store all input values it has read so far before transferring the values to cache. The problem is that when the input RDD is too big for caching, the temporary unrolling memory SizeTrackingVector is not garbage collected in time. As SizeTrackingVector can occupy all available storage memory, it may cause the executor JVM to run out of memory quickly. ## How was this patch tested? Unit test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/clockfly/spark memory_store_leak Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15056.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15056 commit a9a4a8b23afc64d7e2d7426b92013442308a8ea3 Author: Sean ZhongDate: 2016-09-12T07:12:48Z SPARK-17503: Fix memory leak in Memory store when unable to cache the whole RDD --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15054: [SPARK-17502] [SQL] Fix Multiple Bugs in DDL Statements ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15054 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65240/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15054: [SPARK-17502] [SQL] Fix Multiple Bugs in DDL Statements ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15054 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15054: [SPARK-17502] [SQL] Fix Multiple Bugs in DDL Statements ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15054 **[Test build #65240 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65240/consoleFull)** for PR 15054 at commit [`cc47c3e`](https://github.com/apache/spark/commit/cc47c3eb07a7628faa4277dd87c2837d01c4f175). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/12819#discussion_r78325688 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala --- @@ -109,10 +120,51 @@ class NaiveBayes @Since("1.5.0") ( s" numClasses=$numClasses, but thresholds has length ${$(thresholds).length}") } -val oldDataset: RDD[OldLabeledPoint] = - extractLabeledPoints(dataset).map(OldLabeledPoint.fromML) -val oldModel = OldNaiveBayes.train(oldDataset, $(smoothing), $(modelType)) -NaiveBayesModel.fromOld(oldModel, this) +val numFeatures = dataset.select(col($(featuresCol))).head().getAs[Vector](0).size + +val wvsum = new WeightedVectorSum($(modelType), numFeatures) + +val w = if ($(weightCol).isEmpty) lit(1.0) else col($(weightCol)) + +val aggregated = + dataset.select(col($(labelCol)).cast(DoubleType).as("label"), w.as("weight"), +col($(featuresCol)).as("features")) +.groupBy(col($(labelCol))) +.agg(sum(col("weight")), wvsum(col("weight"), col("features"))) +.collect().map { row => +(row.getDouble(0), (row.getDouble(1), row.getAs[Vector](2).toDense)) + }.sortBy(_._1) --- End diff -- Do you have some performance test for switching to Dataset based operation? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/12819#discussion_r78325579 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala --- @@ -98,7 +99,17 @@ class NaiveBayes @Since("1.5.0") ( */ @Since("1.5.0") def setModelType(value: String): this.type = set(modelType, value) - setDefault(modelType -> OldNaiveBayes.Multinomial) + setDefault(modelType -> NaiveBayes.Multinomial) + + /** + * Whether to over-/under-sample training instances according to the given weights in weightCol. + * If empty, all instances are treated equally (weight 1.0). + * Default is empty, so all instances have weight one. + * @group setParam + */ + @Since("2.1.0") + def setWeightCol(value: String): this.type = set(weightCol, value) + setDefault(weightCol -> "") --- End diff -- It's not necessary to set default for ```weightCol```. You can refer other place where used ```weightCol```. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15000: [SPARK-17437] Add uiWebUrl to JavaSparkContext and pyspa...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/15000 My only hesitation about this is that this property really only exists to print it in the shell. Is there a good use case for it otherwise? I know it's minor but want to make sure we're not just doing this for parity. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15053 **[Test build #65242 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65242/consoleFull)** for PR 15053 at commit [`52240bc`](https://github.com/apache/spark/commit/52240bcf8df42dd454e874ce7640d7040c5cdad9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user srowen commented on the issue: https://github.com/apache/spark/pull/15053 Jenkins test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15052: [SPARK-17500][PySpark]Make DiskBytesSpilled metric in Py...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/15052 Given how DiskBytesSpilled is used, and still used in other parts of the code, this doesn't look correct. It seems to be a global that is always incremented. Here you reset the value in certain cases, effectively. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15011: [SPARK-17122][SQL]support drop current database
Github user adrian-wang commented on the issue: https://github.com/apache/spark/pull/15011 @hvanhovell I have checked with Hive and MySQL, they all support dropping current database. By asking user to switch to another database before drop the current one is not enough though, if there are multiple users connected to the same metadata, even you are not using the certain database, some one else may be using that. What's more, if you want to drop a database but without the privilege to access databases created by other user, you will always leave one empty database behind. In Spark's implementation, we ensure the database exists before we do anything, so drop current database is OK. This is also the way other systems adopt. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15055: [SPARK-17462][MLLIB]use VersionUtils to parse Spark vers...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15055 **[Test build #65241 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65241/consoleFull)** for PR 15055 at commit [`72a87b0`](https://github.com/apache/spark/commit/72a87b0f8a386ff53903620c3cddee6744c73a96). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15055: [SPARK-17462][MLLIB]use VersionUtils to parse Spa...
GitHub user VinceShieh opened a pull request: https://github.com/apache/spark/pull/15055 [SPARK-17462][MLLIB]use VersionUtils to parse Spark version strings ## What changes were proposed in this pull request? Several places in MLlib use custom regexes or other approaches to parse Spark versions. Those should be fixed to use the VersionUtils. This PR replaces custom regexes with VersionUtils to get Spark version numbers. ## How was this patch tested? Existing tests. Signed-off-by: VinceShiehYou can merge this pull request into a Git repository by running: $ git pull https://github.com/VinceShieh/spark SPARK-17462 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15055.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15055 commit 72a87b0f8a386ff53903620c3cddee6744c73a96 Author: VinceShieh Date: 2016-09-12T06:37:42Z [SPARK-17462][MLLIB]use VersionUtils to parse Spark version strings Several places in MLlib use custom regexes or other approaches. Those should be fixed to use the VersionUtils Signed-off-by: VinceShieh --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/14834#discussion_r78321887 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -460,33 +577,74 @@ class LogisticRegression @Since("1.2.0") ( as a result, no scaling is needed. */ val rawCoefficients = state.x.toArray.clone() -var i = 0 -while (i < numFeatures) { - rawCoefficients(i) *= { if (featuresStd(i) != 0.0) 1.0 / featuresStd(i) else 0.0 } - i += 1 +val coefficientArray = Array.tabulate(numCoefficientSets * numFeatures) { i => + // flatIndex will loop though rawCoefficients, and skip the intercept terms. + val flatIndex = if ($(fitIntercept)) i + i / numFeatures else i + val featureIndex = i % numFeatures + if (featuresStd(featureIndex) != 0.0) { +rawCoefficients(flatIndex) / featuresStd(featureIndex) + } else { +0.0 + } +} +val coefficientMatrix = + new DenseMatrix(numCoefficientSets, numFeatures, coefficientArray, isTransposed = true) + +if ($(regParam) == 0.0 && isMultinomial) { + /* +When no regularization is applied, the coefficients lack identifiability because +we do not use a pivot class. We can add any constant value to the coefficients and +get the same likelihood. So here, we choose the mean centered coefficients for +reproducibility. This method follows the approach in glmnet, described here: + +Friedman, et al. "Regularization Paths for Generalized Linear Models via + Coordinate Descent," https://core.ac.uk/download/files/153/6287975.pdf + */ + val coefficientMean = coefficientMatrix.values.sum / coefficientMatrix.values.length + coefficientMatrix.update(_ - coefficientMean) } -bcFeaturesStd.destroy(blocking = false) -if ($(fitIntercept)) { - (Vectors.dense(rawCoefficients.dropRight(1)).compressed, rawCoefficients.last, -arrayBuilder.result()) +val interceptsArray: Array[Double] = if ($(fitIntercept)) { + Array.tabulate(numCoefficientSets) { i => +val coefIndex = (i + 1) * numFeaturesPlusIntercept - 1 +rawCoefficients(coefIndex) + } +} else { + Array[Double]() +} +/* + The intercepts are never regularized, so we always center the mean. + */ +val interceptVector = if (interceptsArray.nonEmpty && isMultinomial) { + val interceptMean = interceptsArray.sum / numClasses + interceptsArray.indices.foreach { i => interceptsArray(i) -= interceptMean } + Vectors.dense(interceptsArray) +} else if (interceptsArray.length == 1) { + Vectors.dense(interceptsArray) } else { - (Vectors.dense(rawCoefficients).compressed, 0.0, arrayBuilder.result()) + Vectors.sparse(numCoefficientSets, Seq()) } +(coefficientMatrix, interceptVector, arrayBuilder.result()) --- End diff -- Should we implement `coefficientMatrix.compressed` before merging this? Otherwise, this can be a regression. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13758: [SPARK-16043][SQL] Prepare GenericArrayData implementati...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13758 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65239/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/14834#discussion_r78321247 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -323,32 +382,33 @@ class LogisticRegression @Since("1.2.0") ( instr.logNumClasses(numClasses) instr.logNumFeatures(numFeatures) -val (coefficients, intercept, objectiveHistory) = { +val (coefficientMatrix, interceptVector, objectiveHistory) = { if (numInvalid != 0) { val msg = s"Classification labels should be in [0 to ${numClasses - 1}]. " + s"Found $numInvalid invalid labels." logError(msg) throw new SparkException(msg) } - val isConstantLabel = histogram.count(_ != 0) == 1 + val isConstantLabel = histogram.count(_ != 0.0) == 1 - if (numClasses > 2) { -val msg = s"LogisticRegression with ElasticNet in ML package only supports " + - s"binary classification. Found $numClasses in the input dataset. Consider using " + - s"MultinomialLogisticRegression instead." -logError(msg) -throw new SparkException(msg) - } else if ($(fitIntercept) && numClasses == 2 && isConstantLabel) { -logWarning(s"All labels are one and fitIntercept=true, so the coefficients will be " + - s"zeros and the intercept will be positive infinity; as a result, " + - s"training is not needed.") -(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, Array.empty[Double]) - } else if ($(fitIntercept) && numClasses == 1) { -logWarning(s"All labels are zero and fitIntercept=true, so the coefficients will be " + - s"zeros and the intercept will be negative infinity; as a result, " + - s"training is not needed.") -(Vectors.sparse(numFeatures, Seq()), Double.NegativeInfinity, Array.empty[Double]) + if ($(fitIntercept) && isConstantLabel) { +logWarning(s"All labels are the same value and fitIntercept=true, so the coefficients " + + s"will be zeros. Training is not needed.") +val constantLabelIndex = Vectors.dense(histogram).argmax +val coefMatrix = if (numFeatures < numCoefficientSets) { --- End diff -- Add a comment on this, having a TODO to consolidate the sparse matrix logic. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13758: [SPARK-16043][SQL] Prepare GenericArrayData implementati...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13758 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13758: [SPARK-16043][SQL] Prepare GenericArrayData implementati...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13758 **[Test build #65239 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65239/consoleFull)** for PR 13758 at commit [`deb363a`](https://github.com/apache/spark/commit/deb363afba6b8b3d2bd82b230ec132eb637c43c6). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class GenericArrayData(val array: Array[Any],` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/14834#discussion_r78321146 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -311,8 +350,28 @@ class LogisticRegression @Since("1.2.0") ( val histogram = labelSummarizer.histogram val numInvalid = labelSummarizer.countInvalid -val numClasses = histogram.length val numFeatures = summarizer.mean.size +val numFeaturesPlusIntercept = if (getFitIntercept) numFeatures + 1 else numFeatures + +val numClasses = MetadataUtils.getNumClasses(dataset.schema($(labelCol))) match { + case Some(n: Int) => +require(n >= histogram.length, s"Specified number of classes $n was " + + s"less than the number of unique labels ${histogram.length}.") +n + case None => histogram.length +} + +val isBinaryClassification = numClasses == 1 || numClasses == 2 +val isMultinomial = $(family) match { + case "binomial" => +require(isBinaryClassification, s"Binomial family only supports 1 or 2 " + +s"outcome classes but found $numClasses.") +false + case "multinomial" => true + case "auto" => !isBinaryClassification + case other => throw new IllegalArgumentException(s"Unsupported family: $other") +} --- End diff -- BTW, I think `isBinaryClassification` is not needed since it's not being used at all. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11729: [SPARK-13073] [MLib] [WIP] creating R like summary for l...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/11729 gentle ping @mbaddar1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11079: [SPARK-13197][SQL] When trying to select from the data f...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/11079 +1 for not a problem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15054: [SPARK-17502] [SQL] Fix Multiple Bugs in DDL Statements ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15054 **[Test build #65240 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65240/consoleFull)** for PR 15054 at commit [`cc47c3e`](https://github.com/apache/spark/commit/cc47c3eb07a7628faa4277dd87c2837d01c4f175). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15054: [SPARK-17502] [SQL] Fix Multiple Bugs in DDL Stat...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/15054 [SPARK-17502] [SQL] Fix Multiple Bugs in DDL Statements on Temporary Views [WIP] ### What changes were proposed in this pull request? - When the permanent tables/views do not exist but the temporary view exists, the expected error should be `NoSuchTableException` for partition-related ALTER TABLE commands. However, it always reports a confusing error message. For example, ``` Partition spec is invalid. The spec (a, b) must match the partition spec () defined in table '`testview`'; ``` - When the permanent tables/views do not exist but the temporary view exists, the expected error should be `NoSuchTableException` for `ALTER TABLE ... UNSET TBLPROPERTIES`. However, it reports a missing table property. However, the expected error should be `NoSuchTableException`. For example, ``` Attempted to unset non-existent property 'p' in table '`testView`'; ``` TODO: Trying to add more test cases for DDL processing over temporary views. ### How was this patch tested? Added multiple test cases You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark tempViewDDL Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15054.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15054 commit cc47c3eb07a7628faa4277dd87c2837d01c4f175 Author: gatorsmileDate: 2016-09-12T06:08:17Z fix --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15020: Spark 2.0 error in Intellij
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15020 ping @bigdatatraining --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org