[GitHub] spark issue #16785: [SPARK-19443][SQL] The function to generate constraints ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16785 **[Test build #72303 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72303/testReport)** for PR 16785 at commit [`b4e514a`](https://github.com/apache/spark/commit/b4e514ade7ea478055db448bbf66f7a88caf3a86). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16775 For the issue reported on mailing list, I found the root cause makes significant difference between 1.6 and current branch. The fix is at #16785. However, I think this patch is still useful. So I keep it open for a while for reviewers. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16785: [SPARK-19443][SQL] The function to generate const...
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/16785 [SPARK-19443][SQL] The function to generate constraints takes too long when the query plan grows continuously ## What changes were proposed in this pull request? This issue is originally reported and discussed at http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-tc20803.html#a20821 When run a ML `Pipeline` with many stages, during the iterative updating to `Dataset` , it is observed the it takes longer time to finish the fit and transform as the query plan grows continuously. The example code show as the following in benchmark. Specially, the time spent on preparing optimized plan in current branch (74294 ms) is much higher than 1.6 (292 ms). Actually, the time is spent mostly on generating query plan's constraints during few optimization rules. `getAliasedConstraints` is found to be a function costing most of the running time. This patch tries to rewrite `getAliasedConstraints`. After this patch, the time to preparing optimized plan is reduced significantly from 74294 ms to 2573 ms. ### Benchmark Run the following codes locally. import org.apache.spark.ml.{Pipeline, PipelineStage} import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, VectorAssembler} val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3, "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0")) val indexers = df.columns.tail.map(c => new StringIndexer() .setInputCol(c) .setOutputCol(s"${c}_indexed") .setHandleInvalid("skip")) val encoders = indexers.map(indexer => new OneHotEncoder() .setInputCol(indexer.getOutputCol) .setOutputCol(s"${indexer.getOutputCol}_encoded") .setDropLast(true)) val stages: Array[PipelineStage] = indexers ++ encoders val pipeline = new Pipeline().setStages(stages) pipeline.fit(df).transform(df).show ## How was this patch tested? Jenkins tests. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 improve-constraints-generation Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16785.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16785 commit b4e514ade7ea478055db448bbf66f7a88caf3a86 Author: Liang-Chi Hsieh Date: 2017-02-03T07:08:47Z Improve the code to generate constraints. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16766: [SPARK-19426][SQL] Custom coalesce for Dataset
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16766#discussion_r99286995 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala --- @@ -117,6 +134,34 @@ class DatasetSuite extends QueryTest with SharedSQLContext { data: _*) } + test("coalesce, custom") { + +val maxSplitSize = 512 +// Similar to the implementation of `test("custom RDD coalescer")` from [[RDDSuite]] we first +// write out to disk, to ensure that our splits are in fact [[FileSplit]] instances. +val data = (1 to 1000).map(i => ClassData(i.toString, i)) +data.toDS().repartition(10).write.format("csv").save(path.toString) + +val ds = spark.read.format("csv").load(path.toString).as[ClassData] +val coalescedDataSet = + ds.coalesce(2, partitionCoalescer = Option(new SizeBasedCoalescer(maxSplitSize))) + +assert(coalescedDataSet.rdd.partitions.length <= 10) + +var totalPartitionCount = 0L +coalescedDataSet.rdd.partitions.foreach(partition => { + var splitSizeSum = 0L + partition.asInstanceOf[CoalescedRDDPartition].parents.foreach(partition => { +val split = partition.asInstanceOf[HadoopPartition].inputSplit.value.asInstanceOf[FileSplit] +splitSizeSum += split.getLength +totalPartitionCount += 1 + }) + assert(splitSizeSum <= maxSplitSize) +}) +assert(totalPartitionCount == 10) + --- End diff -- Nit: Remove this empty line. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16766: [SPARK-19426][SQL] Custom coalesce for Dataset
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16766#discussion_r99286957 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala --- @@ -17,24 +17,41 @@ package org.apache.spark.sql -import java.io.{Externalizable, ObjectInput, ObjectOutput} +import java.io.{Externalizable, File, ObjectInput, ObjectOutput} import java.sql.{Date, Timestamp} +import org.apache.hadoop.mapred.FileSplit +import org.scalatest.BeforeAndAfter + +import org.apache.spark.rdd.{CoalescedRDDPartition, HadoopPartition, SizeBasedCoalescer} import org.apache.spark.sql.catalyst.encoders.{OuterScopes, RowEncoder} import org.apache.spark.sql.catalyst.util.sideBySide -import org.apache.spark.sql.execution.{LogicalRDD, RDDScanExec, SortExec} +import org.apache.spark.sql.execution.{LogicalRDD, RDDScanExec} import org.apache.spark.sql.execution.exchange.{BroadcastExchangeExec, ShuffleExchange} import org.apache.spark.sql.execution.streaming.MemoryStream import org.apache.spark.sql.functions._ import org.apache.spark.sql.test.SharedSQLContext import org.apache.spark.sql.types._ +import org.apache.spark.util.Utils case class TestDataPoint(x: Int, y: Double, s: String, t: TestDataPoint2) case class TestDataPoint2(x: Int, s: String) -class DatasetSuite extends QueryTest with SharedSQLContext { +class DatasetSuite extends QueryTest with SharedSQLContext with BeforeAndAfter { import testImplicits._ + private var path: File = null + + override def beforeAll(): Unit = { +super.beforeAll() +path = Utils.createTempDir() +path.delete() + } + + after { +Utils.deleteRecursively(path) + } --- End diff -- No need to do it, if you use `withTempPath`. [This](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala#L247-L265) is an example --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16766: [SPARK-19426][SQL] Custom coalesce for Dataset
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16766#discussion_r99286805 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala --- @@ -117,6 +134,34 @@ class DatasetSuite extends QueryTest with SharedSQLContext { data: _*) } + test("coalesce, custom") { + +val maxSplitSize = 512 +// Similar to the implementation of `test("custom RDD coalescer")` from [[RDDSuite]] we first +// write out to disk, to ensure that our splits are in fact [[FileSplit]] instances. +val data = (1 to 1000).map(i => ClassData(i.toString, i)) +data.toDS().repartition(10).write.format("csv").save(path.toString) --- End diff -- use `WithPath` to generate the path? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16779: [SPARK-19437] Rectify spark executor id in Heartb...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/16779 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16766: [SPARK-19426][SQL] Custom coalesce for Dataset
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16766#discussion_r99286475 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala --- @@ -117,6 +134,34 @@ class DatasetSuite extends QueryTest with SharedSQLContext { data: _*) } + test("coalesce, custom") { + +val maxSplitSize = 512 +// Similar to the implementation of `test("custom RDD coalescer")` from [[RDDSuite]] we first +// write out to disk, to ensure that our splits are in fact [[FileSplit]] instances. +val data = (1 to 1000).map(i => ClassData(i.toString, i)) +data.toDS().repartition(10).write.format("csv").save(path.toString) + +val ds = spark.read.format("csv").load(path.toString).as[ClassData] --- End diff -- ``` cannot resolve '`a`' given input columns: [_c0, _c1]; ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16779: [SPARK-19437] Rectify spark executor id in HeartbeatRece...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/16779 Thanks! Merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16766: [SPARK-19426][SQL] Custom coalesce for Dataset
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16766#discussion_r99286218 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala --- @@ -497,7 +496,9 @@ case class UnionExec(children: Seq[SparkPlan]) extends SparkPlan { * if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of * the 100 new partitions will claim 10 of the current partitions. */ -case class CoalesceExec(numPartitions: Int, child: SparkPlan) extends UnaryExecNode { +case class CoalesceExec(numPartitions: Int, child: SparkPlan, +partitionCoalescer: Option[PartitionCoalescer] + ) extends UnaryExecNode { --- End diff -- ``` case class CoalesceExec( numPartitions: Int, child: SparkPlan, partitionCoalescer: Option[PartitionCoalescer]) extends UnaryExecNode { ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16766: [SPARK-19426][SQL] Custom coalesce for Dataset
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16766#discussion_r99286066 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala --- @@ -19,9 +19,8 @@ package org.apache.spark.sql.execution import scala.concurrent.{ExecutionContext, Future} import scala.concurrent.duration.Duration - --- End diff -- Add it back? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16784: [SPARK-19382][ML]:Test sparse vectors in LinearSVCSuite
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16784 **[Test build #72302 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72302/testReport)** for PR 16784 at commit [`fc1f7d1`](https://github.com/apache/spark/commit/fc1f7d10134638dfe5130eb19784852207acebd5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16766: [SPARK-19426][SQL] Custom coalesce for Dataset
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16766#discussion_r99285902 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala --- @@ -823,6 +825,17 @@ case class Repartition(numPartitions: Int, shuffle: Boolean, child: LogicalPlan) } /** + * Returns a new RDD that has exactly `numPartitions` partitions. --- End diff -- This description is not right. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16766: [SPARK-19426][SQL] Custom coalesce for Dataset
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16766#discussion_r99285925 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala --- @@ -823,6 +825,17 @@ case class Repartition(numPartitions: Int, shuffle: Boolean, child: LogicalPlan) } /** + * Returns a new RDD that has exactly `numPartitions` partitions. + */ +case class CoalesceLogical(numPartitions: Int, partitionCoalescer: Option[PartitionCoalescer], +child: LogicalPlan) + extends UnaryNode { --- End diff -- ```Scala case class PartitionCoalesce( numPartitions: Int, partitionCoalescer: Option[PartitionCoalescer], child: LogicalPlan) extends UnaryNode { ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16766: [SPARK-19426][SQL] Custom coalesce for Dataset
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16766#discussion_r99285876 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala --- @@ -823,6 +825,17 @@ case class Repartition(numPartitions: Int, shuffle: Boolean, child: LogicalPlan) } /** + * Returns a new RDD that has exactly `numPartitions` partitions. + */ +case class CoalesceLogical(numPartitions: Int, partitionCoalescer: Option[PartitionCoalescer], --- End diff -- The name still looks inconsistent with the others. How about `PartitionCoalesce`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16784: [SPARK-19382][ML]:Test sparse vectors in LinearSV...
GitHub user wangmiao1981 opened a pull request: https://github.com/apache/spark/pull/16784 [SPARK-19382][ML]:Test sparse vectors in LinearSVCSuite ## What changes were proposed in this pull request? Add unit tests for testing SparseVector. We can't add mixed DenseVector and SparseVector test case, as discussed in JIRA 19382. def merge(other: MultivariateOnlineSummarizer): this.type = { if (this.totalWeightSum != 0.0 && other.totalWeightSum != 0.0) { require(n == other.n, s"Dimensions mismatch when merging with another summarizer. " + s"Expecting $n but got $ {other.n} .") ## How was this patch tested? Unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangmiao1981/spark bk Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16784.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16784 commit 85336d1ecec425906356c09b5aa347288f7282bc Author: wm...@hotmail.com Date: 2017-01-31T23:10:09Z unit test backup commit fc1f7d10134638dfe5130eb19784852207acebd5 Author: wm...@hotmail.com Date: 2017-02-03T07:06:55Z add SparseVector test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16766: [SPARK-19426][SQL] Custom coalesce for Dataset
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16766#discussion_r99285447 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala --- @@ -17,7 +17,9 @@ package org.apache.spark.sql.catalyst.plans.logical +import org.apache.spark.rdd.PartitionCoalescer import org.apache.spark.sql.catalyst.{CatalystConf, TableIdentifier} +import scala.collection.mutable.ArrayBuffer --- End diff -- Useless? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16766: [SPARK-19426][SQL] Custom coalesce for Dataset
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16766#discussion_r99284849 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala --- @@ -497,7 +496,9 @@ case class UnionExec(children: Seq[SparkPlan]) extends SparkPlan { * if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of * the 100 new partitions will claim 10 of the current partitions. */ -case class CoalesceExec(numPartitions: Int, child: SparkPlan) extends UnaryExecNode { +case class CoalesceExec(numPartitions: Int, child: SparkPlan, +partitionCoalescer: Option[PartitionCoalescer] + ) extends UnaryExecNode { --- End diff -- The same indent issue here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16766: [SPARK-19426][SQL] Custom coalesce for Dataset
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16766#discussion_r99284809 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -2437,9 +2435,12 @@ class Dataset[T] private[sql]( * @group typedrel * @since 1.6.0 */ - def coalesce(numPartitions: Int): Dataset[T] = withTypedPlan { -Repartition(numPartitions, shuffle = false, logicalPlan) - } + def coalesce(numPartitions: Int, partitionCoalescer: Option[PartitionCoalescer]): Dataset[T] = +withTypedPlan { + CoalesceLogical(numPartitions, partitionCoalescer, logicalPlan) +} + + def coalesce(numPartitions: Int): Dataset[T] = coalesce(numPartitions, None) --- End diff -- Please also add the function description, like what we did in the other functions in Dataset.scala? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16765: [SPARK-19425][SQL] Make ExtractEquiJoinKeys support UDT ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16765 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16777: [SPARK-19435][SQL] Type coercion between ArrayTyp...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16777#discussion_r99283834 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -101,24 +101,13 @@ object TypeCoercion { case _ => None } - /** Similar to [[findTightestCommonType]], but can promote all the way to StringType. */ - def findTightestCommonTypeToString(left: DataType, right: DataType): Option[DataType] = { -findTightestCommonTypeOfTwo(left, right).orElse((left, right) match { - case (StringType, t2: AtomicType) if t2 != BinaryType && t2 != BooleanType => Some(StringType) - case (t1: AtomicType, StringType) if t1 != BinaryType && t1 != BooleanType => Some(StringType) - case _ => None -}) - } - /** - * Find the tightest common type of a set of types by continuously applying - * `findTightestCommonTypeOfTwo` on these types. + * Promotes all the way to StringType. */ - private def findTightestCommonType(types: Seq[DataType]): Option[DataType] = { --- End diff -- It becomes harder for reviewers to read this PR. Could you submit a separate PR for code cleaning? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16138: [SPARK-16609] Add to_date/to_timestamp with format funct...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16138 **[Test build #72301 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72301/testReport)** for PR 16138 at commit [`a2d0221`](https://github.com/apache/spark/commit/a2d0221501eebc18e8520a58e1e1cd6bd80a02c9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16138: [SPARK-16609] Add to_date/to_timestamp with forma...
Github user anabranch commented on a diff in the pull request: https://github.com/apache/spark/pull/16138#discussion_r99283708 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala --- @@ -1047,6 +1048,64 @@ case class ToDate(child: Expression) extends UnaryExpression with ImplicitCastIn } /** + * Parses a column to a date based on the given format. + */ +// scalastyle:off line.size.limit +@ExpressionDescription( + usage = "_FUNC_(date_str, fmt) - Parses the `left` expression with the `fmt` expression. Returns null with invalid input.", + extended = """ +Examples: + > SELECT _FUNC_('2016-12-31', '-MM-dd'); + 2016-12-31 + """) +// scalastyle:on line.size.limit +case class ParseToDate(left: Expression, format: Expression, child: Expression) + extends RuntimeReplaceable { + + def this(left: Expression, format: Expression) = { +this(left, format, Cast(Cast(new UnixTimestamp(left, format), TimestampType), DateType)) + } + + def this(left: Expression) = { +// RuntimeReplaceable forces the signature, the second value +// is ignored completely +this(left, Literal(""), ToDate(left)) + } + + override def flatArguments: Iterator[Any] = Iterator(left, format) + override def sql: String = s"$prettyName(${left.sql}, ${format.sql})" --- End diff -- Fixed! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16779: [SPARK-19437] Rectify spark executor id in HeartbeatRece...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16779 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16779: [SPARK-19437] Rectify spark executor id in HeartbeatRece...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16779 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72297/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16779: [SPARK-19437] Rectify spark executor id in HeartbeatRece...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16779 **[Test build #72297 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72297/testReport)** for PR 16779 at commit [`a9bc3f4`](https://github.com/apache/spark/commit/a9bc3f47b9cd08f309c00c159bc0e1e6a6c6e763). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16740 @seth @imatiach-msft Let me know if there is any other changes needed. Thanks much for your review! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16765: [SPARK-19425][SQL] Make ExtractEquiJoinKeys support UDT ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16765 @gatorsmile Updated. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16740 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16740 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72299/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16740 **[Test build #72299 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72299/testReport)** for PR 16740 at commit [`b57af08`](https://github.com/apache/spark/commit/b57af08f792a59438452a3cef070e16ef51316b5). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16783: [SPARK-19441] [SQL] Remove IN type coercion from Promote...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16783 **[Test build #72300 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72300/testReport)** for PR 16783 at commit [`127a114`](https://github.com/apache/spark/commit/127a114801197e0927a5484a9fdb7b8ee93db22b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16783: [SPARK-19441] [SQL] Remove IN type coercion from ...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/16783 [SPARK-19441] [SQL] Remove IN type coercion from PromoteStrings ### What changes were proposed in this pull request? The removed codes are not reachable, because `InConversion` already resolve the type coercion issues. ### How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark typeCoercionIn Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16783.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16783 commit bd2d1e7ac3e995331c1eea2630c32c3c4f32 Author: gatorsmile Date: 2017-02-03T04:40:43Z Merge remote-tracking branch 'upstream/master' into typeCoercionIn commit 127a114801197e0927a5484a9fdb7b8ee93db22b Author: gatorsmile Date: 2017-02-03T04:50:24Z fix. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16740 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72298/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16740 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16740 **[Test build #72298 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72298/testReport)** for PR 16740 at commit [`931f7ec`](https://github.com/apache/spark/commit/931f7ecceff7a0cb0c1870af7e69d38454078c52). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14702: [SPARK-15694] Implement ScriptTransformation in sql/core...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14702 I will try to review it in the next few days. Thanks for working on it! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16765: [SPARK-19425][SQL] Make df.except work for UDT
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16765 Could you update the PR description and title? This PR fixes three scenarios: - `except`on two Datasets with UDT - `intersect` on two Datasets with UDT - `Join` with the join conditions using `<=>` on UDT columns --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16722: [SPARK-9478][ML][MLlib] Add sample weights to dec...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/16722#discussion_r99279786 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala --- @@ -351,6 +370,36 @@ class DecisionTreeClassifierSuite dt.fit(df) } + test("training with sample weights") { +val df = linearMulticlassDataset +val numClasses = 3 +val predEquals = (x: Double, y: Double) => x == y +// (impurity, maxDepth) +val testParams = Seq( + ("gini", 10), + ("entropy", 10), + ("gini", 5) +) +for ((impurity, maxDepth) <- testParams) { + val estimator = new DecisionTreeClassifier() +.setMaxDepth(maxDepth) +.setSeed(seed) +.setMinWeightFractionPerNode(0.049) --- End diff -- maybe also add test to validate that an invalid minWeightFraction will throw an exception --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16722: [SPARK-9478][ML][MLlib] Add sample weights to dec...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/16722#discussion_r99279066 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/tree/ImpuritySuite.scala --- @@ -18,23 +18,62 @@ package org.apache.spark.mllib.tree import org.apache.spark.SparkFunSuite -import org.apache.spark.mllib.tree.impurity.{EntropyAggregator, GiniAggregator} +import org.apache.spark.ml.util.TestingUtils._ +import org.apache.spark.mllib.tree.impurity._ /** * Test suites for [[GiniAggregator]] and [[EntropyAggregator]]. */ class ImpuritySuite extends SparkFunSuite { + + private val seed = 42 + test("Gini impurity does not support negative labels") { val gini = new GiniAggregator(2) intercept[IllegalArgumentException] { - gini.update(Array(0.0, 1.0, 2.0), 0, -1, 0.0) + gini.update(Array(0.0, 1.0, 2.0), 0, -1, 3, 0.0) } } test("Entropy does not support negative labels") { val entropy = new EntropyAggregator(2) intercept[IllegalArgumentException] { - entropy.update(Array(0.0, 1.0, 2.0), 0, -1, 0.0) + entropy.update(Array(0.0, 1.0, 2.0), 0, -1, 3, 0.0) +} + } + + test("Classification impurities are insensitive to scaling") { +val rng = new scala.util.Random(seed) +val weightedCounts = Array.fill(5)(rng.nextDouble()) +val smallWeightedCounts = weightedCounts.map(_ * 0.0001) +val largeWeightedCounts = weightedCounts.map(_ * 1) +Seq(Gini, Entropy).foreach { impurity => + val impurity1 = impurity.calculate(weightedCounts, weightedCounts.sum) + assert(impurity.calculate(smallWeightedCounts, smallWeightedCounts.sum) +~== impurity1 relTol 0.005) + assert(impurity.calculate(largeWeightedCounts, largeWeightedCounts.sum) +~== impurity1 relTol 0.005) } } + test("Regression impurities are insensitive to scaling") { +def computeStats(samples: Seq[Double], weights: Seq[Double]): (Double, Double, Double) = { + samples.zip(weights).foldLeft((0.0, 0.0, 0.0)) { case ((wn, wy, wyy), (y, w)) => +(wn + w, wy + w * y, wyy + w * y * y) + } +} +val rng = new scala.util.Random(seed) +val samples = Array.fill(10)(rng.nextDouble()) +val _weights = Array.fill(10)(rng.nextDouble()) +val smallWeights = _weights.map(_ * 0.0001) +val largeWeights = _weights.map(_ * 1) +val (count, sum, sumSquared) = computeStats(samples, _weights) +Seq(Variance).foreach { impurity => + val impurity1 = impurity.calculate(count, sum, sumSquared) + val (smallCount, smallSum, smallSumSquared) = computeStats(samples, smallWeights) + val (largeCount, largeSum, largeSumSquared) = computeStats(samples, largeWeights) + assert(impurity.calculate(smallCount, smallSum, smallSumSquared) ~== impurity1 relTol 0.005) + assert(impurity.calculate(largeCount, largeSum, largeSumSquared) ~== impurity1 relTol 0.005) --- End diff -- these are really nice tests --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16722: [SPARK-9478][ML][MLlib] Add sample weights to dec...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/16722#discussion_r99278975 --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala --- @@ -281,10 +283,26 @@ object MLTestingUtils extends SparkFunSuite { estimator: E with HasWeightCol, modelEquals: (M, M) => Unit): Unit = { estimator.set(estimator.weightCol, "weight") -val models = Seq(0.001, 1.0, 1000.0).map { w => +val models = Seq(0.01, 1.0, 1000.0).map { w => --- End diff -- was there a specific reason to change the weight here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16722: [SPARK-9478][ML][MLlib] Add sample weights to dec...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/16722#discussion_r99278910 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/TreeTests.scala --- @@ -124,8 +129,8 @@ private[ml] object TreeTests extends SparkFunSuite { * make mistakes such as creating loops of Nodes. */ private def checkEqual(a: Node, b: Node): Unit = { -assert(a.prediction === b.prediction) -assert(a.impurity === b.impurity) +assert(a.prediction ~== b.prediction absTol 1e-8) +assert(a.impurity ~== b.impurity absTol 1e-8) --- End diff -- can the tolerances be moved to a constant? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16138: [SPARK-16609] Add to_date/to_timestamp with forma...
Github user anabranch commented on a diff in the pull request: https://github.com/apache/spark/pull/16138#discussion_r99278789 --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R --- @@ -1177,6 +1177,9 @@ test_that("column functions", { c17 <- cov(c, c1) + cov("c", "c1") + covar_samp(c, c1) + covar_samp("c", "c1") c18 <- covar_pop(c, c1) + covar_pop("c", "c1") c19 <- spark_partition_id() + c20 <- to_timestamp(c) + trim(c) + unbase64(c) + unhex(c) + upper(c) + c21 <- to_timestamp(c, "") + trim(c) + unbase64(c) + unhex(c) + upper(c) + c22 <- to_date(c, "") + trim(c) + unbase64(c) + unhex(c) + upper(c) --- End diff -- fixed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16138: [SPARK-16609] Add to_date/to_timestamp with forma...
Github user anabranch commented on a diff in the pull request: https://github.com/apache/spark/pull/16138#discussion_r99278746 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala --- @@ -1047,6 +1048,64 @@ case class ToDate(child: Expression) extends UnaryExpression with ImplicitCastIn } /** + * Parses a column to a date based on the given format. + */ +// scalastyle:off line.size.limit +@ExpressionDescription( + usage = "_FUNC_(date_str, fmt) - Parses the `left` expression with the `fmt` expression. Returns null with invalid input.", + extended = """ +Examples: + > SELECT _FUNC_('2016-12-31', '-MM-dd'); + 2016-12-31 + """) +// scalastyle:on line.size.limit +case class ParseToDate(left: Expression, format: Expression, child: Expression) + extends RuntimeReplaceable { + + def this(left: Expression, format: Expression) = { +this(left, format, Cast(Cast(new UnixTimestamp(left, format), TimestampType), DateType)) + } + + def this(left: Expression) = { +// RuntimeReplaceable forces the signature, the second value +// is ignored completely +this(left, Literal(""), ToDate(left)) + } + + override def flatArguments: Iterator[Any] = Iterator(left, format) + override def sql: String = s"$prettyName(${left.sql}, ${format.sql})" + + override def prettyName: String = "to_date" + override def dataType: DataType = DateType --- End diff -- Removed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16138: [SPARK-16609] Add to_date/to_timestamp with forma...
Github user anabranch commented on a diff in the pull request: https://github.com/apache/spark/pull/16138#discussion_r99278738 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala --- @@ -1047,6 +1048,64 @@ case class ToDate(child: Expression) extends UnaryExpression with ImplicitCastIn } /** + * Parses a column to a date based on the given format. + */ +// scalastyle:off line.size.limit +@ExpressionDescription( + usage = "_FUNC_(date_str, fmt) - Parses the `left` expression with the `fmt` expression. Returns null with invalid input.", + extended = """ +Examples: + > SELECT _FUNC_('2016-12-31', '-MM-dd'); + 2016-12-31 + """) +// scalastyle:on line.size.limit +case class ParseToDate(left: Expression, format: Expression, child: Expression) --- End diff -- I don't really understand this feedback. This is how I saw other `RuntimeReplaceable` expressions created. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16138: [SPARK-16609] Add to_date/to_timestamp with forma...
Github user anabranch commented on a diff in the pull request: https://github.com/apache/spark/pull/16138#discussion_r99278624 --- Diff: R/pkg/R/functions.R --- @@ -1746,7 +1750,7 @@ setMethod("toRadians", #' to_date(df$c) #' to_date(df$c, '-MM-dd') #' } -#' @note to_date(Column, format) since 2.2.0 +#' @note to_date(Column) since 1.5 --- End diff -- fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16722: [SPARK-9478][ML][MLlib] Add sample weights to dec...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/16722#discussion_r99278201 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala --- @@ -58,6 +62,20 @@ class DecisionTreeClassifierSuite categoricalDataPointsForMulticlassForOrderedFeaturesRDD = sc.parallelize( OldDecisionTreeSuite.generateCategoricalDataPointsForMulticlassForOrderedFeatures()) .map(_.asML) +linearMulticlassDataset = { + val nPoints = 100 + val coefficients = Array( +-0.57997, 0.912083, -0.371077, +-0.16624, -0.84355, -0.048509) + + val xMean = Array(5.843, 3.057) + val xVariance = Array(0.6856, 0.1899) + + val testData = LogisticRegressionSuite.generateMultinomialLogisticInput( +coefficients, xMean, xVariance, addIntercept = true, nPoints, 42) --- End diff -- pass in seed instead of 42 here (at the end) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16722: [SPARK-9478][ML][MLlib] Add sample weights to dec...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/16722#discussion_r99278110 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Variance.scala --- @@ -70,17 +70,24 @@ object Variance extends Impurity { * Note: Instances of this class do not hold the data; they operate on views of the data. */ private[spark] class VarianceAggregator() - extends ImpurityAggregator(statsSize = 3) with Serializable { + extends ImpurityAggregator(statsSize = 4) with Serializable { /** * Update stats for one (node, feature, bin) with the given label. * @param allStats Flat stats array, with stats for this (node, feature, bin) contiguous. * @param offsetStart index of stats for this (node, feature, bin). */ - def update(allStats: Array[Double], offset: Int, label: Double, instanceWeight: Double): Unit = { + def update( + allStats: Array[Double], + offset: Int, + label: Double, + numSamples: Int, + sampleWeight: Double): Unit = { +val instanceWeight = numSamples * sampleWeight allStats(offset) += instanceWeight allStats(offset + 1) += instanceWeight * label allStats(offset + 2) += instanceWeight * label * label +allStats(offset + 3) += numSamples --- End diff -- could the statistics that this computes be added to either the class documentation or this method (the former preferred) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16740: [SPARK-19400][ML] Allow GLM to handle intercept o...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16740#discussion_r99277921 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -743,6 +744,48 @@ class GeneralizedLinearRegressionSuite } } + test("generalized linear regression: intercept only") { +/* + R code: + y <- c(17, 19, 23, 29) + w <- c(1, 2, 3, 4) + model1 <- glm(y ~ 1, family = poisson) + model2 <- glm(y ~ 1, family = poisson, weights = w) + as.vector(c(coef(model1), coef(model2))) + [1] 3.091042 3.178054 + */ + +val dataset = Seq( + Instance(17.0, 1.0, Vectors.zeros(0)), + Instance(19.0, 2.0, Vectors.zeros(0)), + Instance(23.0, 3.0, Vectors.zeros(0)), + Instance(29.0, 4.0, Vectors.zeros(0)) +).toDF() + +val expected = Seq(3.091, 3.178) + +import GeneralizedLinearRegression._ + +var idx = 0 +for (useWeight <- Seq(false, true)) { + val trainer = new GeneralizedLinearRegression().setFamily("poisson") + if (useWeight) trainer.setWeightCol("weight") + val model = trainer.fit(dataset) + val actual = model.intercept + assert(actual ~== expected(idx) absTol 1E-3, "Model mismatch: intercept only GLM with " + +s"useWeight = $useWeight.") + assert(model.coefficients === new DenseVector(Array.empty[Double])) + + idx += 1 +} + +// throw exception for empty model +val trainer = new GeneralizedLinearRegression().setFitIntercept(false) +intercept[SparkException] { --- End diff -- Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16722: [SPARK-9478][ML][MLlib] Add sample weights to dec...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/16722#discussion_r99277799 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Gini.scala --- @@ -80,23 +80,29 @@ object Gini extends Impurity { * @param numClasses Number of classes for label. */ private[spark] class GiniAggregator(numClasses: Int) - extends ImpurityAggregator(numClasses) with Serializable { + extends ImpurityAggregator(numClasses + 1) with Serializable { /** * Update stats for one (node, feature, bin) with the given label. * @param allStats Flat stats array, with stats for this (node, feature, bin) contiguous. * @param offsetStart index of stats for this (node, feature, bin). */ - def update(allStats: Array[Double], offset: Int, label: Double, instanceWeight: Double): Unit = { -if (label >= statsSize) { + def update( + allStats: Array[Double], + offset: Int, + label: Double, + numSamples: Int, + sampleWeight: Double): Unit = { +if (label >= numClasses) { throw new IllegalArgumentException(s"GiniAggregator given label $label" + --- End diff -- not related to this code review, but it seems a bit strange that each of these ImpurityAggregators have the same checks/bounds for label, I would have preferred the abstract base class to implement these instead, although it is nice to have a more specific error message --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16740 **[Test build #72299 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72299/testReport)** for PR 16740 at commit [`b57af08`](https://github.com/apache/spark/commit/b57af08f792a59438452a3cef070e16ef51316b5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16740: [SPARK-19400][ML] Allow GLM to handle intercept o...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/16740#discussion_r99276472 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -743,6 +744,48 @@ class GeneralizedLinearRegressionSuite } } + test("generalized linear regression: intercept only") { +/* + R code: + y <- c(17, 19, 23, 29) + w <- c(1, 2, 3, 4) + model1 <- glm(y ~ 1, family = poisson) + model2 <- glm(y ~ 1, family = poisson, weights = w) + as.vector(c(coef(model1), coef(model2))) + [1] 3.091042 3.178054 + */ + +val dataset = Seq( + Instance(17.0, 1.0, Vectors.zeros(0)), + Instance(19.0, 2.0, Vectors.zeros(0)), + Instance(23.0, 3.0, Vectors.zeros(0)), + Instance(29.0, 4.0, Vectors.zeros(0)) +).toDF() + +val expected = Seq(3.091, 3.178) + +import GeneralizedLinearRegression._ + +var idx = 0 +for (useWeight <- Seq(false, true)) { + val trainer = new GeneralizedLinearRegression().setFamily("poisson") + if (useWeight) trainer.setWeightCol("weight") + val model = trainer.fit(dataset) + val actual = model.intercept + assert(actual ~== expected(idx) absTol 1E-3, "Model mismatch: intercept only GLM with " + +s"useWeight = $useWeight.") + assert(model.coefficients === new DenseVector(Array.empty[Double])) + + idx += 1 +} + +// throw exception for empty model +val trainer = new GeneralizedLinearRegression().setFitIntercept(false) +intercept[SparkException] { --- End diff -- thank you for adding the test, could you also please wrap it in withClue to verify the message contents, eg: withClue("Specified model is empty with neither intercept nor feature") { intercept[SparkException] { trainer.fit(dataset) } } --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16740 **[Test build #72298 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72298/testReport)** for PR 16740 at commit [`931f7ec`](https://github.com/apache/spark/commit/931f7ecceff7a0cb0c1870af7e69d38454078c52). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16740: [SPARK-19400][ML] Allow GLM to handle intercept o...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16740#discussion_r99276315 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -335,6 +335,11 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val throw new SparkException(msg) } +if (numFeatures == 0 && !$(fitIntercept)) { + val msg = "Specified model is empty with neither intercept nor feature." + throw new SparkException(msg) --- End diff -- @imatiach-msft Test added. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16722: [SPARK-9478][ML][MLlib] Add sample weights to dec...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/16722#discussion_r99275923 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Entropy.scala --- @@ -83,23 +83,29 @@ object Entropy extends Impurity { * @param numClasses Number of classes for label. */ private[spark] class EntropyAggregator(numClasses: Int) - extends ImpurityAggregator(numClasses) with Serializable { + extends ImpurityAggregator(numClasses + 1) with Serializable { --- End diff -- I guess it is because the number of "stats" increases by one since we are adding the weight, if I understand correctly --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16722: [SPARK-9478][ML][MLlib] Add sample weights to dec...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/16722#discussion_r99275700 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala --- @@ -79,7 +79,12 @@ private[spark] abstract class ImpurityAggregator(val statsSize: Int) extends Ser * @param allStats Flat stats array, with stats for this (node, feature, bin) contiguous. * @param offsetStart index of stats for this (node, feature, bin). */ - def update(allStats: Array[Double], offset: Int, label: Double, instanceWeight: Double): Unit + def update( + allStats: Array[Double], + offset: Int, + label: Double, + numSamples: Int, + sampleWeight: Double): Unit --- End diff -- should the numSamples/sampleWeight be added to the doc here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16722: [SPARK-9478][ML][MLlib] Add sample weights to dec...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/16722#discussion_r99274459 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Entropy.scala --- @@ -83,23 +83,29 @@ object Entropy extends Impurity { * @param numClasses Number of classes for label. */ private[spark] class EntropyAggregator(numClasses: Int) - extends ImpurityAggregator(numClasses) with Serializable { + extends ImpurityAggregator(numClasses + 1) with Serializable { --- End diff -- sorry, trying to follow this part of the code, why do we pass (numClasses + 1) to the impurityAggregator? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16733: [SPARK-19392][SQL] Fix the bug that throws an exc...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/16733#discussion_r99273860 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala --- @@ -29,7 +29,12 @@ private case object OracleDialect extends JdbcDialect { override def getCatalystType( --- End diff -- okay, I'll do! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16733: [SPARK-19392][SQL] Fix the bug that throws an exc...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16733#discussion_r99273809 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala --- @@ -29,7 +29,12 @@ private case object OracleDialect extends JdbcDialect { override def getCatalystType( --- End diff -- Can you ask him to retry it in Spark 2.1? I am not sure whether he is using Apache Spark or the Spark released by other vendors. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16740: [SPARK-19400][ML] Allow GLM to handle intercept o...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/16740#discussion_r99273642 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -335,6 +335,11 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val throw new SparkException(msg) } +if (numFeatures == 0 && !$(fitIntercept)) { + val msg = "Specified model is empty with neither intercept nor feature." + throw new SparkException(msg) --- End diff -- suggestion: please add a test to validate this case --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16765: [SPARK-19425][SQL] Make df.except work for UDT
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16765 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72295/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16765: [SPARK-19425][SQL] Make df.except work for UDT
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16765 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16765: [SPARK-19425][SQL] Make df.except work for UDT
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16765 **[Test build #72295 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72295/testReport)** for PR 16765 at commit [`ac3c3bf`](https://github.com/apache/spark/commit/ac3c3bfa270dda077bf89db926c38b9946c4738e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16779: [SPARK-19437] Rectify spark executor id in HeartbeatRece...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16779 **[Test build #72297 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72297/testReport)** for PR 16779 at commit [`a9bc3f4`](https://github.com/apache/spark/commit/a9bc3f47b9cd08f309c00c159bc0e1e6a6c6e763). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16740 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72296/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16740 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16779: [SPARK-19437] Rectify spark executor id in HeartbeatRece...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/16779 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16740 **[Test build #72296 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72296/testReport)** for PR 16740 at commit [`3a0a2af`](https://github.com/apache/spark/commit/3a0a2aff5a7b09cb0e1db7ec2e756e55b561eace). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16739 : ) This might be caused by the optimizer rule `CollapseRepartition`. Can you output the plan by `explain(true)`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/15435 @sethah If I merge the MulticlassLogisticRegressionSummary into LogisticRegressionSummary, then, according to the hierarchy currently designed, it became: class LogisticRegressionSummary extends MulticlassSummary with LogisticRegressionSummary class LogisticRegressionTrainingSummary extends LogisticRegressionSummary with ** Note that now LogisticRegressionTrainingSummary must become a class, not a trait, if merge the MulticlassLogisticRegressionSummary into LogisticRegressionSummary, it has to be class...** Now consider the `BinaryLogisticRegressionSummary`: class BinaryLogisticRegressionSummary extends LogisticRegressionSummary class BinaryLogisticRegressionTrainingSummary extends BinaryLogisticRegressionSummary ** Now new problem occur: BinaryLogisticRegressionTrainingSummary cannot extend LogisticRegressionTrainingSummary, because `LogisticRegressionTrainingSummary` has changed into a class, not a trait... ** ** BinaryLogisticRegressionTrainingSummary cannot extend LogisticRegressionTrainingSummary cause more API breaking, such as `def summary`...** So these problems are troublesome... for causing so many API breaking... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16740: [SPARK-19400][ML] Allow GLM to handle intercept o...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16740#discussion_r99269075 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -743,6 +743,55 @@ class GeneralizedLinearRegressionSuite } } + test("generalized linear regression: intercept only") { +/* + R code: + y <- c(17, 19, 23, 29) + w <- c(1, 2, 3, 4) + model1 <- glm(y ~ 1, family = poisson) + model2 <- glm(y ~ 1, family = poisson, weights = w) + as.vector(c(coef(model1), coef(model2))) + [1] 3.091042 3.178054 + */ + +val dataset = Seq( + Instance(17.0, 1.0, Vectors.zeros(0)), + Instance(19.0, 2.0, Vectors.zeros(0)), + Instance(23.0, 3.0, Vectors.zeros(0)), + Instance(29.0, 4.0, Vectors.zeros(0)) +).toDF() + +val expected = Seq(3.091, 3.178) + +import GeneralizedLinearRegression._ + +var idx = 0 +for (useWeight <- Seq(false, true)) { + val trainer = new GeneralizedLinearRegression().setFamily("poisson") +.setLinkPredictionCol("linkPrediction") + if (useWeight) trainer.setWeightCol("weight") + val model = trainer.fit(dataset) + val actual = model.intercept + assert(actual ~== expected(idx) absTol 1E-3, "Model mismatch: intercept only GLM with " + +s"useWeight = $useWeight.") + assert(model.coefficients === new DenseVector(Array.empty[Double])) + + val familyLink = FamilyAndLink(trainer) + model.transform(dataset).select("features", "prediction", "linkPrediction").collect() +.foreach { + case Row(features: DenseVector, prediction1: Double, linkPrediction1: Double) => +val eta = BLAS.dot(features, model.coefficients) + model.intercept +val prediction2 = familyLink.fitted(eta) --- End diff -- That was fast! :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16740 **[Test build #72296 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72296/testReport)** for PR 16740 at commit [`3a0a2af`](https://github.com/apache/spark/commit/3a0a2aff5a7b09cb0e1db7ec2e756e55b561eace). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16740: [SPARK-19400][ML] Allow GLM to handle intercept o...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/16740#discussion_r99269006 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -743,6 +743,55 @@ class GeneralizedLinearRegressionSuite } } + test("generalized linear regression: intercept only") { +/* + R code: + y <- c(17, 19, 23, 29) + w <- c(1, 2, 3, 4) + model1 <- glm(y ~ 1, family = poisson) + model2 <- glm(y ~ 1, family = poisson, weights = w) + as.vector(c(coef(model1), coef(model2))) + [1] 3.091042 3.178054 + */ + +val dataset = Seq( + Instance(17.0, 1.0, Vectors.zeros(0)), + Instance(19.0, 2.0, Vectors.zeros(0)), + Instance(23.0, 3.0, Vectors.zeros(0)), + Instance(29.0, 4.0, Vectors.zeros(0)) +).toDF() + +val expected = Seq(3.091, 3.178) + +import GeneralizedLinearRegression._ + +var idx = 0 +for (useWeight <- Seq(false, true)) { + val trainer = new GeneralizedLinearRegression().setFamily("poisson") +.setLinkPredictionCol("linkPrediction") + if (useWeight) trainer.setWeightCol("weight") + val model = trainer.fit(dataset) + val actual = model.intercept + assert(actual ~== expected(idx) absTol 1E-3, "Model mismatch: intercept only GLM with " + +s"useWeight = $useWeight.") + assert(model.coefficients === new DenseVector(Array.empty[Double])) + + val familyLink = FamilyAndLink(trainer) + model.transform(dataset).select("features", "prediction", "linkPrediction").collect() +.foreach { + case Row(features: DenseVector, prediction1: Double, linkPrediction1: Double) => +val eta = BLAS.dot(features, model.coefficients) + model.intercept +val prediction2 = familyLink.fitted(eta) --- End diff -- @sethah Agree. Removed this. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16779: [SPARK-19437] Rectify spark executor id in HeartbeatRece...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/16779 @zsxwing Thanks a lot for reviewing this. Not sure why the test doesn't start automatically. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16740: [SPARK-19400][ML] Allow GLM to handle intercept o...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16740#discussion_r99268773 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -743,6 +743,55 @@ class GeneralizedLinearRegressionSuite } } + test("generalized linear regression: intercept only") { +/* + R code: + y <- c(17, 19, 23, 29) + w <- c(1, 2, 3, 4) + model1 <- glm(y ~ 1, family = poisson) + model2 <- glm(y ~ 1, family = poisson, weights = w) + as.vector(c(coef(model1), coef(model2))) + [1] 3.091042 3.178054 + */ + +val dataset = Seq( + Instance(17.0, 1.0, Vectors.zeros(0)), + Instance(19.0, 2.0, Vectors.zeros(0)), + Instance(23.0, 3.0, Vectors.zeros(0)), + Instance(29.0, 4.0, Vectors.zeros(0)) +).toDF() + +val expected = Seq(3.091, 3.178) + +import GeneralizedLinearRegression._ + +var idx = 0 +for (useWeight <- Seq(false, true)) { + val trainer = new GeneralizedLinearRegression().setFamily("poisson") +.setLinkPredictionCol("linkPrediction") + if (useWeight) trainer.setWeightCol("weight") + val model = trainer.fit(dataset) + val actual = model.intercept + assert(actual ~== expected(idx) absTol 1E-3, "Model mismatch: intercept only GLM with " + +s"useWeight = $useWeight.") + assert(model.coefficients === new DenseVector(Array.empty[Double])) + + val familyLink = FamilyAndLink(trainer) + model.transform(dataset).select("features", "prediction", "linkPrediction").collect() +.foreach { + case Row(features: DenseVector, prediction1: Double, linkPrediction1: Double) => +val eta = BLAS.dot(features, model.coefficients) + model.intercept +val prediction2 = familyLink.fitted(eta) --- End diff -- I don't think we need to test this. This is essentially checking the correctness of the prediction mechanism, regardless of the "intercept-only" part. The prediction mechanism is tested elsewhere. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/15435 sethah About this issue: Why is there a one-to-one overlap between MulticlassClassificationSummary and LogisticRegressionSummary, and MulticlassLogisticRegressionSummary inherits from them both? If I merge the MulticlassLogisticRegressionSummary into LogisticRegressionSummary to remove the one-to-one overlap between MulticlassClassificationSummary and LogisticRegressionSummary, it will cause **more API breaking**, because in this way it will make BinaryLogisticRegressionTrainingSummary cannot extends LogisticRegressionTrainingSummary and it will break some other public API such as `def summary`. you can try to modify it and compile the code and will find this problem... Maybe there is some better way but I haven't think out. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16740 @srowen would you please take a look and merge this if all is good? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16780: [SPARK-19438] Both reading and updating executorD...
Github user jinxing64 closed the pull request at: https://github.com/apache/spark/pull/16780 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16780: [SPARK-19438] Both reading and updating executorDataMap ...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/16780 Thanks a lot for looking into this~ @zsxwing You are right. My understanding about this is incorrect. `CoarseGrainedSchedulerBackend: DriverEndpoint` is a `ThreadSafeRpcEndpoint`, thus concurrent message processing is disabled. I'll close this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15264: [SPARK-17477][SQL] SparkSQL cannot handle schema evoluti...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15264 Yeap, I will try to get that back after finishing up few issues I am currently working on. I just realised that it'd take a bit of time for me to proceed (as I noticed we need a more careful touch for it). Please feel free to take over it if anyone is interested in it. Otherwise, let me try to proceed even if it takes a while. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16607: [SPARK-19247][ML] Save large word2vec models
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/16607#discussion_r99263532 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala --- @@ -302,16 +302,36 @@ class Word2VecModel private[ml] ( @Since("1.6.0") object Word2VecModel extends MLReadable[Word2VecModel] { + private case class Data(word: String, vector: Array[Float]) + private[Word2VecModel] class Word2VecModelWriter(instance: Word2VecModel) extends MLWriter { -private case class Data(wordIndex: Map[String, Int], wordVectors: Seq[Float]) - override protected def saveImpl(path: String): Unit = { DefaultParamsWriter.saveMetadata(instance, path, sc) - val data = Data(instance.wordVectors.wordIndex, instance.wordVectors.wordVectors.toSeq) + + val wordVectors = instance.wordVectors.getVectors + val dataArray = wordVectors.toSeq.map { case (word, vector) => Data(word, vector) }.toArray --- End diff -- No need to convert back to an Array --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16607: [SPARK-19247][ML] Save large word2vec models
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/16607#discussion_r99263525 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala --- @@ -320,14 +340,29 @@ object Word2VecModel extends MLReadable[Word2VecModel] { private val className = classOf[Word2VecModel].getName override def load(path: String): Word2VecModel = { + val spark = sparkSession + import spark.implicits._ + val metadata = DefaultParamsReader.loadMetadata(path, sc, className) + val (major, minor) = VersionUtils.majorMinorVersion(metadata.sparkVersion) + val dataPath = new Path(path, "data").toString - val data = sparkSession.read.parquet(dataPath) -.select("wordIndex", "wordVectors") -.head() - val wordIndex = data.getAs[Map[String, Int]](0) - val wordVectors = data.getAs[Seq[Float]](1).toArray - val oldModel = new feature.Word2VecModel(wordIndex, wordVectors) + + val oldModel = if (major.toInt < 2 || (major.toInt == 2 && minor.toInt < 2)) { --- End diff -- major, minor are already Ints --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16607: [SPARK-19247][ML] Save large word2vec models
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/16607#discussion_r99259617 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala --- @@ -18,10 +18,9 @@ package org.apache.spark.ml.feature import org.apache.hadoop.fs.Path - --- End diff -- Keep newline between non-spark and spark imports --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16686: [SPARK-18682][SS] Batch Source for Kafka
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16686 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72294/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16686: [SPARK-18682][SS] Batch Source for Kafka
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16686 **[Test build #72294 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72294/testReport)** for PR 16686 at commit [`5b48fc6`](https://github.com/apache/spark/commit/5b48fc65ac08e8ed4a09edd0d346990d40d042e0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16686: [SPARK-18682][SS] Batch Source for Kafka
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16686 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16699#discussion_r99263111 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -743,6 +743,84 @@ class GeneralizedLinearRegressionSuite } } + test("generalized linear regression with offset") { +/* + R code: + library(statmod) + df <- as.data.frame(matrix(c( +1.0, 1.0, 2.0, 0.0, 5.0, +2.0, 2.0, 0.5, 1.0, 2.0, +1.0, 3.0, 1.0, 2.0, 1.0, +2.0, 4.0, 0.0, 3.0, 3.0), 4, 5, byrow = TRUE)) + families <- list(gaussian, poisson, Gamma, tweedie(1.5)) + f1 <- V1 ~ -1 + V4 + V5 + f2 <- V1 ~ V4 + V5 + for (f in c(f1, f2)) { +for (fam in families) { + model <- glm(f, df, family = fam, weights = V2, offset = V3) + print(as.vector(coef(model))) +} + } + + [1] 0.535040431 0.005390836 + [1] 0.1968355 -0.2061711 + [1] 0.307996 -0.153579 + [1] 0.32166185 -0.09698986 + [1] -0.880 0.7342857 0.1714286 + [1] -1.9991044 0.7247511 0.1424392 + [1] -0.27378146 0.31599396 -0.06204946 + [1] -0.17118812 0.31200361 -0.02541656 +*/ +val dataset = Seq( + OffsetInstance(1.0, 1.0, 2.0, Vectors.dense(0.0, 5.0)), + OffsetInstance(2.0, 2.0, 0.5, Vectors.dense(1.0, 2.0)), + OffsetInstance(1.0, 3.0, 1.0, Vectors.dense(2.0, 1.0)), + OffsetInstance(2.0, 4.0, 0.0, Vectors.dense(3.0, 3.0)) +).toDF() + +val expected = Seq( + Vectors.dense(0.0, 0.535040431, 0.005390836), + Vectors.dense(0.0, 0.1968355, -0.2061711), + Vectors.dense(0.0, 0.307996, -0.153579), + Vectors.dense(0.0, 0.32166185, -0.09698986), + Vectors.dense(-0.88, 0.7342857, 0.1714286), + Vectors.dense(-1.9991044, 0.7247511, 0.1424392), + Vectors.dense(-0.27378146, 0.31599396, -0.06204946), + Vectors.dense(-0.17118812, 0.31200361, -0.02541656)) + +import GeneralizedLinearRegression._ + +var idx = 0 +for (fitIntercept <- Seq(false, true)) { + for (family <- Seq("gaussian", "poisson", "gamma", "tweedie")) { +var trainer = new GeneralizedLinearRegression().setFamily(family) + .setFitIntercept(fitIntercept).setOffsetCol("offset") + .setWeightCol("weight").setLinkPredictionCol("linkPrediction") +if (family == "tweedie") trainer = trainer.setVariancePower(1.5) +val model = trainer.fit(dataset) +val actual = Vectors.dense(model.intercept, model.coefficients(0), model.coefficients(1)) +assert(actual ~= expected(idx) absTol 1e-4, s"Model mismatch: GLM with family = $family," + --- End diff -- We need to be checking more than just the coefficients. For example, the computation of the null deviance does not match R, since the null model computation does not consider the offsets. Actually, I think we ought to just incorporate offsets into all of the other tests, which will make sure offsets are exhaustively tested. This has been done before e.g. https://github.com/apache/spark/pull/15488, and it _is_ a real pain, but it's probably the best way. I'd be open to other arguments though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14702: [SPARK-15694] Implement ScriptTransformation in sql/core...
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/14702 can anyone please review this PR ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16740: [SPARK-19400][ML] Allow GLM to handle intercept only mod...
Github user sethah commented on the issue: https://github.com/apache/spark/pull/16740 Ok, yeah, let's go with this fix now then - seems both R and statsmodels fit to compute the null model. Thanks for following up on that! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16765: [SPARK-19425][SQL] Make df.except work for UDT
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16765 **[Test build #72295 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72295/testReport)** for PR 16765 at commit [`ac3c3bf`](https://github.com/apache/spark/commit/ac3c3bfa270dda077bf89db926c38b9946c4738e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16733: [SPARK-19392][SQL] Fix the bug that throws an exc...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/16733#discussion_r99261637 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala --- @@ -29,7 +29,12 @@ private case object OracleDialect extends JdbcDialect { override def getCatalystType( --- End diff -- I looked over the previous releases though, it seems `scale` always is set there. So, I'm not sure why this exception happens in the report. What do u think? Is it okay to close this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16686: [SPARK-18682][SS] Batch Source for Kafka
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16686 **[Test build #72294 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72294/testReport)** for PR 16686 at commit [`5b48fc6`](https://github.com/apache/spark/commit/5b48fc65ac08e8ed4a09edd0d346990d40d042e0). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16607: [SPARK-19247][ML] Save large word2vec models
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16607 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16607: [SPARK-19247][ML] Save large word2vec models
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16607 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72293/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16607: [SPARK-19247][ML] Save large word2vec models
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16607 **[Test build #72293 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72293/testReport)** for PR 16607 at commit [`9b5e928`](https://github.com/apache/spark/commit/9b5e9288699012b2e5d9b347191fd3d141b31d7d). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16607: [SPARK-19247][ML] Save large word2vec models
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16607 **[Test build #72293 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72293/testReport)** for PR 16607 at commit [`9b5e928`](https://github.com/apache/spark/commit/9b5e9288699012b2e5d9b347191fd3d141b31d7d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16607: [SPARK-19247][ML] Save large word2vec models
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/16607 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16607: [SPARK-19247][ML] Save large word2vec models
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/16607 Sorry for the delay; will take a look now! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14412: [SPARK-15355] [CORE] Proactive block replication
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14412 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14412: [SPARK-15355] [CORE] Proactive block replication
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14412 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72291/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org