[GitHub] spark issue #23127: [SPARK-26159] Codegen for LocalTableScanExec and RDDScan...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23127 **[Test build #99275 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99275/testReport)** for PR 23127 at commit [`170073e`](https://github.com/apache/spark/commit/170073efd4d752a436bc18c4c0d2c0fc755d2d37). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23124: [SPARK-25829][SQL] remove duplicated map keys with last ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23124 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23123: [SPARK-26153][ML] GBT & RandomForest avoid unnece...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/23123 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23138: [SPARK-23356][SQL][TEST] add new test cases for a + 1,a ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23138 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23088: [SPARK-26119][CORE][WEBUI]Task summary table shou...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23088#discussion_r236236422 --- Diff: core/src/main/scala/org/apache/spark/status/AppStatusStore.scala --- @@ -222,29 +223,20 @@ private[spark] class AppStatusStore( val indices = quantiles.map { q => math.min((q * count).toLong, count - 1) } def scanTasks(index: String)(fn: TaskDataWrapper => Long): IndexedSeq[Double] = { - Utils.tryWithResource( -store.view(classOf[TaskDataWrapper]) - .parent(stageKey) - .index(index) - .first(0L) - .closeableIterator() - ) { it => -var last = Double.NaN -var currentIdx = -1L -indices.map { idx => - if (idx == currentIdx) { -last - } else { -val diff = idx - currentIdx -currentIdx = idx -if (it.skip(diff - 1)) { - last = fn(it.next()).toDouble - last -} else { - Double.NaN -} - } -}.toIndexedSeq + val quantileTasks = store.view(classOf[TaskDataWrapper]) --- End diff -- CC @vanzin for the proposed new method. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23138: [SPARK-23356][SQL][TEST] add new test cases for a + 1,a ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23138 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99272/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21412: [SPARK-18805][DStream] Avoid StackOverflowError while ge...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/21412 Agree, this PR should be closed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23144: [SPARK-26172][ML][WIP] Unify String Params' case-insensi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23144 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23124: [SPARK-25829][SQL] remove duplicated map keys wit...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23124#discussion_r236275822 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -546,33 +546,29 @@ case class MapConcat(children: Seq[Expression]) extends ComplexTypeMergingExpres override def nullable: Boolean = children.exists(_.nullable) + private lazy val mapBuilder = new ArrayBasedMapBuilder(dataType.keyType, dataType.valueType) + override def eval(input: InternalRow): Any = { -val maps = children.map(_.eval(input)) +val maps = children.map(_.eval(input).asInstanceOf[MapData]).toArray --- End diff -- `toArray` is O(N) but we do it only once. If accessing by index is O(N), the total time complexity is O(N ^ 2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21412: [SPARK-18805][DStream] Avoid StackOverflowError w...
Github user chermenin closed the pull request at: https://github.com/apache/spark/pull/21412 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21688: [SPARK-21809] : Change Stage Page to use datatables to s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21688 **[Test build #4442 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4442/testReport)** for PR 21688 at commit [`8f6efbd`](https://github.com/apache/spark/commit/8f6efbda4ad6b7fc4c86f4a2ae43e49847243826). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23126 Is there a simple test case you can add to cover that too? that would really prove this change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23127: [SPARK-26159] Codegen for LocalTableScanExec and RDDScan...
Github user juliuszsompolski commented on the issue: https://github.com/apache/spark/pull/23127 @cloud-fan @rednaxelafx Actually, the input to a codegen stage can be an internal row so I can't make the inputRDD be `RDD[UnsafeRow], but the output needs to be UnsafeRow. Doing it like `InputAdapter` did actually make it just pass-through output the internal row. For `InputAdapter`, there always is some parent operator to consume it, and create an unsafe projection in whatever it does, and then the output UnsafeRows. But for an `RDDScanExec` or `RowDataSourceScanExec` could be alone in a WholeStageCodegenExec, and then just doing `${consume(ctx, null, row)}` made it pass-through output the InternalRow from input. WDYT about how I patched it up? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/23126 Ok, I will do it later --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23142: [SPARK-26170][SS] Add missing metrics in FlatMapGroupsWi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23142 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99268/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23130: [SPARK-26161][SQL] Ignore empty files in load
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23130#discussion_r236226380 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -388,7 +388,7 @@ case class FileSourceScanExec( logInfo(s"Planning with ${bucketSpec.numBuckets} buckets") val filesGroupedToBuckets = selectedPartitions.flatMap { p => -p.files.map { f => +p.files.filter(_.getLen > 0).map { f => --- End diff -- I personally prefer filter + map as it's shorter and clearer. I don't know if one is faster; two transformations vs having to return Some/None. For a Dataset operation I'd favor one operation, but this is just local Scala code. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23139: [SPARK-25860][SPARK-26107] [FOLLOW-UP] Rule ReplaceNullW...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23139 **[Test build #99270 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99270/testReport)** for PR 23139 at commit [`e416810`](https://github.com/apache/spark/commit/e41681096867cbc6d2556da83ce733092d6df841). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23130: [SPARK-26161][SQL] Ignore empty files in load
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23130#discussion_r236233074 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -388,7 +388,7 @@ case class FileSourceScanExec( logInfo(s"Planning with ${bucketSpec.numBuckets} buckets") val filesGroupedToBuckets = selectedPartitions.flatMap { p => -p.files.map { f => +p.files.filter(_.getLen > 0).map { f => --- End diff -- It's non-critical path in terms of performance. Should be okay. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23143: [SPARK-24762][SQL][Followup] Enable Option of Pro...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23143#discussion_r236267692 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala --- @@ -251,19 +251,15 @@ case class ExpressionEncoder[T]( */ def isSerializedAsStruct: Boolean = objSerializer.dataType.isInstanceOf[StructType] - /** - * Returns true if the type `T` is an `Option` type. - */ - def isOptionType: Boolean = classOf[Option[_]].isAssignableFrom(clsTag.runtimeClass) - /** * If the type `T` is serialized as a struct, when it is encoded to a Spark SQL row, fields in * the struct are naturally mapped to top-level columns in a row. In other words, the serialized * struct is flattened to row. But in case of the `T` is also an `Option` type, it can't be * flattened to top-level row, because in Spark SQL top-level row can't be null. This method * returns true if `T` is serialized as struct and is not `Option` type. */ - def isSerializedAsStructForTopLevel: Boolean = isSerializedAsStruct && !isOptionType + def isSerializedAsStructForTopLevel: Boolean = isSerializedAsStruct && --- End diff -- nit: since it cross lines, I'd prefer ``` def xxx = { xxx } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21688: [SPARK-21809] : Change Stage Page to use datatables to s...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/21688 Test this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/23126 Um, the unit test in spark indeed cover both case. But there is function closeToZero to handle accuracy problem, so.. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23127: [SPARK-26159] Codegen for LocalTableScanExec and RDDScan...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23127 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23123: [SPARK-26153][ML] GBT & RandomForest avoid unnecessary `...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23123 Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23138: [SPARK-23356][SQL][TEST] add new test cases for a + 1,a ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23138 **[Test build #99272 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99272/testReport)** for PR 23138 at commit [`471d114`](https://github.com/apache/spark/commit/471d1144d41f767b3227d78b663eaa79efef738c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23143: [SPARK-24762][SQL][Followup] Enable Option of Product en...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23143 **[Test build #99273 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99273/testReport)** for PR 23143 at commit [`f6bfe3d`](https://github.com/apache/spark/commit/f6bfe3d66774c5b8a61123dd79d4d3d49464df5f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23144: [SPARK-26172][ML][WIP] Unify String Params' case-insensi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23144 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5357/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23144: [SPARK-26172][ML][WIP] Unify String Params' case-insensi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23144 **[Test build #99274 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99274/testReport)** for PR 23144 at commit [`e55244a`](https://github.com/apache/spark/commit/e55244afa41e959f99c02f1afbe916fc7d1ffec3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23127: [SPARK-26159] Codegen for LocalTableScanExec and RDDScan...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23127 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5358/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23139: [SPARK-25860][SPARK-26107] [FOLLOW-UP] Rule ReplaceNullW...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23139 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23139: [SPARK-25860][SPARK-26107] [FOLLOW-UP] Rule ReplaceNullW...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23139 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99270/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22991: [SPARK-25989][ML] OneVsRestModel handle empty out...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22991#discussion_r236230624 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala --- @@ -209,6 +215,9 @@ final class OneVsRestModel private[ml] ( newDataset.unpersist() } +var outputColNames = Seq.empty[String] --- End diff -- Maybe 'predictionColumns' ? These aren't the only output columns. You could make this a mutable val too, but whatever. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22163: [SPARK-25166][CORE]Reduce the number of write operations...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22163 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99271/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22163: [SPARK-25166][CORE]Reduce the number of write operations...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22163 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/23126 It would be better, update the commit --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23124: [SPARK-25829][SQL] remove duplicated map keys wit...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23124#discussion_r236282401 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala --- @@ -0,0 +1,118 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.util + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.types.{AtomicType, CalendarIntervalType, DataType, MapType} + +/** + * A builder of [[ArrayBasedMapData]], which fails if a null map key is detected, and removes + * duplicated map keys w.r.t. the last wins policy. + */ +class ArrayBasedMapBuilder(keyType: DataType, valueType: DataType) extends Serializable { + assert(!keyType.existsRecursively(_.isInstanceOf[MapType]), "key of map cannot be/contain map") + + private lazy val keyToIndex = keyType match { +case _: AtomicType | _: CalendarIntervalType => mutable.HashMap.empty[Any, Int] +case _ => + // for complex types, use interpreted ordering to be able to compare unsafe data with safe + // data, e.g. UnsafeRow vs GenericInternalRow. + mutable.TreeMap.empty[Any, Int](TypeUtils.getInterpretedOrdering(keyType)) + } + + // TODO: specialize it + private lazy val keys = mutable.ArrayBuffer.empty[Any] + private lazy val values = mutable.ArrayBuffer.empty[Any] + + private lazy val keyGetter = InternalRow.getAccessor(keyType) + private lazy val valueGetter = InternalRow.getAccessor(valueType) + + def reset(): Unit = { +keyToIndex.clear() +keys.clear() +values.clear() + } + + def put(key: Any, value: Any): Unit = { +if (key == null) { + throw new RuntimeException("Cannot use null as map key.") +} + +val maybeExistingIdx = keyToIndex.get(key) +if (maybeExistingIdx.isDefined) { + // Overwrite the previous value, as the policy is last wins. + values(maybeExistingIdx.get) = value +} else { + keyToIndex.put(key, values.length) + keys.append(key) + values.append(value) +} + } + + // write a 2-field row, the first field is key and the second field is value. + def put(entry: InternalRow): Unit = { +if (entry.isNullAt(0)) { + throw new RuntimeException("Cannot use null as map key.") +} +put(keyGetter(entry, 0), valueGetter(entry, 1)) + } + + def putAll(keyArray: Array[Any], valueArray: Array[Any]): Unit = { +if (keyArray.length != valueArray.length) { + throw new RuntimeException( +"The key array and value array of MapData must have the same length.") +} + +var i = 0 +while (i < keyArray.length) { + put(keyArray(i), valueArray(i)) + i += 1 +} + } + + def putAll(keyArray: ArrayData, valueArray: ArrayData): Unit = { +if (keyArray.numElements() != valueArray.numElements()) { + throw new RuntimeException( +"The key array and value array of MapData must have the same length.") +} + +var i = 0 +while (i < keyArray.numElements()) { + put(keyGetter(keyArray, i), valueGetter(valueArray, i)) + i += 1 +} + } + + def build(): ArrayBasedMapData = { +new ArrayBasedMapData(new GenericArrayData(keys.toArray), new GenericArrayData(values.toArray)) + } + + def from(keyArray: ArrayData, valueArray: ArrayData): ArrayBasedMapData = { +assert(keyToIndex.isEmpty, "'from' can only be called with a fresh GenericMapBuilder.") +putAll(keyArray, valueArray) --- End diff -- no we can't, as we still need to detect duplicated keys. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail:
[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23126 **[Test build #4441 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4441/testReport)** for PR 23126 at commit [`6fca901`](https://github.com/apache/spark/commit/6fca901dfdad230fa7a4e1079a3f48be826276a3). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23143: [SPARK-24762][SQL][Followup] Enable Option of Pro...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/23143#discussion_r236295606 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala --- @@ -251,19 +251,15 @@ case class ExpressionEncoder[T]( */ def isSerializedAsStruct: Boolean = objSerializer.dataType.isInstanceOf[StructType] - /** - * Returns true if the type `T` is an `Option` type. - */ - def isOptionType: Boolean = classOf[Option[_]].isAssignableFrom(clsTag.runtimeClass) - /** * If the type `T` is serialized as a struct, when it is encoded to a Spark SQL row, fields in * the struct are naturally mapped to top-level columns in a row. In other words, the serialized * struct is flattened to row. But in case of the `T` is also an `Option` type, it can't be * flattened to top-level row, because in Spark SQL top-level row can't be null. This method * returns true if `T` is serialized as struct and is not `Option` type. */ - def isSerializedAsStructForTopLevel: Boolean = isSerializedAsStruct && !isOptionType + def isSerializedAsStructForTopLevel: Boolean = isSerializedAsStruct && --- End diff -- ok. fixed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23145: [MINOR][Docs] "a R interpreter" -> "an R interpreter"
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23145 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23117: [WIP][SPARK-7721][INFRA] Run and generate test coverage ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23117 Hey @shaneknapp, have you found some time to take a look for this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pyspark.me...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23055 Let me get this in in few days if there are no more comments. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23138: [SPARK-23356][SQL][TEST] add new test cases for a...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23138#discussion_r236334056 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/SetOperationSuite.scala --- @@ -196,4 +196,31 @@ class SetOperationSuite extends PlanTest { )) comparePlans(expectedPlan, rewrittenPlan) } + + test("SPARK-23356 union: expressions with number in project list are pushed down") { --- End diff -- `expressions with number` -> `expressions with literal` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23143: [SPARK-24762][SQL][Followup] Enable Option of Product en...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23143 **[Test build #99277 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99277/testReport)** for PR 23143 at commit [`f3f0507`](https://github.com/apache/spark/commit/f3f0507f6797c571ce66086d9e3b7fcbfd234c79). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23145: [MINOR][Docs] "a R interpreter" -> "an R interpreter"
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23145 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23145: [MINOR][Docs] "a R interpreter" -> "an R interpreter"
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23145 It's okay but mind taking another look the documentation and fix other typos while we are here? I'm pretty sure there are more. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pyspark.me...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23055 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pyspark.me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23055 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5361/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23145: [MINOR][Docs] "a R interpreter" -> "an R interpreter"
Github user kjmrknsn commented on the issue: https://github.com/apache/spark/pull/23145 OK, I'll check the whole documentation later. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23146: [SPARK-26173] [MLlib] Prior regularization for Logistic ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23146 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23146: [SPARK-26173] [MLlib] Prior regularization for Logistic ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23146 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23127: [SPARK-26159] Codegen for LocalTableScanExec and ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23127#discussion_r236332511 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala --- @@ -406,14 +415,62 @@ trait BlockingOperatorWithCodegen extends CodegenSupport { override def limitNotReachedChecks: Seq[String] = Nil } +/** + * Leaf codegen node reading from a single RDD. + */ +trait InputRDDCodegen extends CodegenSupport { + + def inputRDD: RDD[InternalRow] + + // If the input is an RDD of InternalRow which are potentially not UnsafeRow, + // and there is no parent to consume it, it needs an UnsafeProjection. + protected val createUnsafeProjection: Boolean = (parent == null) + + override def inputRDDs(): Seq[RDD[InternalRow]] = { +inputRDD :: Nil + } + + override def doProduce(ctx: CodegenContext): String = { --- End diff -- can you highlight the difference between this one and the previous `RowDataSourceScanExec.doProduce`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23141: [SPARK-26021][SQL][followup] add test for special floati...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/23141 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23145: [MINOR][Docs] "a R interpreter" -> "an R interpre...
GitHub user kjmrknsn opened a pull request: https://github.com/apache/spark/pull/23145 [MINOR][Docs] "a R interpreter" -> "an R interpreter" ## What changes were proposed in this pull request? This PR changes the phrase of `a R interpreter` to `an R interpreter` on `docs/index.md`. ## How was this patch tested? NA You can merge this pull request into a Git repository by running: $ git pull https://github.com/kjmrknsn/spark docUpdate Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23145.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23145 commit 9f2f60b3b80b1ae63c9df38cbc3d5bc1010d4123 Author: Keiji Yoshida Date: 2018-11-26T15:29:16Z [MINOR][Docs] "a R interpreter" -> "an R interpreter" --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23145: [MINOR][Docs] "a R interpreter" -> "an R interpreter"
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23145 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pyspark.me...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23055 **[Test build #99278 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99278/testReport)** for PR 23055 at commit [`fd92a4e`](https://github.com/apache/spark/commit/fd92a4e1cee9f666d7ee6f9c9fcb45367c8132a8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22939: [SPARK-25446][R] Add schema_of_json() and schema_of_csv(...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22939 Hey @felixcheung, have you found some time to take a look for this please? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pyspark.me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23055 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23146: [SPARK-26173] [MLlib] Prior regularization for Lo...
GitHub user elfausto opened a pull request: https://github.com/apache/spark/pull/23146 [SPARK-26173] [MLlib] Prior regularization for Logistic Regression Implementation of [SPARK-26173](https://issues.apache.org/jira/browse/SPARK-26173). Unit tests have been added to the modified classes. You can merge this pull request into a Git repository by running: $ git pull https://github.com/affectv/spark SPARK-26173 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23146.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23146 commit d8c35bb897cc0e2248810bfc7208e642deb5c55d Author: Facundo Bellosi Date: 2018-11-26T15:31:26Z SPARK-26173. Prior regularization for Logistic Regression. commit 52918b7bd30f8fdccce450d7f41e34f8bedf32ad Author: Facundo Bellosi Date: 2018-11-26T15:50:11Z SPARK-26173. Prior regularization for Logistic Regression. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23141: [SPARK-26021][SQL][followup] add test for special floati...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23141 **[Test build #99279 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99279/testReport)** for PR 23141 at commit [`7c590bc`](https://github.com/apache/spark/commit/7c590bcd239afb32537d607ee8b2190f4e4db5a3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23146: [SPARK-26173] [MLlib] Prior regularization for Logistic ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23146 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23141: [SPARK-26021][SQL][followup] add test for special floati...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23141 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23127: [SPARK-26159] Codegen for LocalTableScanExec and ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23127#discussion_r236332786 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala --- @@ -406,14 +415,62 @@ trait BlockingOperatorWithCodegen extends CodegenSupport { override def limitNotReachedChecks: Seq[String] = Nil } +/** + * Leaf codegen node reading from a single RDD. + */ +trait InputRDDCodegen extends CodegenSupport { + + def inputRDD: RDD[InternalRow] + + // If the input is an RDD of InternalRow which are potentially not UnsafeRow, + // and there is no parent to consume it, it needs an UnsafeProjection. + protected val createUnsafeProjection: Boolean = (parent == null) + + override def inputRDDs(): Seq[RDD[InternalRow]] = { +inputRDD :: Nil + } + + override def doProduce(ctx: CodegenContext): String = { --- End diff -- the extra `if createUnsafeProjection` check? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23124: [SPARK-25829][SQL] remove duplicated map keys with last ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23124 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5359/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23124: [SPARK-25829][SQL] remove duplicated map keys with last ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23124 **[Test build #99276 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99276/testReport)** for PR 23124 at commit [`b2bfb33`](https://github.com/apache/spark/commit/b2bfb3353f89be03fc6b12e4e8dd0899e3510a09). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23143: [SPARK-24762][SQL][Followup] Enable Option of Product en...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23143 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23143: [SPARK-24762][SQL][Followup] Enable Option of Product en...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23143 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5360/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23141: [SPARK-26021][SQL][followup] add test for special floati...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23141 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5362/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23127: [SPARK-26159] Codegen for LocalTableScanExec and ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23127#discussion_r236333530 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala --- @@ -406,14 +415,62 @@ trait BlockingOperatorWithCodegen extends CodegenSupport { override def limitNotReachedChecks: Seq[String] = Nil } +/** + * Leaf codegen node reading from a single RDD. + */ +trait InputRDDCodegen extends CodegenSupport { + + def inputRDD: RDD[InternalRow] + + // If the input is an RDD of InternalRow which are potentially not UnsafeRow, + // and there is no parent to consume it, it needs an UnsafeProjection. + protected val createUnsafeProjection: Boolean = (parent == null) + + override def inputRDDs(): Seq[RDD[InternalRow]] = { +inputRDD :: Nil + } + + override def doProduce(ctx: CodegenContext): String = { +// Inline mutable state since an InputRDDCodegen is used once in a task for WholeStageCodegen +val input = ctx.addMutableState("scala.collection.Iterator", "input", v => s"$v = inputs[0];", + forceInline = true) +val row = ctx.freshName("row") + +val outputVars = if (createUnsafeProjection) { + // creating the vars will make the parent consume add an unsafe projection. + ctx.INPUT_ROW = row + ctx.currentVars = null + output.zipWithIndex.map { case (a, i) => +BoundReference(i, a.dataType, a.nullable).genCode(ctx) + } +} else { + null +} + +val numOutputRowsCode = if (metrics.contains("numOutputRows")) { --- End diff -- how about `updateNumOutputRowsMetrics` instead of `numOutputRowsCode`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23102: [SPARK-26137][CORE] Use Java system property "fil...
Github user markpavey commented on a diff in the pull request: https://github.com/apache/spark/pull/23102#discussion_r236340228 --- Diff: core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala --- @@ -65,7 +65,7 @@ private[deploy] object DependencyUtils extends Logging { .map { resolveGlobPaths(_, hadoopConf) .split(",") - .filterNot(_.contains(userJar.split("/").last)) + .filterNot(_.contains(userJar.split(System.getProperty("file.separator")).last)) --- End diff -- I have updated to use File.separator instead of System.getProperty("file.separator"). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23127: [SPARK-26159] Codegen for LocalTableScanExec and RDDScan...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23127 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23127: [SPARK-26159] Codegen for LocalTableScanExec and RDDScan...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23127 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5369/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23127: [SPARK-26159] Codegen for LocalTableScanExec and RDDScan...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23127 **[Test build #99286 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99286/testReport)** for PR 23127 at commit [`5d94c42`](https://github.com/apache/spark/commit/5d94c4274d468f3873db0827c88e1461f0125b58). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21688: [SPARK-21809] : Change Stage Page to use datatables to s...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/21688 +1 , going to merge to master There are a few followup jiras on this. 1) make the timeline visualization better: https://issues.apache.org/jira/browse/SPARK-26130 2) improve search functionalit: https://issues.apache.org/jira/browse/SPARK-25719 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23058: [SPARK-25905][CORE] When getting a remote block, avoid f...
Github user squito commented on the issue: https://github.com/apache/spark/pull/23058 lgtm I looked more into the lifecycle of the buffers and when they get `disposed`, and it looks fine to me. (In fact I think there is no need for the `dispose` in the first place, as hinted at here: https://github.com/apache/spark/pull/22511#issuecomment-424429691) I also checked about whether we should buffer the input stream, but `dataDeserializeStream` already does that. @wypoon one thing, can you update the testing section of the pr description to mention the coverage you found in the existing unit tests? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23109: [SPARK-26069][TESTS][FOLLOWUP]Add another possible error...
Github user squito commented on the issue: https://github.com/apache/spark/pull/23109 lte review, but lgtm anyway --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21688: [SPARK-21809] : Change Stage Page to use datatabl...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21688 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23088: [SPARK-26119][CORE][WEBUI]Task summary table shou...
Github user shahidki31 commented on a diff in the pull request: https://github.com/apache/spark/pull/23088#discussion_r236409557 --- Diff: core/src/main/scala/org/apache/spark/status/AppStatusStore.scala --- @@ -222,29 +223,20 @@ private[spark] class AppStatusStore( val indices = quantiles.map { q => math.min((q * count).toLong, count - 1) } def scanTasks(index: String)(fn: TaskDataWrapper => Long): IndexedSeq[Double] = { - Utils.tryWithResource( -store.view(classOf[TaskDataWrapper]) - .parent(stageKey) - .index(index) - .first(0L) - .closeableIterator() - ) { it => -var last = Double.NaN -var currentIdx = -1L -indices.map { idx => - if (idx == currentIdx) { -last - } else { -val diff = idx - currentIdx -currentIdx = idx -if (it.skip(diff - 1)) { - last = fn(it.next()).toDouble - last -} else { - Double.NaN -} - } -}.toIndexedSeq + val quantileTasks = store.view(classOf[TaskDataWrapper]) --- End diff -- Thank you @vanzin for the review. I will check the time with large number of tasks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23088: [SPARK-26119][CORE][WEBUI]Task summary table should cont...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23088 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5370/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23111: [SPARK-26148][PYTHON][TESTS] Increases default paralleli...
Github user squito commented on the issue: https://github.com/apache/spark/pull/23111 wow, thats great! glad there is a big speedup. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23088: [SPARK-26119][CORE][WEBUI]Task summary table shou...
Github user shahidki31 commented on a diff in the pull request: https://github.com/apache/spark/pull/23088#discussion_r236409661 --- Diff: core/src/main/scala/org/apache/spark/status/AppStatusStore.scala --- @@ -222,29 +223,20 @@ private[spark] class AppStatusStore( val indices = quantiles.map { q => math.min((q * count).toLong, count - 1) } def scanTasks(index: String)(fn: TaskDataWrapper => Long): IndexedSeq[Double] = { - Utils.tryWithResource( -store.view(classOf[TaskDataWrapper]) - .parent(stageKey) - .index(index) - .first(0L) - .closeableIterator() - ) { it => -var last = Double.NaN -var currentIdx = -1L -indices.map { idx => - if (idx == currentIdx) { -last - } else { -val diff = idx - currentIdx -currentIdx = idx -if (it.skip(diff - 1)) { - last = fn(it.next()).toDouble - last -} else { - Double.NaN -} - } -}.toIndexedSeq + val quantileTasks = store.view(classOf[TaskDataWrapper]) +.parent(stageKey) +.index(index) +.first(0L) +.asScala +.filter(_.status == "SUCCESS") // Filter "SUCCESS" tasks +.zipWithIndex +.filter(x => indices.contains(x._2)) --- End diff -- Thanks. I have modified --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23088: [SPARK-26119][CORE][WEBUI]Task summary table should cont...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23088 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/23100#discussion_r236410306 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala --- @@ -17,126 +17,512 @@ package org.apache.spark.ml.feature +import org.apache.hadoop.fs.Path --- End diff -- I guess once the commits of the history are squashed into one, it will still like this without better history. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/23100#discussion_r236410750 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala --- @@ -17,126 +17,512 @@ package org.apache.spark.ml.feature +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException import org.apache.spark.annotation.Since -import org.apache.spark.ml.Transformer +import org.apache.spark.ml.{Estimator, Model} import org.apache.spark.ml.attribute._ import org.apache.spark.ml.linalg.Vectors import org.apache.spark.ml.param._ -import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, HasOutputCols} import org.apache.spark.ml.util._ import org.apache.spark.sql.{DataFrame, Dataset} -import org.apache.spark.sql.functions.{col, udf} -import org.apache.spark.sql.types.{DoubleType, NumericType, StructType} +import org.apache.spark.sql.expressions.UserDefinedFunction +import org.apache.spark.sql.functions.{col, lit, udf} +import org.apache.spark.sql.types.{DoubleType, StructField, StructType} + +/** Private trait for params and common methods for OneHotEncoder and OneHotEncoderModel */ +private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid +with HasInputCols with HasOutputCols { + + /** + * Param for how to handle invalid data during transform(). + * Options are 'keep' (invalid data presented as an extra categorical feature) or + * 'error' (throw an error). + * Note that this Param is only used during transform; during fitting, invalid data + * will result in an error. + * Default: "error" + * @group param + */ + @Since("2.3.0") --- End diff -- As we discussed previously, it's a new class. Should we make it as `3.0`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23148: [SPARK-26177] Automated formatting for Scala code
GitHub user koeninger opened a pull request: https://github.com/apache/spark/pull/23148 [SPARK-26177] Automated formatting for Scala code ## What changes were proposed in this pull request? Add a maven plugin and wrapper script at ./dev/scalafmt to use scalafmt to format files that differ from git master. Intention is for contributors to be able to use this to automate fixing code style, not to include it in build pipeline yet. If this PR is accepted, I'd make a different PR to update the code style section of https://spark.apache.org/contributing.html to mention the script ## How was this patch tested? Manually tested by modifying a few files and running the script, then checking that scalastyle still passed. You can merge this pull request into a Git repository by running: $ git pull https://github.com/koeninger/spark-1 scalafmt Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23148.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23148 commit 5df854a340f5e9e2fe7ba60194b4891235420369 Author: cody koeninger Date: 2018-11-21T20:32:57Z WIP on scalafmt config commit 57684272b60ff1213c957541f1db3f9d7aba1543 Author: cody koeninger Date: 2018-11-21T21:22:15Z mvn plugin, e.g. ./build/mvn mvn-scalafmt_2.12:format commit bd7baeca577bf9b519fe028d1e831fb7193e7af9 Author: cody koeninger Date: 2018-11-22T17:07:06Z shell wrapper for mvn scalafmt plugin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23148: [SPARK-26177] Automated formatting for Scala code
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23148 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99288/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23148: [SPARK-26177] Automated formatting for Scala code
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23148 **[Test build #99288 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99288/testReport)** for PR 23148 at commit [`bd7baec`](https://github.com/apache/spark/commit/bd7baeca577bf9b519fe028d1e831fb7193e7af9). * This patch **fails RAT tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23148: [SPARK-26177] Automated formatting for Scala code
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23148 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23141: [SPARK-26021][SQL][followup] add test for special floati...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23141 **[Test build #99279 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99279/testReport)** for PR 23141 at commit [`7c590bc`](https://github.com/apache/spark/commit/7c590bcd239afb32537d607ee8b2190f4e4db5a3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23141: [SPARK-26021][SQL][followup] add test for special floati...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23141 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99279/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23141: [SPARK-26021][SQL][followup] add test for special floati...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23141 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23088: [SPARK-26119][CORE][WEBUI]Task summary table shou...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/23088#discussion_r236398604 --- Diff: core/src/test/scala/org/apache/spark/status/AppStatusStoreSuite.scala --- @@ -95,10 +123,18 @@ class AppStatusStoreSuite extends SparkFunSuite { private def newTaskData(i: Int): TaskDataWrapper = { new TaskDataWrapper( - i, i, i, i, i, i, i.toString, i.toString, i.toString, i.toString, false, Nil, None, + i, i, i, i, i, i, i.toString, i.toString, "SUCCESS", i.toString, false, Nil, None, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, stageId, attemptId) } + private def failedTaskData(i: Int): TaskDataWrapper = { --- End diff -- better to have the status be a parameter to the existing method (with a default value) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23088: [SPARK-26119][CORE][WEBUI]Task summary table shou...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23088#discussion_r236405278 --- Diff: core/src/main/scala/org/apache/spark/status/AppStatusStore.scala --- @@ -222,29 +223,20 @@ private[spark] class AppStatusStore( val indices = quantiles.map { q => math.min((q * count).toLong, count - 1) } def scanTasks(index: String)(fn: TaskDataWrapper => Long): IndexedSeq[Double] = { - Utils.tryWithResource( -store.view(classOf[TaskDataWrapper]) - .parent(stageKey) - .index(index) - .first(0L) - .closeableIterator() - ) { it => -var last = Double.NaN -var currentIdx = -1L -indices.map { idx => - if (idx == currentIdx) { -last - } else { -val diff = idx - currentIdx -currentIdx = idx -if (it.skip(diff - 1)) { - last = fn(it.next()).toDouble - last -} else { - Double.NaN -} - } -}.toIndexedSeq + val quantileTasks = store.view(classOf[TaskDataWrapper]) +.parent(stageKey) +.index(index) +.first(0L) +.asScala +.filter(_.status == "SUCCESS") // Filter "SUCCESS" tasks +.zipWithIndex +.filter(x => indices.contains(x._2)) --- End diff -- I don't feel strongly about it but I have written it as above where a new block isn't needed. I don't think it makes any runtime difference. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pyspark.me...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23055 **[Test build #99278 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99278/testReport)** for PR 23055 at commit [`fd92a4e`](https://github.com/apache/spark/commit/fd92a4e1cee9f666d7ee6f9c9fcb45367c8132a8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23088: [SPARK-26119][CORE][WEBUI]Task summary table shou...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/23088#discussion_r236405732 --- Diff: core/src/main/scala/org/apache/spark/status/AppStatusStore.scala --- @@ -222,29 +223,20 @@ private[spark] class AppStatusStore( val indices = quantiles.map { q => math.min((q * count).toLong, count - 1) } def scanTasks(index: String)(fn: TaskDataWrapper => Long): IndexedSeq[Double] = { - Utils.tryWithResource( -store.view(classOf[TaskDataWrapper]) - .parent(stageKey) - .index(index) - .first(0L) - .closeableIterator() - ) { it => -var last = Double.NaN -var currentIdx = -1L -indices.map { idx => - if (idx == currentIdx) { -last - } else { -val diff = idx - currentIdx -currentIdx = idx -if (it.skip(diff - 1)) { - last = fn(it.next()).toDouble - last -} else { - Double.NaN -} - } -}.toIndexedSeq + val quantileTasks = store.view(classOf[TaskDataWrapper]) +.parent(stageKey) +.index(index) +.first(0L) +.asScala +.filter(_.status == "SUCCESS") // Filter "SUCCESS" tasks +.zipWithIndex +.filter(x => indices.contains(x._2)) --- End diff -- It doesn't make a runtime different. That's why this is a style nit. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pyspark.me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23055 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99278/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pyspark.me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23055 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23088: [SPARK-26119][CORE][WEBUI]Task summary table should cont...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23088 **[Test build #99287 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99287/testReport)** for PR 23088 at commit [`b29a150`](https://github.com/apache/spark/commit/b29a150f6d28065f5fde2b4c40f66ea4683bac69). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23111: [SPARK-26148][PYTHON][TESTS] Increases default paralleli...
Github user squito commented on the issue: https://github.com/apache/spark/pull/23111 we might need to be careful that this doesn't un-intentionally overload the jenkins workers so that we end up hitting more timeouts from too many things running concurrently (I dunno how isolated the workers are) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23088: [SPARK-26119][CORE][WEBUI]Task summary table shou...
Github user shahidki31 commented on a diff in the pull request: https://github.com/apache/spark/pull/23088#discussion_r236409746 --- Diff: core/src/test/scala/org/apache/spark/status/AppStatusStoreSuite.scala --- @@ -95,10 +123,18 @@ class AppStatusStoreSuite extends SparkFunSuite { private def newTaskData(i: Int): TaskDataWrapper = { new TaskDataWrapper( - i, i, i, i, i, i, i.toString, i.toString, i.toString, i.toString, false, Nil, None, + i, i, i, i, i, i, i.toString, i.toString, "SUCCESS", i.toString, false, Nil, None, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, stageId, attemptId) } + private def failedTaskData(i: Int): TaskDataWrapper = { --- End diff -- Thanks. done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23088: [SPARK-26119][CORE][WEBUI]Task summary table should cont...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23088 **[Test build #99289 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99289/testReport)** for PR 23088 at commit [`c81fcf3`](https://github.com/apache/spark/commit/c81fcf3a18339852a681f118129d4d4234adae5a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23088: [SPARK-26119][CORE][WEBUI]Task summary table should cont...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23088 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5372/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org