[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14035#discussion_r69390132 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala --- @@ -134,15 +135,14 @@ class GBTClassifierSuite extends SparkFunSuite with MLlibTestSparkContext */ test("Fitting without numClasses in metadata") { -val df: DataFrame = spark.createDataFrame(TreeTests.featureImportanceData(sc)) +val df: DataFrame = TreeTests.featureImportanceData(sc).toDF() val gbt = new GBTClassifier().setMaxDepth(1).setMaxIter(1) gbt.fit(df) --- End diff -- Wonder why this line is separate not part of 139? Any reason? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14035#discussion_r69390117 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/ClassifierSuite.scala --- @@ -71,8 +71,7 @@ class ClassifierSuite extends SparkFunSuite with MLlibTestSparkContext { test("getNumClasses") { def getTestData(labels: Seq[Double]): DataFrame = { --- End diff -- repeated. What about Moving it outside `test` methods? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14020: [SPARK-16349][sql] Fall back to isolated class lo...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14020#discussion_r69382827 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala --- @@ -264,7 +270,7 @@ private[hive] class IsolatedClientLoader( throw new ClassNotFoundException( s"$cnf when creating Hive client using classpath: ${execJars.mkString(", ")}\n" + "Please make sure that jars for your version of hive and hadoop are included in the " + --- End diff -- Just a nitpick...should 'hive' be Hive as the line above + Hadoop? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14026: [SPARK-13569][STREAMING][KAFKA] pattern based top...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14026#discussion_r69382788 --- Diff: external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/ConsumerStrategy.scala --- @@ -79,8 +81,71 @@ private case class Subscribe[K, V]( def onStart(currentOffsets: ju.Map[TopicPartition, jl.Long]): Consumer[K, V] = { val consumer = new KafkaConsumer[K, V](kafkaParams) consumer.subscribe(topics) -if (currentOffsets.isEmpty) { - offsets.asScala.foreach { case (topicPartition, offset) => +val toSeek = if (currentOffsets.isEmpty) { + offsets +} else { + currentOffsets +} +if (!toSeek.isEmpty) { + // work around KAFKA-3370 when reset is none + val aor = kafkaParams.get(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG) + val shouldSuppress = aor != null && aor.asInstanceOf[String].toUpperCase == "NONE" + try { +consumer.poll(0) + } catch { +case x: NoOffsetForPartitionException if shouldSuppress => + // silence exception + } + toSeek.asScala.foreach { case (topicPartition, offset) => + consumer.seek(topicPartition, offset) --- End diff -- 4 chars for indent? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14031: [SPARK-16353][BUILD][DOC] Missing javadoc options...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14031#discussion_r69382719 --- Diff: project/SparkBuild.scala --- @@ -723,8 +723,8 @@ object Unidoc { .map(_.filterNot(_.getCanonicalPath.contains("org/apache/hadoop"))) }, -// Javadoc options: create a window title, and group key packages on index page -javacOptions in doc := Seq( +// Javadoc options: create a window title --- End diff -- Do we really need that line? It's in the git history at the very least and JIRA. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14030: [SPARK-16350][SQL] Fix support for incremental pl...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14030#discussion_r69382676 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/ForeachSinkSuite.scala --- @@ -35,35 +35,109 @@ class ForeachSinkSuite extends StreamTest with SharedSQLContext with BeforeAndAf sqlContext.streams.active.foreach(_.stop()) } - test("foreach") { + test("foreach() with `append` output mode") { withTempDir { checkpointDir => val input = MemoryStream[Int] val query = input.toDS().repartition(2).writeStream .option("checkpointLocation", checkpointDir.getCanonicalPath) +.outputMode("append") .foreach(new TestForeachWriter()) .start() + + // -- batch 0 --- input.addData(1, 2, 3, 4) query.processAllAvailable() - val expectedEventsForPartition0 = Seq( + var expectedEventsForPartition0 = Seq( ForeachSinkSuite.Open(partition = 0, version = 0), ForeachSinkSuite.Process(value = 1), ForeachSinkSuite.Process(value = 3), ForeachSinkSuite.Close(None) ) - val expectedEventsForPartition1 = Seq( + var expectedEventsForPartition1 = Seq( ForeachSinkSuite.Open(partition = 1, version = 0), ForeachSinkSuite.Process(value = 2), ForeachSinkSuite.Process(value = 4), ForeachSinkSuite.Close(None) ) - val allEvents = ForeachSinkSuite.allEvents() + var allEvents = ForeachSinkSuite.allEvents() + assert(allEvents.size === 2) + assert { +allEvents === Seq(expectedEventsForPartition0, expectedEventsForPartition1) || + allEvents === Seq(expectedEventsForPartition1, expectedEventsForPartition0) + } + + ForeachSinkSuite.clear() + + // -- batch 1 --- + input.addData(5, 6, 7, 8) + query.processAllAvailable() + + expectedEventsForPartition0 = Seq( +ForeachSinkSuite.Open(partition = 0, version = 1), +ForeachSinkSuite.Process(value = 5), +ForeachSinkSuite.Process(value = 7), +ForeachSinkSuite.Close(None) + ) + expectedEventsForPartition1 = Seq( +ForeachSinkSuite.Open(partition = 1, version = 1), +ForeachSinkSuite.Process(value = 6), +ForeachSinkSuite.Process(value = 8), +ForeachSinkSuite.Close(None) + ) + + allEvents = ForeachSinkSuite.allEvents() assert(allEvents.size === 2) assert { allEvents === Seq(expectedEventsForPartition0, expectedEventsForPartition1) || allEvents === Seq(expectedEventsForPartition1, expectedEventsForPartition0) } + + query.stop() +} + } + + test("foreach() with `complete` output mode") { +withTempDir { checkpointDir => + val input = MemoryStream[Int] + + val query = input.toDS() +.groupBy().count().as[Long].map(_.toInt) +.writeStream +.option("checkpointLocation", checkpointDir.getCanonicalPath) +.outputMode("complete") --- End diff -- Are really output modes strings? No enums or similar more type-safe values? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14030: [SPARK-16350][SQL] Fix support for incremental pl...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14030#discussion_r69382669 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/ForeachSinkSuite.scala --- @@ -35,35 +35,109 @@ class ForeachSinkSuite extends StreamTest with SharedSQLContext with BeforeAndAf sqlContext.streams.active.foreach(_.stop()) } - test("foreach") { + test("foreach() with `append` output mode") { withTempDir { checkpointDir => val input = MemoryStream[Int] val query = input.toDS().repartition(2).writeStream .option("checkpointLocation", checkpointDir.getCanonicalPath) +.outputMode("append") .foreach(new TestForeachWriter()) .start() + + // -- batch 0 --- input.addData(1, 2, 3, 4) query.processAllAvailable() - val expectedEventsForPartition0 = Seq( + var expectedEventsForPartition0 = Seq( ForeachSinkSuite.Open(partition = 0, version = 0), ForeachSinkSuite.Process(value = 1), ForeachSinkSuite.Process(value = 3), ForeachSinkSuite.Close(None) ) - val expectedEventsForPartition1 = Seq( + var expectedEventsForPartition1 = Seq( ForeachSinkSuite.Open(partition = 1, version = 0), ForeachSinkSuite.Process(value = 2), ForeachSinkSuite.Process(value = 4), ForeachSinkSuite.Close(None) ) - val allEvents = ForeachSinkSuite.allEvents() + var allEvents = ForeachSinkSuite.allEvents() + assert(allEvents.size === 2) + assert { +allEvents === Seq(expectedEventsForPartition0, expectedEventsForPartition1) || + allEvents === Seq(expectedEventsForPartition1, expectedEventsForPartition0) + } + + ForeachSinkSuite.clear() + + // -- batch 1 --- + input.addData(5, 6, 7, 8) + query.processAllAvailable() + + expectedEventsForPartition0 = Seq( +ForeachSinkSuite.Open(partition = 0, version = 1), +ForeachSinkSuite.Process(value = 5), +ForeachSinkSuite.Process(value = 7), +ForeachSinkSuite.Close(None) + ) + expectedEventsForPartition1 = Seq( +ForeachSinkSuite.Open(partition = 1, version = 1), +ForeachSinkSuite.Process(value = 6), +ForeachSinkSuite.Process(value = 8), +ForeachSinkSuite.Close(None) + ) + + allEvents = ForeachSinkSuite.allEvents() assert(allEvents.size === 2) assert { allEvents === Seq(expectedEventsForPartition0, expectedEventsForPartition1) || allEvents === Seq(expectedEventsForPartition1, expectedEventsForPartition0) --- End diff -- Same as above --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14030: [SPARK-16350][SQL] Fix support for incremental pl...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14030#discussion_r69382667 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/ForeachSinkSuite.scala --- @@ -35,35 +35,109 @@ class ForeachSinkSuite extends StreamTest with SharedSQLContext with BeforeAndAf sqlContext.streams.active.foreach(_.stop()) } - test("foreach") { + test("foreach() with `append` output mode") { withTempDir { checkpointDir => val input = MemoryStream[Int] val query = input.toDS().repartition(2).writeStream .option("checkpointLocation", checkpointDir.getCanonicalPath) +.outputMode("append") .foreach(new TestForeachWriter()) .start() + + // -- batch 0 --- input.addData(1, 2, 3, 4) query.processAllAvailable() - val expectedEventsForPartition0 = Seq( + var expectedEventsForPartition0 = Seq( ForeachSinkSuite.Open(partition = 0, version = 0), ForeachSinkSuite.Process(value = 1), ForeachSinkSuite.Process(value = 3), ForeachSinkSuite.Close(None) ) - val expectedEventsForPartition1 = Seq( + var expectedEventsForPartition1 = Seq( ForeachSinkSuite.Open(partition = 1, version = 0), ForeachSinkSuite.Process(value = 2), ForeachSinkSuite.Process(value = 4), ForeachSinkSuite.Close(None) ) - val allEvents = ForeachSinkSuite.allEvents() + var allEvents = ForeachSinkSuite.allEvents() + assert(allEvents.size === 2) + assert { +allEvents === Seq(expectedEventsForPartition0, expectedEventsForPartition1) || + allEvents === Seq(expectedEventsForPartition1, expectedEventsForPartition0) --- End diff -- `should contain theSameElementsAs`? See http://www.scalatest.org/user_guide/using_matchers#workingWithAggregations --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13909: [SPARK-16213][SQL] Reduce runtime overhead of a p...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/13909#discussion_r68506637 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameComplexTypeSuite.scala --- @@ -26,6 +26,20 @@ import org.apache.spark.sql.test.SharedSQLContext class DataFrameComplexTypeSuite extends QueryTest with SharedSQLContext { import testImplicits._ + test("primitive type on array") { +val df = sparkContext.parallelize(Seq(1, 2), 1).toDF("v") +val resDF = df.selectExpr("Array(v + 2, v + 3)") +checkAnswer(resDF, + Seq(Row(Array(3, 4)), Row(Array(4, 5 + } + + test("primitive type and null on array") { +val df = sparkContext.parallelize(Seq(1, 2), 1).toDF("v") +val resDF = df.selectExpr("Array(v + 2, null, v + 3)") +checkAnswer(resDF, + Seq(Row(Array(3, null, 4)), Row(Array(4, null, 5 + } + test("UDF on struct") { val f = udf((a: String) => a) --- End diff -- `val f = udf[String, String](identity)` ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13909: [SPARK-16213][SQL] Reduce runtime overhead of a p...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/13909#discussion_r68506607 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameComplexTypeSuite.scala --- @@ -26,6 +26,20 @@ import org.apache.spark.sql.test.SharedSQLContext class DataFrameComplexTypeSuite extends QueryTest with SharedSQLContext { import testImplicits._ + test("primitive type on array") { +val df = sparkContext.parallelize(Seq(1, 2), 1).toDF("v") --- End diff -- `Seq(1, 2).toDF("v")` ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13909: [SPARK-16213][SQL] Reduce runtime overhead of a p...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/13909#discussion_r68506564 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala --- @@ -51,27 +51,52 @@ case class CreateArray(children: Seq[Expression]) extends Expression { override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { val arrayClass = classOf[GenericArrayData].getName val values = ctx.freshName("values") -ctx.addMutableState("Object[]", values, s"this.$values = null;") - -ev.copy(code = s""" - final boolean ${ev.isNull} = false; - this.$values = new Object[${children.size}];""" + - ctx.splitExpressions( -ctx.INPUT_ROW, -children.zipWithIndex.map { case (e, i) => - val eval = e.genCode(ctx) - eval.code + s""" +val dt = dataType match { + case a @ ArrayType(et, _) => et +} +val isPrimitive = ctx.isPrimitiveType(dt) +val evals = children.map(e => e.genCode(ctx)) +val allNonNull = evals.find(_.isNull != "false").isEmpty --- End diff -- `evals.forall(_.isNull == "true")` ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13909: [SPARK-16213][SQL] Reduce runtime overhead of a p...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/13909#discussion_r68506515 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala --- @@ -51,27 +51,52 @@ case class CreateArray(children: Seq[Expression]) extends Expression { override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { val arrayClass = classOf[GenericArrayData].getName val values = ctx.freshName("values") -ctx.addMutableState("Object[]", values, s"this.$values = null;") - -ev.copy(code = s""" - final boolean ${ev.isNull} = false; - this.$values = new Object[${children.size}];""" + - ctx.splitExpressions( -ctx.INPUT_ROW, -children.zipWithIndex.map { case (e, i) => - val eval = e.genCode(ctx) - eval.code + s""" +val dt = dataType match { --- End diff -- This can be shorter (simpler?) with the pattern matching on assignment "trick", i.e. ``` val ArrayType(dt, _) = dataType ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13912: [SPARK-16216][SQL] CSV data source does not write...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/13912#discussion_r68506371 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala --- @@ -195,18 +202,50 @@ private[sql] class CsvOutputWriter( private var records: Long = 0L private val csvWriter = new LineCsvWriter(params, dataSchema.fieldNames.toSeq) - private def rowToString(row: Seq[Any]): Seq[String] = row.map { field => -if (field != null) { - field.toString -} else { - params.nullValue + private def rowToString(row: InternalRow): Seq[String] = { +var i = 0 +val values = new Array[String](row.numFields) +while (i < row.numFields) { + if (!row.isNullAt(i)) { +values(i) = fieldsConverters(i).apply(row, i) + } else { +values(i) = params.nullValue + } + i += 1 +} +values + } + + private def makeConverter(dataType: DataType): ValueConverter = { +dataType match { --- End diff -- Why do you place `dataType match` on a separate line? One less `{` if at line 219 ;-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13912: [SPARK-16216][SQL] CSV data source does not write...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/13912#discussion_r68506352 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala --- @@ -195,18 +202,50 @@ private[sql] class CsvOutputWriter( private var records: Long = 0L private val csvWriter = new LineCsvWriter(params, dataSchema.fieldNames.toSeq) - private def rowToString(row: Seq[Any]): Seq[String] = row.map { field => -if (field != null) { - field.toString -} else { - params.nullValue + private def rowToString(row: InternalRow): Seq[String] = { +var i = 0 +val values = new Array[String](row.numFields) +while (i < row.numFields) { --- End diff -- Please use `values.indices` and then `map` or `foreach` to make it more functional (and hopefully readable). With the change, you'll link `values` to indices (not relying on `row.numFields` used twice). BTW, do we need `values` to be initialized first before adding elements? I'd vote for `foldLeft` as a better alternative, e.g. ``` def appendRowValue(arr: Array[String], i: Int) = { // do the if here arr } (0 to row.numFields).foldLeft(Array[String]()) { case (arr, i) => arr } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13654: [SPARK-15868] [Web UI] Executors table in Executo...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/13654#discussion_r66961307 --- Diff: core/src/main/scala/org/apache/spark/ui/exec/ExecutorsPage.scala --- @@ -69,13 +73,13 @@ private[ui] class ExecutorsPage( } val execInfo = activeExecutorInfo ++ deadExecutorInfo -val execInfoSorted = execInfo.sortBy(_.id) +val execInfoSorted = execInfo.sortWith((a, b) => strToInt(a.id) > strToInt(b.id)) --- End diff -- I think I'd use `Ordering` trait instead as it's not very easy to spot what the ordering is. ``` case class Info(id: String) val infos = Seq(Info("5"), Info("driver"), Info("1"), Info("10"), Info("500")) implicit val onString = Ordering[Int].on((s: String) => util.Try(s.toInt).getOrElse(-1)) infos.sortBy(_.id) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13555: [SPARK-15804][SQL]Include metadata in the toStruc...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/13555#discussion_r66235235 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala --- @@ -625,6 +625,22 @@ class ParquetQuerySuite extends QueryTest with ParquetTest with SharedSQLContext } } } + + test("SPARK-15804: write out the metadata to parquet file") { +val data = (1, "abc") ::(2, "helloabcde") :: Nil +val df = spark.createDataFrame(data).toDF("a", "b") +val md = new MetadataBuilder().putString("key", "value").build() +val dfWithmeta = df.select(Column("a"), Column("b").as("b", md)) --- End diff -- Use `df.select('a, 'b.as("b", md))` instead. It's more compact and "modern". Our tests should promote the latest API. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13555: [SPARK-15804][SQL]Include metadata in the toStruc...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/13555#discussion_r66235290 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala --- @@ -625,6 +625,22 @@ class ParquetQuerySuite extends QueryTest with ParquetTest with SharedSQLContext } } } + + test("SPARK-15804: write out the metadata to parquet file") { +val data = (1, "abc") ::(2, "helloabcde") :: Nil +val df = spark.createDataFrame(data).toDF("a", "b") +val md = new MetadataBuilder().putString("key", "value").build() +val dfWithmeta = df.select(Column("a"), Column("b").as("b", md)) + +withTempPath { dir => + val path = s"${dir.getCanonicalPath}/data" + dfWithmeta.write.parquet(path) + + readParquetFile(path) { dfwithmeta2 => --- End diff -- Replace `dfwithmeta2` to `df` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13555: [SPARK-15804][SQL]Include metadata in the toStruc...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/13555#discussion_r66234892 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala --- @@ -625,6 +625,22 @@ class ParquetQuerySuite extends QueryTest with ParquetTest with SharedSQLContext } } } + + test("SPARK-15804: write out the metadata to parquet file") { +val data = (1, "abc") ::(2, "helloabcde") :: Nil +val df = spark.createDataFrame(data).toDF("a", "b") --- End diff -- Merge the lines 630 and 631 to `Seq((1,"abc"),(2,"hello")).toDF("a", "b")` instead --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13513: [SPARK-15698][SQL][Streaming] Add the ability to ...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/13513#discussion_r65825524 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -529,7 +529,28 @@ object SQLConf { .internal() .doc("How long in milliseconds a file is guaranteed to be visible for all readers.") .timeConf(TimeUnit.MILLISECONDS) - .createWithDefault(60 * 1000L) // 10 minutes + .createWithDefault(60 * 10 * 1000L) // 10 minutes + + val FILE_SOURCE_LOG_DELETION = SQLConfigBuilder("spark.sql.streaming.fileSource.log.deletion") +.internal() +.doc("Whether to delete the expired log files in file stream source.") +.booleanConf +.createWithDefault(true) + + val FILE_SOURCE_LOG_COMPACT_INTERVAL = +SQLConfigBuilder("spark.sql.streaming.fileSource.log.compactInterval") + .internal() + .doc("Number of log files after which all the previous files " + +"are compacted into the next log file.") + .intConf + .createWithDefault(10) + + val FILE_SOURCE_LOG_CLEANUP_DELAY = +SQLConfigBuilder("spark.sql.streaming.fileSource.log.cleanupDelay") + .internal() + .doc("How long in milliseconds a file is guaranteed to be visible for all readers.") + .timeConf(TimeUnit.MILLISECONDS) + .createWithDefault(60 * 10 * 1000L) // 10 minutes --- End diff -- A nitpick but think it'd be easier to "decode" - `10 * 60 * 1000L`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13513: [SPARK-15698][SQL][Streaming] Add the ability to ...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/13513#discussion_r65825474 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala --- @@ -129,3 +131,86 @@ class FileStreamSource( override def toString: String = s"FileStreamSource[$qualifiedBasePath]" } + +class FileStreamSourceLog(sparkSession: SparkSession, path: String) + extends HDFSMetadataLog[Seq[String]](sparkSession, path) { + + // Configurations about metadata compaction + private val compactInterval = sparkSession.conf.get(SQLConf.FILE_SOURCE_LOG_COMPACT_INTERVAL) + require(compactInterval > 0, +s"Please set ${SQLConf.FILE_SOURCE_LOG_COMPACT_INTERVAL.key} (was $compactInterval) to a " + + s"positive value.") + + private val fileCleanupDelayMs = sparkSession.conf.get(SQLConf.FILE_SOURCE_LOG_CLEANUP_DELAY) + + private val isDeletingExpiredLog = sparkSession.conf.get(SQLConf.FILE_SOURCE_LOG_DELETION) + + private var compactBatchId: Long = -1L + + private def isCompactionBatch(batchId: Long, compactInterval: Long): Boolean = { +batchId % compactInterval == 0 + } + + override def add(batchId: Long, metadata: Seq[String]): Boolean = { +if (isCompactionBatch(batchId, compactInterval)) { + compactMetadataLog(batchId - 1) +} + +super.add(batchId, metadata) + } + + private def compactMetadataLog(batchId: Long): Unit = { +// read out compact metadata and merge with new metadata. +val batches = super.get(Some(compactBatchId), Some(batchId)) +val totalMetadata = batches.flatMap(_._2) +if (totalMetadata.isEmpty) { + return +} + +// Remove old compact metadata file and rewrite. +val renamedPath = new Path(path, s".${batchId.toString}-${UUID.randomUUID.toString}.tmp") +fileManager.rename(batchIdToPath(batchId), renamedPath) + +var isSuccess = false +try { + isSuccess = super.add(batchId, totalMetadata) +} catch { + case NonFatal(e) => isSuccess = false --- End diff -- Why are you setting `isSuccess` to `false` since it's `false` already? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13513: [SPARK-15698][SQL][Streaming] Add the ability to ...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/13513#discussion_r65825480 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala --- @@ -129,3 +131,86 @@ class FileStreamSource( override def toString: String = s"FileStreamSource[$qualifiedBasePath]" } + +class FileStreamSourceLog(sparkSession: SparkSession, path: String) + extends HDFSMetadataLog[Seq[String]](sparkSession, path) { + + // Configurations about metadata compaction + private val compactInterval = sparkSession.conf.get(SQLConf.FILE_SOURCE_LOG_COMPACT_INTERVAL) + require(compactInterval > 0, +s"Please set ${SQLConf.FILE_SOURCE_LOG_COMPACT_INTERVAL.key} (was $compactInterval) to a " + + s"positive value.") + + private val fileCleanupDelayMs = sparkSession.conf.get(SQLConf.FILE_SOURCE_LOG_CLEANUP_DELAY) + + private val isDeletingExpiredLog = sparkSession.conf.get(SQLConf.FILE_SOURCE_LOG_DELETION) + + private var compactBatchId: Long = -1L + + private def isCompactionBatch(batchId: Long, compactInterval: Long): Boolean = { +batchId % compactInterval == 0 + } + + override def add(batchId: Long, metadata: Seq[String]): Boolean = { +if (isCompactionBatch(batchId, compactInterval)) { + compactMetadataLog(batchId - 1) +} + +super.add(batchId, metadata) + } + + private def compactMetadataLog(batchId: Long): Unit = { +// read out compact metadata and merge with new metadata. +val batches = super.get(Some(compactBatchId), Some(batchId)) +val totalMetadata = batches.flatMap(_._2) +if (totalMetadata.isEmpty) { + return +} + +// Remove old compact metadata file and rewrite. +val renamedPath = new Path(path, s".${batchId.toString}-${UUID.randomUUID.toString}.tmp") +fileManager.rename(batchIdToPath(batchId), renamedPath) + +var isSuccess = false +try { + isSuccess = super.add(batchId, totalMetadata) +} catch { + case NonFatal(e) => isSuccess = false +} finally { + if (!isSuccess) { +// Rollback to the previous status if compaction is failed. --- End diff -- s/status/state ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13513: [SPARK-15698][SQL][Streaming] Add the ability to ...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/13513#discussion_r65825440 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala --- @@ -129,3 +131,86 @@ class FileStreamSource( override def toString: String = s"FileStreamSource[$qualifiedBasePath]" } + +class FileStreamSourceLog(sparkSession: SparkSession, path: String) + extends HDFSMetadataLog[Seq[String]](sparkSession, path) { + + // Configurations about metadata compaction + private val compactInterval = sparkSession.conf.get(SQLConf.FILE_SOURCE_LOG_COMPACT_INTERVAL) + require(compactInterval > 0, +s"Please set ${SQLConf.FILE_SOURCE_LOG_COMPACT_INTERVAL.key} (was $compactInterval) to a " + --- End diff -- I'd move `(was $compactInterval)` at the end of the message. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [CORE][DOC][MINOR] Remove incorrect scaladoc
GitHub user jaceklaskowski opened a pull request: https://github.com/apache/spark/pull/13432 [CORE][DOC][MINOR] Remove incorrect scaladoc ## What changes were proposed in this pull request? It removes a sentence about `SparkListener` being internal and may change in the future yet the class is `@DeveloperApi` that may or may not say the same. If it does not, it's clearly incorrect. If it says what `@DeveloperApi` is for, it's a duplication (and/or would require other `@DeveloperApi` to have it, too). ## How was this patch tested? manual build You can merge this pull request into a Git repository by running: $ git pull https://github.com/jaceklaskowski/spark SparkListener Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13432.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13432 commit 306e60ebc135ccaceed24209e04d11a0929027f2 Author: Jacek Laskowski <ja...@japila.pl> Date: 2016-06-01T06:14:47Z [CORE][DOC][MINOR] Remove incorrect scaladoc --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15662][SQL] Add since annotation for cl...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/13406#discussion_r65140246 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalog/interface.scala --- @@ -41,6 +49,16 @@ class Database( } +/** + * A table in Spark, as returned by the `listTables` method in [[Catalog]]. --- End diff -- Use `[[Catalog#listTables]]` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15662][SQL] Add since annotation for cl...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/13406#discussion_r65140296 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalog/interface.scala --- @@ -83,9 +112,19 @@ class Column( } -// TODO(andrew): should we include the database here? +/** + * A user-defined function in Spark, as returned by `listFunctions` method in [[Catalog]]. --- End diff -- Use `[[Catalog#listFunctions]]` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15662][SQL] Add since annotation for cl...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/13406#discussion_r65140165 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalog/interface.scala --- @@ -25,6 +25,14 @@ import org.apache.spark.sql.catalyst.DefinedByConstructorParams // Note: all classes here are expected to be wrapped in Datasets and so must extend // DefinedByConstructorParams for the catalog to be able to create encoders for them. +/** + * A database in Spark, as returned by the `listDatabases` method defined in [[Catalog]]. --- End diff -- Use `[[Catalog#listDatabases]]` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15647] [SQL] Fix Boundary Cases in Opti...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/13392#discussion_r65136606 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -937,8 +937,12 @@ object SimplifyConditionals extends Rule[LogicalPlan] with PredicateHelper { */ case class OptimizeCodegen(conf: CatalystConf) extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions { -case e @ CaseWhen(branches, _) if branches.size < conf.maxCaseBranchesForCodegen => - e.toCodegen() +case e: CaseWhen if canCodeGen(e) => e.toCodegen() --- End diff -- Sorry for nitpicking, but could you use `canCodegen` instead (to follow the name of the method to call)? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15647] [SQL] Fix Boundary Cases in Opti...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/13392#discussion_r65008333 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -937,7 +937,8 @@ object SimplifyConditionals extends Rule[LogicalPlan] with PredicateHelper { */ case class OptimizeCodegen(conf: CatalystConf) extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions { -case e @ CaseWhen(branches, _) if branches.size < conf.maxCaseBranchesForCodegen => +case e @ CaseWhen(branches, elseBranch) +if branches.size + elseBranch.size <= conf.maxCaseBranchesForCodegen => --- End diff -- Reading the case takes a while and and I think it'd greatly benefit from introducing a local `def` - a predicate - for the condition (I can't figure out a name for this, sorry) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [CORE][MINOR][DOC] Removing incorrect scaladoc
GitHub user jaceklaskowski opened a pull request: https://github.com/apache/spark/pull/13384 [CORE][MINOR][DOC] Removing incorrect scaladoc ## What changes were proposed in this pull request? I don't think the method will ever throw an exception so removing a false comment. Sorry @srowen and @rxin again -- I simply couldn't resist. I wholeheartedly support merging the change with a bigger one (and trashing this PR). ## How was this patch tested? Manual build You can merge this pull request into a Git repository by running: $ git pull https://github.com/jaceklaskowski/spark blockinfomanager Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13384.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13384 commit 7c029d61b1927534c442d2892679d7c6e682cf7e Author: Jacek Laskowski <ja...@japila.pl> Date: 2016-05-28T21:14:57Z [CORE][MINOR][DOC] Removing incorrect scaladoc --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [CORE][DOC][MINOR] typos + links
GitHub user jaceklaskowski opened a pull request: https://github.com/apache/spark/pull/13383 [CORE][DOC][MINOR] typos + links ## What changes were proposed in this pull request? A very tiny change to javadoc (which I don't mind if gets merged with a bigger change). I've just found it annoying and couldn't resist proposing a pull request. Sorry @srowen and @rxin. ## How was this patch tested? Manual build You can merge this pull request into a Git repository by running: $ git pull https://github.com/jaceklaskowski/spark memory-consumer Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13383.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13383 commit 93cc32ecda3c10c53e79872a4130913213f0ec67 Author: Jacek Laskowski <ja...@japila.pl> Date: 2016-05-28T21:08:14Z [CORE][DOC][MINOR] typos + links --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [CORE][SQL][MINOR] Scaladoc fixes + string int...
Github user jaceklaskowski commented on the pull request: https://github.com/apache/spark/pull/13329#issuecomment-70008 Thanks @rxin and @srowen for your help and patience! I'll close the pull request. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [CORE][SQL][MINOR] Scaladoc fixes + string int...
Github user jaceklaskowski closed the pull request at: https://github.com/apache/spark/pull/13329 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [CORE][SQL][MINOR] Scaladoc fixes + string int...
Github user jaceklaskowski commented on the pull request: https://github.com/apache/spark/pull/13329#issuecomment-42718 How am I supposed to read this? Do you want me to...forget about the changes? All of them or just some? Which one would you accept since @srowen said: "Some of this is OK"? It _appears_ there's _some_ merit in the change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [CORE][SQL][MINOR] Scaladoc fixes + string int...
Github user jaceklaskowski commented on the pull request: https://github.com/apache/spark/pull/13329#issuecomment-222140528 The issue with `nonEmpty` is that you could easily miss the negation (and that's why Scala offers `nonEmpty`). I don't think it's the final solution, but certainly believe it's far better for future readers to know what's going on in these lines. I believe the changes improve readability (but can happily revert _some_ if you point me at the places that need this). It's just me to believe that by doing these small changes the code becomes more readable. I spent enough time with it to think it needs so (more often than I'm proposing). Please guide me to learn your coding style. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [CORE][SQL][MINOR] Scaladoc fixes + string int...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/13329#discussion_r64900321 --- Diff: yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnScheduler.scala --- @@ -31,9 +31,8 @@ private[spark] class YarnScheduler(sc: SparkContext) extends TaskSchedulerImpl(s Logger.getLogger(classOf[RackResolver]).setLevel(Level.WARN) } - // By default, rack is unknown override def getRackForHost(hostPort: String): Option[String] = { -val host = Utils.parseHostPort(hostPort)._1 +val (host, _) = Utils.parseHostPort(hostPort) --- End diff -- It is when you agree that you could easily (?) miss `_1` at the very end. I do agree and I did miss it few times while reviewing that piece of code. Opinions may vary and I can happily revert this change if requested (I need your advice to learn your coding style). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [CORE][SQL][MINOR] Scaladoc fixes + string int...
GitHub user jaceklaskowski opened a pull request: https://github.com/apache/spark/pull/13329 [CORE][SQL][MINOR] Scaladoc fixes + string interpolation ## What changes were proposed in this pull request? Scaladoc fixes + string interpolation for logging ## How was this patch tested? local build + manual testing (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) You can merge this pull request into a Git repository by running: $ git pull https://github.com/jaceklaskowski/spark minor-fixes Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13329.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13329 commit 2e794dcbf69af5e6d399e3d07cec002bbd573bc9 Author: Jacek Laskowski <ja...@japila.pl> Date: 2016-05-26T17:45:26Z [CORE][SQL][MINOR] Scaladoc fixes + string interpolation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15458][SQL][STREAMING] Disable schema i...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/13238#discussion_r64127555 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala --- @@ -140,6 +140,18 @@ class FileStreamSourceSuite extends FileStreamSourceTest with SharedSQLContext { import testImplicits._ + private def withSchemaInference(f: => Unit): Unit = { +withSQLConf(("spark.sql.streaming.allowSchemaInference", "true")) { f } --- End diff -- Why do you use `{ f }` not `(f)`? Also, I've seen `"spark.sql.streaming.allowSchemaInference" -> "true"` used in the past. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15458][SQL][STREAMING] Disable schema i...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/13238#discussion_r64127527 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -186,6 +187,14 @@ case class DataSource( val path = caseInsensitiveOptions.getOrElse("path", { throw new IllegalArgumentException("'path' is not specified") }) +if (!sparkSession.conf.get(SQLConf.STREAMING_SCHEMA_INFERENCE) && --- End diff -- This and the other 2 lines: could you please define a local `val` and name it in a way that tells what the `if` is supposed to guard? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15425][SQL] Disallow cartesian joins by...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/13209#discussion_r63987980 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -348,6 +348,11 @@ object SQLConf { .booleanConf .createWithDefault(true) + val CARTESIAN_PRODUCT_ENABLED = SQLConfigBuilder("spark.sql.join.cartesian.enabled") +.doc("When false, we will throw an error if a query contains a cartesian product") +.booleanConf +.createWithDefault(false) + val ORDER_BY_ORDINAL = SQLConfigBuilder("spark.sql.orderByOrdinal") .doc("When true, the ordinal numbers are treated as the position in the select list. " + "When false, the ordinal numbers in order/sort By clause are ignored.") --- End diff -- Is there any reason for `By` uppercase? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15428][SQL] Disable multiple streaming ...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/13210#discussion_r63987729 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala --- @@ -55,10 +55,19 @@ object UnsupportedOperationChecker { case _: InsertIntoTable => throwError("InsertIntoTable is not supported with streaming DataFrames/Datasets") -case Aggregate(_, _, child) if child.isStreaming && outputMode == Append => - throwError( -"Aggregations are not supported on streaming DataFrames/Datasets in " + - "Append output mode. Consider changing output mode to Update.") +case Aggregate(_, _, child) if child.isStreaming => + if (outputMode == Append) { +throwError( + "Aggregations are not supported on streaming DataFrames/Datasets in " + +"Append output mode. Consider changing output mode to Update.") + } + val moreStreamingAggregates = child.find { +case Aggregate(_, _, grandchild) if grandchild.isStreaming => true +case _ => false + } + if (moreStreamingAggregates.nonEmpty) { --- End diff -- What is the type of `moreStreamingAggregates`? Option? What do you think about `foreach`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15428][SQL] Disable multiple streaming ...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/13210#discussion_r63987696 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala --- @@ -55,10 +55,19 @@ object UnsupportedOperationChecker { case _: InsertIntoTable => throwError("InsertIntoTable is not supported with streaming DataFrames/Datasets") -case Aggregate(_, _, child) if child.isStreaming && outputMode == Append => - throwError( -"Aggregations are not supported on streaming DataFrames/Datasets in " + - "Append output mode. Consider changing output mode to Update.") +case Aggregate(_, _, child) if child.isStreaming => + if (outputMode == Append) { +throwError( + "Aggregations are not supported on streaming DataFrames/Datasets in " + +"Append output mode. Consider changing output mode to Update.") --- End diff -- I'd suggest changing "Append" to using the Append value and Update too, so when you refactor the code it'd be easier to catch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15428][SQL] Disable multiple streaming ...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/13210#discussion_r63987667 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala --- @@ -55,10 +55,19 @@ object UnsupportedOperationChecker { case _: InsertIntoTable => throwError("InsertIntoTable is not supported with streaming DataFrames/Datasets") -case Aggregate(_, _, child) if child.isStreaming && outputMode == Append => - throwError( -"Aggregations are not supported on streaming DataFrames/Datasets in " + - "Append output mode. Consider changing output mode to Update.") +case Aggregate(_, _, child) if child.isStreaming => + if (outputMode == Append) { +throwError( + "Aggregations are not supported on streaming DataFrames/Datasets in " + +"Append output mode. Consider changing output mode to Update.") + } + val moreStreamingAggregates = child.find { +case Aggregate(_, _, grandchild) if grandchild.isStreaming => true +case _ => false + } + if (moreStreamingAggregates.nonEmpty) { +throwError("Multiple streaming aggregations are not supported DataFrames/Datasets") --- End diff -- "are not supported DataFrames/Datasets"? It appears as "in" or "for" might be missing? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15067] [YARN] YARN executors are launch...
Github user jaceklaskowski commented on the pull request: https://github.com/apache/spark/pull/12985#issuecomment-217727591 What do you think about a unit test for this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15152][DOC][MINOR] Scaladoc and Code st...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/12928#discussion_r62170912 --- Diff: core/src/main/scala/org/apache/spark/scheduler/Pool.scala --- @@ -98,13 +97,12 @@ private[spark] class Pool( } override def getSortedTaskSetQueue: ArrayBuffer[TaskSetManager] = { -var sortedTaskSetQueue = new ArrayBuffer[TaskSetManager] -val sortedSchedulableQueue = - schedulableQueue.asScala.toSeq.sortWith(taskSetSchedulingAlgorithm.comparator) -for (schedulable <- sortedSchedulableQueue) { - sortedTaskSetQueue ++= schedulable.getSortedTaskSetQueue -} -sortedTaskSetQueue +schedulableQueue --- End diff -- You're right! I have overlooked that. I'll revert the change. Sorry. (Hmmm, how did it work then?! #confused) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15152][DOC][MINOR] Scaladoc and Code st...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/12928#discussion_r62170253 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala --- @@ -168,7 +168,7 @@ private[spark] class Client( val appContext = createApplicationSubmissionContext(newApp, containerContext) // Finally, submit and monitor the application - logInfo(s"Submitting application ${appId.getId} to ResourceManager") + logInfo(s"Submitting application $appId to ResourceManager") --- End diff -- It's inconsistent with the other log messages where `appId` is printed out instead. It then makes tracing the calls slightly harder - cf. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L180-L182 and https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L174-L175 and few other places. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15152][DOC][MINOR] Scaladoc and Code st...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/12928#discussion_r62169812 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala --- @@ -155,10 +155,10 @@ private[spark] class Client( // Get a new application from our RM val newApp = yarnClient.createApplication() - val newAppResponse = newApp.getNewApplicationResponse() - appId = newAppResponse.getApplicationId() + val newAppResponse = newApp.getNewApplicationResponse --- End diff -- Yes, there are, but IDEA kept bugging me all the time about calling getters with `()` as if mutation/computation happened. I'll revert it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15152][DOC][MINOR] Scaladoc and Code st...
GitHub user jaceklaskowski opened a pull request: https://github.com/apache/spark/pull/12928 [SPARK-15152][DOC][MINOR] Scaladoc and Code style Improvements ## What changes were proposed in this pull request? Minor doc and code style fixes ## How was this patch tested? local build You can merge this pull request into a Git repository by running: $ git pull https://github.com/jaceklaskowski/spark SPARK-15152 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12928.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12928 commit a3449278bc04b1e40cf782779066d58df899ea06 Author: Jacek Laskowski <ja...@japila.pl> Date: 2016-05-05T10:00:28Z [DOC][MINOR] SPARK-15152 Scaladoc and Code style Improvements --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14938][ML] replace RDD.map with Dataset...
Github user jaceklaskowski commented on the pull request: https://github.com/apache/spark/pull/12718#issuecomment-215550190 Other than the few places where you could use symbols not string literals LGTM. Excellent job! Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14938][ML] replace RDD.map with Dataset...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/12718#discussion_r61494448 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala --- @@ -79,11 +79,12 @@ final class ChiSqSelector(override val uid: String) @Since("2.0.0") override def fit(dataset: Dataset[_]): ChiSqSelectorModel = { +val sqlContext = dataset.sqlContext +import sqlContext.implicits._ + transformSchema(dataset.schema, logging = true) -val input = dataset.select($(labelCol), $(featuresCol)).rdd.map { - case Row(label: Double, features: Vector) => -LabeledPoint(label, features) -} +val input = dataset.select(col($(labelCol)).cast(DoubleType).as("label"), --- End diff -- Sorry, couldn't resist :) I'd change `"label"` to be a symbol `'label`. Not very widely used, but think it deserves its place in the code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MINOR][DOCS] Minor typo fixes
Github user jaceklaskowski commented on the pull request: https://github.com/apache/spark/pull/12469#issuecomment-214638031 @srowen Reverted the line removal and rebased with master. Mind merging it to the repo? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14889][Spark Core] scala.MatchError: NO...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/12666#discussion_r60991107 --- Diff: core/src/main/scala/org/apache/spark/scheduler/Pool.scala --- @@ -55,6 +55,8 @@ private[spark] class Pool( new FairSchedulingAlgorithm() case SchedulingMode.FIFO => new FIFOSchedulingAlgorithm() + case _ => +throw new RuntimeException(s"The scheduler mode $schedulingMode is not supported by Spark.") --- End diff -- In [TaskSchedulerImpl](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L118) they use: "Unrecognized spark.scheduler.mode: $schedulingModeConf" so I'd change it to "Unsupported spark.scheduler.mode: $schedulingMode". I'm also not sure about the other places like https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L133-L138. I know that the change would not let it be executed with `NONE` or unsupported scheduling modes, but when Pool changes...just thinking aloud. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [DOCS][MINOR] Screenshot + minor fixes to impr...
Github user jaceklaskowski commented on the pull request: https://github.com/apache/spark/pull/12569#issuecomment-213819194 Could this PR have a bit of @srowen's attention? :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MINOR][DOCS] Minor typo fixes
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/12469#discussion_r60727194 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala --- @@ -80,10 +80,8 @@ class HiveContext private[hive]( protected[sql] override def sharedState: HiveSharedState = { sparkSession.sharedState.asInstanceOf[HiveSharedState] } - --- End diff -- You're correct @srowen the line was there, but the file had changed since your first review of my *minor* change so I had to rebase. While rebasing and fixing the merge conflict I noticed the two lines and remove one. Since I was fixing the merge conflict I introduced the additional change (as it should've been noticed and fixed before). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MINOR][DOCS] Minor typo fixes
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/12469#discussion_r60726392 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala --- @@ -80,10 +80,8 @@ class HiveContext private[hive]( protected[sql] override def sharedState: HiveSharedState = { sparkSession.sharedState.asInstanceOf[HiveSharedState] } - --- End diff -- It was because I touched the file (not the line) and while I was rebasing I noticed the offending line. When I was at it I decided to remove it since I was fixing the merge conflict. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [DOCS][MINOR] Screenshot + minor fixes to impr...
Github user jaceklaskowski commented on the pull request: https://github.com/apache/spark/pull/12569#issuecomment-213387430 Title has changed to incorporate @HyukjinKwon's suggestion. Please review @srowen. I intentionally have not included "in the programming guide" due to "[DOCS]". Is this ok now? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MINOR][DOCS] Minor typo fixes
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/12469#discussion_r60725372 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala --- @@ -80,10 +80,8 @@ class HiveContext private[hive]( protected[sql] override def sharedState: HiveSharedState = { sparkSession.sharedState.asInstanceOf[HiveSharedState] } - --- End diff -- In this case, it caused the last merge conflict so when I was rebasing I noticed the line and removed. I can revert the change if that's what you want. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [DOCS][MINOR] Accumulators
Github user jaceklaskowski commented on the pull request: https://github.com/apache/spark/pull/12569#issuecomment-213275339 @HyukjinKwon In that case I'd ask for the alternative as I currently have no idea how to make it clearer (it wasn't me to say "the title is not clear" :)) What do you miss in the title? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MINOR][DOCS] Minor typo fixes
Github user jaceklaskowski commented on the pull request: https://github.com/apache/spark/pull/12469#issuecomment-213118039 I fixed the issue with the two failing tests due to the error message having been changed. Jenkins retest this please :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MINOR][DOCS] Minor typo fixes
Github user jaceklaskowski commented on the pull request: https://github.com/apache/spark/pull/12469#issuecomment-212920619 Done @srowen. Thanks a lot for your patience. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [DOCS][MINOR] Accumulators
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/12569#discussion_r60577867 --- Diff: docs/programming-guide.md --- @@ -1328,12 +1328,18 @@ value of the broadcast variable (e.g. if the variable is shipped to a new node l Accumulators are variables that are only "added" to through an associative and commutative operation and can therefore be efficiently supported in parallel. They can be used to implement counters (as in MapReduce) or sums. Spark natively supports accumulators of numeric types, and programmers -can add support for new types. If accumulators are created with a name, they will be +can add support for new types. + +If accumulators are created with a name, they will be displayed in Spark's UI. This can be useful for understanding the progress of running stages (NOTE: this is not yet supported in Python). + --- End diff -- I copied it from another file https://github.com/apache/spark/blob/master/docs/cluster-overview.md as I didn't know how to include images. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [DOCS][MINOR] Accumulators
GitHub user jaceklaskowski opened a pull request: https://github.com/apache/spark/pull/12569 [DOCS][MINOR] Accumulators ## What changes were proposed in this pull request? Added screenshot + minor fixes to improve reading ## How was this patch tested? Manual You can merge this pull request into a Git repository by running: $ git pull https://github.com/jaceklaskowski/spark docs-accumulators Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12569.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12569 commit 8cdc916256c792b212e0b52593fbfd5e0f1b7c8b Author: Jacek Laskowski <ja...@japila.pl> Date: 2016-04-21T08:49:18Z [DOCS][MINOR] Accumulators --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MINOR][DOCS] Minor typo fixes
Github user jaceklaskowski commented on the pull request: https://github.com/apache/spark/pull/12469#issuecomment-212316780 @srowen Mind reviewing and possibly merging afterwards? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MINOR][ML] Use Datasets (to improve internal ...
Github user jaceklaskowski commented on the pull request: https://github.com/apache/spark/pull/11915#issuecomment-211605171 Sorry, no. Got distracted and had no time for it. Let me work on it later today (it's 00:21 already). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Minor typo fixes
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/12469#discussion_r60092825 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala --- @@ -46,8 +46,7 @@ import org.apache.spark.sql.SQLContext * files in a directory always shows the latest files. */ class HDFSMetadataLog[T: ClassTag](sqlContext: SQLContext, path: String) - extends MetadataLog[T] - with Logging { + extends MetadataLog[T] with Logging { --- End diff -- Against the code style, don't you think? How to proceed with similar cases? Please guide. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Minor typo fixes
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/12469#discussion_r60092677 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -453,8 +453,8 @@ class Dataset[T] private[sql]( * Returns true if this [[Dataset]] contains one or more sources that continuously * return data as it arrives. A [[Dataset]] that reads data from a streaming source * must be executed as a [[ContinuousQuery]] using the `startStream()` method in - * [[DataFrameWriter]]. Methods that return a single answer, (e.g., `count()` or - * `collect()`) will throw an [[AnalysisException]] when there is a streaming + * [[DataFrameWriter]]. Methods that return a single answer, e.g. `count()` or --- End diff -- See http://dictionary.cambridge.org/dictionary/english/e-g. There should only be a comma before e.g. not after it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Minor typo fixes
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/12469#discussion_r60092525 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala --- @@ -38,8 +38,6 @@ import org.apache.spark.sql.types._ /** * Represents a numeric vector, whose index type is Int and value type is Double. - * - * Note: Users should not implement this interface. --- End diff -- That's my point with the change. If uses should not implement this interface, we should offer them alternatives. Do you know any? I'm happy to add it to the scaladoc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Minor typo fixes
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/12469#discussion_r60092387 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -438,7 +438,6 @@ class LinearRegressionModel private[ml] ( } } - --- End diff -- How to fix them? I've seen few commits with such a change so I thought this one would be enough to include this too. Please guide as I've got few other similar cases and don't know how to proceed with small but irritating cases. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Minor typo fixes
GitHub user jaceklaskowski opened a pull request: https://github.com/apache/spark/pull/12469 Minor typo fixes ## What changes were proposed in this pull request? Minor typo fixes (too minor to deserve separate a JIRA) ## How was this patch tested? local build You can merge this pull request into a Git repository by running: $ git pull https://github.com/jaceklaskowski/spark minor-typo-fixes Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12469.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12469 commit 13bcfbbe03f9c78512547d6b69d13b945a163e65 Author: Jacek Laskowski <ja...@japila.pl> Date: 2016-04-18T11:46:56Z Minor typo fixes --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14288][SQL] Memory Sink for streaming
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/12119#discussion_r59296064 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala --- @@ -275,23 +277,64 @@ final class DataFrameWriter private[sql](df: DataFrame) { * @since 2.0.0 */ def startStream(): ContinuousQuery = { -val dataSource = - DataSource( -df.sqlContext, -className = source, -options = extraOptions.toMap, -partitionColumns = normalizedParCols.getOrElse(Nil)) - -val queryName = extraOptions.getOrElse("queryName", StreamExecution.nextName) -val checkpointLocation = extraOptions.getOrElse("checkpointLocation", { - new Path(df.sqlContext.conf.checkpointLocation, queryName).toUri.toString -}) -df.sqlContext.sessionState.continuousQueryManager.startQuery( - queryName, - checkpointLocation, - df, - dataSource.createSink(), - trigger) +if (source == "memory") { + val queryName = +extraOptions.getOrElse( + "queryName", throw new AnalysisException("queryName must be specified for memory sink")) + val checkpointLocation = extraOptions.get("checkpointLocation").map { userSpecified => +new Path(userSpecified).toUri.toString + }.orElse { +val checkpointConfig: Option[String] = + df.sqlContext.conf.getConf( +SQLConf.CHECKPOINT_LOCATION, +None) + +checkpointConfig.map { location => + new Path(location, queryName).toUri.toString +} + }.getOrElse { +Utils.createTempDir(namePrefix = "memory.stream").getCanonicalPath + } + + // If offsets have already been created, we trying to resume a query. + val checkpointPath = new Path(checkpointLocation, "offsets") + val fs = checkpointPath.getFileSystem(df.sqlContext.sparkContext.hadoopConfiguration) + if (fs.exists(checkpointPath)) { +throw new AnalysisException( + s"Unable to resume query written to memory sink. Delete $checkpointPath to start over.") + } else { +checkpointPath.toUri.toString --- End diff -- @marmbrus I think it's a dead code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14500] [ML] Accept Dataset[_] instead o...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/12274#discussion_r59124622 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/model/DecisionTreeModel.scala --- @@ -32,7 +32,7 @@ import org.apache.spark.mllib.tree.configuration.{Algo, FeatureType} import org.apache.spark.mllib.tree.configuration.Algo._ import org.apache.spark.mllib.util.{Loader, Saveable} import org.apache.spark.rdd.RDD -import org.apache.spark.sql.{DataFrame, Row, SQLContext} +import org.apache.spark.sql.{DataFrame, Dataset, Row, SQLContext} --- End diff -- Again, no other change in the file and you need the import? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14500] [ML] Accept Dataset[_] instead o...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/12274#discussion_r59124613 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/impl/GLMRegressionModel.scala --- @@ -23,7 +23,7 @@ import org.json4s.jackson.JsonMethods._ import org.apache.spark.SparkContext import org.apache.spark.mllib.linalg.Vector import org.apache.spark.mllib.util.Loader -import org.apache.spark.sql.{DataFrame, Row, SQLContext} +import org.apache.spark.sql.{DataFrame, Dataset, Row, SQLContext} --- End diff -- I'm surprised you need this change without any other changes in the file? How could that be? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14500] [ML] Accept Dataset[_] instead o...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/12274#discussion_r59124592 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala --- @@ -171,18 +171,18 @@ abstract class PredictionModel[FeaturesType, M <: PredictionModel[FeaturesType, * @param dataset input dataset * @return transformed dataset with [[predictionCol]] of type [[Double]] */ - override def transform(dataset: DataFrame): DataFrame = { + override def transform(dataset: Dataset[_]): DataFrame = { --- End diff -- What about the return type? If I want to chain transformers by `andThen`, i.e. `(tok.transform _).andThen(hashTF.transform)`, won't `DataFrame` be an issue? Why do we keep `DataFrame` as the return type? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14500] [ML] Accept Dataset[_] instead o...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/12274#discussion_r59124562 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala --- @@ -124,7 +124,7 @@ class Pipeline @Since("1.4.0") ( * @return fitted pipeline */ @Since("1.2.0") --- End diff -- What about `@Since` then? It's a different method, isn't it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [HOTFIX][SPARK-14402] Fix ExpressionDescriptio...
Github user jaceklaskowski commented on the pull request: https://github.com/apache/spark/pull/12192#issuecomment-207175670 Please please merge it as soon as possible as I'm suffering from not having it in master every time I do the build :( --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [HOTFIX][SPARK-14402] Fix ExpressionDescriptio...
Github user jaceklaskowski commented on the pull request: https://github.com/apache/spark/pull/12192#issuecomment-206095697 Is this really the patch to cause the issue or is this more an interim hiccup? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [HOTFIX][SPARK-14402] Fix ExpressionDescriptio...
GitHub user jaceklaskowski opened a pull request: https://github.com/apache/spark/pull/12192 [HOTFIX][SPARK-14402] Fix ExpressionDescription annotation ## What changes were proposed in this pull request? Fix for the error introduced in https://github.com/apache/spark/commit/c59abad052b7beec4ef550049413e95578e545be: ``` /Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:626: error: annotation argument needs to be a constant; found: "_FUNC_(str) - ".+("Returns str, with the first letter of each word in uppercase, all other letters in ").+("lowercase. Words are delimited by white space.") "Returns str, with the first letter of each word in uppercase, all other letters in " + ^ ``` ## How was this patch tested? Local build You can merge this pull request into a Git repository by running: $ git pull https://github.com/jaceklaskowski/spark SPARK-14402-HOTFIX Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12192.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12192 commit 218bf51fb0132738daf0d544bfcde076af22f444 Author: Jacek Laskowski <ja...@japila.pl> Date: 2016-04-06T01:36:19Z [HOTFIX][SPARK-14402] Fix ExpressionDescription annotation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MINOR] Typo fixes
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/11802#discussion_r57455819 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala --- @@ -246,9 +246,7 @@ class StreamingContext private[streaming] ( checkpointDir != null } - private[streaming] def initialCheckpoint: Checkpoint = { -if (isCheckpointPresent) _cp else null - } + private[streaming] def initialCheckpoint: Checkpoint = _cp --- End diff -- Sorry @srowen to keep you waiting. As to the line, look at `isCheckpointPresent` and see that it does what the `if...else` block does, i.e. checks whether `_cp` is `null` and if it is, it returns `false` so `null`. It's a false check here and hence an easy fix (and that's why I proposed it with the other changes). I can revert it and report another change if you want. Let me know. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MINOR][ML] Use Datasets (to improve internal ...
GitHub user jaceklaskowski opened a pull request: https://github.com/apache/spark/pull/11915 [MINOR][ML] Use Datasets (to improve internal implementation) ## What changes were proposed in this pull request? Change the current implementation to use Datasets (not the "old-school" `map` over `Row`s) NOTE: I'm sending the pull request to discuss whether my understanding of the change is correct or not and if approved (after discussions and further changes possibly) learn the right way. ## How was this patch tested? Local build You can merge this pull request into a Git repository by running: $ git pull https://github.com/jaceklaskowski/spark sparkml-predictor Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11915.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11915 commit 74897af4b8b4ba126790377fb6d79bbe96364ec0 Author: Jacek Laskowski <ja...@japila.pl> Date: 2016-03-23T07:56:56Z [MINOR][ML] Use Datasets (to improve internal implementation) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MINOR] Typo fixes
GitHub user jaceklaskowski opened a pull request: https://github.com/apache/spark/pull/11802 [MINOR] Typo fixes ## What changes were proposed in this pull request? Typo fixes. No functional changes. ## How was this patch tested? Built the sources and ran with samples. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jaceklaskowski/spark typo-fixes Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11802.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11802 commit c870a8cacbf0e62aab83bae56018f9a89f9e6fa0 Author: Jacek Laskowski <ja...@japila.pl> Date: 2016-03-17T23:10:53Z [MINOR] Typo fixes --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MINOR] Typo fixes
Github user jaceklaskowski commented on the pull request: https://github.com/apache/spark/pull/11802#issuecomment-198593089 Thanks @srowen @thunterdb for review! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MINOR] Typo fixes
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/11802#discussion_r56738903 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala --- @@ -277,7 +277,7 @@ abstract class DStream[T: ClassTag] ( logInfo(s"Slide time = $slideDuration") logInfo(s"Storage level = ${storageLevel.description}") logInfo(s"Checkpoint interval = $checkpointDuration") -logInfo(s"Remember duration = $rememberDuration") +logInfo(s"Remember interval = $rememberDuration") --- End diff -- It's consistent with "Checkpoint interval" above and the general impression that it's described as "interval" while called "Duration" in the code (due to the type). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13825][CORE] Upgrade to Scala 2.11.8
Github user jaceklaskowski commented on the pull request: https://github.com/apache/spark/pull/11681#issuecomment-196929219 @srowen Mind triggering a test? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13825][CORE] Upgrade to Scala 2.11.8
GitHub user jaceklaskowski opened a pull request: https://github.com/apache/spark/pull/11681 [SPARK-13825][CORE] Upgrade to Scala 2.11.8 ## What changes were proposed in this pull request? Upgrade to 2.11.8 (from the current 2.11.7) ## How was this patch tested? A manual build You can merge this pull request into a Git repository by running: $ git pull https://github.com/jaceklaskowski/spark SPARK-13825-scala-2_11_8 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11681.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11681 commit 5ae9816ce17ba2ce9b708a68e8a1fd2540024872 Author: Jacek Laskowski <ja...@japila.pl> Date: 2016-03-12T22:05:24Z [SPARK-13825][CORE] Upgrade to Scala 2.11.8 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13776][WebUI]Limit the max number of ac...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/11615#discussion_r55883199 --- Diff: core/src/main/scala/org/apache/spark/ui/JettyUtils.scala --- @@ -270,9 +270,25 @@ private[spark] object JettyUtils extends Logging { gzipHandlers.foreach(collection.addHandler) connectors.foreach(_.setHost(hostName)) + // As each acceptor and each selector will use one thread, the number of threads should at + // least be the number of acceptors and selectors plus 1. (See SPARK-13776) + var minThreads = 1 + connectors.foreach { c => +// Currently we only use "SelectChannelConnector" +val connector = c.asInstanceOf[SelectChannelConnector] +// Limit the max acceptor number to 8 so that we don't waste a lot of threads +connector.setAcceptors(math.min(connector.getAcceptors, 8)) +// The number of selectors always equals to the number of acceptors +minThreads += connector.getAcceptors * 2 + } server.setConnectors(connectors.toArray) val pool = new QueuedThreadPool + if (serverName.nonEmpty) { +pool.setName(serverName) + } + pool.setMaxThreads(math.max(pool.getMaxThreads, minThreads)) + pool.setMinThreads(math.min(pool.getMinThreads, pool.getMaxThreads)) --- End diff -- So, could getMinThreads be greater than getMaxThreads ever? Why are you doing the line? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13776][WebUI]Limit the max number of ac...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/11615#discussion_r55869059 --- Diff: core/src/main/scala/org/apache/spark/ui/JettyUtils.scala --- @@ -270,9 +270,25 @@ private[spark] object JettyUtils extends Logging { gzipHandlers.foreach(collection.addHandler) connectors.foreach(_.setHost(hostName)) + // As each acceptor and each selector will use one thread, the number of threads should at + // least be the number of acceptors and selectors plus 1. (See SPARK-13776) + var minThreads = 1 + connectors.foreach { c => +// Currently we only use "SelectChannelConnector" +val connector = c.asInstanceOf[SelectChannelConnector] +// Limit the max acceptor number to 8 so that we don't waste a lot of threads +connector.setAcceptors(math.min(connector.getAcceptors, 8)) +// The number of selectors always equals to the number of acceptors +minThreads += connector.getAcceptors * 2 + } server.setConnectors(connectors.toArray) val pool = new QueuedThreadPool + if (serverName.nonEmpty) { +pool.setName(serverName) + } + pool.setMaxThreads(math.max(pool.getMaxThreads, minThreads)) + pool.setMinThreads(math.min(pool.getMinThreads, pool.getMaxThreads)) --- End diff -- When could `getMinThreads` be greater than `getMaxThreads`? Why don't you use `minThreads` here, too? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13776][WebUI]Limit the max number of ac...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/11615#discussion_r55868740 --- Diff: core/src/main/scala/org/apache/spark/ui/JettyUtils.scala --- @@ -270,9 +270,25 @@ private[spark] object JettyUtils extends Logging { gzipHandlers.foreach(collection.addHandler) connectors.foreach(_.setHost(hostName)) + // As each acceptor and each selector will use one thread, the number of threads should at + // least be the number of acceptors and selectors plus 1. (See SPARK-13776) + var minThreads = 1 + connectors.foreach { c => --- End diff -- Use `collect` or "cast" using pattern matching on type. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13727] [SQL] SparkConf.contains does no...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/11568#discussion_r55408473 --- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala --- @@ -351,7 +351,16 @@ class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging { def getAppId: String = get("spark.app.id") /** Does the configuration contain a given parameter? */ - def contains(key: String): Boolean = settings.containsKey(key) + def contains(key: String): Boolean = { +if (settings.containsKey(key)) { + true +} else { + // try to find the settings in the alternatives + configsWithAlternatives.get(key).flatMap { alts => +alts.collectFirst { case alt if contains(alt.key) => true } + }.isDefined --- End diff -- It's far too complicated. Would `configsWithAlternatives.get("one").contains(...)` work here? What is this supposed to do? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13727] [SQL] SparkConf.contains does no...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/11568#discussion_r55406003 --- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala --- @@ -351,7 +351,16 @@ class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging { def getAppId: String = get("spark.app.id") /** Does the configuration contain a given parameter? */ - def contains(key: String): Boolean = settings.containsKey(key) + def contains(key: String): Boolean = { +if (settings.containsKey(key)) { --- End diff -- It always bothers me when I see `if (true) true`. I think `settings.containsKey(key) || ...` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12458][SQL] Add ExpressionDescription t...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/10428#discussion_r55347052 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala --- @@ -126,6 +144,13 @@ case class Hour(child: Expression) extends UnaryExpression with ImplicitCastInpu } } +/* + * Returns the minute componemnt of the string/timestamp/interval + */ +@ExpressionDescription( + usage = "_FUNC_(timestamp_param) - Returns the minute componemnt of the timestamp value.", --- End diff -- s/the minute componemnt/the minute component --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11535][ML] handling empty string in Str...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/11575#discussion_r55345145 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala --- @@ -178,6 +178,20 @@ class StringIndexerSuite } } + test("StringIndexer on column with empty string values") { +val data = sc.parallelize(Seq((0, "a"), (1, ""), (2, "c"), (3, "a"), (4, "a"), (5, "c")), 2) +val df = sqlContext.createDataFrame(data).toDF("id", "label") --- End diff -- These two lines could be written as: ``` Seq((0, "a"), (1, ""), (2, "c"), (3, "a"), (4, "a"), (5, "c")).toDF("id", "label") ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [STREAMING][DOCS] Fixes and code improvements ...
Github user jaceklaskowski commented on the pull request: https://github.com/apache/spark/pull/11201#issuecomment-186008514 @srowen Mind having a look? I'd appreciate. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [STREAMING][DOCS] Fixes and code improvements ...
GitHub user jaceklaskowski opened a pull request: https://github.com/apache/spark/pull/11201 [STREAMING][DOCS] Fixes and code improvements for checkpointing You can merge this pull request into a Git repository by running: $ git pull https://github.com/jaceklaskowski/spark docs-streaming-checkpointing Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11201.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11201 commit 3eb999f04eadcf603ecb551f9bf78dc776b80dbe Author: Jacek Laskowski <ja...@japila.pl> Date: 2016-02-14T14:12:11Z [STREAMING][DOCS] Fixes and code improvements for checkpointing --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Fix for [SPARK-12854][SQL] Implement complex t...
Github user jaceklaskowski commented on the pull request: https://github.com/apache/spark/pull/10946#issuecomment-177873394 @JoshRosen @rxin a friendly reminder to merge the change (or close it if irrelevant) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Fix for [SPARK-12854][SQL] Implement complex t...
Github user jaceklaskowski commented on the pull request: https://github.com/apache/spark/pull/10946#issuecomment-176115262 I'd appreciate having it merged since the current master is broken without the fix - checked early morning today. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Fix for [SPARK-12854][SQL] Implement complex t...
GitHub user jaceklaskowski opened a pull request: https://github.com/apache/spark/pull/10946 Fix for [SPARK-12854][SQL] Implement complex types support in Columna⦠â¦rBatch Fixes build for Scala 2.11. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jaceklaskowski/spark SPARK-12854-fix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10946.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10946 commit c13743b3a9214f61df1eac6b16e216f2a9c0cc7c Author: Jacek Laskowski <ja...@japila.pl> Date: 2016-01-27T09:12:49Z Fix for [SPARK-12854][SQL] Implement complex types support in ColumnarBatch --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [STREAMING][MINOR] Scaladoc + logs
GitHub user jaceklaskowski opened a pull request: https://github.com/apache/spark/pull/10878 [STREAMING][MINOR] Scaladoc + logs Found while doing code review You can merge this pull request into a Git repository by running: $ git pull https://github.com/jaceklaskowski/spark streaming-scaladoc-logs-tiny-fixes Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10878.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10878 commit 6af94bee120c62fe5b50d2cd7345ed96f70d2ceb Author: Jacek Laskowski <ja...@japila.pl> Date: 2016-01-22T10:41:31Z [STREAMING][MINOR] Scaladoc + logs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [STREAMING][MINOR] Typo fixes
GitHub user jaceklaskowski opened a pull request: https://github.com/apache/spark/pull/10698 [STREAMING][MINOR] Typo fixes You can merge this pull request into a Git repository by running: $ git pull https://github.com/jaceklaskowski/spark streaming-kafka-typo-fixes Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10698.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10698 commit c9f7e085d92ca3d1b4d51b6fd29d57eb5987d153 Author: Jacek Laskowski <ja...@japila.pl> Date: 2016-01-11T12:28:49Z [STREAMING][MINOR] Typo fixes --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12736][CORE][DEPLOY] Standalone Master ...
Github user jaceklaskowski commented on the pull request: https://github.com/apache/spark/pull/10674#issuecomment-170238716 Yes, it does. I'm using the latest revision + the change. It's a serious issue since standalone Master cannot be started as of today. I do not know how it's supposed to have been fixed, but that's exactly what helped to resolve the issue. Any help appreciated to make it better. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12736][CORE][DEPLOY] Standalone Master ...
GitHub user jaceklaskowski opened a pull request: https://github.com/apache/spark/pull/10674 [SPARK-12736][CORE][DEPLOY] Standalone Master cannot be started due t⦠â¦o NoClassDefFoundError: org/spark-project/guava/collect/Maps /cc @srowen @rxin You can merge this pull request into a Git repository by running: $ git pull https://github.com/jaceklaskowski/spark SPARK-12736 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10674.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10674 commit 16c228d9227731e1228e75b419b06bcf29e8765c Author: Jacek Laskowski <ja...@japila.pl> Date: 2016-01-09T11:34:07Z [SPARK-12736][CORE][DEPLOY] Standalone Master cannot be started due to NoClassDefFoundError: org/spark-project/guava/collect/Maps --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12736][CORE][DEPLOY] Standalone Master ...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/10674#discussion_r49266731 --- Diff: network/common/pom.xml --- @@ -55,6 +55,7 @@ com.google.guava guava + compile --- End diff -- Thought so, but the line has fixed standalone Master to start (after the line got removed in https://github.com/apache/spark/commit/659fd9d04b988d48960eac4f352ca37066f43f5c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org