[GitHub] spark issue #20490: [SPARK-23323][SQL]: Support commit coordinator for DataS...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20490 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20549: SPARK-18844[MLLIB] Add more binary classification metric...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20549 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20490: [SPARK-23323][SQL]: Support commit coordinator for DataS...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20490 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/722/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20549: SPARK-18844[MLLIB] Add more binary classification metric...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20549 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20490: [SPARK-23323][SQL]: Support commit coordinator for DataS...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/20490 @cloud-fan, please have another look. I fixed the problems you spotted. I haven't added support for the streaming side. It is different enough that I think we should do it in a follow-up. More details are on the thread. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20490: [SPARK-23323][SQL]: Support commit coordinator fo...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20490#discussion_r167011220 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceWriter.java --- @@ -78,10 +78,11 @@ default void onDataWriterCommit(WriterCommitMessage message) {} * failed, and {@link #abort(WriterCommitMessage[])} would be called. The state of the destination * is undefined and @{@link #abort(WriterCommitMessage[])} may not be able to deal with it. * - * Note that, one partition may have multiple committed data writers because of speculative tasks. - * Spark will pick the first successful one and get its commit message. Implementations should be - * aware of this and handle it correctly, e.g., have a coordinator to make sure only one data - * writer can commit, or have a way to clean up the data of already-committed writers. + * Note that speculative execution may cause multiple tasks to run for a partition. By default, + * Spark uses the OutputCommitCoordinator to allow only one attempt to commit. + * {@link DataWriterFactory} implementations can disable this behavior. If disabled, multiple --- End diff -- I clarified this and added a note about how to do it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20490: [SPARK-23323][SQL]: Support commit coordinator fo...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20490#discussion_r167011250 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceWriter.java --- @@ -78,10 +78,11 @@ default void onDataWriterCommit(WriterCommitMessage message) {} * failed, and {@link #abort(WriterCommitMessage[])} would be called. The state of the destination * is undefined and @{@link #abort(WriterCommitMessage[])} may not be able to deal with it. * - * Note that, one partition may have multiple committed data writers because of speculative tasks. - * Spark will pick the first successful one and get its commit message. Implementations should be - * aware of this and handle it correctly, e.g., have a coordinator to make sure only one data - * writer can commit, or have a way to clean up the data of already-committed writers. + * Note that speculative execution may cause multiple tasks to run for a partition. By default, + * Spark uses the OutputCommitCoordinator to allow only one attempt to commit. --- End diff -- Fixed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20490: [SPARK-23323][SQL]: Support commit coordinator fo...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20490#discussion_r167011291 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataWriterFactory.java --- @@ -32,6 +32,16 @@ @InterfaceStability.Evolving public interface DataWriterFactory extends Serializable { + /** + * Returns whether Spark should use the OutputCommitCoordinator to ensure that only one attempt + * for each task commits. + * + * @return true if commit coordinator should be used, false otherwise. + */ + default boolean useCommitCoordinator() { --- End diff -- I moved it. Good idea. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20549: SPARK-18844[MLLIB] Add more binary classification metric...
Github user sandecho commented on the issue: https://github.com/apache/spark/pull/20549 [SPARK-18844.zip](https://github.com/apache/spark/files/1708136/SPARK-18844.zip) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20549: SPARK-18844[MLLIB] Add more binary classification...
GitHub user sandecho reopened a pull request: https://github.com/apache/spark/pull/20549 SPARK-18844[MLLIB] Add more binary classification metrics to BinaryClassificationMetrics ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sandecho/spark new_branch Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20549.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20549 commit 9f33d677586043fe7c75ac1930c51c138f281a49 Author: Sandeep Kumar Choudhary Date: 2018-02-08T16:49:13Z Add more binary classification metrics to BinaryClassificationMetrics commit d7144f63a99e575d5c996fd7919bdbe44266620f Author: Sandeep Kumar Choudhary Date: 2018-02-08T17:20:52Z SPARK-18844 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20521: [SPARK-22977][SQL] fix web UI SQL tab for CTAS
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20521 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87219/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20521: [SPARK-22977][SQL] fix web UI SQL tab for CTAS
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20521 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20546: [SPARK-20659][Core] Removing sc.getExecutorStorageStatus...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20546 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20521: [SPARK-22977][SQL] fix web UI SQL tab for CTAS
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20521 **[Test build #87219 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87219/testReport)** for PR 20521 at commit [`87fd406`](https://github.com/apache/spark/commit/87fd406813d5bc263d07b8aa87a87d6b360133de). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20546: [SPARK-20659][Core] Removing sc.getExecutorStorageStatus...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20546 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87216/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20490: [SPARK-23323][SQL]: Support commit coordinator fo...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20490#discussion_r167009143 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala --- @@ -117,20 +118,43 @@ object DataWritingSparkTask extends Logging { writeTask: DataWriterFactory[InternalRow], context: TaskContext, iter: Iterator[InternalRow]): WriterCommitMessage = { -val dataWriter = writeTask.createDataWriter(context.partitionId(), context.attemptNumber()) +val stageId = context.stageId() +val partId = context.partitionId() +val attemptId = context.attemptNumber() +val dataWriter = writeTask.createDataWriter(partId, attemptId) // write the data and commit this writer. Utils.tryWithSafeFinallyAndFailureCallbacks(block = { iter.foreach(dataWriter.write) - logInfo(s"Writer for partition ${context.partitionId()} is committing.") - val msg = dataWriter.commit() - logInfo(s"Writer for partition ${context.partitionId()} committed.") + + val msg = if (writeTask.useCommitCoordinator) { --- End diff -- Is it a good idea to add streaming to this commit? The changes differ significantly. It isn't clear how commit coordination happens for streaming writes. The OutputCommitCoordinator's `canCommit` method takes stage, partition, and attempt ids, not epochs. Either the other components aren't ready to have commit coordination, or I'm not familiar enough with how it is done for streaming. I think we can keep the two separate, and I'm happy to open a follow-up issue for the streaming side. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20537: [SPARK-23314][PYTHON] Add ambiguous=False when localizin...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20537 **[Test build #87223 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87223/testReport)** for PR 20537 at commit [`2c1a258`](https://github.com/apache/spark/commit/2c1a2582c04a5b9cb7d011892343ca0a07ddb854). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20537: [SPARK-23314][PYTHON] Add ambiguous=False when localizin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20537 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87223/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20537: [SPARK-23314][PYTHON] Add ambiguous=False when localizin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20537 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20546: [SPARK-20659][Core] Removing sc.getExecutorStorageStatus...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20546 **[Test build #87216 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87216/testReport)** for PR 20546 at commit [`3f5dba6`](https://github.com/apache/spark/commit/3f5dba62cfa8b07d602639e9f88f3fd8cec92831). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20549: Add more binary classification metrics to BinaryC...
Github user sandecho closed the pull request at: https://github.com/apache/spark/pull/20549 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20549: Add more binary classification metrics to BinaryC...
GitHub user sandecho reopened a pull request: https://github.com/apache/spark/pull/20549 Add more binary classification metrics to BinaryClassificationMetrics ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sandecho/spark new_branch Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20549.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20549 commit 9f33d677586043fe7c75ac1930c51c138f281a49 Author: Sandeep Kumar Choudhary Date: 2018-02-08T16:49:13Z Add more binary classification metrics to BinaryClassificationMetrics commit d7144f63a99e575d5c996fd7919bdbe44266620f Author: Sandeep Kumar Choudhary Date: 2018-02-08T17:20:52Z SPARK-18844 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20549: Add more binary classification metrics to BinaryClassifi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20549 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20549: Add more binary classification metrics to BinaryC...
Github user sandecho closed the pull request at: https://github.com/apache/spark/pull/20549 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...
Github user gaborgsomogyi commented on a diff in the pull request: https://github.com/apache/spark/pull/20362#discussion_r167005865 --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala --- @@ -599,8 +598,11 @@ class ALSSuite (ex, act) => ex.userFactors.first().getSeq[Float](1) === act.userFactors.first.getSeq[Float](1) } { (ex, act, _) => - ex.transform(_: DataFrame).select("prediction").first.getDouble(0) ~== -act.transform(_: DataFrame).select("prediction").first.getDouble(0) absTol 1e-6 + testTransformerByGlobalCheckFunc[Float](_: DataFrame, act, "prediction") { +case actRows: Seq[Row] => + ex.transform(_: DataFrame).select("prediction").first.getDouble(0) ~== +actRows(0).getDouble(0) absTol 1e-6 + } --- End diff -- Woah, didn't check the original functionality in such a depth. This is really dead code in runtime environment. Fixed ð --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20549: Add more binary classification metrics to BinaryC...
GitHub user sandecho opened a pull request: https://github.com/apache/spark/pull/20549 Add more binary classification metrics to BinaryClassificationMetrics ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sandecho/spark new_branch Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20549.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20549 commit 9f33d677586043fe7c75ac1930c51c138f281a49 Author: Sandeep Kumar Choudhary Date: 2018-02-08T16:49:13Z Add more binary classification metrics to BinaryClassificationMetrics --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20387: [SPARK-23203][SQL]: DataSourceV2: Use immutable l...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20387#discussion_r167005107 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala --- @@ -17,17 +17,130 @@ package org.apache.spark.sql.execution.datasources.v2 +import java.util.UUID + +import scala.collection.JavaConverters._ +import scala.collection.mutable + +import org.apache.spark.sql.{AnalysisException, SaveMode} import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation -import org.apache.spark.sql.catalyst.expressions.AttributeReference -import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics} -import org.apache.spark.sql.sources.v2.reader._ +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Expression} +import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan, Statistics} +import org.apache.spark.sql.execution.datasources.DataSourceStrategy +import org.apache.spark.sql.sources.{DataSourceRegister, Filter} +import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, ReadSupport, ReadSupportWithSchema, WriteSupport} +import org.apache.spark.sql.sources.v2.reader.{DataSourceReader, SupportsPushDownCatalystFilters, SupportsPushDownFilters, SupportsPushDownRequiredColumns, SupportsReportStatistics} +import org.apache.spark.sql.sources.v2.writer.DataSourceWriter +import org.apache.spark.sql.types.StructType case class DataSourceV2Relation( -output: Seq[AttributeReference], -reader: DataSourceReader) - extends LeafNode with MultiInstanceRelation with DataSourceReaderHolder { +source: DataSourceV2, +options: Map[String, String], +projection: Option[Seq[AttributeReference]] = None, +filters: Option[Seq[Expression]] = None, +userSchema: Option[StructType] = None) extends LeafNode with MultiInstanceRelation { + + override def simpleString: String = { +s"DataSourceV2Relation(source=$sourceName, " + + s"schema=[${output.map(a => s"$a ${a.dataType.simpleString}").mkString(", ")}], " + + s"filters=[${pushedFilters.mkString(", ")}], options=$options)" + } + + override lazy val schema: StructType = reader.readSchema() + + override lazy val output: Seq[AttributeReference] = { +projection match { + case Some(attrs) => +// use the projection attributes to avoid assigning new ids. fields that are not projected +// will be assigned new ids, which is okay because they are not projected. +val attrMap = attrs.map(a => a.name -> a).toMap +schema.map(f => attrMap.getOrElse(f.name, + AttributeReference(f.name, f.dataType, f.nullable, f.metadata)())) + case _ => +schema.toAttributes +} + } + + private lazy val v2Options: DataSourceOptions = { +// ensure path and table options are set correctly +val updatedOptions = new mutable.HashMap[String, String] +updatedOptions ++= options + +new DataSourceOptions(options.asJava) + } + + private val sourceName: String = { +source match { + case registered: DataSourceRegister => +registered.shortName() + case _ => +source.getClass.getSimpleName +} + } + + lazy val ( + reader: DataSourceReader, + unsupportedFilters: Seq[Expression], + pushedFilters: Seq[Expression]) = { +val newReader = userSchema match { + case Some(s) => +asReadSupportWithSchema.createReader(s, v2Options) + case _ => +asReadSupport.createReader(v2Options) +} + +projection.foreach { attrs => + DataSourceV2Relation.pushRequiredColumns(newReader, attrs.toStructType) +} + +val (remainingFilters, pushedFilters) = filters match { + case Some(filterSeq) => +DataSourceV2Relation.pushFilters(newReader, filterSeq) + case _ => +(Nil, Nil) +} + +(newReader, remainingFilters, pushedFilters) + } - override def canEqual(other: Any): Boolean = other.isInstanceOf[DataSourceV2Relation] + def writer(dfSchema: StructType, mode: SaveMode): Option[DataSourceWriter] = { --- End diff -- If you want this removed to get this commit in, please say so. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20387: [SPARK-23203][SQL]: DataSourceV2: Use immutable l...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20387#discussion_r167004846 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala --- @@ -17,17 +17,130 @@ package org.apache.spark.sql.execution.datasources.v2 +import java.util.UUID + +import scala.collection.JavaConverters._ +import scala.collection.mutable + +import org.apache.spark.sql.{AnalysisException, SaveMode} import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation -import org.apache.spark.sql.catalyst.expressions.AttributeReference -import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics} -import org.apache.spark.sql.sources.v2.reader._ +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Expression} +import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan, Statistics} +import org.apache.spark.sql.execution.datasources.DataSourceStrategy +import org.apache.spark.sql.sources.{DataSourceRegister, Filter} +import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, ReadSupport, ReadSupportWithSchema, WriteSupport} +import org.apache.spark.sql.sources.v2.reader.{DataSourceReader, SupportsPushDownCatalystFilters, SupportsPushDownFilters, SupportsPushDownRequiredColumns, SupportsReportStatistics} +import org.apache.spark.sql.sources.v2.writer.DataSourceWriter +import org.apache.spark.sql.types.StructType case class DataSourceV2Relation( -output: Seq[AttributeReference], -reader: DataSourceReader) - extends LeafNode with MultiInstanceRelation with DataSourceReaderHolder { +source: DataSourceV2, +options: Map[String, String], +projection: Option[Seq[AttributeReference]] = None, +filters: Option[Seq[Expression]] = None, +userSchema: Option[StructType] = None) extends LeafNode with MultiInstanceRelation { + + override def simpleString: String = { +s"DataSourceV2Relation(source=$sourceName, " + + s"schema=[${output.map(a => s"$a ${a.dataType.simpleString}").mkString(", ")}], " + + s"filters=[${pushedFilters.mkString(", ")}], options=$options)" + } + + override lazy val schema: StructType = reader.readSchema() + + override lazy val output: Seq[AttributeReference] = { --- End diff -- That commit was not reverted when I rebased. The test is still present and passing: https://github.com/apache/spark/blob/181946d1f1c5889661544830a77bd23c4b4f685a/sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala#L320-L336 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20387: [SPARK-23203][SQL]: DataSourceV2: Use immutable l...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20387#discussion_r167001183 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala --- @@ -17,17 +17,130 @@ package org.apache.spark.sql.execution.datasources.v2 +import java.util.UUID + +import scala.collection.JavaConverters._ +import scala.collection.mutable + +import org.apache.spark.sql.{AnalysisException, SaveMode} import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation -import org.apache.spark.sql.catalyst.expressions.AttributeReference -import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics} -import org.apache.spark.sql.sources.v2.reader._ +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Expression} +import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan, Statistics} +import org.apache.spark.sql.execution.datasources.DataSourceStrategy +import org.apache.spark.sql.sources.{DataSourceRegister, Filter} +import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, ReadSupport, ReadSupportWithSchema, WriteSupport} +import org.apache.spark.sql.sources.v2.reader.{DataSourceReader, SupportsPushDownCatalystFilters, SupportsPushDownFilters, SupportsPushDownRequiredColumns, SupportsReportStatistics} +import org.apache.spark.sql.sources.v2.writer.DataSourceWriter +import org.apache.spark.sql.types.StructType case class DataSourceV2Relation( -output: Seq[AttributeReference], -reader: DataSourceReader) - extends LeafNode with MultiInstanceRelation with DataSourceReaderHolder { +source: DataSourceV2, +options: Map[String, String], +projection: Option[Seq[AttributeReference]] = None, +filters: Option[Seq[Expression]] = None, +userSchema: Option[StructType] = None) extends LeafNode with MultiInstanceRelation { + + override def simpleString: String = { +s"DataSourceV2Relation(source=$sourceName, " + + s"schema=[${output.map(a => s"$a ${a.dataType.simpleString}").mkString(", ")}], " + + s"filters=[${pushedFilters.mkString(", ")}], options=$options)" + } + + override lazy val schema: StructType = reader.readSchema() + + override lazy val output: Seq[AttributeReference] = { +projection match { + case Some(attrs) => +// use the projection attributes to avoid assigning new ids. fields that are not projected +// will be assigned new ids, which is okay because they are not projected. +val attrMap = attrs.map(a => a.name -> a).toMap +schema.map(f => attrMap.getOrElse(f.name, + AttributeReference(f.name, f.dataType, f.nullable, f.metadata)())) + case _ => +schema.toAttributes +} + } + + private lazy val v2Options: DataSourceOptions = { +// ensure path and table options are set correctly +val updatedOptions = new mutable.HashMap[String, String] +updatedOptions ++= options + +new DataSourceOptions(options.asJava) --- End diff -- Also, keep in mind that this is a lazy val. It is only referenced when creating a reader or writer --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20537: [SPARK-23314][PYTHON] Add ambiguous=False when localizin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20537 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87222/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20537: [SPARK-23314][PYTHON] Add ambiguous=False when localizin...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20537 **[Test build #87222 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87222/testReport)** for PR 20537 at commit [`f6b5d28`](https://github.com/apache/spark/commit/f6b5d2868c3ca7c8c2cc2bfb6e7a06ce7c01998c). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20537: [SPARK-23314][PYTHON] Add ambiguous=False when localizin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20537 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20537: [SPARK-23314][PYTHON] Add ambiguous=False when localizin...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20537 **[Test build #87223 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87223/testReport)** for PR 20537 at commit [`2c1a258`](https://github.com/apache/spark/commit/2c1a2582c04a5b9cb7d011892343ca0a07ddb854). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20537: [SPARK-23314][PYTHON] Add ambiguous=False when localizin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20537 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/721/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20537: [SPARK-23314][PYTHON] Add ambiguous=False when localizin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20537 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20387: [SPARK-23203][SQL]: DataSourceV2: Use immutable l...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20387#discussion_r166998163 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala --- @@ -17,17 +17,130 @@ package org.apache.spark.sql.execution.datasources.v2 +import java.util.UUID + +import scala.collection.JavaConverters._ +import scala.collection.mutable + +import org.apache.spark.sql.{AnalysisException, SaveMode} import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation -import org.apache.spark.sql.catalyst.expressions.AttributeReference -import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics} -import org.apache.spark.sql.sources.v2.reader._ +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Expression} +import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan, Statistics} +import org.apache.spark.sql.execution.datasources.DataSourceStrategy +import org.apache.spark.sql.sources.{DataSourceRegister, Filter} +import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, ReadSupport, ReadSupportWithSchema, WriteSupport} +import org.apache.spark.sql.sources.v2.reader.{DataSourceReader, SupportsPushDownCatalystFilters, SupportsPushDownFilters, SupportsPushDownRequiredColumns, SupportsReportStatistics} +import org.apache.spark.sql.sources.v2.writer.DataSourceWriter +import org.apache.spark.sql.types.StructType case class DataSourceV2Relation( -output: Seq[AttributeReference], -reader: DataSourceReader) - extends LeafNode with MultiInstanceRelation with DataSourceReaderHolder { +source: DataSourceV2, +options: Map[String, String], +projection: Option[Seq[AttributeReference]] = None, +filters: Option[Seq[Expression]] = None, +userSchema: Option[StructType] = None) extends LeafNode with MultiInstanceRelation { + + override def simpleString: String = { +s"DataSourceV2Relation(source=$sourceName, " + + s"schema=[${output.map(a => s"$a ${a.dataType.simpleString}").mkString(", ")}], " + + s"filters=[${pushedFilters.mkString(", ")}], options=$options)" + } + + override lazy val schema: StructType = reader.readSchema() + + override lazy val output: Seq[AttributeReference] = { --- End diff -- I'll have a look. I didn't realize you'd committed that one already. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20387: [SPARK-23203][SQL]: DataSourceV2: Use immutable l...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20387#discussion_r166997697 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala --- @@ -17,17 +17,130 @@ package org.apache.spark.sql.execution.datasources.v2 +import java.util.UUID + +import scala.collection.JavaConverters._ +import scala.collection.mutable + +import org.apache.spark.sql.{AnalysisException, SaveMode} import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation -import org.apache.spark.sql.catalyst.expressions.AttributeReference -import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics} -import org.apache.spark.sql.sources.v2.reader._ +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Expression} +import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan, Statistics} +import org.apache.spark.sql.execution.datasources.DataSourceStrategy +import org.apache.spark.sql.sources.{DataSourceRegister, Filter} +import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, ReadSupport, ReadSupportWithSchema, WriteSupport} +import org.apache.spark.sql.sources.v2.reader.{DataSourceReader, SupportsPushDownCatalystFilters, SupportsPushDownFilters, SupportsPushDownRequiredColumns, SupportsReportStatistics} +import org.apache.spark.sql.sources.v2.writer.DataSourceWriter +import org.apache.spark.sql.types.StructType case class DataSourceV2Relation( -output: Seq[AttributeReference], -reader: DataSourceReader) - extends LeafNode with MultiInstanceRelation with DataSourceReaderHolder { +source: DataSourceV2, +options: Map[String, String], +projection: Option[Seq[AttributeReference]] = None, +filters: Option[Seq[Expression]] = None, +userSchema: Option[StructType] = None) extends LeafNode with MultiInstanceRelation { + + override def simpleString: String = { +s"DataSourceV2Relation(source=$sourceName, " + + s"schema=[${output.map(a => s"$a ${a.dataType.simpleString}").mkString(", ")}], " + + s"filters=[${pushedFilters.mkString(", ")}], options=$options)" + } + + override lazy val schema: StructType = reader.readSchema() + + override lazy val output: Seq[AttributeReference] = { +projection match { + case Some(attrs) => +// use the projection attributes to avoid assigning new ids. fields that are not projected +// will be assigned new ids, which is okay because they are not projected. +val attrMap = attrs.map(a => a.name -> a).toMap +schema.map(f => attrMap.getOrElse(f.name, + AttributeReference(f.name, f.dataType, f.nullable, f.metadata)())) + case _ => +schema.toAttributes +} + } + + private lazy val v2Options: DataSourceOptions = { +// ensure path and table options are set correctly +val updatedOptions = new mutable.HashMap[String, String] +updatedOptions ++= options + +new DataSourceOptions(options.asJava) + } + + private val sourceName: String = { +source match { + case registered: DataSourceRegister => +registered.shortName() + case _ => +source.getClass.getSimpleName +} + } + + lazy val ( + reader: DataSourceReader, + unsupportedFilters: Seq[Expression], + pushedFilters: Seq[Expression]) = { +val newReader = userSchema match { + case Some(s) => +asReadSupportWithSchema.createReader(s, v2Options) + case _ => +asReadSupport.createReader(v2Options) +} + +projection.foreach { attrs => + DataSourceV2Relation.pushRequiredColumns(newReader, attrs.toStructType) +} + +val (remainingFilters, pushedFilters) = filters match { + case Some(filterSeq) => +DataSourceV2Relation.pushFilters(newReader, filterSeq) + case _ => +(Nil, Nil) +} + +(newReader, remainingFilters, pushedFilters) + } - override def canEqual(other: Any): Boolean = other.isInstanceOf[DataSourceV2Relation] + def writer(dfSchema: StructType, mode: SaveMode): Option[DataSourceWriter] = { --- End diff -- No, it isn't. But a relation should be able to return a writer. This is going to be needed as we improve the logical plans used by v2. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20387: [SPARK-23203][SQL]: DataSourceV2: Use immutable l...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20387#discussion_r166997241 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala --- @@ -17,17 +17,130 @@ package org.apache.spark.sql.execution.datasources.v2 +import java.util.UUID + +import scala.collection.JavaConverters._ +import scala.collection.mutable + +import org.apache.spark.sql.{AnalysisException, SaveMode} import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation -import org.apache.spark.sql.catalyst.expressions.AttributeReference -import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics} -import org.apache.spark.sql.sources.v2.reader._ +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Expression} +import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan, Statistics} +import org.apache.spark.sql.execution.datasources.DataSourceStrategy +import org.apache.spark.sql.sources.{DataSourceRegister, Filter} +import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, ReadSupport, ReadSupportWithSchema, WriteSupport} +import org.apache.spark.sql.sources.v2.reader.{DataSourceReader, SupportsPushDownCatalystFilters, SupportsPushDownFilters, SupportsPushDownRequiredColumns, SupportsReportStatistics} +import org.apache.spark.sql.sources.v2.writer.DataSourceWriter +import org.apache.spark.sql.types.StructType case class DataSourceV2Relation( -output: Seq[AttributeReference], -reader: DataSourceReader) - extends LeafNode with MultiInstanceRelation with DataSourceReaderHolder { +source: DataSourceV2, +options: Map[String, String], +projection: Option[Seq[AttributeReference]] = None, +filters: Option[Seq[Expression]] = None, +userSchema: Option[StructType] = None) extends LeafNode with MultiInstanceRelation { + + override def simpleString: String = { +s"DataSourceV2Relation(source=$sourceName, " + + s"schema=[${output.map(a => s"$a ${a.dataType.simpleString}").mkString(", ")}], " + + s"filters=[${pushedFilters.mkString(", ")}], options=$options)" + } + + override lazy val schema: StructType = reader.readSchema() + + override lazy val output: Seq[AttributeReference] = { +projection match { + case Some(attrs) => +// use the projection attributes to avoid assigning new ids. fields that are not projected +// will be assigned new ids, which is okay because they are not projected. +val attrMap = attrs.map(a => a.name -> a).toMap +schema.map(f => attrMap.getOrElse(f.name, + AttributeReference(f.name, f.dataType, f.nullable, f.metadata)())) + case _ => +schema.toAttributes +} + } + + private lazy val v2Options: DataSourceOptions = { +// ensure path and table options are set correctly +val updatedOptions = new mutable.HashMap[String, String] +updatedOptions ++= options + +new DataSourceOptions(options.asJava) --- End diff -- As we've already discussed at length, I think it is a bad idea to create `DataSourceOptions` and pass it to the relation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20525: [SPARK-23271[SQL] Parquet output contains only _SUCCESS ...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/20525 @gatorsmile Thanks. I will create a doc pr and address it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20490: [SPARK-23323][SQL]: Support commit coordinator fo...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20490#discussion_r166995080 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceWriter.java --- @@ -78,10 +78,11 @@ default void onDataWriterCommit(WriterCommitMessage message) {} * failed, and {@link #abort(WriterCommitMessage[])} would be called. The state of the destination * is undefined and @{@link #abort(WriterCommitMessage[])} may not be able to deal with it. * - * Note that, one partition may have multiple committed data writers because of speculative tasks. - * Spark will pick the first successful one and get its commit message. Implementations should be - * aware of this and handle it correctly, e.g., have a coordinator to make sure only one data - * writer can commit, or have a way to clean up the data of already-committed writers. + * Note that speculative execution may cause multiple tasks to run for a partition. By default, + * Spark uses the OutputCommitCoordinator to allow only one attempt to commit. + * {@link DataWriterFactory} implementations can disable this behavior. If disabled, multiple --- End diff -- It says that already: "DataWriterFactory implementations can disable this behavior." --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20537: [SPARK-23314][PYTHON] Add ambiguous=False when localizin...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20537 **[Test build #87221 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87221/testReport)** for PR 20537 at commit [`304666a`](https://github.com/apache/spark/commit/304666ad089d497d666de25476955da52aae5395). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20537: [SPARK-23314][PYTHON] Add ambiguous=False when localizin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20537 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87221/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20537: [SPARK-23314][PYTHON] Add ambiguous=False when localizin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20537 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20537: [SPARK-23314][PYTHON] Add ambiguous=False when localizin...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20537 **[Test build #87222 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87222/testReport)** for PR 20537 at commit [`f6b5d28`](https://github.com/apache/spark/commit/f6b5d2868c3ca7c8c2cc2bfb6e7a06ce7c01998c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20537: [SPARK-23314][PYTHON] Add ambiguous=False when localizin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20537 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/720/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20537: [SPARK-23314][PYTHON] Add ambiguous=False when localizin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20537 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20408: [SPARK-23189][Core][Web UI] Reflect stage level blacklis...
Github user squito commented on the issue: https://github.com/apache/spark/pull/20408 just a quick note -- I realized I was confused about one part of the inner workings of the history server which I want to confirm before I merge this, but got sick and now have a bit of a backlog. Will get back to this next week. Sorry for the delay --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20546: [SPARK-20659][Core] Remove StorageStatus, or make it pri...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/20546 This won't go into 2.3. Also, please don't copy & paste the bug title in your PR. Explain what you're doing instead. The current title does not explain what the change does. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20537: [SPARK-23314][PYTHON] Add ambiguous=False when localizin...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20537 **[Test build #87221 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87221/testReport)** for PR 20537 at commit [`304666a`](https://github.com/apache/spark/commit/304666ad089d497d666de25476955da52aae5395). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20537: [SPARK-23314][PYTHON] Add ambiguous=False when localizin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20537 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20537: [SPARK-23314][PYTHON] Add ambiguous=False when localizin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20537 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/719/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20477: [SPARK-23303][SQL] improve the explain result for data s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20477 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20477: [SPARK-23303][SQL] improve the explain result for data s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20477 **[Test build #87220 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87220/testReport)** for PR 20477 at commit [`4bff16d`](https://github.com/apache/spark/commit/4bff16deb6debda55ae7bdcbe56fbc527ff73d23). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20477: [SPARK-23303][SQL] improve the explain result for data s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20477 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87220/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20477: [SPARK-23303][SQL] improve the explain result for data s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20477 **[Test build #87220 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87220/testReport)** for PR 20477 at commit [`4bff16d`](https://github.com/apache/spark/commit/4bff16deb6debda55ae7bdcbe56fbc527ff73d23). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20477: [SPARK-23303][SQL] improve the explain result for data s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20477 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/718/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20477: [SPARK-23303][SQL] improve the explain result for data s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20477 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20521: [SPARK-22977][SQL] fix web UI SQL tab for CTAS
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20521 **[Test build #87219 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87219/testReport)** for PR 20521 at commit [`87fd406`](https://github.com/apache/spark/commit/87fd406813d5bc263d07b8aa87a87d6b360133de). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20521: [SPARK-22977][SQL] fix web UI SQL tab for CTAS
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20521 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20521: [SPARK-22977][SQL] fix web UI SQL tab for CTAS
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20521 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/717/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20537: [SPARK-23314][PYTHON] Add ambiguous=False when lo...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/20537#discussion_r166975718 --- Diff: python/pyspark/sql/types.py --- @@ -1730,7 +1730,28 @@ def _check_series_convert_timestamps_internal(s, timezone): # TODO: handle nested timestamps, such as ArrayType(TimestampType())? if is_datetime64_dtype(s.dtype): tz = timezone or 'tzlocal()' -return s.dt.tz_localize(tz).dt.tz_convert('UTC') +""" +tz_localize with ambiguous=False has the same behavior of pytz.localize +>>> import datetime +>>> import pandas as pd +>>> import pytz +>>> +>>> t = datetime.datetime(2015, 11, 1, 1, 23, 24) +>>> ts = pd.Series([t]) +>>> tz = pytz.timezone('America/New_York') +>>> +>>> ts.dt.tz_localize(tz, ambiguous=False) +>>> 0 2015-11-01 01:23:24-05:00 +>>> dtype: datetime64[ns, America/New_York] +>>> +>>> ts.dt.tz_localize(tz, ambiguous=True) +>>> 0 2015-11-01 01:23:24-04:00 +>>> dtype: datetime64[ns, America/New_York] +>>> +>>> str(tz.localize(t)) +>>> '2015-11-01 01:23:24-05:00' +""" +return s.dt.tz_localize(tz, ambiguous=False).dt.tz_convert('UTC') --- End diff -- Yes will create a new for `pandas_udf`. Seems `ambiguous=False` is undocumented in the method doc, @jreback can you please confirm this usage is correct? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19077: [SPARK-21860][core]Improve memory reuse for heap ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19077 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20537: [SPARK-23314][PYTHON] Add ambiguous=False when lo...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/20537#discussion_r166974650 --- Diff: python/pyspark/sql/types.py --- @@ -1730,7 +1730,28 @@ def _check_series_convert_timestamps_internal(s, timezone): # TODO: handle nested timestamps, such as ArrayType(TimestampType())? if is_datetime64_dtype(s.dtype): tz = timezone or 'tzlocal()' -return s.dt.tz_localize(tz).dt.tz_convert('UTC') +""" +tz_localize with ambiguous=False has the same behavior of pytz.localize +>>> import datetime +>>> import pandas as pd +>>> import pytz +>>> +>>> t = datetime.datetime(2015, 11, 1, 1, 23, 24) +>>> ts = pd.Series([t]) +>>> tz = pytz.timezone('America/New_York') +>>> +>>> ts.dt.tz_localize(tz, ambiguous=False) +>>> 0 2015-11-01 01:23:24-05:00 +>>> dtype: datetime64[ns, America/New_York] +>>> +>>> ts.dt.tz_localize(tz, ambiguous=True) +>>> 0 2015-11-01 01:23:24-04:00 +>>> dtype: datetime64[ns, America/New_York] +>>> +>>> str(tz.localize(t)) +>>> '2015-11-01 01:23:24-05:00' --- End diff -- Yeah Let me clean up the format... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20537: [SPARK-23314][PYTHON] Add ambiguous=False when lo...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/20537#discussion_r166974367 --- Diff: python/pyspark/sql/types.py --- @@ -1730,7 +1730,28 @@ def _check_series_convert_timestamps_internal(s, timezone): # TODO: handle nested timestamps, such as ArrayType(TimestampType())? if is_datetime64_dtype(s.dtype): tz = timezone or 'tzlocal()' -return s.dt.tz_localize(tz).dt.tz_convert('UTC') +""" +tz_localize with ambiguous=False has the same behavior of pytz.localize --- End diff -- Oh definitely not doctest..Let me change to comments --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19077: [SPARK-21860][core]Improve memory reuse for heap memory ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19077 thanks, merging to master! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20548: [SPARK-23316][SQL] AnalysisException after max iteration...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20548 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/716/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19431: [SPARK-18580] [DStreams] [external/kafka-0-10][external/...
Github user gaborgsomogyi commented on the issue: https://github.com/apache/spark/pull/19431 I mean the difference between `test("use backpressure.initialRate with backpressure")` and `test("backpressure.initialRate should honor maxRatePerPartition")` are 3 numbers. Wrapping the common code into one function and making 2 function call would be better. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20548: [SPARK-23316][SQL] AnalysisException after max iteration...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20548 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20548: [SPARK-23316][SQL] AnalysisException after max iteration...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20548 **[Test build #87218 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87218/testReport)** for PR 20548 at commit [`367c70b`](https://github.com/apache/spark/commit/367c70bd3aa9cf82358462deb624b7634567f0c9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20548: [SPARK-23316][SQL] AnalysisException after max it...
GitHub user bogdanrdc opened a pull request: https://github.com/apache/spark/pull/20548 [SPARK-23316][SQL] AnalysisException after max iteration reached for IN query ## What changes were proposed in this pull request? Added flag ignoreNullability to DataType.equalsStructurally. The previous semantic is for ignoreNullability=false. When ignoreNullability=true equalsStructurally ignores nullability of contained types (map key types, value types, array element types, structure field types). In.checkInputTypes calls equalsStructurally to check if the children types match. They should match regardless of nullability (which is just a hint), so it is now called with ignoreNullability=true. ## How was this patch tested? New test in SubquerySuite You can merge this pull request into a Git repository by running: $ git pull https://github.com/bogdanrdc/spark SPARK-23316 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20548.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20548 commit 367c70bd3aa9cf82358462deb624b7634567f0c9 Author: Bogdan Raducanu Date: 2018-02-08T15:19:34Z fix + test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20547: [SPARK-23316][SQL] AnalysisException after max it...
Github user bogdanrdc closed the pull request at: https://github.com/apache/spark/pull/20547 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20545: [SPARK-23359][SQL] Adds an alias 'names' of 'fieldNames'...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20545 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87215/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20545: [SPARK-23359][SQL] Adds an alias 'names' of 'fieldNames'...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20545 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20547: [SPARK-23316][SQL] AnalysisException after max iteration...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20547 **[Test build #87217 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87217/testReport)** for PR 20547 at commit [`79e2593`](https://github.com/apache/spark/commit/79e2593b90ce33788e012ee28fc4cbd3bf6e4264). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20547: [SPARK-23316][SQL] AnalysisException after max iteration...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20547 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/715/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20545: [SPARK-23359][SQL] Adds an alias 'names' of 'fieldNames'...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20545 **[Test build #87215 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87215/testReport)** for PR 20545 at commit [`664a62c`](https://github.com/apache/spark/commit/664a62c7da9ba5da2007d40ef9c157f7e82938c5). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20547: [SPARK-23316][SQL] AnalysisException after max iteration...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20547 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19431: [SPARK-18580] [DStreams] [external/kafka-0-10][external/...
Github user akonopko commented on the issue: https://github.com/apache/spark/pull/19431 > Related the doc I thought it's kafka specific but it's not so fine like that Yes, it was implemented only in Kafka Streams but doc doesnt limit usage of this parameter to Kafka > good to merge the common functionalities Not sure I understood you correctly here. You mean in tests ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20547: [SPARK-23316][SQL] AnalysisException after max it...
GitHub user bogdanrdc opened a pull request: https://github.com/apache/spark/pull/20547 [SPARK-23316][SQL] AnalysisException after max iteration reached for IN query ## What changes were proposed in this pull request? Added flag ignoreNullability to DataType.equalsStructurally. The previous semantic is for ignoreNullability=false. When ignoreNullability=true equalsStructurally ignores nullability of contained types (map key types, value types, array element types, structure field types). In.checkInputTypes calls equalsStructurally to check if the children types match. They should match regardless of nullability (which is just a hint), so it is now called with ignoreNullability=true. ## How was this patch tested? New test in SubquerySuite. You can merge this pull request into a Git repository by running: $ git pull https://github.com/bogdanrdc/spark SPARK-23316 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20547.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20547 commit 03a4281751e02acd2b97ceff6cf8e1621e83eb93 Author: Bogdan Raducanu Date: 2017-04-20T10:59:49Z fix + test commit 72cf1d117890abe45aa30c6b91a7e2c527fc4969 Author: Bogdan Raducanu Date: 2017-04-20T11:01:40Z reverted mistake commit commit 2c96a8d65059db3b808e05241b870ccd17937095 Author: Bogdan Raducanu Date: 2017-05-12T15:24:57Z erge remote-tracking branch 'upstream/master' commit fa11b0b97b38bb98b599a8edf1d43e01b067a926 Author: Bogdan Raducanu Date: 2017-05-23T11:28:34Z Merge remote-tracking branch 'upstream/master' commit 21ad3aa4468b58aa4e552e2922e1bceda61097f7 Author: Bogdan Raducanu Date: 2017-05-23T11:28:57Z Merge remote-tracking branch 'upstream/master' commit 7f78cce9d6869371e2e28ce5d9fc4766d7dbc3de Author: Bogdan Raducanu Date: 2017-06-06T11:19:35Z Merge remote-tracking branch 'upstream/master' commit eea2e5d466558aa3f2f6232024e8150dc246ba8a Author: Bogdan Raducanu Date: 2017-06-07T10:35:37Z Merge remote-tracking branch 'upstream/master' commit b30788eac9df5e76863393826230481e23e52550 Author: Bogdan Raducanu Date: 2017-06-16T10:49:48Z Merge remote-tracking branch 'upstream/master' commit 38a0347e3079a3e56d70b77f5e25994497eabe41 Author: Bogdan Raducanu Date: 2017-06-27T10:33:10Z Merge remote-tracking branch 'upstream/master' commit 1057abe6262353093ccf9b75ed24ed54fdfc0095 Author: Bogdan Raducanu Date: 2017-06-28T12:35:12Z Merge remote-tracking branch 'upstream/master' commit 3f7bf43fab830c7cc6473b654ca290b23a9886be Author: Bogdan Raducanu Date: 2017-07-07T11:08:05Z Merge remote-tracking branch 'upstream/master' commit 0a51b0f8640236da4054a25ca50bb8d19ba73b70 Author: Bogdan Raducanu Date: 2017-07-09T11:08:41Z Merge remote-tracking branch 'upstream/master' commit 3d11dca76380c7c53710141114aa768b1477b893 Author: Bogdan Raducanu Date: 2017-07-10T10:18:46Z Merge remote-tracking branch 'upstream/master' commit 015c84b6beb29c0b275b01f7774a2b7d8aa8d180 Author: Bogdan Raducanu Date: 2017-07-12T10:21:48Z Merge remote-tracking branch 'upstream/master' commit 597ddf2265149427f796eed1a43539b13e1516d9 Author: Bogdan Raducanu Date: 2017-08-03T12:45:36Z Merge remote-tracking branch 'upstream/master' commit 38549ba22681ba6622c4f1cd9c1b97592c5a34a5 Author: Bogdan Raducanu Date: 2017-08-04T09:17:09Z Merge remote-tracking branch 'upstream/master' commit 89b86c0d1e4616eae2902da79be308e0573ab5e3 Author: Bogdan Raducanu Date: 2017-09-10T19:23:22Z Merge remote-tracking branch 'upstream/master' commit edd1fbf107501bc9a0bdbf4f712577b9fe1fd3f6 Author: Bogdan Raducanu Date: 2017-10-30T14:18:09Z Merge remote-tracking branch 'upstream/master' commit 6ead465cb00fc36869d152693a4cd1318fa005b9 Author: Bogdan Raducanu Date: 2017-12-21T15:17:45Z Merge remote-tracking branch 'upstream/master' commit d83a3adfba5d790270214887f615ab95ba50a2f9 Author: Bogdan Raducanu Date: 2018-01-10T11:53:31Z Merge remote-tracking branch 'upstream/master' commit 4b79fd683102137401ed4e77ee351439c0d254b5 Author: Bogdan Raducanu Date: 2018-01-11T04:01:12Z Merge remote-tracking branch 'upstream/master' commit ffa5debdd3b6a7e0bd70d01e640f048883b23440 Author: Bogdan Raducanu Date: 2018-01-18T12:44:12Z Merge remote-tracking branch 'upstream/master' commit aec35c98a4ea0db2cee52785e51130243cdc0b61 Author: Bogdan Raducanu Date: 2018-01-24T17:23:37Z Merge remote-tracking branch 'upstream/master' commit 87d8649777d7481beca7a04f195e04bab5b059a0 Author: Bogdan Raducanu Date: 2018-02-02T11:02:31Z Merge remote-tracking branch 'upstream/master' commit 4d6b42a3a7cbc7254f7fa159a7d04b898c8b65e6 Author: Bogdan Raducanu Date: 2018-02-08T14:57:57Z fix and test commit
[GitHub] spark issue #20544: [SPARK-23358][CORE]When the number of partitions is grea...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20544 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20544: [SPARK-23358][CORE]When the number of partitions is grea...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20544 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87213/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20544: [SPARK-23358][CORE]When the number of partitions is grea...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20544 **[Test build #87213 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87213/testReport)** for PR 20544 at commit [`9ad0bfd`](https://github.com/apache/spark/commit/9ad0bfdc00373a9732338e17ca2dfa05b0c28cfb). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19431: [SPARK-18580] [DStreams] [external/kafka-0-10][ex...
Github user gaborgsomogyi commented on a diff in the pull request: https://github.com/apache/spark/pull/19431#discussion_r166952412 --- Diff: external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala --- @@ -91,9 +91,16 @@ class DirectKafkaInputDStream[ private val maxRateLimitPerPartition: Long = context.sparkContext.getConf.getLong( "spark.streaming.kafka.maxRatePerPartition", 0) + private val initialRate = context.sparkContext.getConf.getLong( +"spark.streaming.backpressure.initialRate", 0) + protected[streaming] def maxMessagesPerPartition( offsets: Map[TopicAndPartition, Long]): Option[Map[TopicAndPartition, Long]] = { -val estimatedRateLimit = rateController.map(_.getLatestRate()) + +val estimatedRateLimit = rateController.map(x => { + val lr = x.getLatestRate() + if (lr > 0) lr else initialRate --- End diff -- Same applies here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19431: [SPARK-18580] [DStreams] [external/kafka-0-10][ex...
Github user gaborgsomogyi commented on a diff in the pull request: https://github.com/apache/spark/pull/19431#discussion_r166953420 --- Diff: external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/DirectKafkaStreamSuite.scala --- @@ -551,6 +551,76 @@ class DirectKafkaStreamSuite Map(new TopicPartition(topic, 0) -> 5L, new TopicPartition(topic, 1) -> 10L)) } + test("use backpressure.initialRate with backpressure") { +val topic = "backpressureInitialRate" +val kafkaParams = getKafkaParams("auto.offset.reset" -> "earliest") +val sparkConf = new SparkConf() + // Safe, even with streaming, because we're using the direct API. + // Using 1 core is useful to make the test more predictable. + .setMaster("local[1]") + .setAppName(this.getClass.getSimpleName) + .set("spark.streaming.backpressure.enabled", "true") + .set("spark.streaming.kafka.maxRatePerPartition", "1000") + .set("spark.streaming.backpressure.initialRate", "500") + +val messages = Map("foo" -> 5000) +kafkaTestUtils.sendMessages(topic, messages) + +ssc = new StreamingContext(sparkConf, Milliseconds(500)) + +val kafkaStream = withClue("Error creating direct stream") { + new DirectKafkaInputDStream[String, String]( +ssc, +preferredHosts, +ConsumerStrategies.Subscribe[String, String](List(topic), kafkaParams.asScala), +new DefaultPerPartitionConfig(sparkConf) + ) +} +kafkaStream.start() + +val input = Map(new TopicPartition(topic, 0) -> 1000L) + + assert(kafkaStream.maxMessagesPerPartition(input).flatMap(_.headOption).contains( --- End diff -- Just caught this could be a bit simplified: > assert(kafkaStream.maxMessagesPerPartition(input).get == > Map(new TopicPartition(topic, 0) -> 250)) // we run for half a second --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19431: [SPARK-18580] [DStreams] [external/kafka-0-10][ex...
Github user gaborgsomogyi commented on a diff in the pull request: https://github.com/apache/spark/pull/19431#discussion_r166953742 --- Diff: external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala --- @@ -387,6 +387,89 @@ class DirectKafkaStreamSuite Map(TopicAndPartition(topic, 0) -> 10L, TopicAndPartition(topic, 1) -> 10L)) } + test("use backpressure.initialRate with backpressure") { +val topic = "backpressureInitialRate" +val topicPartitions = Set(TopicAndPartition(topic, 0)) +kafkaTestUtils.createTopic(topic, 1) +val kafkaParams = Map( + "metadata.broker.list" -> kafkaTestUtils.brokerAddress, + "auto.offset.reset" -> "smallest" +) + +val sparkConf = new SparkConf() + // Safe, even with streaming, because we're using the direct API. + // Using 1 core is useful to make the test more predictable. + .setMaster("local[1]") + .setAppName(this.getClass.getSimpleName) + .set("spark.streaming.backpressure.enabled", "true") + .set("spark.streaming.backpressure.initialRate", "500") + +val messages = Map("foo" -> 5000) +kafkaTestUtils.sendMessages(topic, messages) + +ssc = new StreamingContext(sparkConf, Milliseconds(500)) + +val kafkaStream = withClue("Error creating direct stream") { + val kc = new KafkaCluster(kafkaParams) + val messageHandler = (mmd: MessageAndMetadata[String, String]) => (mmd.key, mmd.message) + val m = kc.getEarliestLeaderOffsets(topicPartitions) +.fold(e => Map.empty[TopicAndPartition, Long], m => m.mapValues(lo => lo.offset)) + + new DirectKafkaInputDStream[String, String, StringDecoder, StringDecoder, (String, String)]( +ssc, kafkaParams, m, messageHandler) +} +kafkaStream.start() + +val input = Map(new TopicAndPartition(topic, 0) -> 1000L) + + assert(kafkaStream.maxMessagesPerPartition(input).flatMap(_.headOption).contains( + new TopicAndPartition(topic, 0) -> 250)) // we run for half a second + +kafkaStream.stop() + } + + test("backpressure.initialRate should honor maxRatePerPartition") { +val topic = "backpressureInitialRate2" +val topicPartitions = Set(TopicAndPartition(topic, 0)) +kafkaTestUtils.createTopic(topic, 1) +val kafkaParams = Map( + "metadata.broker.list" -> kafkaTestUtils.brokerAddress, + "auto.offset.reset" -> "smallest" +) + +val sparkConf = new SparkConf() + // Safe, even with streaming, because we're using the direct API. + // Using 1 core is useful to make the test more predictable. + .setMaster("local[1]") + .setAppName(this.getClass.getSimpleName) + .set("spark.streaming.backpressure.enabled", "true") + .set("spark.streaming.kafka.maxRatePerPartition", "300") + .set("spark.streaming.backpressure.initialRate", "1000") + +val messages = Map("foo" -> 5000) +kafkaTestUtils.sendMessages(topic, messages) + +ssc = new StreamingContext(sparkConf, Milliseconds(500)) + +val kafkaStream = withClue("Error creating direct stream") { + val kc = new KafkaCluster(kafkaParams) + val messageHandler = (mmd: MessageAndMetadata[String, String]) => (mmd.key, mmd.message) + val m = kc.getEarliestLeaderOffsets(topicPartitions) +.fold(e => Map.empty[TopicAndPartition, Long], m => m.mapValues(lo => lo.offset)) + + new DirectKafkaInputDStream[String, String, StringDecoder, StringDecoder, (String, String)]( +ssc, kafkaParams, m, messageHandler) +} +kafkaStream.start() + +val input = Map(new TopicAndPartition(topic, 0) -> 1000L) + + assert(kafkaStream.maxMessagesPerPartition(input).flatMap(_.headOption).contains( --- End diff -- Same simplification here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19431: [SPARK-18580] [DStreams] [external/kafka-0-10][ex...
Github user gaborgsomogyi commented on a diff in the pull request: https://github.com/apache/spark/pull/19431#discussion_r166953894 --- Diff: external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/DirectKafkaStreamSuite.scala --- @@ -551,6 +551,76 @@ class DirectKafkaStreamSuite Map(new TopicPartition(topic, 0) -> 5L, new TopicPartition(topic, 1) -> 10L)) } + test("use backpressure.initialRate with backpressure") { +val topic = "backpressureInitialRate" +val kafkaParams = getKafkaParams("auto.offset.reset" -> "earliest") +val sparkConf = new SparkConf() + // Safe, even with streaming, because we're using the direct API. + // Using 1 core is useful to make the test more predictable. + .setMaster("local[1]") + .setAppName(this.getClass.getSimpleName) + .set("spark.streaming.backpressure.enabled", "true") + .set("spark.streaming.kafka.maxRatePerPartition", "1000") + .set("spark.streaming.backpressure.initialRate", "500") + +val messages = Map("foo" -> 5000) +kafkaTestUtils.sendMessages(topic, messages) + +ssc = new StreamingContext(sparkConf, Milliseconds(500)) + +val kafkaStream = withClue("Error creating direct stream") { + new DirectKafkaInputDStream[String, String]( +ssc, +preferredHosts, +ConsumerStrategies.Subscribe[String, String](List(topic), kafkaParams.asScala), +new DefaultPerPartitionConfig(sparkConf) + ) +} +kafkaStream.start() + +val input = Map(new TopicPartition(topic, 0) -> 1000L) + + assert(kafkaStream.maxMessagesPerPartition(input).flatMap(_.headOption).contains( + new TopicPartition(topic, 0) -> 250)) // we run for half a second + +kafkaStream.stop() + } + + test("backpressure.initialRate should honor maxRatePerPartition") { +val topic = "backpressureInitialRate" +val kafkaParams = getKafkaParams("auto.offset.reset" -> "earliest") +val sparkConf = new SparkConf() + // Safe, even with streaming, because we're using the direct API. + // Using 1 core is useful to make the test more predictable. + .setMaster("local[1]") + .setAppName(this.getClass.getSimpleName) + .set("spark.streaming.backpressure.enabled", "true") + .set("spark.streaming.kafka.maxRatePerPartition", "300") + .set("spark.streaming.backpressure.initialRate", "1000") + +val messages = Map("foo" -> 5000) +kafkaTestUtils.sendMessages(topic, messages) + +ssc = new StreamingContext(sparkConf, Milliseconds(500)) + +val kafkaStream = withClue("Error creating direct stream") { + new DirectKafkaInputDStream[String, String]( +ssc, +preferredHosts, +ConsumerStrategies.Subscribe[String, String](List(topic), kafkaParams.asScala), +new DefaultPerPartitionConfig(sparkConf) + ) +} +kafkaStream.start() + +val input = Map(new TopicPartition(topic, 0) -> 1000L) + + assert(kafkaStream.maxMessagesPerPartition(input).flatMap(_.headOption).contains( --- End diff -- Same simplification here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19431: [SPARK-18580] [DStreams] [external/kafka-0-10][ex...
Github user gaborgsomogyi commented on a diff in the pull request: https://github.com/apache/spark/pull/19431#discussion_r166953800 --- Diff: external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala --- @@ -387,6 +387,89 @@ class DirectKafkaStreamSuite Map(TopicAndPartition(topic, 0) -> 10L, TopicAndPartition(topic, 1) -> 10L)) } + test("use backpressure.initialRate with backpressure") { +val topic = "backpressureInitialRate" +val topicPartitions = Set(TopicAndPartition(topic, 0)) +kafkaTestUtils.createTopic(topic, 1) +val kafkaParams = Map( + "metadata.broker.list" -> kafkaTestUtils.brokerAddress, + "auto.offset.reset" -> "smallest" +) + +val sparkConf = new SparkConf() + // Safe, even with streaming, because we're using the direct API. + // Using 1 core is useful to make the test more predictable. + .setMaster("local[1]") + .setAppName(this.getClass.getSimpleName) + .set("spark.streaming.backpressure.enabled", "true") + .set("spark.streaming.backpressure.initialRate", "500") + +val messages = Map("foo" -> 5000) +kafkaTestUtils.sendMessages(topic, messages) + +ssc = new StreamingContext(sparkConf, Milliseconds(500)) + +val kafkaStream = withClue("Error creating direct stream") { + val kc = new KafkaCluster(kafkaParams) + val messageHandler = (mmd: MessageAndMetadata[String, String]) => (mmd.key, mmd.message) + val m = kc.getEarliestLeaderOffsets(topicPartitions) +.fold(e => Map.empty[TopicAndPartition, Long], m => m.mapValues(lo => lo.offset)) + + new DirectKafkaInputDStream[String, String, StringDecoder, StringDecoder, (String, String)]( +ssc, kafkaParams, m, messageHandler) +} +kafkaStream.start() + +val input = Map(new TopicAndPartition(topic, 0) -> 1000L) + + assert(kafkaStream.maxMessagesPerPartition(input).flatMap(_.headOption).contains( --- End diff -- Same simplification here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19431: [SPARK-18580] [DStreams] [external/kafka-0-10][ex...
Github user gaborgsomogyi commented on a diff in the pull request: https://github.com/apache/spark/pull/19431#discussion_r166952263 --- Diff: external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala --- @@ -126,7 +129,10 @@ private[spark] class DirectKafkaInputDStream[K, V]( protected[streaming] def maxMessagesPerPartition( offsets: Map[TopicPartition, Long]): Option[Map[TopicPartition, Long]] = { -val estimatedRateLimit = rateController.map(_.getLatestRate()) +val estimatedRateLimit = rateController.map(x => { + val lr = x.getLatestRate() + if (lr > 0) lr else initialRate --- End diff -- Thanks for the info. My concern was the `LatestRate = 0` case, where limit can be lost. In the meantime taken a look at the `PIDRateEstimator` which could not produce 0 rate because of this: ` val newRate = (latestRate - proportional * error - integral * historicalError - derivative * dError).max(minRate) ` and minRate is limited: ` require( minRate > 0, s"Minimum rate in PIDRateEstimator should be > 0") ` I'm fine with this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20520: [SPARK-23344][PYTHON][ML] Add distanceMeasure param to K...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20520 cc @BryanCutler @holdenk @jkbradley @srowen --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20546: [SPARK-20659][Core] Remove StorageStatus, or make it pri...
Github user attilapiros commented on the issue: https://github.com/apache/spark/pull/20546 If this change goes into the 2.3 branch then MimaExcludes.scala should be changed accordingly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20503: [SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviour for R...
Github user ashashwat commented on the issue: https://github.com/apache/spark/pull/20503 @HyukjinKwon Should I add more tests covering Unicode? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20546: [SPARK-20659][Core] Remove StorageStatus, or make it pri...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20546 **[Test build #87216 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87216/testReport)** for PR 20546 at commit [`3f5dba6`](https://github.com/apache/spark/commit/3f5dba62cfa8b07d602639e9f88f3fd8cec92831). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20546: [SPARK-20659][Core] Remove StorageStatus, or make it pri...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20546 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20546: [SPARK-20659][Core] Remove StorageStatus, or make...
GitHub user attilapiros opened a pull request: https://github.com/apache/spark/pull/20546 [SPARK-20659][Core] Remove StorageStatus, or make it private. ## What changes were proposed in this pull request? In this PR StorageStatus is made to private and simplified a bit moreover SparkContext.getExecutorStorageStatus method is removed. The reason of keeping StorageStatus is that it is usage from SparkContext.getRDDStorageInfo. Instead of the method SparkContext.getExecutorStorageStatus executor infos are extended with additional memory metrics such as usedOnHeapStorageMemory, usedOffHeapStorageMemory, totalOnHeapStorageMemory, totalOffHeapStorageMemory. ## How was this patch tested? By running existing unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/attilapiros/spark SPARK-20659 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20546.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20546 commit 3f5dba62cfa8b07d602639e9f88f3fd8cec92831 Author: âattilapirosâ Date: 2018-02-07T13:41:31Z initial version --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20531 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87214/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20531 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20531 **[Test build #87214 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87214/testReport)** for PR 20531 at commit [`36617e4`](https://github.com/apache/spark/commit/36617e4bd864e0fbca5c617d009de45a8231a5d6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19077: [SPARK-21860][core]Improve memory reuse for heap memory ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19077 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19077: [SPARK-21860][core]Improve memory reuse for heap memory ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19077 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87212/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19077: [SPARK-21860][core]Improve memory reuse for heap memory ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19077 **[Test build #87212 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87212/testReport)** for PR 19077 at commit [`78ede3f`](https://github.com/apache/spark/commit/78ede3fae243c5379eaea1c86584e200ce697c19). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org