[GitHub] spark pull request #20525: [SPARK-23271[SQL] Parquet output contains only _S...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20525#discussion_r167159659 --- Diff: docs/sql-programming-guide.md --- @@ -1930,6 +1930,9 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Literal values used in SQL operations are converted to DECIMAL with the exact precision and scale needed by them. - The configuration `spark.sql.decimalOperations.allowPrecisionLoss` has been introduced. It defaults to `true`, which means the new behavior described here; if set to `false`, Spark uses previous rules, ie. it doesn't adjust the needed scale to represent the values and it returns NULL if an exact representation of the value is not possible. + - Since Spark 2.3, writing an empty dataframe (a dataframe with 0 partitions) in parquet or orc format, creates a format specific metadata only file. In prior versions the metadata only file was not created. As a result, subsequent attempt to read from this directory fails with AnalysisException while inferring schema of the file. For example : df.write.format("parquet").save("outDir") --- End diff -- yea the above 2 changes are good! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20516: [SPARK-23343][CORE][TEST] Increase the exception ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20516#discussion_r167159549 --- Diff: core/src/test/scala/org/apache/spark/SparkFunSuite.scala --- @@ -59,6 +59,7 @@ abstract class SparkFunSuite protected val enableAutoThreadAudit = true protected override def beforeAll(): Unit = { +System.setProperty("spark.testing", "true") --- End diff -- if we are already doing this, let's make it more explicit that we should remove `./project/SparkBuild.scala:795: javaOptions in Test += "-Dspark.testing=1"` and set `spark.testing` in `SparkFunSuite.beforeAll`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20525: [SPARK-23271[SQL] Parquet output contains only _S...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/20525#discussion_r167159534 --- Diff: docs/sql-programming-guide.md --- @@ -1930,6 +1930,9 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Literal values used in SQL operations are converted to DECIMAL with the exact precision and scale needed by them. - The configuration `spark.sql.decimalOperations.allowPrecisionLoss` has been introduced. It defaults to `true`, which means the new behavior described here; if set to `false`, Spark uses previous rules, ie. it doesn't adjust the needed scale to represent the values and it returns NULL if an exact representation of the value is not possible. + - Since Spark 2.3, writing an empty dataframe (a dataframe with 0 partitions) in parquet or orc format, creates a format specific metadata only file. In prior versions the metadata only file was not created. As a result, subsequent attempt to read from this directory fails with AnalysisException while inferring schema of the file. For example : df.write.format("parquet").save("outDir") --- End diff -- even -> even if ? self-described -> self-describing ? @cloud-fan Nicely written. Thanks. Let me know if you are ok with the above two change ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20449: [SPARK-23040][CORE]: Returns interruptible iterat...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20449#discussion_r167158719 --- Diff: core/src/main/scala/org/apache/spark/shuffle/BlockStoreShuffleReader.scala --- @@ -104,9 +104,16 @@ private[spark] class BlockStoreShuffleReader[K, C]( context.taskMetrics().incMemoryBytesSpilled(sorter.memoryBytesSpilled) context.taskMetrics().incDiskBytesSpilled(sorter.diskBytesSpilled) context.taskMetrics().incPeakExecutionMemory(sorter.peakMemoryUsedBytes) +// Use completion callback to stop sorter if task was cancelled. --- End diff -- `if task is completed(either finished or canceled)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20525: [SPARK-23271[SQL] Parquet output contains only _SUCCESS ...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/20525 @tdas @brkyvz Do we still need the fix for 0-partition DataFrame in Structured Streaming after this change? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20525: [SPARK-23271[SQL] Parquet output contains only _S...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20525#discussion_r167158557 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala --- @@ -301,7 +301,6 @@ class DataFrameReaderWriterSuite extends QueryTest with SharedSQLContext with Be intercept[AnalysisException] { spark.range(10).write.format("csv").mode("overwrite").partitionBy("id").save(path) } - spark.emptyDataFrame.write.format("parquet").mode("overwrite").save(path) --- End diff -- How does it fail? If it's a runtime error we should fail earlier during analysis. This worth a new JIRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20516: [SPARK-23343][CORE][TEST] Increase the exception ...
Github user heary-cao commented on a diff in the pull request: https://github.com/apache/spark/pull/20516#discussion_r167158512 --- Diff: core/src/test/scala/org/apache/spark/SparkFunSuite.scala --- @@ -59,6 +59,7 @@ abstract class SparkFunSuite protected val enableAutoThreadAudit = true protected override def beforeAll(): Unit = { +System.setProperty("spark.testing", "true") --- End diff -- My debugging tool is IDEA, I think IDE had no relevance to the process of setting. Be similar to HiveSparkSubmitSuite RPackageUtilsSuite SparkSubmitSuite. There are also manually add System.setProperty("spark.testing", "true"). Of courseï¼ I try with mavn(using command line) to test case , it is right. thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20525: [SPARK-23271[SQL] Parquet output contains only _S...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20525#discussion_r167158389 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala --- @@ -19,6 +19,7 @@ package org.apache.spark.sql.execution.datasources import org.apache.spark.sql.{QueryTest, Row} import org.apache.spark.sql.test.SharedSQLContext +import org.apache.spark.sql.types.{StringType, StructField, StructType} --- End diff -- please remove it --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20525: [SPARK-23271[SQL] Parquet output contains only _S...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20525#discussion_r167158260 --- Diff: docs/sql-programming-guide.md --- @@ -1930,6 +1930,9 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Literal values used in SQL operations are converted to DECIMAL with the exact precision and scale needed by them. - The configuration `spark.sql.decimalOperations.allowPrecisionLoss` has been introduced. It defaults to `true`, which means the new behavior described here; if set to `false`, Spark uses previous rules, ie. it doesn't adjust the needed scale to represent the values and it returns NULL if an exact representation of the value is not possible. + - Since Spark 2.3, writing an empty dataframe (a dataframe with 0 partitions) in parquet or orc format, creates a format specific metadata only file. In prior versions the metadata only file was not created. As a result, subsequent attempt to read from this directory fails with AnalysisException while inferring schema of the file. For example : df.write.format("parquet").save("outDir") --- End diff -- `Since Spark 2.3, writing an empty dataframe to a directory launches at least one write task, even physically the dataframe has no partition. This introduces a small behavior change that for self-described file formats like Parquet and Orc, Spark creates a metadata-only file in the target directory when writing 0-partition dataframe, so that schema inference can still work if users read that directory later. The new behavior is more reasonable and more consistent regarding writing empty dataframe.` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20516: [SPARK-23343][CORE][TEST] Increase the exception ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20516#discussion_r167156814 --- Diff: core/src/test/scala/org/apache/spark/SparkFunSuite.scala --- @@ -59,6 +59,7 @@ abstract class SparkFunSuite protected val enableAutoThreadAudit = true protected override def beforeAll(): Unit = { +System.setProperty("spark.testing", "true") --- End diff -- Sorry let me make the question more clear. Why we need this if `./project/SparkBuild.scala:795: javaOptions in Test += "-Dspark.testing=1"` works? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20555: [SPARK-23366] Improve hot reading path in ReadAheadInput...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20555 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20555: [SPARK-23366] Improve hot reading path in ReadAheadInput...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20555 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87243/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20516: [SPARK-23343][CORE][TEST] Increase the exception test fo...
Github user heary-cao commented on the issue: https://github.com/apache/spark/pull/20516 I try with mavn(using command line) to test case , it is right. thanks. Then, whether we add System.setProperty("spark.testing", "true") in SparkFunSuite to slove the IDE test tool problem. Be similar to HiveSparkSubmitSuite RPackageUtilsSuite SparkSubmitSuite. thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20555: [SPARK-23366] Improve hot reading path in ReadAheadInput...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20555 **[Test build #87243 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87243/testReport)** for PR 20555 at commit [`b26ffce`](https://github.com/apache/spark/commit/b26ffce6780078dbc38bff658e1ef7e9c56c3dd8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20477: [SPARK-23303][SQL] improve the explain result for data s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20477 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87247/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20477: [SPARK-23303][SQL] improve the explain result for data s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20477 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20477: [SPARK-23303][SQL] improve the explain result for data s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20477 **[Test build #87247 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87247/testReport)** for PR 20477 at commit [`0cc0600`](https://github.com/apache/spark/commit/0cc0600b8f6f3a46189ae38850835f34b57bd945). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20490: [SPARK-23323][SQL]: Support commit coordinator for DataS...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20490 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20490: [SPARK-23323][SQL]: Support commit coordinator for DataS...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20490 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87244/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20490: [SPARK-23323][SQL]: Support commit coordinator for DataS...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20490 **[Test build #87244 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87244/testReport)** for PR 20490 at commit [`e9964ca`](https://github.com/apache/spark/commit/e9964ca2fc831819662056210db594f613bce5d0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20501: [SPARK-22430][Docs] Unknown tag warnings when bui...
Github user rekhajoshm closed the pull request at: https://github.com/apache/spark/pull/20501 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20501: [SPARK-22430][Docs] Unknown tag warnings when building R...
Github user rekhajoshm commented on the issue: https://github.com/apache/spark/pull/20501 Ack. thanks for the update @felixcheung @srowen Closing this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20499: [SPARK-23328][PYTHON] Disallow default value None in na....
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20499 Yup, I should fix the guide for 2.2 anyway :-) Will open a backport tonight KST. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20525: [SPARK-23271[SQL] Parquet output contains only _SUCCESS ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20525 **[Test build #87250 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87250/testReport)** for PR 20525 at commit [`30e5aa5`](https://github.com/apache/spark/commit/30e5aa50a5bb01f18eab134a206d72a73e501baf). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20525: [SPARK-23271[SQL] Parquet output contains only _SUCCESS ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20525 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20525: [SPARK-23271[SQL] Parquet output contains only _SUCCESS ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20525 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/743/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20537: [SPARK-23314][PYTHON] Add ambiguous=False when localizin...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20537 LGTM too --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20551: [SPARK-23271][DOC] Document the empty dataframe w...
Github user dilipbiswal closed the pull request at: https://github.com/apache/spark/pull/20551 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20449: [SPARK-23040][CORE]: Returns interruptible iterator for ...
Github user advancedxy commented on the issue: https://github.com/apache/spark/pull/20449 @jerryshao @cloud-fan I have updated my code. Do you have any other concerns? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20525: [SPARK-23271[SQL] Parquet output contains only _SUCCESS ...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/20525 @cloud-fan @gatorsmile Done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20378: [SPARK-11222][Build][Python] Python document styl...
Github user rekhajoshm closed the pull request at: https://github.com/apache/spark/pull/20378 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20378: [SPARK-11222][Build][Python] Python document style check...
Github user rekhajoshm commented on the issue: https://github.com/apache/spark/pull/20378 @HyukjinKwon @holdenk @ueshin @viirya @icexelloss @felixcheung @BryanCutler and @MrBago - This was one of the possible approach that I was running by you. I have proposed another approach at #20556 with features as below - - Use sphinx like check, run only if pydocstyle installed on machine/jenkins - use pydocstyle rather than single file pep257.py - verify pydocstyle latest 2.1.1 is in use, to ensure latest doc checks are getting executed - ignore (inclusion/exclusion) features and support via tox.ini - Be non-breaking change and allow updating docstyle to standard at easy pace Closing this.Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20378: [SPARK-11222][Build][Python] Python document style check...
Github user rekhajoshm commented on the issue: https://github.com/apache/spark/pull/20378 @HyukjinKwon Identifying docstyle failures does not help much as it is not straightforward to exclude in this version. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20525: [SPARK-23271[SQL] Parquet output contains only _SUCCESS ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20525 no we can't merge 2 PRs together. Please pick one of your PRs and put all the changes there, thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20556: [SPARK-23367][Build] Include python document style check...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20556 **[Test build #87249 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87249/testReport)** for PR 20556 at commit [`ee14cf7`](https://github.com/apache/spark/commit/ee14cf708603bd904505a110c0ca5d3607d5cdb8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20556: [SPARK-23367][Build] Include python document style check...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20556 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20556: [SPARK-23367][Build] Include python document style check...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20556 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/742/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20525: [SPARK-23271[SQL] Parquet output contains only _SUCCESS ...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/20525 @cloud-fan Actually i had already created the doc pr in the morning using the same JIRA number. Whenchen, if we want to have both the changes in the same commit , will we be able to do it when we merge the patch ? If not, pl let me know , i will close that PR and move over the change to this branch. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20378: [SPARK-11222][Build][Python] Python document styl...
Github user rekhajoshm commented on a diff in the pull request: https://github.com/apache/spark/pull/20378#discussion_r167148657 --- Diff: dev/lint-python --- @@ -83,6 +84,53 @@ else rm "$PEP8_REPORT_PATH" fi + Python Document Style Checks + +# Get PYDOCSTYLE at runtime so that we don't rely on it being installed on the build server. +# Using pep257.py which is the single file version of pydocstyle. +PYDOCSTYLE_VERSION="0.2.1" --- End diff -- As called out earlier, this was single file python doc style checker, the latest does not have single file checker that can be included. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20499 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20499: [SPARK-23328][PYTHON] Disallow default value None in na....
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20499 thanks, merging to master/2.3! Can you send a new PR for 2.2? it conflicts... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20387: [SPARK-23203][SQL]: DataSourceV2: Use immutable l...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20387#discussion_r167147910 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala --- @@ -17,17 +17,130 @@ package org.apache.spark.sql.execution.datasources.v2 +import java.util.UUID + +import scala.collection.JavaConverters._ +import scala.collection.mutable + +import org.apache.spark.sql.{AnalysisException, SaveMode} import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation -import org.apache.spark.sql.catalyst.expressions.AttributeReference -import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics} -import org.apache.spark.sql.sources.v2.reader._ +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Expression} +import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan, Statistics} +import org.apache.spark.sql.execution.datasources.DataSourceStrategy +import org.apache.spark.sql.sources.{DataSourceRegister, Filter} +import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, ReadSupport, ReadSupportWithSchema, WriteSupport} +import org.apache.spark.sql.sources.v2.reader.{DataSourceReader, SupportsPushDownCatalystFilters, SupportsPushDownFilters, SupportsPushDownRequiredColumns, SupportsReportStatistics} +import org.apache.spark.sql.sources.v2.writer.DataSourceWriter +import org.apache.spark.sql.types.StructType case class DataSourceV2Relation( -output: Seq[AttributeReference], -reader: DataSourceReader) - extends LeafNode with MultiInstanceRelation with DataSourceReaderHolder { +source: DataSourceV2, +options: Map[String, String], +projection: Option[Seq[AttributeReference]] = None, +filters: Option[Seq[Expression]] = None, +userSchema: Option[StructType] = None) extends LeafNode with MultiInstanceRelation { + + override def simpleString: String = { +s"DataSourceV2Relation(source=$sourceName, " + + s"schema=[${output.map(a => s"$a ${a.dataType.simpleString}").mkString(", ")}], " + + s"filters=[${pushedFilters.mkString(", ")}], options=$options)" + } + + override lazy val schema: StructType = reader.readSchema() + + override lazy val output: Seq[AttributeReference] = { --- End diff -- I pulled your code and played with it. So your PR does fix the bug, but in a hacky way. Let's me explain what happened. 1. `QueryPlan.canonicalized` is called, every expression in `DataSourceV2Relation` is canonicalized, including `DataSourceV2Relation.projection`. This means, the attributes in `projection` are all renamed to "none". 2. `DataSourceV2Relation.output` is called, which triggers the creation of the reader, and applies filter push down and column pruning. Note that because all attributes are renamed to "none", we are actually pushing invalid filters and columns to data sources. 3. line up `reader.schema` and `projection`, to get the actual output. Because all names are "none", it works. However step 2 is pretty dangerous, Spark doesn't define the behavior of pushing invalid filters and columns, especially what `reader.schema` should return after invalid columns are pushed down. I prefer my original fix, which put `output` in `DataSourceV2Relation`'s constructor parameters, and update it when doing column pruning in `PushDownOperatorsToDataSource`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20557: [SPARK-23364][SQL]'desc table' command in spark-sql add ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20557 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20557: [SPARK-23364][SQL]'desc table' command in spark-sql add ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20557 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20557: [SPARK-23364][SQL]'desc table' command in spark-s...
GitHub user guoxiaolongzte opened a pull request: https://github.com/apache/spark/pull/20557 [SPARK-23364][SQL]'desc table' command in spark-sql add column head display ## What changes were proposed in this pull request? Use 'desc partition_table' command in spark-sql client, i think it should add column head display. Add 'col_name' âdata_typeâ 'comment' column head display. fix before: ![2](https://user-images.githubusercontent.com/26266482/36013945-283fea8c-0da2-11e8-8265-63d816dabd9b.png) fix after: ![1](https://user-images.githubusercontent.com/26266482/36013954-3252fd7a-0da2-11e8-8e63-3b586f238072.png) ## How was this patch tested? manual tests Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/guoxiaolongzte/spark SPARK-23364 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20557.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20557 commit 5699c0dc2810a4500f0ee34414b77b80afd0e9c1 Author: guoxiaolongDate: 2018-02-09T06:00:40Z [SPARK-23364][SQL]'desc table' command in spark-sql add column head display --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20359: [SPARK-23186][SQL] Initialize DriverManager first before...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20359 Thank you for merging, @cloud-fan . And thank you again, @HyukjinKwon , @gatorsmile , and @srowen ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20556: [SPARK-23367][Build] Include python document style check...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20556 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20556: [SPARK-23367][Build] Include python document style check...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20556 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87248/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20556: [SPARK-23367][Build] Include python document style check...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20556 **[Test build #87248 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87248/testReport)** for PR 20556 at commit [`85ca69d`](https://github.com/apache/spark/commit/85ca69de956cd3255eee5c51e830b9aa8f451308). * This patch **fails RAT tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20556: [SPARK-23367][Build] Include python document style check...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20556 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/741/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20556: [SPARK-23367][Build] Include python document style check...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20556 **[Test build #87248 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87248/testReport)** for PR 20556 at commit [`85ca69d`](https://github.com/apache/spark/commit/85ca69de956cd3255eee5c51e830b9aa8f451308). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20556: [SPARK-23367][Build] Include python document style check...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20556 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20532: [SPARK-23353][CORE] Allow ExecutorMetricsUpdate events t...
Github user squito commented on the issue: https://github.com/apache/spark/pull/20532 I can see why you want this sometimes, but I'm trying to figure out if its really valuable for users in general. You could always add a custom listener to log this info. It would go into separate file, not the std event log file, which means you'd have a little more work to do to stitch them together. OTOH that could be a good thing, as it means these history server wouldn't have to parse those extra lines. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20556: [SPARK-23367][Build] Include python document styl...
GitHub user rekhajoshm opened a pull request: https://github.com/apache/spark/pull/20556 [SPARK-23367][Build] Include python document style checking ## What changes were proposed in this pull request? Include python document style checking. This PR includes the pydocstyle checking if pydocstyle is installed, similar to sphinx checking.It takes care of exclusion/inclusion of explicit document error code via tox.ini. Currently all error codes are ignored to be a non-breaking change. ## How was this patch tested? ./dev/run-tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/rekhajoshm/spark SPARK-23367 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20556.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20556 commit e3677c9fa9697e0d34f9df52442085a6a481c9e9 Author: Rekha JoshiDate: 2015-05-05T23:10:08Z Merge pull request #1 from apache/master Pulling functionality from apache spark commit 106fd8eee8f6a6f7c67cfc64f57c1161f76d8f75 Author: Rekha Joshi Date: 2015-05-08T21:49:09Z Merge pull request #2 from apache/master pull latest from apache spark commit 0be142d6becba7c09c6eba0b8ea1efe83d649e8c Author: Rekha Joshi Date: 2015-06-22T00:08:08Z Merge pull request #3 from apache/master Pulling functionality from apache spark commit 6c6ee12fd733e3f9902e10faf92ccb78211245e3 Author: Rekha Joshi Date: 2015-09-17T01:03:09Z Merge pull request #4 from apache/master Pulling functionality from apache spark commit b123c601e459d1ad17511fd91dd304032154882a Author: Rekha Joshi Date: 2015-11-25T18:50:32Z Merge pull request #5 from apache/master pull request from apache/master commit c73c32aadd6066e631956923725a48d98a18777e Author: Rekha Joshi Date: 2016-03-18T19:13:51Z Merge pull request #6 from apache/master pull latest from apache spark commit 7dbf7320057978526635bed09dabc8cf8657a28a Author: Rekha Joshi Date: 2016-04-05T20:26:40Z Merge pull request #8 from apache/master pull latest from apache spark commit 5e9d71827f8e2e4d07027281b80e4e073e7fecd1 Author: Rekha Joshi Date: 2017-05-01T23:00:30Z Merge pull request #9 from apache/master Pull apache spark commit 63d99b3ce5f222d7126133170a373591f0ac67dd Author: Rekha Joshi Date: 2017-09-30T22:26:44Z Merge pull request #10 from apache/master pull latest apache spark commit a7fc787466b71784ff86f9694f617db0f1042da8 Author: Rekha Joshi Date: 2018-01-21T00:17:58Z Merge pull request #11 from apache/master Apache spark pull latest commit 3a2d45377ed4397de802badd764bc2588cfd275b Author: Rekha Joshi Date: 2018-02-09T04:55:12Z Merge pull request #12 from apache/master Apache spark latest pull commit 85ca69de956cd3255eee5c51e830b9aa8f451308 Author: rjoshi2 Date: 2018-02-09T05:54:03Z [SPARK-23367][Build] Include python document style checking --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20244: [SPARK-23053][CORE] taskBinarySerialization and t...
Github user ivoson commented on a diff in the pull request: https://github.com/apache/spark/pull/20244#discussion_r167145734 --- Diff: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala --- @@ -2399,6 +2424,121 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi } } + /** + * In this test, we simulate the scene in concurrent jobs using the same + * rdd which is marked to do checkpoint: + * Job one has already finished the spark job, and start the process of doCheckpoint; + * Job two is submitted, and submitMissingTasks is called. + * In submitMissingTasks, if taskSerialization is called before doCheckpoint is done, + * while part calculates from stage.rdd.partitions is called after doCheckpoint is done, + * we may get a ClassCastException when execute the task because of some rdd will do + * Partition cast. + * + * With this test case, just want to indicate that we should do taskSerialization and + * part calculate in submitMissingTasks with the same rdd checkpoint status. + */ + test("SPARK-23053: avoid ClassCastException in concurrent execution with checkpoint") { --- End diff -- hi @squito , it's fine. The pr and jira have been updated. Thanks for your patient and review. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20554: [SPARK-23362][SS] Migrate Kafka Microbatch source to v2
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20554 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20554: [SPARK-23362][SS] Migrate Kafka Microbatch source to v2
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20554 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87239/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20554: [SPARK-23362][SS] Migrate Kafka Microbatch source to v2
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20554 **[Test build #87239 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87239/testReport)** for PR 20554 at commit [`05c9d20`](https://github.com/apache/spark/commit/05c9d20da4361d631d8839bd4a45e4966964afa0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20554: [SPARK-23362][SS] Migrate Kafka Microbatch source to v2
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20554 Build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20554: [SPARK-23362][SS] Migrate Kafka Microbatch source to v2
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20554 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87238/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20554: [SPARK-23362][SS] Migrate Kafka Microbatch source to v2
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20554 **[Test build #87238 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87238/testReport)** for PR 20554 at commit [`3ed2a50`](https://github.com/apache/spark/commit/3ed2a509276194214875f39e1e18d8093155c54c). * This patch passes all tests. * This patch **does not merge cleanly**. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20477: [SPARK-23303][SQL] improve the explain result for data s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20477 **[Test build #87247 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87247/testReport)** for PR 20477 at commit [`0cc0600`](https://github.com/apache/spark/commit/0cc0600b8f6f3a46189ae38850835f34b57bd945). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20477: [SPARK-23303][SQL] improve the explain result for data s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20477 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/740/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20477: [SPARK-23303][SQL] improve the explain result for data s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20477 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20477: [SPARK-23303][SQL] improve the explain result for data s...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20477 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20516: [SPARK-23343][CORE][TEST] Increase the exception test fo...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20516 Can you try with SBT(using command line)? Usually we don't trust the test result of IDE. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20525: [SPARK-23271[SQL] Parquet output contains only _SUCCESS ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20525 I think it's better to have the doc change in the same PR, then it's more clear which patch caused the behavior change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20516: [SPARK-23343][CORE][TEST] Increase the exception test fo...
Github user heary-cao commented on the issue: https://github.com/apache/spark/pull/20516 sure, Operation environment: IDEA test tool. test case: test("can bind to a specific port") Test code: val maxRetries = portMaxRetries(conf) println("maxRetries:" + maxRetries) run result: maxRetries:16 if and only if add System.setProperty("spark.testing", "true") in SparkFunSuite. run result: maxRetries: 100 thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20387: [SPARK-23203][SQL]: DataSourceV2: Use immutable logical ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20387 > We've added a resolution rule from UnresolvedRelation to DataSourceV2Relation that uses our implementation. UnresolvedRelation needs to pass its TableIdentifier to the v2 relation, which is why I added this. I've been thinking about this a little more. This is actually an existing problem for file-based data sources. The solution is, when converting an unresolved relation to data source relation, we add some new options to the existing data source options before passing the options to data source relation. See `FindDataSourceTable.readDataSourceTable` about how we handle the path option. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20545: [SPARK-23359][SQL] Adds an alias 'names' of 'fieldNames'...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20545 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/739/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20545: [SPARK-23359][SQL] Adds an alias 'names' of 'fieldNames'...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20545 **[Test build #87246 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87246/testReport)** for PR 20545 at commit [`664a62c`](https://github.com/apache/spark/commit/664a62c7da9ba5da2007d40ef9c157f7e82938c5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20545: [SPARK-23359][SQL] Adds an alias 'names' of 'fieldNames'...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20545 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20545: [SPARK-23359][SQL] Adds an alias 'names' of 'fieldNames'...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20545 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20387: [SPARK-23203][SQL]: DataSourceV2: Use immutable l...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20387#discussion_r167142433 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala --- @@ -17,17 +17,130 @@ package org.apache.spark.sql.execution.datasources.v2 +import java.util.UUID + +import scala.collection.JavaConverters._ +import scala.collection.mutable + +import org.apache.spark.sql.{AnalysisException, SaveMode} import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation -import org.apache.spark.sql.catalyst.expressions.AttributeReference -import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics} -import org.apache.spark.sql.sources.v2.reader._ +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Expression} +import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan, Statistics} +import org.apache.spark.sql.execution.datasources.DataSourceStrategy +import org.apache.spark.sql.sources.{DataSourceRegister, Filter} +import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, ReadSupport, ReadSupportWithSchema, WriteSupport} +import org.apache.spark.sql.sources.v2.reader.{DataSourceReader, SupportsPushDownCatalystFilters, SupportsPushDownFilters, SupportsPushDownRequiredColumns, SupportsReportStatistics} +import org.apache.spark.sql.sources.v2.writer.DataSourceWriter +import org.apache.spark.sql.types.StructType case class DataSourceV2Relation( -output: Seq[AttributeReference], -reader: DataSourceReader) - extends LeafNode with MultiInstanceRelation with DataSourceReaderHolder { +source: DataSourceV2, +options: Map[String, String], +projection: Option[Seq[AttributeReference]] = None, +filters: Option[Seq[Expression]] = None, +userSchema: Option[StructType] = None) extends LeafNode with MultiInstanceRelation { + + override def simpleString: String = { +s"DataSourceV2Relation(source=$sourceName, " + + s"schema=[${output.map(a => s"$a ${a.dataType.simpleString}").mkString(", ")}], " + + s"filters=[${pushedFilters.mkString(", ")}], options=$options)" + } + + override lazy val schema: StructType = reader.readSchema() + + override lazy val output: Seq[AttributeReference] = { +projection match { + case Some(attrs) => +// use the projection attributes to avoid assigning new ids. fields that are not projected +// will be assigned new ids, which is okay because they are not projected. +val attrMap = attrs.map(a => a.name -> a).toMap +schema.map(f => attrMap.getOrElse(f.name, + AttributeReference(f.name, f.dataType, f.nullable, f.metadata)())) + case _ => +schema.toAttributes +} + } + + private lazy val v2Options: DataSourceOptions = { +// ensure path and table options are set correctly +val updatedOptions = new mutable.HashMap[String, String] +updatedOptions ++= options + +new DataSourceOptions(options.asJava) --- End diff -- We all agree that duplicating the logic of creating `DataSourceOptions` in many places is a bad idea. Currently there are 2 proposals: 1. Have a central place to take care the data source v2 resolution logic, including option creating. This is the approach of data source v1, i.e. the class `DataSource`. 2. Similar to proposal 1, but make `DataSourceV2Relation` the central place. For now we don't know which one is better, it depends on how data source v2 evolves in the future. At this point of time, I think we should pick the simplest approach, which is passing the `DataSourceOptions` to `DataSourceV2Relation`. Then we just need a one-line change in `DataFrameReader`, and don't need to add `v2Options` here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20477: [SPARK-23303][SQL] improve the explain result for data s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20477 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87242/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20477: [SPARK-23303][SQL] improve the explain result for data s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20477 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20477: [SPARK-23303][SQL] improve the explain result for data s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20477 **[Test build #87242 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87242/testReport)** for PR 20477 at commit [`0cc0600`](https://github.com/apache/spark/commit/0cc0600b8f6f3a46189ae38850835f34b57bd945). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20552: [SPARK-23099][SS] Migrate foreach sink to DataSourceV2
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20552 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87241/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20552: [SPARK-23099][SS] Migrate foreach sink to DataSourceV2
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20552 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20552: [SPARK-23099][SS] Migrate foreach sink to DataSourceV2
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20552 **[Test build #87241 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87241/testReport)** for PR 20552 at commit [`a33a35c`](https://github.com/apache/spark/commit/a33a35ccbae7350519a3faf8d5d3d6f35692feb3). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20303: [SPARK-23128][SQL] A new approach to do adaptive executi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20303 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20303: [SPARK-23128][SQL] A new approach to do adaptive executi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20303 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87236/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20387: [SPARK-23203][SQL]: DataSourceV2: Use immutable l...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20387#discussion_r167141001 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala --- @@ -17,17 +17,130 @@ package org.apache.spark.sql.execution.datasources.v2 +import java.util.UUID + +import scala.collection.JavaConverters._ +import scala.collection.mutable + +import org.apache.spark.sql.{AnalysisException, SaveMode} import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation -import org.apache.spark.sql.catalyst.expressions.AttributeReference -import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics} -import org.apache.spark.sql.sources.v2.reader._ +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Expression} +import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan, Statistics} +import org.apache.spark.sql.execution.datasources.DataSourceStrategy +import org.apache.spark.sql.sources.{DataSourceRegister, Filter} +import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, ReadSupport, ReadSupportWithSchema, WriteSupport} +import org.apache.spark.sql.sources.v2.reader.{DataSourceReader, SupportsPushDownCatalystFilters, SupportsPushDownFilters, SupportsPushDownRequiredColumns, SupportsReportStatistics} +import org.apache.spark.sql.sources.v2.writer.DataSourceWriter +import org.apache.spark.sql.types.StructType case class DataSourceV2Relation( -output: Seq[AttributeReference], -reader: DataSourceReader) - extends LeafNode with MultiInstanceRelation with DataSourceReaderHolder { +source: DataSourceV2, +options: Map[String, String], +projection: Option[Seq[AttributeReference]] = None, +filters: Option[Seq[Expression]] = None, +userSchema: Option[StructType] = None) extends LeafNode with MultiInstanceRelation { + + override def simpleString: String = { +s"DataSourceV2Relation(source=$sourceName, " + + s"schema=[${output.map(a => s"$a ${a.dataType.simpleString}").mkString(", ")}], " + + s"filters=[${pushedFilters.mkString(", ")}], options=$options)" + } + + override lazy val schema: StructType = reader.readSchema() + + override lazy val output: Seq[AttributeReference] = { +projection match { + case Some(attrs) => +// use the projection attributes to avoid assigning new ids. fields that are not projected +// will be assigned new ids, which is okay because they are not projected. +val attrMap = attrs.map(a => a.name -> a).toMap +schema.map(f => attrMap.getOrElse(f.name, + AttributeReference(f.name, f.dataType, f.nullable, f.metadata)())) + case _ => +schema.toAttributes +} + } + + private lazy val v2Options: DataSourceOptions = { +// ensure path and table options are set correctly +val updatedOptions = new mutable.HashMap[String, String] +updatedOptions ++= options + +new DataSourceOptions(options.asJava) + } + + private val sourceName: String = { +source match { + case registered: DataSourceRegister => +registered.shortName() + case _ => +source.getClass.getSimpleName +} + } + + lazy val ( + reader: DataSourceReader, + unsupportedFilters: Seq[Expression], + pushedFilters: Seq[Expression]) = { +val newReader = userSchema match { + case Some(s) => +asReadSupportWithSchema.createReader(s, v2Options) + case _ => +asReadSupport.createReader(v2Options) +} + +projection.foreach { attrs => + DataSourceV2Relation.pushRequiredColumns(newReader, attrs.toStructType) +} + +val (remainingFilters, pushedFilters) = filters match { + case Some(filterSeq) => +DataSourceV2Relation.pushFilters(newReader, filterSeq) + case _ => +(Nil, Nil) +} + +(newReader, remainingFilters, pushedFilters) + } - override def canEqual(other: Any): Boolean = other.isInstanceOf[DataSourceV2Relation] + def writer(dfSchema: StructType, mode: SaveMode): Option[DataSourceWriter] = { --- End diff -- I think we should avoid adding unused code that is needed in the future. The streaming data source v2 was a bad example and you already pointed it out. Hope we don't make the same mistake in the future. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20303: [SPARK-23128][SQL] A new approach to do adaptive executi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20303 **[Test build #87236 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87236/testReport)** for PR 20303 at commit [`603c6d5`](https://github.com/apache/spark/commit/603c6d58ae9a72f8202236682c78cd48a9bb320e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20521: [SPARK-22977][SQL] fix web UI SQL tab for CTAS
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20521 **[Test build #87245 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87245/testReport)** for PR 20521 at commit [`6bc913f`](https://github.com/apache/spark/commit/6bc913f71bab6a7d5f04dfa465e1e67951489dc6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20521: [SPARK-22977][SQL] fix web UI SQL tab for CTAS
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20521 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/738/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20521: [SPARK-22977][SQL] fix web UI SQL tab for CTAS
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20521 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20541: [SPARK-23356][SQL]Pushes Project to both sides of Union ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20541 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20521: [SPARK-22977][SQL] fix web UI SQL tab for CTAS
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20521#discussion_r167140801 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveExplainSuite.scala --- @@ -128,32 +128,6 @@ class HiveExplainSuite extends QueryTest with SQLTestUtils with TestHiveSingleto "src") } - test("SPARK-17409: The EXPLAIN output of CTAS only shows the analyzed plan") { --- End diff -- This is kinda a "bad" test. The bug was we optimize the CTAS input query twice, but here we are testing the if the EXPLAIN result of CTAS only contains analyzed query, which is specific to how we fix that bug at that time. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20541: [SPARK-23356][SQL]Pushes Project to both sides of Union ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20541 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87237/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20541: [SPARK-23356][SQL]Pushes Project to both sides of Union ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20541 **[Test build #87237 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87237/testReport)** for PR 20541 at commit [`4f5d46b`](https://github.com/apache/spark/commit/4f5d46baca612caaa882cbabb3b35665e9c7ed8b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20359: [SPARK-23186][SQL] Initialize DriverManager first...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20359 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20359: [SPARK-23186][SQL] Initialize DriverManager first before...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20359 thanks, merging to master/2.3! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20516: [SPARK-23343][CORE][TEST] Increase the exception test fo...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20516 are you sure `./project/SparkBuild.scala:795: javaOptions in Test += "-Dspark.testing=1"` only affect non-test code path? Then we have a lot of places to fix. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20545: [SPARK-23359][SQL] Adds an alias 'names' of 'fieldNames'...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20545 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20545: [SPARK-23359][SQL] Adds an alias 'names' of 'fieldNames'...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20545 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87240/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20545: [SPARK-23359][SQL] Adds an alias 'names' of 'fieldNames'...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20545 **[Test build #87240 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87240/testReport)** for PR 20545 at commit [`664a62c`](https://github.com/apache/spark/commit/664a62c7da9ba5da2007d40ef9c157f7e82938c5). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20244: [SPARK-23053][CORE] taskBinarySerialization and t...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/20244#discussion_r167138603 --- Diff: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala --- @@ -2399,6 +2424,121 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi } } + /** + * In this test, we simulate the scene in concurrent jobs using the same + * rdd which is marked to do checkpoint: + * Job one has already finished the spark job, and start the process of doCheckpoint; + * Job two is submitted, and submitMissingTasks is called. + * In submitMissingTasks, if taskSerialization is called before doCheckpoint is done, + * while part calculates from stage.rdd.partitions is called after doCheckpoint is done, + * we may get a ClassCastException when execute the task because of some rdd will do + * Partition cast. + * + * With this test case, just want to indicate that we should do taskSerialization and + * part calculate in submitMissingTasks with the same rdd checkpoint status. + */ + test("SPARK-23053: avoid ClassCastException in concurrent execution with checkpoint") { --- End diff -- hi @ivoson -- I haven't come up with a better way to test this, so I think for now you should (1) change the PR to *only* include the changes to the DAGScheduler (also undo the `protected[spark]` changes elsewhere) (2) put this repro on the jira as its a pretty good for showing whats going on. if we come up with a way to test it, we can always do that later on. thanks and sorry for the back and forth --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20516: [SPARK-23343][CORE][TEST] Increase the exception ...
Github user heary-cao commented on a diff in the pull request: https://github.com/apache/spark/pull/20516#discussion_r167137766 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -,7 +,7 @@ private[spark] object Utils extends Logging { */ def portMaxRetries(conf: SparkConf): Int = { val maxRetries = conf.getOption("spark.port.maxRetries").map(_.toInt) -if (conf.contains("spark.testing")) { +if (isTesting || conf.contains("spark.testing")) { --- End diff -- Sorry, may I have this understanding of one-sided point. It is not just in the test call. But when we need to get the default value for spark.port.maxRetries is 100. still need to set the `'spark.testing` . Or in the Spark unit test set test sign. so I added it to here. thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20490: [SPARK-23323][SQL]: Support commit coordinator fo...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/20490#discussion_r167137165 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceWriter.java --- @@ -62,6 +62,16 @@ */ DataWriterFactory createWriterFactory(); + /** + * Returns whether Spark should use the commit coordinator to ensure that only one attempt for --- End diff -- This is actually not a guarantee, is it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org