[GitHub] [spark] HyukjinKwon commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources

2020-08-14 Thread GitBox


HyukjinKwon commented on a change in pull request #29437:
URL: https://github.com/apache/spark/pull/29437#discussion_r470939924



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
##
@@ -1450,10 +1450,20 @@ abstract class CSVSuite extends QueryTest with 
SharedSparkSession with TestCsvDa
   val ds = sampledTestData.coalesce(1)
   ds.write.text(path.getAbsolutePath)
 
-  val readback = spark.read
+  val readback1 = spark.read
 .option("inferSchema", true).option("samplingRatio", 0.1)
 .csv(path.getCanonicalPath)
-  assert(readback.schema == new StructType().add("_c0", IntegerType))
+  assert(readback1.schema == new StructType().add("_c0", IntegerType))
+
+  // SPARK-32621: During infer, "path" option gets added again to the 
paths that have already

Review comment:
   I think you can use `withClue`. e.g.)
   
   ```scala
   withClue("SPARK-32621: blah blah") {
 val readback2 ...
   }
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29436: [SPARK-32620][SQL] Reset the numPartitions metric when DPP is enabled

2020-08-14 Thread GitBox


AmplabJenkins removed a comment on pull request #29436:
URL: https://github.com/apache/spark/pull/29436#issuecomment-674350284







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29436: [SPARK-32620][SQL] Reset the numPartitions metric when DPP is enabled

2020-08-14 Thread GitBox


AmplabJenkins commented on pull request #29436:
URL: https://github.com/apache/spark/pull/29436#issuecomment-674350284







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29436: [SPARK-32620][SQL] Reset the numPartitions metric when DPP is enabled

2020-08-14 Thread GitBox


SparkQA removed a comment on pull request #29436:
URL: https://github.com/apache/spark/pull/29436#issuecomment-674324895


   **[Test build #127468 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127468/testReport)**
 for PR 29436 at commit 
[`18cac6a`](https://github.com/apache/spark/commit/18cac6a9f0bf4a6d449393f1ee84004623b3c893).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29436: [SPARK-32620][SQL] Reset the numPartitions metric when DPP is enabled

2020-08-14 Thread GitBox


SparkQA commented on pull request #29436:
URL: https://github.com/apache/spark/pull/29436#issuecomment-674350117


   **[Test build #127468 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127468/testReport)**
 for PR 29436 at commit 
[`18cac6a`](https://github.com/apache/spark/commit/18cac6a9f0bf4a6d449393f1ee84004623b3c893).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #29427: [SPARK-32622][SQL][TEST] Add case-sensitivity test for ORC predicate pushdown

2020-08-14 Thread GitBox


viirya commented on pull request #29427:
URL: https://github.com/apache/spark/pull/29427#issuecomment-674348958


   Updated the JIRA ticket number.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29270: [SPARK-32466][TEST][SQL] Add PlanStabilitySuite to detect SparkPlan regression

2020-08-14 Thread GitBox


SparkQA commented on pull request #29270:
URL: https://github.com/apache/spark/pull/29270#issuecomment-674337032


   **[Test build #127470 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127470/testReport)**
 for PR 29270 at commit 
[`5b223fa`](https://github.com/apache/spark/commit/5b223fa2b11c23e4409ce4e6dd9a7da3686409f0).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-14 Thread GitBox


AmplabJenkins removed a comment on pull request #29342:
URL: https://github.com/apache/spark/pull/29342#issuecomment-674336822


   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/127467/
   Test PASSed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-14 Thread GitBox


AmplabJenkins commented on pull request #29342:
URL: https://github.com/apache/spark/pull/29342#issuecomment-674336817







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-14 Thread GitBox


AmplabJenkins removed a comment on pull request #29342:
URL: https://github.com/apache/spark/pull/29342#issuecomment-674336817


   Merged build finished. Test PASSed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-14 Thread GitBox


SparkQA removed a comment on pull request #29342:
URL: https://github.com/apache/spark/pull/29342#issuecomment-674316955


   **[Test build #127467 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127467/testReport)**
 for PR 29342 at commit 
[`cf04e2f`](https://github.com/apache/spark/commit/cf04e2f9edeb0364ffed180d49b64d5b6969ef36).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-14 Thread GitBox


SparkQA commented on pull request #29342:
URL: https://github.com/apache/spark/pull/29342#issuecomment-674336578


   **[Test build #127467 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127467/testReport)**
 for PR 29342 at commit 
[`cf04e2f`](https://github.com/apache/spark/commit/cf04e2f9edeb0364ffed180d49b64d5b6969ef36).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29270: [SPARK-32466][TEST][SQL] Add PlanStabilitySuite to detect SparkPlan regression

2020-08-14 Thread GitBox


AmplabJenkins commented on pull request #29270:
URL: https://github.com/apache/spark/pull/29270#issuecomment-674336364







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29270: [SPARK-32466][TEST][SQL] Add PlanStabilitySuite to detect SparkPlan regression

2020-08-14 Thread GitBox


AmplabJenkins removed a comment on pull request #29270:
URL: https://github.com/apache/spark/pull/29270#issuecomment-674336364







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources

2020-08-14 Thread GitBox


AmplabJenkins removed a comment on pull request #29437:
URL: https://github.com/apache/spark/pull/29437#issuecomment-674332647







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources

2020-08-14 Thread GitBox


AmplabJenkins commented on pull request #29437:
URL: https://github.com/apache/spark/pull/29437#issuecomment-674332647







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources

2020-08-14 Thread GitBox


SparkQA commented on pull request #29437:
URL: https://github.com/apache/spark/pull/29437#issuecomment-674332543


   **[Test build #127469 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127469/testReport)**
 for PR 29437 at commit 
[`23f2efe`](https://github.com/apache/spark/commit/23f2efebe452fa2e2e74825d2433dc1f24f6733d).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] imback82 opened a new pull request #29437: 'path' option can cause issues while inferring schema in CSV/JSON datasources

2020-08-14 Thread GitBox


imback82 opened a new pull request #29437:
URL: https://github.com/apache/spark/pull/29437


   
   
   ### What changes were proposed in this pull request?
   
   When CSV/JSON datasources infer schema (e.g, `def inferSchema(files: 
Seq[FileStatus])`, they create the `TextFileFormat` datasource and use the 
`files` along with the original options. `files` in `inferSchema` could have 
been deduced from the "path" option if the option was present, so this can 
cause issues (e.g., reading more data, listing the path again) since the "path" 
option is **added** to the `files`.
   
   ### Why are the changes needed?
   
   The current behavior can cause the following issue:
   ```scala
   class TestFileFilter extends PathFilter {
 override def accept(path: Path): Boolean = path.getParent.getName != "p=2"
   }
   
   val path = "/tmp"
   val df = spark.range(2)
   df.write.json(path + "/p=1")
   df.write.json(path + "/p=2")
   
   val extraOptions = Map(
 "mapred.input.pathFilter.class" -> classOf[TestFileFilter].getName,
 "mapreduce.input.pathFilter.class" -> classOf[TestFileFilter].getName
   )
   
   // This works fine.
   assert(spark.read.options(extraOptions).json(path).count == 2)
   
   // The following with "path" option fails with the following:
   // assertion failed: Conflicting directory structures detected. Suspicious 
paths
   //   file:/tmp
   //   file:/tmp/p=1
   assert(spark.read.options(extraOptions).format("json").option("path", 
path).load.count() === 2)
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, the above failure doesn't happen and you get the consistent experience 
when you use `spark.read.csv(path)` or `spark.read.format("csv").option("path", 
path).load`.
   
   ### How was this patch tested?
   
   Updated existing tests.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29430: [SPARK-32609] Incorrect exchange reuse with DataSourceV2

2020-08-14 Thread GitBox


AmplabJenkins removed a comment on pull request #29430:
URL: https://github.com/apache/spark/pull/29430#issuecomment-674328932







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29430: [SPARK-32609] Incorrect exchange reuse with DataSourceV2

2020-08-14 Thread GitBox


AmplabJenkins commented on pull request #29430:
URL: https://github.com/apache/spark/pull/29430#issuecomment-674328932







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29430: [SPARK-32609] Incorrect exchange reuse with DataSourceV2

2020-08-14 Thread GitBox


SparkQA removed a comment on pull request #29430:
URL: https://github.com/apache/spark/pull/29430#issuecomment-674296775


   **[Test build #127466 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127466/testReport)**
 for PR 29430 at commit 
[`5b1b9b3`](https://github.com/apache/spark/commit/5b1b9b39eb612cbf9ec67efd4e364adafcff66c4).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29430: [SPARK-32609] Incorrect exchange reuse with DataSourceV2

2020-08-14 Thread GitBox


SparkQA commented on pull request #29430:
URL: https://github.com/apache/spark/pull/29430#issuecomment-674328817


   **[Test build #127466 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127466/testReport)**
 for PR 29430 at commit 
[`5b1b9b3`](https://github.com/apache/spark/commit/5b1b9b39eb612cbf9ec67efd4e364adafcff66c4).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29436: [SPARK-32620][SQL] Reset the numPartitions metric when DPP is enabled

2020-08-14 Thread GitBox


AmplabJenkins removed a comment on pull request #29436:
URL: https://github.com/apache/spark/pull/29436#issuecomment-674325021







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29436: [SPARK-32620][SQL] Reset the numPartitions metric when DPP is enabled

2020-08-14 Thread GitBox


AmplabJenkins commented on pull request #29436:
URL: https://github.com/apache/spark/pull/29436#issuecomment-674325021







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29436: [SPARK-32620][SQL] Reset the numPartitions metric when DPP is enabled

2020-08-14 Thread GitBox


SparkQA commented on pull request #29436:
URL: https://github.com/apache/spark/pull/29436#issuecomment-674324895


   **[Test build #127468 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127468/testReport)**
 for PR 29436 at commit 
[`18cac6a`](https://github.com/apache/spark/commit/18cac6a9f0bf4a6d449393f1ee84004623b3c893).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum opened a new pull request #29436: [SPARK-32620][SQL] Reset the numPartitions metric when DPP is enabled

2020-08-14 Thread GitBox


wangyum opened a new pull request #29436:
URL: https://github.com/apache/spark/pull/29436


   ### What changes were proposed in this pull request?
   
   This pr reset the `numPartitions` metric when DPP is enabled.
   
   
   ### Why are the changes needed?
   
   Fix metric issue.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   
   ### How was this patch tested?
   
   Unit test and manual test
   
   For [this test 
case](https://github.com/apache/spark/blob/18cac6a9f0bf4a6d449393f1ee84004623b3c893/sql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala#L252-L280).
   
   Before this pr:
   
![image](https://user-images.githubusercontent.com/5399861/90301798-9310b480-ded4-11ea-9294-49bcaba46f83.png)
   
   After this pr:
   
![image](https://user-images.githubusercontent.com/5399861/90301709-0fef5e80-ded4-11ea-942d-4d45d1dd15bc.png)
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] agrawaldevesh commented on pull request #28818: [WIP][SPARK-31198][CORE] Use graceful decommissioning as part of dynamic scaling

2020-08-14 Thread GitBox


agrawaldevesh commented on pull request #28818:
URL: https://github.com/apache/spark/pull/28818#issuecomment-674318466


   @holdenk can this PR be abandoned/closed now since this is finally in ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-14 Thread GitBox


SparkQA commented on pull request #29342:
URL: https://github.com/apache/spark/pull/29342#issuecomment-674316955


   **[Test build #127467 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127467/testReport)**
 for PR 29342 at commit 
[`cf04e2f`](https://github.com/apache/spark/commit/cf04e2f9edeb0364ffed180d49b64d5b6969ef36).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-14 Thread GitBox


AmplabJenkins removed a comment on pull request #29342:
URL: https://github.com/apache/spark/pull/29342#issuecomment-674315733







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-14 Thread GitBox


AmplabJenkins commented on pull request #29342:
URL: https://github.com/apache/spark/pull/29342#issuecomment-674315733







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on a change in pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-14 Thread GitBox


c21 commented on a change in pull request #29342:
URL: https://github.com/apache/spark/pull/29342#discussion_r470908654



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala
##
@@ -71,8 +85,215 @@ case class ShuffledHashJoinExec(
 val numOutputRows = longMetric("numOutputRows")
 streamedPlan.execute().zipPartitions(buildPlan.execute()) { (streamIter, 
buildIter) =>
   val hashed = buildHashedRelation(buildIter)
-  join(streamIter, hashed, numOutputRows)
+  joinType match {
+case FullOuter => fullOuterJoin(streamIter, hashed, numOutputRows)
+case _ => join(streamIter, hashed, numOutputRows)
+  }
+}
+  }
+
+  private def fullOuterJoin(
+  streamIter: Iterator[InternalRow],
+  hashedRelation: HashedRelation,
+  numOutputRows: SQLMetric): Iterator[InternalRow] = {
+val joinKeys = streamSideKeyGenerator()
+val joinRow = new JoinedRow
+val (joinRowWithStream, joinRowWithBuild) = {
+  buildSide match {
+case BuildLeft => (joinRow.withRight _, joinRow.withLeft _)
+case BuildRight => (joinRow.withLeft _, joinRow.withRight _)
+  }
+}
+val buildNullRow = new GenericInternalRow(buildOutput.length)
+val streamNullRow = new GenericInternalRow(streamedOutput.length)
+lazy val streamNullJoinRowWithBuild = {
+  buildSide match {
+case BuildLeft =>
+  joinRow.withRight(streamNullRow)
+  joinRow.withLeft _
+case BuildRight =>
+  joinRow.withLeft(streamNullRow)
+  joinRow.withRight _
+  }
+}
+
+val iter = if (hashedRelation.keyIsUnique) {
+  fullOuterJoinWithUniqueKey(streamIter, hashedRelation, joinKeys, 
joinRowWithStream,
+joinRowWithBuild, streamNullJoinRowWithBuild, buildNullRow, 
streamNullRow)
+} else {
+  fullOuterJoinWithNonUniqueKey(streamIter, hashedRelation, joinKeys, 
joinRowWithStream,
+joinRowWithBuild, streamNullJoinRowWithBuild, buildNullRow, 
streamNullRow)
 }
+
+val resultProj = UnsafeProjection.create(output, output)
+iter.map { r =>
+  numOutputRows += 1
+  resultProj(r)
+}
+  }
+
+  /**
+   * Full outer shuffled hash join with unique join keys:
+   * 1. Process rows from stream side by looking up hash relation.
+   *Mark the matched rows from build side be looked up.
+   *A `BitSet` is used to track matched rows with key index.
+   * 2. Process rows from build side by iterating hash relation.
+   *Filter out rows from build side being matched already,
+   *by checking key index from `BitSet`.
+   */
+  private def fullOuterJoinWithUniqueKey(
+  streamIter: Iterator[InternalRow],
+  hashedRelation: HashedRelation,
+  joinKeys: UnsafeProjection,
+  joinRowWithStream: InternalRow => JoinedRow,
+  joinRowWithBuild: InternalRow => JoinedRow,
+  streamNullJoinRowWithBuild: => InternalRow => JoinedRow,
+  buildNullRow: GenericInternalRow,
+  streamNullRow: GenericInternalRow): Iterator[InternalRow] = {
+val matchedKeys = new BitSet(hashedRelation.maxNumKeysIndex)
+
+// Process stream side with looking up hash relation
+val streamResultIter = streamIter.map { srow =>
+  joinRowWithStream(srow)
+  val keys = joinKeys(srow)
+  if (keys.anyNull) {
+joinRowWithBuild(buildNullRow)
+  } else {
+val matched = hashedRelation.getValueWithKeyIndex(keys)
+if (matched != null) {
+  val keyIndex = matched.getKeyIndex
+  val buildRow = matched.getValue
+  val joinRow = joinRowWithBuild(buildRow)
+  if (boundCondition(joinRow)) {
+matchedKeys.set(keyIndex)
+joinRow
+  } else {
+joinRowWithBuild(buildNullRow)
+  }
+} else {
+  joinRowWithBuild(buildNullRow)
+}
+  }
+}
+
+// Process build side with filtering out the matched rows
+val buildResultIter = hashedRelation.valuesWithKeyIndex().flatMap {
+  valueRowWithKeyIndex =>
+val keyIndex = valueRowWithKeyIndex.getKeyIndex
+val isMatched = matchedKeys.get(keyIndex)
+if (!isMatched) {
+  val buildRow = valueRowWithKeyIndex.getValue
+  Some(streamNullJoinRowWithBuild(buildRow))
+} else {
+  None
+}
+}
+
+streamResultIter ++ buildResultIter
+  }
+
+  /**
+   * Full outer shuffled hash join with unique join keys:

Review comment:
   @cloud-fan - my bad, updated.

##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
##
@@ -66,6 +66,30 @@ private[execution] sealed trait HashedRelation extends 
KnownSizeEstimation {
 throw new UnsupportedOperationException
   }
 
+  /**
+   * Returns an iterator for key index and matched rows.
+   *
+   * Returns null if there is no matched rows.
+   */
+  def getWithKeyIndex(key: InternalRow): 

[GitHub] [spark] huaxingao commented on a change in pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-14 Thread GitBox


huaxingao commented on a change in pull request #29396:
URL: https://github.com/apache/spark/pull/29396#discussion_r470898996



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCScan.scala
##
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.datasources.v2.jdbc
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Row, SQLContext}
+import org.apache.spark.sql.connector.read.V1Scan
+import org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation
+import org.apache.spark.sql.sources.{BaseRelation, Filter, TableScan}
+import org.apache.spark.sql.types.StructType
+
+case class JDBCScan(
+relation: JDBCRelation,
+prunedSchema: StructType,
+pushedFilters: Array[Filter]) extends V1Scan {

Review comment:
   Using ```V1Scan``` here for now because we are still calling V1 
```JDBCRDD.scanTable``` underneath. Migrating step by step.

##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCScanBuilder.scala
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.datasources.v2.jdbc
+
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.connector.read.{Scan, ScanBuilder, 
SupportsPushDownFilters, SupportsPushDownRequiredColumns}
+import org.apache.spark.sql.execution.datasources.PartitioningUtils
+import org.apache.spark.sql.execution.datasources.jdbc.{JDBCOptions, JDBCRDD, 
JDBCRelation}
+import org.apache.spark.sql.jdbc.JdbcDialects
+import org.apache.spark.sql.sources.Filter
+import org.apache.spark.sql.types.StructType
+
+case class JDBCScanBuilder(
+session: SparkSession,
+schema: StructType,
+jdbcOptions: JDBCOptions)
+  extends ScanBuilder with SupportsPushDownFilters with 
SupportsPushDownRequiredColumns {
+
+  private val isCaseSensitive = session.sessionState.conf.caseSensitiveAnalysis
+
+  private var pushedFilter = Array.empty[Filter]
+
+  private var prunedSchema = schema
+
+  override def pushFilters(filters: Array[Filter]): Array[Filter] = {
+if (jdbcOptions.pushDownPredicate) {
+  val dialect = JdbcDialects.get(jdbcOptions.url)
+  val (pushed, unSupported) = filters.partition(JDBCRDD.compileFilter(_, 
dialect).isDefined)
+  this.pushedFilter = pushed
+  unSupported

Review comment:
   Thanks for your comment. I agree that it's safer to return the original 
filters, but it seems to me that we want to push down filters to the underlying 
datasource for better performance, so I guess we don't want to return the 
original filter to re-evaluate the filters that have already evaluated in the 
datasources. 

##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCScanBuilder.scala
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, 

[GitHub] [spark] viirya commented on a change in pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-14 Thread GitBox


viirya commented on a change in pull request #29342:
URL: https://github.com/apache/spark/pull/29342#discussion_r470898482



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
##
@@ -66,6 +66,30 @@ private[execution] sealed trait HashedRelation extends 
KnownSizeEstimation {
 throw new UnsupportedOperationException
   }
 
+  /**
+   * Returns an iterator for key index and matched rows.
+   *
+   * Returns null if there is no matched rows.
+   */
+  def getWithKeyIndex(key: InternalRow): Iterator[ValueRowWithKeyIndex]
+
+  /**
+   * Returns key index and matched single row.
+   *
+   * Returns null if there is no matched rows.
+   */
+  def getValueWithKeyIndex(key: InternalRow): ValueRowWithKeyIndex

Review comment:
   Can you add comment saying this is for unique key case?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mingjialiu commented on pull request #29430: [SPARK-32609] Incorrect exchange reuse with DataSourceV2

2020-08-14 Thread GitBox


mingjialiu commented on pull request #29430:
URL: https://github.com/apache/spark/pull/29430#issuecomment-674302449


   > I think it might make sense to add a test with a self join between the two 
dfs yields the correct results to mirror the issue observed
   
   The issue observed cannot be necessarily mirrored this way. The issue 
happens when 1. An exchange exists in optimized physical plan 2. Reuse exchange 
rule is applied.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29430: [SPARK-32609] Incorrect exchange reuse with DataSourceV2

2020-08-14 Thread GitBox


AmplabJenkins removed a comment on pull request #29430:
URL: https://github.com/apache/spark/pull/29430#issuecomment-674297234







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29430: [SPARK-32609] Incorrect exchange reuse with DataSourceV2

2020-08-14 Thread GitBox


AmplabJenkins commented on pull request #29430:
URL: https://github.com/apache/spark/pull/29430#issuecomment-674297234







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29430: [SPARK-32609] Incorrect exchange reuse with DataSourceV2

2020-08-14 Thread GitBox


SparkQA commented on pull request #29430:
URL: https://github.com/apache/spark/pull/29430#issuecomment-674296775


   **[Test build #127466 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127466/testReport)**
 for PR 29430 at commit 
[`5b1b9b3`](https://github.com/apache/spark/commit/5b1b9b39eb612cbf9ec67efd4e364adafcff66c4).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] agrawaldevesh commented on a change in pull request #29422: [SPARK-32613][CORE] Fix regressions in DecommissionWorkerSuite

2020-08-14 Thread GitBox


agrawaldevesh commented on a change in pull request #29422:
URL: https://github.com/apache/spark/pull/29422#discussion_r470890206



##
File path: 
core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
##
@@ -294,10 +294,13 @@ private[spark] class CoarseGrainedExecutorBackend(
 override def run(): Unit = {
   var lastTaskRunningTime = System.nanoTime()
   val sleep_time = 1000 // 1s
-
+  val initialSleepMillis = env.conf.getInt(

Review comment:
   sure





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] agrawaldevesh commented on a change in pull request #29422: [SPARK-32613][CORE] Fix regressions in DecommissionWorkerSuite

2020-08-14 Thread GitBox


agrawaldevesh commented on a change in pull request #29422:
URL: https://github.com/apache/spark/pull/29422#discussion_r470890156



##
File path: 
core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
##
@@ -323,6 +326,7 @@ private[spark] class CoarseGrainedExecutorBackend(
   // move forward.
   lastTaskRunningTime = System.nanoTime()
 }
+Thread.sleep(sleep_time)

Review comment:
   Yeah. No semantic change. We are still by default waiting for sleep_time 
the first time and the last time around (it is an infinite while loop that can 
only exit via an `exit(1)` -- via process death). I just wanted the first sleep 
interval to be configurable for testing. But no production change to the 
shutdown thread. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] agrawaldevesh commented on a change in pull request #29422: [SPARK-32613][CORE] Fix regressions in DecommissionWorkerSuite

2020-08-14 Thread GitBox


agrawaldevesh commented on a change in pull request #29422:
URL: https://github.com/apache/spark/pull/29422#discussion_r470889735



##
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##
@@ -2012,7 +2013,8 @@ private[spark] class DAGScheduler(
   execId: String,
   fileLost: Boolean,
   hostToUnregisterOutputs: Option[String],
-  maybeEpoch: Option[Long] = None): Unit = {
+  maybeEpoch: Option[Long] = None,
+  ignoreShuffleVersion: Boolean = false): Unit = {

Review comment:
   Yeah I will make it be something like `ignoreShuffleFileLostEpoch` to be 
even more explicit





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] agrawaldevesh commented on a change in pull request #29422: [SPARK-32613][CORE] Fix regressions in DecommissionWorkerSuite

2020-08-14 Thread GitBox


agrawaldevesh commented on a change in pull request #29422:
URL: https://github.com/apache/spark/pull/29422#discussion_r470889570



##
File path: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
##
@@ -1027,7 +1036,15 @@ private[spark] class TaskSchedulerImpl(
   }
 }
 
-executorsPendingDecommission -= executorId
+
+val decomInfo = executorsPendingDecommission.get(executorId)
+if (decomInfo.isDefined) {
+  val rememberSeconds =
+conf.getInt("spark.decommissioningRememberAfterRemoval.seconds", 60)
+  val gcSecond = TimeUnit.MILLISECONDS.toSeconds(clock.getTimeMillis()) + 
rememberSeconds
+  decommissioningExecutorsToGc.computeIfAbsent(gcSecond, _ => 
mutable.ArrayBuffer.empty) +=
+executorId
+}

Review comment:
   Hmm, no. the removal code only shares the piece about 
`TimeUnit.MILLISECONDS.toSeconds(clock.getTimeMillis())`, it does not share the 
rest of the code.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] agrawaldevesh commented on a change in pull request #29422: [SPARK-32613][CORE] Fix regressions in DecommissionWorkerSuite

2020-08-14 Thread GitBox


agrawaldevesh commented on a change in pull request #29422:
URL: https://github.com/apache/spark/pull/29422#discussion_r470889645



##
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##
@@ -2022,16 +2024,25 @@ private[spark] class DAGScheduler(
   blockManagerMaster.removeExecutor(execId)
   clearCacheLocs()
 }
-if (fileLost &&
-(!shuffleFileLostEpoch.contains(execId) || 
shuffleFileLostEpoch(execId) < currentEpoch)) {
-  shuffleFileLostEpoch(execId) = currentEpoch
-  hostToUnregisterOutputs match {
-case Some(host) =>
-  logInfo(s"Shuffle files lost for host: $host (epoch $currentEpoch)")
-  mapOutputTracker.removeOutputsOnHost(host)
-case None =>
-  logInfo(s"Shuffle files lost for executor: $execId (epoch 
$currentEpoch)")
-  mapOutputTracker.removeOutputsOnExecutor(execId)
+if (fileLost) {

Review comment:
   Will do. Good idea. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] agrawaldevesh commented on a change in pull request #29422: [SPARK-32613][CORE] Fix regressions in DecommissionWorkerSuite

2020-08-14 Thread GitBox


agrawaldevesh commented on a change in pull request #29422:
URL: https://github.com/apache/spark/pull/29422#discussion_r470889426



##
File path: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
##
@@ -136,7 +137,9 @@ private[spark] class TaskSchedulerImpl(
   // IDs of the tasks running on each executor
   private val executorIdToRunningTaskIds = new HashMap[String, HashSet[Long]]
 
-  private val executorsPendingDecommission = new HashMap[String, 
ExecutorDecommissionInfo]
+  val executorsPendingDecommission = new HashMap[String, 
ExecutorDecommissionInfo]
+  // map of second to list of executors to clear form the above map
+  val decommissioningExecutorsToGc = new util.TreeMap[Long, 
mutable.ArrayBuffer[String]]()

Review comment:
   Sure. Any structure that lets me GC by time will do. I just wanted 
something lightweight and custom to this use case. 
   
   I expect the treemap to contain no more than 60 seconds worth of entries 
since things are keyed by the second, and they are also cleaned up on every 
check. The check happens on every executor loss and fetch failures. But yeah it 
is possible that if there are no failures then the entries could just sit there 
:-P. 
   
   I will change it to Cache. good idea. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mingjialiu commented on a change in pull request #29430: [SPARK-32609] Incorrect exchange reuse with DataSourceV2

2020-08-14 Thread GitBox


mingjialiu commented on a change in pull request #29430:
URL: https://github.com/apache/spark/pull/29430#discussion_r470889123



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala
##
@@ -371,6 +371,25 @@ class DataSourceV2Suite extends QueryTest with 
SharedSQLContext {
   }
 }
   }
+
+  test("SPARK-32609: DataSourceV2 with different pushedfilters should be 
different") {
+def getScanExec(query: DataFrame): DataSourceV2ScanExec = {
+  query.queryExecution.executedPlan.collect {
+case d: DataSourceV2ScanExec => d
+  }.head
+}
+
+Seq(classOf[AdvancedDataSourceV2], 
classOf[JavaAdvancedDataSourceV2]).foreach { cls =>
+  withClue(cls.getName) {
+val df = spark.read.format(cls.getName).load()
+val q1 = df.select('i).filter('i > 6)
+val q2 = df.select('i).filter('i > 5)

Review comment:
   Done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] agrawaldevesh commented on a change in pull request #29422: [SPARK-32613][CORE] Fix regressions in DecommissionWorkerSuite

2020-08-14 Thread GitBox


agrawaldevesh commented on a change in pull request #29422:
URL: https://github.com/apache/spark/pull/29422#discussion_r470888548



##
File path: 
core/src/test/scala/org/apache/spark/deploy/DecommissionWorkerSuite.scala
##
@@ -212,22 +226,27 @@ class DecommissionWorkerSuite
   override def handleRootTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
 val taskInfo = taskEnd.taskInfo
 if (taskInfo.executorId == executorToDecom && taskInfo.attemptNumber 
== 0 &&
-  taskEnd.stageAttemptId == 0) {
+  taskEnd.stageAttemptId == 0 && taskEnd.stageId == 0) {
   decommissionWorkerOnMaster(workerToDecom,
 "decommission worker after task on it is done")
 }
   }
 }
-TestUtils.withListener(sc, listener) { _ =>
+withListener(sc, listener) { _ =>
   val jobResult = sc.parallelize(1 to 2, 2).mapPartitionsWithIndex((_, _) 
=> {
 val executorId = SparkEnv.get.executorId
-val sleepTimeSeconds = if (executorId == executorToDecom) 10 else 1
-Thread.sleep(sleepTimeSeconds * 1000L)
+val context = TaskContext.get()
+if (context.attemptNumber() == 0 && context.stageAttemptNumber() == 0) 
{
+  val sleepTimeSeconds = if (executorId == executorToDecom) 10 else 1
+  Thread.sleep(sleepTimeSeconds * 1000L)
+}

Review comment:
   Exactly. Got tired of waiting for the test to run and trying to cut 
slack.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] agrawaldevesh commented on pull request #29422: [SPARK-32613][CORE] Fix regressions in DecommissionWorkerSuite

2020-08-14 Thread GitBox


agrawaldevesh commented on pull request #29422:
URL: https://github.com/apache/spark/pull/29422#issuecomment-674294907


   > Thank you for taking the time to resolve this and make such a clear 
writeup of the root cause. From an in-production not-in-test question: if the 
executor exits we also want to eagerly clean up everything and resubmit right?
   
   Yes for sure. That will happen on its own. I haven't really changed that 
behavior. I have only changed the way fetch failures are handled (stemming from 
a decommissioned host). And the way they lead to a rerun is that 
`org.apache.spark.scheduler.DAGScheduler#resubmitFailedStages` gets invoked on 
a fetch failure asynchronously. The driver will then figure out what stages are 
missing map outputs and rerun them in topological order. 
   
   When an executor exits, it will normally clean up just its shuffle data (it 
does not know that its peer executors on the same host will soon be dying as 
well). Its the incrementing of the `shuffleFileLostEpoch` as a part of this 
cleanup that prevents future cleanups when a fetch failure is observed. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rohitmishr1484 commented on pull request #29410: [SPARK-32180][PYTHON][DOCS] Installation page in Getting Started in PySpark documentation

2020-08-14 Thread GitBox


rohitmishr1484 commented on pull request #29410:
URL: https://github.com/apache/spark/pull/29410#issuecomment-674294866


   @HyukjinKwon,
   
   Thanks for your helpful comment. I have done the suggested changes but if 
you still find something which requires modification, please let me know, I 
will update it. 
   
   Thanks!!!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-14 Thread GitBox


c21 commented on pull request #29342:
URL: https://github.com/apache/spark/pull/29342#issuecomment-674293854


   @agrawaldevesh - thanks for pointer of measuring metrics for query and 
script. I will take a look. I also have the plan to backport the PR in our 
internal fork. We don't have it internally, I just add the feature here 
inspired by discussion with @cloud-fan and Bart Samwell (who told me delta 
engine will have this feature as well). I will run with our production queries 
in more serious way and report CPU metrics back here. But I want to align with 
our expectation here, that it may take couple of days and it's not a blocker 
for merging this PR, thanks.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #28939: [SPARK-32119][CORE] ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes with --jars

2020-08-14 Thread GitBox


mridulm commented on pull request #28939:
URL: https://github.com/apache/spark/pull/28939#issuecomment-674293037


   Thanks @sarutak !



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] asfgit closed pull request #28939: [SPARK-32119][CORE] ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes with --jars

2020-08-14 Thread GitBox


asfgit closed pull request #28939:
URL: https://github.com/apache/spark/pull/28939


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on a change in pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-14 Thread GitBox


c21 commented on a change in pull request #29342:
URL: https://github.com/apache/spark/pull/29342#discussion_r470884973



##
File path: sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala
##
@@ -1188,4 +1188,53 @@ class JoinSuite extends QueryTest with 
SharedSparkSession with AdaptiveSparkPlan
 classOf[BroadcastNestedLoopJoinExec]))
 }
   }
+
+  test("SPARK-32399: Full outer shuffled hash join") {
+val inputDFs = Seq(
+  // Test unique join key
+  (spark.range(10).selectExpr("id as k1"),
+spark.range(30).selectExpr("id as k2"),
+$"k1" === $"k2"),
+  // Test non-unique join key
+  (spark.range(10).selectExpr("id % 5 as k1"),
+spark.range(30).selectExpr("id % 5 as k2"),
+$"k1" === $"k2"),
+  // Test string join key
+  (spark.range(10).selectExpr("cast(id * 3 as string) as k1"),
+spark.range(30).selectExpr("cast(id as string) as k2"),
+$"k1" === $"k2"),
+  // Test build side at right
+  (spark.range(30).selectExpr("cast(id / 3 as string) as k1"),
+spark.range(10).selectExpr("cast(id as string) as k2"),
+$"k1" === $"k2"),
+  // Test NULL join key
+  (spark.range(10).map(i => if (i % 2 == 0) i else null).selectExpr("value 
as k1"),
+spark.range(30).map(i => if (i % 4 == 0) i else 
null).selectExpr("value as k2"),
+$"k1" === $"k2"),
+  // Test multiple join keys
+  (spark.range(10).map(i => if (i % 2 == 0) i else null).selectExpr(

Review comment:
   @agrawaldevesh - I updated in line 1217 to be `spark.range(30).map(i => 
if (i % 3 == 0) i else null)` as you suggested. Am I missing anything here?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on a change in pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-14 Thread GitBox


c21 commented on a change in pull request #29342:
URL: https://github.com/apache/spark/pull/29342#discussion_r470884722



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala
##
@@ -71,8 +85,215 @@ case class ShuffledHashJoinExec(
 val numOutputRows = longMetric("numOutputRows")
 streamedPlan.execute().zipPartitions(buildPlan.execute()) { (streamIter, 
buildIter) =>
   val hashed = buildHashedRelation(buildIter)
-  join(streamIter, hashed, numOutputRows)
+  joinType match {
+case FullOuter => fullOuterJoin(streamIter, hashed, numOutputRows)
+case _ => join(streamIter, hashed, numOutputRows)
+  }
+}
+  }
+
+  private def fullOuterJoin(
+  streamIter: Iterator[InternalRow],
+  hashedRelation: HashedRelation,
+  numOutputRows: SQLMetric): Iterator[InternalRow] = {
+val joinKeys = streamSideKeyGenerator()
+val joinRow = new JoinedRow
+val (joinRowWithStream, joinRowWithBuild) = {
+  buildSide match {
+case BuildLeft => (joinRow.withRight _, joinRow.withLeft _)
+case BuildRight => (joinRow.withLeft _, joinRow.withRight _)
+  }
+}
+val buildNullRow = new GenericInternalRow(buildOutput.length)
+val streamNullRow = new GenericInternalRow(streamedOutput.length)
+lazy val streamNullJoinRowWithBuild = {
+  buildSide match {
+case BuildLeft =>
+  joinRow.withRight(streamNullRow)
+  joinRow.withLeft _
+case BuildRight =>
+  joinRow.withLeft(streamNullRow)
+  joinRow.withRight _
+  }
+}
+
+val iter = if (hashedRelation.keyIsUnique) {
+  fullOuterJoinWithUniqueKey(streamIter, hashedRelation, joinKeys, 
joinRowWithStream,
+joinRowWithBuild, streamNullJoinRowWithBuild, buildNullRow, 
streamNullRow)
+} else {
+  fullOuterJoinWithNonUniqueKey(streamIter, hashedRelation, joinKeys, 
joinRowWithStream,
+joinRowWithBuild, streamNullJoinRowWithBuild, buildNullRow, 
streamNullRow)
 }
+
+val resultProj = UnsafeProjection.create(output, output)
+iter.map { r =>
+  numOutputRows += 1
+  resultProj(r)
+}
+  }
+
+  /**
+   * Full outer shuffled hash join with unique join keys:
+   * 1. Process rows from stream side by looking up hash relation.
+   *Mark the matched rows from build side be looked up.
+   *A `BitSet` is used to track matched rows with key index.
+   * 2. Process rows from build side by iterating hash relation.
+   *Filter out rows from build side being matched already,
+   *by checking key index from `BitSet`.
+   */
+  private def fullOuterJoinWithUniqueKey(
+  streamIter: Iterator[InternalRow],
+  hashedRelation: HashedRelation,
+  joinKeys: UnsafeProjection,
+  joinRowWithStream: InternalRow => JoinedRow,
+  joinRowWithBuild: InternalRow => JoinedRow,
+  streamNullJoinRowWithBuild: => InternalRow => JoinedRow,
+  buildNullRow: GenericInternalRow,
+  streamNullRow: GenericInternalRow): Iterator[InternalRow] = {
+val matchedKeys = new BitSet(hashedRelation.maxNumKeysIndex)
+
+// Process stream side with looking up hash relation
+val streamResultIter = streamIter.map { srow =>
+  joinRowWithStream(srow)
+  val keys = joinKeys(srow)
+  if (keys.anyNull) {
+joinRowWithBuild(buildNullRow)
+  } else {
+val matched = hashedRelation.getValueWithKeyIndex(keys)
+if (matched != null) {
+  val keyIndex = matched.getKeyIndex
+  val buildRow = matched.getValue
+  val joinRow = joinRowWithBuild(buildRow)
+  if (boundCondition(joinRow)) {
+matchedKeys.set(keyIndex)
+joinRow
+  } else {
+joinRowWithBuild(buildNullRow)
+  }
+} else {
+  joinRowWithBuild(buildNullRow)
+}
+  }
+}
+
+// Process build side with filtering out the matched rows
+val buildResultIter = hashedRelation.valuesWithKeyIndex().flatMap {
+  valueRowWithKeyIndex =>
+val keyIndex = valueRowWithKeyIndex.getKeyIndex
+val isMatched = matchedKeys.get(keyIndex)
+if (!isMatched) {
+  val buildRow = valueRowWithKeyIndex.getValue
+  Some(streamNullJoinRowWithBuild(buildRow))
+} else {
+  None
+}
+}
+
+streamResultIter ++ buildResultIter
+  }
+
+  /**
+   * Full outer shuffled hash join with unique join keys:
+   * 1. Process rows from stream side by looking up hash relation.
+   *Mark the matched rows from build side be looked up.
+   *A `HashSet[Long]` is used to track matched rows with
+   *key index (Int) and value index (Int) together.
+   * 2. Process rows from build side by iterating hash relation.
+   *Filter out rows from build side being matched already,
+   *by checking key index and value index from `HashSet`.
+   *
+   * The "value index" is defined as the index of 

[GitHub] [spark] c21 commented on a change in pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-14 Thread GitBox


c21 commented on a change in pull request #29342:
URL: https://github.com/apache/spark/pull/29342#discussion_r470883805



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala
##
@@ -71,8 +85,215 @@ case class ShuffledHashJoinExec(
 val numOutputRows = longMetric("numOutputRows")
 streamedPlan.execute().zipPartitions(buildPlan.execute()) { (streamIter, 
buildIter) =>
   val hashed = buildHashedRelation(buildIter)
-  join(streamIter, hashed, numOutputRows)
+  joinType match {
+case FullOuter => fullOuterJoin(streamIter, hashed, numOutputRows)
+case _ => join(streamIter, hashed, numOutputRows)
+  }
+}
+  }
+
+  private def fullOuterJoin(
+  streamIter: Iterator[InternalRow],
+  hashedRelation: HashedRelation,
+  numOutputRows: SQLMetric): Iterator[InternalRow] = {
+val joinKeys = streamSideKeyGenerator()
+val joinRow = new JoinedRow
+val (joinRowWithStream, joinRowWithBuild) = {
+  buildSide match {
+case BuildLeft => (joinRow.withRight _, joinRow.withLeft _)
+case BuildRight => (joinRow.withLeft _, joinRow.withRight _)
+  }
+}
+val buildNullRow = new GenericInternalRow(buildOutput.length)
+val streamNullRow = new GenericInternalRow(streamedOutput.length)
+lazy val streamNullJoinRowWithBuild = {
+  buildSide match {
+case BuildLeft =>
+  joinRow.withRight(streamNullRow)
+  joinRow.withLeft _
+case BuildRight =>
+  joinRow.withLeft(streamNullRow)
+  joinRow.withRight _
+  }
+}
+
+val iter = if (hashedRelation.keyIsUnique) {
+  fullOuterJoinWithUniqueKey(streamIter, hashedRelation, joinKeys, 
joinRowWithStream,
+joinRowWithBuild, streamNullJoinRowWithBuild, buildNullRow, 
streamNullRow)
+} else {
+  fullOuterJoinWithNonUniqueKey(streamIter, hashedRelation, joinKeys, 
joinRowWithStream,
+joinRowWithBuild, streamNullJoinRowWithBuild, buildNullRow, 
streamNullRow)
 }
+
+val resultProj = UnsafeProjection.create(output, output)
+iter.map { r =>
+  numOutputRows += 1
+  resultProj(r)
+}
+  }
+
+  /**
+   * Full outer shuffled hash join with unique join keys:
+   * 1. Process rows from stream side by looking up hash relation.
+   *Mark the matched rows from build side be looked up.
+   *A `BitSet` is used to track matched rows with key index.
+   * 2. Process rows from build side by iterating hash relation.
+   *Filter out rows from build side being matched already,
+   *by checking key index from `BitSet`.
+   */
+  private def fullOuterJoinWithUniqueKey(
+  streamIter: Iterator[InternalRow],
+  hashedRelation: HashedRelation,
+  joinKeys: UnsafeProjection,
+  joinRowWithStream: InternalRow => JoinedRow,
+  joinRowWithBuild: InternalRow => JoinedRow,
+  streamNullJoinRowWithBuild: => InternalRow => JoinedRow,
+  buildNullRow: GenericInternalRow,
+  streamNullRow: GenericInternalRow): Iterator[InternalRow] = {
+val matchedKeys = new BitSet(hashedRelation.maxNumKeysIndex)
+
+// Process stream side with looking up hash relation
+val streamResultIter = streamIter.map { srow =>
+  joinRowWithStream(srow)
+  val keys = joinKeys(srow)
+  if (keys.anyNull) {
+joinRowWithBuild(buildNullRow)
+  } else {
+val matched = hashedRelation.getValueWithKeyIndex(keys)
+if (matched != null) {
+  val keyIndex = matched.getKeyIndex
+  val buildRow = matched.getValue
+  val joinRow = joinRowWithBuild(buildRow)
+  if (boundCondition(joinRow)) {
+matchedKeys.set(keyIndex)
+joinRow
+  } else {
+joinRowWithBuild(buildNullRow)
+  }
+} else {
+  joinRowWithBuild(buildNullRow)
+}
+  }
+}
+
+// Process build side with filtering out the matched rows
+val buildResultIter = hashedRelation.valuesWithKeyIndex().flatMap {
+  valueRowWithKeyIndex =>
+val keyIndex = valueRowWithKeyIndex.getKeyIndex
+val isMatched = matchedKeys.get(keyIndex)
+if (!isMatched) {
+  val buildRow = valueRowWithKeyIndex.getValue
+  Some(streamNullJoinRowWithBuild(buildRow))
+} else {
+  None
+}
+}
+
+streamResultIter ++ buildResultIter
+  }
+
+  /**
+   * Full outer shuffled hash join with unique join keys:
+   * 1. Process rows from stream side by looking up hash relation.
+   *Mark the matched rows from build side be looked up.
+   *A `HashSet[Long]` is used to track matched rows with
+   *key index (Int) and value index (Int) together.
+   * 2. Process rows from build side by iterating hash relation.
+   *Filter out rows from build side being matched already,
+   *by checking key index and value index from `HashSet`.
+   *
+   * The "value index" is defined as the index of 

[GitHub] [spark] agrawaldevesh commented on a change in pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-14 Thread GitBox


agrawaldevesh commented on a change in pull request #29342:
URL: https://github.com/apache/spark/pull/29342#discussion_r470873077



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala
##
@@ -71,8 +85,215 @@ case class ShuffledHashJoinExec(
 val numOutputRows = longMetric("numOutputRows")
 streamedPlan.execute().zipPartitions(buildPlan.execute()) { (streamIter, 
buildIter) =>
   val hashed = buildHashedRelation(buildIter)
-  join(streamIter, hashed, numOutputRows)
+  joinType match {
+case FullOuter => fullOuterJoin(streamIter, hashed, numOutputRows)
+case _ => join(streamIter, hashed, numOutputRows)
+  }
+}
+  }
+
+  private def fullOuterJoin(
+  streamIter: Iterator[InternalRow],
+  hashedRelation: HashedRelation,
+  numOutputRows: SQLMetric): Iterator[InternalRow] = {
+val joinKeys = streamSideKeyGenerator()
+val joinRow = new JoinedRow
+val (joinRowWithStream, joinRowWithBuild) = {
+  buildSide match {
+case BuildLeft => (joinRow.withRight _, joinRow.withLeft _)
+case BuildRight => (joinRow.withLeft _, joinRow.withRight _)
+  }
+}
+val buildNullRow = new GenericInternalRow(buildOutput.length)
+val streamNullRow = new GenericInternalRow(streamedOutput.length)
+lazy val streamNullJoinRowWithBuild = {
+  buildSide match {
+case BuildLeft =>
+  joinRow.withRight(streamNullRow)
+  joinRow.withLeft _
+case BuildRight =>
+  joinRow.withLeft(streamNullRow)
+  joinRow.withRight _
+  }
+}
+
+val iter = if (hashedRelation.keyIsUnique) {
+  fullOuterJoinWithUniqueKey(streamIter, hashedRelation, joinKeys, 
joinRowWithStream,
+joinRowWithBuild, streamNullJoinRowWithBuild, buildNullRow, 
streamNullRow)
+} else {
+  fullOuterJoinWithNonUniqueKey(streamIter, hashedRelation, joinKeys, 
joinRowWithStream,
+joinRowWithBuild, streamNullJoinRowWithBuild, buildNullRow, 
streamNullRow)
 }
+
+val resultProj = UnsafeProjection.create(output, output)
+iter.map { r =>
+  numOutputRows += 1
+  resultProj(r)
+}
+  }
+
+  /**
+   * Full outer shuffled hash join with unique join keys:
+   * 1. Process rows from stream side by looking up hash relation.
+   *Mark the matched rows from build side be looked up.
+   *A `BitSet` is used to track matched rows with key index.
+   * 2. Process rows from build side by iterating hash relation.
+   *Filter out rows from build side being matched already,
+   *by checking key index from `BitSet`.
+   */
+  private def fullOuterJoinWithUniqueKey(
+  streamIter: Iterator[InternalRow],
+  hashedRelation: HashedRelation,
+  joinKeys: UnsafeProjection,
+  joinRowWithStream: InternalRow => JoinedRow,
+  joinRowWithBuild: InternalRow => JoinedRow,
+  streamNullJoinRowWithBuild: => InternalRow => JoinedRow,
+  buildNullRow: GenericInternalRow,
+  streamNullRow: GenericInternalRow): Iterator[InternalRow] = {
+val matchedKeys = new BitSet(hashedRelation.maxNumKeysIndex)
+
+// Process stream side with looking up hash relation
+val streamResultIter = streamIter.map { srow =>
+  joinRowWithStream(srow)
+  val keys = joinKeys(srow)
+  if (keys.anyNull) {
+joinRowWithBuild(buildNullRow)
+  } else {
+val matched = hashedRelation.getValueWithKeyIndex(keys)
+if (matched != null) {
+  val keyIndex = matched.getKeyIndex
+  val buildRow = matched.getValue
+  val joinRow = joinRowWithBuild(buildRow)
+  if (boundCondition(joinRow)) {
+matchedKeys.set(keyIndex)
+joinRow
+  } else {
+joinRowWithBuild(buildNullRow)
+  }
+} else {
+  joinRowWithBuild(buildNullRow)
+}
+  }
+}
+
+// Process build side with filtering out the matched rows
+val buildResultIter = hashedRelation.valuesWithKeyIndex().flatMap {
+  valueRowWithKeyIndex =>
+val keyIndex = valueRowWithKeyIndex.getKeyIndex
+val isMatched = matchedKeys.get(keyIndex)
+if (!isMatched) {
+  val buildRow = valueRowWithKeyIndex.getValue
+  Some(streamNullJoinRowWithBuild(buildRow))
+} else {
+  None
+}
+}
+
+streamResultIter ++ buildResultIter
+  }
+
+  /**
+   * Full outer shuffled hash join with unique join keys:
+   * 1. Process rows from stream side by looking up hash relation.
+   *Mark the matched rows from build side be looked up.
+   *A `HashSet[Long]` is used to track matched rows with
+   *key index (Int) and value index (Int) together.
+   * 2. Process rows from build side by iterating hash relation.
+   *Filter out rows from build side being matched already,
+   *by checking key index and value index from `HashSet`.
+   *
+   * The "value index" is defined as 

[GitHub] [spark] SparkQA removed a comment on pull request #29410: [SPARK-32180][PYTHON][DOCS] Installation page in Getting Started in PySpark documentation

2020-08-14 Thread GitBox


SparkQA removed a comment on pull request #29410:
URL: https://github.com/apache/spark/pull/29410#issuecomment-674274381


   **[Test build #127465 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127465/testReport)**
 for PR 29410 at commit 
[`5df88b4`](https://github.com/apache/spark/commit/5df88b4cd441116cb023d932a8023cd8ef068307).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29410: [SPARK-32180][PYTHON][DOCS] Installation page in Getting Started in PySpark documentation

2020-08-14 Thread GitBox


AmplabJenkins removed a comment on pull request #29410:
URL: https://github.com/apache/spark/pull/29410#issuecomment-674282565







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29410: [SPARK-32180][PYTHON][DOCS] Installation page in Getting Started in PySpark documentation

2020-08-14 Thread GitBox


AmplabJenkins commented on pull request #29410:
URL: https://github.com/apache/spark/pull/29410#issuecomment-674282565







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29410: [SPARK-32180][PYTHON][DOCS] Installation page in Getting Started in PySpark documentation

2020-08-14 Thread GitBox


SparkQA commented on pull request #29410:
URL: https://github.com/apache/spark/pull/29410#issuecomment-674282273


   **[Test build #127465 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127465/testReport)**
 for PR 29410 at commit 
[`5df88b4`](https://github.com/apache/spark/commit/5df88b4cd441116cb023d932a8023cd8ef068307).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29410: [SPARK-32180][PYTHON][DOCS] Installation page in Getting Started in PySpark documentation

2020-08-14 Thread GitBox


AmplabJenkins removed a comment on pull request #29410:
URL: https://github.com/apache/spark/pull/29410#issuecomment-674274794







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29410: [SPARK-32180][PYTHON][DOCS] Installation page in Getting Started in PySpark documentation

2020-08-14 Thread GitBox


AmplabJenkins commented on pull request #29410:
URL: https://github.com/apache/spark/pull/29410#issuecomment-674274794







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29410: [SPARK-32180][PYTHON][DOCS] Installation page in Getting Started in PySpark documentation

2020-08-14 Thread GitBox


SparkQA commented on pull request #29410:
URL: https://github.com/apache/spark/pull/29410#issuecomment-674274381


   **[Test build #127465 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127465/testReport)**
 for PR 29410 at commit 
[`5df88b4`](https://github.com/apache/spark/commit/5df88b4cd441116cb023d932a8023cd8ef068307).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rohitmishr1484 commented on a change in pull request #29410: [SPARK-32180][PYTHON][DOCS] Installation page in Getting Started in PySpark documentation

2020-08-14 Thread GitBox


rohitmishr1484 commented on a change in pull request #29410:
URL: https://github.com/apache/spark/pull/29410#discussion_r470867265



##
File path: python/docs/source/getting_started/installation.rst
##
@@ -0,0 +1,62 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+
+===
+Installation
+===
+
+Using Conda 
+~~
+PySpark installation using `Conda `_ 
can be performed using the below command::
+
+conda install -c conda-forge pyspark
+   
+Using PyPI
+~~
+PySpark installation using `PyPI `_::
+
+pip install pyspark
+
+Official release channel
+
+
+Different flavor of PySpark is available in `the official release channel 
`__.
+Any suitable version can be downloaded and extracted as below::
+
+tar xzvf spark-3.0.0-bin-hadoop2.7.tgz
+
+An important step is to ensure ``SPARK_HOME`` environment variable points to 
the directory where the code has been extracted. The next step is to properly 
define ``PYTHONPATH`` such that it can find the PySpark and Py4J under 
``$SPARK_HOME/python/lib``::
+
+cd spark-3.0.0-bin-hadoop2.7
+
+export SPARK_HOME=`pwd`
+
+export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo 
"${ZIPS[*]}"):$PYTHONPATH
+
+Installing from source
+~~
+
+To install PySpark from source, refer `Building Spark 
`__.
+
+* Steps for defining ``PYTHONPATH`` is same as described in `Official release 
channel` section above. 
+
+
+Dependencies
+
+* Using PySpark requires the Spark JARs.
+* At its core PySpark depends on Py4J, but some additional sub-packages have 
their own extra requirements for some features (including NumPy, pandas, and 
PyArrow).

Review comment:
   Added this





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rohitmishr1484 commented on a change in pull request #29410: [SPARK-32180][PYTHON][DOCS] Installation page in Getting Started in PySpark documentation

2020-08-14 Thread GitBox


rohitmishr1484 commented on a change in pull request #29410:
URL: https://github.com/apache/spark/pull/29410#discussion_r470866980



##
File path: python/docs/source/getting_started/installation.rst
##
@@ -0,0 +1,62 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+
+===

Review comment:
   Done

##
File path: python/docs/source/getting_started/installation.rst
##
@@ -0,0 +1,62 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+
+===
+Installation
+===
+
+Using Conda 
+~~

Review comment:
   Done

##
File path: python/docs/source/getting_started/installation.rst
##
@@ -0,0 +1,62 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+
+===
+Installation
+===
+
+Using Conda 

Review comment:
   Added this line

##
File path: python/docs/source/getting_started/installation.rst
##
@@ -0,0 +1,62 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+
+===
+Installation
+===
+
+Using Conda 
+~~
+PySpark installation using `Conda `_ 
can be performed using the below command::
+
+conda install -c conda-forge pyspark
+   
+Using PyPI
+~~
+PySpark installation using `PyPI `_::
+
+pip install pyspark
+
+Official release channel
+
+
+Different flavor of PySpark is available in `the official release channel 
`__.
+Any suitable version can be downloaded and extracted as below::
+
+tar xzvf spark-3.0.0-bin-hadoop2.7.tgz
+
+An important step is to ensure ``SPARK_HOME`` environment variable points to 
the directory where the code has been extracted. The next step is to properly 
define ``PYTHONPATH`` such that 

[GitHub] [spark] rohitmishr1484 commented on a change in pull request #29410: [SPARK-32180][PYTHON][DOCS] Installation page in Getting Started in PySpark documentation

2020-08-14 Thread GitBox


rohitmishr1484 commented on a change in pull request #29410:
URL: https://github.com/apache/spark/pull/29410#discussion_r470866807



##
File path: python/docs/source/getting_started/index.rst
##
@@ -20,3 +20,13 @@
 Getting Started
 ===
 
+**PySpark** is the Python API for Spark.

Review comment:
   Done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rohitmishr1484 commented on a change in pull request #29410: [SPARK-32180][PYTHON][DOCS] Installation page in Getting Started in PySpark documentation

2020-08-14 Thread GitBox


rohitmishr1484 commented on a change in pull request #29410:
URL: https://github.com/apache/spark/pull/29410#discussion_r470866885



##
File path: python/docs/source/getting_started/index.rst
##
@@ -20,3 +20,13 @@
 Getting Started
 ===
 
+**PySpark** is the Python API for Spark.
+
+This page lists an overview of the basic steps required to setup & get started 
with PySpark.
+
+.. toctree::
+   :maxdepth: 2
+
+   installation
+   package_overview

Review comment:
   Done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on a change in pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-14 Thread GitBox


c21 commented on a change in pull request #29342:
URL: https://github.com/apache/spark/pull/29342#discussion_r470855625



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
##
@@ -66,6 +66,30 @@ private[execution] sealed trait HashedRelation extends 
KnownSizeEstimation {
 throw new UnsupportedOperationException
   }
 
+  /**
+   * Returns an iterator for key index and matched rows.
+   *
+   * Returns null if there is no matched rows.
+   */
+  def getWithKeyIndex(key: InternalRow): Iterator[ValueRowWithKeyIndex]
+
+  /**
+   * Returns key index and matched single row.
+   *
+   * Returns null if there is no matched rows.
+   */
+  def getValueWithKeyIndex(key: InternalRow): ValueRowWithKeyIndex

Review comment:
   @viirya - yes. Similar to definition of `def getValue(key: InternalRow): 
InternalRow` above. I added comment to say `Returns key index and matched 
single row.`, to be consistent with comment of `getValue`. Hope this is clear 
enough.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on a change in pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-14 Thread GitBox


c21 commented on a change in pull request #29342:
URL: https://github.com/apache/spark/pull/29342#discussion_r470854938



##
File path: core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java
##
@@ -428,6 +428,68 @@ public MapIterator destructiveIterator() {
 return new MapIterator(numValues, new Location(), true);
   }
 
+  /**
+   * Iterator for the entries of this map. This is to first iterate over key 
index array
+   * `longArray` then accessing values in `dataPages`. NOTE: this is different 
from `MapIterator`
+   * in the sense that key index is preserved here
+   * (See `UnsafeHashedRelation` for example of usage).
+   */
+  public final class MapIteratorWithKeyIndex implements Iterator {

Review comment:
   @viirya - you can think of `Location` returned by 
`MapIteratorWithKeyIndex.next()` indirectly exposes the `keyIndex`. I don't 
have a strong preference here, @cloud-fan WDYT here?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-14 Thread GitBox


viirya commented on a change in pull request #29342:
URL: https://github.com/apache/spark/pull/29342#discussion_r470845507



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
##
@@ -66,6 +66,30 @@ private[execution] sealed trait HashedRelation extends 
KnownSizeEstimation {
 throw new UnsupportedOperationException
   }
 
+  /**
+   * Returns an iterator for key index and matched rows.
+   *
+   * Returns null if there is no matched rows.
+   */
+  def getWithKeyIndex(key: InternalRow): Iterator[ValueRowWithKeyIndex]
+
+  /**
+   * Returns key index and matched single row.
+   *
+   * Returns null if there is no matched rows.
+   */
+  def getValueWithKeyIndex(key: InternalRow): ValueRowWithKeyIndex

Review comment:
   Is this for unique key case only?

##
File path: core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java
##
@@ -428,6 +428,68 @@ public MapIterator destructiveIterator() {
 return new MapIterator(numValues, new Location(), true);
   }
 
+  /**
+   * Iterator for the entries of this map. This is to first iterate over key 
index array
+   * `longArray` then accessing values in `dataPages`. NOTE: this is different 
from `MapIterator`
+   * in the sense that key index is preserved here
+   * (See `UnsafeHashedRelation` for example of usage).
+   */
+  public final class MapIteratorWithKeyIndex implements Iterator {

Review comment:
   Looks like `keyIndex` is not exposed outside this map iterator? then 
maybe call it `MapIteratorPreserveKeyIndex`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] emkornfield commented on a change in pull request #29430: [SPARK-32609] Incorrect exchange reuse with DataSourceV2

2020-08-14 Thread GitBox


emkornfield commented on a change in pull request #29430:
URL: https://github.com/apache/spark/pull/29430#discussion_r470845244



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala
##
@@ -371,6 +371,25 @@ class DataSourceV2Suite extends QueryTest with 
SharedSQLContext {
   }
 }
   }
+
+  test("SPARK-32609: DataSourceV2 with different pushedfilters should be 
different") {
+def getScanExec(query: DataFrame): DataSourceV2ScanExec = {
+  query.queryExecution.executedPlan.collect {
+case d: DataSourceV2ScanExec => d
+  }.head
+}
+
+Seq(classOf[AdvancedDataSourceV2], 
classOf[JavaAdvancedDataSourceV2]).foreach { cls =>
+  withClue(cls.getName) {
+val df = spark.read.format(cls.getName).load()
+val q1 = df.select('i).filter('i > 6)
+val q2 = df.select('i).filter('i > 5)

Review comment:
   i think it might also be good to verify that two dataframes with the 
same filter compare to equal (i.e. we don't break the exchange reuse)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29435: [SPARK-32609] Incorrect exchange reuse with DataSourceV2

2020-08-14 Thread GitBox


AmplabJenkins removed a comment on pull request #29435:
URL: https://github.com/apache/spark/pull/29435#issuecomment-674241724


   Can one of the admins verify this patch?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29435: [SPARK-32609] Incorrect exchange reuse with DataSourceV2

2020-08-14 Thread GitBox


AmplabJenkins commented on pull request #29435:
URL: https://github.com/apache/spark/pull/29435#issuecomment-674243886


   Can one of the admins verify this patch?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29435: [SPARK-32609] Incorrect exchange reuse with DataSourceV2

2020-08-14 Thread GitBox


AmplabJenkins commented on pull request #29435:
URL: https://github.com/apache/spark/pull/29435#issuecomment-674241724


   Can one of the admins verify this patch?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mingjialiu opened a new pull request #29435: [SPARK-32609] Incorrect exchange reuse with DataSourceV2

2020-08-14 Thread GitBox


mingjialiu opened a new pull request #29435:
URL: https://github.com/apache/spark/pull/29435


   
   
   ### What changes were proposed in this pull request?
   Copy  to master branch the unit test added for 
branch-2.4(https://github.com/apache/spark/pull/29430).
   
   
   
   
   ### Why are the changes needed?
   The unit test will pass at master branch, indicating that issue reported in 
https://issues.apache.org/jira/browse/SPARK-32609 is already fixed at master 
branch. But adding this unit test for future possible failure catch. 
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   no.
   
   
   
   ### How was this patch tested?
   sbt test run
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-14 Thread GitBox


AmplabJenkins removed a comment on pull request #29396:
URL: https://github.com/apache/spark/pull/29396#issuecomment-674239618







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-14 Thread GitBox


AmplabJenkins commented on pull request #29396:
URL: https://github.com/apache/spark/pull/29396#issuecomment-674239618







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-14 Thread GitBox


SparkQA commented on pull request #29396:
URL: https://github.com/apache/spark/pull/29396#issuecomment-674238955


   **[Test build #127463 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127463/testReport)**
 for PR 29396 at commit 
[`a80c544`](https://github.com/apache/spark/commit/a80c5441a01c8c516f4fa3288dfddf090ae2b060).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-14 Thread GitBox


SparkQA removed a comment on pull request #29396:
URL: https://github.com/apache/spark/pull/29396#issuecomment-674125075


   **[Test build #127463 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127463/testReport)**
 for PR 29396 at commit 
[`a80c544`](https://github.com/apache/spark/commit/a80c5441a01c8c516f4fa3288dfddf090ae2b060).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mingjialiu commented on pull request #29430: [SPARK-32609] Incorrect exchange reuse with DataSourceV2

2020-08-14 Thread GitBox


mingjialiu commented on pull request #29430:
URL: https://github.com/apache/spark/pull/29430#issuecomment-674230500


   > This is against 2.4. Could you also check whether the master branch still 
has such an issue?
   
   I cannot repro the issue at master branch. 3.0. in unit test. 
   Besides, filters comparison is included in function BatchScanExec.equals, 
which does  " this.batch == other.batch"



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13

2020-08-14 Thread GitBox


AmplabJenkins removed a comment on pull request #29434:
URL: https://github.com/apache/spark/pull/29434#issuecomment-674230204







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13

2020-08-14 Thread GitBox


AmplabJenkins commented on pull request #29434:
URL: https://github.com/apache/spark/pull/29434#issuecomment-674230204







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13

2020-08-14 Thread GitBox


SparkQA removed a comment on pull request #29434:
URL: https://github.com/apache/spark/pull/29434#issuecomment-674160102


   **[Test build #127464 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127464/testReport)**
 for PR 29434 at commit 
[`d71064e`](https://github.com/apache/spark/commit/d71064e50fe39454e5720c59d2bfdc5c9079831c).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13

2020-08-14 Thread GitBox


SparkQA commented on pull request #29434:
URL: https://github.com/apache/spark/pull/29434#issuecomment-674229394


   **[Test build #127464 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127464/testReport)**
 for PR 29434 at commit 
[`d71064e`](https://github.com/apache/spark/commit/d71064e50fe39454e5720c59d2bfdc5c9079831c).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29428: [SPARK-32608][SQL] Script Transform ROW FORMAT DELIMIT value should format value

2020-08-14 Thread GitBox


AmplabJenkins removed a comment on pull request #29428:
URL: https://github.com/apache/spark/pull/29428#issuecomment-674220724







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29428: [SPARK-32608][SQL] Script Transform ROW FORMAT DELIMIT value should format value

2020-08-14 Thread GitBox


AmplabJenkins commented on pull request #29428:
URL: https://github.com/apache/spark/pull/29428#issuecomment-674220724







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29428: [SPARK-32608][SQL] Script Transform ROW FORMAT DELIMIT value should format value

2020-08-14 Thread GitBox


SparkQA removed a comment on pull request #29428:
URL: https://github.com/apache/spark/pull/29428#issuecomment-674101424


   **[Test build #127462 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127462/testReport)**
 for PR 29428 at commit 
[`0a6c574`](https://github.com/apache/spark/commit/0a6c5743a8808b55f399e3298116a0e92bd72d0d).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29428: [SPARK-32608][SQL] Script Transform ROW FORMAT DELIMIT value should format value

2020-08-14 Thread GitBox


SparkQA commented on pull request #29428:
URL: https://github.com/apache/spark/pull/29428#issuecomment-674217964


   **[Test build #127462 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127462/testReport)**
 for PR 29428 at commit 
[`0a6c574`](https://github.com/apache/spark/commit/0a6c5743a8808b55f399e3298116a0e92bd72d0d).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29431: [SPARK-32615][SQL] Fix AQE aggregateMetrics java.util.NoSuchElementException

2020-08-14 Thread GitBox


AmplabJenkins removed a comment on pull request #29431:
URL: https://github.com/apache/spark/pull/29431#issuecomment-674210880







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29431: [SPARK-32615][SQL] Fix AQE aggregateMetrics java.util.NoSuchElementException

2020-08-14 Thread GitBox


AmplabJenkins commented on pull request #29431:
URL: https://github.com/apache/spark/pull/29431#issuecomment-674210880







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29431: [SPARK-32615][SQL] Fix AQE aggregateMetrics java.util.NoSuchElementException

2020-08-14 Thread GitBox


SparkQA removed a comment on pull request #29431:
URL: https://github.com/apache/spark/pull/29431#issuecomment-674088406


   **[Test build #127461 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127461/testReport)**
 for PR 29431 at commit 
[`09ef17b`](https://github.com/apache/spark/commit/09ef17b914060fe107d61bfd2af1ee69003a6fee).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29431: [SPARK-32615][SQL] Fix AQE aggregateMetrics java.util.NoSuchElementException

2020-08-14 Thread GitBox


SparkQA commented on pull request #29431:
URL: https://github.com/apache/spark/pull/29431#issuecomment-674210062


   **[Test build #127461 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127461/testReport)**
 for PR 29431 at commit 
[`09ef17b`](https://github.com/apache/spark/commit/09ef17b914060fe107d61bfd2af1ee69003a6fee).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on a change in pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13

2020-08-14 Thread GitBox


srowen commented on a change in pull request #29434:
URL: https://github.com/apache/spark/pull/29434#discussion_r470793047



##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/StarJoinCostBasedReorderSuite.scala
##
@@ -351,6 +351,18 @@ class StarJoinCostBasedReorderSuite extends PlanTest with 
StatsEstimationTestBas
 .join(t5.join(t6, Inner, Some(nameToAttr("t5_c2") === 
nameToAttr("t6_c2"))), Inner,
   Some(nameToAttr("d2_c2") === nameToAttr("t5_c1")))
 .select(outputsOf(d1, t3, t4, f1, d2, t5, t6, d3, t1, t2): _*)
+} else {

Review comment:
   Why does this differ? we'd rather not allow different user-visible 
behavior

##
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala
##
@@ -314,7 +314,7 @@ trait Row extends Serializable {
*
* @throws ClassCastException when data type does not match.
*/
-  def getSeq[T](i: Int): Seq[T] = getAs[Seq[T]](i)
+  def getSeq[T](i: Int): scala.collection.Seq[T] = 
getAs[scala.collection.Seq[T]](i)

Review comment:
   Can you explain this fix a bit more? I don't doubt it, but do we need to 
promise a `scala.collection.Seq`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29431: [SPARK-32615][SQL] Fix AQE aggregateMetrics java.util.NoSuchElementException

2020-08-14 Thread GitBox


AmplabJenkins removed a comment on pull request #29431:
URL: https://github.com/apache/spark/pull/29431#issuecomment-674201907







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29431: [SPARK-32615][SQL] Fix AQE aggregateMetrics java.util.NoSuchElementException

2020-08-14 Thread GitBox


AmplabJenkins commented on pull request #29431:
URL: https://github.com/apache/spark/pull/29431#issuecomment-674201907







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29431: [SPARK-32615][SQL] Fix AQE aggregateMetrics java.util.NoSuchElementException

2020-08-14 Thread GitBox


SparkQA removed a comment on pull request #29431:
URL: https://github.com/apache/spark/pull/29431#issuecomment-674085136


   **[Test build #127460 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127460/testReport)**
 for PR 29431 at commit 
[`98c8be3`](https://github.com/apache/spark/commit/98c8be31530d7d146746869acf4cadf0c5d495ee).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29431: [SPARK-32615][SQL] Fix AQE aggregateMetrics java.util.NoSuchElementException

2020-08-14 Thread GitBox


SparkQA commented on pull request #29431:
URL: https://github.com/apache/spark/pull/29431#issuecomment-674201047


   **[Test build #127460 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127460/testReport)**
 for PR 29431 at commit 
[`98c8be3`](https://github.com/apache/spark/commit/98c8be31530d7d146746869acf4cadf0c5d495ee).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] holdenk commented on a change in pull request #29422: [SPARK-32613][CORE] Fix regressions in DecommissionWorkerSuite

2020-08-14 Thread GitBox


holdenk commented on a change in pull request #29422:
URL: https://github.com/apache/spark/pull/29422#discussion_r470765386



##
File path: 
core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
##
@@ -323,6 +326,7 @@ private[spark] class CoarseGrainedExecutorBackend(
   // move forward.
   lastTaskRunningTime = System.nanoTime()
 }
+Thread.sleep(sleep_time)

Review comment:
   This was moved so initial sleep time didn't have sleep_time added to it 
on the first pass through right? Nothing else?

##
File path: 
core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
##
@@ -294,10 +294,13 @@ private[spark] class CoarseGrainedExecutorBackend(
 override def run(): Unit = {
   var lastTaskRunningTime = System.nanoTime()
   val sleep_time = 1000 // 1s
-
+  val initialSleepMillis = env.conf.getInt(

Review comment:
   Maybe just add a comment here that this is for testing only.

##
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##
@@ -2012,7 +2013,8 @@ private[spark] class DAGScheduler(
   execId: String,
   fileLost: Boolean,
   hostToUnregisterOutputs: Option[String],
-  maybeEpoch: Option[Long] = None): Unit = {
+  maybeEpoch: Option[Long] = None,
+  ignoreShuffleVersion: Boolean = false): Unit = {

Review comment:
   Please add this to the java doc. Also I'm not completely sure about the 
name of the variable.

##
File path: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
##
@@ -1027,7 +1036,15 @@ private[spark] class TaskSchedulerImpl(
   }
 }
 
-executorsPendingDecommission -= executorId
+
+val decomInfo = executorsPendingDecommission.get(executorId)
+if (decomInfo.isDefined) {
+  val rememberSeconds =
+conf.getInt("spark.decommissioningRememberAfterRemoval.seconds", 60)
+  val gcSecond = TimeUnit.MILLISECONDS.toSeconds(clock.getTimeMillis()) + 
rememberSeconds
+  decommissioningExecutorsToGc.computeIfAbsent(gcSecond, _ => 
mutable.ArrayBuffer.empty) +=
+executorId
+}

Review comment:
   Seems like repeated logic.

##
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##
@@ -2022,16 +2024,25 @@ private[spark] class DAGScheduler(
   blockManagerMaster.removeExecutor(execId)
   clearCacheLocs()
 }
-if (fileLost &&
-(!shuffleFileLostEpoch.contains(execId) || 
shuffleFileLostEpoch(execId) < currentEpoch)) {
-  shuffleFileLostEpoch(execId) = currentEpoch
-  hostToUnregisterOutputs match {
-case Some(host) =>
-  logInfo(s"Shuffle files lost for host: $host (epoch $currentEpoch)")
-  mapOutputTracker.removeOutputsOnHost(host)
-case None =>
-  logInfo(s"Shuffle files lost for executor: $execId (epoch 
$currentEpoch)")
-  mapOutputTracker.removeOutputsOnExecutor(execId)
+if (fileLost) {

Review comment:
   Can we have a comment here clarifying the reasoning behind this logic?

##
File path: 
core/src/test/scala/org/apache/spark/deploy/DecommissionWorkerSuite.scala
##
@@ -212,22 +226,27 @@ class DecommissionWorkerSuite
   override def handleRootTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
 val taskInfo = taskEnd.taskInfo
 if (taskInfo.executorId == executorToDecom && taskInfo.attemptNumber 
== 0 &&
-  taskEnd.stageAttemptId == 0) {
+  taskEnd.stageAttemptId == 0 && taskEnd.stageId == 0) {
   decommissionWorkerOnMaster(workerToDecom,
 "decommission worker after task on it is done")
 }
   }
 }
-TestUtils.withListener(sc, listener) { _ =>
+withListener(sc, listener) { _ =>
   val jobResult = sc.parallelize(1 to 2, 2).mapPartitionsWithIndex((_, _) 
=> {
 val executorId = SparkEnv.get.executorId
-val sleepTimeSeconds = if (executorId == executorToDecom) 10 else 1
-Thread.sleep(sleepTimeSeconds * 1000L)
+val context = TaskContext.get()
+if (context.attemptNumber() == 0 && context.stageAttemptNumber() == 0) 
{
+  val sleepTimeSeconds = if (executorId == executorToDecom) 10 else 1
+  Thread.sleep(sleepTimeSeconds * 1000L)
+}

Review comment:
   I assume this is for speed up right?

##
File path: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
##
@@ -136,7 +137,9 @@ private[spark] class TaskSchedulerImpl(
   // IDs of the tasks running on each executor
   private val executorIdToRunningTaskIds = new HashMap[String, HashSet[Long]]
 
-  private val executorsPendingDecommission = new HashMap[String, 
ExecutorDecommissionInfo]
+  val executorsPendingDecommission = new HashMap[String, 
ExecutorDecommissionInfo]
+  // map of second to list of executors to 

[GitHub] [spark] holdenk commented on pull request #29422: [SPARK-32613][CORE] Fix regressions in DecommissionWorkerSuite

2020-08-14 Thread GitBox


holdenk commented on pull request #29422:
URL: https://github.com/apache/spark/pull/29422#issuecomment-674186603


   Thank you for taking the time to resolve this and make such a clear writeup 
of the root cause. From an in-production not-in-test question: if the executor 
exits we also want to eagerly clean up everything and resubmit right?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on a change in pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-14 Thread GitBox


c21 commented on a change in pull request #29342:
URL: https://github.com/apache/spark/pull/29342#discussion_r470753712



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala
##
@@ -71,8 +85,215 @@ case class ShuffledHashJoinExec(
 val numOutputRows = longMetric("numOutputRows")
 streamedPlan.execute().zipPartitions(buildPlan.execute()) { (streamIter, 
buildIter) =>
   val hashed = buildHashedRelation(buildIter)
-  join(streamIter, hashed, numOutputRows)
+  joinType match {
+case FullOuter => fullOuterJoin(streamIter, hashed, numOutputRows)
+case _ => join(streamIter, hashed, numOutputRows)
+  }
+}
+  }
+
+  private def fullOuterJoin(
+  streamIter: Iterator[InternalRow],
+  hashedRelation: HashedRelation,
+  numOutputRows: SQLMetric): Iterator[InternalRow] = {
+val joinKeys = streamSideKeyGenerator()
+val joinRow = new JoinedRow
+val (joinRowWithStream, joinRowWithBuild) = {
+  buildSide match {
+case BuildLeft => (joinRow.withRight _, joinRow.withLeft _)
+case BuildRight => (joinRow.withLeft _, joinRow.withRight _)
+  }
+}
+val buildNullRow = new GenericInternalRow(buildOutput.length)
+val streamNullRow = new GenericInternalRow(streamedOutput.length)
+lazy val streamNullJoinRowWithBuild = {

Review comment:
   @cloud-fan - I think with `lazy val` I am to address [comment from 
@viirya ](https://github.com/apache/spark/pull/29342#discussion_r467252553), to 
only set stream side NULL row once, but not per row, because every row would 
have stream side NULL row so we only need to set it once. If not `lazy val`, 
but `val`, the `joinRow.withRight(streamNullRow)` would be eagerly evaluated 
here which is not right, as `joinRow` being reused later.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-14 Thread GitBox


viirya commented on a change in pull request #29396:
URL: https://github.com/apache/spark/pull/29396#discussion_r470736186



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCWriteBuilder.scala
##
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.datasources.v2.jdbc
+
+import org.apache.spark.sql._
+import org.apache.spark.sql.connector.write._
+import org.apache.spark.sql.execution.datasources.jdbc.{JdbcOptionsInWrite, 
JdbcUtils}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.sources.InsertableRelation
+import org.apache.spark.sql.types.StructType
+
+case class JDBCWriteBuilder(schema: StructType, options: JdbcOptionsInWrite) 
extends V1WriteBuilder
+  with SupportsTruncate {
+
+  private var isTruncate = false
+
+  override def truncate(): WriteBuilder = {
+isTruncate = true
+this
+  }
+
+  override def buildForV1Write(): InsertableRelation = new InsertableRelation {
+override def insert(data: DataFrame, overwrite: Boolean): Unit = {

Review comment:
   `overwrite` should be true when `isTruncate` is true right? Can you add 
an assert here?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13

2020-08-14 Thread GitBox


AmplabJenkins removed a comment on pull request #29434:
URL: https://github.com/apache/spark/pull/29434#issuecomment-674160674







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   >