[GitHub] [spark] huaxingao commented on a change in pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-10 Thread GitBox


huaxingao commented on a change in pull request #29396:
URL: https://github.com/apache/spark/pull/29396#discussion_r468343334



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala
##
@@ -64,11 +64,14 @@ class DataSourceV2Strategy(session: SparkSession) extends 
Strategy with Predicat
   }
   val rdd = v1Relation.buildScan()
   val unsafeRowRDD = DataSourceStrategy.toCatalystRDD(v1Relation, output, 
rdd)
-  val originalOutputNames = relation.table.schema().map(_.name)
-  val requiredColumnsIndex = 
output.map(_.name).map(originalOutputNames.indexOf)
+  // `RowDataSourceScanExec` requires the full output instead of the scan 
output after column
+  // pruning. However, when we reach here following the v2 code path, we 
don't have the full
+  // output anymore. `RowDataSourceScanExec.fullOutput` is actually 
meaningless so here we just
+  // pass the pruned output.
+  // TODO: remove `RowDataSourceScanExec.fullOutput`.

Review comment:
   I will open the subtasks tomorrow. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-10 Thread GitBox


c21 commented on pull request #29342:
URL: https://github.com/apache/spark/pull/29342#issuecomment-671742514


   @cloud-fan, @agrawaldevesh, @maropu and @viirya -
   
   I took a more closer look inside `BytesToBytesMap.java`, and found it would 
probably be hard / hacky to get key index when iterating all values of map. 
   
   After reading the stream side of join, we need to iterate all rows on build 
side `BytesToBytesMap` to output build side rows not having a match. To iterate 
all rows on the map, `BytesToBytesMap` provides a method 
[`iterator()`](https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java#L412).
 The method is to iterate through all data (key-value pair) in each memory page 
of `dataPages`. The approach only reads through data pages, and does not 
interact with the key index array `longArray` at all. So we could not get key 
index efficiently here.
   
   A workaround would be for every returned key-value pair inside 
`MapIterator`, we call `lookup(key, ...)` again on map to get key index. But we 
need a probing for every row on build side which seems to be obviously 
inefficient.
   
   How do you guys think? Thanks.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29276: [SPARK-32470][CORE] Remove task result size check for shuffle map stage

2020-08-10 Thread GitBox


AmplabJenkins removed a comment on pull request #29276:
URL: https://github.com/apache/spark/pull/29276#issuecomment-671741732


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/127304/
   Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29276: [SPARK-32470][CORE] Remove task result size check for shuffle map stage

2020-08-10 Thread GitBox


SparkQA removed a comment on pull request #29276:
URL: https://github.com/apache/spark/pull/29276#issuecomment-671696614


   **[Test build #127304 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127304/testReport)**
 for PR 29276 at commit 
[`e1e53ef`](https://github.com/apache/spark/commit/e1e53ef258013df2f85573de4542d36105ef852b).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29276: [SPARK-32470][CORE] Remove task result size check for shuffle map stage

2020-08-10 Thread GitBox


AmplabJenkins removed a comment on pull request #29276:
URL: https://github.com/apache/spark/pull/29276#issuecomment-671741724


   Merged build finished. Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29276: [SPARK-32470][CORE] Remove task result size check for shuffle map stage

2020-08-10 Thread GitBox


AmplabJenkins commented on pull request #29276:
URL: https://github.com/apache/spark/pull/29276#issuecomment-671741724







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29276: [SPARK-32470][CORE] Remove task result size check for shuffle map stage

2020-08-10 Thread GitBox


SparkQA commented on pull request #29276:
URL: https://github.com/apache/spark/pull/29276#issuecomment-671741096


   **[Test build #127304 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127304/testReport)**
 for PR 29276 at commit 
[`e1e53ef`](https://github.com/apache/spark/commit/e1e53ef258013df2f85573de4542d36105ef852b).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 edited a comment on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-10 Thread GitBox


c21 edited a comment on pull request #29342:
URL: https://github.com/apache/spark/pull/29342#issuecomment-671723199


   ~~@cloud-fan - sorry if I miss anything, could you elaborate more of~~
   
   ~~> We don't need to get the value index. We can calculate it by ourselves~~
   
   ~~How do we calculate value index based on value itself?~~
   
   @cloud-fan - I got it, you mean value index to be index of row in Iterator.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29402: [SPARK-32584][PYTHON][DOCS] Exclude _images and _sources that are generated by Sphinx in Jekyll build

2020-08-10 Thread GitBox


SparkQA removed a comment on pull request #29402:
URL: https://github.com/apache/spark/pull/29402#issuecomment-671727996


   **[Test build #127307 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127307/testReport)**
 for PR 29402 at commit 
[`3cb7891`](https://github.com/apache/spark/commit/3cb789155ac0fe60d91f11784f97876ac9928852).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29402: [SPARK-32584][PYTHON][DOCS] Exclude _images and _sources that are generated by Sphinx in Jekyll build

2020-08-10 Thread GitBox


AmplabJenkins removed a comment on pull request #29402:
URL: https://github.com/apache/spark/pull/29402#issuecomment-671730938







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29402: [SPARK-32584][PYTHON][DOCS] Exclude _images and _sources that are generated by Sphinx in Jekyll build

2020-08-10 Thread GitBox


AmplabJenkins commented on pull request #29402:
URL: https://github.com/apache/spark/pull/29402#issuecomment-671730938







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29402: [SPARK-32584][PYTHON][DOCS] Exclude _images and _sources that are generated by Sphinx in Jekyll build

2020-08-10 Thread GitBox


SparkQA commented on pull request #29402:
URL: https://github.com/apache/spark/pull/29402#issuecomment-671730864


   **[Test build #127307 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127307/testReport)**
 for PR 29402 at commit 
[`3cb7891`](https://github.com/apache/spark/commit/3cb789155ac0fe60d91f11784f97876ac9928852).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] imback82 commented on a change in pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters

2020-08-10 Thread GitBox


imback82 commented on a change in pull request #29328:
URL: https://github.com/apache/spark/pull/29328#discussion_r468330472



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala
##
@@ -395,7 +395,7 @@ class DataFrameReaderWriterSuite extends QueryTest with 
SharedSparkSession with
   .load("/test")
 
 assert(LastOptions.parameters("intOpt") == "56")
-assert(LastOptions.parameters("path") == "/test")
+assert(!LastOptions.parameters.contains("path"))

Review comment:
   Yes, now it makes sense. I will update it, thanks!





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] imback82 commented on a change in pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters

2020-08-10 Thread GitBox


imback82 commented on a change in pull request #29328:
URL: https://github.com/apache/spark/pull/29328#discussion_r468330253



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
##
@@ -155,7 +155,7 @@ object TextInputCSVDataSource extends CSVDataSource {
 sparkSession,
 paths = paths,
 className = classOf[TextFileFormat].getName,
-options = options.parameters
+options = options.parameters - "path"

Review comment:
   Yes, without this, few tests that involve "inferring" fail.
   
   Basically, when you do `spark.read.csv("/tmp")`, "path" option is populated 
since there is only one path parameter in `load()`. Now, when `infer` is 
called, `inputPaths: Seq[FileStatus]` is a result of listing files and there is 
the "path" option that is being carried over, so `infer` will try to read more 
data than needed.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #29402: [SPARK-32584][PYTHON][DOCS] Exclude _images and _sources that are generated by Sphinx in Jekyll build

2020-08-10 Thread GitBox


HyukjinKwon commented on pull request #29402:
URL: https://github.com/apache/spark/pull/29402#issuecomment-671729038


   cc @viirya can you take a quick look when you're available?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29402: [SPARK-32584][PYTHON][DOCS] Exclude _images and _sources that are generated by Sphinx in Jekyll build

2020-08-10 Thread GitBox


AmplabJenkins removed a comment on pull request #29402:
URL: https://github.com/apache/spark/pull/29402#issuecomment-671728329







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29402: [SPARK-32584][PYTHON][DOCS] Exclude _images and _sources that are generated by Sphinx in Jekyll build

2020-08-10 Thread GitBox


AmplabJenkins commented on pull request #29402:
URL: https://github.com/apache/spark/pull/29402#issuecomment-671728329







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29402: [SPARK-32584][PYTHON][DOCS] Exclude _images and _sources that are generated by Sphinx in Jekyll build

2020-08-10 Thread GitBox


SparkQA commented on pull request #29402:
URL: https://github.com/apache/spark/pull/29402#issuecomment-671727996


   **[Test build #127307 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127307/testReport)**
 for PR 29402 at commit 
[`3cb7891`](https://github.com/apache/spark/commit/3cb789155ac0fe60d91f11784f97876ac9928852).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon opened a new pull request #29402: [SPARK-32584][PYTHON][DOCS] Exclude _images and _sources that are generated by Sphinx in Jekyll build

2020-08-10 Thread GitBox


HyukjinKwon opened a new pull request #29402:
URL: https://github.com/apache/spark/pull/29402


   ### What changes were proposed in this pull request?
   
   This PR proposes to `include` `_images` and `_sources` directories in Jekyll 
build.
   
   For `_images` directory,
   After SPARK-31851, now we add some images to use within the pages built by 
Sphinx. It copies and images into `_images` directory. Later, when Jekyll 
builds, the underscore directories are ignored by default which ends up with 
missing image in the main doc.
   
   Before:
   ![Screen Shot 2020-08-11 at 1 52 46 
PM](https://user-images.githubusercontent.com/6477701/89859104-2e571080-dbdb-11ea-817c-c04bbcd4088e.png)
   
   After:
   ![Screen Shot 2020-08-11 at 1 49 00 
PM](https://user-images.githubusercontent.com/6477701/89859105-30b96a80-dbdb-11ea-85c6-8a135eddf613.png)
   
   
   For `_sources` directory,
   Please refer 
[here](https://github.com/sphinx-contrib/sphinx-pretty-searchresults#source-links)
 and 
[here](https://www.sphinx-doc.org/en/master/usage/configuration.html#confval-html_copy_source).
 They are generated by default and used by default in the documentations by 
Sphinx, and we should better include them.
   
   ### Why are the changes needed?
   
   To show the images correctly in PySpark documentation.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, only in unreleased branches.
   
   ### How was this patch tested?
   
   Manually tested via:
   
   ```bash
   SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve --watch
   ```
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-10 Thread GitBox


AmplabJenkins removed a comment on pull request #29396:
URL: https://github.com/apache/spark/pull/29396#issuecomment-671725265







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-10 Thread GitBox


AmplabJenkins commented on pull request #29396:
URL: https://github.com/apache/spark/pull/29396#issuecomment-671725265







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-10 Thread GitBox


SparkQA removed a comment on pull request #29396:
URL: https://github.com/apache/spark/pull/29396#issuecomment-671657035


   **[Test build #127297 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127297/testReport)**
 for PR 29396 at commit 
[`79f336c`](https://github.com/apache/spark/commit/79f336c5fcd8f703887dfa5e7ffb7f64eaff3a9f).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters

2020-08-10 Thread GitBox


cloud-fan commented on a change in pull request #29328:
URL: https://github.com/apache/spark/pull/29328#discussion_r468326123



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala
##
@@ -395,7 +395,7 @@ class DataFrameReaderWriterSuite extends QueryTest with 
SharedSparkSession with
   .load("/test")
 
 assert(LastOptions.parameters("intOpt") == "56")
-assert(LastOptions.parameters("path") == "/test")
+assert(!LastOptions.parameters.contains("path"))

Review comment:
   Sorry I missed one thing. For DS v1, `...load(path)` will generate a 
`path` option, which is visible to v1 sources, and we can't break it. It's 
strange that `...load(path1, path2)` will completely ignore the paths and don't 
pass paths to v1 sources, but there is nothing we can do here.
   
   We need to update `DataFrameReader.loadV1Source` to generate a `path` option 
if `paths.length == 1`, to keep the previous behavior.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-10 Thread GitBox


SparkQA commented on pull request #29396:
URL: https://github.com/apache/spark/pull/29396#issuecomment-671724764


   **[Test build #127297 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127297/testReport)**
 for PR 29396 at commit 
[`79f336c`](https://github.com/apache/spark/commit/79f336c5fcd8f703887dfa5e7ffb7f64eaff3a9f).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters

2020-08-10 Thread GitBox


cloud-fan commented on a change in pull request #29328:
URL: https://github.com/apache/spark/pull/29328#discussion_r468326123



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala
##
@@ -395,7 +395,7 @@ class DataFrameReaderWriterSuite extends QueryTest with 
SharedSparkSession with
   .load("/test")
 
 assert(LastOptions.parameters("intOpt") == "56")
-assert(LastOptions.parameters("path") == "/test")
+assert(!LastOptions.parameters.contains("path"))

Review comment:
   Sorry I missed one thing. For DS v1, `...load(path)` will generate a 
`path` option, which is visible to v1 sources, and we can't break it. It's 
strange that `...load(path1, path2)` will completely ignore the paths and don't 
pass paths to v1 sources, but there is nothing we can do here.
   
   We need to update `DataFrameReader.loadV1Source` to generate a `path` option 
if `paths.length == 1` (and set paths to Nil), to keep the previous behavior.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-10 Thread GitBox


c21 commented on pull request #29342:
URL: https://github.com/apache/spark/pull/29342#issuecomment-671723199


   @cloud-fan - sorry if I miss anything, could you elaborate more of
   
   > We don't need to get the value index. We can calculate it by ourselves
   
   How do we calculate value index based on value itself?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters

2020-08-10 Thread GitBox


cloud-fan commented on a change in pull request #29328:
URL: https://github.com/apache/spark/pull/29328#discussion_r468324590



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
##
@@ -155,7 +155,7 @@ object TextInputCSVDataSource extends CSVDataSource {
 sparkSession,
 paths = paths,
 className = classOf[TextFileFormat].getName,
-options = options.parameters
+options = options.parameters - "path"

Review comment:
   This is a good catch! BTW does it have to be fixed in this PR? The check 
is added in `DataFrameReader.load` and `DataSource.resolveRelation` should be 
fine.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters

2020-08-10 Thread GitBox


cloud-fan commented on a change in pull request #29328:
URL: https://github.com/apache/spark/pull/29328#discussion_r468324072



##
File path: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
##
@@ -245,12 +245,19 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
 "read files of Hive data source directly.")
 }
 
+if (extraOptions.contains("path") && paths.nonEmpty) {

Review comment:
   Then we can make sure `path` and `paths` options wouldn't co-exist for 
DS v2, and you can make the change in 
https://github.com/apache/spark/pull/29328/files#r468295791





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29367: [SPARK-31198][CORE] Use graceful decommissioning as part of dynamic scaling

2020-08-10 Thread GitBox


AmplabJenkins removed a comment on pull request #29367:
URL: https://github.com/apache/spark/pull/29367#issuecomment-671721882







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] imback82 commented on a change in pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters

2020-08-10 Thread GitBox


imback82 commented on a change in pull request #29328:
URL: https://github.com/apache/spark/pull/29328#discussion_r468286947



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
##
@@ -155,7 +155,7 @@ object TextInputCSVDataSource extends CSVDataSource {
 sparkSession,
 paths = paths,
 className = classOf[TextFileFormat].getName,
-options = options.parameters
+options = options.parameters - "path"

Review comment:
   This is called during `infer` and "path" option gets added to 
`inputPaths` (if the option was originally defined), thus this needs to be 
removed.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] imback82 commented on a change in pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters

2020-08-10 Thread GitBox


imback82 commented on a change in pull request #29328:
URL: https://github.com/apache/spark/pull/29328#discussion_r468295791



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
##
@@ -155,7 +155,7 @@ object TextInputCSVDataSource extends CSVDataSource {
 sparkSession,
 paths = paths,
 className = classOf[TextFileFormat].getName,
-options = options.parameters
+options = options.parameters - "path"

Review comment:
   I also thought about update the following
   
https://github.com/apache/spark/blob/8659ec554fe1993fcaf2ec4d1c4745205bddec7d/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L575
   
   to
   
   ```
   val allPaths = 
caseInsensitiveOptions.get("path").map(Seq(_)).getOrElse(paths)
   ```
   , but it seemed a risky change (e.g., there is no good place for assert, 
etc.).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters

2020-08-10 Thread GitBox


cloud-fan commented on a change in pull request #29328:
URL: https://github.com/apache/spark/pull/29328#discussion_r468323610



##
File path: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
##
@@ -245,12 +245,19 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
 "read files of Hive data source directly.")
 }
 
+if (extraOptions.contains("path") && paths.nonEmpty) {

Review comment:
   how about we forbid `paths` option as well for consistency?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29367: [SPARK-31198][CORE] Use graceful decommissioning as part of dynamic scaling

2020-08-10 Thread GitBox


AmplabJenkins commented on pull request #29367:
URL: https://github.com/apache/spark/pull/29367#issuecomment-671721882







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29367: [SPARK-31198][CORE] Use graceful decommissioning as part of dynamic scaling

2020-08-10 Thread GitBox


SparkQA commented on pull request #29367:
URL: https://github.com/apache/spark/pull/29367#issuecomment-671721570


   **[Test build #127306 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127306/testReport)**
 for PR 29367 at commit 
[`a099152`](https://github.com/apache/spark/commit/a099152577214a531d8a57a8f08771e30e2e449a).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] imback82 commented on a change in pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters

2020-08-10 Thread GitBox


imback82 commented on a change in pull request #29328:
URL: https://github.com/apache/spark/pull/29328#discussion_r468323154



##
File path: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
##
@@ -245,6 +246,10 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
 "read files of Hive data source directly.")
 }
 

Review comment:
   Ah OK. Thanks.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters

2020-08-10 Thread GitBox


cloud-fan commented on a change in pull request #29328:
URL: https://github.com/apache/spark/pull/29328#discussion_r468323149



##
File path: docs/sql-migration-guide.md
##
@@ -36,6 +36,8 @@ license: |
 
   - In Spark 3.1, NULL elements of structures, arrays and maps are converted 
to "null" in casting them to strings. In Spark 3.0 or earlier, NULL elements 
are converted to empty strings. To restore the behavior before Spark 3.1, you 
can set `spark.sql.legacy.castComplexTypesToString.enabled` to `true`.
 
+  - In Spark 3.1, when loading a dataframe, `path` option cannot coexist with 
`load()`'s path parameters. For example, 
`spark.read.format("csv").option("path", "/tmp").load("/tmp2")` or 
`spark.read.option("path", "/tmp").csv("/tmp2")` will throw 
`org.apache.spark.sql.AnalysisException`.

Review comment:
   Let's also mention the legacy behavior in 3.0 and below.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters

2020-08-10 Thread GitBox


cloud-fan commented on a change in pull request #29328:
URL: https://github.com/apache/spark/pull/29328#discussion_r468322696



##
File path: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
##
@@ -245,6 +246,10 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
 "read files of Hive data source directly.")
 }
 

Review comment:
   parquet is a builtin source, so we can make it work anyway. `def 
load(path: String)` is general for other external sources, and we should 
guarantee backward compatibility here.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-10 Thread GitBox


cloud-fan commented on pull request #29342:
URL: https://github.com/apache/spark/pull/29342#issuecomment-671719927


   A few more thoughts:
   1. For `keyIsUnique` code path, we know it's one key one value, I think we 
can still use bitset.
   2. We don't need to get the value index. We can calculate it by ourselves 
since `get(key)` returns an `Iterator[InternalRow]`. e.g. `values.zipWithIndex`.
   
   I would expect to see a `def getWithKeyIndex(key): (Int, 
Iterator[InternalRow])`. We can even combine key index and value index into a 
long for better performance.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] imback82 commented on a change in pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters

2020-08-10 Thread GitBox


imback82 commented on a change in pull request #29328:
URL: https://github.com/apache/spark/pull/29328#discussion_r468320706



##
File path: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
##
@@ -245,6 +246,10 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
 "read files of Hive data source directly.")
 }
 

Review comment:
   Hmm, `def parquet(path: String)` has been calling `def load(paths: 
String*)`, not `def load(path: String)`, and it seems to have worked fine so 
far (other than the inconsistency this PR addresses), no?
   
   Btw, we shouldn't remove `def load(path: String)` because it breaks the 
binary compatibility.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29389: [SPARK-32573][SQL] Anti Join Improvement with EmptyHashedRelation and EmptyHashedRelationWithAllNullKeys

2020-08-10 Thread GitBox


AmplabJenkins removed a comment on pull request #29389:
URL: https://github.com/apache/spark/pull/29389#issuecomment-671718149







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29389: [SPARK-32573][SQL] Anti Join Improvement with EmptyHashedRelation and EmptyHashedRelationWithAllNullKeys

2020-08-10 Thread GitBox


AmplabJenkins commented on pull request #29389:
URL: https://github.com/apache/spark/pull/29389#issuecomment-671718149







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] leanken commented on pull request #29389: [SPARK-32573][SQL] Anti Join Improvement with EmptyHashedRelation and EmptyHashedRelationWithAllNullKeys

2020-08-10 Thread GitBox


leanken commented on pull request #29389:
URL: https://github.com/apache/spark/pull/29389#issuecomment-671718124


   @cloud-fan Test passed, ready to merge.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29389: [SPARK-32573][SQL] Anti Join Improvement with EmptyHashedRelation and EmptyHashedRelationWithAllNullKeys

2020-08-10 Thread GitBox


SparkQA removed a comment on pull request #29389:
URL: https://github.com/apache/spark/pull/29389#issuecomment-671646391


   **[Test build #127296 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127296/testReport)**
 for PR 29389 at commit 
[`5309b66`](https://github.com/apache/spark/commit/5309b66b23af2ffe7999edbd8f0c11f26628805d).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29389: [SPARK-32573][SQL] Anti Join Improvement with EmptyHashedRelation and EmptyHashedRelationWithAllNullKeys

2020-08-10 Thread GitBox


SparkQA commented on pull request #29389:
URL: https://github.com/apache/spark/pull/29389#issuecomment-671717620


   **[Test build #127296 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127296/testReport)**
 for PR 29389 at commit 
[`5309b66`](https://github.com/apache/spark/commit/5309b66b23af2ffe7999edbd8f0c11f26628805d).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-10 Thread GitBox


AmplabJenkins removed a comment on pull request #29396:
URL: https://github.com/apache/spark/pull/29396#issuecomment-671716088







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-10 Thread GitBox


AmplabJenkins commented on pull request #29396:
URL: https://github.com/apache/spark/pull/29396#issuecomment-671716088







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-10 Thread GitBox


SparkQA commented on pull request #29396:
URL: https://github.com/apache/spark/pull/29396#issuecomment-671715831


   **[Test build #127305 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127305/testReport)**
 for PR 29396 at commit 
[`3ef5414`](https://github.com/apache/spark/commit/3ef54141d4834ebbe8a7bddea57117e73034ba06).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gatorsmile commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

2020-08-10 Thread GitBox


gatorsmile commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r468306894



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
##
@@ -160,8 +152,42 @@ class HDFSMetadataLog[T <: AnyRef : 
ClassTag](sparkSession: SparkSession, path:
 IOUtils.closeQuietly(input)
   }
 } else {
-  logDebug(s"Unable to find batch $batchMetadataFile")
-  None
+  throw new FileNotFoundException(s"Unable to find batch 
$batchMetadataFile")
+}
+  }
+
+  /**
+   * Store the metadata for the specified batchId and return `true` if 
successful. This method
+   * fills the content of metadata via executing function. If the function 
throws exception,

Review comment:
   Nit: exception -> an exception





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28991: [SPARK-26533][SQL][test-hive1.2][test-hadoop2.7] Support query auto timeout cancel on thriftserver

2020-08-10 Thread GitBox


AmplabJenkins removed a comment on pull request #28991:
URL: https://github.com/apache/spark/pull/28991#issuecomment-671704714







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28991: [SPARK-26533][SQL][test-hive1.2][test-hadoop2.7] Support query auto timeout cancel on thriftserver

2020-08-10 Thread GitBox


AmplabJenkins commented on pull request #28991:
URL: https://github.com/apache/spark/pull/28991#issuecomment-671704714







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #28991: [SPARK-26533][SQL][test-hive1.2][test-hadoop2.7] Support query auto timeout cancel on thriftserver

2020-08-10 Thread GitBox


SparkQA removed a comment on pull request #28991:
URL: https://github.com/apache/spark/pull/28991#issuecomment-671694607


   **[Test build #127303 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127303/testReport)**
 for PR 28991 at commit 
[`da76e1a`](https://github.com/apache/spark/commit/da76e1a7ba25ca96495cbb2ce56377983ee9f74f).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28991: [SPARK-26533][SQL][test-hive1.2][test-hadoop2.7] Support query auto timeout cancel on thriftserver

2020-08-10 Thread GitBox


SparkQA commented on pull request #28991:
URL: https://github.com/apache/spark/pull/28991#issuecomment-671704534


   **[Test build #127303 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127303/testReport)**
 for PR 28991 at commit 
[`da76e1a`](https://github.com/apache/spark/commit/da76e1a7ba25ca96495cbb2ce56377983ee9f74f).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gatorsmile commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

2020-08-10 Thread GitBox


gatorsmile commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r468306410



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLog.scala
##
@@ -97,18 +97,13 @@ class FileStreamSinkLog(
 s"Please set ${SQLConf.FILE_SINK_LOG_COMPACT_INTERVAL.key} (was 
$defaultCompactInterval) " +
   "to a positive value.")
 
-  override def compactLogs(logs: Seq[SinkFileStatus]): Seq[SinkFileStatus] = {
-val deletedFiles = logs.filter(_.action == 
FileStreamSinkLog.DELETE_ACTION).map(_.path).toSet
-if (deletedFiles.isEmpty) {
-  logs
-} else {
-  logs.filter(f => !deletedFiles.contains(f.path))
-}
-  }
+  override def shouldRetain(log: SinkFileStatus): Boolean = true
 }
 
 object FileStreamSinkLog {
   val VERSION = 1
+  // TODO: This action hasn't been used from the introduction. We should just 
remove this.

Review comment:
   Add a JIRA number in the comment?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gatorsmile commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

2020-08-10 Thread GitBox


gatorsmile commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r468305462



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala
##
@@ -132,12 +130,18 @@ abstract class CompactibleFileStreamLog[T <: AnyRef : 
ClassTag](
 }
   }
 
+  def serializeEntry(entry: T, out: OutputStream): Unit = {
+out.write(Serialization.write(entry).getBytes(UTF_8))
+  }
+
+  def deserializeEntry(line: String): T = Serialization.read[T](line)

Review comment:
   private?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gatorsmile commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

2020-08-10 Thread GitBox


gatorsmile commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r468305424



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala
##
@@ -132,12 +130,18 @@ abstract class CompactibleFileStreamLog[T <: AnyRef : 
ClassTag](
 }
   }
 
+  def serializeEntry(entry: T, out: OutputStream): Unit = {

Review comment:
   private?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters

2020-08-10 Thread GitBox


cloud-fan commented on a change in pull request #29328:
URL: https://github.com/apache/spark/pull/29328#discussion_r468305334



##
File path: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
##
@@ -245,6 +246,10 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
 "read files of Hive data source directly.")
 }
 

Review comment:
   It's required, because
   ```
   def load(path: String): DataFrame = {
 // force invocation of `load(...varargs...)`
 option("path", path).load(Seq.empty: _*)
   }
   ```
   
   If we remove this method and always call `def load(paths: String*)`, we 
should make sure the behavior doesn't change, and we still generate "path" 
option if only one path is given.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gatorsmile commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

2020-08-10 Thread GitBox


gatorsmile commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r468305223



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala
##
@@ -173,37 +177,64 @@ abstract class CompactibleFileStreamLog[T <: AnyRef : 
ClassTag](
   override def purge(thresholdBatchId: Long): Unit = throw new 
UnsupportedOperationException(
 s"Cannot purge as it might break internal state.")
 
+  /**
+   * Apply function on all entries in the specific batch. The method will throw
+   * FileNotFoundException if the metadata log file doesn't exist.
+   *
+   * NOTE: This doesn't fail early on corruption. The caller should handle the 
exception
+   * properly and make sure the logic is not affected by failing in the middle.

Review comment:
   How to ensure all the callers handle it properly? Do we need to 
introduce a test to cover it?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gatorsmile commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

2020-08-10 Thread GitBox


gatorsmile commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r468303420



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala
##
@@ -173,37 +177,64 @@ abstract class CompactibleFileStreamLog[T <: AnyRef : 
ClassTag](
   override def purge(thresholdBatchId: Long): Unit = throw new 
UnsupportedOperationException(
 s"Cannot purge as it might break internal state.")
 
+  /**
+   * Apply function on all entries in the specific batch. The method will throw
+   * FileNotFoundException if the metadata log file doesn't exist.
+   *
+   * NOTE: This doesn't fail early on corruption. The caller should handle the 
exception
+   * properly and make sure the logic is not affected by failing in the middle.

Review comment:
   If the caller does not handle the exception properly, what is the 
consequence? Do we have a test case to cover it?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #29031: [SPARK-32216][SQL] Remove redundant ProjectExec

2020-08-10 Thread GitBox


cloud-fan closed pull request #29031:
URL: https://github.com/apache/spark/pull/29031


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #29031: [SPARK-32216][SQL] Remove redundant ProjectExec

2020-08-10 Thread GitBox


cloud-fan commented on pull request #29031:
URL: https://github.com/apache/spark/pull/29031#issuecomment-671699494


   thanks, merging to master!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29276: [SPARK-32470][CORE] Remove task result size check for shuffle map stage

2020-08-10 Thread GitBox


AmplabJenkins removed a comment on pull request #29276:
URL: https://github.com/apache/spark/pull/29276#issuecomment-671696968







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29276: [SPARK-32470][CORE] Remove task result size check for shuffle map stage

2020-08-10 Thread GitBox


AmplabJenkins commented on pull request #29276:
URL: https://github.com/apache/spark/pull/29276#issuecomment-671696968







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29276: [SPARK-32470][CORE] Remove task result size check for shuffle map stage

2020-08-10 Thread GitBox


SparkQA commented on pull request #29276:
URL: https://github.com/apache/spark/pull/29276#issuecomment-671696614


   **[Test build #127304 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127304/testReport)**
 for PR 29276 at commit 
[`e1e53ef`](https://github.com/apache/spark/commit/e1e53ef258013df2f85573de4542d36105ef852b).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28991: [SPARK-26533][SQL][test-hive1.2][test-hadoop2.7] Support query auto timeout cancel on thriftserver

2020-08-10 Thread GitBox


AmplabJenkins removed a comment on pull request #28991:
URL: https://github.com/apache/spark/pull/28991#issuecomment-671694990







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-10 Thread GitBox


AmplabJenkins removed a comment on pull request #29396:
URL: https://github.com/apache/spark/pull/29396#issuecomment-671695012


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/127301/
   Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-10 Thread GitBox


AmplabJenkins removed a comment on pull request #29396:
URL: https://github.com/apache/spark/pull/29396#issuecomment-671695006


   Merged build finished. Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29369: [SPARK-32540][SQL] Eliminate the filter clause in aggregate

2020-08-10 Thread GitBox


AmplabJenkins removed a comment on pull request #29369:
URL: https://github.com/apache/spark/pull/29369#issuecomment-671694921







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-10 Thread GitBox


SparkQA removed a comment on pull request #29396:
URL: https://github.com/apache/spark/pull/29396#issuecomment-671694555


   **[Test build #127301 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127301/testReport)**
 for PR 29396 at commit 
[`fb48c86`](https://github.com/apache/spark/commit/fb48c860d25ad4b8efcf03025dc904c976738f07).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28991: [SPARK-26533][SQL][test-hive1.2][test-hadoop2.7] Support query auto timeout cancel on thriftserver

2020-08-10 Thread GitBox


AmplabJenkins commented on pull request #28991:
URL: https://github.com/apache/spark/pull/28991#issuecomment-671694990







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29369: [SPARK-32540][SQL] Eliminate the filter clause in aggregate

2020-08-10 Thread GitBox


AmplabJenkins commented on pull request #29369:
URL: https://github.com/apache/spark/pull/29369#issuecomment-671694921







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-10 Thread GitBox


SparkQA commented on pull request #29396:
URL: https://github.com/apache/spark/pull/29396#issuecomment-671694997


   **[Test build #127301 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127301/testReport)**
 for PR 29396 at commit 
[`fb48c86`](https://github.com/apache/spark/commit/fb48c860d25ad4b8efcf03025dc904c976738f07).
* This patch **fails Scala style tests**.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-10 Thread GitBox


AmplabJenkins commented on pull request #29396:
URL: https://github.com/apache/spark/pull/29396#issuecomment-671695006







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29369: [SPARK-32540][SQL] Eliminate the filter clause in aggregate

2020-08-10 Thread GitBox


SparkQA commented on pull request #29369:
URL: https://github.com/apache/spark/pull/29369#issuecomment-671694575


   **[Test build #127302 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127302/testReport)**
 for PR 29369 at commit 
[`e4b4c5a`](https://github.com/apache/spark/commit/e4b4c5ac25b880fa4adc37bca82282ab34874f36).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-10 Thread GitBox


SparkQA commented on pull request #29396:
URL: https://github.com/apache/spark/pull/29396#issuecomment-671694555


   **[Test build #127301 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127301/testReport)**
 for PR 29396 at commit 
[`fb48c86`](https://github.com/apache/spark/commit/fb48c860d25ad4b8efcf03025dc904c976738f07).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28991: [SPARK-26533][SQL][test-hive1.2][test-hadoop2.7] Support query auto timeout cancel on thriftserver

2020-08-10 Thread GitBox


SparkQA commented on pull request #28991:
URL: https://github.com/apache/spark/pull/28991#issuecomment-671694607


   **[Test build #127303 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127303/testReport)**
 for PR 28991 at commit 
[`da76e1a`](https://github.com/apache/spark/commit/da76e1a7ba25ca96495cbb2ce56377983ee9f74f).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28991: [SPARK-26533][SQL][test-hive1.2][test-hadoop2.7] Support query auto timeout cancel on thriftserver

2020-08-10 Thread GitBox


AmplabJenkins removed a comment on pull request #28991:
URL: https://github.com/apache/spark/pull/28991#issuecomment-671692831


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/127300/
   Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] imback82 commented on a change in pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters

2020-08-10 Thread GitBox


imback82 commented on a change in pull request #29328:
URL: https://github.com/apache/spark/pull/29328#discussion_r468295791



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
##
@@ -155,7 +155,7 @@ object TextInputCSVDataSource extends CSVDataSource {
 sparkSession,
 paths = paths,
 className = classOf[TextFileFormat].getName,
-options = options.parameters
+options = options.parameters - "path"

Review comment:
   I also thought about update the following
   
https://github.com/apache/spark/blob/8659ec554fe1993fcaf2ec4d1c4745205bddec7d/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L575
   
   to
   
   ```
   val allPaths = 
caseInsensitiveOptions.get("path").map(Seq(_)).getOrElse(paths)
   ```
   , but it looked like a risky change (e.g., there is no good place for 
assert, etc.).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-10 Thread GitBox


AmplabJenkins removed a comment on pull request #29396:
URL: https://github.com/apache/spark/pull/29396#issuecomment-671692976







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huaxingao commented on a change in pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-10 Thread GitBox


huaxingao commented on a change in pull request #29396:
URL: https://github.com/apache/spark/pull/29396#discussion_r468295410



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala
##
@@ -107,6 +111,41 @@ class JDBCTableCatalogSuite extends QueryTest with 
SharedSparkSession {
 }
   }
 
+  test("simple scan") {
+checkAnswer(sql("SELECT name, id FROM h2.test.people"), Seq(Row("fred", 
1), Row("mary", 2)))

Review comment:
   fixed





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-10 Thread GitBox


AmplabJenkins commented on pull request #29396:
URL: https://github.com/apache/spark/pull/29396#issuecomment-671692976







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #28991: [SPARK-26533][SQL][test-hive1.2][test-hadoop2.7] Support query auto timeout cancel on thriftserver

2020-08-10 Thread GitBox


SparkQA removed a comment on pull request #28991:
URL: https://github.com/apache/spark/pull/28991#issuecomment-671690512


   **[Test build #127300 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127300/testReport)**
 for PR 28991 at commit 
[`4ca75ba`](https://github.com/apache/spark/commit/4ca75baea102fb1c2e7467ed1093d4c0520a5245).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huaxingao commented on a change in pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-10 Thread GitBox


huaxingao commented on a change in pull request #29396:
URL: https://github.com/apache/spark/pull/29396#discussion_r468295245



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTable.scala
##
@@ -34,4 +38,16 @@ case class JDBCTable(ident: Identifier, schema: StructType, 
jdbcOptions: JDBCOpt
 capabilities.add(BATCH_READ)
 capabilities
   }
+
+  override def newScanBuilder(options: CaseInsensitiveStringMap): 
JDBCScanBuilder = {
+val mergedOptions = new JDBCOptions(
+  jdbcOptions.parameters.originalMap ++ 
options.asCaseSensitiveMap().asScala)
+new JDBCScanBuilder(SparkSession.active, schema, mergedOptions)

Review comment:
   removed





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huaxingao commented on a change in pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-10 Thread GitBox


huaxingao commented on a change in pull request #29396:
URL: https://github.com/apache/spark/pull/29396#discussion_r468295282



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCWriteBuilder.scala
##
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.datasources.v2.jdbc
+
+import org.apache.spark.sql._
+import org.apache.spark.sql.connector.write._
+import org.apache.spark.sql.execution.datasources.jdbc.{JdbcOptionsInWrite, 
JdbcUtils}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.sources.InsertableRelation
+import org.apache.spark.sql.types.StructType
+
+

Review comment:
   removed





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28991: [SPARK-26533][SQL][test-hive1.2][test-hadoop2.7] Support query auto timeout cancel on thriftserver

2020-08-10 Thread GitBox


AmplabJenkins removed a comment on pull request #28991:
URL: https://github.com/apache/spark/pull/28991#issuecomment-671692824


   Merged build finished. Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huaxingao commented on a change in pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-10 Thread GitBox


huaxingao commented on a change in pull request #29396:
URL: https://github.com/apache/spark/pull/29396#discussion_r468295197



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCScanBuilder.scala
##
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.datasources.v2.jdbc
+
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.connector.read.{Scan, ScanBuilder, 
SupportsPushDownFilters, SupportsPushDownRequiredColumns}
+import org.apache.spark.sql.execution.datasources.jdbc.{JDBCOptions, JDBCRDD, 
JDBCRelation}
+import org.apache.spark.sql.jdbc.JdbcDialects
+import org.apache.spark.sql.sources.Filter
+import org.apache.spark.sql.types.StructType
+
+case class JDBCScanBuilder(
+session: SparkSession,
+schema: StructType,
+jdbcOptions: JDBCOptions)
+  extends ScanBuilder with SupportsPushDownFilters with 
SupportsPushDownRequiredColumns {
+
+  private var pushedFilter = Array.empty[Filter]
+
+  private var prunedSchema = schema
+
+  override def pushFilters(filters: Array[Filter]): Array[Filter] = {
+if (jdbcOptions.pushDownPredicate) {
+  val dialect = JdbcDialects.get(jdbcOptions.url)
+  val (pushed, unSupported) = filters.partition(JDBCRDD.compileFilter(_, 
dialect).isDefined)
+  this.pushedFilter = pushed
+  unSupported
+} else {
+  filters
+}
+  }
+
+  override def pushedFilters(): Array[Filter] = pushedFilter
+
+  override def pruneColumns(requiredSchema: StructType): Unit = {
+// JDBC doesn't support nested column pruning.
+val requiredCols = requiredSchema.map(_.name)
+prunedSchema = StructType(schema.fields.filter(f => 
requiredCols.contains(f.name)))

Review comment:
   I followed the way in FileScanBuider and fixed this





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28991: [SPARK-26533][SQL][test-hive1.2][test-hadoop2.7] Support query auto timeout cancel on thriftserver

2020-08-10 Thread GitBox


SparkQA commented on pull request #28991:
URL: https://github.com/apache/spark/pull/28991#issuecomment-671692811


   **[Test build #127300 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127300/testReport)**
 for PR 28991 at commit 
[`4ca75ba`](https://github.com/apache/spark/commit/4ca75baea102fb1c2e7467ed1093d4c0520a5245).
* This patch **fails to build**.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28991: [SPARK-26533][SQL][test-hive1.2][test-hadoop2.7] Support query auto timeout cancel on thriftserver

2020-08-10 Thread GitBox


AmplabJenkins commented on pull request #28991:
URL: https://github.com/apache/spark/pull/28991#issuecomment-671692824







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huaxingao commented on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-10 Thread GitBox


huaxingao commented on pull request #29396:
URL: https://github.com/apache/spark/pull/29396#issuecomment-671692513


   > Could you add tests for write since you supported WriteBuilder in this PR.
   I will take JDBCV2Suite from @cloud-fan 's 
https://github.com/apache/spark/pull/27345. I will ignore the tests that don't 
work for now.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28991: [SPARK-26533][SQL][test-hive1.2][test-hadoop2.7] Support query auto timeout cancel on thriftserver

2020-08-10 Thread GitBox


AmplabJenkins removed a comment on pull request #28991:
URL: https://github.com/apache/spark/pull/28991#issuecomment-671691009







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29401: [SPARK-32400][SQL] Improve test coverage of HiveScriptTransformationExec

2020-08-10 Thread GitBox


AmplabJenkins removed a comment on pull request #29401:
URL: https://github.com/apache/spark/pull/29401#issuecomment-671690941







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28991: [SPARK-26533][SQL][test-hive1.2][test-hadoop2.7] Support query auto timeout cancel on thriftserver

2020-08-10 Thread GitBox


AmplabJenkins commented on pull request #28991:
URL: https://github.com/apache/spark/pull/28991#issuecomment-671691009







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29401: [SPARK-32400][SQL] Improve test coverage of HiveScriptTransformationExec

2020-08-10 Thread GitBox


AmplabJenkins commented on pull request #29401:
URL: https://github.com/apache/spark/pull/29401#issuecomment-671690941







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28991: [SPARK-26533][SQL][test-hive1.2][test-hadoop2.7] Support query auto timeout cancel on thriftserver

2020-08-10 Thread GitBox


SparkQA commented on pull request #28991:
URL: https://github.com/apache/spark/pull/28991#issuecomment-671690512


   **[Test build #127300 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127300/testReport)**
 for PR 28991 at commit 
[`4ca75ba`](https://github.com/apache/spark/commit/4ca75baea102fb1c2e7467ed1093d4c0520a5245).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29401: [SPARK-32400][SQL] Improve test coverage of HiveScriptTransformationExec

2020-08-10 Thread GitBox


SparkQA commented on pull request #29401:
URL: https://github.com/apache/spark/pull/29401#issuecomment-671690473


   **[Test build #127299 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127299/testReport)**
 for PR 29401 at commit 
[`b9776bd`](https://github.com/apache/spark/commit/b9776bde0d73274b814dde713143b5286c5449ea).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu opened a new pull request #29401: [SPARK-32400][SQL] Improve test coverage of HiveScriptTransformationExec

2020-08-10 Thread GitBox


AngersZh opened a new pull request #29401:
URL: https://github.com/apache/spark/pull/29401


   ### What changes were proposed in this pull request?
   
   1. Extract common test case (no serde) to BasicScriptTransformationExecSuite
   2. Add more test case for no serde mode about supported data type and 
behavior in `BasicScriptTransformationExecSuite`
   3. Add more test case for hive serde mode about supported type and behavior 
in `HiveScriptTransformationExecSuite`
   
   
   ### Why are the changes needed?
   Improve test coverage of Script Transformation
   
   
   ### Does this PR introduce _any_ user-facing change?
   NO
   
   
   ### How was this patch tested?
   Added UT



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters

2020-08-10 Thread GitBox


SparkQA commented on pull request #29328:
URL: https://github.com/apache/spark/pull/29328#issuecomment-671685821


   **[Test build #127298 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127298/testReport)**
 for PR 29328 at commit 
[`d344eb9`](https://github.com/apache/spark/commit/d344eb946d921c0d180069daac4f15438c452af4).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters

2020-08-10 Thread GitBox


AmplabJenkins removed a comment on pull request #29328:
URL: https://github.com/apache/spark/pull/29328#issuecomment-671684138







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters

2020-08-10 Thread GitBox


AmplabJenkins commented on pull request #29328:
URL: https://github.com/apache/spark/pull/29328#issuecomment-671684138







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] imback82 commented on a change in pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters

2020-08-10 Thread GitBox


imback82 commented on a change in pull request #29328:
URL: https://github.com/apache/spark/pull/29328#discussion_r468286947



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
##
@@ -155,7 +155,7 @@ object TextInputCSVDataSource extends CSVDataSource {
 sparkSession,
 paths = paths,
 className = classOf[TextFileFormat].getName,
-options = options.parameters
+options = options.parameters - "path"

Review comment:
   This is called during `infer` and "path" option gets added to 
`inputPaths`, thus this needs to be removed.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on pull request #28490: [SPARK-31670][SQL]Resolve Struct Field in Grouping Aggregate with same ExprId

2020-08-10 Thread GitBox


AngersZh commented on pull request #28490:
URL: https://github.com/apache/spark/pull/28490#issuecomment-671682512


   Any update for this? cc @cloud-fan @viirya @HyukjinKwon 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   >