[GitHub] [spark] huaxingao commented on a change in pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder
huaxingao commented on a change in pull request #29396: URL: https://github.com/apache/spark/pull/29396#discussion_r468343334 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala ## @@ -64,11 +64,14 @@ class DataSourceV2Strategy(session: SparkSession) extends Strategy with Predicat } val rdd = v1Relation.buildScan() val unsafeRowRDD = DataSourceStrategy.toCatalystRDD(v1Relation, output, rdd) - val originalOutputNames = relation.table.schema().map(_.name) - val requiredColumnsIndex = output.map(_.name).map(originalOutputNames.indexOf) + // `RowDataSourceScanExec` requires the full output instead of the scan output after column + // pruning. However, when we reach here following the v2 code path, we don't have the full + // output anymore. `RowDataSourceScanExec.fullOutput` is actually meaningless so here we just + // pass the pruned output. + // TODO: remove `RowDataSourceScanExec.fullOutput`. Review comment: I will open the subtasks tomorrow. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] c21 commented on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join
c21 commented on pull request #29342: URL: https://github.com/apache/spark/pull/29342#issuecomment-671742514 @cloud-fan, @agrawaldevesh, @maropu and @viirya - I took a more closer look inside `BytesToBytesMap.java`, and found it would probably be hard / hacky to get key index when iterating all values of map. After reading the stream side of join, we need to iterate all rows on build side `BytesToBytesMap` to output build side rows not having a match. To iterate all rows on the map, `BytesToBytesMap` provides a method [`iterator()`](https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java#L412). The method is to iterate through all data (key-value pair) in each memory page of `dataPages`. The approach only reads through data pages, and does not interact with the key index array `longArray` at all. So we could not get key index efficiently here. A workaround would be for every returned key-value pair inside `MapIterator`, we call `lookup(key, ...)` again on map to get key index. But we need a probing for every row on build side which seems to be obviously inefficient. How do you guys think? Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29276: [SPARK-32470][CORE] Remove task result size check for shuffle map stage
AmplabJenkins removed a comment on pull request #29276: URL: https://github.com/apache/spark/pull/29276#issuecomment-671741732 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/127304/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29276: [SPARK-32470][CORE] Remove task result size check for shuffle map stage
SparkQA removed a comment on pull request #29276: URL: https://github.com/apache/spark/pull/29276#issuecomment-671696614 **[Test build #127304 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127304/testReport)** for PR 29276 at commit [`e1e53ef`](https://github.com/apache/spark/commit/e1e53ef258013df2f85573de4542d36105ef852b). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29276: [SPARK-32470][CORE] Remove task result size check for shuffle map stage
AmplabJenkins removed a comment on pull request #29276: URL: https://github.com/apache/spark/pull/29276#issuecomment-671741724 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29276: [SPARK-32470][CORE] Remove task result size check for shuffle map stage
AmplabJenkins commented on pull request #29276: URL: https://github.com/apache/spark/pull/29276#issuecomment-671741724 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29276: [SPARK-32470][CORE] Remove task result size check for shuffle map stage
SparkQA commented on pull request #29276: URL: https://github.com/apache/spark/pull/29276#issuecomment-671741096 **[Test build #127304 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127304/testReport)** for PR 29276 at commit [`e1e53ef`](https://github.com/apache/spark/commit/e1e53ef258013df2f85573de4542d36105ef852b). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] c21 edited a comment on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join
c21 edited a comment on pull request #29342: URL: https://github.com/apache/spark/pull/29342#issuecomment-671723199 ~~@cloud-fan - sorry if I miss anything, could you elaborate more of~~ ~~> We don't need to get the value index. We can calculate it by ourselves~~ ~~How do we calculate value index based on value itself?~~ @cloud-fan - I got it, you mean value index to be index of row in Iterator. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29402: [SPARK-32584][PYTHON][DOCS] Exclude _images and _sources that are generated by Sphinx in Jekyll build
SparkQA removed a comment on pull request #29402: URL: https://github.com/apache/spark/pull/29402#issuecomment-671727996 **[Test build #127307 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127307/testReport)** for PR 29402 at commit [`3cb7891`](https://github.com/apache/spark/commit/3cb789155ac0fe60d91f11784f97876ac9928852). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29402: [SPARK-32584][PYTHON][DOCS] Exclude _images and _sources that are generated by Sphinx in Jekyll build
AmplabJenkins removed a comment on pull request #29402: URL: https://github.com/apache/spark/pull/29402#issuecomment-671730938 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29402: [SPARK-32584][PYTHON][DOCS] Exclude _images and _sources that are generated by Sphinx in Jekyll build
AmplabJenkins commented on pull request #29402: URL: https://github.com/apache/spark/pull/29402#issuecomment-671730938 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29402: [SPARK-32584][PYTHON][DOCS] Exclude _images and _sources that are generated by Sphinx in Jekyll build
SparkQA commented on pull request #29402: URL: https://github.com/apache/spark/pull/29402#issuecomment-671730864 **[Test build #127307 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127307/testReport)** for PR 29402 at commit [`3cb7891`](https://github.com/apache/spark/commit/3cb789155ac0fe60d91f11784f97876ac9928852). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] imback82 commented on a change in pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters
imback82 commented on a change in pull request #29328: URL: https://github.com/apache/spark/pull/29328#discussion_r468330472 ## File path: sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala ## @@ -395,7 +395,7 @@ class DataFrameReaderWriterSuite extends QueryTest with SharedSparkSession with .load("/test") assert(LastOptions.parameters("intOpt") == "56") -assert(LastOptions.parameters("path") == "/test") +assert(!LastOptions.parameters.contains("path")) Review comment: Yes, now it makes sense. I will update it, thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] imback82 commented on a change in pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters
imback82 commented on a change in pull request #29328: URL: https://github.com/apache/spark/pull/29328#discussion_r468330253 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala ## @@ -155,7 +155,7 @@ object TextInputCSVDataSource extends CSVDataSource { sparkSession, paths = paths, className = classOf[TextFileFormat].getName, -options = options.parameters +options = options.parameters - "path" Review comment: Yes, without this, few tests that involve "inferring" fail. Basically, when you do `spark.read.csv("/tmp")`, "path" option is populated since there is only one path parameter in `load()`. Now, when `infer` is called, `inputPaths: Seq[FileStatus]` is a result of listing files and there is the "path" option that is being carried over, so `infer` will try to read more data than needed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #29402: [SPARK-32584][PYTHON][DOCS] Exclude _images and _sources that are generated by Sphinx in Jekyll build
HyukjinKwon commented on pull request #29402: URL: https://github.com/apache/spark/pull/29402#issuecomment-671729038 cc @viirya can you take a quick look when you're available? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29402: [SPARK-32584][PYTHON][DOCS] Exclude _images and _sources that are generated by Sphinx in Jekyll build
AmplabJenkins removed a comment on pull request #29402: URL: https://github.com/apache/spark/pull/29402#issuecomment-671728329 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29402: [SPARK-32584][PYTHON][DOCS] Exclude _images and _sources that are generated by Sphinx in Jekyll build
AmplabJenkins commented on pull request #29402: URL: https://github.com/apache/spark/pull/29402#issuecomment-671728329 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29402: [SPARK-32584][PYTHON][DOCS] Exclude _images and _sources that are generated by Sphinx in Jekyll build
SparkQA commented on pull request #29402: URL: https://github.com/apache/spark/pull/29402#issuecomment-671727996 **[Test build #127307 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127307/testReport)** for PR 29402 at commit [`3cb7891`](https://github.com/apache/spark/commit/3cb789155ac0fe60d91f11784f97876ac9928852). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon opened a new pull request #29402: [SPARK-32584][PYTHON][DOCS] Exclude _images and _sources that are generated by Sphinx in Jekyll build
HyukjinKwon opened a new pull request #29402: URL: https://github.com/apache/spark/pull/29402 ### What changes were proposed in this pull request? This PR proposes to `include` `_images` and `_sources` directories in Jekyll build. For `_images` directory, After SPARK-31851, now we add some images to use within the pages built by Sphinx. It copies and images into `_images` directory. Later, when Jekyll builds, the underscore directories are ignored by default which ends up with missing image in the main doc. Before: ![Screen Shot 2020-08-11 at 1 52 46 PM](https://user-images.githubusercontent.com/6477701/89859104-2e571080-dbdb-11ea-817c-c04bbcd4088e.png) After: ![Screen Shot 2020-08-11 at 1 49 00 PM](https://user-images.githubusercontent.com/6477701/89859105-30b96a80-dbdb-11ea-85c6-8a135eddf613.png) For `_sources` directory, Please refer [here](https://github.com/sphinx-contrib/sphinx-pretty-searchresults#source-links) and [here](https://www.sphinx-doc.org/en/master/usage/configuration.html#confval-html_copy_source). They are generated by default and used by default in the documentations by Sphinx, and we should better include them. ### Why are the changes needed? To show the images correctly in PySpark documentation. ### Does this PR introduce _any_ user-facing change? No, only in unreleased branches. ### How was this patch tested? Manually tested via: ```bash SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve --watch ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder
AmplabJenkins removed a comment on pull request #29396: URL: https://github.com/apache/spark/pull/29396#issuecomment-671725265 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder
AmplabJenkins commented on pull request #29396: URL: https://github.com/apache/spark/pull/29396#issuecomment-671725265 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder
SparkQA removed a comment on pull request #29396: URL: https://github.com/apache/spark/pull/29396#issuecomment-671657035 **[Test build #127297 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127297/testReport)** for PR 29396 at commit [`79f336c`](https://github.com/apache/spark/commit/79f336c5fcd8f703887dfa5e7ffb7f64eaff3a9f). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters
cloud-fan commented on a change in pull request #29328: URL: https://github.com/apache/spark/pull/29328#discussion_r468326123 ## File path: sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala ## @@ -395,7 +395,7 @@ class DataFrameReaderWriterSuite extends QueryTest with SharedSparkSession with .load("/test") assert(LastOptions.parameters("intOpt") == "56") -assert(LastOptions.parameters("path") == "/test") +assert(!LastOptions.parameters.contains("path")) Review comment: Sorry I missed one thing. For DS v1, `...load(path)` will generate a `path` option, which is visible to v1 sources, and we can't break it. It's strange that `...load(path1, path2)` will completely ignore the paths and don't pass paths to v1 sources, but there is nothing we can do here. We need to update `DataFrameReader.loadV1Source` to generate a `path` option if `paths.length == 1`, to keep the previous behavior. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder
SparkQA commented on pull request #29396: URL: https://github.com/apache/spark/pull/29396#issuecomment-671724764 **[Test build #127297 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127297/testReport)** for PR 29396 at commit [`79f336c`](https://github.com/apache/spark/commit/79f336c5fcd8f703887dfa5e7ffb7f64eaff3a9f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters
cloud-fan commented on a change in pull request #29328: URL: https://github.com/apache/spark/pull/29328#discussion_r468326123 ## File path: sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala ## @@ -395,7 +395,7 @@ class DataFrameReaderWriterSuite extends QueryTest with SharedSparkSession with .load("/test") assert(LastOptions.parameters("intOpt") == "56") -assert(LastOptions.parameters("path") == "/test") +assert(!LastOptions.parameters.contains("path")) Review comment: Sorry I missed one thing. For DS v1, `...load(path)` will generate a `path` option, which is visible to v1 sources, and we can't break it. It's strange that `...load(path1, path2)` will completely ignore the paths and don't pass paths to v1 sources, but there is nothing we can do here. We need to update `DataFrameReader.loadV1Source` to generate a `path` option if `paths.length == 1` (and set paths to Nil), to keep the previous behavior. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] c21 commented on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join
c21 commented on pull request #29342: URL: https://github.com/apache/spark/pull/29342#issuecomment-671723199 @cloud-fan - sorry if I miss anything, could you elaborate more of > We don't need to get the value index. We can calculate it by ourselves How do we calculate value index based on value itself? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters
cloud-fan commented on a change in pull request #29328: URL: https://github.com/apache/spark/pull/29328#discussion_r468324590 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala ## @@ -155,7 +155,7 @@ object TextInputCSVDataSource extends CSVDataSource { sparkSession, paths = paths, className = classOf[TextFileFormat].getName, -options = options.parameters +options = options.parameters - "path" Review comment: This is a good catch! BTW does it have to be fixed in this PR? The check is added in `DataFrameReader.load` and `DataSource.resolveRelation` should be fine. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters
cloud-fan commented on a change in pull request #29328: URL: https://github.com/apache/spark/pull/29328#discussion_r468324072 ## File path: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ## @@ -245,12 +245,19 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { "read files of Hive data source directly.") } +if (extraOptions.contains("path") && paths.nonEmpty) { Review comment: Then we can make sure `path` and `paths` options wouldn't co-exist for DS v2, and you can make the change in https://github.com/apache/spark/pull/29328/files#r468295791 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29367: [SPARK-31198][CORE] Use graceful decommissioning as part of dynamic scaling
AmplabJenkins removed a comment on pull request #29367: URL: https://github.com/apache/spark/pull/29367#issuecomment-671721882 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] imback82 commented on a change in pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters
imback82 commented on a change in pull request #29328: URL: https://github.com/apache/spark/pull/29328#discussion_r468286947 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala ## @@ -155,7 +155,7 @@ object TextInputCSVDataSource extends CSVDataSource { sparkSession, paths = paths, className = classOf[TextFileFormat].getName, -options = options.parameters +options = options.parameters - "path" Review comment: This is called during `infer` and "path" option gets added to `inputPaths` (if the option was originally defined), thus this needs to be removed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] imback82 commented on a change in pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters
imback82 commented on a change in pull request #29328: URL: https://github.com/apache/spark/pull/29328#discussion_r468295791 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala ## @@ -155,7 +155,7 @@ object TextInputCSVDataSource extends CSVDataSource { sparkSession, paths = paths, className = classOf[TextFileFormat].getName, -options = options.parameters +options = options.parameters - "path" Review comment: I also thought about update the following https://github.com/apache/spark/blob/8659ec554fe1993fcaf2ec4d1c4745205bddec7d/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L575 to ``` val allPaths = caseInsensitiveOptions.get("path").map(Seq(_)).getOrElse(paths) ``` , but it seemed a risky change (e.g., there is no good place for assert, etc.). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters
cloud-fan commented on a change in pull request #29328: URL: https://github.com/apache/spark/pull/29328#discussion_r468323610 ## File path: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ## @@ -245,12 +245,19 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { "read files of Hive data source directly.") } +if (extraOptions.contains("path") && paths.nonEmpty) { Review comment: how about we forbid `paths` option as well for consistency? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29367: [SPARK-31198][CORE] Use graceful decommissioning as part of dynamic scaling
AmplabJenkins commented on pull request #29367: URL: https://github.com/apache/spark/pull/29367#issuecomment-671721882 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29367: [SPARK-31198][CORE] Use graceful decommissioning as part of dynamic scaling
SparkQA commented on pull request #29367: URL: https://github.com/apache/spark/pull/29367#issuecomment-671721570 **[Test build #127306 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127306/testReport)** for PR 29367 at commit [`a099152`](https://github.com/apache/spark/commit/a099152577214a531d8a57a8f08771e30e2e449a). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] imback82 commented on a change in pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters
imback82 commented on a change in pull request #29328: URL: https://github.com/apache/spark/pull/29328#discussion_r468323154 ## File path: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ## @@ -245,6 +246,10 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { "read files of Hive data source directly.") } Review comment: Ah OK. Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters
cloud-fan commented on a change in pull request #29328: URL: https://github.com/apache/spark/pull/29328#discussion_r468323149 ## File path: docs/sql-migration-guide.md ## @@ -36,6 +36,8 @@ license: | - In Spark 3.1, NULL elements of structures, arrays and maps are converted to "null" in casting them to strings. In Spark 3.0 or earlier, NULL elements are converted to empty strings. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.castComplexTypesToString.enabled` to `true`. + - In Spark 3.1, when loading a dataframe, `path` option cannot coexist with `load()`'s path parameters. For example, `spark.read.format("csv").option("path", "/tmp").load("/tmp2")` or `spark.read.option("path", "/tmp").csv("/tmp2")` will throw `org.apache.spark.sql.AnalysisException`. Review comment: Let's also mention the legacy behavior in 3.0 and below. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters
cloud-fan commented on a change in pull request #29328: URL: https://github.com/apache/spark/pull/29328#discussion_r468322696 ## File path: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ## @@ -245,6 +246,10 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { "read files of Hive data source directly.") } Review comment: parquet is a builtin source, so we can make it work anyway. `def load(path: String)` is general for other external sources, and we should guarantee backward compatibility here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join
cloud-fan commented on pull request #29342: URL: https://github.com/apache/spark/pull/29342#issuecomment-671719927 A few more thoughts: 1. For `keyIsUnique` code path, we know it's one key one value, I think we can still use bitset. 2. We don't need to get the value index. We can calculate it by ourselves since `get(key)` returns an `Iterator[InternalRow]`. e.g. `values.zipWithIndex`. I would expect to see a `def getWithKeyIndex(key): (Int, Iterator[InternalRow])`. We can even combine key index and value index into a long for better performance. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] imback82 commented on a change in pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters
imback82 commented on a change in pull request #29328: URL: https://github.com/apache/spark/pull/29328#discussion_r468320706 ## File path: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ## @@ -245,6 +246,10 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { "read files of Hive data source directly.") } Review comment: Hmm, `def parquet(path: String)` has been calling `def load(paths: String*)`, not `def load(path: String)`, and it seems to have worked fine so far (other than the inconsistency this PR addresses), no? Btw, we shouldn't remove `def load(path: String)` because it breaks the binary compatibility. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29389: [SPARK-32573][SQL] Anti Join Improvement with EmptyHashedRelation and EmptyHashedRelationWithAllNullKeys
AmplabJenkins removed a comment on pull request #29389: URL: https://github.com/apache/spark/pull/29389#issuecomment-671718149 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29389: [SPARK-32573][SQL] Anti Join Improvement with EmptyHashedRelation and EmptyHashedRelationWithAllNullKeys
AmplabJenkins commented on pull request #29389: URL: https://github.com/apache/spark/pull/29389#issuecomment-671718149 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] leanken commented on pull request #29389: [SPARK-32573][SQL] Anti Join Improvement with EmptyHashedRelation and EmptyHashedRelationWithAllNullKeys
leanken commented on pull request #29389: URL: https://github.com/apache/spark/pull/29389#issuecomment-671718124 @cloud-fan Test passed, ready to merge. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29389: [SPARK-32573][SQL] Anti Join Improvement with EmptyHashedRelation and EmptyHashedRelationWithAllNullKeys
SparkQA removed a comment on pull request #29389: URL: https://github.com/apache/spark/pull/29389#issuecomment-671646391 **[Test build #127296 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127296/testReport)** for PR 29389 at commit [`5309b66`](https://github.com/apache/spark/commit/5309b66b23af2ffe7999edbd8f0c11f26628805d). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29389: [SPARK-32573][SQL] Anti Join Improvement with EmptyHashedRelation and EmptyHashedRelationWithAllNullKeys
SparkQA commented on pull request #29389: URL: https://github.com/apache/spark/pull/29389#issuecomment-671717620 **[Test build #127296 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127296/testReport)** for PR 29389 at commit [`5309b66`](https://github.com/apache/spark/commit/5309b66b23af2ffe7999edbd8f0c11f26628805d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder
AmplabJenkins removed a comment on pull request #29396: URL: https://github.com/apache/spark/pull/29396#issuecomment-671716088 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder
AmplabJenkins commented on pull request #29396: URL: https://github.com/apache/spark/pull/29396#issuecomment-671716088 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder
SparkQA commented on pull request #29396: URL: https://github.com/apache/spark/pull/29396#issuecomment-671715831 **[Test build #127305 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127305/testReport)** for PR 29396 at commit [`3ef5414`](https://github.com/apache/spark/commit/3ef54141d4834ebbe8a7bddea57117e73034ba06). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gatorsmile commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue
gatorsmile commented on a change in pull request #28904: URL: https://github.com/apache/spark/pull/28904#discussion_r468306894 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala ## @@ -160,8 +152,42 @@ class HDFSMetadataLog[T <: AnyRef : ClassTag](sparkSession: SparkSession, path: IOUtils.closeQuietly(input) } } else { - logDebug(s"Unable to find batch $batchMetadataFile") - None + throw new FileNotFoundException(s"Unable to find batch $batchMetadataFile") +} + } + + /** + * Store the metadata for the specified batchId and return `true` if successful. This method + * fills the content of metadata via executing function. If the function throws exception, Review comment: Nit: exception -> an exception This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28991: [SPARK-26533][SQL][test-hive1.2][test-hadoop2.7] Support query auto timeout cancel on thriftserver
AmplabJenkins removed a comment on pull request #28991: URL: https://github.com/apache/spark/pull/28991#issuecomment-671704714 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28991: [SPARK-26533][SQL][test-hive1.2][test-hadoop2.7] Support query auto timeout cancel on thriftserver
AmplabJenkins commented on pull request #28991: URL: https://github.com/apache/spark/pull/28991#issuecomment-671704714 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28991: [SPARK-26533][SQL][test-hive1.2][test-hadoop2.7] Support query auto timeout cancel on thriftserver
SparkQA removed a comment on pull request #28991: URL: https://github.com/apache/spark/pull/28991#issuecomment-671694607 **[Test build #127303 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127303/testReport)** for PR 28991 at commit [`da76e1a`](https://github.com/apache/spark/commit/da76e1a7ba25ca96495cbb2ce56377983ee9f74f). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28991: [SPARK-26533][SQL][test-hive1.2][test-hadoop2.7] Support query auto timeout cancel on thriftserver
SparkQA commented on pull request #28991: URL: https://github.com/apache/spark/pull/28991#issuecomment-671704534 **[Test build #127303 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127303/testReport)** for PR 28991 at commit [`da76e1a`](https://github.com/apache/spark/commit/da76e1a7ba25ca96495cbb2ce56377983ee9f74f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gatorsmile commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue
gatorsmile commented on a change in pull request #28904: URL: https://github.com/apache/spark/pull/28904#discussion_r468306410 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLog.scala ## @@ -97,18 +97,13 @@ class FileStreamSinkLog( s"Please set ${SQLConf.FILE_SINK_LOG_COMPACT_INTERVAL.key} (was $defaultCompactInterval) " + "to a positive value.") - override def compactLogs(logs: Seq[SinkFileStatus]): Seq[SinkFileStatus] = { -val deletedFiles = logs.filter(_.action == FileStreamSinkLog.DELETE_ACTION).map(_.path).toSet -if (deletedFiles.isEmpty) { - logs -} else { - logs.filter(f => !deletedFiles.contains(f.path)) -} - } + override def shouldRetain(log: SinkFileStatus): Boolean = true } object FileStreamSinkLog { val VERSION = 1 + // TODO: This action hasn't been used from the introduction. We should just remove this. Review comment: Add a JIRA number in the comment? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gatorsmile commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue
gatorsmile commented on a change in pull request #28904: URL: https://github.com/apache/spark/pull/28904#discussion_r468305462 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala ## @@ -132,12 +130,18 @@ abstract class CompactibleFileStreamLog[T <: AnyRef : ClassTag]( } } + def serializeEntry(entry: T, out: OutputStream): Unit = { +out.write(Serialization.write(entry).getBytes(UTF_8)) + } + + def deserializeEntry(line: String): T = Serialization.read[T](line) Review comment: private? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gatorsmile commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue
gatorsmile commented on a change in pull request #28904: URL: https://github.com/apache/spark/pull/28904#discussion_r468305424 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala ## @@ -132,12 +130,18 @@ abstract class CompactibleFileStreamLog[T <: AnyRef : ClassTag]( } } + def serializeEntry(entry: T, out: OutputStream): Unit = { Review comment: private? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters
cloud-fan commented on a change in pull request #29328: URL: https://github.com/apache/spark/pull/29328#discussion_r468305334 ## File path: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ## @@ -245,6 +246,10 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { "read files of Hive data source directly.") } Review comment: It's required, because ``` def load(path: String): DataFrame = { // force invocation of `load(...varargs...)` option("path", path).load(Seq.empty: _*) } ``` If we remove this method and always call `def load(paths: String*)`, we should make sure the behavior doesn't change, and we still generate "path" option if only one path is given. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gatorsmile commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue
gatorsmile commented on a change in pull request #28904: URL: https://github.com/apache/spark/pull/28904#discussion_r468305223 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala ## @@ -173,37 +177,64 @@ abstract class CompactibleFileStreamLog[T <: AnyRef : ClassTag]( override def purge(thresholdBatchId: Long): Unit = throw new UnsupportedOperationException( s"Cannot purge as it might break internal state.") + /** + * Apply function on all entries in the specific batch. The method will throw + * FileNotFoundException if the metadata log file doesn't exist. + * + * NOTE: This doesn't fail early on corruption. The caller should handle the exception + * properly and make sure the logic is not affected by failing in the middle. Review comment: How to ensure all the callers handle it properly? Do we need to introduce a test to cover it? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gatorsmile commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue
gatorsmile commented on a change in pull request #28904: URL: https://github.com/apache/spark/pull/28904#discussion_r468303420 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala ## @@ -173,37 +177,64 @@ abstract class CompactibleFileStreamLog[T <: AnyRef : ClassTag]( override def purge(thresholdBatchId: Long): Unit = throw new UnsupportedOperationException( s"Cannot purge as it might break internal state.") + /** + * Apply function on all entries in the specific batch. The method will throw + * FileNotFoundException if the metadata log file doesn't exist. + * + * NOTE: This doesn't fail early on corruption. The caller should handle the exception + * properly and make sure the logic is not affected by failing in the middle. Review comment: If the caller does not handle the exception properly, what is the consequence? Do we have a test case to cover it? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #29031: [SPARK-32216][SQL] Remove redundant ProjectExec
cloud-fan closed pull request #29031: URL: https://github.com/apache/spark/pull/29031 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #29031: [SPARK-32216][SQL] Remove redundant ProjectExec
cloud-fan commented on pull request #29031: URL: https://github.com/apache/spark/pull/29031#issuecomment-671699494 thanks, merging to master! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29276: [SPARK-32470][CORE] Remove task result size check for shuffle map stage
AmplabJenkins removed a comment on pull request #29276: URL: https://github.com/apache/spark/pull/29276#issuecomment-671696968 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29276: [SPARK-32470][CORE] Remove task result size check for shuffle map stage
AmplabJenkins commented on pull request #29276: URL: https://github.com/apache/spark/pull/29276#issuecomment-671696968 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29276: [SPARK-32470][CORE] Remove task result size check for shuffle map stage
SparkQA commented on pull request #29276: URL: https://github.com/apache/spark/pull/29276#issuecomment-671696614 **[Test build #127304 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127304/testReport)** for PR 29276 at commit [`e1e53ef`](https://github.com/apache/spark/commit/e1e53ef258013df2f85573de4542d36105ef852b). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28991: [SPARK-26533][SQL][test-hive1.2][test-hadoop2.7] Support query auto timeout cancel on thriftserver
AmplabJenkins removed a comment on pull request #28991: URL: https://github.com/apache/spark/pull/28991#issuecomment-671694990 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder
AmplabJenkins removed a comment on pull request #29396: URL: https://github.com/apache/spark/pull/29396#issuecomment-671695012 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/127301/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder
AmplabJenkins removed a comment on pull request #29396: URL: https://github.com/apache/spark/pull/29396#issuecomment-671695006 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29369: [SPARK-32540][SQL] Eliminate the filter clause in aggregate
AmplabJenkins removed a comment on pull request #29369: URL: https://github.com/apache/spark/pull/29369#issuecomment-671694921 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder
SparkQA removed a comment on pull request #29396: URL: https://github.com/apache/spark/pull/29396#issuecomment-671694555 **[Test build #127301 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127301/testReport)** for PR 29396 at commit [`fb48c86`](https://github.com/apache/spark/commit/fb48c860d25ad4b8efcf03025dc904c976738f07). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28991: [SPARK-26533][SQL][test-hive1.2][test-hadoop2.7] Support query auto timeout cancel on thriftserver
AmplabJenkins commented on pull request #28991: URL: https://github.com/apache/spark/pull/28991#issuecomment-671694990 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29369: [SPARK-32540][SQL] Eliminate the filter clause in aggregate
AmplabJenkins commented on pull request #29369: URL: https://github.com/apache/spark/pull/29369#issuecomment-671694921 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder
SparkQA commented on pull request #29396: URL: https://github.com/apache/spark/pull/29396#issuecomment-671694997 **[Test build #127301 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127301/testReport)** for PR 29396 at commit [`fb48c86`](https://github.com/apache/spark/commit/fb48c860d25ad4b8efcf03025dc904c976738f07). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder
AmplabJenkins commented on pull request #29396: URL: https://github.com/apache/spark/pull/29396#issuecomment-671695006 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29369: [SPARK-32540][SQL] Eliminate the filter clause in aggregate
SparkQA commented on pull request #29369: URL: https://github.com/apache/spark/pull/29369#issuecomment-671694575 **[Test build #127302 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127302/testReport)** for PR 29369 at commit [`e4b4c5a`](https://github.com/apache/spark/commit/e4b4c5ac25b880fa4adc37bca82282ab34874f36). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder
SparkQA commented on pull request #29396: URL: https://github.com/apache/spark/pull/29396#issuecomment-671694555 **[Test build #127301 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127301/testReport)** for PR 29396 at commit [`fb48c86`](https://github.com/apache/spark/commit/fb48c860d25ad4b8efcf03025dc904c976738f07). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28991: [SPARK-26533][SQL][test-hive1.2][test-hadoop2.7] Support query auto timeout cancel on thriftserver
SparkQA commented on pull request #28991: URL: https://github.com/apache/spark/pull/28991#issuecomment-671694607 **[Test build #127303 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127303/testReport)** for PR 28991 at commit [`da76e1a`](https://github.com/apache/spark/commit/da76e1a7ba25ca96495cbb2ce56377983ee9f74f). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28991: [SPARK-26533][SQL][test-hive1.2][test-hadoop2.7] Support query auto timeout cancel on thriftserver
AmplabJenkins removed a comment on pull request #28991: URL: https://github.com/apache/spark/pull/28991#issuecomment-671692831 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/127300/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] imback82 commented on a change in pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters
imback82 commented on a change in pull request #29328: URL: https://github.com/apache/spark/pull/29328#discussion_r468295791 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala ## @@ -155,7 +155,7 @@ object TextInputCSVDataSource extends CSVDataSource { sparkSession, paths = paths, className = classOf[TextFileFormat].getName, -options = options.parameters +options = options.parameters - "path" Review comment: I also thought about update the following https://github.com/apache/spark/blob/8659ec554fe1993fcaf2ec4d1c4745205bddec7d/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L575 to ``` val allPaths = caseInsensitiveOptions.get("path").map(Seq(_)).getOrElse(paths) ``` , but it looked like a risky change (e.g., there is no good place for assert, etc.). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder
AmplabJenkins removed a comment on pull request #29396: URL: https://github.com/apache/spark/pull/29396#issuecomment-671692976 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder
huaxingao commented on a change in pull request #29396: URL: https://github.com/apache/spark/pull/29396#discussion_r468295410 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala ## @@ -107,6 +111,41 @@ class JDBCTableCatalogSuite extends QueryTest with SharedSparkSession { } } + test("simple scan") { +checkAnswer(sql("SELECT name, id FROM h2.test.people"), Seq(Row("fred", 1), Row("mary", 2))) Review comment: fixed This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder
AmplabJenkins commented on pull request #29396: URL: https://github.com/apache/spark/pull/29396#issuecomment-671692976 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28991: [SPARK-26533][SQL][test-hive1.2][test-hadoop2.7] Support query auto timeout cancel on thriftserver
SparkQA removed a comment on pull request #28991: URL: https://github.com/apache/spark/pull/28991#issuecomment-671690512 **[Test build #127300 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127300/testReport)** for PR 28991 at commit [`4ca75ba`](https://github.com/apache/spark/commit/4ca75baea102fb1c2e7467ed1093d4c0520a5245). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder
huaxingao commented on a change in pull request #29396: URL: https://github.com/apache/spark/pull/29396#discussion_r468295245 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTable.scala ## @@ -34,4 +38,16 @@ case class JDBCTable(ident: Identifier, schema: StructType, jdbcOptions: JDBCOpt capabilities.add(BATCH_READ) capabilities } + + override def newScanBuilder(options: CaseInsensitiveStringMap): JDBCScanBuilder = { +val mergedOptions = new JDBCOptions( + jdbcOptions.parameters.originalMap ++ options.asCaseSensitiveMap().asScala) +new JDBCScanBuilder(SparkSession.active, schema, mergedOptions) Review comment: removed This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder
huaxingao commented on a change in pull request #29396: URL: https://github.com/apache/spark/pull/29396#discussion_r468295282 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCWriteBuilder.scala ## @@ -0,0 +1,47 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.execution.datasources.v2.jdbc + +import org.apache.spark.sql._ +import org.apache.spark.sql.connector.write._ +import org.apache.spark.sql.execution.datasources.jdbc.{JdbcOptionsInWrite, JdbcUtils} +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.sources.InsertableRelation +import org.apache.spark.sql.types.StructType + + Review comment: removed This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28991: [SPARK-26533][SQL][test-hive1.2][test-hadoop2.7] Support query auto timeout cancel on thriftserver
AmplabJenkins removed a comment on pull request #28991: URL: https://github.com/apache/spark/pull/28991#issuecomment-671692824 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder
huaxingao commented on a change in pull request #29396: URL: https://github.com/apache/spark/pull/29396#discussion_r468295197 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCScanBuilder.scala ## @@ -0,0 +1,61 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.execution.datasources.v2.jdbc + +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.connector.read.{Scan, ScanBuilder, SupportsPushDownFilters, SupportsPushDownRequiredColumns} +import org.apache.spark.sql.execution.datasources.jdbc.{JDBCOptions, JDBCRDD, JDBCRelation} +import org.apache.spark.sql.jdbc.JdbcDialects +import org.apache.spark.sql.sources.Filter +import org.apache.spark.sql.types.StructType + +case class JDBCScanBuilder( +session: SparkSession, +schema: StructType, +jdbcOptions: JDBCOptions) + extends ScanBuilder with SupportsPushDownFilters with SupportsPushDownRequiredColumns { + + private var pushedFilter = Array.empty[Filter] + + private var prunedSchema = schema + + override def pushFilters(filters: Array[Filter]): Array[Filter] = { +if (jdbcOptions.pushDownPredicate) { + val dialect = JdbcDialects.get(jdbcOptions.url) + val (pushed, unSupported) = filters.partition(JDBCRDD.compileFilter(_, dialect).isDefined) + this.pushedFilter = pushed + unSupported +} else { + filters +} + } + + override def pushedFilters(): Array[Filter] = pushedFilter + + override def pruneColumns(requiredSchema: StructType): Unit = { +// JDBC doesn't support nested column pruning. +val requiredCols = requiredSchema.map(_.name) +prunedSchema = StructType(schema.fields.filter(f => requiredCols.contains(f.name))) Review comment: I followed the way in FileScanBuider and fixed this This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28991: [SPARK-26533][SQL][test-hive1.2][test-hadoop2.7] Support query auto timeout cancel on thriftserver
SparkQA commented on pull request #28991: URL: https://github.com/apache/spark/pull/28991#issuecomment-671692811 **[Test build #127300 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127300/testReport)** for PR 28991 at commit [`4ca75ba`](https://github.com/apache/spark/commit/4ca75baea102fb1c2e7467ed1093d4c0520a5245). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28991: [SPARK-26533][SQL][test-hive1.2][test-hadoop2.7] Support query auto timeout cancel on thriftserver
AmplabJenkins commented on pull request #28991: URL: https://github.com/apache/spark/pull/28991#issuecomment-671692824 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder
huaxingao commented on pull request #29396: URL: https://github.com/apache/spark/pull/29396#issuecomment-671692513 > Could you add tests for write since you supported WriteBuilder in this PR. I will take JDBCV2Suite from @cloud-fan 's https://github.com/apache/spark/pull/27345. I will ignore the tests that don't work for now. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28991: [SPARK-26533][SQL][test-hive1.2][test-hadoop2.7] Support query auto timeout cancel on thriftserver
AmplabJenkins removed a comment on pull request #28991: URL: https://github.com/apache/spark/pull/28991#issuecomment-671691009 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29401: [SPARK-32400][SQL] Improve test coverage of HiveScriptTransformationExec
AmplabJenkins removed a comment on pull request #29401: URL: https://github.com/apache/spark/pull/29401#issuecomment-671690941 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28991: [SPARK-26533][SQL][test-hive1.2][test-hadoop2.7] Support query auto timeout cancel on thriftserver
AmplabJenkins commented on pull request #28991: URL: https://github.com/apache/spark/pull/28991#issuecomment-671691009 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29401: [SPARK-32400][SQL] Improve test coverage of HiveScriptTransformationExec
AmplabJenkins commented on pull request #29401: URL: https://github.com/apache/spark/pull/29401#issuecomment-671690941 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28991: [SPARK-26533][SQL][test-hive1.2][test-hadoop2.7] Support query auto timeout cancel on thriftserver
SparkQA commented on pull request #28991: URL: https://github.com/apache/spark/pull/28991#issuecomment-671690512 **[Test build #127300 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127300/testReport)** for PR 28991 at commit [`4ca75ba`](https://github.com/apache/spark/commit/4ca75baea102fb1c2e7467ed1093d4c0520a5245). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29401: [SPARK-32400][SQL] Improve test coverage of HiveScriptTransformationExec
SparkQA commented on pull request #29401: URL: https://github.com/apache/spark/pull/29401#issuecomment-671690473 **[Test build #127299 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127299/testReport)** for PR 29401 at commit [`b9776bd`](https://github.com/apache/spark/commit/b9776bde0d73274b814dde713143b5286c5449ea). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu opened a new pull request #29401: [SPARK-32400][SQL] Improve test coverage of HiveScriptTransformationExec
AngersZh opened a new pull request #29401: URL: https://github.com/apache/spark/pull/29401 ### What changes were proposed in this pull request? 1. Extract common test case (no serde) to BasicScriptTransformationExecSuite 2. Add more test case for no serde mode about supported data type and behavior in `BasicScriptTransformationExecSuite` 3. Add more test case for hive serde mode about supported type and behavior in `HiveScriptTransformationExecSuite` ### Why are the changes needed? Improve test coverage of Script Transformation ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Added UT This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters
SparkQA commented on pull request #29328: URL: https://github.com/apache/spark/pull/29328#issuecomment-671685821 **[Test build #127298 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127298/testReport)** for PR 29328 at commit [`d344eb9`](https://github.com/apache/spark/commit/d344eb946d921c0d180069daac4f15438c452af4). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters
AmplabJenkins removed a comment on pull request #29328: URL: https://github.com/apache/spark/pull/29328#issuecomment-671684138 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters
AmplabJenkins commented on pull request #29328: URL: https://github.com/apache/spark/pull/29328#issuecomment-671684138 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] imback82 commented on a change in pull request #29328: [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters
imback82 commented on a change in pull request #29328: URL: https://github.com/apache/spark/pull/29328#discussion_r468286947 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala ## @@ -155,7 +155,7 @@ object TextInputCSVDataSource extends CSVDataSource { sparkSession, paths = paths, className = classOf[TextFileFormat].getName, -options = options.parameters +options = options.parameters - "path" Review comment: This is called during `infer` and "path" option gets added to `inputPaths`, thus this needs to be removed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #28490: [SPARK-31670][SQL]Resolve Struct Field in Grouping Aggregate with same ExprId
AngersZh commented on pull request #28490: URL: https://github.com/apache/spark/pull/28490#issuecomment-671682512 Any update for this? cc @cloud-fan @viirya @HyukjinKwon This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org