[GitHub] spark issue #22503: [SPARK-25493] [SQL] Fix multiline crlf
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22503 Also, please fix the PR title to be more descriptive. For instance, `[SPARK-25493][SQL] Use auto-detection for CRLF in CSV datasource multiline mode`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22503: [SPARK-25493] [SQL] Fix multiline crlf
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22503 **[Test build #96485 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96485/testReport)** for PR 22503 at commit [`2f349d7`](https://github.com/apache/spark/commit/2f349d7a779cd8f347b73ec59e2f4216450075f1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22326: [SPARK-25314][SQL] Fix Python UDF accessing attributes f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22326 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22503: [SPARK-25493] [SQL] Fix multiline crlf
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22503#discussion_r219688971 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala --- @@ -212,6 +212,7 @@ class CSVOptions( settings.setEmptyValue(emptyValueInRead) settings.setMaxCharsPerColumn(maxCharsPerColumn) settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER) +settings.setLineSeparatorDetectionEnabled(true) --- End diff -- Yup, I would rather enable this only for multiline mode. Also, please add what this configuration does in the PR description. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22326: [SPARK-25314][SQL] Fix Python UDF accessing attributes f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22326 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96482/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22326: [SPARK-25314][SQL] Fix Python UDF accessing attributes f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22326 **[Test build #96482 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96482/testReport)** for PR 22326 at commit [`caf6f94`](https://github.com/apache/spark/commit/caf6f94b980e877f02c57b9647bae7df5d4e16ae). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22503: [SPARK-25493] [SQL] Fix multiline crlf
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22503 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22529: [SPARK-25460][BRANCH-2.4][SS] DataSourceV2: SS sources d...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22529 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22529: [SPARK-25460][BRANCH-2.4][SS] DataSourceV2: SS sources d...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22529 **[Test build #96484 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96484/testReport)** for PR 22529 at commit [`b080b0d`](https://github.com/apache/spark/commit/b080b0d7cb018f739afd578c7952d5f23d3375e2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22529: [SPARK-25460][BRANCH-2.4][SS] DataSourceV2: SS sources d...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22529 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3386/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21632: [SPARK-19591][ML][MLlib] Add sample weights to decision ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21632 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96483/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21632: [SPARK-19591][ML][MLlib] Add sample weights to decision ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21632 **[Test build #96483 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96483/testReport)** for PR 21632 at commit [`f0cb95f`](https://github.com/apache/spark/commit/f0cb95f6fd95b1819a028bdd674ea5f7c3a2e754). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21632: [SPARK-19591][ML][MLlib] Add sample weights to decision ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21632 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21632: [SPARK-19591][ML][MLlib] Add sample weights to decision ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21632 **[Test build #96483 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96483/testReport)** for PR 21632 at commit [`f0cb95f`](https://github.com/apache/spark/commit/f0cb95f6fd95b1819a028bdd674ea5f7c3a2e754). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21632: [SPARK-19591][ML][MLlib] Add sample weights to decision ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21632 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3385/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21632: [SPARK-19591][ML][MLlib] Add sample weights to decision ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21632 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22326: [SPARK-25314][SQL] Fix Python UDF accessing attributes f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22326 **[Test build #96482 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96482/testReport)** for PR 22326 at commit [`caf6f94`](https://github.com/apache/spark/commit/caf6f94b980e877f02c57b9647bae7df5d4e16ae). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22326: [SPARK-25314][SQL] Fix Python UDF accessing attributes f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22326 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22326: [SPARK-25314][SQL] Fix Python UDF accessing attributes f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22326 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3384/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22480: [SPARK-25473][PYTHON][SS][TEST] ForeachWriter tes...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22480 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22480: [SPARK-25473][PYTHON][SS][TEST] ForeachWriter tests fail...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22480 Thanks, @cloud-fan, @BryanCutler and @holdenk --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22480: [SPARK-25473][PYTHON][SS][TEST] ForeachWriter tests fail...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22480 Merged only to master since I assume it's likely we will meet the test failures on master branch specifically more often. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22529: [SPARK-25460][BRANCH-2.4][SS] DataSourceV2: SS sources d...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22529 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96480/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22529: [SPARK-25460][BRANCH-2.4][SS] DataSourceV2: SS sources d...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22529 **[Test build #96480 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96480/testReport)** for PR 22529 at commit [`b6f8880`](https://github.com/apache/spark/commit/b6f8880ad6bdbcb721ca0863502ec4b6c85b162c). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22529: [SPARK-25460][BRANCH-2.4][SS] DataSourceV2: SS sources d...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22529 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22316: [SPARK-25048][SQL] Pivoting by multiple columns i...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22316#discussion_r219686833 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -416,7 +426,7 @@ class RelationalGroupedDataset protected[sql]( new RelationalGroupedDataset( df, groupingExprs, - RelationalGroupedDataset.PivotType(pivotColumn.expr, values.map(Literal.apply))) + RelationalGroupedDataset.PivotType(pivotColumn.expr, values.map(lit(_).expr))) --- End diff -- That's true in general but specifically is decimal precision more correct? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/7 **[Test build #96481 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96481/testReport)** for PR 7 at commit [`5c8f487`](https://github.com/apache/spark/commit/5c8f48715748bdeda703761fba6a4d1828a19985). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/7 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22529: [SPARK-25460][BRANCH-2.4][SS] DataSourceV2: SS sources d...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22529 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22529: [SPARK-25460][BRANCH-2.4][SS] DataSourceV2: SS sources d...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22529 **[Test build #96480 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96480/testReport)** for PR 22529 at commit [`b6f8880`](https://github.com/apache/spark/commit/b6f8880ad6bdbcb721ca0863502ec4b6c85b162c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22529: [SPARK-25460][BRANCH-2.4][SS] DataSourceV2: SS sources d...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22529 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3383/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22462: [SPARK-25460][SS] DataSourceV2: SS sources do not respec...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22462 The conflicts looks mainly renaming. I opened a backport - https://github.com/apache/spark/pull/22529 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22529: [SPARK-25460][BRANCH-2.4][SS] DataSourceV2: SS so...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/22529 [SPARK-25460][BRANCH-2.4][SS] DataSourceV2: SS sources do not respect SessionConfigSupport ## What changes were proposed in this pull request? This PR proposes to backport SPARK-25460 to branch-2.4: This PR proposes to respect `SessionConfigSupport` in SS datasources as well. Currently these are only respected in batch sources: https://github.com/apache/spark/blob/e06da95cd9423f55cdb154a2778b0bddf7be984c/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L198-L203 https://github.com/apache/spark/blob/e06da95cd9423f55cdb154a2778b0bddf7be984c/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L244-L249 If a developer makes a datasource V2 that supports both structured streaming and batch jobs, batch jobs respect a specific configuration, let's say, URL to connect and fetch data (which end users might not be aware of); however, structured streaming ends up with not supporting this (and should explicitly be set into options). ## How was this patch tested? Unit tests were added. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-25460-backport Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22529.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22529 commit b6f8880ad6bdbcb721ca0863502ec4b6c85b162c Author: hyukjinkwon Date: 2018-09-20T12:22:55Z [SPARK-25460][SS] DataSourceV2: SS sources do not respect SessionConfigSupport This PR proposes to respect `SessionConfigSupport` in SS datasources as well. Currently these are only respected in batch sources: https://github.com/apache/spark/blob/e06da95cd9423f55cdb154a2778b0bddf7be984c/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L198-L203 https://github.com/apache/spark/blob/e06da95cd9423f55cdb154a2778b0bddf7be984c/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L244-L249 If a developer makes a datasource V2 that supports both structured streaming and batch jobs, batch jobs respect a specific configuration, let's say, URL to connect and fetch data (which end users might not be aware of); however, structured streaming ends up with not supporting this (and should explicitly be set into options). Unit tests were added. Closes #22462 from HyukjinKwon/SPARK-25460. Authored-by: hyukjinkwon Signed-off-by: Wenchen Fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18544: [SPARK-21318][SQL]Improve exception message thrown by `l...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18544 **[Test build #96479 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96479/testReport)** for PR 18544 at commit [`623b282`](https://github.com/apache/spark/commit/623b282b2edf872cb4e4bd93e27837ac567854e1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21747: [SPARK-24165][SQL][branch-2.3] Fixing conditional expres...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21747 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22523: [MINOR][PYSPARK] Always Close the tempFile in _se...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22523 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22523: [MINOR][PYSPARK] Always Close the tempFile in _serialize...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22523 and branch-2.4. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22523: [MINOR][PYSPARK] Always Close the tempFile in _se...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22523#discussion_r219686544 --- Diff: python/pyspark/context.py --- @@ -537,8 +537,10 @@ def _serialize_to_jvm(self, data, serializer, reader_func, createRDDServer): # parallelize from there. tempFile = NamedTemporaryFile(delete=False, dir=self._temp_dir) --- End diff -- Actually, we better use a context manager: ```python with NamedTemporaryFile(delete=False, dir=self._temp_dir) as tempfile: ... ``` but not a big deal. LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18544: [SPARK-21318][SQL]Improve exception message thrown by `l...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18544 **[Test build #96478 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96478/testReport)** for PR 18544 at commit [`53dc155`](https://github.com/apache/spark/commit/53dc1558ecdb64623d004e615a6000745989ceed). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22527: [SPARK-17952][SQL] Nested Java beans support in c...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22527#discussion_r219686433 --- Diff: sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java --- @@ -171,7 +184,12 @@ void validateDataFrameWithBeans(Bean bean, Dataset df) { schema.apply("d")); Assert.assertEquals(new StructField("e", DataTypes.createDecimalType(38,0), true, Metadata.empty()), schema.apply("e")); -Row first = df.select("a", "b", "c", "d", "e").first(); +Assert.assertEquals(new StructField("f", + DataTypes.createStructType(Collections.singletonList(new StructField( +"a", IntegerType$.MODULE$, false, Metadata.empty(, +true, Metadata.empty()), +schema.apply("f")); --- End diff -- should be double spaced. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22527: [SPARK-17952][SQL] Nested Java beans support in c...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22527#discussion_r219686429 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -1100,13 +1101,24 @@ object SQLContext { attrs: Seq[AttributeReference]): Iterator[InternalRow] = { val extractors = JavaTypeInference.getJavaBeanReadableProperties(beanClass).map(_.getReadMethod) -val methodsToConverts = extractors.zip(attrs).map { case (e, attr) => - (e, CatalystTypeConverters.createToCatalystConverter(attr.dataType)) +val methodsToTypes = extractors.zip(attrs).map { case (e, attr) => + (e, attr.dataType) +} +def invoke(element: Any)(tuple: (Method, DataType)): Any = tuple match { + case (e, structType: StructType) => +val value = e.invoke(element) +val nestedExtractors = JavaTypeInference.getJavaBeanReadableProperties(value.getClass) +.map(desc => desc.getName -> desc.getReadMethod) +.toMap +new GenericInternalRow(structType.map(nestedProperty => + invoke(value)(nestedExtractors(nestedProperty.name) -> nestedProperty.dataType) +).toArray) --- End diff -- Why should we use a map here while we don't need it for the root bean? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22517: Branch 2.3 how can i fix error use Pyspark
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22517 @lovezeropython please close this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22528: [SPARK-25513][SQL] Read zipped CSV and JSON
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22528#discussion_r219686326 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CodecStreams.scala --- @@ -41,7 +42,12 @@ object CodecStreams { getDecompressionCodec(config, file) .map(codec => codec.createInputStream(inputStream)) - .getOrElse(inputStream) + .orElse { +if (file.getName.toLowerCase.endsWith(".zip")) { + val zip = new ZipArchiveInputStream(inputStream) + if (zip.getNextEntry != null) Some(zip) else None +} else None + }.getOrElse(inputStream) --- End diff -- @MaxGekk, I got that we can support zipped one but isn't this difficult to extend this support to non multiline modes as well? Basically deflate is the same codec and I wonder if we really should allow this zip one specifically in multiline mode for CSV / JSON specifically with a clear restriction (single file). Please correct me if I misunderstood. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22489: [SPARK-25425][SQL][BACKPORT-2.3] Extra options should ov...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22489 I've considered this for 2.3.3 since 2.3.2 RC6 vote was already started. For now, I'm waiting the result of vote. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22513: [SPARK-25499][TEST]Refactor BenchmarkBase and Benchmark
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22513 +1, late LGTM. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22509: [SPARK-25384][SQL] Clarify fromJsonForceNullableSchema w...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22509 Sorry for missing this deprecation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22499: [SPARK-25489][ML][TEST] Refactor UDTSerialization...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22499#discussion_r219685155 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/UDTSerializationBenchmark.scala --- @@ -18,52 +18,52 @@ package org.apache.spark.mllib.linalg import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder -import org.apache.spark.util.Benchmark +import org.apache.spark.util.{Benchmark, BenchmarkBase => FileBenchmarkBase} /** * Serialization benchmark for VectorUDT. + * To run this benchmark: + * 1. without sbt: bin/spark-submit --class --- End diff -- +1 for fix the docs to pass Jenkins. Also, could you rebase this PR to resolve conflicts, @seancxmao ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22528: [SPARK-25513][SQL] Read zipped CSV and JSON
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22528 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96477/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22528: [SPARK-25513][SQL] Read zipped CSV and JSON
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22528 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22528: [SPARK-25513][SQL] Read zipped CSV and JSON
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22528 **[Test build #96477 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96477/testReport)** for PR 22528 at commit [`ec8ba0d`](https://github.com/apache/spark/commit/ec8ba0da6a29efb7f4dfeccb7cb68c2085c6890f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22407: [SPARK-25416][SQL] ArrayPosition function may return inc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22407 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22407: [SPARK-25416][SQL] ArrayPosition function may return inc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22407 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96475/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22407: [SPARK-25416][SQL] ArrayPosition function may return inc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22407 **[Test build #96475 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96475/testReport)** for PR 22407 at commit [`55d4b95`](https://github.com/apache/spark/commit/55d4b950951892f3a239f960feadbe1a25198659). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22522: [SPARK-25510][TEST] Create new trait replace Benc...
Github user wangyum closed the pull request at: https://github.com/apache/spark/pull/22522 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22484: [SPARK-25476][TEST] Refactor AggregateBenchmark t...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/22484#discussion_r219683925 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala --- @@ -34,621 +34,508 @@ import org.apache.spark.unsafe.map.BytesToBytesMap /** * Benchmark to measure performance for aggregate primitives. - * To run this: - * build/sbt "sql/test-only *benchmark.AggregateBenchmark" - * - * Benchmarks in this file are skipped in normal builds. + * To run this benchmark: + * {{{ + * 1. without sbt: bin/spark-submit --class + * 2. build/sbt "sql/test:runMain " + * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain " + * Results will be written to "benchmarks/AggregateBenchmark-results.txt". + * }}} */ -class AggregateBenchmark extends BenchmarkWithCodegen { +object AggregateBenchmark extends RunBenchmarkWithCodegen { - ignore("aggregate without grouping") { -val N = 500L << 22 -val benchmark = new Benchmark("agg without grouping", N) -runBenchmark("agg w/o group", N) { - sparkSession.range(N).selectExpr("sum(id)").collect() + override def benchmark(): Unit = { +runBenchmark("aggregate without grouping") { + val N = 500L << 22 + runBenchmark("agg w/o group", N) { --- End diff -- Yes. Do you have a suggested name? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...
Github user erikerlandson commented on the issue: https://github.com/apache/spark/pull/13440 I think targeting 3.0 with a refactor makes the most sense. There's no way to do this without making small breaking changes, but slightly larger changes could clean up the design. `ImpurityCalculator` can subsume `Impurity`, and a more general rethinking of gain and impurity can be accommodated too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/13440 Yeah I take your point that the trait Impurity already defines two methods, only one of which is implemented for each of the subclasses. It's already a funky design that probably should have been generalized differently. I think a rewrite for Spark 3 would be worthwhile, personally. I'm also not quite sure of the difference between the Impurity and ImpurityCalculator class; it seems like Impurity should fold into ImpurityCalculator. Is the single method we really want to define something like `computeInformationGain(ImpurityCalculator, ImpurityCalculator)`? even the new method you've added is not directly computing info gain, nor were the existing ones in Impurity. But that's the thing we need and abstraction for over several implementations, it seems. Well, I think either this gets a bigger redesign in 3.0, or we try to get it into 2.5 and accept some API changes. I think I lean towards a bolder breaking change to fix it up in 3.0, unless there's a pressing need for this metric. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22528: [SPARK-25513][SQL] Read zipped CSV and JSON
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22528 **[Test build #96477 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96477/testReport)** for PR 22528 at commit [`ec8ba0d`](https://github.com/apache/spark/commit/ec8ba0da6a29efb7f4dfeccb7cb68c2085c6890f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22528: [SPARK-25513][SQL] Read zipped CSV and JSON
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/22528 jenkins, retest this, please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22528: [SPARK-25513][SQL] Read zipped CSV and JSON
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22528 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22528: [SPARK-25513][SQL] Read zipped CSV and JSON
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22528 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96476/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22528: [SPARK-25513][SQL] Read zipped CSV and JSON
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22528 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22528: [SPARK-25513][SQL] Read zipped CSV and JSON
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22528 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22528: [SPARK-25513][SQL] Read zipped CSV and JSON
GitHub user MaxGekk opened a pull request: https://github.com/apache/spark/pull/22528 [SPARK-25513][SQL] Read zipped CSV and JSON ## What changes were proposed in this pull request? In the PR, I propose to support reading of zip archives containing **one** CSV or JSON file in the multi-line mode. ## How was this patch tested? Added tests for CSV and JSON where zip archives are created by Java library. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MaxGekk/spark-1 read-zipped-csv-json Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22528.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22528 commit a926d277e0cecb4d2d66e6500a68e656da6e1d2f Author: Maxim Gekk Date: 2018-09-22T19:49:44Z Support zip archives commit 29716248b1ef504ab828c6b8af8ac78f1013923a Author: Maxim Gekk Date: 2018-09-22T19:49:59Z Add test for zipped CSV files commit 149e452d17cffecb024c29771dc05322295ba437 Author: Maxim Gekk Date: 2018-09-22T19:52:18Z Fix imports commit 1dff39eb7e06435551ab7ba0d0443b106e60e4b6 Author: Maxim Gekk Date: 2018-09-22T19:57:10Z Added a test for zipped JSON commit 09dff81b34600c05a3b30a135c32e9dcd40e5bae Author: Maxim Gekk Date: 2018-09-22T19:58:56Z Refactoring of the CSV test commit 5fda51a3505437c4a32f146940a908cd1557bbf5 Author: Maxim Gekk Date: 2018-09-22T20:02:37Z Make extension checking case agnostic --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22484: [SPARK-25476][TEST] Refactor AggregateBenchmark to use m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22484 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96473/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22484: [SPARK-25476][TEST] Refactor AggregateBenchmark to use m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22484 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22484: [SPARK-25476][TEST] Refactor AggregateBenchmark to use m...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22484 **[Test build #96473 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96473/testReport)** for PR 22484 at commit [`42230b6`](https://github.com/apache/spark/commit/42230b6e3edb731eb69b3b8800805805e2234d10). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `trait RunBenchmarkWithCodegen extends BenchmarkBase ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22326: [SPARK-25314][SQL] Fix Python UDF accessing attributes f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22326 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96474/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22326: [SPARK-25314][SQL] Fix Python UDF accessing attributes f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22326 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22326: [SPARK-25314][SQL] Fix Python UDF accessing attributes f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22326 **[Test build #96474 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96474/testReport)** for PR 22326 at commit [`e7c1aee`](https://github.com/apache/spark/commit/e7c1aeecff433ecdd272a9e2a85567d438152722). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` case class HandlePythonUDFInJoinCondition(conf: SQLConf) extends Rule[LogicalPlan] ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22407: [SPARK-25416][SQL] ArrayPosition function may return inc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22407 **[Test build #96475 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96475/testReport)** for PR 22407 at commit [`55d4b95`](https://github.com/apache/spark/commit/55d4b950951892f3a239f960feadbe1a25198659). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22407: [SPARK-25416][SQL] ArrayPosition function may return inc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22407 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3382/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22527: [SPARK-17952][SQL] Nested Java beans support in createDa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22527 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22407: [SPARK-25416][SQL] ArrayPosition function may return inc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22407 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22527: [SPARK-17952][SQL] Nested Java beans support in createDa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22527 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22527: [SPARK-17952][SQL] Nested Java beans support in createDa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22527 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22527: [SPARK-17952][SQL] Nested Java beans support in c...
GitHub user michalsenkyr opened a pull request: https://github.com/apache/spark/pull/22527 [SPARK-17952][SQL] Nested Java beans support in createDataFrame ## What changes were proposed in this pull request? When constructing a DataFrame from a Java bean, using nested beans throws an error despite [documentation](http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection) stating otherwise. This PR aims to add that support. This PR does not yet add nested beans support in array or List fields. This can be added later or in another PR. ## How was this patch tested? Nested bean was added to the appropriate unit test. Also manually tested in Spark shell on code emulating the referenced JIRA: ``` scala> import scala.beans.BeanProperty import scala.beans.BeanProperty scala> class SubCategory(@BeanProperty var id: String, @BeanProperty var name: String) extends Serializable defined class SubCategory scala> class Category(@BeanProperty var id: String, @BeanProperty var subCategory: SubCategory) extends Serializable defined class Category scala> import scala.collection.JavaConverters._ import scala.collection.JavaConverters._ scala> spark.createDataFrame(Seq(new Category("s-111", new SubCategory("sc-111", "Sub-1"))).asJava, classOf[Category]) java.lang.IllegalArgumentException: The value (SubCategory@65130cf2) of the type (SubCategory) cannot be converted to struct at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:262) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:238) at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103) at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:396) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1108) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1108) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1108) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1106) at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) at scala.collection.Iterator$class.toStream(Iterator.scala:1320) at scala.collection.AbstractIterator.toStream(Iterator.scala:1334) at scala.collection.TraversableOnce$class.toSeq(TraversableOnce.scala:298) at scala.collection.AbstractIterator.toSeq(Iterator.scala:1334) at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:423) ... 51 elided ``` New behavior: ``` scala> spark.createDataFrame(Seq(new Category("s-111", new SubCategory("sc-111", "Sub-1"))).asJava, classOf[Category]) res0: org.apache.spark.sql.DataFrame = [id: string, subCategory: struct] scala> res0.show() +-+---+ | id|subCategory| +-+---+ |s-111|[sc-111, Sub-1]| +-+---+ ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/michalsenkyr/spark SPARK-17952 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22527.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22527 commit ccea758b069c4622e9b1f71b92167c81cfcd81b8 Author: Michal Senkyr Date: 2018-09-22T18:25:36Z Add nested Java beans support to SQLContext.beansToRow --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21816: [SPARK-24794][CORE] Driver launched through rest should ...
Github user bsikander commented on the issue: https://github.com/apache/spark/pull/21816 Could some please have a look at this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22467: [SPARK-25465][TEST] Refactor Parquet test suites ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22467 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22522: [SPARK-25510][TEST] Create new trait replace BenchmarkWi...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/22522 @wangyum I have left my comment in https://github.com/apache/spark/pull/22484 . Also, should we close this one and move to https://github.com/apache/spark/pull/22484 ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22484: [SPARK-25476][TEST] Refactor AggregateBenchmark t...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/22484#discussion_r219676161 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala --- @@ -34,621 +34,508 @@ import org.apache.spark.unsafe.map.BytesToBytesMap /** * Benchmark to measure performance for aggregate primitives. - * To run this: - * build/sbt "sql/test-only *benchmark.AggregateBenchmark" - * - * Benchmarks in this file are skipped in normal builds. + * To run this benchmark: + * {{{ + * 1. without sbt: bin/spark-submit --class + * 2. build/sbt "sql/test:runMain " + * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain " + * Results will be written to "benchmarks/AggregateBenchmark-results.txt". + * }}} */ -class AggregateBenchmark extends BenchmarkWithCodegen { +object AggregateBenchmark extends RunBenchmarkWithCodegen { - ignore("aggregate without grouping") { -val N = 500L << 22 -val benchmark = new Benchmark("agg without grouping", N) -runBenchmark("agg w/o group", N) { - sparkSession.range(N).selectExpr("sum(id)").collect() + override def benchmark(): Unit = { +runBenchmark("aggregate without grouping") { + val N = 500L << 22 + runBenchmark("agg w/o group", N) { --- End diff -- The `runBenchmark` here is different from the on in line 48, but they have the same name. We should have a different name. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22326: [SPARK-25314][SQL] Fix Python UDF accessing attributes f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22326 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22326: [SPARK-25314][SQL] Fix Python UDF accessing attributes f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22326 **[Test build #96474 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96474/testReport)** for PR 22326 at commit [`e7c1aee`](https://github.com/apache/spark/commit/e7c1aeecff433ecdd272a9e2a85567d438152722). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22326: [SPARK-25314][SQL] Fix Python UDF accessing attributes f...
Github user xuanyuanking commented on the issue: https://github.com/apache/spark/pull/22326 @cloud-fan Great thanks for your offline guidance, as our discussion, I reimplement this by adding a new rule `HandlePythonUDFInJoinCondition` in Analyzer, revert all changes in `Optimizer` before. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22326: [SPARK-25314][SQL] Fix Python UDF accessing attributes f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22326 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3381/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22326: [SPARK-25314][SQL] Fix Python UDF accessing attri...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/22326#discussion_r219675105 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -995,7 +995,8 @@ class Dataset[T] private[sql]( // After the cloning, left and right side will have distinct expression ids. val plan = withPlan( Join(logicalPlan, right.logicalPlan, JoinType(joinType), Some(joinExprs.expr))) - .queryExecution.analyzed.asInstanceOf[Join] + .queryExecution.analyzed +val joinPlan = plan.collectFirst { case j: Join => j }.get --- End diff -- For reviewer, we need this change cause the rule `HandlePythonUDFInJoinCondition` will break the assumption about the join plan after analyzing will only return Join. After we add the rule of handling python udf, we'll add filter or project node on top of Join. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22484: [SPARK-25476][TEST] Refactor AggregateBenchmark to use m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22484 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3380/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22522: [SPARK-25510][TEST] Create new trait replace Benc...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/22522#discussion_r219674900 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/RunBenchmarkWithCodegen.scala --- @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.benchmark + +import org.apache.spark.benchmark.Benchmark +import org.apache.spark.sql.SparkSession + +/** + * Common base trait for micro benchmarks that are supposed to run standalone (i.e. not together + * with other benchmarks). + */ +private[benchmark] trait RunBenchmarkWithCodegen { --- End diff -- Done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22484: [SPARK-25476][TEST] Refactor AggregateBenchmark to use m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22484 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22522: [SPARK-25510][TEST] Create new trait replace BenchmarkWi...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22522 Thanks @cloud-fan I have migrate [`AggregateBenchmark`](https://github.com/apache/spark/pull/22484/files) to use new trait. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22484: [SPARK-25476][TEST] Refactor AggregateBenchmark to use m...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22484 **[Test build #96473 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96473/testReport)** for PR 22484 at commit [`42230b6`](https://github.com/apache/spark/commit/42230b6e3edb731eb69b3b8800805805e2234d10). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/19868 Sure, updated. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22525: [SPARK-25503][WEBUI] Total task message in stage page is...
Github user shahidki31 commented on the issue: https://github.com/apache/spark/pull/22525 cc @vanzin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22526: [SPARK-25502][WEBUI]Empty Page when page number exceeds ...
Github user shahidki31 commented on the issue: https://github.com/apache/spark/pull/22526 cc @vanzin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22526: [SPARK-25502]Empty Page when page number exceeds the rea...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22526 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22526: [SPARK-25502]Empty Page when page number exceeds the rea...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22526 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22526: [SPARK-25502]Empty Page when page number exceeds the rea...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22526 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22526: [SPARK-25502]Empty Page when page number exceeds ...
GitHub user shahidki31 opened a pull request: https://github.com/apache/spark/pull/22526 [SPARK-25502]Empty Page when page number exceeds the reatinedTask size. ## What changes were proposed in this pull request? Test steps : 1) bin/spark-shell --conf spark.ui.retainedTasks=200 2) val rdd = sc.parallelize(1 to 1000, 1000) 3) rdd.count Stage tab in the UI will display 10 pages with 100 tasks per page. But number of retained tasks in only 200. So, from the 3rd page onwards will display nothing. We have to calculate total pages based on the number of tasks need display in the UI. **Before the change:** ![empty_4](https://user-images.githubusercontent.com/23054875/45918251-b1650580-bea1-11e8-90d3-7e0d491981a2.jpg) **After the change:** ![empty_3](https://user-images.githubusercontent.com/23054875/45918257-c2ae1200-bea1-11e8-960f-dfbdb4a90ae7.jpg) ## How was this patch tested? Manually tested Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/shahidki31/spark SPARK-25502 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22526.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22526 commit 6204cbe46b99cc6d897dbcebec81e89b369d58d2 Author: Shahid Date: 2018-09-22T14:07:22Z [SPARK-25502]Empty Page when page number exceeds the reatinedTask size. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22518: [SPARK-25482][SQL] ReuseSubquery can be useless when the...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22518 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22518: [SPARK-25482][SQL] ReuseSubquery can be useless when the...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22518 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96472/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org