[GitHub] spark issue #19767: [SPARK-22543][SQL] fix java 64kb compile error for deepl...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19767 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19728: [SPARK-22498][SQL] Fix 64KB JVM bytecode limit problem w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19728 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19728: [SPARK-22498][SQL] Fix 64KB JVM bytecode limit problem w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19728 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83983/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19728: [SPARK-22498][SQL] Fix 64KB JVM bytecode limit problem w...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19728 **[Test build #83983 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83983/testReport)** for PR 19728 at commit [`b9ca4ff`](https://github.com/apache/spark/commit/b9ca4ff8cefadcf02bd6c03bad53429da042f700). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19082: [SPARK-21870][SQL] Split aggregation code into small fun...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19082 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19082: [SPARK-21870][SQL] Split aggregation code into small fun...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19082 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83982/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19082: [SPARK-21870][SQL] Split aggregation code into small fun...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19082 **[Test build #83982 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83982/testReport)** for PR 19082 at commit [`4c2816b`](https://github.com/apache/spark/commit/4c2816bd6513dc3ab2568584ecc8da6d1547fbf6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19767: [SPARK-22543][SQL] fix java 64kb compile error fo...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/19767#discussion_r151830994 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala --- @@ -105,6 +105,41 @@ abstract class Expression extends TreeNode[Expression] { val isNull = ctx.freshName("isNull") val value = ctx.freshName("value") val ve = doGenCode(ctx, ExprCode("", isNull, value)) + + // TODO: support whole stage codegen too + if (ve.code.trim.length > 1024 && ctx.INPUT_ROW != null && ctx.currentVars == null) { +val setIsNull = if (ve.isNull != "false" && ve.isNull != "true") { + val globalIsNull = ctx.freshName("globalIsNull") + ctx.addMutableState("boolean", globalIsNull, s"$globalIsNull = false;") + val localIsNull = ve.isNull + ve.isNull = globalIsNull + s"$globalIsNull = $localIsNull;" +} else { + "" +} + +val setValue = { + val globalValue = ctx.freshName("globalValue") + ctx.addMutableState( +ctx.javaType(dataType), globalValue, s"$globalValue = ${ctx.defaultValue(dataType)};") + val localValue = ve.value + ve.value = globalValue + s"$globalValue = $localValue;" +} + +val funcName = ctx.freshName(nodeName) +val funcFullName = ctx.addNewFunction(funcName, + s""" + |private void $funcName(InternalRow ${ctx.INPUT_ROW}) { + | ${ve.code.trim} + | $setValue --- End diff -- Is that a bad idea to prepare some utility classes to store a pair (value, isNull) for this splitting cases? I feel class fields are valuable resources. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19769: [SPARK-12297][SQL] Adjust timezone for int96 data...
Github user henryr commented on a diff in the pull request: https://github.com/apache/spark/pull/19769#discussion_r151830301 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetInteroperabilitySuite.scala --- @@ -87,4 +95,109 @@ class ParquetInteroperabilitySuite extends ParquetCompatibilityTest with SharedS Row(Seq(2, 3 } } + + test("parquet timestamp conversion") { +// Make a table with one parquet file written by impala, and one parquet file written by spark. +// We should only adjust the timestamps in the impala file, and only if the conf is set +val impalaFile = "test-data/impala_timestamp.parq" + +// here are the timestamps in the impala file, as they were saved by impala +val impalaFileData = + Seq( +"2001-01-01 01:01:01", +"2002-02-02 02:02:02", +"2003-03-03 03:03:03" + ).map(java.sql.Timestamp.valueOf) +val impalaPath = Thread.currentThread().getContextClassLoader.getResource(impalaFile) + .toURI.getPath +withTempPath { tableDir => + val ts = Seq( +"2004-04-04 04:04:04", +"2005-05-05 05:05:05", +"2006-06-06 06:06:06" + ).map { s => java.sql.Timestamp.valueOf(s) } + import testImplicits._ + // match the column names of the file from impala + val df = spark.createDataset(ts).toDF().repartition(1).withColumnRenamed("value", "ts") + df.write.parquet(tableDir.getAbsolutePath) + FileUtils.copyFile(new File(impalaPath), new File(tableDir, "part-1.parq")) + + Seq(false, true).foreach { int96TimestampConversion => +Seq(false, true).foreach { vectorized => + withSQLConf( + (SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key, +SQLConf.ParquetOutputTimestampType.INT96.toString), + (SQLConf.PARQUET_INT96_TIMESTAMP_CONVERSION.key, int96TimestampConversion.toString()), + (SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key, vectorized.toString()) + ) { +val readBack = spark.read.parquet(tableDir.getAbsolutePath).collect() +assert(readBack.size === 6) +// if we apply the conversion, we'll get the "right" values, as saved by impala in the +// original file. Otherwise, they're off by the local timezone offset, set to +// America/Los_Angeles in tests +val impalaExpectations = if (int96TimestampConversion) { + impalaFileData +} else { + impalaFileData.map { ts => +DateTimeUtils.toJavaTimestamp(DateTimeUtils.convertTz( + DateTimeUtils.fromJavaTimestamp(ts), + DateTimeUtils.TimeZoneUTC, + DateTimeUtils.getTimeZone(conf.sessionLocalTimeZone))) + } +} +val fullExpectations = (ts ++ impalaExpectations).map(_.toString).sorted.toArray +val actual = readBack.map(_.getTimestamp(0).toString).sorted +withClue(s"applyConversion = $int96TimestampConversion; vectorized = $vectorized") { + assert(fullExpectations === actual) + + // Now test that the behavior is still correct even with a filter which could get + // pushed down into parquet. We don't need extra handling for pushed down + // predicates because (a) in ParquetFilters, we ignore TimestampType and (b) parquet --- End diff -- how about adding an assertion or comment to ParquetFilters that it would be unsafe to add timestamp support? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19769: [SPARK-12297][SQL] Adjust timezone for int96 data...
Github user henryr commented on a diff in the pull request: https://github.com/apache/spark/pull/19769#discussion_r151830046 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -355,17 +362,33 @@ class ParquetFileFormat fileSplit.getLocations, null) + val sharedConf = broadcastedHadoopConf.value.value + // PARQUET_INT96_TIMESTAMP_CONVERSION says to apply timezone conversions to int96 timestamps' + // *only* if the file was created by something other than "parquet-mr", so check the actual + // writer here for this file. We have to do this per-file, as each file in the table may + // have different writers. + def isCreatedByParquetMr(): Boolean = { +val footer = ParquetFileReader.readFooter(sharedConf, fileSplit.getPath, SKIP_ROW_GROUPS) --- End diff -- Does it make more sense to have VectorizedParquetRecordReader() do this in initialize() rather than here? There doesn't seem to be a way to share footer metadata across record readers, which would be helpful to avoid calling readFooter() multiple times. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19769: [SPARK-12297][SQL] Adjust timezone for int96 data...
Github user henryr commented on a diff in the pull request: https://github.com/apache/spark/pull/19769#discussion_r151830623 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java --- @@ -430,9 +439,11 @@ private void readBinaryBatch(int rowId, int num, WritableColumnVector column) { } else if (column.dataType() == DataTypes.TimestampType) { for (int i = 0; i < num; i++) { if (defColumn.readInteger() == maxDefLevel) { - column.putLong(rowId + i, - // Read 12 bytes for INT96 - ParquetRowConverter.binaryToSQLTimestamp(data.readBinary(12))); + // Read 12 bytes for INT96 + long rawTime = ParquetRowConverter.binaryToSQLTimestamp(data.readBinary(12)); + long adjTime = + convertTz == null ? rawTime : DateTimeUtils.convertTz(rawTime, convertTz, UTC); --- End diff -- it might be worth hoisting the conditional here as high as possible: if (convertTz == null) { for (int i = 0; i < num; i++) { /// etc } else { for (int i = 0; i < num; i++) { /// etc } In this case the code duplication might be worth avoiding a branch on every read. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19769: [SPARK-12297][SQL] Adjust timezone for int96 data...
Github user henryr commented on a diff in the pull request: https://github.com/apache/spark/pull/19769#discussion_r151830543 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java --- @@ -298,7 +304,10 @@ private void decodeDictionaryIds( // TODO: Convert dictionary of Binaries to dictionary of Longs if (!column.isNullAt(i)) { Binary v = dictionary.decodeToBinary(dictionaryIds.getDictId(i)); - column.putLong(i, ParquetRowConverter.binaryToSQLTimestamp(v)); + long rawTime = ParquetRowConverter.binaryToSQLTimestamp(v); + long adjTime = --- End diff -- is it practical to consider caching the decoded and converted dictionary values? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19778: [SPARK-22550][SQL] Fix 64KB JVM bytecode limit problem w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19778 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83980/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19778: [SPARK-22550][SQL] Fix 64KB JVM bytecode limit problem w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19778 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19778: [SPARK-22550][SQL] Fix 64KB JVM bytecode limit problem w...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19778 **[Test build #83980 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83980/testReport)** for PR 19778 at commit [`feecef0`](https://github.com/apache/spark/commit/feecef0995d55a90c3cac8c4fc5c3680319b3def). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19778: [SPARK-22550][SQL] Fix 64KB JVM bytecode limit problem w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19778 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83979/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19778: [SPARK-22550][SQL] Fix 64KB JVM bytecode limit problem w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19778 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19778: [SPARK-22550][SQL] Fix 64KB JVM bytecode limit problem w...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19778 **[Test build #83979 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83979/testReport)** for PR 19778 at commit [`06b103e`](https://github.com/apache/spark/commit/06b103e556756d567ae45626a9a62b8ffb777e36). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19777: [SPARK-22549][SQL] Fix 64KB JVM bytecode limit problem w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19777 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83978/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19777: [SPARK-22549][SQL] Fix 64KB JVM bytecode limit problem w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19777 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19777: [SPARK-22549][SQL] Fix 64KB JVM bytecode limit problem w...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19777 **[Test build #83978 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83978/testReport)** for PR 19777 at commit [`1714c88`](https://github.com/apache/spark/commit/1714c883d193f96e43c19e72a96c15c23d5fc193). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17436: [SPARK-20101][SQL] Use OffHeapColumnVector when "...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/17436#discussion_r151828392 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -140,6 +140,13 @@ object SQLConf { .booleanConf .createWithDefault(true) + val COLUMN_VECTOR_OFFHEAP_ENABLED = +buildConf("spark.sql.columnVector.offheap.enable") + .internal() + .doc("When true, use OffHeapColumnVector in ColumnarBatch.") + .booleanConf + .createWithDefault(true) --- End diff -- Sorry, this was test code in my local environment. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19728: [SPARK-22498][SQL] Fix 64KB JVM bytecode limit problem w...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19728 **[Test build #83983 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83983/testReport)** for PR 19728 at commit [`b9ca4ff`](https://github.com/apache/spark/commit/b9ca4ff8cefadcf02bd6c03bad53429da042f700). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19728: [SPARK-22498][SQL] Fix 64KB JVM bytecode limit problem w...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19728 Sure done. #19777 for `concat_ws`. #19778 for `elt` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19082: [SPARK-21870][SQL] Split aggregation code into small fun...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19082 **[Test build #83982 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83982/testReport)** for PR 19082 at commit [`4c2816b`](https://github.com/apache/spark/commit/4c2816bd6513dc3ab2568584ecc8da6d1547fbf6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19082: [SPARK-21870][SQL] Split aggregation code into small fun...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19082 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19082: [SPARK-21870][SQL] Split aggregation code into small fun...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19082 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83981/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19082: [SPARK-21870][SQL] Split aggregation code into small fun...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19082 **[Test build #83981 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83981/testReport)** for PR 19082 at commit [`28b47e5`](https://github.com/apache/spark/commit/28b47e52bb262edaf73ad2aa44a2db68eb3a9847). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19082: [SPARK-21870][SQL] Split aggregation code into small fun...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19082 **[Test build #83981 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83981/testReport)** for PR 19082 at commit [`28b47e5`](https://github.com/apache/spark/commit/28b47e52bb262edaf73ad2aa44a2db68eb3a9847). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19778: [SPARK-22550][SQL] Fix 64KB JVM bytecode limit problem w...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19778 **[Test build #83980 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83980/testReport)** for PR 19778 at commit [`feecef0`](https://github.com/apache/spark/commit/feecef0995d55a90c3cac8c4fc5c3680319b3def). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19778: [SPARK-22550][SQL] Fix 64KB JVM bytecode limit problem w...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19778 **[Test build #83979 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83979/testReport)** for PR 19778 at commit [`06b103e`](https://github.com/apache/spark/commit/06b103e556756d567ae45626a9a62b8ffb777e36). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19774#discussion_r151826928 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -689,6 +689,11 @@ case class DescribeColumnCommand( buffer += Row("distinct_count", cs.map(_.distinctCount.toString).getOrElse("NULL")) buffer += Row("avg_col_len", cs.map(_.avgLen.toString).getOrElse("NULL")) buffer += Row("max_col_len", cs.map(_.maxLen.toString).getOrElse("NULL")) + buffer ++= cs.flatMap(_.histogram.map { hist => +val header = Row("histogram", s"height: ${hist.height}, num_of_bins: ${hist.bins.length}") +Seq(header) ++ hist.bins.map(bin => + Row("", s"lower_bound: ${bin.lo}, upper_bound: ${bin.hi}, distinct_count: ${bin.ndv}")) --- End diff -- add a comment that the empty column is for indentation and better readability? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19778: [SPARK-22550][SQL] Fix 64KB JVM bytecode limit pr...
GitHub user kiszk opened a pull request: https://github.com/apache/spark/pull/19778 [SPARK-22550][SQL] Fix 64KB JVM bytecode limit problem with elt ## What changes were proposed in this pull request? This PR changes `elt` code generation to place generated code for expression for arguments into separated methods if these size could be large. This PR resolved the case of `elt` with a lot of argument ## How was this patch tested? Added new test cases into `StringExpressionsSuite` You can merge this pull request into a Git repository by running: $ git pull https://github.com/kiszk/spark SPARK-22550 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19778.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19778 commit 06b103e556756d567ae45626a9a62b8ffb777e36 Author: Kazuaki IshizakiDate: 2017-11-18T03:07:31Z initial commit --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19777: [SPARK-22549][SQL] Fix 64KB JVM bytecode limit problem w...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19777 **[Test build #83978 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83978/testReport)** for PR 19777 at commit [`1714c88`](https://github.com/apache/spark/commit/1714c883d193f96e43c19e72a96c15c23d5fc193). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN command
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/19774 seems good to me. ping @gatorsmile @cloud-fan @HyukjinKwon --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19777: [SPARK-22549][SQL] Fix 64KB JVM bytecode limit pr...
GitHub user kiszk opened a pull request: https://github.com/apache/spark/pull/19777 [SPARK-22549][SQL] Fix 64KB JVM bytecode limit problem with concat_ws ## What changes were proposed in this pull request? This PR changes `concat_ws` code generation to place generated code for expression for arguments into separated methods if these size could be large. This PR resolved the case of `concat_ws` with a lot of argument ## How was this patch tested? Added new test cases into `StringExpressionsSuite` You can merge this pull request into a Git repository by running: $ git pull https://github.com/kiszk/spark SPARK-22549 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19777.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19777 commit 1714c883d193f96e43c19e72a96c15c23d5fc193 Author: Kazuaki IshizakiDate: 2017-11-18T02:40:00Z initial commit --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19760: [SPARK-22533][core] Handle deprecated names in ConfigEnt...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19760 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83977/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19760: [SPARK-22533][core] Handle deprecated names in ConfigEnt...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19760 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19760: [SPARK-22533][core] Handle deprecated names in ConfigEnt...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19760 **[Test build #83977 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83977/testReport)** for PR 19760 at commit [`8afc56a`](https://github.com/apache/spark/commit/8afc56a825fdb4bce8aac683348bbd4223a29acc). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19776: [SPARK-22548][SQL] Incorrect nested AND expression pushe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19776 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19776: [SPARK-22548][SQL] Incorrect nested AND expressio...
GitHub user jliwork opened a pull request: https://github.com/apache/spark/pull/19776 [SPARK-22548][SQL] Incorrect nested AND expression pushed down to JDB⦠â¦C data source ## What changes were proposed in this pull request? Letâs say I have a nested AND expression shown below and p2 can not be pushed down, (p1 AND p2) OR p3 In current Spark code, during data source filter translation, (p1 AND p2) is returned as p1 only and p2 is simply lost. This issue occurs with JDBC data source and is similar to SPARK-12218 for Parquet. When we have AND nested below another expression, we should either push both legs or nothing. Note that 1) the current Spark code will always split conjunctive predicate before it determines if a predicate can be pushed down or not; 2) the same translation method is also called by Data Source V2. ## How was this patch tested? Added new unit test cases to JDBCSuite You can merge this pull request into a Git repository by running: $ git pull https://github.com/jliwork/spark spark-22548 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19776.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19776 commit 58de88c21210d469b8ef14b1f23764c31ca5651e Author: Jia LiDate: 2017-11-18T01:15:01Z [SPARK-22548][SQL] Incorrect nested AND expression pushed down to JDBC data source --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19390: [SPARK-18935][MESOS] Fix dynamic reservations on ...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/19390#discussion_r151821865 --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala --- @@ -228,24 +254,15 @@ trait MesosSchedulerUtils extends Logging { (attr.getName, attr.getText.getValue.split(',').toSet) } - - /** Build a Mesos resource protobuf object */ - protected def createResource(resourceName: String, quantity: Double): Protos.Resource = { -Resource.newBuilder() - .setName(resourceName) - .setType(Value.Type.SCALAR) - .setScalar(Value.Scalar.newBuilder().setValue(quantity).build()) - .build() - } - /** * Converts the attributes from the resource offer into a Map of name to Attribute Value * The attribute values are the mesos attribute types and they are * * @param offerAttributes the attributes offered * @return */ - protected def toAttributeMap(offerAttributes: JList[Attribute]): Map[String, GeneratedMessage] = { + protected def toAttributeMap(offerAttributes: JList[Attribute]) +: Map[String, GeneratedMessageV3] = { --- End diff -- Convention is to double-indent everything that is part of the method signature, so that it's easy to visually separate from the method body. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19772: [SPARK-22538][ML] SQLTransformer should not unper...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19772#discussion_r151815120 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala --- @@ -70,7 +70,8 @@ class SQLTransformer @Since("1.6.0") (@Since("1.6.0") override val uid: String) dataset.createOrReplaceTempView(tableName) val realStatement = $(statement).replace(tableIdentifier, tableName) val result = dataset.sparkSession.sql(realStatement) -dataset.sparkSession.catalog.dropTempView(tableName) --- End diff -- Hmm, but the behavior is explicitly defined at `Catalog`. This should be a public API. We may not easily change it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19771: [SPARK-22544][SS]FileStreamSource should use its ...
Github user zsxwing closed the pull request at: https://github.com/apache/spark/pull/19771 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19771: [SPARK-22544][SS]FileStreamSource should use its own had...
Github user marmbrus commented on the issue: https://github.com/apache/spark/pull/19771 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19771: [SPARK-22544][SS]FileStreamSource should use its own had...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/19771 Thanks! Merging to master and 2.2. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19772: [SPARK-22538][ML] SQLTransformer should not unper...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19772#discussion_r151813095 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala --- @@ -70,7 +70,8 @@ class SQLTransformer @Since("1.6.0") (@Since("1.6.0") override val uid: String) dataset.createOrReplaceTempView(tableName) val realStatement = $(statement).replace(tableIdentifier, tableName) val result = dataset.sparkSession.sql(realStatement) -dataset.sparkSession.catalog.dropTempView(tableName) --- End diff -- Yes. I also think it doesn't make a lot sense. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19760: [SPARK-22533][core] Handle deprecated names in ConfigEnt...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19760 **[Test build #83977 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83977/testReport)** for PR 19760 at commit [`8afc56a`](https://github.com/apache/spark/commit/8afc56a825fdb4bce8aac683348bbd4223a29acc). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19388: [SPARK-22162] Executors and the driver should use consis...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19388 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19388: [SPARK-22162] Executors and the driver should use consis...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19388 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83976/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19388: [SPARK-22162] Executors and the driver should use consis...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19388 **[Test build #83976 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83976/testReport)** for PR 19388 at commit [`500c73c`](https://github.com/apache/spark/commit/500c73cc96290efe0194e371ab84e0cda863347d). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class StageInfo extends Serializable ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19763: [SPARK-22537][core] Aggregation of map output sta...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/19763#discussion_r151801786 --- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala --- @@ -472,17 +474,36 @@ private[spark] class MapOutputTrackerMaster( shuffleStatuses.get(shuffleId).map(_.findMissingPartitions()) } + /** + * Try to equally divide Range(0, num) to divisor slices + */ + def equallyDivide(num: Int, divisor: Int): Iterator[Seq[Int]] = { +assert(divisor > 0, "Divisor should be positive") +val (each, remain) = (num / divisor, num % divisor) +val (smaller, bigger) = (0 until num).splitAt((divisor-remain) * each) +if (each != 0) { + smaller.grouped(each) ++ bigger.grouped(each + 1) +} else { + bigger.grouped(each + 1) +} + } + /** * Return statistics about all of the outputs for a given shuffle. */ def getStatistics(dep: ShuffleDependency[_, _, _]): MapOutputStatistics = { shuffleStatuses(dep.shuffleId).withMapStatuses { statuses => val totalSizes = new Array[Long](dep.partitioner.numPartitions) - for (s <- statuses) { -for (i <- 0 until totalSizes.length) { - totalSizes(i) += s.getSizeForBlock(i) + val parallelism = conf.getInt("spark.adaptive.map.statistics.cores", 8) + + val mapStatusSubmitTasks = equallyDivide(totalSizes.length, parallelism).map { --- End diff -- Doing this is not cheap. I would add a config and only run this in multiple threads when `#mapper * #shuffle_partitions` is large. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19763: [SPARK-22537][core] Aggregation of map output sta...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/19763#discussion_r151801570 --- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala --- @@ -473,16 +477,41 @@ private[spark] class MapOutputTrackerMaster( } /** + * Try to equally divide Range(0, num) to divisor slices + */ + def equallyDivide(num: Int, divisor: Int): Iterator[Seq[Int]] = { +assert(divisor > 0, "Divisor should be positive") +val (each, remain) = (num / divisor, num % divisor) +val (smaller, bigger) = (0 until num).splitAt((divisor-remain) * each) +if (each != 0) { + smaller.grouped(each) ++ bigger.grouped(each + 1) +} else { + bigger.grouped(each + 1) +} + } + + /** * Return statistics about all of the outputs for a given shuffle. */ def getStatistics(dep: ShuffleDependency[_, _, _]): MapOutputStatistics = { shuffleStatuses(dep.shuffleId).withMapStatuses { statuses => val totalSizes = new Array[Long](dep.partitioner.numPartitions) - for (s <- statuses) { -for (i <- 0 until totalSizes.length) { - totalSizes(i) += s.getSizeForBlock(i) -} + val mapStatusSubmitTasks = ArrayBuffer[Future[_]]() + var taskSlices = parallelism + + equallyDivide(totalSizes.length, taskSlices).foreach { +reduceIds => + mapStatusSubmitTasks += threadPoolMapStats.submit( +new Runnable { + override def run(): Unit = { +for (s <- statuses; i <- reduceIds) { + totalSizes(i) += s.getSizeForBlock(i) +} + } +} + ) } + mapStatusSubmitTasks.foreach(_.get()) --- End diff -- Don't use `scala.concurrent.ExecutionContext.Implicits.global`. You need to create a thread pool. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19388: [SPARK-22162] Executors and the driver should use consis...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/19388 High level question: do you need to create the commit id based on the rdd + stage ids, or could it be something else (like a monotonically increasing value, or a uuid)? That would probably simplify the code. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19751: [SPARK-20653][core] Add cleaning of old elements from th...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/19751 Not retesting since there will probably be feedback and the failure seems unrelated, so better just wait. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17436: [SPARK-20101][SQL] Use OffHeapColumnVector when "spark.m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17436 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83973/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17436: [SPARK-20101][SQL] Use OffHeapColumnVector when "spark.m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17436 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17436: [SPARK-20101][SQL] Use OffHeapColumnVector when "spark.m...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17436 **[Test build #83973 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83973/testReport)** for PR 17436 at commit [`ac2a7da`](https://github.com/apache/spark/commit/ac2a7da48790e632ba681d56e573f5cee764c464). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19751: [SPARK-20653][core] Add cleaning of old elements from th...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19751 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19751: [SPARK-20653][core] Add cleaning of old elements from th...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19751 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83975/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19751: [SPARK-20653][core] Add cleaning of old elements from th...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19751 **[Test build #83975 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83975/testReport)** for PR 19751 at commit [`e09a376`](https://github.com/apache/spark/commit/e09a376edc6707e564e9f89a202c5347f37ad738). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19728: [SPARK-22498][SQL] Fix 64KB JVM bytecode limit problem w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19728 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83974/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19728: [SPARK-22498][SQL] Fix 64KB JVM bytecode limit problem w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19728 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19728: [SPARK-22498][SQL] Fix 64KB JVM bytecode limit problem w...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19728 **[Test build #83974 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83974/testReport)** for PR 19728 at commit [`22019b1`](https://github.com/apache/spark/commit/22019b13b4fa2e8a20d4507d156a34f91e80a07d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19388: [SPARK-22162] Executors and the driver should use consis...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19388 **[Test build #83976 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83976/testReport)** for PR 19388 at commit [`500c73c`](https://github.com/apache/spark/commit/500c73cc96290efe0194e371ab84e0cda863347d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17436: [SPARK-20101][SQL] Use OffHeapColumnVector when "spark.m...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/17436 LGTM except a few minor comment, please update the PR title and description, thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17436: [SPARK-20101][SQL] Use OffHeapColumnVector when "...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17436#discussion_r151765263 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadBenchmark.scala --- @@ -260,6 +261,7 @@ object ParquetReadBenchmark { def stringWithNullsScanBenchmark(values: Int, fractionOfNulls: Double): Unit = { withTempPath { dir => withTempTable("t1", "tempTable") { +val enableOffHeapColumnVector = spark.sqlContext.conf.offHeapColumnVectorEnabled --- End diff -- nit: spark.sessionState.conf.offHeapColumnVectorEnabled --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17436: [SPARK-20101][SQL] Use OffHeapColumnVector when "...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17436#discussion_r151764870 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala --- @@ -62,7 +69,11 @@ case class InMemoryTableScanExec( private def createAndDecompressColumn(cachedColumnarBatch: CachedBatch): ColumnarBatch = { val rowCount = cachedColumnarBatch.numRows -val columnVectors = OnHeapColumnVector.allocateColumns(rowCount, columnarBatchSchema) +val columnVectors = if (!conf.offHeapColumnVectorEnabled) { --- End diff -- only enable it when `TaskContext.get != null`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17436: [SPARK-20101][SQL] Use OffHeapColumnVector when "...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17436#discussion_r151764498 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java --- @@ -101,9 +101,13 @@ private boolean returnColumnarBatch; /** - * The default config on whether columnarBatch should be offheap. + * The config on whether columnarBatch should be offheap. --- End diff -- nit: the memory mode of the columnarBatch --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17436: [SPARK-20101][SQL] Use OffHeapColumnVector when "...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17436#discussion_r151764317 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -140,6 +140,13 @@ object SQLConf { .booleanConf .createWithDefault(true) + val COLUMN_VECTOR_OFFHEAP_ENABLED = +buildConf("spark.sql.columnVector.offheap.enable") + .internal() + .doc("When true, use OffHeapColumnVector in ColumnarBatch.") + .booleanConf + .createWithDefault(true) --- End diff -- hey let's not change the existing behavior. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19728: [SPARK-22498][SQL] Fix 64KB JVM bytecode limit problem w...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19728 can you split it into 3 PRs? The approaches for these 3 expression are quite different. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19775: [SPARK-22343][core] Add support for publishing Spark met...
Github user erikerlandson commented on the issue: https://github.com/apache/spark/pull/19775 @matyix thanks for re-submitting! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19751: [SPARK-20653][core] Add cleaning of old elements from th...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19751 **[Test build #83975 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83975/testReport)** for PR 19751 at commit [`e09a376`](https://github.com/apache/spark/commit/e09a376edc6707e564e9f89a202c5347f37ad738). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19751: [SPARK-20653][core] Add cleaning of old elements from th...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/19751 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN command
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19774 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN command
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19774 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83972/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19760: [SPARK-22533][core] Handle deprecated names in ConfigEnt...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/19760 > my future plan is to move config related stuff to a new maven module Sounds good. I'll try to update this before leaving on vacation, but may not be able to... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN command
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19774 **[Test build #83972 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83972/testReport)** for PR 19774 at commit [`9bfa80c`](https://github.com/apache/spark/commit/9bfa80cca04a3b00e0fc2b02beb45c56f2058a34). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19257: [SPARK-22042] [SQL] ReorderJoinPredicates can break when...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19257 Thank you for the decision, @cloud-fan . It's great to see the progress on this! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19769: [SPARK-12297][SQL] Adjust timezone for int96 data from i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19769 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19769: [SPARK-12297][SQL] Adjust timezone for int96 data from i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19769 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83971/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19769: [SPARK-12297][SQL] Adjust timezone for int96 data from i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19769 **[Test build #83971 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83971/testReport)** for PR 19769 at commit [`9bb4cf0`](https://github.com/apache/spark/commit/9bb4cf0514dddc005b90ddb17a22d3b05be929e5). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19728: [SPARK-22498][SQL] Fix 64KB JVM bytecode limit problem w...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19728 **[Test build #83974 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83974/testReport)** for PR 19728 at commit [`22019b1`](https://github.com/apache/spark/commit/22019b13b4fa2e8a20d4507d156a34f91e80a07d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17436: [SPARK-20101][SQL] Use OffHeapColumnVector when "spark.m...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17436 **[Test build #83973 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83973/testReport)** for PR 17436 at commit [`ac2a7da`](https://github.com/apache/spark/commit/ac2a7da48790e632ba681d56e573f5cee764c464). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19730: [SPARK-22500][SQL] Fix 64KB JVM bytecode limit problem w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19730 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83970/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19730: [SPARK-22500][SQL] Fix 64KB JVM bytecode limit problem w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19730 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19730: [SPARK-22500][SQL] Fix 64KB JVM bytecode limit problem w...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19730 **[Test build #83970 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83970/testReport)** for PR 19730 at commit [`83fef40`](https://github.com/apache/spark/commit/83fef403b92a96a13421901d161a0df5e6a6d7b3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19725: [DO NOT REVIEW][SPARK-22042] [SQL] Insert shuffle...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/19725#discussion_r151741374 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/InjectPlaceholderExchange.scala --- @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.exchange + +import org.apache.spark.sql.catalyst.expressions.SortOrder +import org.apache.spark.sql.catalyst.plans.physical._ +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.execution.SparkPlan +import org.apache.spark.sql.internal.SQLConf + +case class InjectPlaceholderExchange(conf: SQLConf) extends Rule[SparkPlan] { + private def defaultNumPreShufflePartitions: Int = conf.numShufflePartitions + + /** + * Given a required distribution, returns a partitioning that satisfies that distribution. + * @param requiredDistribution The distribution that is required by the operator + * @param numPartitions Used when the distribution doesn't require a specific number of partitions + */ + private def createPartitioning(requiredDistribution: Distribution, + numPartitions: Int): Partitioning = { +requiredDistribution match { + case AllTuples => SinglePartition + case ClusteredDistribution(clustering, desiredPartitions) => +HashPartitioning(clustering, desiredPartitions.getOrElse(numPartitions)) + case OrderedDistribution(ordering) => RangePartitioning(ordering, numPartitions) + case dist => sys.error(s"Do not know how to satisfy distribution $dist") +} + } + + def apply(plan: SparkPlan): SparkPlan = plan.transformUp { +case operator @ ShuffleExchangeExec(partitioning, child, _) => + child.children match { +case ShuffleExchangeExec(childPartitioning, baseChild, _)::Nil => --- End diff -- No white spaces around `::` intended? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19717: [SPARK-18278] [Submission] Spark on Kubernetes - basic s...
Github user liyinan926 commented on the issue: https://github.com/apache/spark/pull/19717 @vanzin @jiangxb1987 @mridulm PTAL. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19774#discussion_r151741011 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -689,6 +689,11 @@ case class DescribeColumnCommand( buffer += Row("distinct_count", cs.map(_.distinctCount.toString).getOrElse("NULL")) buffer += Row("avg_col_len", cs.map(_.avgLen.toString).getOrElse("NULL")) buffer += Row("max_col_len", cs.map(_.maxLen.toString).getOrElse("NULL")) + buffer ++= cs.flatMap(_.histogram.map { hist => --- End diff -- Thanks for your feedback. IMHO I am not sure that it would be easier to read. But if someone else agrees with you I can change it, thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19728: [SPARK-22498][SQL] Fix 64KB JVM bytecode limit pr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19728#discussion_r151741098 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala --- @@ -125,19 +138,43 @@ case class ConcatWs(children: Seq[Expression]) if (children.forall(_.dataType == StringType)) { // All children are strings. In that case we can construct a fixed size array. val evals = children.map(_.genCode(ctx)) - - val inputs = evals.map { eval => -s"${eval.isNull} ? (UTF8String) null : ${eval.value}" - }.mkString(", ") - - ev.copy(evals.map(_.code).mkString("\n") + s""" -UTF8String ${ev.value} = UTF8String.concatWs($inputs); + val separator = evals.head + val strings = evals.tail + val argNums = strings.length + val args = ctx.freshName("args") + + val inputs = strings.zipWithIndex.map { case (eval, index) => +if (eval.isNull != "true") { + s""" + ${eval.code} + if (!${eval.isNull}) { + $args[$index] = ${eval.value}; + } + """ +} else { + "" +} + } + val codes = s"${separator.code}\n" + +(if (ctx.INPUT_ROW != null && ctx.currentVars == null) { + ctx.splitExpressions(inputs, "valueConcatWs", + ("InternalRow", ctx.INPUT_ROW) :: ("UTF8String[]", args) :: Nil) + } else { + inputs.mkString("\n") + }) + ev.copy(s""" +UTF8String[] $args = new UTF8String[$argNums]; +$codes +UTF8String ${ev.value} = UTF8String.concatWs(${separator.value}, $args); --- End diff -- nit: ``` val code = if (ctx.INPUT_ROW != null && ctx.currentVars == null) ... ev.copy(code = s""" UTF8String[] $args = new UTF8String[$argNums]; ${separator.code} ... """) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19728: [SPARK-22498][SQL] Fix 64KB JVM bytecode limit pr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19728#discussion_r151740628 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala --- @@ -125,19 +138,43 @@ case class ConcatWs(children: Seq[Expression]) if (children.forall(_.dataType == StringType)) { // All children are strings. In that case we can construct a fixed size array. val evals = children.map(_.genCode(ctx)) - - val inputs = evals.map { eval => -s"${eval.isNull} ? (UTF8String) null : ${eval.value}" - }.mkString(", ") - - ev.copy(evals.map(_.code).mkString("\n") + s""" -UTF8String ${ev.value} = UTF8String.concatWs($inputs); + val separator = evals.head + val strings = evals.tail + val argNums = strings.length --- End diff -- nit: `numArgs` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19773: [SPARK-22546][SQL] Supporting for changing column...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/19773#discussion_r151739773 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala --- @@ -318,16 +318,28 @@ case class AlterTableChangeColumnCommand( s"'${newColumn.name}' with type '${newColumn.dataType}'") } +val changeSchema = originColumn.dataType != newColumn.dataType --- End diff -- What do you think about renaming the val to `typeChanged`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19773: [SPARK-22546][SQL] Supporting for changing column...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/19773#discussion_r151740225 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala --- @@ -1459,6 +1459,11 @@ abstract class DDLSuite extends QueryTest with SQLTestUtils { // Ensure that change column will preserve other metadata fields. sql("ALTER TABLE dbx.tab1 CHANGE COLUMN col1 col1 INT COMMENT 'this is col1'") assert(getMetadata("col1").getString("key") == "value") + +// Ensure that change column type take effect --- End diff -- s/change/changing + s/take/takes --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19773: [SPARK-22546][SQL] Supporting for changing column...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/19773#discussion_r151739604 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala --- @@ -318,16 +318,28 @@ case class AlterTableChangeColumnCommand( s"'${newColumn.name}' with type '${newColumn.dataType}'") } +val changeSchema = originColumn.dataType != newColumn.dataType val newSchema = table.schema.fields.map { field => if (field.name == originColumn.name) { -// Create a new column from the origin column with the new comment. -addComment(field, newColumn.getComment) +var newField = field --- End diff -- I'd recommend getting rid of this `var` and re-writting the code as follows: ``` val newField = newColumn.getComment.map(...).getOrElse(field) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19728: [SPARK-22498][SQL] Fix 64KB JVM bytecode limit pr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19728#discussion_r151738992 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala --- @@ -63,15 +63,28 @@ case class Concat(children: Seq[Expression]) extends Expression with ImplicitCas override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { val evals = children.map(_.genCode(ctx)) -val inputs = evals.map { eval => - s"${eval.isNull} ? null : ${eval.value}" -}.mkString(", ") -ev.copy(evals.map(_.code).mkString("\n") + s""" - boolean ${ev.isNull} = false; - UTF8String ${ev.value} = UTF8String.concat($inputs); - if (${ev.value} == null) { -${ev.isNull} = true; - } +val numArgs = evals.length --- End diff -- nit: we can inline it --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/19774#discussion_r151737674 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -689,6 +689,11 @@ case class DescribeColumnCommand( buffer += Row("distinct_count", cs.map(_.distinctCount.toString).getOrElse("NULL")) buffer += Row("avg_col_len", cs.map(_.avgLen.toString).getOrElse("NULL")) buffer += Row("max_col_len", cs.map(_.maxLen.toString).getOrElse("NULL")) + buffer ++= cs.flatMap(_.histogram.map { hist => --- End diff -- I'm pretty sure that for-comprehension would make the code read easier. ```scala for { c <- cs hist <- c.histogram ... } yield ... ``` Let me know if you need help with that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19747: [Spark-22431][SQL] Ensure that the datatype in the schem...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19747 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83969/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19772: [SPARK-22538][ML] SQLTransformer should not unper...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19772 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19747: [Spark-22431][SQL] Ensure that the datatype in the schem...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19747 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org