[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/21537 I think we can set up a place (mailling list or JIRA) to discuss the further thing about IR design, as suggested by @HyukjinKwon. This can be a co-work from interesting parties. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22119: [WIP][SPARK-25129][SQL] Revert mapping com.databricks.sp...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/22119 @tgravescs @dongjoon-hyun Thanks for the explanation. We should add a configuration instead of reverting. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20637: [SPARK-23466][SQL] Remove redundant null checks in gener...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20637 **[Test build #94878 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94878/testReport)** for PR 20637 at commit [`99731ca`](https://github.com/apache/spark/commit/99731ca058f0d8946397530aff76d3c55fa93162). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20637: [SPARK-23466][SQL] Remove redundant null checks in gener...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/20637 Jenkins, retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20637: [SPARK-23466][SQL] Remove redundant null checks in gener...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20637 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2263/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20637: [SPARK-23466][SQL] Remove redundant null checks in gener...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20637 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22126: [SPARK-23938][SQL][FOLLOW-UP][TEST] Nullabilities...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22126 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22130: [SPARK-25137][Spark Shell] NumberFormatException` when s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22130 **[Test build #94877 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94877/testReport)** for PR 22130 at commit [`8ab5f87`](https://github.com/apache/spark/commit/8ab5f879843e74bec43ceada1027d1d5818e40da). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]AVRO data source guide
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22121 @gengliangwang Could you also post the screen shot in your PR description? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22130: [SPARK-25137][Spark Shell] NumberFormatException` when s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22130 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2262/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22130: [SPARK-25137][Spark Shell] NumberFormatException` when s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22130 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22130: [SPARK-25137][Spark Shell] NumberFormatException` when s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22130 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22130: [SPARK-25137][Spark Shell] NumberFormatException` when s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22130 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2261/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22130: [SPARK-25137][Spark Shell] NumberFormatException` when s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22130 **[Test build #94876 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94876/testReport)** for PR 22130 at commit [`d00929f`](https://github.com/apache/spark/commit/d00929f28b2523869252d67fefc04297aadc5af6). * This patch **fails build dependency tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22130: [SPARK-25137][Spark Shell] NumberFormatException` when s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22130 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94876/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22130: [SPARK-25137][Spark Shell] NumberFormatException` when s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22130 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22126: [SPARK-23938][SQL][FOLLOW-UP][TEST] Nullabilities of val...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/22126 Thanks! merging to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22130: [SPARK-25137][Spark Shell] NumberFormatException` when s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22130 **[Test build #94876 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94876/testReport)** for PR 22130 at commit [`d00929f`](https://github.com/apache/spark/commit/d00929f28b2523869252d67fefc04297aadc5af6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22130: [SPARK-25137][Spark Shell] NumberFormatException`...
GitHub user vinodkc opened a pull request: https://github.com/apache/spark/pull/22130 [SPARK-25137][Spark Shell] NumberFormatException` when starting spark-shell from Mac terminal ## What changes were proposed in this pull request? When starting spark-shell from Mac terminal, Getting exception [ERROR] Failed to construct terminal; falling back to unsupported java.lang.NumberFormatException: For input string: "0x100" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:580) at java.lang.Integer.valueOf(Integer.java:766) at jline.internal.InfoCmp.parseInfoCmp(InfoCmp.java:59) at jline.UnixTerminal.parseInfoCmp(UnixTerminal.java:242) at jline.UnixTerminal.(UnixTerminal.java:65) at jline.UnixTerminal.(UnixTerminal.java:50) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at java.lang.Class.newInstance(Class.java:442) at jline.TerminalFactory.getFlavor(TerminalFactory.java:211) This issue is due a jline defect : https://github.com/jline/jline2/issues/281, which is fixed in Jline 2.14.4, bumping up JLine version in spark to version > Jline 2.14.4 will fix the issue ## How was this patch tested? No new UT/automation test added, after upgrade to latest Jline version 2.14.6, manually tested spark shell features You can merge this pull request into a Git repository by running: $ git pull https://github.com/vinodkc/spark br_UpgradeJLineVersion Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22130.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22130 commit d00929f28b2523869252d67fefc04297aadc5af6 Author: Vinod KC Date: 2018-08-17T04:10:18Z Upgrade JLine to 2.14.6 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]AVRO data source guide
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/22121 @cloud-fan @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22079: [SPARK-23207][SPARK-22905][SQL][BACKPORT-2.2] Shuffle+Re...
Github user bersprockets commented on the issue: https://github.com/apache/spark/pull/22079 @jiangxb1987 gentle ping. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22126: [SPARK-23938][SQL][FOLLOW-UP][TEST] Nullabilities of val...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22126 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22126: [SPARK-23938][SQL][FOLLOW-UP][TEST] Nullabilities of val...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22126 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94871/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22126: [SPARK-23938][SQL][FOLLOW-UP][TEST] Nullabilities of val...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22126 **[Test build #94871 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94871/testReport)** for PR 22126 at commit [`45d044c`](https://github.com/apache/spark/commit/45d044c42fd8b785c734a920f4b557ca469a5212). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22120: [SPARK-25131]Event logs missing applicationAttemptId for...
Github user ajithme commented on the issue: https://github.com/apache/spark/pull/22120 @vanzin I agree its a trivial change. Just wanted it to be consistent output with yarn cluster mode. This is not just for event logs also for a custom SparkListener , it may be confusing that appId is empty in client case and a actual number in cluster case for onApplicationStart, this is where its effect can be seen. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22123: [SPARK-25134][SQL] Csv column pruning with checking of h...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22123 **[Test build #94875 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94875/testReport)** for PR 22123 at commit [`09c986c`](https://github.com/apache/spark/commit/09c986c7e9586346255ba7631db83f2f88fe1625). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22123: [SPARK-25134][SQL] Csv column pruning with checking of h...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22123 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21794: [SPARK-24834][CORE] use java comparison for float and do...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/21794 I think we'd have to close this due to the behavior change, but would merge an optimization of the existing behavior. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22123: [SPARK-25134][SQL] Csv column pruning with checking of h...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22123 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2260/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22123: [SPARK-25134][SQL] Csv column pruning with checki...
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/22123#discussion_r210801081 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -1603,6 +1603,44 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te .exists(msg => msg.getRenderedMessage.contains("CSV header does not conform to the schema"))) } + test("SPARK-25134: check header on parsing of dataset with projection and column pruning") { +withSQLConf(SQLConf.CSV_PARSER_COLUMN_PRUNING.key -> "true") { + withTempPath { path => +val dir = path.getAbsolutePath +Seq(("a", "b")).toDF("columnA", "columnB").write + .format("csv") + .option("header", true) + .save(dir) +checkAnswer(spark.read + .format("csv") + .option("header", true) + .option("enforceSchema", false) + .load(dir) + .select("columnA"), + Row("a")) + } +} + } + + test("SPARK-25134: check header on parsing of dataset with projection and no column pruning") { +withSQLConf(SQLConf.CSV_PARSER_COLUMN_PRUNING.key -> "false") { --- End diff -- ok will remove --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21961: Spark 20597
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/21961#discussion_r210800762 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala --- @@ -231,7 +231,13 @@ private[kafka010] class KafkaSourceProvider extends DataSourceRegister parameters: Map[String, String], partitionColumns: Seq[String], outputMode: OutputMode): Sink = { -val defaultTopic = parameters.get(TOPIC_OPTION_KEY).map(_.trim) +// Picks the defaulttopicname from "path" key, an entry in "parameters" Map, +// if no topic key is present in the "parameters" Map and is provided with key "path". +val defaultTopic = parameters.get(TOPIC_OPTION_KEY) match { --- End diff -- Isn't this simpler as something like ``` val defaultTopic = parameters.getOrElse(TOPIC_OPTION_KEY, parameters.get(PATH_OPTION_KEY)).map(_.trim) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21860: [SPARK-24901][SQL]Merge the codegen of RegularHashMap an...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21860 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94872/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21860: [SPARK-24901][SQL]Merge the codegen of RegularHashMap an...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21860 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21537 @kiszk The initial prototype or proof of concept can be in any personal branch. When we merge it to the master branch, we still need to separate it from the current codegen and make it configurable. After the release, the users can choose which one to be used. When the new IR is stable, we can then consider deprecate the current one. This is majorly for product stability. We need to follow the similar principle for any big project. @viirya @mgaido91 Let us first focus on the new IR design and prototype. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21860: [SPARK-24901][SQL]Merge the codegen of RegularHashMap an...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21860 **[Test build #94872 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94872/testReport)** for PR 21860 at commit [`3aa4e6d`](https://github.com/apache/spark/commit/3aa4e6d2c4ebd330898feb75af7b7fb36f512ea7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21868: [SPARK-24906][SQL] Adaptively enlarge split / par...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21868#discussion_r210799950 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -425,12 +426,44 @@ case class FileSourceScanExec( fsRelation: HadoopFsRelation): RDD[InternalRow] = { val defaultMaxSplitBytes = fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes -val openCostInBytes = fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes +var openCostInBytes = fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes val defaultParallelism = fsRelation.sparkSession.sparkContext.defaultParallelism val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + openCostInBytes)).sum val bytesPerCore = totalBytes / defaultParallelism -val maxSplitBytes = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore)) +var maxSplitBytes = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore)) + + if(fsRelation.sparkSession.sessionState.conf.isParquetSizeAdaptiveEnabled && + (fsRelation.fileFormat.isInstanceOf[ParquetSource] || +fsRelation.fileFormat.isInstanceOf[OrcFileFormat])) { + if (relation.dataSchema.map(_.dataType).forall(dataType => +dataType.isInstanceOf[CalendarIntervalType] || dataType.isInstanceOf[StructType] + || dataType.isInstanceOf[MapType] || dataType.isInstanceOf[NullType] + || dataType.isInstanceOf[AtomicType] || dataType.isInstanceOf[ArrayType])) { + +def getTypeLength(dataType: DataType): Int = { + if (dataType.isInstanceOf[StructType]) { + fsRelation.sparkSession.sessionState.conf.parquetStructTypeLength + } else if (dataType.isInstanceOf[ArrayType]) { + fsRelation.sparkSession.sessionState.conf.parquetArrayTypeLength + } else if (dataType.isInstanceOf[MapType]) { +fsRelation.sparkSession.sessionState.conf.parquetMapTypeLength + } else { +dataType.defaultSize + } +} + +val selectedColumnSize = requiredSchema.map(_.dataType).map(getTypeLength(_)) + .reduceOption(_ + _).getOrElse(StringType.defaultSize) +val totalColumnSize = relation.dataSchema.map(_.dataType).map(getTypeLength(_)) + .reduceOption(_ + _).getOrElse(StringType.defaultSize) --- End diff -- I think his point is that the estimation is super rough which I agree with .. I am less sure if we should go ahead or not partially by this reason as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21868: [SPARK-24906][SQL] Adaptively enlarge split / par...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21868#discussion_r210799970 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -459,6 +460,29 @@ object SQLConf { .intConf .createWithDefault(4096) + val IS_PARQUET_PARTITION_ADAPTIVE_ENABLED = buildConf("spark.sql.parquet.adaptiveFileSplit") +.doc("For columnar file format (e.g., Parquet), it's possible that only few (not all) " + + "columns are needed. So, it's better to make sure that the total size of the selected " + + "columns is about 128 MB " +) +.booleanConf +.createWithDefault(false) + + val PARQUET_STRUCT_LENGTH = buildConf("spark.sql.parquet.struct.length") +.doc("Set the default size of struct column") +.intConf +.createWithDefault(StringType.defaultSize) + + val PARQUET_MAP_LENGTH = buildConf("spark.sql.parquet.map.length") --- End diff -- Yeah, I was thinking that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/21537 If we will continue on improving current codegen framework, I think it is good to have a design doc reviewed by the community. If we decide to have IR design and get rid of this string based framework, do we still need to have design doc for the current codegen improvement? Or we can focus on IR design doc? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21868: [SPARK-24906][SQL] Adaptively enlarge split / par...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21868#discussion_r210799891 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -459,6 +460,29 @@ object SQLConf { .intConf .createWithDefault(4096) + val IS_PARQUET_PARTITION_ADAPTIVE_ENABLED = buildConf("spark.sql.parquet.adaptiveFileSplit") +.doc("For columnar file format (e.g., Parquet), it's possible that only few (not all) " + + "columns are needed. So, it's better to make sure that the total size of the selected " + + "columns is about 128 MB " +) +.booleanConf +.createWithDefault(false) + + val PARQUET_STRUCT_LENGTH = buildConf("spark.sql.parquet.struct.length") +.doc("Set the default size of struct column") +.intConf +.createWithDefault(StringType.defaultSize) + + val PARQUET_MAP_LENGTH = buildConf("spark.sql.parquet.map.length") --- End diff -- I wouldn't do this. This makes more complicated and I would just set a bigger number for `maxPartitionBytes`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21868: [SPARK-24906][SQL] Adaptively enlarge split / par...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21868#discussion_r210799770 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -459,6 +460,29 @@ object SQLConf { .intConf .createWithDefault(4096) + val IS_PARQUET_PARTITION_ADAPTIVE_ENABLED = buildConf("spark.sql.parquet.adaptiveFileSplit") +.doc("For columnar file format (e.g., Parquet), it's possible that only few (not all) " + + "columns are needed. So, it's better to make sure that the total size of the selected " + + "columns is about 128 MB " --- End diff -- It sounds not describing what the configuration does actually. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21868: [SPARK-24906][SQL] Adaptively enlarge split / par...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21868#discussion_r210799731 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -459,6 +460,29 @@ object SQLConf { .intConf .createWithDefault(4096) + val IS_PARQUET_PARTITION_ADAPTIVE_ENABLED = buildConf("spark.sql.parquet.adaptiveFileSplit") +.doc("For columnar file format (e.g., Parquet), it's possible that only few (not all) " + --- End diff -- `it's` I would avoid abbreviation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22110: [SPARK-25122][SQL] Deduplication of supports equa...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22110 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21868: [SPARK-24906][SQL] Adaptively enlarge split / par...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21868#discussion_r210799600 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -459,6 +460,29 @@ object SQLConf { .intConf .createWithDefault(4096) + val IS_PARQUET_PARTITION_ADAPTIVE_ENABLED = buildConf("spark.sql.parquet.adaptiveFileSplit") --- End diff -- This configuration doesn't look specific to parquet anymore. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21868: [SPARK-24906][SQL] Adaptively enlarge split / partition ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21868 @habren, BTW, just for clarification, you can set the bigger number to `spark.sql.files.maxPartitionBytes` explicitly and that resolved your issue. This one is to handle it dynamically, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22124: [SPARK-25135][SQL] Insert datasource table may al...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22124#discussion_r210799343 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala --- @@ -490,7 +490,8 @@ object DDLPreprocessingUtils { case (expected, actual) => if (expected.dataType.sameType(actual.dataType) && expected.name == actual.name && - expected.metadata == actual.metadata) { + expected.metadata == actual.metadata && + expected.exprId.id == actual.exprId.id) { --- End diff -- why does this fix the problem? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22110: [SPARK-25122][SQL] Deduplication of supports equals code
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22110 thanks, merging to master! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22116: [DOCS]Update configuration.md
Github user KraFusion commented on the issue: https://github.com/apache/spark/pull/22116 @srowen Thanks! yes, my bad. next time I will bundle (better yet I will look for the same issue elsewhere in the docs), and I'll use a better title. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21537 It's a good point that we should have a design doc for this codegen infrastructure improvement, since it's very critical to Spark. And we should have it reviewed by the community. There were some discussions on the PRs and JIRAs, but it didn't happen in the dev list. This is something we should do next. At this stage, I think it's too late to revert anything related to the codegen improvement. There are so many codegen templates get touched and I think reverting is riskier. But we should hold it now until the design doc is reviewed by the community in dev list. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22112 Actually we can extend the solution later and I've mentioned it in my PR description. Basically there are 3 kinds of closures: 1. totally random 2. always output same data set in a random order 3. always output same data sequence (same order) Spark is able to handle closure 1, the cost is, whenever a fetch failure happens and a map task gets retried, Spark needs to rollback all the succeeding stages and retry them, because their input has changed. `zip` falls in this category, but due to time constraints, I think it's ok to document it and fix it later. For closure 2, Spark can treat it as closure 3 if the shuffle partitioner is order insensitive like range/hash partitioner. This means, when a map task gets retried, it will produce the same data for the reducers, so we don't need to rollback all the succeeding stages. However, if the shuffle partitioner is order insensitive like round-robin, Spark has to treat it like closure 1 and rollback all the succeeding stages if a map task gets retried. Closure 3 is already handled well by the current Spark. In this PR, I assume all the RDDs' computing functions are closure 3, so that we don't have performance regression. The only exception is shuffled RDD, which outputs data in a random order because of the remote block fetching. In the future, we can extend `RDD#isIdempotent` to an enum to indicate the 3 closure types, and change the `FetchFailed` handling logic accordingly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21950 **[Test build #94874 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94874/testReport)** for PR 21950 at commit [`3a65edf`](https://github.com/apache/spark/commit/3a65edf0e07f3beb6d6dd4dcb16e76ea7210c5e9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...
Github user squito commented on the issue: https://github.com/apache/spark/pull/21950 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22045: [SPARK-23940][SQL] Add transform_values SQL funct...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22045 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21868: [SPARK-24906][SQL] Adaptively enlarge split / par...
Github user habren commented on a diff in the pull request: https://github.com/apache/spark/pull/21868#discussion_r210793717 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -425,12 +426,44 @@ case class FileSourceScanExec( fsRelation: HadoopFsRelation): RDD[InternalRow] = { val defaultMaxSplitBytes = fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes -val openCostInBytes = fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes +var openCostInBytes = fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes val defaultParallelism = fsRelation.sparkSession.sparkContext.defaultParallelism val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + openCostInBytes)).sum val bytesPerCore = totalBytes / defaultParallelism -val maxSplitBytes = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore)) +var maxSplitBytes = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore)) + + if(fsRelation.sparkSession.sessionState.conf.isParquetSizeAdaptiveEnabled && + (fsRelation.fileFormat.isInstanceOf[ParquetSource] || +fsRelation.fileFormat.isInstanceOf[OrcFileFormat])) { + if (relation.dataSchema.map(_.dataType).forall(dataType => +dataType.isInstanceOf[CalendarIntervalType] || dataType.isInstanceOf[StructType] + || dataType.isInstanceOf[MapType] || dataType.isInstanceOf[NullType] + || dataType.isInstanceOf[AtomicType] || dataType.isInstanceOf[ArrayType])) { + +def getTypeLength(dataType: DataType): Int = { + if (dataType.isInstanceOf[StructType]) { + fsRelation.sparkSession.sessionState.conf.parquetStructTypeLength + } else if (dataType.isInstanceOf[ArrayType]) { + fsRelation.sparkSession.sessionState.conf.parquetArrayTypeLength + } else if (dataType.isInstanceOf[MapType]) { +fsRelation.sparkSession.sessionState.conf.parquetMapTypeLength + } else { +dataType.defaultSize + } +} + +val selectedColumnSize = requiredSchema.map(_.dataType).map(getTypeLength(_)) + .reduceOption(_ + _).getOrElse(StringType.defaultSize) +val totalColumnSize = relation.dataSchema.map(_.dataType).map(getTypeLength(_)) + .reduceOption(_ + _).getOrElse(StringType.defaultSize) --- End diff -- @gatorsmile The target of this change is not making users easy to set the partition size. Instead, when user set the partition size, this change will try its best to make sure the read size is close to the value that set by user. Without this change, when user set partition size to 128MB, the actual read size may be 1MB or even smaller because of column pruning. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22045: [SPARK-23940][SQL] Add transform_values SQL function
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/22045 Thanks! merging to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21990 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94873/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21990 **[Test build #94873 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94873/testReport)** for PR 21990 at commit [`0eea205`](https://github.com/apache/spark/commit/0eea205ca0591c68975412873b34393f6bf19437). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class SparkExtensionsTest(unittest.TestCase, SQLTestUtils):` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21990 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22122: [SPARK-24665][PySpark][FollowUp] Use SQLConf in PySpark ...
Github user xuanyuanking commented on the issue: https://github.com/apache/spark/pull/22122 Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22112 @tgravescs To understand better, are you suggesting that we do not support any api and/or user closure which depends on input order ? If yes, that would break not just repartition + shuffle, but also other publically exposed api in spark core and (my guess) non trivial aspects of mllib. Or is it that we support repartition and possibly a few other high priority cases (sampling in mllib for example ?) and not support the rest ? My (unproven) contention is that solution for repartition + shuffle would be a general solution (or very close to it) : which will then work for all other cases with suitable modifications as required. By "expand solution to cover all later.", I was referring to these changes to leverage whatever we build for repartition in other usecases- for example set appropriate parameters, etc in interest of time. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in P...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21990#discussion_r210790581 --- Diff: python/pyspark/sql/tests.py --- @@ -3563,6 +3563,51 @@ def test_query_execution_listener_on_collect_with_arrow(self): "The callback from the query execution listener should be called after 'toPandas'") +class SparkExtensionsTest(unittest.TestCase, SQLTestUtils): +# These tests are separate because it uses 'spark.sql.extensions' which is +# static and immutable. This can't be set or unset, for example, via `spark.conf`. + +@classmethod +def setUpClass(cls): +import glob +from pyspark.find_spark_home import _find_spark_home + +SPARK_HOME = _find_spark_home() +filename_pattern = ( +"sql/core/target/scala-*/test-classes/org/apache/spark/sql/" +"SparkSessionExtensionSuite.class") +if not glob.glob(os.path.join(SPARK_HOME, filename_pattern)): +raise unittest.SkipTest( +"'org.apache.spark.sql.SparkSessionExtensionSuite.' is not " +"available. Will skip the related tests.") + +# Note that 'spark.sql.extensions' is a static immutable configuration. +cls.spark = SparkSession.builder \ +.master("local[4]") \ +.appName(cls.__name__) \ +.config( +"spark.sql.extensions", +"org.apache.spark.sql.MyExtensions") \ +.getOrCreate() + +@classmethod +def tearDownClass(cls): +cls.spark.stop() + +def tearDown(self): +self.spark._jvm.OnSuccessCall.clear() --- End diff -- This wouldn't be needed since I did this for testing if the callback is called or not in the PR pointed out. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22122: [SPARK-24665][PySpark][FollowUp] Use SQLConf in P...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22122 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in P...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21990#discussion_r210790531 --- Diff: python/pyspark/sql/tests.py --- @@ -3563,6 +3563,51 @@ def test_query_execution_listener_on_collect_with_arrow(self): "The callback from the query execution listener should be called after 'toPandas'") +class SparkExtensionsTest(unittest.TestCase, SQLTestUtils): +# These tests are separate because it uses 'spark.sql.extensions' which is +# static and immutable. This can't be set or unset, for example, via `spark.conf`. + +@classmethod +def setUpClass(cls): +import glob +from pyspark.find_spark_home import _find_spark_home + +SPARK_HOME = _find_spark_home() +filename_pattern = ( +"sql/core/target/scala-*/test-classes/org/apache/spark/sql/" +"SparkSessionExtensionSuite.class") +if not glob.glob(os.path.join(SPARK_HOME, filename_pattern)): +raise unittest.SkipTest( +"'org.apache.spark.sql.SparkSessionExtensionSuite.' is not " +"available. Will skip the related tests.") + +# Note that 'spark.sql.extensions' is a static immutable configuration. +cls.spark = SparkSession.builder \ +.master("local[4]") \ +.appName(cls.__name__) \ +.config( +"spark.sql.extensions", +"org.apache.spark.sql.MyExtensions") \ --- End diff -- @RussellSpitzer, I think you should push `MyExtensions` scala side code too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22114: [SPARK-24938][Core] Prevent Netty from using onheap memo...
Github user NiharS commented on the issue: https://github.com/apache/spark/pull/22114 They pass on my machine :( --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22122: [SPARK-24665][PySpark][FollowUp] Use SQLConf in PySpark ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22122 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22122: [SPARK-24665][PySpark][FollowUp] Use SQLConf in PySpark ...
Github user xuanyuanking commented on the issue: https://github.com/apache/spark/pull/22122 ``` Are they all instances to fix? ``` @HyukjinKwon Yep, I grep all `conf.get("spark.sql.xxx")` and make sure for this. The remaining of hard code config is StaticSQLConf `spark.sql.catalogImplementation` in session.py, it can't manage by SQLConf. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21990 **[Test build #94873 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94873/testReport)** for PR 21990 at commit [`0eea205`](https://github.com/apache/spark/commit/0eea205ca0591c68975412873b34393f6bf19437). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22123: [SPARK-25134][SQL] Csv column pruning with checki...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22123#discussion_r210788916 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -1603,6 +1603,44 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te .exists(msg => msg.getRenderedMessage.contains("CSV header does not conform to the schema"))) } + test("SPARK-25134: check header on parsing of dataset with projection and column pruning") { +withSQLConf(SQLConf.CSV_PARSER_COLUMN_PRUNING.key -> "true") { + withTempPath { path => +val dir = path.getAbsolutePath +Seq(("a", "b")).toDF("columnA", "columnB").write + .format("csv") + .option("header", true) + .save(dir) +checkAnswer(spark.read + .format("csv") + .option("header", true) + .option("enforceSchema", false) + .load(dir) + .select("columnA"), + Row("a")) + } +} + } + + test("SPARK-25134: check header on parsing of dataset with projection and no column pruning") { +withSQLConf(SQLConf.CSV_PARSER_COLUMN_PRUNING.key -> "false") { --- End diff -- I think `false` case test can be removed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21950 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94869/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21950 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21950: [SPARK-24914][SQL][WIP] Add configuration to avoid OOM d...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21950 **[Test build #94869 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94869/testReport)** for PR 21950 at commit [`3a65edf`](https://github.com/apache/spark/commit/3a65edf0e07f3beb6d6dd4dcb16e76ea7210c5e9). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22098: [SPARK-24886][INFRA] Fix the testing script to increase ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22098 haha, it's more then 4 years ago .. if we are unsure on the env, let me just push this in. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22104: [SPARK-24721][SQL] Exclude Python UDFs filters in...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22104#discussion_r210786895 --- Diff: python/pyspark/sql/tests.py --- @@ -3367,6 +3367,33 @@ def test_ignore_column_of_all_nulls(self): finally: shutil.rmtree(path) +# SPARK-24721 +def test_datasource_with_udf_filter_lit_input(self): +from pyspark.sql.functions import udf, lit, col + +path = tempfile.mkdtemp() +shutil.rmtree(path) +try: + self.spark.range(1).write.mode("overwrite").format('csv').save(path) +filesource_df = self.spark.read.csv(path) +datasource_df = self.spark.read \ +.format("org.apache.spark.sql.sources.SimpleScanSource") \ +.option('from', 0).option('to', 1).load() +datasource_v2_df = self.spark.read \ + .format("org.apache.spark.sql.sources.v2.SimpleDataSourceV2") \ --- End diff -- This wouldn't work if test classes are not compiled. I think we should better make another test suite that skips the test if the test classes are not existent. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22114: [SPARK-24938][Core] Prevent Netty from using onheap memo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22114 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22114: [SPARK-24938][Core] Prevent Netty from using onheap memo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22114 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94867/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22114: [SPARK-24938][Core] Prevent Netty from using onheap memo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22114 **[Test build #94867 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94867/testReport)** for PR 22114 at commit [`c2f9ed1`](https://github.com/apache/spark/commit/c2f9ed10776842ffe0746fcc89b157675fa6c455). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21868: [SPARK-24906][SQL] Adaptively enlarge split / par...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21868#discussion_r210785081 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -459,6 +458,29 @@ object SQLConf { .intConf .createWithDefault(4096) + val IS_PARQUET_PARTITION_ADAPTIVE_ENABLED = buildConf("spark.sql.parquet.adaptiveFileSplit") +.doc("For columnar file format (e.g., Parquet), it's possible that only few (not all) " + + "columns are needed. So, it's better to make sure that the total size of the selected " + + "columns is about 128 MB " +) +.booleanConf +.createWithDefault(false) + + val PARQUET_STRUCT_LENGTH = buildConf("spark.sql.parquet.struct.length") +.doc("Set the default size of struct column") +.intConf +.createWithDefault(StringType.defaultSize) --- End diff -- And these configs assume that different storage formats use the same size? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21868: [SPARK-24906][SQL] Adaptively enlarge split / par...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21868#discussion_r210779310 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -459,6 +458,29 @@ object SQLConf { .intConf .createWithDefault(4096) + val IS_PARQUET_PARTITION_ADAPTIVE_ENABLED = buildConf("spark.sql.parquet.adaptiveFileSplit") +.doc("For columnar file format (e.g., Parquet), it's possible that only few (not all) " + + "columns are needed. So, it's better to make sure that the total size of the selected " + + "columns is about 128 MB " +) +.booleanConf +.createWithDefault(false) + + val PARQUET_STRUCT_LENGTH = buildConf("spark.sql.parquet.struct.length") +.doc("Set the default size of struct column") +.intConf +.createWithDefault(StringType.defaultSize) + + val PARQUET_MAP_LENGTH = buildConf("spark.sql.parquet.map.length") +.doc("Set the default size of map column") +.intConf +.createWithDefault(StringType.defaultSize) + + val PARQUET_ARRAY_LENGTH = buildConf("spark.sql.parquet.array.length") +.doc("Set the default size of array column") +.intConf +.createWithDefault(StringType.defaultSize) --- End diff -- This feature includes so many configs, my concern is it is hard for end users to set them. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21868: [SPARK-24906][SQL] Adaptively enlarge split / par...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21868#discussion_r210765335 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -25,17 +25,16 @@ import java.util.zip.Deflater import scala.collection.JavaConverters._ import scala.collection.immutable import scala.util.matching.Regex - --- End diff -- Please don't remove these blank lines. Can you revert it back? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21819: [SPARK-24863][SS] Report Kafka offset lag as a custom me...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21819 Let me leave this in few days in case someone has more comments on this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21320 I said either way works fine. It doesn't matter which way we go. Better close one of them if the approach is the same and both PRs are active. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22125: [DOCS] Fix cloud-integration.md Typo
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22125 @KraFusion Sorry, I overlooked another PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21537 I don't see a notable risk here. That's just to avoid string interpolation, which makes less error-prone, which is discussed already and the code change is small. I hope we can move other discussions to other threads like JIRA or mailing list so that people can see. It's quite difficult to find such discussion to me actually. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22129: just testing pyarrow changes
GitHub user shaneknapp opened a pull request: https://github.com/apache/spark/pull/22129 just testing pyarrow changes ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/shaneknapp/spark pyarrow-test Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22129.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22129 commit df646e860f97c09f7fea9a80a058025bd3edac57 Author: shane knapp Date: 2018-08-17T00:48:47Z just testing pyarrow changes --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22129: just testing pyarrow changes
Github user shaneknapp closed the pull request at: https://github.com/apache/spark/pull/22129 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21221: [SPARK-23429][CORE] Add executor memory metrics to heart...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21221 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94865/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21221: [SPARK-23429][CORE] Add executor memory metrics to heart...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21221 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21221: [SPARK-23429][CORE] Add executor memory metrics to heart...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21221 **[Test build #94865 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94865/testReport)** for PR 21221 at commit [`2897281`](https://github.com/apache/spark/commit/2897281a384d25556609a17be21f926cb5d68dd6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21860: [SPARK-24901][SQL]Merge the codegen of RegularHashMap an...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21860 **[Test build #94872 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94872/testReport)** for PR 21860 at commit [`3aa4e6d`](https://github.com/apache/spark/commit/3aa4e6d2c4ebd330898feb75af7b7fb36f512ea7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22048: [SPARK-25108][SQL] Fix the show method to display the wi...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22048 Thank you for creating a JIRA entry and for putting the result. The test case is not available yet. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22126: [SPARK-23938][SQL][FOLLOW-UP][TEST] Nullabilities of val...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22126 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22126: [SPARK-23938][SQL][FOLLOW-UP][TEST] Nullabilities of val...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22126 **[Test build #94871 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94871/testReport)** for PR 22126 at commit [`45d044c`](https://github.com/apache/spark/commit/45d044c42fd8b785c734a920f4b557ca469a5212). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22126: [SPARK-23938][SQL][FOLLOW-UP][TEST] Nullabilities of val...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22126 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2259/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22045: [SPARK-23940][SQL] Add transform_values SQL function
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22045 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94864/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22045: [SPARK-23940][SQL] Add transform_values SQL function
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22045 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22045: [SPARK-23940][SQL] Add transform_values SQL function
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22045 **[Test build #94864 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94864/testReport)** for PR 22045 at commit [`3382e1a`](https://github.com/apache/spark/commit/3382e1a5396c8e5a94802d92a7106eacf627617c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21990 **[Test build #94870 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94870/testReport)** for PR 21990 at commit [`d5c37b7`](https://github.com/apache/spark/commit/d5c37b732f1948d8240bd8de33a080ac5db03571). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class SparkExtensionsTest(unittest.TestCase, SQLTestUtils):` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21990 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94870/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21990 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21990 **[Test build #94870 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94870/testReport)** for PR 21990 at commit [`d5c37b7`](https://github.com/apache/spark/commit/d5c37b732f1948d8240bd8de33a080ac5db03571). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22116: [DOCS]Update configuration.md
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22116 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org