[jira] [Updated] (SPARK-30400) Test failure in SQL module on ppc64le
[ https://issues.apache.org/jira/browse/SPARK-30400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] AK97 updated SPARK-30400: - Shepherd: Yin Huai > Test failure in SQL module on ppc64le > - > > Key: SPARK-30400 > URL: https://issues.apache.org/jira/browse/SPARK-30400 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.4.0 > Environment: os: rhel 7.6 > arch: ppc64le >Reporter: AK97 >Priority: Major > > I have been trying to build the Apache Spark on rhel_7.6/ppc64le; however, > the test cases are failing in SQL module with following error : > {code} > - CREATE TABLE USING AS SELECT based on the file without write permission *** > FAILED *** > Expected exception org.apache.spark.SparkException to be thrown, but no > exception was thrown (CreateTableAsSelectSuite.scala:92) > - create a table, drop it and create another one with the same name *** > FAILED *** > org.apache.spark.sql.AnalysisException: Table default.jsonTable already > exists. You need to drop it first.; > at > org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:159) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:115) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) > at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > {code} > Would like some help on understanding the cause for the same . I am running > it on a High end VM with good connectivity. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30400) Test failure in SQL module on ppc64le
[ https://issues.apache.org/jira/browse/SPARK-30400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] AK97 updated SPARK-30400: - Shepherd: (was: Yin Huai) > Test failure in SQL module on ppc64le > - > > Key: SPARK-30400 > URL: https://issues.apache.org/jira/browse/SPARK-30400 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.4.0 > Environment: os: rhel 7.6 > arch: ppc64le >Reporter: AK97 >Priority: Major > > I have been trying to build the Apache Spark on rhel_7.6/ppc64le; however, > the test cases are failing in SQL module with following error : > {code} > - CREATE TABLE USING AS SELECT based on the file without write permission *** > FAILED *** > Expected exception org.apache.spark.SparkException to be thrown, but no > exception was thrown (CreateTableAsSelectSuite.scala:92) > - create a table, drop it and create another one with the same name *** > FAILED *** > org.apache.spark.sql.AnalysisException: Table default.jsonTable already > exists. You need to drop it first.; > at > org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:159) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:115) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) > at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > {code} > Would like some help on understanding the cause for the same . I am running > it on a High end VM with good connectivity. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30400) Test failure in SQL module on ppc64le
[ https://issues.apache.org/jira/browse/SPARK-30400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] AK97 updated SPARK-30400: - Shepherd: Yin Huai Environment: os: rhel 7.6 arch: ppc64le was: os: rhel 7.6 arch: ppc64le > Test failure in SQL module on ppc64le > - > > Key: SPARK-30400 > URL: https://issues.apache.org/jira/browse/SPARK-30400 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.4.0 > Environment: os: rhel 7.6 > arch: ppc64le >Reporter: AK97 >Priority: Major > > I have been trying to build the Apache Spark on rhel_7.6/ppc64le; however, > the test cases are failing in SQL module with following error : > {code} > - CREATE TABLE USING AS SELECT based on the file without write permission *** > FAILED *** > Expected exception org.apache.spark.SparkException to be thrown, but no > exception was thrown (CreateTableAsSelectSuite.scala:92) > - create a table, drop it and create another one with the same name *** > FAILED *** > org.apache.spark.sql.AnalysisException: Table default.jsonTable already > exists. You need to drop it first.; > at > org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:159) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:115) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) > at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > {code} > Would like some help on understanding the cause for the same . I am running > it on a High end VM with good connectivity. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30400) Test failure in SQL module on ppc64le
[ https://issues.apache.org/jira/browse/SPARK-30400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012542#comment-17012542 ] AK97 commented on SPARK-30400: -- Any Leads will be appreciated. > Test failure in SQL module on ppc64le > - > > Key: SPARK-30400 > URL: https://issues.apache.org/jira/browse/SPARK-30400 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.4.0 > Environment: os: rhel 7.6 > arch: ppc64le >Reporter: AK97 >Priority: Major > > I have been trying to build the Apache Spark on rhel_7.6/ppc64le; however, > the test cases are failing in SQL module with following error : > {code} > - CREATE TABLE USING AS SELECT based on the file without write permission *** > FAILED *** > Expected exception org.apache.spark.SparkException to be thrown, but no > exception was thrown (CreateTableAsSelectSuite.scala:92) > - create a table, drop it and create another one with the same name *** > FAILED *** > org.apache.spark.sql.AnalysisException: Table default.jsonTable already > exists. You need to drop it first.; > at > org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:159) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:115) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) > at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > {code} > Would like some help on understanding the cause for the same . I am running > it on a High end VM with good connectivity. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30481) Integrate event log compactor into Spark History Server
Jungtaek Lim created SPARK-30481: Summary: Integrate event log compactor into Spark History Server Key: SPARK-30481 URL: https://issues.apache.org/jira/browse/SPARK-30481 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Jungtaek Lim This issue is to track the effort on compacting old event logs (and cleaning up after compaction) without breaking guaranteeing of compatibility. This issue depends on SPARK-29779 and SPARK-30479, and focuses on integrating event log compactor into Spark History Server and enable configurations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30480) Pyspark test "test_memory_limit" fails consistently
[ https://issues.apache.org/jira/browse/SPARK-30480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30480. -- Fix Version/s: 3.0.0 Resolution: Fixed Fixed in [https://github.com/apache/spark/pull/27162] > Pyspark test "test_memory_limit" fails consistently > --- > > Key: SPARK-30480 > URL: https://issues.apache.org/jira/browse/SPARK-30480 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Major > Fix For: 3.0.0 > > > I'm seeing consistent pyspark test failures on multiple PRs > ([#26955|https://github.com/apache/spark/pull/26955], > [#26201|https://github.com/apache/spark/pull/26201], > [#27064|https://github.com/apache/spark/pull/27064]), and they all failed > from "test_memory_limit". > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116422/testReport] > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116438/testReport] > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116429/testReport] > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116366/testReport] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-29776) rpad and lpad should return NULL when padstring parameter is empty
[ https://issues.apache.org/jira/browse/SPARK-29776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-29776. - > rpad and lpad should return NULL when padstring parameter is empty > -- > > Key: SPARK-29776 > URL: https://issues.apache.org/jira/browse/SPARK-29776 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > As per rpad definition > rpad > rpad(str, len, pad) - Returns str, right-padded with pad to a length of len > If str is longer than len, the return value is shortened to len characters. > *In case of empty pad string, the return value is null.* > Below is Example > In Spark: > {code} > 0: jdbc:hive2://10.18.19.208:23040/default> SELECT rpad('hi', 5, ''); > ++ > | rpad(hi, 5, ) | > ++ > | hi | > ++ > {code} > It should return NULL as per definition. > > Hive behavior is correct as per definition it returns NULL when pad is empty > String > INFO : Concurrency mode is disabled, not creating a lock manager > {code} > +---+ > | _c0 | > +---+ > | NULL | > +---+ > {code} > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30480) Pyspark test "test_memory_limit" fails consistently
Jungtaek Lim created SPARK-30480: Summary: Pyspark test "test_memory_limit" fails consistently Key: SPARK-30480 URL: https://issues.apache.org/jira/browse/SPARK-30480 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.0.0 Reporter: Jungtaek Lim I'm seeing consistent pyspark test failures on multiple PRs ([#26955|https://github.com/apache/spark/pull/26955], [#26201|https://github.com/apache/spark/pull/26201], [#27064|https://github.com/apache/spark/pull/27064]), and they all failed from "test_memory_limit". [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116422/testReport] [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116438/testReport] [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116429/testReport] [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116366/testReport] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27686) Update migration guide
[ https://issues.apache.org/jira/browse/SPARK-27686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-27686: Attachment: hive-1.2.1-lib.tgz > Update migration guide > --- > > Key: SPARK-27686 > URL: https://issues.apache.org/jira/browse/SPARK-27686 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Minor > Attachments: hive-1.2.1-lib.tgz > > > The built-in Hive 2.3 fixes the following issues: > * HIVE-6727: Table level stats for external tables are set incorrectly. > * HIVE-15653: Some ALTER TABLE commands drop table stats. > * SPARK-12014: Spark SQL query containing semicolon is broken in Beeline. > * SPARK-25193: insert overwrite doesn't throw exception when drop old data > fails. > * SPARK-25919: Date value corrupts when tables are "ParquetHiveSerDe" > formatted and target table is Partitioned. > * SPARK-26332: Spark sql write orc table on viewFS throws exception. > * SPARK-26437: Decimal data becomes bigint to query, unable to query. > We need update migration guide. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30479) Apply compaction of event log to SQL events
Jungtaek Lim created SPARK-30479: Summary: Apply compaction of event log to SQL events Key: SPARK-30479 URL: https://issues.apache.org/jira/browse/SPARK-30479 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Jungtaek Lim This issue is to track the effort on compacting old event logs (and cleaning up after compaction) without breaking guaranteeing of compatibility. This issue depends on SPARK-29779 and focuses on dealing with SQL events. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29779) Compact old event log files and clean up
[ https://issues.apache.org/jira/browse/SPARK-29779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-29779: - Description: This issue is to track the effort on compacting old event logs (and cleaning up after compaction) without breaking guaranteeing of compatibility. Please note that this issue leaves below functionalities for future JIRA issue as the patch for SPARK-29779 is too huge and we decided to break down. * apply filter in SQL events * integrate compaction into FsHistoryProvider * documentation about new configuration was:This issue is to track the effort on compacting old event logs (and cleaning up after compaction) without breaking guaranteeing of compatibility. > Compact old event log files and clean up > > > Key: SPARK-29779 > URL: https://issues.apache.org/jira/browse/SPARK-29779 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Major > > This issue is to track the effort on compacting old event logs (and cleaning > up after compaction) without breaking guaranteeing of compatibility. > Please note that this issue leaves below functionalities for future JIRA > issue as the patch for SPARK-29779 is too huge and we decided to break down. > * apply filter in SQL events > * integrate compaction into FsHistoryProvider > * documentation about new configuration -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30477) More KeyValueGroupedDataset methods should be composable
[ https://issues.apache.org/jira/browse/SPARK-30477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Jones updated SPARK-30477: --- Description: Right now many `KeyValueGroupedDataset` do not return a `KeyValueGroupedDataset`. In some cases this means we have to do multiple `groupByKey`s into order to express certain patterns. Setup {code:scala} def f: T => K def g: U => K def h: V => K val ds1: Dataset[T] = ??? val ds2: Dataset[U] = ??? val ds3: Dataset[V] = ??? val kvDs1: KeyValueGroupedDataset[K, T] = ds1.groupByKey(f) val kvDs2: KeyValueGroupedDataset[K, U] = ds2.groupByKey(g) val kvDs3: KeyValueGroupedDataset[K, V] = ds3.groupByKey(h) {code} Example one: Combining multiple CoGrouped Dataset. {code:scala} // Current kvDs1 .coGroup(kvDs2)(k: K, it1: Iterator[T], it2: Iterator[U]) => ???: X) .groupByKey((x: X) => ???: K) .coGroup(kvDs3)(k: K, it1: Iterator[X], it2: Iterator[Y]) => ???: Z) // Wanted trait KeyValueGroupedDataset[K, T] { def coGroupKeyValueGroupedDataset[U, X](r: KeyValueGroupedDataset)(K, Iterator[T], Iterator[U] => X): KeyValueGroupedDataset[K, X] } kvDs1 .coGroupKeyValueGroupedDataset(kvDs2)(k: K, it1: Iterator[T], it2: Iterator[U]) => ???: X) .coGroupKeyValueGroupedDataset(kvDs3)(k: K, it1: Iterator[X], it2: Iterator[Y]) => ???: Z) {code} Example two: Combining a reduceGroups with a coGroup {code:scala} // current val newDs1: Dataset[X] = kvDs1 .reduceGroups((l: T, r: T) => ???: T)) .groupByKey {case (k, _) => k } .mapValues { case (_, v) => v } .coGroup(kvDs2)(k: K, it1: Iterator[T], it2: Iterator[U]) => ???: X) // wanted trait KeyValueGroupedDataset[K, T] { def reduceGroupsKeyValueGroupedDataset(v: (V, V) => V): KeyValueGroupedDataset[K, V] } val newDs2: Dataset[X] = kvDs1 .reduceGroupsKeyValueGroupedDataset((l: T, r: T) => ???: T)) .coGroup(kvDs2)(k: K, it1: Iterator[T], it2: Iterator[U]) => ???: X) {code} In both cases not only are the ergonomics better, Spark will better able to optimize the code. For almost every method of `KeyValueGroupedDataset` we should have a matching method that returns a `KeyValueGroupedDataset`. We can also add a `.toDs` method which converts a `KeyValueGroupedDataset[K, V]` to a `Dataset[(K, V)]` was: Right now many `KeyValueGroupedDataset` do not return a `KeyValueGroupedDataset`. In some cases this means we have to do multiple `groupByKey`s into order to express certain patterns. Setup {code:scala} def f: T => K def g: U => K def h: V => K val ds1: Dataset[T] = ??? val ds2: Dataset[U] = ??? val ds3: Dataset[V] = ??? val kvDs1: KeyValueGroupedDataset[K, T] = ds1.groupByKey(f) val kvDs2: KeyValueGroupedDataset[K, U] = ds2.groupByKey(g) val kvDs3: KeyValueGroupedDataset[K, V] = ds3.groupByKey(h) {code} Example one: Combining multiple CoGrouped Dataset. {code:scala} // Current kvDs1 .coGroup(kvDs2)(k: K, it1: Iterator[T], it2: Iterator[U]) => ???: X) .groupByKey((x: X) => ???: K) .coGroup(kvDs3)(k: K, it1: Iterator[X], it2: Iterator[Y]) => ???: Z) // Wanted trait KeyValueGroupedDataset[K, T] { def coGroupKeyValueGroupedDataset[U, X](r: KeyValueGroupedDataset)(K, Iterator[T], Iterator[U] => X): KeyValueGroupedDataset[K, X] } kvDs1 .coGroupKeyValueGroupedDataset(kvDs2)(k: K, it1: Iterator[T], it2: Iterator[U]) => ???: X) .coGroupKeyValueGroupedDataset(kvDs3)(k: K, it1: Iterator[X], it2: Iterator[Y]) => ???: Z) {code} Example two: Combining a reduceGroups with a coGroup {code:scala} // current val newDs1: Dataset[X] = kvDs1 .reduceGroups((l: T, r: T) => ???: T)) .groupByKey {case (k, _) => k }.mapValues { case (_, v) => v } .coGroup(kvDs2)(k: K, it1: Iterator[T], it2: Iterator[U]) => ???: X) // wanted trait KeyValueGroupedDataset[K, T] { def reduceGroupsKeyValueGroupedDataset(v: (V, V) => V): KeyValueGroupedDataset[K, V] } val newDs2: Dataset[X] = kvDs1 .reduceGroupsKeyValueGroupedDataset((l: T, r: T) => ???: T)) .coGroup(kvDs2)(k: K, it1: Iterator[T], it2: Iterator[U]) => ???: X) {code} In both cases not only are the ergonomics better, Spark will better able to optimize the code. For almost every method of `KeyValueGroupedDataset` we should have a matching method that returns a `KeyValueGroupedDataset`. We can also add a `.toDs` method which converts a `KeyValueGroupedDataset[K, V]` to a `Dataset[(K, V)]` > More KeyValueGroupedDataset methods should be composable > > > Key: SPARK-30477 > URL: https://issues.apache.org/jira/browse/SPARK-30477 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Paul Jones >Priority: Major > > Right now many `KeyValueGroupedDataset` do not return a > `KeyValueGroupedDataset`. In some cases this means we have to do multiple > `groupByKey`s into order to express certai
[jira] [Updated] (SPARK-27686) Update migration guide
[ https://issues.apache.org/jira/browse/SPARK-27686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-27686: Description: The built-in Hive 2.3 fixes the following issues: * HIVE-6727: Table level stats for external tables are set incorrectly. * HIVE-15653: Some ALTER TABLE commands drop table stats. * SPARK-12014: Spark SQL query containing semicolon is broken in Beeline. * SPARK-25193: insert overwrite doesn't throw exception when drop old data fails. * SPARK-25919: Date value corrupts when tables are "ParquetHiveSerDe" formatted and target table is Partitioned. * SPARK-26332: Spark sql write orc table on viewFS throws exception. * SPARK-26437: Decimal data becomes bigint to query, unable to query. We need update migration guide. was: The built-in Hive 2.3 fixes the following issues: * HIVE-6727: Table level stats for external tables are set incorrectly. * HIVE-15653: Some ALTER TABLE commands drop table stats. * SPARK-12014: Spark SQL query containing semicolon is broken in Beeline. * SPARK-25193: insert overwrite doesn't throw exception when drop old data fails. * SPARK-25919: Date value corrupts when tables are "ParquetHiveSerDe" formatted and target table is Partitioned. * SPARK-26332: Spark sql write orc table on viewFS throws exception. * SPARK-26437: Decimal data becomes bigint to query, unable to query. We need update migration guide. Please note that this is only fixed in `hadoop-3.2` binary distribution. > Update migration guide > --- > > Key: SPARK-27686 > URL: https://issues.apache.org/jira/browse/SPARK-27686 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Minor > > The built-in Hive 2.3 fixes the following issues: > * HIVE-6727: Table level stats for external tables are set incorrectly. > * HIVE-15653: Some ALTER TABLE commands drop table stats. > * SPARK-12014: Spark SQL query containing semicolon is broken in Beeline. > * SPARK-25193: insert overwrite doesn't throw exception when drop old data > fails. > * SPARK-25919: Date value corrupts when tables are "ParquetHiveSerDe" > formatted and target table is Partitioned. > * SPARK-26332: Spark sql write orc table on viewFS throws exception. > * SPARK-26437: Decimal data becomes bigint to query, unable to query. > We need update migration guide. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27686) Update migration guide
[ https://issues.apache.org/jira/browse/SPARK-27686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-27686: Parent Issue: SPARK-30034 (was: SPARK-23710) > Update migration guide > --- > > Key: SPARK-27686 > URL: https://issues.apache.org/jira/browse/SPARK-27686 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Minor > > The built-in Hive 2.3 fixes the following issues: > * HIVE-6727: Table level stats for external tables are set incorrectly. > * HIVE-15653: Some ALTER TABLE commands drop table stats. > * SPARK-12014: Spark SQL query containing semicolon is broken in Beeline. > * SPARK-25193: insert overwrite doesn't throw exception when drop old data > fails. > * SPARK-25919: Date value corrupts when tables are "ParquetHiveSerDe" > formatted and target table is Partitioned. > * SPARK-26332: Spark sql write orc table on viewFS throws exception. > * SPARK-26437: Decimal data becomes bigint to query, unable to query. > We need update migration guide. > Please note that this is only fixed in `hadoop-3.2` binary distribution. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30474) Writing data to parquet with dynamic partitionOverwriteMode should not do the folder rename in commitjob stage
[ https://issues.apache.org/jira/browse/SPARK-30474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zaisheng Dai updated SPARK-30474: - Description: In the current spark implementation if you set, {code:java} spark.sql.sources.partitionOverwriteMode=dynamic {code} even with {code:java} mapreduce.fileoutputcommitter.algorithm.version=2 {code} it would still rename the partition folder *sequentially* in commitJob stage as shown here: [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L188] [https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L184] This is very slow in cloud storage. We should commit the data similar to FileOutputCommitter v2? was: In the current spark implementation if you set, {code:java} spark.sql.sources.partitionOverwriteMode=dynamic {code} even with {code:java} mapreduce.fileoutputcommitter.algorithm.version=2 {code} it would still rename the partition folder *sequentially* in commitJob stage as shown here: [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L188] [https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L184] This is very slow in cloud storage. We should commit the data similar to FileOutputCommitter v2? > Writing data to parquet with dynamic partitionOverwriteMode should not do the > folder rename in commitjob stage > -- > > Key: SPARK-30474 > URL: https://issues.apache.org/jira/browse/SPARK-30474 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.3.4, 2.4.4 >Reporter: Zaisheng Dai >Priority: Minor > > In the current spark implementation if you set, > {code:java} > spark.sql.sources.partitionOverwriteMode=dynamic > {code} > even with > {code:java} > mapreduce.fileoutputcommitter.algorithm.version=2 > {code} > it would still rename the partition folder *sequentially* in commitJob stage > as shown here: > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L188] > [https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L184] > > This is very slow in cloud storage. We should commit the data similar to > FileOutputCommitter v2? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30474) Writing data to parquet with dynamic partitionOverwriteMode should not do the folder rename in commitjob stage
[ https://issues.apache.org/jira/browse/SPARK-30474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zaisheng Dai updated SPARK-30474: - Summary: Writing data to parquet with dynamic partitionOverwriteMode should not do the folder rename in commitjob stage (was: Writing data to parquet with dynamic partition should not be done in commit job stage) > Writing data to parquet with dynamic partitionOverwriteMode should not do the > folder rename in commitjob stage > -- > > Key: SPARK-30474 > URL: https://issues.apache.org/jira/browse/SPARK-30474 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.3.4, 2.4.4 >Reporter: Zaisheng Dai >Priority: Minor > > In the current spark implementation if you set, > {code:java} > spark.sql.sources.partitionOverwriteMode=dynamic > {code} > even with > {code:java} > mapreduce.fileoutputcommitter.algorithm.version=2 > {code} > it would still rename the partition folder *sequentially* in commitJob stage > as shown here: > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L188] > > [https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L184] > > This is very slow in cloud storage. We should commit the data similar to > FileOutputCommitter v2? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27686) Update migration guide
[ https://issues.apache.org/jira/browse/SPARK-27686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012434#comment-17012434 ] Dongjoon Hyun commented on SPARK-27686: --- Hi, [~yumwang]. Can we have this document? > Update migration guide > --- > > Key: SPARK-27686 > URL: https://issues.apache.org/jira/browse/SPARK-27686 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Minor > > The built-in Hive 2.3 fixes the following issues: > * HIVE-6727: Table level stats for external tables are set incorrectly. > * HIVE-15653: Some ALTER TABLE commands drop table stats. > * SPARK-12014: Spark SQL query containing semicolon is broken in Beeline. > * SPARK-25193: insert overwrite doesn't throw exception when drop old data > fails. > * SPARK-25919: Date value corrupts when tables are "ParquetHiveSerDe" > formatted and target table is Partitioned. > * SPARK-26332: Spark sql write orc table on viewFS throws exception. > * SPARK-26437: Decimal data becomes bigint to query, unable to query. > We need update migration guide. > Please note that this is only fixed in `hadoop-3.2` binary distribution. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30441) Improve the memory usage in StronglyConnectedComponents
[ https://issues.apache.org/jira/browse/SPARK-30441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012432#comment-17012432 ] Dongjoon Hyun commented on SPARK-30441: --- Hi, [~jmzhou]. Please don't set `Fixed Version`. We use that when the committers merge the PRs. - https://spark.apache.org/contributing.html Also, `New Feature` and `Improvement` should have the version of `master` branch because Apache Spark community backports only bug fixes. > Improve the memory usage in StronglyConnectedComponents > --- > > Key: SPARK-30441 > URL: https://issues.apache.org/jira/browse/SPARK-30441 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 3.0.0 >Reporter: jiamuzhou >Priority: Major > Attachments: figure1.png, figure2.png > > > This is very consume memory when It use StronglyConnectedComponents(see > figure1.png). Because there is no mark the Graph/RDD as non-persistent in the > iterative process timely. And it is maybe lead to fail in the big graph. > In order to improve the memory usage, it is verty important to mark the > Graph/RDD as non-persistent timely. In the current code, only make the > Graph/RDD as non-persistent for 'sccGraph' but not for 'sccWorkGraph' in > degree's step and pregel's step. > I have done a optimized code proposal(see my > fork:[https://github.com/jmzhoulab/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala]) > The storage after optimization see figure2.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30441) Improve the memory usage in StronglyConnectedComponents
[ https://issues.apache.org/jira/browse/SPARK-30441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30441: -- Target Version/s: (was: 3.0.0) > Improve the memory usage in StronglyConnectedComponents > --- > > Key: SPARK-30441 > URL: https://issues.apache.org/jira/browse/SPARK-30441 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 2.1.0, 2.3.0, 2.4.0, 2.4.4 >Reporter: jiamuzhou >Priority: Major > Attachments: figure1.png, figure2.png > > > This is very consume memory when It use StronglyConnectedComponents(see > figure1.png). Because there is no mark the Graph/RDD as non-persistent in the > iterative process timely. And it is maybe lead to fail in the big graph. > In order to improve the memory usage, it is verty important to mark the > Graph/RDD as non-persistent timely. In the current code, only make the > Graph/RDD as non-persistent for 'sccGraph' but not for 'sccWorkGraph' in > degree's step and pregel's step. > I have done a optimized code proposal(see my > fork:[https://github.com/jmzhoulab/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala]) > The storage after optimization see figure2.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30441) Improve the memory usage in StronglyConnectedComponents
[ https://issues.apache.org/jira/browse/SPARK-30441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30441: -- Flags: (was: Important) > Improve the memory usage in StronglyConnectedComponents > --- > > Key: SPARK-30441 > URL: https://issues.apache.org/jira/browse/SPARK-30441 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 2.1.0, 2.3.0, 2.4.0, 2.4.4 >Reporter: jiamuzhou >Priority: Major > Attachments: figure1.png, figure2.png > > > This is very consume memory when It use StronglyConnectedComponents(see > figure1.png). Because there is no mark the Graph/RDD as non-persistent in the > iterative process timely. And it is maybe lead to fail in the big graph. > In order to improve the memory usage, it is verty important to mark the > Graph/RDD as non-persistent timely. In the current code, only make the > Graph/RDD as non-persistent for 'sccGraph' but not for 'sccWorkGraph' in > degree's step and pregel's step. > I have done a optimized code proposal(see my > fork:[https://github.com/jmzhoulab/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala]) > The storage after optimization see figure2.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30441) Improve the memory usage in StronglyConnectedComponents
[ https://issues.apache.org/jira/browse/SPARK-30441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30441: -- Affects Version/s: (was: 2.4.4) (was: 2.4.0) (was: 2.3.0) (was: 2.1.0) 3.0.0 > Improve the memory usage in StronglyConnectedComponents > --- > > Key: SPARK-30441 > URL: https://issues.apache.org/jira/browse/SPARK-30441 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 3.0.0 >Reporter: jiamuzhou >Priority: Major > Attachments: figure1.png, figure2.png > > > This is very consume memory when It use StronglyConnectedComponents(see > figure1.png). Because there is no mark the Graph/RDD as non-persistent in the > iterative process timely. And it is maybe lead to fail in the big graph. > In order to improve the memory usage, it is verty important to mark the > Graph/RDD as non-persistent timely. In the current code, only make the > Graph/RDD as non-persistent for 'sccGraph' but not for 'sccWorkGraph' in > degree's step and pregel's step. > I have done a optimized code proposal(see my > fork:[https://github.com/jmzhoulab/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala]) > The storage after optimization see figure2.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30441) Improve the memory usage in StronglyConnectedComponents
[ https://issues.apache.org/jira/browse/SPARK-30441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30441: -- Fix Version/s: (was: 3.0.0) > Improve the memory usage in StronglyConnectedComponents > --- > > Key: SPARK-30441 > URL: https://issues.apache.org/jira/browse/SPARK-30441 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 2.1.0, 2.3.0, 2.4.0, 2.4.4 >Reporter: jiamuzhou >Priority: Major > Attachments: figure1.png, figure2.png > > > This is very consume memory when It use StronglyConnectedComponents(see > figure1.png). Because there is no mark the Graph/RDD as non-persistent in the > iterative process timely. And it is maybe lead to fail in the big graph. > In order to improve the memory usage, it is verty important to mark the > Graph/RDD as non-persistent timely. In the current code, only make the > Graph/RDD as non-persistent for 'sccGraph' but not for 'sccWorkGraph' in > degree's step and pregel's step. > I have done a optimized code proposal(see my > fork:[https://github.com/jmzhoulab/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala]) > The storage after optimization see figure2.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30296) Dataset diffing transformation
[ https://issues.apache.org/jira/browse/SPARK-30296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012430#comment-17012430 ] Dongjoon Hyun commented on SPARK-30296: --- Hi, [~EnricoMi]. Please don't set `Fixed Version`. We set that when the committers merge the PRs. Also, `New Feature` should have the version of `master` branch, 3.0.0 (as of today), because Apache Spark community has a policy which allows blackporting bug-fixes only. - https://spark.apache.org/contributing.html > Dataset diffing transformation > -- > > Key: SPARK-30296 > URL: https://issues.apache.org/jira/browse/SPARK-30296 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Enrico Minack >Priority: Major > > Evolving Spark code needs frequent regression testing to prove it still > produces identical results, or if changes are expected, to investigate those > changes. Diffing the Datasets of two code paths provides confidence. > Diffing small schemata is easy, but with wide schema the Spark query becomes > laborious and error-prone. With a single proven and tested method, diffing > becomes easier and a more reliable operation. As a Dataset transformation, > you get this operation first hand with your Dataset API. > This has proven to be useful for interactive spark as well as deployed > production code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30296) Dataset diffing transformation
[ https://issues.apache.org/jira/browse/SPARK-30296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30296: -- Affects Version/s: (was: 2.4.4) 3.0.0 > Dataset diffing transformation > -- > > Key: SPARK-30296 > URL: https://issues.apache.org/jira/browse/SPARK-30296 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Enrico Minack >Priority: Major > > Evolving Spark code needs frequent regression testing to prove it still > produces identical results, or if changes are expected, to investigate those > changes. Diffing the Datasets of two code paths provides confidence. > Diffing small schemata is easy, but with wide schema the Spark query becomes > laborious and error-prone. With a single proven and tested method, diffing > becomes easier and a more reliable operation. As a Dataset transformation, > you get this operation first hand with your Dataset API. > This has proven to be useful for interactive spark as well as deployed > production code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30296) Dataset diffing transformation
[ https://issues.apache.org/jira/browse/SPARK-30296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30296: -- Fix Version/s: (was: 3.0.0) > Dataset diffing transformation > -- > > Key: SPARK-30296 > URL: https://issues.apache.org/jira/browse/SPARK-30296 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.4 >Reporter: Enrico Minack >Priority: Major > > Evolving Spark code needs frequent regression testing to prove it still > produces identical results, or if changes are expected, to investigate those > changes. Diffing the Datasets of two code paths provides confidence. > Diffing small schemata is easy, but with wide schema the Spark query becomes > laborious and error-prone. With a single proven and tested method, diffing > becomes easier and a more reliable operation. As a Dataset transformation, > you get this operation first hand with your Dataset API. > This has proven to be useful for interactive spark as well as deployed > production code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25017) Add test suite for ContextBarrierState
[ https://issues.apache.org/jira/browse/SPARK-25017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25017: -- Target Version/s: (was: 3.0.0) > Add test suite for ContextBarrierState > -- > > Key: SPARK-25017 > URL: https://issues.apache.org/jira/browse/SPARK-25017 > Project: Spark > Issue Type: Test > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Xingbo Jiang >Priority: Major > > We shall be able to add unit test to ContextBarrierState with a mocked > RpcCallContext. Currently it's only covered by end-to-end test in > `BarrierTaskContextSuite` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30478) update memory package doc
SongXun created SPARK-30478: --- Summary: update memory package doc Key: SPARK-30478 URL: https://issues.apache.org/jira/browse/SPARK-30478 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.1.0 Reporter: SongXun >From Spark 2.0, the storage memory also uses off heap memory. We update the >doc here. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30131) Add array_median function
[ https://issues.apache.org/jira/browse/SPARK-30131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30131: -- Fix Version/s: (was: 3.0.0) > Add array_median function > - > > Key: SPARK-30131 > URL: https://issues.apache.org/jira/browse/SPARK-30131 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.4 >Reporter: Alexander Hagerf >Priority: Minor > > It is known that there isn't any exact median function in Spark SQL, and this > might be a difficult problem to solve efficiently. However, to find the > median for an array should be a simple task, and something that users can > utilize when collecting numeric values to a list or set. > This can already be achieved by using sorting and choosing element, but can > get cumbersome and if a fully tested function is provided in the API, I think > it can solve some headache for many. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30131) Add array_median function
[ https://issues.apache.org/jira/browse/SPARK-30131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30131: -- Target Version/s: (was: 2.4.4) > Add array_median function > - > > Key: SPARK-30131 > URL: https://issues.apache.org/jira/browse/SPARK-30131 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.4 >Reporter: Alexander Hagerf >Priority: Minor > Fix For: 3.0.0 > > > It is known that there isn't any exact median function in Spark SQL, and this > might be a difficult problem to solve efficiently. However, to find the > median for an array should be a simple task, and something that users can > utilize when collecting numeric values to a list or set. > This can already be achieved by using sorting and choosing element, but can > get cumbersome and if a fully tested function is provided in the API, I think > it can solve some headache for many. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30034) Use Apache Hive 2.3 dependency by default
[ https://issues.apache.org/jira/browse/SPARK-30034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30034. --- Fix Version/s: 3.0.0 Resolution: Done > Use Apache Hive 2.3 dependency by default > - > > Key: SPARK-30034 > URL: https://issues.apache.org/jira/browse/SPARK-30034 > Project: Spark > Issue Type: Umbrella > Components: Build, SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Blocker > Labels: release-notes > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29988) Adjust Jenkins jobs for `hive-1.2/2.3` combination
[ https://issues.apache.org/jira/browse/SPARK-29988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29988. --- Fix Version/s: 3.0.0 Resolution: Fixed Thank you. It looks working. I'll monitor them. > Adjust Jenkins jobs for `hive-1.2/2.3` combination > -- > > Key: SPARK-29988 > URL: https://issues.apache.org/jira/browse/SPARK-29988 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Shane Knapp >Priority: Major > Fix For: 3.0.0 > > Attachments: Screen Shot 2020-01-09 at 1.59.25 PM.png > > > We need to rename the following Jenkins jobs first. > spark-master-test-sbt-hadoop-2.7 -> spark-master-test-sbt-hadoop-2.7-hive-1.2 > spark-master-test-sbt-hadoop-3.2 -> spark-master-test-sbt-hadoop-3.2-hive-2.3 > spark-master-test-maven-hadoop-2.7 -> > spark-master-test-maven-hadoop-2.7-hive-1.2 > spark-master-test-maven-hadoop-3.2 -> > spark-master-test-maven-hadoop-3.2-hive-2.3 > Also, we need to add `-Phive-1.2` for the existing `hadoop-2.7` jobs. > {code} > -Phive \ > +-Phive-1.2 \ > {code} > And, we need to add `-Phive-2.3` for the existing `hadoop-3.2` jobs. > {code} > -Phive \ > +-Phive-2.3 \ > {code} > Now now, I added the above `-Phive-1.2` and `-Phive-2.3` to the Jenkins > manually. (This should be added to SCM of AmpLab Jenkins.) > After SPARK-29981, we need to create two new jobs. > - spark-master-test-sbt-hadoop-2.7-hive-2.3 > - spark-master-test-maven-hadoop-2.7-hive-2.3 > This is for preparation for Apache Spark 3.0.0. > We may drop all `*-hive-1.2` jobs at Apache Spark 3.1.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30477) More KeyValueGroupedDataset methods should be composable
Paul Jones created SPARK-30477: -- Summary: More KeyValueGroupedDataset methods should be composable Key: SPARK-30477 URL: https://issues.apache.org/jira/browse/SPARK-30477 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.4 Reporter: Paul Jones Right now many `KeyValueGroupedDataset` do not return a `KeyValueGroupedDataset`. In some cases this means we have to do multiple `groupByKey`s into order to express certain patterns. Setup {code:scala} def f: T => K def g: U => K def h: V => K val ds1: Dataset[T] = ??? val ds2: Dataset[U] = ??? val ds3: Dataset[V] = ??? val kvDs1: KeyValueGroupedDataset[K, T] = ds1.groupByKey(f) val kvDs2: KeyValueGroupedDataset[K, U] = ds2.groupByKey(g) val kvDs3: KeyValueGroupedDataset[K, V] = ds3.groupByKey(h) {code} Example one: Combining multiple CoGrouped Dataset. {code:scala} // Current kvDs1 .coGroup(kvDs2)(k: K, it1: Iterator[T], it2: Iterator[U]) => ???: X) .groupByKey((x: X) => ???: K) .coGroup(kvDs3)(k: K, it1: Iterator[X], it2: Iterator[Y]) => ???: Z) // Wanted trait KeyValueGroupedDataset[K, T] { def coGroupKeyValueGroupedDataset[U, X](r: KeyValueGroupedDataset)(K, Iterator[T], Iterator[U] => X): KeyValueGroupedDataset[K, X] } kvDs1 .coGroupKeyValueGroupedDataset(kvDs2)(k: K, it1: Iterator[T], it2: Iterator[U]) => ???: X) .coGroupKeyValueGroupedDataset(kvDs3)(k: K, it1: Iterator[X], it2: Iterator[Y]) => ???: Z) {code} Example two: Combining a reduceGroups with a coGroup {code:scala} // current val newDs1: Dataset[X] = kvDs1 .reduceGroups((l: T, r: T) => ???: T)) .groupByKey {case (k, _) => k }.mapValues { case (_, v) => v } .coGroup(kvDs2)(k: K, it1: Iterator[T], it2: Iterator[U]) => ???: X) // wanted trait KeyValueGroupedDataset[K, T] { def reduceGroupsKeyValueGroupedDataset(v: (V, V) => V): KeyValueGroupedDataset[K, V] } val newDs2: Dataset[X] = kvDs1 .reduceGroupsKeyValueGroupedDataset((l: T, r: T) => ???: T)) .coGroup(kvDs2)(k: K, it1: Iterator[T], it2: Iterator[U]) => ???: X) {code} In both cases not only are the ergonomics better, Spark will better able to optimize the code. For almost every method of `KeyValueGroupedDataset` we should have a matching method that returns a `KeyValueGroupedDataset`. We can also add a `.toDs` method which converts a `KeyValueGroupedDataset[K, V]` to a `Dataset[(K, V)]` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28396) Add PathCatalog for data source V2
[ https://issues.apache.org/jira/browse/SPARK-28396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012381#comment-17012381 ] Gengliang Wang commented on SPARK-28396: [~jerrychenhf] they are still handled by V1 implementation > Add PathCatalog for data source V2 > -- > > Key: SPARK-28396 > URL: https://issues.apache.org/jira/browse/SPARK-28396 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > Add PathCatalog for data source V2, so that: > 1. We can convert SaveMode in DataFrameWriter into catalog table operations, > instead of supporting SaveMode in file source V2. > 2. Support create-table SQL statements like "CREATE TABLE orc.'path'" -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30476) NullPointException when Insert data to hive mongo external table by spark-sql
XiongCheng created SPARK-30476: -- Summary: NullPointException when Insert data to hive mongo external table by spark-sql Key: SPARK-30476 URL: https://issues.apache.org/jira/browse/SPARK-30476 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.3 Environment: mongo-hadoop: 2.0.2 spark-version: 2.4.3 scala-version: 2.11 hive-version: 1.2.1 hadoop-version: 2.6.0 Reporter: XiongCheng I execute the sql,but i got a NPE. result_data_mongo is a mongodb hive external table. {code:java} insert into result_data_mongo values("15","15","15",15,"15",15,15,15,15,15,15,15,15,15,15); {code} NPE detail: {code:java} org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249) at org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:123) at org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:108) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:236) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.NullPointerException at com.mongodb.hadoop.output.MongoOutputCommitter.getTaskAttemptPath(MongoOutputCommitter.java:264) at com.mongodb.hadoop.output.MongoRecordWriter.(MongoRecordWriter.java:59) at com.mongodb.hadoop.hive.output.HiveMongoOutputFormat$HiveMongoRecordWriter.(HiveMongoOutputFormat.java:80) at com.mongodb.hadoop.hive.output.HiveMongoOutputFormat.getHiveRecordWriter(HiveMongoOutputFormat.java:52) at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:261) at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:246) ... 15 more {code} I know mongo-hadoop use the incorrect key to get TaskAttemptID,so I modified the source code of mongo-hadoop to get the correct properties ('mapreduce.task.id' and 'mapreduce.task.attempt.id'), but I still can't get the value. I found that these parameters are stored in spark In TaskAttemptContext, but TaskAttemptContext is not passed into HiveOutputWriter, is this a design flaw? here are two key point. mongo-hadoop: [https://github.com/mongodb/mongo-hadoop/blob/cdcd0f15503f2d1c5a1a2e3941711d850d1e427b/hive/src/main/java/com/mongodb/hadoop/hive/output/HiveMongoOutputFormat.java#L80] spark-hive:[https://github.com/apache/spark/blob/7c7d7f6a878b02ece881266ee538f3e1443aa8c1/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala#L103] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30439) support NOT NULL in column data type
[ https://issues.apache.org/jira/browse/SPARK-30439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30439. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27110 [https://github.com/apache/spark/pull/27110] > support NOT NULL in column data type > > > Key: SPARK-30439 > URL: https://issues.apache.org/jira/browse/SPARK-30439 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30416) Log a warning for deprecated SQL config in `set()` and `unset()`
[ https://issues.apache.org/jira/browse/SPARK-30416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30416. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27092 [https://github.com/apache/spark/pull/27092] > Log a warning for deprecated SQL config in `set()` and `unset()` > > > Key: SPARK-30416 > URL: https://issues.apache.org/jira/browse/SPARK-30416 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > - Gather deprecated SQL configs and add extra info - when a config was > deprecated and why > - Output warning about deprecated SQL config in set() and unset() -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30416) Log a warning for deprecated SQL config in `set()` and `unset()`
[ https://issues.apache.org/jira/browse/SPARK-30416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-30416: Assignee: Maxim Gekk > Log a warning for deprecated SQL config in `set()` and `unset()` > > > Key: SPARK-30416 > URL: https://issues.apache.org/jira/browse/SPARK-30416 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > > - Gather deprecated SQL configs and add extra info - when a config was > deprecated and why > - Output warning about deprecated SQL config in set() and unset() -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30468) Use multiple lines to display data columns for show create table command
[ https://issues.apache.org/jira/browse/SPARK-30468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenhua Wang updated SPARK-30468: - Description: Currently data columns are displayed in one line for show create table command, when the table has many columns (to make things even worse, columns may have long names or comments), the displayed result is really hard to read. To improve readability, we could print each column in a separate line. Note that other systems like Hive/MySQL also display in this way. Also, for data columns, table properties and options, we'd better put the right parenthesis to the end of the last column/property/option, instead of occupying a separate line. As a result, before the change: {noformat} spark-sql> show create table test_table; CREATE TABLE `test_table` (`col1` INT COMMENT 'This is comment for column 1', `col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT 'This is comment for column 3') USING parquet OPTIONS ( `bar` '2', `foo` '1' ) TBLPROPERTIES ( 'a' = 'x', 'b' = 'y' ) {noformat} after the change: {noformat} spark-sql> show create table test_table; CREATE TABLE `test_table` ( `col1` INT COMMENT 'This is comment for column 1', `col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT 'This is comment for column 3') USING parquet OPTIONS ( `bar` '2', `foo` '1') TBLPROPERTIES ( 'a' = 'x', 'b' = 'y') {noformat} was: Currently data columns are displayed in one line for show create table command, when the table has many columns (to make things even worse, columns may have long names or comments), the displayed result is really hard to read. E.g. {noformat} spark-sql> show create table test_table; CREATE TABLE `test_table` (`col1` INT COMMENT 'This is comment for column 1', `col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT 'This is comment for column 3') USING parquet {noformat} To improve readability, we should print each column in a separate line. {noformat} spark-sql> show create table test_table; CREATE TABLE `test_table` ( `col1` INT COMMENT 'This is comment for column 1', `col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT 'This is comment for column 3') USING parquet {noformat} > Use multiple lines to display data columns for show create table command > > > Key: SPARK-30468 > URL: https://issues.apache.org/jira/browse/SPARK-30468 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhenhua Wang >Priority: Minor > > Currently data columns are displayed in one line for show create table > command, when the table has many columns (to make things even worse, columns > may have long names or comments), the displayed result is really hard to read. > To improve readability, we could print each column in a separate line. Note > that other systems like Hive/MySQL also display in this way. > Also, for data columns, table properties and options, we'd better put the > right parenthesis to the end of the last column/property/option, instead of > occupying a separate line. > As a result, before the change: > {noformat} > spark-sql> show create table test_table; > CREATE TABLE `test_table` (`col1` INT COMMENT 'This is comment for column 1', > `col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT > 'This is comment for column 3') > USING parquet > OPTIONS ( > `bar` '2', > `foo` '1' > ) > TBLPROPERTIES ( > 'a' = 'x', > 'b' = 'y' > ) > {noformat} > after the change: > {noformat} > spark-sql> show create table test_table; > CREATE TABLE `test_table` ( > `col1` INT COMMENT 'This is comment for column 1', > `col2` STRING COMMENT 'This is comment for column 2', > `col3` DOUBLE COMMENT 'This is comment for column 3') > USING parquet > OPTIONS ( > `bar` '2', > `foo` '1') > TBLPROPERTIES ( > 'a' = 'x', > 'b' = 'y') > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28396) Add PathCatalog for data source V2
[ https://issues.apache.org/jira/browse/SPARK-28396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012336#comment-17012336 ] Haifeng Chen commented on SPARK-28396: -- [~Gengliang.Wang] Gengliang, I am trying to understand how Hive catalog tables connected with data source V2 API in the current implementation. Just to check with you that in current Spark 3.0 implementation, has Hive catalog tables or thrift server catalog tables already gone through Data source V2 implementation or they are still handled by V1 implementation? > Add PathCatalog for data source V2 > -- > > Key: SPARK-28396 > URL: https://issues.apache.org/jira/browse/SPARK-28396 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > Add PathCatalog for data source V2, so that: > 1. We can convert SaveMode in DataFrameWriter into catalog table operations, > instead of supporting SaveMode in file source V2. > 2. Support create-table SQL statements like "CREATE TABLE orc.'path'" -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24714) AnalysisSuite should use ClassTag to check the runtime instance
[ https://issues.apache.org/jira/browse/SPARK-24714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-24714. -- Resolution: Won't Fix > AnalysisSuite should use ClassTag to check the runtime instance > --- > > Key: SPARK-24714 > URL: https://issues.apache.org/jira/browse/SPARK-24714 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.3.1 >Reporter: Chia-Ping Tsai >Priority: Minor > > {code:java} > test("SPARK-22614 RepartitionByExpression partitioning") { > def checkPartitioning[T <: Partitioning](numPartitions: Int, exprs: > Expression*): Unit = { > val partitioning = RepartitionByExpression(exprs, testRelation2, > numPartitions).partitioning > assert(partitioning.isInstanceOf[T]) // it always be true because of type > erasure > }{code} > Spark support the scala 2.10 and 2.11 so it is ok to introduce ClassTag to > correct the type check. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24714) AnalysisSuite should use ClassTag to check the runtime instance
[ https://issues.apache.org/jira/browse/SPARK-24714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012323#comment-17012323 ] Takeshi Yamamuro commented on SPARK-24714: -- I'll close this because the corresponding pr is inactive. If necessary, please reopen it. > AnalysisSuite should use ClassTag to check the runtime instance > --- > > Key: SPARK-24714 > URL: https://issues.apache.org/jira/browse/SPARK-24714 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.3.1 >Reporter: Chia-Ping Tsai >Priority: Minor > > {code:java} > test("SPARK-22614 RepartitionByExpression partitioning") { > def checkPartitioning[T <: Partitioning](numPartitions: Int, exprs: > Expression*): Unit = { > val partitioning = RepartitionByExpression(exprs, testRelation2, > numPartitions).partitioning > assert(partitioning.isInstanceOf[T]) // it always be true because of type > erasure > }{code} > Spark support the scala 2.10 and 2.11 so it is ok to introduce ClassTag to > correct the type check. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30475) File source V2: Push data filters for file listing
[ https://issues.apache.org/jira/browse/SPARK-30475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guy Khazma updated SPARK-30475: --- Description: Follow up on [SPARK-30428|https://github.com/apache/spark/pull/27112] which added support for partition pruning in File source V2. We should also pass the {{dataFilters}} to the {{listFiles method.}} Datasources such as {{csv}} and {{json}} do not implement the {{SupportsPushDownFilters}} trait. In order to support data skipping uniformly for all file based data sources, one can override the {{listFiles}} method in a {{FileIndex}} implementation and use the {{dataFilters}} and partitionFilters to consult external metadata and prunes the list of files. was: Follow up on [SPARK-30428|https://github.com/apache/spark/pull/27112] which added support for partition pruning in File source V2. We should also pass the {{dataFilters}} to the {{listFiles method.}} Datasources such as {{csv}} and {{json}} do not implement the {{SupportsPushDownFilters}} trait. In order to support data skipping uniformly for all file based data sources, one can override the {{listFiles}} method in a {{FileIndex}} implementation, which consults external metadata and prunes the list of files. > File source V2: Push data filters for file listing > -- > > Key: SPARK-30475 > URL: https://issues.apache.org/jira/browse/SPARK-30475 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Guy Khazma >Priority: Major > Fix For: 3.0.0 > > > Follow up on [SPARK-30428|https://github.com/apache/spark/pull/27112] which > added support for partition pruning in File source V2. > We should also pass the {{dataFilters}} to the {{listFiles method.}} > Datasources such as {{csv}} and {{json}} do not implement the > {{SupportsPushDownFilters}} trait. In order to support data skipping > uniformly for all file based data sources, one can override the {{listFiles}} > method in a {{FileIndex}} implementation and use the {{dataFilters}} and > partitionFilters to consult external metadata and prunes the list of files. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30475) File source V2: Push data filters for file listing
[ https://issues.apache.org/jira/browse/SPARK-30475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012298#comment-17012298 ] Guy Khazma commented on SPARK-30475: PR https://github.com/apache/spark/pull/27157 > File source V2: Push data filters for file listing > -- > > Key: SPARK-30475 > URL: https://issues.apache.org/jira/browse/SPARK-30475 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Guy Khazma >Priority: Major > Fix For: 3.0.0 > > > Follow up on [SPARK-30428|https://github.com/apache/spark/pull/27112] which > added support for partition pruning in File source V2. > We should also pass the {{dataFilters}} to the {{listFiles method.}} > Datasources such as {{csv}} and {{json}} do not implement the > {{SupportsPushDownFilters}} trait. In order to support data skipping > uniformly for all file based data sources, one can override the {{listFiles}} > method in a {{FileIndex}} implementation, which consults external metadata > and prunes the list of files. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30475) File source V2: Push data filters for file listing
[ https://issues.apache.org/jira/browse/SPARK-30475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guy Khazma updated SPARK-30475: --- External issue URL: https://github.com/apache/spark/pull/27157 > File source V2: Push data filters for file listing > -- > > Key: SPARK-30475 > URL: https://issues.apache.org/jira/browse/SPARK-30475 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Guy Khazma >Priority: Major > Fix For: 3.0.0 > > > Follow up on [SPARK-30428|https://github.com/apache/spark/pull/27112] which > added support for partition pruning in File source V2. > We should also pass the {{dataFilters}} to the {{listFiles method.}} > Datasources such as {{csv}} and {{json}} do not implement the > {{SupportsPushDownFilters}} trait. In order to support data skipping > uniformly for all file based data sources, one can override the {{listFiles}} > method in a {{FileIndex}} implementation, which consults external metadata > and prunes the list of files. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30475) File source V2: Push data filters for file listing
[ https://issues.apache.org/jira/browse/SPARK-30475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guy Khazma updated SPARK-30475: --- External issue URL: (was: https://github.com/apache/spark/pull/27157) > File source V2: Push data filters for file listing > -- > > Key: SPARK-30475 > URL: https://issues.apache.org/jira/browse/SPARK-30475 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Guy Khazma >Priority: Major > Fix For: 3.0.0 > > > Follow up on [SPARK-30428|https://github.com/apache/spark/pull/27112] which > added support for partition pruning in File source V2. > We should also pass the {{dataFilters}} to the {{listFiles method.}} > Datasources such as {{csv}} and {{json}} do not implement the > {{SupportsPushDownFilters}} trait. In order to support data skipping > uniformly for all file based data sources, one can override the {{listFiles}} > method in a {{FileIndex}} implementation, which consults external metadata > and prunes the list of files. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30475) File source V2: Push data filters for file listing
Guy Khazma created SPARK-30475: -- Summary: File source V2: Push data filters for file listing Key: SPARK-30475 URL: https://issues.apache.org/jira/browse/SPARK-30475 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Guy Khazma Fix For: 3.0.0 Follow up on [SPARK-30428|https://github.com/apache/spark/pull/27112] which added support for partition pruning in File source V2. We should also pass the {{dataFilters}} to the {{listFiles method.}} Datasources such as {{csv}} and {{json}} do not implement the {{SupportsPushDownFilters}} trait. In order to support data skipping uniformly for all file based data sources, one can override the {{listFiles}} method in a {{FileIndex}} implementation, which consults external metadata and prunes the list of files. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29988) Adjust Jenkins jobs for `hive-1.2/2.3` combination
[ https://issues.apache.org/jira/browse/SPARK-29988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shane Knapp updated SPARK-29988: Attachment: Screen Shot 2020-01-09 at 1.59.25 PM.png > Adjust Jenkins jobs for `hive-1.2/2.3` combination > -- > > Key: SPARK-29988 > URL: https://issues.apache.org/jira/browse/SPARK-29988 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Shane Knapp >Priority: Major > Attachments: Screen Shot 2020-01-09 at 1.59.25 PM.png > > > We need to rename the following Jenkins jobs first. > spark-master-test-sbt-hadoop-2.7 -> spark-master-test-sbt-hadoop-2.7-hive-1.2 > spark-master-test-sbt-hadoop-3.2 -> spark-master-test-sbt-hadoop-3.2-hive-2.3 > spark-master-test-maven-hadoop-2.7 -> > spark-master-test-maven-hadoop-2.7-hive-1.2 > spark-master-test-maven-hadoop-3.2 -> > spark-master-test-maven-hadoop-3.2-hive-2.3 > Also, we need to add `-Phive-1.2` for the existing `hadoop-2.7` jobs. > {code} > -Phive \ > +-Phive-1.2 \ > {code} > And, we need to add `-Phive-2.3` for the existing `hadoop-3.2` jobs. > {code} > -Phive \ > +-Phive-2.3 \ > {code} > Now now, I added the above `-Phive-1.2` and `-Phive-2.3` to the Jenkins > manually. (This should be added to SCM of AmpLab Jenkins.) > After SPARK-29981, we need to create two new jobs. > - spark-master-test-sbt-hadoop-2.7-hive-2.3 > - spark-master-test-maven-hadoop-2.7-hive-2.3 > This is for preparation for Apache Spark 3.0.0. > We may drop all `*-hive-1.2` jobs at Apache Spark 3.1.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29988) Adjust Jenkins jobs for `hive-1.2/2.3` combination
[ https://issues.apache.org/jira/browse/SPARK-29988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012261#comment-17012261 ] Shane Knapp commented on SPARK-29988: - it's hard to tell but i disabled the old jobs and all the new ones are running. > Adjust Jenkins jobs for `hive-1.2/2.3` combination > -- > > Key: SPARK-29988 > URL: https://issues.apache.org/jira/browse/SPARK-29988 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Shane Knapp >Priority: Major > Attachments: Screen Shot 2020-01-09 at 1.59.25 PM.png > > > We need to rename the following Jenkins jobs first. > spark-master-test-sbt-hadoop-2.7 -> spark-master-test-sbt-hadoop-2.7-hive-1.2 > spark-master-test-sbt-hadoop-3.2 -> spark-master-test-sbt-hadoop-3.2-hive-2.3 > spark-master-test-maven-hadoop-2.7 -> > spark-master-test-maven-hadoop-2.7-hive-1.2 > spark-master-test-maven-hadoop-3.2 -> > spark-master-test-maven-hadoop-3.2-hive-2.3 > Also, we need to add `-Phive-1.2` for the existing `hadoop-2.7` jobs. > {code} > -Phive \ > +-Phive-1.2 \ > {code} > And, we need to add `-Phive-2.3` for the existing `hadoop-3.2` jobs. > {code} > -Phive \ > +-Phive-2.3 \ > {code} > Now now, I added the above `-Phive-1.2` and `-Phive-2.3` to the Jenkins > manually. (This should be added to SCM of AmpLab Jenkins.) > After SPARK-29981, we need to create two new jobs. > - spark-master-test-sbt-hadoop-2.7-hive-2.3 > - spark-master-test-maven-hadoop-2.7-hive-2.3 > This is for preparation for Apache Spark 3.0.0. > We may drop all `*-hive-1.2` jobs at Apache Spark 3.1.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29988) Adjust Jenkins jobs for `hive-1.2/2.3` combination
[ https://issues.apache.org/jira/browse/SPARK-29988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012260#comment-17012260 ] Shane Knapp commented on SPARK-29988: - done! !Screen Shot 2020-01-09 at 1.59.25 PM.png! > Adjust Jenkins jobs for `hive-1.2/2.3` combination > -- > > Key: SPARK-29988 > URL: https://issues.apache.org/jira/browse/SPARK-29988 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Shane Knapp >Priority: Major > Attachments: Screen Shot 2020-01-09 at 1.59.25 PM.png > > > We need to rename the following Jenkins jobs first. > spark-master-test-sbt-hadoop-2.7 -> spark-master-test-sbt-hadoop-2.7-hive-1.2 > spark-master-test-sbt-hadoop-3.2 -> spark-master-test-sbt-hadoop-3.2-hive-2.3 > spark-master-test-maven-hadoop-2.7 -> > spark-master-test-maven-hadoop-2.7-hive-1.2 > spark-master-test-maven-hadoop-3.2 -> > spark-master-test-maven-hadoop-3.2-hive-2.3 > Also, we need to add `-Phive-1.2` for the existing `hadoop-2.7` jobs. > {code} > -Phive \ > +-Phive-1.2 \ > {code} > And, we need to add `-Phive-2.3` for the existing `hadoop-3.2` jobs. > {code} > -Phive \ > +-Phive-2.3 \ > {code} > Now now, I added the above `-Phive-1.2` and `-Phive-2.3` to the Jenkins > manually. (This should be added to SCM of AmpLab Jenkins.) > After SPARK-29981, we need to create two new jobs. > - spark-master-test-sbt-hadoop-2.7-hive-2.3 > - spark-master-test-maven-hadoop-2.7-hive-2.3 > This is for preparation for Apache Spark 3.0.0. > We may drop all `*-hive-1.2` jobs at Apache Spark 3.1.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29988) Adjust Jenkins jobs for `hive-1.2/2.3` combination
[ https://issues.apache.org/jira/browse/SPARK-29988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012246#comment-17012246 ] Shane Knapp commented on SPARK-29988: - ok, after banging my head against jenkins job builder, i finally got it to work. deploying now. > Adjust Jenkins jobs for `hive-1.2/2.3` combination > -- > > Key: SPARK-29988 > URL: https://issues.apache.org/jira/browse/SPARK-29988 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Shane Knapp >Priority: Major > > We need to rename the following Jenkins jobs first. > spark-master-test-sbt-hadoop-2.7 -> spark-master-test-sbt-hadoop-2.7-hive-1.2 > spark-master-test-sbt-hadoop-3.2 -> spark-master-test-sbt-hadoop-3.2-hive-2.3 > spark-master-test-maven-hadoop-2.7 -> > spark-master-test-maven-hadoop-2.7-hive-1.2 > spark-master-test-maven-hadoop-3.2 -> > spark-master-test-maven-hadoop-3.2-hive-2.3 > Also, we need to add `-Phive-1.2` for the existing `hadoop-2.7` jobs. > {code} > -Phive \ > +-Phive-1.2 \ > {code} > And, we need to add `-Phive-2.3` for the existing `hadoop-3.2` jobs. > {code} > -Phive \ > +-Phive-2.3 \ > {code} > Now now, I added the above `-Phive-1.2` and `-Phive-2.3` to the Jenkins > manually. (This should be added to SCM of AmpLab Jenkins.) > After SPARK-29981, we need to create two new jobs. > - spark-master-test-sbt-hadoop-2.7-hive-2.3 > - spark-master-test-maven-hadoop-2.7-hive-2.3 > This is for preparation for Apache Spark 3.0.0. > We may drop all `*-hive-1.2` jobs at Apache Spark 3.1.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27249) Developers API for Transformers beyond UnaryTransformer
[ https://issues.apache.org/jira/browse/SPARK-27249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012229#comment-17012229 ] Everett Rush commented on SPARK-27249: -- [~nafshartous] Hi Nick, I would like to have a "MultiColumnTransformer class" in Spark. I should be able to subclass this transformer. I would like the api to be similar to UnaryTranformer. So I provide a transformation function and a new schema. Then Spark handles the encoding back to a DataFrame and optimizes the computation however it can. class ExampleMulticolumn(override val uid: String, envVars: Map[String, String]) extends MultiColumnTransformer[ExampleMulticolumn] with HasInputCol with DefaultParamsWritable { def this() = this(Identifiable.randomUID("exampleMulticolumn"), Map()) // developer provides the new schema for dataframe val newSchema: StructType override protected def transformFunc: Iterator[Row] => Iterator[Row] = { iter => { // connect to database // iterate over rows in partition val new_iter = iter.map{ row => // do some computation row } new_iter } } override def copy(extra: ParamMap): ExampleMulticolumn = defaultCopy(extra) } > Developers API for Transformers beyond UnaryTransformer > --- > > Key: SPARK-27249 > URL: https://issues.apache.org/jira/browse/SPARK-27249 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.0.0 >Reporter: Everett Rush >Priority: Minor > Labels: starter > Original Estimate: 96h > Remaining Estimate: 96h > > It would be nice to have a developers' API for dataset transformations that > need more than one column from a row (ie UnaryTransformer inputs one column > and outputs one column) or that contain objects too expensive to initialize > repeatedly in a UDF such as a database connection. > > Design: > Abstract class PartitionTransformer extends Transformer and defines the > partition transformation function as Iterator[Row] => Iterator[Row] > NB: This parallels the UnaryTransformer createTransformFunc method > > When developers subclass this transformer, they can provide their own schema > for the output Row in which case the PartitionTransformer creates a row > encoder and executes the transformation. Alternatively the developer can set > output Datatype and output col name. Then the PartitionTransformer class will > create a new schema, a row encoder, and execute the transformation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27249) Developers API for Transformers beyond UnaryTransformer
[ https://issues.apache.org/jira/browse/SPARK-27249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17010756#comment-17010756 ] Nick Afshartous edited comment on SPARK-27249 at 1/9/20 7:49 PM: - I could try and look into this. Could someone validate that this feature is still needed ? [~enrush] It would also be helpful if you could provide a code example illustrating how the {{PartitionTransformer}} would be used. was (Author: nafshartous): I could try and look into this. Could someone validate that this feature is still needed ? > Developers API for Transformers beyond UnaryTransformer > --- > > Key: SPARK-27249 > URL: https://issues.apache.org/jira/browse/SPARK-27249 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.0.0 >Reporter: Everett Rush >Priority: Minor > Labels: starter > Original Estimate: 96h > Remaining Estimate: 96h > > It would be nice to have a developers' API for dataset transformations that > need more than one column from a row (ie UnaryTransformer inputs one column > and outputs one column) or that contain objects too expensive to initialize > repeatedly in a UDF such as a database connection. > > Design: > Abstract class PartitionTransformer extends Transformer and defines the > partition transformation function as Iterator[Row] => Iterator[Row] > NB: This parallels the UnaryTransformer createTransformFunc method > > When developers subclass this transformer, they can provide their own schema > for the output Row in which case the PartitionTransformer creates a row > encoder and executes the transformation. Alternatively the developer can set > output Datatype and output col name. Then the PartitionTransformer class will > create a new schema, a row encoder, and execute the transformation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30459) Fix ignoreMissingFiles/ignoreCorruptFiles in DSv2
[ https://issues.apache.org/jira/browse/SPARK-30459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-30459. Resolution: Fixed This issue is resolved in https://github.com/apache/spark/pull/27136 > Fix ignoreMissingFiles/ignoreCorruptFiles in DSv2 > - > > Key: SPARK-30459 > URL: https://issues.apache.org/jira/browse/SPARK-30459 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > > ignoreMissingFiles/ignoreCorruptFiles in DSv2 is wrong comparing to DSv1, as > it stop immediately once it finds a missing or corrupt file while in DSv1 it > will skip and continue to read next files. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30459) Fix ignoreMissingFiles/ignoreCorruptFiles in DSv2
[ https://issues.apache.org/jira/browse/SPARK-30459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-30459: --- Issue Type: Bug (was: Improvement) > Fix ignoreMissingFiles/ignoreCorruptFiles in DSv2 > - > > Key: SPARK-30459 > URL: https://issues.apache.org/jira/browse/SPARK-30459 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > > ignoreMissingFiles/ignoreCorruptFiles in DSv2 is wrong comparing to DSv1, as > it stop immediately once it finds a missing or corrupt file while in DSv1 it > will skip and continue to read next files. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30459) Fix ignoreMissingFiles/ignoreCorruptFiles in DSv2
[ https://issues.apache.org/jira/browse/SPARK-30459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang reassigned SPARK-30459: -- Assignee: wuyi > Fix ignoreMissingFiles/ignoreCorruptFiles in DSv2 > - > > Key: SPARK-30459 > URL: https://issues.apache.org/jira/browse/SPARK-30459 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > > ignoreMissingFiles/ignoreCorruptFiles in DSv2 is wrong comparing to DSv1, as > it stop immediately once it finds a missing or corrupt file while in DSv1 it > will skip and continue to read next files. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29219) DataSourceV2: Support all SaveModes in DataFrameWriter.save
[ https://issues.apache.org/jira/browse/SPARK-29219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-29219. - Fix Version/s: 3.0.0 Resolution: Done Resolved by [https://github.com/apache/spark/pull/26913] > DataSourceV2: Support all SaveModes in DataFrameWriter.save > --- > > Key: SPARK-29219 > URL: https://issues.apache.org/jira/browse/SPARK-29219 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > Fix For: 3.0.0 > > > We currently don't support all save modes in DataFrameWriter.save as the > TableProvider interface allows for the reading/writing of data, but not for > the creation of tables. We created a catalog API to support the > creation/dropping/checking existence of tables, but DataFrameWriter.save > doesn't necessarily use a catalog for example, when writing to a path based > table. > For this case, we propose a new interface that will allow TableProviders to > extract an Indentifier and a Catalog from a bundle of > CaseInsensitiveStringOptions. This information can then be used to check the > existence of a table, and support all save modes. If a Catalog is not > defined, then the behavior is to use the spark_catalog (or configured session > catalog) to perform the check. > > The interface can look like: > {code:java} > trait CatalogOptions { > def extractCatalog(StringMap): String > def extractIdentifier(StringMap): Identifier > } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29219) DataSourceV2: Support all SaveModes in DataFrameWriter.save
[ https://issues.apache.org/jira/browse/SPARK-29219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz reassigned SPARK-29219: --- Assignee: Burak Yavuz > DataSourceV2: Support all SaveModes in DataFrameWriter.save > --- > > Key: SPARK-29219 > URL: https://issues.apache.org/jira/browse/SPARK-29219 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > > We currently don't support all save modes in DataFrameWriter.save as the > TableProvider interface allows for the reading/writing of data, but not for > the creation of tables. We created a catalog API to support the > creation/dropping/checking existence of tables, but DataFrameWriter.save > doesn't necessarily use a catalog for example, when writing to a path based > table. > For this case, we propose a new interface that will allow TableProviders to > extract an Indentifier and a Catalog from a bundle of > CaseInsensitiveStringOptions. This information can then be used to check the > existence of a table, and support all save modes. If a Catalog is not > defined, then the behavior is to use the spark_catalog (or configured session > catalog) to perform the check. > > The interface can look like: > {code:java} > trait CatalogOptions { > def extractCatalog(StringMap): String > def extractIdentifier(StringMap): Identifier > } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30474) Writing data to parquet with dynamic partition should not be done in commit job stage
[ https://issues.apache.org/jira/browse/SPARK-30474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zaisheng Dai updated SPARK-30474: - Description: In the current spark implementation if you set, {code:java} spark.sql.sources.partitionOverwriteMode=dynamic {code} even with {code:java} mapreduce.fileoutputcommitter.algorithm.version=2 {code} it would still rename the partition folder *sequentially* in commitJob stage as shown here: [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L188] [https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L184] This is very slow in cloud storage. We should commit the data similar to FileOutputCommitter v2? was: In the current spark implementation if you set spark.sql.sources.partitionOverwriteMode=dynamic, even with mapreduce.fileoutputcommitter.algorithm.version=2, it would still rename the partition folder *sequentially* in commitJob stage as shown here: [|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L188] [https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L184] This is very slow in cloud storage. We should commit the data similar to FileOutputCommitter v2? > Writing data to parquet with dynamic partition should not be done in commit > job stage > - > > Key: SPARK-30474 > URL: https://issues.apache.org/jira/browse/SPARK-30474 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.3.4, 2.4.4 >Reporter: Zaisheng Dai >Priority: Minor > > In the current spark implementation if you set, > {code:java} > spark.sql.sources.partitionOverwriteMode=dynamic > {code} > even with > {code:java} > mapreduce.fileoutputcommitter.algorithm.version=2 > {code} > it would still rename the partition folder *sequentially* in commitJob stage > as shown here: > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L188] > > [https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L184] > > This is very slow in cloud storage. We should commit the data similar to > FileOutputCommitter v2? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30474) Writing data to parquet with dynamic partition should not be done in commit job stage
[ https://issues.apache.org/jira/browse/SPARK-30474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zaisheng Dai updated SPARK-30474: - Description: In the current spark implementation if you set spark.sql.sources.partitionOverwriteMode=dynamic, even with mapreduce.fileoutputcommitter.algorithm.version=2, it would still rename the partition folder *sequentially* in commitJob stage as shown here: [|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L188] [https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L184] This is very slow in cloud storage. We should commit the data similar to FileOutputCommitter v2? was: In the current spark implementation if you set spark.sql.sources.partitionOverwriteMode=dynamic, even with mapreduce.fileoutputcommitter.algorithm.version=2, it would still rename the partition folder *sequentially* in commitJob stage as shown here: [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L188] This is very slow in cloud storage. We should commit the data similar to FileOutputCommitter v2? > Writing data to parquet with dynamic partition should not be done in commit > job stage > - > > Key: SPARK-30474 > URL: https://issues.apache.org/jira/browse/SPARK-30474 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.3.4, 2.4.4 >Reporter: Zaisheng Dai >Priority: Minor > > In the current spark implementation if you set > spark.sql.sources.partitionOverwriteMode=dynamic, even with > mapreduce.fileoutputcommitter.algorithm.version=2, it would still rename the > partition folder *sequentially* in commitJob stage as shown here: > [|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L188] > > [https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L184] > > This is very slow in cloud storage. We should commit the data similar to > FileOutputCommitter v2? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30474) Writing data to parquet with dynamic partition should not be done in commit job stage
Zaisheng Dai created SPARK-30474: Summary: Writing data to parquet with dynamic partition should not be done in commit job stage Key: SPARK-30474 URL: https://issues.apache.org/jira/browse/SPARK-30474 Project: Spark Issue Type: Improvement Components: Input/Output Affects Versions: 2.4.4, 2.3.4 Reporter: Zaisheng Dai In the current spark implementation if you set spark.sql.sources.partitionOverwriteMode=dynamic, even with mapreduce.fileoutputcommitter.algorithm.version=2, it would still rename the partition folder *sequentially* in commitJob stage as shown here: [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L188] This is very slow in cloud storage. We should commit the data similar to FileOutputCommitter v2? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30467) On Federal Information Processing Standard (FIPS) enabled cluster, Spark Workers are not able to connect to Remote Master.
[ https://issues.apache.org/jira/browse/SPARK-30467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHOBHIT SHUKLA updated SPARK-30467: --- Description: On _*Federal Information Processing Standard*_ (FIPS) enabled clusters, Spark Workers are not able to create Spark Context because communication between Spark Worker and Spark Master is not possible If we configured *spark.network.crypto.enabled true*. *Error logs :* To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). *fips.c(145): OpenSSL internal error, assertion failed: FATAL FIPS SELFTEST FAILURE* JVMDUMP039I Processing dump event "abort", detail "" at 2020/01/09 06:41:50 - please wait. JVMDUMP032I JVM requested System dump using '/bin/core.20200109.064150.283.0001.dmp' in response to an event JVMDUMP030W Cannot write dump to file/bin/core.20200109.064150.283.0001.dmp: Permission denied JVMDUMP012E Error in System dump: The core file created by child process with pid = 375 was not found. Expected to find core file with name "/var/cores/core-netty-rpc-conne-sig11-user1000320999-group0-pid375-time*" JVMDUMP030W Cannot write dump to file /bin/javacore.20200109.064150.283.0002.txt: Permission denied JVMDUMP032I JVM requested Java dump using '/tmp/javacore.20200109.064150.283.0002.txt' in response to an event JVMDUMP010I Java dump written to /tmp/javacore.20200109.064150.283.0002.txt JVMDUMP032I JVM requested Snap dump using '/bin/Snap.20200109.064150.283.0003.trc' in response to an event JVMDUMP030W Cannot write dump to file /bin/Snap.20200109.064150.283.0003.trc: Permission denied JVMDUMP010I Snap dump written to /tmp/Snap.20200109.064150.283.0003.trc JVMDUMP030W Cannot write dump to file /bin/jitdump.20200109.064150.283.0004.dmp: Permission denied JVMDUMP007I JVM Requesting JIT dump using '/tmp/jitdump.20200109.064150.283.0004.dmp' JVMDUMP010I JIT dump written to /tmp/jitdump.20200109.064150.283.0004.dmp JVMDUMP013I Processed dump event "abort", detail "". was: On _*Federal Information Processing Standard*_ (FIPS) enabled clusters, Spark Workers are not able to create Spark Context because communication between Spark Worker and Spark Master is not possible If we configured *spark.network.crypto.enabled true*. *Error logs :* To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). fips.c(145): OpenSSL internal error, assertion failed: FATAL FIPS SELFTEST FAILURE JVMDUMP039I Processing dump event "abort", detail "" at 2020/01/09 06:41:50 - please wait. JVMDUMP032I JVM requested System dump using '/bin/core.20200109.064150.283.0001.dmp' in response to an event JVMDUMP030W Cannot write dump to file/bin/core.20200109.064150.283.0001.dmp: Permission denied JVMDUMP012E Error in System dump: The core file created by child process with pid = 375 was not found. Expected to find core file with name "/var/cores/core-netty-rpc-conne-sig11-user1000320999-group0-pid375-time*" JVMDUMP030W Cannot write dump to file /bin/javacore.20200109.064150.283.0002.txt: Permission denied JVMDUMP032I JVM requested Java dump using '/tmp/javacore.20200109.064150.283.0002.txt' in response to an event JVMDUMP010I Java dump written to /tmp/javacore.20200109.064150.283.0002.txt JVMDUMP032I JVM requested Snap dump using '/bin/Snap.20200109.064150.283.0003.trc' in response to an event JVMDUMP030W Cannot write dump to file /bin/Snap.20200109.064150.283.0003.trc: Permission denied JVMDUMP010I Snap dump written to /tmp/Snap.20200109.064150.283.0003.trc JVMDUMP030W Cannot write dump to file /bin/jitdump.20200109.064150.283.0004.dmp: Permission denied JVMDUMP007I JVM Requesting JIT dump using '/tmp/jitdump.20200109.064150.283.0004.dmp' JVMDUMP010I JIT dump written to /tmp/jitdump.20200109.064150.283.0004.dmp JVMDUMP013I Processed dump event "abort", detail "". > On Federal Information Processing Standard (FIPS) enabled cluster, Spark > Workers are not able to connect to Remote Master. > -- > > Key: SPARK-30467 > URL: https://issues.apache.org/jira/browse/SPARK-30467 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.3, 2.3.4, 2.4.4 >Reporter: SHOBHIT SHUKLA >Priority: Blocker > Labels: security > > On _*Federal Information Processing Standard*_ (FIPS) enabled clusters, Spark > Workers are not able to create Spark Context because communication between > Spark Worker and Spark Master is not possible If we configured > *spark.network.crypto.enabled true*. > *Error logs :* > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > *fips.c(145): OpenSSL internal error, assertion failed: FATAL FIPS SELFTEST > FAILURE* > JVMDU
[jira] [Updated] (SPARK-30467) On Federal Information Processing Standard (FIPS) enabled cluster, Spark Workers are not able to connect to Remote Master.
[ https://issues.apache.org/jira/browse/SPARK-30467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHOBHIT SHUKLA updated SPARK-30467: --- Priority: Blocker (was: Major) > On Federal Information Processing Standard (FIPS) enabled cluster, Spark > Workers are not able to connect to Remote Master. > -- > > Key: SPARK-30467 > URL: https://issues.apache.org/jira/browse/SPARK-30467 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.3, 2.3.4, 2.4.4 >Reporter: SHOBHIT SHUKLA >Priority: Blocker > Labels: security > > On _*Federal Information Processing Standard*_ (FIPS) enabled clusters, Spark > Workers are not able to create Spark Context because communication between > Spark Worker and Spark Master is not possible If we configured > *spark.network.crypto.enabled true*. > *Error logs :* > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > fips.c(145): OpenSSL internal error, assertion failed: FATAL FIPS SELFTEST > FAILURE > JVMDUMP039I Processing dump event "abort", detail "" at 2020/01/09 06:41:50 - > please wait. > JVMDUMP032I JVM requested System dump using > '/bin/core.20200109.064150.283.0001.dmp' in response to an event > JVMDUMP030W Cannot write dump to > file/bin/core.20200109.064150.283.0001.dmp: Permission denied > JVMDUMP012E Error in System dump: The core file created by child process with > pid = 375 was not found. Expected to find core file with name > "/var/cores/core-netty-rpc-conne-sig11-user1000320999-group0-pid375-time*" > JVMDUMP030W Cannot write dump to file > /bin/javacore.20200109.064150.283.0002.txt: Permission denied > JVMDUMP032I JVM requested Java dump using > '/tmp/javacore.20200109.064150.283.0002.txt' in response to an event > JVMDUMP010I Java dump written to /tmp/javacore.20200109.064150.283.0002.txt > JVMDUMP032I JVM requested Snap dump using > '/bin/Snap.20200109.064150.283.0003.trc' in response to an event > JVMDUMP030W Cannot write dump to file > /bin/Snap.20200109.064150.283.0003.trc: Permission denied > JVMDUMP010I Snap dump written to /tmp/Snap.20200109.064150.283.0003.trc > JVMDUMP030W Cannot write dump to file > /bin/jitdump.20200109.064150.283.0004.dmp: Permission denied > JVMDUMP007I JVM Requesting JIT dump using > '/tmp/jitdump.20200109.064150.283.0004.dmp' > JVMDUMP010I JIT dump written to /tmp/jitdump.20200109.064150.283.0004.dmp > JVMDUMP013I Processed dump event "abort", detail "". -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30473) PySpark enum subclass crashes when used inside UDF
[ https://issues.apache.org/jira/browse/SPARK-30473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Härtwig updated SPARK-30473: Description: PySpark enum subclass crashes when used inside a UDF. Example: {code:java} from enum import Enum class Direction(Enum): NORTH = 0 SOUTH = 1 {code} Working: {code:java} Direction.NORTH{code} Crashing: {code:java} @udf def fn(a): Direction.NORTH return "" df.withColumn("test", fn("a")){code} Stacktrace: {noformat} SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 4 times, most recent failure: Lost task 0.3 in stage 9.0 (TID 235, 10.139.64.21, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/databricks/spark/python/pyspark/serializers.py", line 182, in _read_with_length return self.loads(obj) File "/databricks/spark/python/pyspark/serializers.py", line 695, in loads return pickle.loads(obj, encoding=encoding) File "/databricks/python/lib/python3.7/enum.py", line 152, in __new__ enum_members = {k: classdict[k] for k in classdict._member_names} AttributeError: 'dict' object has no attribute '_member_names'{noformat} I suspect the problem is in *python/pyspark/cloudpickle.py*. On line 586 in the function *_save_dynamic_enum*, the attribute *_member_names* is removed from the enum. Yet, this attribute is required by the *Enum* class. This results in all Enum subclasses crashing. was: PySpark enum subclass crashes when used inside a UDF. Example: {code:java} from enum import Enum class Direction(Enum): NORTH = 0 SOUTH = 1 {code} Working: {code:java} Direction.NORTH{code} Crashing: {code:java} @udf def fn(a): Direction.NORTH return "" df.withColumn("test", fn("a")){code} Stacktrace: {noformat} SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 4 times, most recent failure: Lost task 0.3 in stage 9.0 (TID 235, 10.139.64.21, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/databricks/spark/python/pyspark/serializers.py", line 182, in _read_with_length return self.loads(obj) File "/databricks/spark/python/pyspark/serializers.py", line 695, in loads return pickle.loads(obj, encoding=encoding) File "/databricks/python/lib/python3.7/enum.py", line 152, in __new__ enum_members = {k: classdict[k] for k in classdict._member_names} AttributeError: 'dict' object has no attribute '_member_names'{noformat} I suspect the problem is in `python/pyspark/cloudpickle.py`. On line 586 in the function `_save_dynamic_enum`, the attribute `_member_names` is removed from the enum. Yet, this attribute is required by the `Enum` class and Enum subclasses will crash. > PySpark enum subclass crashes when used inside UDF > -- > > Key: SPARK-30473 > URL: https://issues.apache.org/jira/browse/SPARK-30473 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 > Environment: Databricks Runtime 6.2 (includes Apache Spark 2.4.4, > Scala 2.11) >Reporter: Max Härtwig >Priority: Major > > PySpark enum subclass crashes when used inside a UDF. > > Example: > {code:java} > from enum import Enum > class Direction(Enum): > NORTH = 0 > SOUTH = 1 > {code} > > Working: > {code:java} > Direction.NORTH{code} > > Crashing: > {code:java} > @udf > def fn(a): > Direction.NORTH > return "" > df.withColumn("test", fn("a")){code} > > Stacktrace: > {noformat} > SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed > 4 times, most recent failure: Lost task 0.3 in stage 9.0 (TID 235, > 10.139.64.21, executor 0): org.apache.spark.api.python.PythonException: > Traceback (most recent call last): > File "/databricks/spark/python/pyspark/serializers.py", line 182, in > _read_with_length return self.loads(obj) > File "/databricks/spark/python/pyspark/serializers.py", line 695, in > loads return pickle.loads(obj, encoding=encoding) > File "/databricks/python/lib/python3.7/enum.py", line 152, in __new__ > enum_members = {k: classdict[k] for k in classdict._member_names} > AttributeError: 'dict' object has no attribute '_member_names'{noformat} > > I suspect the problem is in *python/pyspark/cloudpickle.py*. On line 586 in > the function *_save_dynamic_enum*, the attribute *_member_names* is removed > from the enum. Yet, this attribute is required by the *Enum* class. This > results in all Enum subclasses crashing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30473) PySpark enum subclass crashes when used inside UDF
[ https://issues.apache.org/jira/browse/SPARK-30473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Härtwig updated SPARK-30473: Description: PySpark enum subclass crashes when used inside a UDF. Example: {code:java} from enum import Enum class Direction(Enum): NORTH = 0 SOUTH = 1 {code} Working: {code:java} Direction.NORTH{code} Crashing: {code:java} @udf def fn(a): Direction.NORTH return "" df.withColumn("test", fn("a")){code} Stacktrace: {noformat} SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 4 times, most recent failure: Lost task 0.3 in stage 9.0 (TID 235, 10.139.64.21, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/databricks/spark/python/pyspark/serializers.py", line 182, in _read_with_length return self.loads(obj) File "/databricks/spark/python/pyspark/serializers.py", line 695, in loads return pickle.loads(obj, encoding=encoding) File "/databricks/python/lib/python3.7/enum.py", line 152, in __new__ enum_members = {k: classdict[k] for k in classdict._member_names} AttributeError: 'dict' object has no attribute '_member_names'{noformat} I suspect the problem is in `python/pyspark/cloudpickle.py`. On line 586 in the function `_save_dynamic_enum`, the attribute `_member_names` is removed from the enum. Yet, this attribute is required by the `Enum` class and Enum subclasses will crash. was: PySpark enum subclass crashes when used inside a UDF. Example: {code:java} from enum import Enum class Direction(Enum): NORTH = 0 SOUTH = 1 {code} Working: {code:java} Direction.NORTH{code} Crashing: {code:java} @udf def fn(a): Direction.NORTH return "" df.withColumn("test", fn("a")){code} Stacktrace: {noformat} SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 4 times, most recent failure: Lost task 0.3 in stage 9.0 (TID 235, 10.139.64.21, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/databricks/spark/python/pyspark/serializers.py", line 182, in _read_with_length return self.loads(obj) File "/databricks/spark/python/pyspark/serializers.py", line 695, in loads return pickle.loads(obj, encoding=encoding) File "/databricks/python/lib/python3.7/enum.py", line 152, in __new__ enum_members = {k: classdict[k] for k in classdict._member_names} AttributeError: 'dict' object has no attribute '_member_names'{noformat} I suspect the problem is in `python/pyspark/cloudpickle.py`. On line 586 in the function `_save_dynamic_enum`, the attribute `_member_names` is removed from the enum. Yet, this attribute is required by the `Enum` class and Enum subclasses will crash. > PySpark enum subclass crashes when used inside UDF > -- > > Key: SPARK-30473 > URL: https://issues.apache.org/jira/browse/SPARK-30473 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 > Environment: Databricks Runtime 6.2 (includes Apache Spark 2.4.4, > Scala 2.11) >Reporter: Max Härtwig >Priority: Major > > PySpark enum subclass crashes when used inside a UDF. > > Example: > {code:java} > from enum import Enum > class Direction(Enum): > NORTH = 0 > SOUTH = 1 > {code} > > Working: > {code:java} > Direction.NORTH{code} > > Crashing: > {code:java} > @udf > def fn(a): > Direction.NORTH > return "" > df.withColumn("test", fn("a")){code} > > Stacktrace: > {noformat} > SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed > 4 times, most recent failure: Lost task 0.3 in stage 9.0 (TID 235, > 10.139.64.21, executor 0): org.apache.spark.api.python.PythonException: > Traceback (most recent call last): > File "/databricks/spark/python/pyspark/serializers.py", line 182, in > _read_with_length return self.loads(obj) > File "/databricks/spark/python/pyspark/serializers.py", line 695, in > loads return pickle.loads(obj, encoding=encoding) > File "/databricks/python/lib/python3.7/enum.py", line 152, in __new__ > enum_members = {k: classdict[k] for k in classdict._member_names} > AttributeError: 'dict' object has no attribute '_member_names'{noformat} > > I suspect the problem is in `python/pyspark/cloudpickle.py`. On line 586 in > the function `_save_dynamic_enum`, the attribute `_member_names` is removed > from the enum. Yet, this attribute is required by the `Enum` class and Enum > subclasses will crash. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30473) PySpark enum subclass crashes when used inside UDF
Max Härtwig created SPARK-30473: --- Summary: PySpark enum subclass crashes when used inside UDF Key: SPARK-30473 URL: https://issues.apache.org/jira/browse/SPARK-30473 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.4 Environment: Databricks Runtime 6.2 (includes Apache Spark 2.4.4, Scala 2.11) Reporter: Max Härtwig PySpark enum subclass crashes when used inside a UDF. Example: {code:java} from enum import Enum class Direction(Enum): NORTH = 0 SOUTH = 1 {code} Working: {code:java} Direction.NORTH{code} Crashing: {code:java} @udf def fn(a): Direction.NORTH return "" df.withColumn("test", fn("a")){code} Stacktrace: {noformat} SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 4 times, most recent failure: Lost task 0.3 in stage 9.0 (TID 235, 10.139.64.21, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/databricks/spark/python/pyspark/serializers.py", line 182, in _read_with_length return self.loads(obj) File "/databricks/spark/python/pyspark/serializers.py", line 695, in loads return pickle.loads(obj, encoding=encoding) File "/databricks/python/lib/python3.7/enum.py", line 152, in __new__ enum_members = {k: classdict[k] for k in classdict._member_names} AttributeError: 'dict' object has no attribute '_member_names'{noformat} I suspect the problem is in `python/pyspark/cloudpickle.py`. On line 586 in the function `_save_dynamic_enum`, the attribute `_member_names` is removed from the enum. Yet, this attribute is required by the `Enum` class and Enum subclasses will crash. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30472) [SQL] ANSI SQL: Throw exception on format invalid and overflow when casting String to IntegerType.
[ https://issues.apache.org/jira/browse/SPARK-30472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-30472: Summary: [SQL] ANSI SQL: Throw exception on format invalid and overflow when casting String to IntegerType. (was: ANSI SQL: Cast String to Integer Type, throw exception on format invalid and overflow.) > [SQL] ANSI SQL: Throw exception on format invalid and overflow when casting > String to IntegerType. > -- > > Key: SPARK-30472 > URL: https://issues.apache.org/jira/browse/SPARK-30472 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: feiwang >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30467) On Federal Information Processing Standard (FIPS) enabled cluster, Spark Workers are not able to connect to Remote Master.
[ https://issues.apache.org/jira/browse/SPARK-30467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHOBHIT SHUKLA updated SPARK-30467: --- Description: On _*Federal Information Processing Standard*_ (FIPS) enabled clusters, Spark Workers are not able to create Spark Context because communication between Spark Worker and Spark Master is not possible If we configured *spark.network.crypto.enabled true*. *Error logs :* To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). fips.c(145): OpenSSL internal error, assertion failed: FATAL FIPS SELFTEST FAILURE JVMDUMP039I Processing dump event "abort", detail "" at 2020/01/09 06:41:50 - please wait. JVMDUMP032I JVM requested System dump using '/bin/core.20200109.064150.283.0001.dmp' in response to an event JVMDUMP030W Cannot write dump to file/bin/core.20200109.064150.283.0001.dmp: Permission denied JVMDUMP012E Error in System dump: The core file created by child process with pid = 375 was not found. Expected to find core file with name "/var/cores/core-netty-rpc-conne-sig11-user1000320999-group0-pid375-time*" JVMDUMP030W Cannot write dump to file /bin/javacore.20200109.064150.283.0002.txt: Permission denied JVMDUMP032I JVM requested Java dump using '/tmp/javacore.20200109.064150.283.0002.txt' in response to an event JVMDUMP010I Java dump written to /tmp/javacore.20200109.064150.283.0002.txt JVMDUMP032I JVM requested Snap dump using '/bin/Snap.20200109.064150.283.0003.trc' in response to an event JVMDUMP030W Cannot write dump to file /bin/Snap.20200109.064150.283.0003.trc: Permission denied JVMDUMP010I Snap dump written to /tmp/Snap.20200109.064150.283.0003.trc JVMDUMP030W Cannot write dump to file /bin/jitdump.20200109.064150.283.0004.dmp: Permission denied JVMDUMP007I JVM Requesting JIT dump using '/tmp/jitdump.20200109.064150.283.0004.dmp' JVMDUMP010I JIT dump written to /tmp/jitdump.20200109.064150.283.0004.dmp JVMDUMP013I Processed dump event "abort", detail "". was: On _*Federal Information Processing Standard*_ (FIPS) enabled clusters, Spark Workers are not able to create Spark Context because communication between worker and master is not possible If we configured *spark.network.crypto.enabled true*. *Error logs :* To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). fips.c(145): OpenSSL internal error, assertion failed: FATAL FIPS SELFTEST FAILURE JVMDUMP039I Processing dump event "abort", detail "" at 2020/01/09 06:41:50 - please wait. JVMDUMP032I JVM requested System dump using '/bin/core.20200109.064150.283.0001.dmp' in response to an event JVMDUMP030W Cannot write dump to file/bin/core.20200109.064150.283.0001.dmp: Permission denied JVMDUMP012E Error in System dump: The core file created by child process with pid = 375 was not found. Expected to find core file with name "/var/cores/core-netty-rpc-conne-sig11-user1000320999-group0-pid375-time*" JVMDUMP030W Cannot write dump to file /bin/javacore.20200109.064150.283.0002.txt: Permission denied JVMDUMP032I JVM requested Java dump using '/tmp/javacore.20200109.064150.283.0002.txt' in response to an event JVMDUMP010I Java dump written to /tmp/javacore.20200109.064150.283.0002.txt JVMDUMP032I JVM requested Snap dump using '/bin/Snap.20200109.064150.283.0003.trc' in response to an event JVMDUMP030W Cannot write dump to file /bin/Snap.20200109.064150.283.0003.trc: Permission denied JVMDUMP010I Snap dump written to /tmp/Snap.20200109.064150.283.0003.trc JVMDUMP030W Cannot write dump to file /bin/jitdump.20200109.064150.283.0004.dmp: Permission denied JVMDUMP007I JVM Requesting JIT dump using '/tmp/jitdump.20200109.064150.283.0004.dmp' JVMDUMP010I JIT dump written to /tmp/jitdump.20200109.064150.283.0004.dmp JVMDUMP013I Processed dump event "abort", detail "". > On Federal Information Processing Standard (FIPS) enabled cluster, Spark > Workers are not able to connect to Remote Master. > -- > > Key: SPARK-30467 > URL: https://issues.apache.org/jira/browse/SPARK-30467 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.3, 2.3.4, 2.4.4 >Reporter: SHOBHIT SHUKLA >Priority: Major > Labels: security > > On _*Federal Information Processing Standard*_ (FIPS) enabled clusters, Spark > Workers are not able to create Spark Context because communication between > Spark Worker and Spark Master is not possible If we configured > *spark.network.crypto.enabled true*. > *Error logs :* > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > fips.c(145): OpenSSL internal error, assertion failed: FATAL FIPS SELFTEST > FAILURE > JVMDUMP039I Processing
[jira] [Updated] (SPARK-30467) On Federal Information Processing Standard (FIPS) enabled cluster, Spark Workers are not able to connect to Remote Master.
[ https://issues.apache.org/jira/browse/SPARK-30467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHOBHIT SHUKLA updated SPARK-30467: --- Description: On _*Federal Information Processing Standard*_ (FIPS) enabled clusters, Spark Workers are not able to create Spark Context because communication between worker and master is not possible If we configured *spark.network.crypto.enabled true*. *Error logs :* To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). fips.c(145): OpenSSL internal error, assertion failed: FATAL FIPS SELFTEST FAILURE JVMDUMP039I Processing dump event "abort", detail "" at 2020/01/09 06:41:50 - please wait. JVMDUMP032I JVM requested System dump using '/bin/core.20200109.064150.283.0001.dmp' in response to an event JVMDUMP030W Cannot write dump to file/bin/core.20200109.064150.283.0001.dmp: Permission denied JVMDUMP012E Error in System dump: The core file created by child process with pid = 375 was not found. Expected to find core file with name "/var/cores/core-netty-rpc-conne-sig11-user1000320999-group0-pid375-time*" JVMDUMP030W Cannot write dump to file /bin/javacore.20200109.064150.283.0002.txt: Permission denied JVMDUMP032I JVM requested Java dump using '/tmp/javacore.20200109.064150.283.0002.txt' in response to an event JVMDUMP010I Java dump written to /tmp/javacore.20200109.064150.283.0002.txt JVMDUMP032I JVM requested Snap dump using '/bin/Snap.20200109.064150.283.0003.trc' in response to an event JVMDUMP030W Cannot write dump to file /bin/Snap.20200109.064150.283.0003.trc: Permission denied JVMDUMP010I Snap dump written to /tmp/Snap.20200109.064150.283.0003.trc JVMDUMP030W Cannot write dump to file /bin/jitdump.20200109.064150.283.0004.dmp: Permission denied JVMDUMP007I JVM Requesting JIT dump using '/tmp/jitdump.20200109.064150.283.0004.dmp' JVMDUMP010I JIT dump written to /tmp/jitdump.20200109.064150.283.0004.dmp JVMDUMP013I Processed dump event "abort", detail "". was: On _*Federal Information Processing Standard*_ (FIPS) enabled clusters, Spark Workers are not able to create Spark Context If we configured *spark.network.crypto.enabled true*. *Error logs :* To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). fips.c(145): OpenSSL internal error, assertion failed: FATAL FIPS SELFTEST FAILURE JVMDUMP039I Processing dump event "abort", detail "" at 2020/01/09 06:41:50 - please wait. JVMDUMP032I JVM requested System dump using '/bin/core.20200109.064150.283.0001.dmp' in response to an event JVMDUMP030W Cannot write dump to file/bin/core.20200109.064150.283.0001.dmp: Permission denied JVMDUMP012E Error in System dump: The core file created by child process with pid = 375 was not found. Expected to find core file with name "/var/cores/core-netty-rpc-conne-sig11-user1000320999-group0-pid375-time*" JVMDUMP030W Cannot write dump to file /bin/javacore.20200109.064150.283.0002.txt: Permission denied JVMDUMP032I JVM requested Java dump using '/tmp/javacore.20200109.064150.283.0002.txt' in response to an event JVMDUMP010I Java dump written to /tmp/javacore.20200109.064150.283.0002.txt JVMDUMP032I JVM requested Snap dump using '/bin/Snap.20200109.064150.283.0003.trc' in response to an event JVMDUMP030W Cannot write dump to file /bin/Snap.20200109.064150.283.0003.trc: Permission denied JVMDUMP010I Snap dump written to /tmp/Snap.20200109.064150.283.0003.trc JVMDUMP030W Cannot write dump to file /bin/jitdump.20200109.064150.283.0004.dmp: Permission denied JVMDUMP007I JVM Requesting JIT dump using '/tmp/jitdump.20200109.064150.283.0004.dmp' JVMDUMP010I JIT dump written to /tmp/jitdump.20200109.064150.283.0004.dmp JVMDUMP013I Processed dump event "abort", detail "". > On Federal Information Processing Standard (FIPS) enabled cluster, Spark > Workers are not able to connect to Remote Master. > -- > > Key: SPARK-30467 > URL: https://issues.apache.org/jira/browse/SPARK-30467 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.3, 2.3.4, 2.4.4 >Reporter: SHOBHIT SHUKLA >Priority: Major > Labels: security > > On _*Federal Information Processing Standard*_ (FIPS) enabled clusters, Spark > Workers are not able to create Spark Context because communication between > worker and master is not possible If we configured > *spark.network.crypto.enabled true*. > *Error logs :* > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > fips.c(145): OpenSSL internal error, assertion failed: FATAL FIPS SELFTEST > FAILURE > JVMDUMP039I Processing dump event "abort", detail "" at 2020/01/09 06:41:50 - > please wait. > JVMDUMP032I JVM
[jira] [Updated] (SPARK-30467) On Federal Information Processing Standard (FIPS) enabled cluster, Spark Workers are not able to connect to Remote Master.
[ https://issues.apache.org/jira/browse/SPARK-30467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHOBHIT SHUKLA updated SPARK-30467: --- Summary: On Federal Information Processing Standard (FIPS) enabled cluster, Spark Workers are not able to connect to Remote Master. (was: On Federal Information Processing Standard (FIPS) enabled cluster, Spark Workers are not able to connect to Master.) > On Federal Information Processing Standard (FIPS) enabled cluster, Spark > Workers are not able to connect to Remote Master. > -- > > Key: SPARK-30467 > URL: https://issues.apache.org/jira/browse/SPARK-30467 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.3, 2.3.4, 2.4.4 >Reporter: SHOBHIT SHUKLA >Priority: Major > Labels: security > > On _*Federal Information Processing Standard*_ (FIPS) enabled clusters, Spark > Workers are not able to create Spark Context If we configured > *spark.network.crypto.enabled true*. > *Error logs :* > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > fips.c(145): OpenSSL internal error, assertion failed: FATAL FIPS SELFTEST > FAILURE > JVMDUMP039I Processing dump event "abort", detail "" at 2020/01/09 06:41:50 - > please wait. > JVMDUMP032I JVM requested System dump using > '/bin/core.20200109.064150.283.0001.dmp' in response to an event > JVMDUMP030W Cannot write dump to > file/bin/core.20200109.064150.283.0001.dmp: Permission denied > JVMDUMP012E Error in System dump: The core file created by child process with > pid = 375 was not found. Expected to find core file with name > "/var/cores/core-netty-rpc-conne-sig11-user1000320999-group0-pid375-time*" > JVMDUMP030W Cannot write dump to file > /bin/javacore.20200109.064150.283.0002.txt: Permission denied > JVMDUMP032I JVM requested Java dump using > '/tmp/javacore.20200109.064150.283.0002.txt' in response to an event > JVMDUMP010I Java dump written to /tmp/javacore.20200109.064150.283.0002.txt > JVMDUMP032I JVM requested Snap dump using > '/bin/Snap.20200109.064150.283.0003.trc' in response to an event > JVMDUMP030W Cannot write dump to file > /bin/Snap.20200109.064150.283.0003.trc: Permission denied > JVMDUMP010I Snap dump written to /tmp/Snap.20200109.064150.283.0003.trc > JVMDUMP030W Cannot write dump to file > /bin/jitdump.20200109.064150.283.0004.dmp: Permission denied > JVMDUMP007I JVM Requesting JIT dump using > '/tmp/jitdump.20200109.064150.283.0004.dmp' > JVMDUMP010I JIT dump written to /tmp/jitdump.20200109.064150.283.0004.dmp > JVMDUMP013I Processed dump event "abort", detail "". -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30452) Add predict and numFeatures in Python IsotonicRegressionModel
[ https://issues.apache.org/jira/browse/SPARK-30452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-30452. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27122 [https://github.com/apache/spark/pull/27122] > Add predict and numFeatures in Python IsotonicRegressionModel > - > > Key: SPARK-30452 > URL: https://issues.apache.org/jira/browse/SPARK-30452 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Trivial > Fix For: 3.0.0 > > > Since IsotonicRegressionModel doesn't extend JavaPredictionModel, predict and > numFeatures need to be added explicitly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30452) Add predict and numFeatures in Python IsotonicRegressionModel
[ https://issues.apache.org/jira/browse/SPARK-30452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-30452: Assignee: Huaxin Gao > Add predict and numFeatures in Python IsotonicRegressionModel > - > > Key: SPARK-30452 > URL: https://issues.apache.org/jira/browse/SPARK-30452 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Trivial > > Since IsotonicRegressionModel doesn't extend JavaPredictionModel, predict and > numFeatures need to be added explicitly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30421) Dropped columns still available for filtering
[ https://issues.apache.org/jira/browse/SPARK-30421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17011941#comment-17011941 ] Tobias Hermann commented on SPARK-30421: It allows you to write code that should break but does not. Then, later, somebody does an innocent and totally valid refactoring, and suddenly the code is broken, and this poor person goes crazy. ;) > Dropped columns still available for filtering > - > > Key: SPARK-30421 > URL: https://issues.apache.org/jira/browse/SPARK-30421 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Tobias Hermann >Priority: Minor > > The following minimal example: > {quote}val df = Seq((0, "a"), (1, "b")).toDF("foo", "bar") > df.select("foo").where($"bar" === "a").show > df.drop("bar").where($"bar" === "a").show > {quote} > should result in an error like the following: > {quote}org.apache.spark.sql.AnalysisException: cannot resolve '`bar`' given > input columns: [foo]; > {quote} > However, it does not but instead works without error, as if the column "bar" > would exist. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30472) ANSI SQL: Cast String to Integer Type, throw exception on format invalid and overflow.
feiwang created SPARK-30472: --- Summary: ANSI SQL: Cast String to Integer Type, throw exception on format invalid and overflow. Key: SPARK-30472 URL: https://issues.apache.org/jira/browse/SPARK-30472 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: feiwang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30421) Dropped columns still available for filtering
[ https://issues.apache.org/jira/browse/SPARK-30421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17011930#comment-17011930 ] Wenchen Fan commented on SPARK-30421: - but it will not break anything, right? It just gives more chances to let your query compile. > Dropped columns still available for filtering > - > > Key: SPARK-30421 > URL: https://issues.apache.org/jira/browse/SPARK-30421 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Tobias Hermann >Priority: Minor > > The following minimal example: > {quote}val df = Seq((0, "a"), (1, "b")).toDF("foo", "bar") > df.select("foo").where($"bar" === "a").show > df.drop("bar").where($"bar" === "a").show > {quote} > should result in an error like the following: > {quote}org.apache.spark.sql.AnalysisException: cannot resolve '`bar`' given > input columns: [foo]; > {quote} > However, it does not but instead works without error, as if the column "bar" > would exist. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-28317) Built-in Mathematical Functions: SCALE
[ https://issues.apache.org/jira/browse/SPARK-28317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oleg Bonar updated SPARK-28317: --- Comment: was deleted (was: HI [~shivuson...@gmail.com]! Have you made any progress on the issue?) > Built-in Mathematical Functions: SCALE > -- > > Key: SPARK-28317 > URL: https://issues.apache.org/jira/browse/SPARK-28317 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > ||Function||Return Type||Description||Example||Result|| > |{{scale(}}{{numeric}}{{)}}|{{integer}}|scale of the argument (the number of > decimal digits in the fractional part)|{{scale(8.41)}}|{{2}}| > https://www.postgresql.org/docs/11/functions-math.html#FUNCTIONS-MATH-FUNC-TABLE -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30428) File source V2: support partition pruning
[ https://issues.apache.org/jira/browse/SPARK-30428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30428. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27112 [https://github.com/apache/spark/pull/27112] > File source V2: support partition pruning > - > > Key: SPARK-30428 > URL: https://issues.apache.org/jira/browse/SPARK-30428 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30471) Fix issue when compare string and IntegerType
[ https://issues.apache.org/jira/browse/SPARK-30471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-30471: Description: When we comparing a String Type and IntegerType: '2147483648'(StringType, which exceed Int.MaxValue) > 0(IntegerType). Now the result of findCommonTypeForBinaryComparison(StringType, IntegerType) is IntegerType. But the value of string may exceed Int.MaxValue, then the result is corruputed. For example: {code:java} // Some comments here CREATE TEMPORARY VIEW ta AS SELECT * FROM VALUES(CAST ('2147483648' AS STRING)) AS ta(id); SELECT * FROM ta WHERE id > 0; // result is null {code} was: When we comparing a String Type and IntegerType: '2147483648'(StringType, which exceed Int.MaxValue) > 0(IntegerType). Now the result of findCommonTypeForBinaryComparison(StringType, IntegerType) is IntegerType. But the value of string may exceed Int.MaxValue, then the result is corruputed. For example: {code:java} // Some comments here CREATE TEMPORARY VIEW ta AS SELECT * FROM VALUES(CAST ('2147483648' AS STRING)) AS ta(id); SELECT * FROM ta WHERE id > 0; {code} > Fix issue when compare string and IntegerType > - > > Key: SPARK-30471 > URL: https://issues.apache.org/jira/browse/SPARK-30471 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: feiwang >Priority: Major > > When we comparing a String Type and IntegerType: > '2147483648'(StringType, which exceed Int.MaxValue) > 0(IntegerType). > Now the result of findCommonTypeForBinaryComparison(StringType, IntegerType) > is IntegerType. > But the value of string may exceed Int.MaxValue, then the result is > corruputed. > For example: > {code:java} > // Some comments here > CREATE TEMPORARY VIEW ta AS SELECT * FROM VALUES(CAST ('2147483648' AS > STRING)) AS ta(id); > SELECT * FROM ta WHERE id > 0; // result is null > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30471) Fix issue when compare string and IntegerType
[ https://issues.apache.org/jira/browse/SPARK-30471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-30471: Description: When we comparing a String Type and IntegerType: '2147483648'(StringType, which exceed Int.MaxValue) > 0(IntegerType). Now the result of findCommonTypeForBinaryComparison(StringType, IntegerType) is IntegerType. But the value of string may exceed Int.MaxValue, then the result is corruputed. For example: {code:java} // Some comments here CREATE TEMPORARY VIEW ta AS SELECT * FROM VALUES(CAST ('2147483648' AS STRING)) AS ta(id); SELECT * FROM ta WHERE id > 0; {code} > Fix issue when compare string and IntegerType > - > > Key: SPARK-30471 > URL: https://issues.apache.org/jira/browse/SPARK-30471 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: feiwang >Priority: Major > > When we comparing a String Type and IntegerType: > '2147483648'(StringType, which exceed Int.MaxValue) > 0(IntegerType). > Now the result of findCommonTypeForBinaryComparison(StringType, IntegerType) > is IntegerType. > But the value of string may exceed Int.MaxValue, then the result is > corruputed. > For example: > {code:java} > // Some comments here > CREATE TEMPORARY VIEW ta AS SELECT * FROM VALUES(CAST ('2147483648' AS > STRING)) AS ta(id); > SELECT * FROM ta WHERE id > 0; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10816) EventTime based sessionization
[ https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17011732#comment-17011732 ] Rafat commented on SPARK-10816: --- same question above Is there SLA for this feature ? Thanks > EventTime based sessionization > -- > > Key: SPARK-10816 > URL: https://issues.apache.org/jira/browse/SPARK-10816 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Reporter: Reynold Xin >Priority: Major > Attachments: SPARK-10816 Support session window natively.pdf, Session > Window Support For Structure Streaming.pdf > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30471) Fix issue when compare string and IntegerType
feiwang created SPARK-30471: --- Summary: Fix issue when compare string and IntegerType Key: SPARK-30471 URL: https://issues.apache.org/jira/browse/SPARK-30471 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: feiwang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30470) Uncache table in tempViews if needed on session closed
liupengcheng created SPARK-30470: Summary: Uncache table in tempViews if needed on session closed Key: SPARK-30470 URL: https://issues.apache.org/jira/browse/SPARK-30470 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.2 Reporter: liupengcheng Currently, Spark will not cleanup cached tables in tempViews produced by sql like following `CACHE TABLE table1 as SELECT ` There are risks that the `uncache table` not called due to session closed unexpectedly, or user closed manually. Then these temp views will lost, and we can not visit them in other session, but the cached plan still exists in the `CacheManager`. Moreover, the leaks may cause the failure of the subsequent query, one failure we encoutered in our production environment is as below: {code:java} Caused by: java.io.FileNotFoundException: File does not exist: /user//xx/data__db60e76d_91b8_42f3_909d_5c68692ecdd4Caused by: java.io.FileNotFoundException: File does not exist: /user//xx/data__db60e76d_91b8_42f3_909d_5c68692ecdd4It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:131) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:182) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage0.scan_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage0.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) {code} The above exception happens when user update the data of the table, but spark still use the old cached plan. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30469) Partition columns should not be involved when calculating sizeInBytes of Project logical plan
[ https://issues.apache.org/jira/browse/SPARK-30469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hu Fuwang updated SPARK-30469: -- Description: When getting the statistics of a Project logical plan, if CBO not enabled, Spark will call SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode to calculate the size in bytes, which will compute the ratio of the row size of the project plan and its child plan. And the row size is computed based on the output attributes (columns). Currently, SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode involve partition columns of hive table as well, which is not reasonable, because partition columns actually does not account for sizeInBytes. This may make the sizeInBytes not accurate. was: When getting the statistics of a Project logical plan, if CBO not enabled, Spark will call SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode to calculate the size in bytes, which will compute the ratio of the row size of the project plan and its child plan. And the row size is computed based on the output attributes (columns). Currently, SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode involve partition columns of hive table as well, which is not reasonable, because hive partition column actually does not account for sizeInBytes. This may make the sizeInBytes not accurate. > Partition columns should not be involved when calculating sizeInBytes of > Project logical plan > - > > Key: SPARK-30469 > URL: https://issues.apache.org/jira/browse/SPARK-30469 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hu Fuwang >Priority: Major > > When getting the statistics of a Project logical plan, if CBO not enabled, > Spark will call SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode to calculate > the size in bytes, which will compute the ratio of the row size of the > project plan and its child plan. > And the row size is computed based on the output attributes (columns). > Currently, SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode involve partition > columns of hive table as well, which is not reasonable, because partition > columns actually does not account for sizeInBytes. > This may make the sizeInBytes not accurate. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30469) Hive Partition columns should not be involved when calculating sizeInBytes of Project logical plan
[ https://issues.apache.org/jira/browse/SPARK-30469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hu Fuwang updated SPARK-30469: -- Description: When getting the statistics of a Project logical plan, if CBO not enabled, Spark will call SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode to calculate the size in bytes, which will compute the ratio of the row size of the project plan and its child plan. And the row size is computed based on the output attributes (columns). Currently, SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode involve partition columns of hive table as well, which is not reasonable, because hive partition column actually does not account for sizeInBytes. This may make the sizeInBytes not accurate. was: When getting the statistics of a Project logical plan, if CBO not enabled, Spark will call SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode to calculate the size in bytes, which will compute the ratio of the row size of the project plan and its child plan. And the row size is computed based on the out attributes (columns). Currently, SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode involve partition columns of hive table as well, which is not reasonable, because hive partition column actually does not account for sizeInBytes. This may make the sizeInBytes not accurate. > Hive Partition columns should not be involved when calculating sizeInBytes of > Project logical plan > -- > > Key: SPARK-30469 > URL: https://issues.apache.org/jira/browse/SPARK-30469 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hu Fuwang >Priority: Major > > When getting the statistics of a Project logical plan, if CBO not enabled, > Spark will call SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode to calculate > the size in bytes, which will compute the ratio of the row size of the > project plan and its child plan. > And the row size is computed based on the output attributes (columns). > Currently, SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode involve partition > columns of hive table as well, which is not reasonable, because hive > partition column actually does not account for sizeInBytes. > This may make the sizeInBytes not accurate. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30469) Partition columns should not be involved when calculating sizeInBytes of Project logical plan
[ https://issues.apache.org/jira/browse/SPARK-30469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hu Fuwang updated SPARK-30469: -- Summary: Partition columns should not be involved when calculating sizeInBytes of Project logical plan (was: Hive Partition columns should not be involved when calculating sizeInBytes of Project logical plan) > Partition columns should not be involved when calculating sizeInBytes of > Project logical plan > - > > Key: SPARK-30469 > URL: https://issues.apache.org/jira/browse/SPARK-30469 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hu Fuwang >Priority: Major > > When getting the statistics of a Project logical plan, if CBO not enabled, > Spark will call SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode to calculate > the size in bytes, which will compute the ratio of the row size of the > project plan and its child plan. > And the row size is computed based on the output attributes (columns). > Currently, SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode involve partition > columns of hive table as well, which is not reasonable, because hive > partition column actually does not account for sizeInBytes. > This may make the sizeInBytes not accurate. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30469) Hive Partition columns should not be involved when calculating sizeInBytes of Project logical plan
Hu Fuwang created SPARK-30469: - Summary: Hive Partition columns should not be involved when calculating sizeInBytes of Project logical plan Key: SPARK-30469 URL: https://issues.apache.org/jira/browse/SPARK-30469 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Hu Fuwang When getting the statistics of a Project logical plan, if CBO not enabled, Spark will call SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode to calculate the size in bytes, which will compute the ratio of the row size of the project plan and its child plan. And the row size is computed based on the out attributes (columns). Currently, SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode involve partition columns of hive table as well, which is not reasonable, because hive partition column actually does not account for sizeInBytes. This may make the sizeInBytes not accurate. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30468) Use multiple lines to display data columns for show create table command
[ https://issues.apache.org/jira/browse/SPARK-30468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenhua Wang updated SPARK-30468: - Description: Currently data columns are displayed in one line for show create table command, when the table has many columns (to make things even worse, columns may have long names or comments), the displayed result is really hard to read. E.g. {noformat} spark-sql> show create table test_table; CREATE TABLE `test_table` (`col1` INT COMMENT 'This is comment for column 1', `col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT 'This is comment for column 3') USING parquet {noformat} To improve readability, we should print each column in a separate line. {noformat} spark-sql> show create table test_table; CREATE TABLE `test_table` ( `col1` INT COMMENT 'This is comment for column 1', `col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT 'This is comment for column 3') USING parquet {noformat} was: Currently data columns are displayed in one line for show create table command, when the table has many columns, and columns may have long names or comments, the displayed result is really hard to read. E.g. {noformat} spark-sql> show create table test_table; CREATE TABLE `test_table` (`col1` INT COMMENT 'This is comment for column 1', `col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT 'This is comment for column 3') USING parquet {noformat} To improve readability, we should print each column in a separate line. {noformat} spark-sql> show create table test_table; CREATE TABLE `test_table` ( `col1` INT COMMENT 'This is comment for column 1', `col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT 'This is comment for column 3') USING parquet {noformat} > Use multiple lines to display data columns for show create table command > > > Key: SPARK-30468 > URL: https://issues.apache.org/jira/browse/SPARK-30468 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhenhua Wang >Priority: Minor > > Currently data columns are displayed in one line for show create table > command, when the table has many columns (to make things even worse, columns > may have long names or comments), the displayed result is really hard to > read. E.g. > {noformat} > spark-sql> show create table test_table; > CREATE TABLE `test_table` (`col1` INT COMMENT 'This is comment for column 1', > `col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT > 'This is comment for column 3') > USING parquet > {noformat} > To improve readability, we should print each column in a separate line. > {noformat} > spark-sql> show create table test_table; > CREATE TABLE `test_table` ( > `col1` INT COMMENT 'This is comment for column 1', > `col2` STRING COMMENT 'This is comment for column 2', > `col3` DOUBLE COMMENT 'This is comment for column 3') > USING parquet > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30468) Use multiple lines to display data columns for show create table command
[ https://issues.apache.org/jira/browse/SPARK-30468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenhua Wang updated SPARK-30468: - Description: Currently data columns are displayed in one line for show create table command, when the table has many columns, and columns may have long names or comments, the displayed result is really hard to read. E.g. {noformat} spark-sql> show create table test_table; CREATE TABLE `test_table` (`col1` INT COMMENT 'This is comment for column 1', `col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT 'This is comment for column 3') USING parquet {noformat} To improve readability, we should print each column in a separate line. {noformat} spark-sql> show create table test_table; CREATE TABLE `test_table` ( `col1` INT COMMENT 'This is comment for column 1', `col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT 'This is comment for column 3') USING parquet {noformat} was: Currently data columns are displayed in one line for show create table command, when the table has many columns, and columns may have long names or comments, the displayed result is really hard to read. E.g. {{{noformat}}} spark-sql> show create table test_table; CREATE TABLE `test_table` (`col1` INT COMMENT 'This is comment for column 1', `col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT 'This is comment for column 3') USING parquet {{{noformat}}} To improve readability, we should print each column in a separate line. > Use multiple lines to display data columns for show create table command > > > Key: SPARK-30468 > URL: https://issues.apache.org/jira/browse/SPARK-30468 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhenhua Wang >Priority: Minor > > Currently data columns are displayed in one line for show create table > command, when the table has many columns, and columns may have long names or > comments, the displayed result is really hard to read. E.g. > {noformat} > spark-sql> show create table test_table; > CREATE TABLE `test_table` (`col1` INT COMMENT 'This is comment for column 1', > `col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT > 'This is comment for column 3') > USING parquet > {noformat} > To improve readability, we should print each column in a separate line. > {noformat} > spark-sql> show create table test_table; > CREATE TABLE `test_table` ( > `col1` INT COMMENT 'This is comment for column 1', > `col2` STRING COMMENT 'This is comment for column 2', > `col3` DOUBLE COMMENT 'This is comment for column 3') > USING parquet > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30468) Use multiple lines to display data columns for show create table command
[ https://issues.apache.org/jira/browse/SPARK-30468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenhua Wang updated SPARK-30468: - Description: Currently data columns are displayed in one line for show create table command, when the table has many columns, and columns may have long names or comments, the displayed result is really hard to read. E.g. {{{noformat}}} spark-sql> show create table test_table; CREATE TABLE `test_table` (`col1` INT COMMENT 'This is comment for column 1', `col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT 'This is comment for column 3') USING parquet {{{noformat}}} To improve readability, we should print each column in a separate line. was: Currently data columns are displayed in one line for show create table command, when the table has many columns, and columns may have long names or comments, the displayed result is really hard to read. E.g. ``` spark-sql> show create table test_table; CREATE TABLE `test_table` (`col1` INT COMMENT 'This is comment for column 1', `col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT 'This is comment for column 3') USING parquet ``` To improve readability, we should print each column in a separate line. > Use multiple lines to display data columns for show create table command > > > Key: SPARK-30468 > URL: https://issues.apache.org/jira/browse/SPARK-30468 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhenhua Wang >Priority: Minor > > Currently data columns are displayed in one line for show create table > command, when the table has many columns, and columns may have long names or > comments, the displayed result is really hard to read. E.g. > {{{noformat}}} > spark-sql> show create table test_table; > CREATE TABLE `test_table` (`col1` INT COMMENT 'This is comment for column 1', > `col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT > 'This is comment for column 3') > USING parquet > {{{noformat}}} > To improve readability, we should print each column in a separate line. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30468) Use multiple lines to display data columns for show create table command
[ https://issues.apache.org/jira/browse/SPARK-30468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenhua Wang updated SPARK-30468: - Description: Currently data columns are displayed in one line for show create table command, when the table has many columns, and columns may have long names or comments, the displayed result is really hard to read. E.g. ``` spark-sql> show create table test_table; CREATE TABLE `test_table` (`col1` INT COMMENT 'This is comment for column 1', `col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT 'This is comment for column 3') USING parquet ``` To improve readability, we should print each column in a separate line. was:Currently data columns are displayed in one line for show create table command, when the table has many columns, and even worse, colu > Use multiple lines to display data columns for show create table command > > > Key: SPARK-30468 > URL: https://issues.apache.org/jira/browse/SPARK-30468 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhenhua Wang >Priority: Minor > > Currently data columns are displayed in one line for show create table > command, when the table has many columns, and columns may have long names or > comments, the displayed result is really hard to read. E.g. > ``` > spark-sql> show create table test_table; > CREATE TABLE `test_table` (`col1` INT COMMENT 'This is comment for column 1', > `col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT > 'This is comment for column 3') > USING parquet > ``` > To improve readability, we should print each column in a separate line. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30468) Use multiple lines to display data columns for show create table command
[ https://issues.apache.org/jira/browse/SPARK-30468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenhua Wang updated SPARK-30468: - Description: Currently data columns are displayed in one line for show create table command, when the table has many columns, and even worse, colu > Use multiple lines to display data columns for show create table command > > > Key: SPARK-30468 > URL: https://issues.apache.org/jira/browse/SPARK-30468 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhenhua Wang >Priority: Minor > > Currently data columns are displayed in one line for show create table > command, when the table has many columns, and even worse, colu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30468) Use multiple lines to display data columns for show create table command
Zhenhua Wang created SPARK-30468: Summary: Use multiple lines to display data columns for show create table command Key: SPARK-30468 URL: https://issues.apache.org/jira/browse/SPARK-30468 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Zhenhua Wang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28883) Fix a flaky test: ThriftServerQueryTestSuite
[ https://issues.apache.org/jira/browse/SPARK-28883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17011533#comment-17011533 ] Jungtaek Lim commented on SPARK-28883: -- Would SPARK-30345 be a complement of this? Or does this issue cover more cases? > Fix a flaky test: ThriftServerQueryTestSuite > > > Key: SPARK-28883 > URL: https://issues.apache.org/jira/browse/SPARK-28883 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Blocker > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109764/testReport/org.apache.spark.sql.hive.thriftserver/ThriftServerQueryTestSuite/ > (2 failures) > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109768/testReport/org.apache.spark.sql.hive.thriftserver/ThriftServerQueryTestSuite/ > (4 failures) > Error message: > {noformat} > java.sql.SQLException: Could not open client transport with JDBC Uri: > jdbc:hive2://localhost:14431: java.net.ConnectException: Connection refused > (Connection refused) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org