[jira] [Assigned] (SPARK-44280) Add convertJavaTimestampToTimestamp in JDBCDialect API
[ https://issues.apache.org/jira/browse/SPARK-44280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-44280: --- Assignee: Mingkang Li > Add convertJavaTimestampToTimestamp in JDBCDialect API > -- > > Key: SPARK-44280 > URL: https://issues.apache.org/jira/browse/SPARK-44280 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Mingkang Li >Assignee: Mingkang Li >Priority: Major > Fix For: 3.5.0 > > > A new method, {{{}convertJavaTimestampToTimestamp{}}}, is introduced to the > JDBCDialects API, providing the capability for JDBC dialects to override the > default Java timestamp conversion behavior. This enhancement is particularly > beneficial for databases such as PostgreSQL, which feature special values for > timestamps representing positive and negative infinity. > The pre-existing default behavior of timestamp conversion potentially > triggers an overflow due to these special values (i.e. The executor would > crash if you select a column that contains infinity timestamps in > PostgreSQL.) By integrating this new function, we can mitigate such issues, > enabling more versatile and robust timestamp value conversions across various > JDBC-based connectors. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44280) Add convertJavaTimestampToTimestamp in JDBCDialect API
[ https://issues.apache.org/jira/browse/SPARK-44280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-44280. - Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41843 [https://github.com/apache/spark/pull/41843] > Add convertJavaTimestampToTimestamp in JDBCDialect API > -- > > Key: SPARK-44280 > URL: https://issues.apache.org/jira/browse/SPARK-44280 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Mingkang Li >Priority: Major > Fix For: 3.5.0 > > > A new method, {{{}convertJavaTimestampToTimestamp{}}}, is introduced to the > JDBCDialects API, providing the capability for JDBC dialects to override the > default Java timestamp conversion behavior. This enhancement is particularly > beneficial for databases such as PostgreSQL, which feature special values for > timestamps representing positive and negative infinity. > The pre-existing default behavior of timestamp conversion potentially > triggers an overflow due to these special values (i.e. The executor would > crash if you select a column that contains infinity timestamps in > PostgreSQL.) By integrating this new function, we can mitigate such issues, > enabling more versatile and robust timestamp value conversions across various > JDBC-based connectors. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44627) org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils#resultSetToRows produces wrong data
[ https://issues.apache.org/jira/browse/SPARK-44627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Min Zhao updated SPARK-44627: - Description: When the resultSet exists a timestmp column and it's value is null, but column define is not null. In the row it generates, this column will use the value of the same column in the previous row. In mysql, if a datetime column is defined, meanwhile it is not null. When a value is '-00-00 00:00:00', mysql provided a property of zeroDateTimeBehavior, it will return null. table define: CREATE TABLE `test_timestamp` ( `id` int(11) unsigned NOT NULL AUTO_INCREMENT COMMENT '主键id', `unbind_time` datetime NOT NULL DEFAULT '-00-00 00:00:00' , PRIMARY KEY (`id`) ) example: the value of resultSet 1, 2023-01-01 12:00:00 2, null the value of row 1, 2023-01-01 12:00:00 2, 2023-01-01 12:00:00 was: When the resultSet exists a timestmp column and it's value is null, In the row it generates, this column will use the value of the same column in the previous row. In mysql, if a datetime column is defined, meanwhile it is not null. When a value is '-00-00 00:00:00', mysql provided a property of zeroDateTimeBehavior, it will return null. table define: CREATE TABLE `test_timestamp` ( `id` int(11) unsigned NOT NULL AUTO_INCREMENT COMMENT '主键id', `unbind_time` datetime NOT NULL DEFAULT '-00-00 00:00:00' , PRIMARY KEY (`id`) ) example: the value of resultSet 1, 2023-01-01 12:00:00 2, null the value of row 1, 2023-01-01 12:00:00 2, 2023-01-01 12:00:00 > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils#resultSetToRows > produces wrong data > - > > Key: SPARK-44627 > URL: https://issues.apache.org/jira/browse/SPARK-44627 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2, 3.3.1 >Reporter: Min Zhao >Priority: Minor > Attachments: image-2023-08-02-14-01-54-447.png > > > When the resultSet exists a timestmp column and it's value is null, but > column define is not null. In the row it generates, this column will use the > value of the same column in the previous row. > > In mysql, if a datetime column is defined, meanwhile it is not null. When a > value is '-00-00 00:00:00', mysql provided a property of > zeroDateTimeBehavior, it will return null. > table define: > CREATE TABLE `test_timestamp` ( > `id` int(11) unsigned NOT NULL AUTO_INCREMENT COMMENT '主键id', > `unbind_time` datetime NOT NULL DEFAULT '-00-00 00:00:00' , > PRIMARY KEY (`id`) > ) > example: > the value of resultSet > 1, 2023-01-01 12:00:00 > 2, null > > the value of row > 1, 2023-01-01 12:00:00 > 2, 2023-01-01 12:00:00 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44572) Clean up unused installers ASAP
[ https://issues.apache.org/jira/browse/SPARK-44572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-44572: -- Summary: Clean up unused installers ASAP (was: Clean up unused installer ASAP) > Clean up unused installers ASAP > --- > > Key: SPARK-44572 > URL: https://issues.apache.org/jira/browse/SPARK-44572 > Project: Spark > Issue Type: Sub-task > Components: Project Infra, Tests >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44572) Clean up unused installer ASAP
[ https://issues.apache.org/jira/browse/SPARK-44572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-44572: -- Summary: Clean up unused installer ASAP (was: Clean up unused files ASAP) > Clean up unused installer ASAP > -- > > Key: SPARK-44572 > URL: https://issues.apache.org/jira/browse/SPARK-44572 > Project: Spark > Issue Type: Sub-task > Components: Project Infra, Tests >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43043) Improve the performance of MapOutputTracker.updateMapOutput
[ https://issues.apache.org/jira/browse/SPARK-43043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-43043: - Fix Version/s: 3.5.0 (was: 3.4.1) > Improve the performance of MapOutputTracker.updateMapOutput > --- > > Key: SPARK-43043 > URL: https://issues.apache.org/jira/browse/SPARK-43043 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.2 >Reporter: Xingbo Jiang >Assignee: Xingbo Jiang >Priority: Major > Fix For: 3.5.0 > > > Inside of MapOutputTracker, there is a line of code which does a linear find > through a mapStatuses collection: > https://github.com/apache/spark/blob/cb48c0e48eeff2b7b51176d0241491300e5aad6f/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L167 > (plus a similar search a few lines down at > https://github.com/apache/spark/blob/cb48c0e48eeff2b7b51176d0241491300e5aad6f/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L174) > This scan is necessary because we only know the mapId of the updated status > and not its mapPartitionId. > We perform this scan once per migrated block, so if a large proportion of all > blocks in the map are migrated then we get O(n^2) total runtime across all of > the calls. > I think we might be able to fix this by extending ShuffleStatus to have an > OpenHashMap mapping from mapId to mapPartitionId. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44630) Revert SPARK-43043 Improve the performance of MapOutputTracker.updateMapOutput
[ https://issues.apache.org/jira/browse/SPARK-44630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-44630: Assignee: Dongjoon Hyun > Revert SPARK-43043 Improve the performance of MapOutputTracker.updateMapOutput > -- > > Key: SPARK-44630 > URL: https://issues.apache.org/jira/browse/SPARK-44630 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.1 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44630) Revert SPARK-43043 Improve the performance of MapOutputTracker.updateMapOutput
[ https://issues.apache.org/jira/browse/SPARK-44630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-44630. -- Fix Version/s: 3.4.2 Resolution: Fixed Issue resolved by pull request 42285 [https://github.com/apache/spark/pull/42285] > Revert SPARK-43043 Improve the performance of MapOutputTracker.updateMapOutput > -- > > Key: SPARK-44630 > URL: https://issues.apache.org/jira/browse/SPARK-44630 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.1 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.4.2 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44627) org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils#resultSetToRows produces wrong data
[ https://issues.apache.org/jira/browse/SPARK-44627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Min Zhao updated SPARK-44627: - Priority: Minor (was: Major) > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils#resultSetToRows > produces wrong data > - > > Key: SPARK-44627 > URL: https://issues.apache.org/jira/browse/SPARK-44627 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2, 3.3.1 >Reporter: Min Zhao >Priority: Minor > Attachments: image-2023-08-02-14-01-54-447.png > > > When the resultSet exists a timestmp column and it's value is null, In the > row it generates, this column will use the value of the same column in the > previous row. > > In mysql, if a datetime column is defined, meanwhile it is not null. When a > value is '-00-00 00:00:00', mysql provided a property of > zeroDateTimeBehavior, it will return null. > table define: > CREATE TABLE `test_timestamp` ( > `id` int(11) unsigned NOT NULL AUTO_INCREMENT COMMENT '主键id', > `unbind_time` datetime NOT NULL DEFAULT '-00-00 00:00:00' , > PRIMARY KEY (`id`) > ) > example: > the value of resultSet > 1, 2023-01-01 12:00:00 > 2, null > > the value of row > 1, 2023-01-01 12:00:00 > 2, 2023-01-01 12:00:00 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44627) org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils#resultSetToRows produces wrong data
[ https://issues.apache.org/jira/browse/SPARK-44627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17750115#comment-17750115 ] Min Zhao commented on SPARK-44627: -- !image-2023-08-02-14-01-54-447.png! it only update isNull to true, but the value keep with last row. > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils#resultSetToRows > produces wrong data > - > > Key: SPARK-44627 > URL: https://issues.apache.org/jira/browse/SPARK-44627 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2, 3.3.1 >Reporter: Min Zhao >Priority: Major > Attachments: image-2023-08-02-14-01-54-447.png > > > When the resultSet exists a timestmp column and it's value is null, In the > row it generates, this column will use the value of the same column in the > previous row. > > In mysql, if a datetime column is defined, meanwhile it is not null. When a > value is '-00-00 00:00:00', mysql provided a property of > zeroDateTimeBehavior, it will return null. > table define: > CREATE TABLE `test_timestamp` ( > `id` int(11) unsigned NOT NULL AUTO_INCREMENT COMMENT '主键id', > `unbind_time` datetime NOT NULL DEFAULT '-00-00 00:00:00' , > PRIMARY KEY (`id`) > ) > example: > the value of resultSet > 1, 2023-01-01 12:00:00 > 2, null > > the value of row > 1, 2023-01-01 12:00:00 > 2, 2023-01-01 12:00:00 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44627) org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils#resultSetToRows produces wrong data
[ https://issues.apache.org/jira/browse/SPARK-44627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Min Zhao updated SPARK-44627: - Description: When the resultSet exists a timestmp column and it's value is null, In the row it generates, this column will use the value of the same column in the previous row. In mysql, if a datetime column is defined, meanwhile it is not null. When a value is '-00-00 00:00:00', mysql provided a property of zeroDateTimeBehavior, it will return null. table define: CREATE TABLE `test_timestamp` ( `id` int(11) unsigned NOT NULL AUTO_INCREMENT COMMENT '主键id', `unbind_time` datetime NOT NULL DEFAULT '-00-00 00:00:00' , PRIMARY KEY (`id`) ) example: the value of resultSet 1, 2023-01-01 12:00:00 2, null the value of row 1, 2023-01-01 12:00:00 2, 2023-01-01 12:00:00 was: When the resultSet exists a timestmp column and it's value is null, In the row it generates, this column will use the value of the same column in the previous row. In mysql, if a datetime column is defined, meanwhile it is not null. When a value is '-00-00 00:00:00', mysql provided a property of zeroDateTimeBehavior, it will return null. table definite: CREATE TABLE `test_timestamp` ( `id` int(11) unsigned NOT NULL AUTO_INCREMENT COMMENT '主键id', `unbind_time` datetime NOT NULL DEFAULT '-00-00 00:00:00' , PRIMARY KEY (`id`) ) example: the value of resultSet 1, 2023-01-01 12:00:00 2, null the value of row 1, 2023-01-01 12:00:00 2, 2023-01-01 12:00:00 > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils#resultSetToRows > produces wrong data > - > > Key: SPARK-44627 > URL: https://issues.apache.org/jira/browse/SPARK-44627 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2, 3.3.1 >Reporter: Min Zhao >Priority: Major > Attachments: image-2023-08-02-14-01-54-447.png > > > When the resultSet exists a timestmp column and it's value is null, In the > row it generates, this column will use the value of the same column in the > previous row. > > In mysql, if a datetime column is defined, meanwhile it is not null. When a > value is '-00-00 00:00:00', mysql provided a property of > zeroDateTimeBehavior, it will return null. > table define: > CREATE TABLE `test_timestamp` ( > `id` int(11) unsigned NOT NULL AUTO_INCREMENT COMMENT '主键id', > `unbind_time` datetime NOT NULL DEFAULT '-00-00 00:00:00' , > PRIMARY KEY (`id`) > ) > example: > the value of resultSet > 1, 2023-01-01 12:00:00 > 2, null > > the value of row > 1, 2023-01-01 12:00:00 > 2, 2023-01-01 12:00:00 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44627) org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils#resultSetToRows produces wrong data
[ https://issues.apache.org/jira/browse/SPARK-44627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Min Zhao updated SPARK-44627: - Attachment: image-2023-08-02-14-01-54-447.png > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils#resultSetToRows > produces wrong data > - > > Key: SPARK-44627 > URL: https://issues.apache.org/jira/browse/SPARK-44627 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2, 3.3.1 >Reporter: Min Zhao >Priority: Major > Attachments: image-2023-08-02-14-01-54-447.png > > > When the resultSet exists a timestmp column and it's value is null, In the > row it generates, this column will use the value of the same column in the > previous row. > > In mysql, if a datetime column is defined, meanwhile it is not null. When a > value is '-00-00 00:00:00', mysql provided a property of > zeroDateTimeBehavior, it will return null. > table define: > CREATE TABLE `test_timestamp` ( > `id` int(11) unsigned NOT NULL AUTO_INCREMENT COMMENT '主键id', > `unbind_time` datetime NOT NULL DEFAULT '-00-00 00:00:00' , > PRIMARY KEY (`id`) > ) > example: > the value of resultSet > 1, 2023-01-01 12:00:00 > 2, null > > the value of row > 1, 2023-01-01 12:00:00 > 2, 2023-01-01 12:00:00 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44627) org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils#resultSetToRows produces wrong data
[ https://issues.apache.org/jira/browse/SPARK-44627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Min Zhao updated SPARK-44627: - Description: When the resultSet exists a timestmp column and it's value is null, In the row it generates, this column will use the value of the same column in the previous row. In mysql, if a datetime column is defined, meanwhile it is not null. When a value is '-00-00 00:00:00', mysql provided a property of zeroDateTimeBehavior, it will return null. table definite: CREATE TABLE `test_timestamp` ( `id` int(11) unsigned NOT NULL AUTO_INCREMENT COMMENT '主键id', `unbind_time` datetime NOT NULL DEFAULT '-00-00 00:00:00' , PRIMARY KEY (`id`) ) example: the value of resultSet 1, 2023-01-01 12:00:00 2, null the value of row 1, 2023-01-01 12:00:00 2, 2023-01-01 12:00:00 was: when the resultSet exists a timestmp column and it's value is null, In the row it generates, this column will use the value of the same column in the previous row. example: the value of resultSet 1, 2023-01-01 12:00:00 2, null the value of row 1, 2023-01-01 12:00:00 2, 2023-01-01 12:00:00 > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils#resultSetToRows > produces wrong data > - > > Key: SPARK-44627 > URL: https://issues.apache.org/jira/browse/SPARK-44627 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2, 3.3.1 >Reporter: Min Zhao >Priority: Major > Attachments: image-2023-08-02-14-01-54-447.png > > > When the resultSet exists a timestmp column and it's value is null, In the > row it generates, this column will use the value of the same column in the > previous row. > > In mysql, if a datetime column is defined, meanwhile it is not null. When a > value is '-00-00 00:00:00', mysql provided a property of > zeroDateTimeBehavior, it will return null. > table definite: > CREATE TABLE `test_timestamp` ( > `id` int(11) unsigned NOT NULL AUTO_INCREMENT COMMENT '主键id', > `unbind_time` datetime NOT NULL DEFAULT '-00-00 00:00:00' , > PRIMARY KEY (`id`) > ) > example: > the value of resultSet > 1, 2023-01-01 12:00:00 > 2, null > > the value of row > 1, 2023-01-01 12:00:00 > 2, 2023-01-01 12:00:00 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44555) Use checkError() to check Exception in command Suite & assign some error class names
[ https://issues.apache.org/jira/browse/SPARK-44555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-44555. -- Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 42169 [https://github.com/apache/spark/pull/42169] > Use checkError() to check Exception in command Suite & assign some error > class names > > > Key: SPARK-44555 > URL: https://issues.apache.org/jira/browse/SPARK-44555 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0, 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.5.0, 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44555) Use checkError() to check Exception in command Suite & assign some error class names
[ https://issues.apache.org/jira/browse/SPARK-44555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-44555: Assignee: BingKun Pan > Use checkError() to check Exception in command Suite & assign some error > class names > > > Key: SPARK-44555 > URL: https://issues.apache.org/jira/browse/SPARK-44555 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0, 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44632) DiskBlockManager should check and be able to handle stale directories
Kent Yao created SPARK-44632: Summary: DiskBlockManager should check and be able to handle stale directories Key: SPARK-44632 URL: https://issues.apache.org/jira/browse/SPARK-44632 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.4.1, 3.5.0 Reporter: Kent Yao The subDir in the memory cache could be stale, for example, after a damaged disk repair or replacement. This dir could be accessed subsequently by others. Especially, `filename` generated by `RDDBlockId` is unchanged between task reties, so it probably attempts to access the same subDir repeatedly. Therefore, it is necessary to check if the subDir exists. If it is stale and the hardware has been recovered without data and directories, we will recreate the subDir to prevent FileNotFoundException during writing. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44631) Remove session-based directory when the isolated session cache is evicted
Hyukjin Kwon created SPARK-44631: Summary: Remove session-based directory when the isolated session cache is evicted Key: SPARK-44631 URL: https://issues.apache.org/jira/browse/SPARK-44631 Project: Spark Issue Type: Task Components: Connect Affects Versions: 3.5.0 Reporter: Hyukjin Kwon SPARK-44078 added the cache for isolated sessions, and SPARK-44348 added the session-based directory for isolation. When the isolated session cache is evicted, we should remove the session-based directory so it doesn't fail when the same session is used, see also https://github.com/apache/spark/pull/41625#discussion_r1251427466 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44588) Migrated shuffle blocks are encrypted multiple times when io.encryption is enabled
[ https://issues.apache.org/jira/browse/SPARK-44588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-44588: -- Fix Version/s: 3.3.3 > Migrated shuffle blocks are encrypted multiple times when io.encryption is > enabled > --- > > Key: SPARK-44588 > URL: https://issues.apache.org/jira/browse/SPARK-44588 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0, 3.2.2, > 3.3.1, 3.2.3, 3.2.4, 3.3.2, 3.4.0, 3.4.1 >Reporter: Henry Mai >Assignee: Henry Mai >Priority: Critical > Fix For: 3.3.3, 3.4.2, 3.5.0 > > > Shuffle blocks upon migration are wrapped for encryption again when being > written out to a file on the receiver side. > > Pull request to fix this: https://github.com/apache/spark/pull/42214 > > Details: > Sender/Read side: > BlockManagerDecommissioner:run() > blocks = bm.migratableResolver.getMigrationBlocks() > *dataFile = IndexShuffleBlockResolver:getDataFile(...)* > buffer = FileSegmentManagedBuffer(..., dataFile) > *^ This reads straight from disk without decryption* > blocks.foreach((blockId, buffer) => > bm.blockTransferService.uploadBlockSync(..., buffer, ...)) > -> uploadBlockSync() -> uploadBlock(..., buffer, ...) > -> client.uploadStream(UploadBlockStream, buffer, ...) > - Notice that there is no decryption here on the sender/read side. > Receiver/Write side: > NettyBlockRpcServer:receiveStream() <--- This is the UploadBlockStream handler > putBlockDataAsStream() > migratableResolver.putShuffleBlockAsStream() > *-> file = IndexShuffleBlockResolver:getDataFile(...)* > -> tmpFile = (file + . extension) > *-> Creates an encrypting writable channel to a tmpFile using > serializerManager.wrapStream()* > -> onData() writes the data into the channel > -> onComplete() renames the tmpFile to the file > - Notice: > * Both getMigrationBlocks()[read] and putShuffleBlockAsStream()[write] > target IndexShuffleBlockResolver:getDataFile() > * The read path does not decrypt but the write path encrypts. > * As a thought exercise: if this cycle happens more than once (where this > receiver is now a sender) even if we assume that the shuffle blocks are > initially unencrypted*, then bytes in the file will just have more and more > layers of encryption applied to it each time it gets migrated. > * *In practice, the shuffle blocks are encrypted on disk to begin with, this > is just a thought exercise -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44588) Migrated shuffle blocks are encrypted multiple times when io.encryption is enabled
[ https://issues.apache.org/jira/browse/SPARK-44588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-44588: -- Fix Version/s: 3.4.2 > Migrated shuffle blocks are encrypted multiple times when io.encryption is > enabled > --- > > Key: SPARK-44588 > URL: https://issues.apache.org/jira/browse/SPARK-44588 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0, 3.2.2, > 3.3.1, 3.2.3, 3.2.4, 3.3.2, 3.4.0, 3.4.1 >Reporter: Henry Mai >Assignee: Henry Mai >Priority: Critical > Fix For: 3.4.2, 3.5.0 > > > Shuffle blocks upon migration are wrapped for encryption again when being > written out to a file on the receiver side. > > Pull request to fix this: https://github.com/apache/spark/pull/42214 > > Details: > Sender/Read side: > BlockManagerDecommissioner:run() > blocks = bm.migratableResolver.getMigrationBlocks() > *dataFile = IndexShuffleBlockResolver:getDataFile(...)* > buffer = FileSegmentManagedBuffer(..., dataFile) > *^ This reads straight from disk without decryption* > blocks.foreach((blockId, buffer) => > bm.blockTransferService.uploadBlockSync(..., buffer, ...)) > -> uploadBlockSync() -> uploadBlock(..., buffer, ...) > -> client.uploadStream(UploadBlockStream, buffer, ...) > - Notice that there is no decryption here on the sender/read side. > Receiver/Write side: > NettyBlockRpcServer:receiveStream() <--- This is the UploadBlockStream handler > putBlockDataAsStream() > migratableResolver.putShuffleBlockAsStream() > *-> file = IndexShuffleBlockResolver:getDataFile(...)* > -> tmpFile = (file + . extension) > *-> Creates an encrypting writable channel to a tmpFile using > serializerManager.wrapStream()* > -> onData() writes the data into the channel > -> onComplete() renames the tmpFile to the file > - Notice: > * Both getMigrationBlocks()[read] and putShuffleBlockAsStream()[write] > target IndexShuffleBlockResolver:getDataFile() > * The read path does not decrypt but the write path encrypts. > * As a thought exercise: if this cycle happens more than once (where this > receiver is now a sender) even if we assume that the shuffle blocks are > initially unencrypted*, then bytes in the file will just have more and more > layers of encryption applied to it each time it gets migrated. > * *In practice, the shuffle blocks are encrypted on disk to begin with, this > is just a thought exercise -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44600) Make `repl` module daily test pass
[ https://issues.apache.org/jira/browse/SPARK-44600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-44600: - Description: [https://github.com/apache/spark/actions/runs/5727123477/job/15518895421] {code:java} - SPARK-15236: use Hive catalog *** FAILED *** 18137 isContain was true Interpreter output contained 'Exception': 18138 Welcome to 18139 __ 18140 / __/__ ___ _/ /__ 18141 _\ \/ _ \/ _ `/ __/ '_/ 18142 /___/ .__/\_,_/_/ /_/\_\ version 4.0.0-SNAPSHOT 18143/_/ 18144 18145 Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 1.8.0_372) 18146 Type in expressions to have them evaluated. 18147 Type :help for more information. 18148 18149 scala> 18150 scala> java.lang.NoClassDefFoundError: org/sparkproject/guava/cache/CacheBuilder 18151at org.apache.spark.sql.catalyst.catalog.SessionCatalog.(SessionCatalog.scala:197) 18152at org.apache.spark.sql.internal.BaseSessionStateBuilder.catalog$lzycompute(BaseSessionStateBuilder.scala:153) 18153at org.apache.spark.sql.internal.BaseSessionStateBuilder.catalog(BaseSessionStateBuilder.scala:152) 18154at org.apache.spark.sql.internal.BaseSessionStateBuilder.v2SessionCatalog$lzycompute(BaseSessionStateBuilder.scala:166) 18155at org.apache.spark.sql.internal.BaseSessionStateBuilder.v2SessionCatalog(BaseSessionStateBuilder.scala:166) 18156at org.apache.spark.sql.internal.BaseSessionStateBuilder.catalogManager$lzycompute(BaseSessionStateBuilder.scala:168) 18157at org.apache.spark.sql.internal.BaseSessionStateBuilder.catalogManager(BaseSessionStateBuilder.scala:168) 18158at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anon$1.(BaseSessionStateBuilder.scala:185) 18159at org.apache.spark.sql.internal.BaseSessionStateBuilder.analyzer(BaseSessionStateBuilder.scala:185) 18160at org.apache.spark.sql.internal.BaseSessionStateBuilder.$anonfun$build$2(BaseSessionStateBuilder.scala:374) 18161at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:92) 18162at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:92) 18163at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:77) 18164at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:138) 18165at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:219) 18166at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:546) 18167at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:219) 18168at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) 18169at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:218) 18170at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:77) 18171at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:74) 18172at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:66) 18173at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) 18174at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) 18175at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:98) 18176at org.apache.spark.sql.SparkSession.$anonfun$sql$4(SparkSession.scala:691) 18177at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) 18178at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:682) 18179at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:713) 18180at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:744) 18181... 100 elided 18182 Caused by: java.lang.ClassNotFoundException: org.sparkproject.guava.cache.CacheBuilder 18183at java.net.URLClassLoader.findClass(URLClassLoader.java:387) 18184at java.lang.ClassLoader.loadClass(ClassLoader.java:418) 18185at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) 18186at java.lang.ClassLoader.loadClass(ClassLoader.java:351) 18187... 130 more 18188 18189 scala> | 18190 scala> :quit (ReplSuite.scala:83) {code} > Make `repl` module daily test pass > -- > > Key: SPARK-44600 > URL: https://issues.apache.org/jira/browse/SPARK-44600 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > > [https://github.com/apache/spark/actions/runs/5727123477/job/15518895421] > > {code:java} > - SPARK-15236: use Hive catalog *** FAILED *** > 18137 isContain was true Interpreter output contained 'Exception': > 18138 Welcome to > 18139
[jira] [Resolved] (SPARK-44607) Remove unused function `containsNestedColumn` from `Filter`
[ https://issues.apache.org/jira/browse/SPARK-44607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-44607. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42239 [https://github.com/apache/spark/pull/42239] > Remove unused function `containsNestedColumn` from `Filter` > --- > > Key: SPARK-44607 > URL: https://issues.apache.org/jira/browse/SPARK-44607 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44607) Remove unused function `containsNestedColumn` from `Filter`
[ https://issues.apache.org/jira/browse/SPARK-44607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie reassigned SPARK-44607: Assignee: Yang Jie > Remove unused function `containsNestedColumn` from `Filter` > --- > > Key: SPARK-44607 > URL: https://issues.apache.org/jira/browse/SPARK-44607 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44630) Revert SPARK-43043 Improve the performance of MapOutputTracker.updateMapOutput
Dongjoon Hyun created SPARK-44630: - Summary: Revert SPARK-43043 Improve the performance of MapOutputTracker.updateMapOutput Key: SPARK-44630 URL: https://issues.apache.org/jira/browse/SPARK-44630 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.4.1 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44629) Publish PySpark Test Guidelines webpage
Amanda Liu created SPARK-44629: -- Summary: Publish PySpark Test Guidelines webpage Key: SPARK-44629 URL: https://issues.apache.org/jira/browse/SPARK-44629 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43241) MultiIndex.append not checking names for equality
[ https://issues.apache.org/jira/browse/SPARK-43241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43241: Affects Version/s: 4.0.0 (was: 3.5.0) > MultiIndex.append not checking names for equality > - > > Key: SPARK-43241 > URL: https://issues.apache.org/jira/browse/SPARK-43241 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > To match the behavior with pandas: > https://github.com/pandas-dev/pandas/pull/48288 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42621) Add `inclusive` parameter for date_range
[ https://issues.apache.org/jira/browse/SPARK-42621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-42621: Affects Version/s: 4.0.0 (was: 3.5.0) > Add `inclusive` parameter for date_range > > > Key: SPARK-42621 > URL: https://issues.apache.org/jira/browse/SPARK-42621 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > See https://github.com/pandas-dev/pandas/issues/40245 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42620) Add `inclusive` parameter for (DataFrame|Series).between_time
[ https://issues.apache.org/jira/browse/SPARK-42620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-42620: Affects Version/s: 4.0.0 (was: 3.5.0) > Add `inclusive` parameter for (DataFrame|Series).between_time > - > > Key: SPARK-42620 > URL: https://issues.apache.org/jira/browse/SPARK-42620 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > See https://github.com/pandas-dev/pandas/pull/43248 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43194) PySpark 3.4.0 cannot convert timestamp-typed objects to pandas with pandas 2.0
[ https://issues.apache.org/jira/browse/SPARK-43194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43194: Affects Version/s: 4.0.0 (was: 3.4.0) > PySpark 3.4.0 cannot convert timestamp-typed objects to pandas with pandas 2.0 > -- > > Key: SPARK-43194 > URL: https://issues.apache.org/jira/browse/SPARK-43194 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 > Environment: {code} > In [4]: import pandas as pd > In [5]: pd.__version__ > Out[5]: '2.0.0' > In [6]: import pyspark as ps > In [7]: ps.__version__ > Out[7]: '3.4.0' > {code} >Reporter: Phillip Cloud >Priority: Major > > {code} > In [1]: from pyspark.sql import SparkSession > In [2]: session = SparkSession.builder.appName("test").getOrCreate() > 23/04/19 09:21:42 WARN Utils: Your hostname, albatross resolves to a loopback > address: 127.0.0.2; using 192.168.1.170 instead (on interface enp5s0) > 23/04/19 09:21:42 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 23/04/19 09:21:42 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > In [3]: session.sql("select now()").toPandas() > {code} > Results in: > {code} > ... > TypeError: Casting to unit-less dtype 'datetime64' is not supported. Pass > e.g. 'datetime64[ns]' instead. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42619) Add `show_counts` parameter for DataFrame.info
[ https://issues.apache.org/jira/browse/SPARK-42619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-42619: Affects Version/s: 4.0.0 (was: 3.5.0) > Add `show_counts` parameter for DataFrame.info > -- > > Key: SPARK-42619 > URL: https://issues.apache.org/jira/browse/SPARK-42619 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > See https://github.com/pandas-dev/pandas/pull/37999 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42617) Support `isocalendar`
[ https://issues.apache.org/jira/browse/SPARK-42617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-42617: Affects Version/s: 4.0.0 (was: 3.5.0) > Support `isocalendar` > - > > Key: SPARK-42617 > URL: https://issues.apache.org/jira/browse/SPARK-42617 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > We should support `isocalendar` to match pandas behavior > (https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.Series.dt.isocalendar.html) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43271) Match behavior with DataFrame.reindex with specifying `index`.
[ https://issues.apache.org/jira/browse/SPARK-43271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43271: Affects Version/s: 4.0.0 (was: 3.5.0) > Match behavior with DataFrame.reindex with specifying `index`. > -- > > Key: SPARK-43271 > URL: https://issues.apache.org/jira/browse/SPARK-43271 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > Re-enable pandas 2.0.0 test in DataFrameTests.test_reindex in proper way. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43451) Enable RollingTests.test_rolling_count for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43451: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable RollingTests.test_rolling_count for pandas 2.0.0. > > > Key: SPARK-43451 > URL: https://issues.apache.org/jira/browse/SPARK-43451 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > Enable RollingTests.test_rolling_count for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43282) Investigate DataFrame.sort_values with pandas behavior.
[ https://issues.apache.org/jira/browse/SPARK-43282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43282: Affects Version/s: 4.0.0 (was: 3.5.0) > Investigate DataFrame.sort_values with pandas behavior. > --- > > Key: SPARK-43282 > URL: https://issues.apache.org/jira/browse/SPARK-43282 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > {code:java} > import pandas as pd > pdf = pd.DataFrame( > { > "a": pd.Categorical([1, 2, 3, 1, 2, 3]), > "b": pd.Categorical( > ["b", "a", "c", "c", "b", "a"], categories=["c", "b", "d", "a"] > ), > }, > ) > pdf.groupby("a").apply(lambda x: x).sort_values(["a"]) > Traceback (most recent call last): > ... > ValueError: 'a' is both an index level and a column label, which is > ambiguous. {code} > We should investigate this issue whether this is intended behavior or just > bug in pandas. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43245) Fix DatetimeIndex.microsecond to return 'int32' instead of 'int64' type of Index.
[ https://issues.apache.org/jira/browse/SPARK-43245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43245: Affects Version/s: 4.0.0 (was: 3.5.0) > Fix DatetimeIndex.microsecond to return 'int32' instead of 'int64' type of > Index. > - > > Key: SPARK-43245 > URL: https://issues.apache.org/jira/browse/SPARK-43245 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html#index-can-now-hold-numpy-numeric-dtypes -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43433) Match `GroupBy.nth` behavior with new pandas behavior
[ https://issues.apache.org/jira/browse/SPARK-43433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43433: Affects Version/s: 4.0.0 (was: 3.5.0) > Match `GroupBy.nth` behavior with new pandas behavior > - > > Key: SPARK-43433 > URL: https://issues.apache.org/jira/browse/SPARK-43433 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > Match behavior with > https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html#dataframegroupby-nth-and-seriesgroupby-nth-now-behave-as-filtrations -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43432) Fix `min_periods` for Rolling to work same as pandas
[ https://issues.apache.org/jira/browse/SPARK-43432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43432: Affects Version/s: 4.0.0 (was: 3.5.0) > Fix `min_periods` for Rolling to work same as pandas > - > > Key: SPARK-43432 > URL: https://issues.apache.org/jira/browse/SPARK-43432 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > Fix `min_periods` for Rolling to work same as pandas > https://github.com/pandas-dev/pandas/issues/31302 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43291) Match behavior for DataFrame.cov on string DataFrame
[ https://issues.apache.org/jira/browse/SPARK-43291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43291: Affects Version/s: 4.0.0 (was: 3.5.0) > Match behavior for DataFrame.cov on string DataFrame > > > Key: SPARK-43291 > URL: https://issues.apache.org/jira/browse/SPARK-43291 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > Should enable test below: > {code:java} > pdf = pd.DataFrame([("1", "2"), ("0", "3"), ("2", "0"), ("1", "1")], > columns=["a", "b"]) > psdf = ps.from_pandas(pdf) > self.assert_eq(pdf.cov(), psdf.cov()) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43295) Make DataFrameGroupBy.sum support for string type columns
[ https://issues.apache.org/jira/browse/SPARK-43295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43295: Affects Version/s: 4.0.0 (was: 3.5.0) > Make DataFrameGroupBy.sum support for string type columns > - > > Key: SPARK-43295 > URL: https://issues.apache.org/jira/browse/SPARK-43295 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > From pandas 2.0.0, DataFrameGroupBy.sum also works for string type columns: > {code:java} > >>> psdf > A B C D > 0 1 3.1 a True > 1 2 4.1 b False > 2 1 4.1 b False > 3 2 3.1 a True > >>> psdf.groupby("A").sum().sort_index() > B D > A > 1 7.2 1 > 2 7.2 1 > >>> psdf.to_pandas().groupby("A").sum().sort_index() > B C D > A > 1 7.2 ab 1 > 2 7.2 ba 1 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44628) Clear some unused codes in "***Errors" and extract some common logic
BingKun Pan created SPARK-44628: --- Summary: Clear some unused codes in "***Errors" and extract some common logic Key: SPARK-44628 URL: https://issues.apache.org/jira/browse/SPARK-44628 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43460) Enable OpsOnDiffFramesGroupByTests.test_groupby_different_lengths for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43460: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable OpsOnDiffFramesGroupByTests.test_groupby_different_lengths for pandas > 2.0.0. > --- > > Key: SPARK-43460 > URL: https://issues.apache.org/jira/browse/SPARK-43460 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > Enable OpsOnDiffFramesGroupByTests.test_groupby_different_lengths for pandas > 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43453) Enable OpsOnDiffFramesEnabledTests.test_concat_column_axis for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43453: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable OpsOnDiffFramesEnabledTests.test_concat_column_axis for pandas 2.0.0. > > > Key: SPARK-43453 > URL: https://issues.apache.org/jira/browse/SPARK-43453 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > Enable OpsOnDiffFramesEnabledTests.test_concat_column_axis for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43459) Enable OpsOnDiffFramesGroupByTests.test_groupby_multiindex_columns for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43459: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable OpsOnDiffFramesGroupByTests.test_groupby_multiindex_columns for pandas > 2.0.0. > > > Key: SPARK-43459 > URL: https://issues.apache.org/jira/browse/SPARK-43459 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > Enable OpsOnDiffFramesGroupByTests.test_groupby_multiindex_columns for pandas > 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43476) Enable SeriesStringTests.test_string_replace for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43476: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable SeriesStringTests.test_string_replace for pandas 2.0.0. > -- > > Key: SPARK-43476 > URL: https://issues.apache.org/jira/browse/SPARK-43476 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > Enable SeriesStringTests.test_string_replace for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43458) Enable SeriesConversionTests.test_to_latex for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43458: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable SeriesConversionTests.test_to_latex for pandas 2.0.0. > > > Key: SPARK-43458 > URL: https://issues.apache.org/jira/browse/SPARK-43458 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > Enable SeriesConversionTests.test_to_latex for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43452) Enable RollingTests.test_groupby_rolling_count for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43452: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable RollingTests.test_groupby_rolling_count for pandas 2.0.0. > > > Key: SPARK-43452 > URL: https://issues.apache.org/jira/browse/SPARK-43452 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > Enable RollingTests.test_groupby_rolling_count for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43462) Enable SeriesDateTimeTests.test_date_subtraction for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43462: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable SeriesDateTimeTests.test_date_subtraction for pandas 2.0.0. > -- > > Key: SPARK-43462 > URL: https://issues.apache.org/jira/browse/SPARK-43462 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > Enable SeriesDateTimeTests.test_date_subtraction for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43477) Enable SeriesStringTests.test_string_rsplit for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43477: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable SeriesStringTests.test_string_rsplit for pandas 2.0.0. > - > > Key: SPARK-43477 > URL: https://issues.apache.org/jira/browse/SPARK-43477 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > Enable SeriesStringTests.test_string_rsplit for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43497) Enable StatsTests.test_cov_corr_meta for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43497: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable StatsTests.test_cov_corr_meta for pandas 2.0.0. > -- > > Key: SPARK-43497 > URL: https://issues.apache.org/jira/browse/SPARK-43497 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > Enable StatsTests.test_cov_corr_meta for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43498) Enable StatsTests.test_axis_on_dataframe for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43498: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable StatsTests.test_axis_on_dataframe for pandas 2.0.0. > -- > > Key: SPARK-43498 > URL: https://issues.apache.org/jira/browse/SPARK-43498 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > Enable StatsTests.test_axis_on_dataframe for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43478) Enable SeriesStringTests.test_string_split for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43478: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable SeriesStringTests.test_string_split for pandas 2.0.0. > > > Key: SPARK-43478 > URL: https://issues.apache.org/jira/browse/SPARK-43478 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > Enable SeriesStringTests.test_string_split for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43506) Enable ArrowTests.test_toPandas_empty_columns for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43506: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable ArrowTests.test_toPandas_empty_columns for pandas 2.0.0. > --- > > Key: SPARK-43506 > URL: https://issues.apache.org/jira/browse/SPARK-43506 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > Enable ArrowTests.test_toPandas_empty_columns for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43499) Enable StatsTests.test_stat_functions_with_no_numeric_columns for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43499: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable StatsTests.test_stat_functions_with_no_numeric_columns for pandas > 2.0.0. > --- > > Key: SPARK-43499 > URL: https://issues.apache.org/jira/browse/SPARK-43499 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > Enable StatsTests.test_stat_functions_with_no_numeric_columns for pandas > 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43561) Enable DataFrameConversionTests.test_to_latex for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43561: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable DataFrameConversionTests.test_to_latex for pandas 2.0.0. > --- > > Key: SPARK-43561 > URL: https://issues.apache.org/jira/browse/SPARK-43561 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > Enable DataFrameConversionTests.test_to_latex for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43562) Enable DataFrameTests.test_append for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43562: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable DataFrameTests.test_append for pandas 2.0.0. > --- > > Key: SPARK-43562 > URL: https://issues.apache.org/jira/browse/SPARK-43562 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > Enable DataFrameTests.test_append for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43533) Enable MultiIndex test for IndexesTests.test_difference
[ https://issues.apache.org/jira/browse/SPARK-43533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43533: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable MultiIndex test for IndexesTests.test_difference > --- > > Key: SPARK-43533 > URL: https://issues.apache.org/jira/browse/SPARK-43533 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > Enable MultiIndex test for IndexesTests.test_difference -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43563) Enable CsvTests.test_read_csv_with_squeeze for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43563: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable CsvTests.test_read_csv_with_squeeze for pandas 2.0.0. > > > Key: SPARK-43563 > URL: https://issues.apache.org/jira/browse/SPARK-43563 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > Enable CsvTests.test_read_csv_with_squeeze for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43570) Enable DateOpsTests.test_rsub for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43570: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable DateOpsTests.test_rsub for pandas 2.0.0. > --- > > Key: SPARK-43570 > URL: https://issues.apache.org/jira/browse/SPARK-43570 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > Enable DateOpsTests.test_rsub for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43608) Enable IndexesTests.test_union for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43608: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable IndexesTests.test_union for pandas 2.0.0. > > > Key: SPARK-43608 > URL: https://issues.apache.org/jira/browse/SPARK-43608 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > Enable IndexesTests.test_union for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43705) Enable TimedeltaIndexTests.test_properties for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43705: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable TimedeltaIndexTests.test_properties for pandas 2.0.0. > > > Key: SPARK-43705 > URL: https://issues.apache.org/jira/browse/SPARK-43705 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > Enable TimedeltaIndexTests.test_properties for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43644) Enable DatetimeIndexTests.test_indexer_between_time for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43644: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable DatetimeIndexTests.test_indexer_between_time for pandas 2.0.0. > - > > Key: SPARK-43644 > URL: https://issues.apache.org/jira/browse/SPARK-43644 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > Enable DatetimeIndexTests.test_indexer_between_time for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43606) Enable IndexesTests.test_index_basic for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43606: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable IndexesTests.test_index_basic for pandas 2.0.0. > -- > > Key: SPARK-43606 > URL: https://issues.apache.org/jira/browse/SPARK-43606 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > Enable IndexesTests.test_index_basic for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43567) Enable CategoricalIndexTests.test_factorize for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43567: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable CategoricalIndexTests.test_factorize for pandas 2.0.0. > - > > Key: SPARK-43567 > URL: https://issues.apache.org/jira/browse/SPARK-43567 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > Enable CategoricalIndexTests.test_factorize for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43571) Enable DateOpsTests.test_sub for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43571: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable DateOpsTests.test_sub for pandas 2.0.0. > -- > > Key: SPARK-43571 > URL: https://issues.apache.org/jira/browse/SPARK-43571 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > Enable DateOpsTests.test_sub for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43607) Enable IndexesTests.test_intersection for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43607: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable IndexesTests.test_intersection for pandas 2.0.0. > --- > > Key: SPARK-43607 > URL: https://issues.apache.org/jira/browse/SPARK-43607 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > Enable IndexesTests.test_intersection for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43568) Enable CategoricalIndexTests.test_categories_setter for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43568: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable CategoricalIndexTests.test_categories_setter for pandas 2.0.0. > - > > Key: SPARK-43568 > URL: https://issues.apache.org/jira/browse/SPARK-43568 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > Enable CategoricalIndexTests.test_categories_setter for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43633) Enable CategoricalIndexTests.test_remove_categories for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43633: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable CategoricalIndexTests.test_remove_categories for pandas 2.0.0. > - > > Key: SPARK-43633 > URL: https://issues.apache.org/jira/browse/SPARK-43633 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > Enable CategoricalIndexTests.test_remove_categories for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43811) Enable DataFrameTests.test_reindex for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43811: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable DataFrameTests.test_reindex for pandas 2.0.0. > > > Key: SPARK-43811 > URL: https://issues.apache.org/jira/browse/SPARK-43811 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43869) Enable GroupBySlowTests for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43869: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable GroupBySlowTests for pandas 2.0.0. > - > > Key: SPARK-43869 > URL: https://issues.apache.org/jira/browse/SPARK-43869 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark, PySpark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > test list: > * test_value_counts > * test_split_apply_combine_on_series -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43709) Enable NamespaceTests.test_date_range for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43709: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable NamespaceTests.test_date_range for pandas 2.0.0. > --- > > Key: SPARK-43709 > URL: https://issues.apache.org/jira/browse/SPARK-43709 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43812) Enable DataFrameTests.test_all for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43812: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable DataFrameTests.test_all for pandas 2.0.0. > > > Key: SPARK-43812 > URL: https://issues.apache.org/jira/browse/SPARK-43812 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43871) Enable SeriesDateTimeTests for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43871: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable SeriesDateTimeTests for pandas 2.0.0. > > > Key: SPARK-43871 > URL: https://issues.apache.org/jira/browse/SPARK-43871 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark, PySpark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > test list: > * test_day > * test_dayofweek > * test_dayofyear > * test_days_in_month > * test_daysinmonth > * test_hour > * test_microsecond > * test_minute > * test_month > * test_quarter > * test_second > * test_wrrkday > * test_year -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43872) Enable DataFramePlotMatplotlibTests for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43872: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable DataFramePlotMatplotlibTests for pandas 2.0.0. > - > > Key: SPARK-43872 > URL: https://issues.apache.org/jira/browse/SPARK-43872 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark, PySpark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > test list: > * test_area_plot > * test_area_plot_stacked_false > * test_area_plot_y > * test_bar_plot > * test_bar_with_x_y > * test_barh_plot_with_x_y > * test_barh_plot > * test_line_plot > * test_pie_plot > * test_scatter_plot > * test_hist_plot > * test_kde_plot -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43873) Enable DataFrameSlowTests for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43873: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable DataFrameSlowTests for pandas 2.0.0. > --- > > Key: SPARK-43873 > URL: https://issues.apache.org/jira/browse/SPARK-43873 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark, PySpark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > test list: > * test_describe > * test_between_time > * test_product > * test_iteritems > * test_mad > * test_cov > * test_quantile -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43870) Enable SeriesTests for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43870: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable SeriesTests for pandas 2.0.0. > > > Key: SPARK-43870 > URL: https://issues.apache.org/jira/browse/SPARK-43870 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark, PySpark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > test list: > * test_value_counts > * test_append > * test_astype > * test_between > * test_mad > * test_quantile > * test_rank > * test_between_time > * test_iteritems > * test_product > * test_factorize -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43874) Enable GroupByTests for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43874: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable GroupByTests for pandas 2.0.0. > - > > Key: SPARK-43874 > URL: https://issues.apache.org/jira/browse/SPARK-43874 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark, PySpark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > test list: > * test_prod > * test_nth > * test_mad > * test_basic_stat_funcs > * test_groupby_multiindex_columns > * test_apply_without_shortcut > * test_mean > * test_apply -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43875) Enable CategoricalTests for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43875: Affects Version/s: 4.0.0 (was: 3.5.0) > Enable CategoricalTests for pandas 2.0.0. > - > > Key: SPARK-43875 > URL: https://issues.apache.org/jira/browse/SPARK-43875 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark, PySpark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > test list: > * test_factorize > * test_as_ordered_unordered > * test_categories_setter > * test_remove_categories > * test_groupby_apply_without_shortcut -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44624) Spark Connect reattachable Execute when initial ExecutePlan didn't reach server
[ https://issues.apache.org/jira/browse/SPARK-44624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Juliusz Sompolski updated SPARK-44624: -- Epic Link: SPARK-43754 > Spark Connect reattachable Execute when initial ExecutePlan didn't reach > server > --- > > Key: SPARK-44624 > URL: https://issues.apache.org/jira/browse/SPARK-44624 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0, 4.0.0 >Reporter: Juliusz Sompolski >Priority: Major > > If the ExecutePlan never reached the server, a ReattachExecute will fail with > INVALID_HANDLE.OPERATION_NOT_FOUND. In that case, we could try to send > ExecutePlan again. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44624) Spark Connect reattachable Execute when initial ExecutePlan didn't reach server
[ https://issues.apache.org/jira/browse/SPARK-44624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Juliusz Sompolski updated SPARK-44624: -- Description: If the ExecutePlan never reached the server, a ReattachExecute will fail with INVALID_HANDLE.OPERATION_NOT_FOUND. In that case, we could try to send ExecutePlan again. (was: Even though we empirically observed that error is throws only from first next() or hasNext() of the response StreamObserver, wrap the initial call in retries as well to not depend on it in case it's just an quirk that's not fully dependable.) > Spark Connect reattachable Execute when initial ExecutePlan didn't reach > server > --- > > Key: SPARK-44624 > URL: https://issues.apache.org/jira/browse/SPARK-44624 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0, 4.0.0 >Reporter: Juliusz Sompolski >Priority: Major > > If the ExecutePlan never reached the server, a ReattachExecute will fail with > INVALID_HANDLE.OPERATION_NOT_FOUND. In that case, we could try to send > ExecutePlan again. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44624) Spark Connect reattachable Execute when initial ExecutePlan didn't reach server
[ https://issues.apache.org/jira/browse/SPARK-44624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Juliusz Sompolski updated SPARK-44624: -- Summary: Spark Connect reattachable Execute when initial ExecutePlan didn't reach server (was: Wrap retries around initial streaming GRPC call in connect) > Spark Connect reattachable Execute when initial ExecutePlan didn't reach > server > --- > > Key: SPARK-44624 > URL: https://issues.apache.org/jira/browse/SPARK-44624 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0, 4.0.0 >Reporter: Juliusz Sompolski >Priority: Major > > Even though we empirically observed that error is throws only from first > next() or hasNext() of the response StreamObserver, wrap the initial call in > retries as well to not depend on it in case it's just an quirk that's not > fully dependable. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44627) org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils#resultSetToRows produces wrong data
Min Zhao created SPARK-44627: Summary: org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils#resultSetToRows produces wrong data Key: SPARK-44627 URL: https://issues.apache.org/jira/browse/SPARK-44627 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.3.1, 2.3.2 Reporter: Min Zhao when the resultSet exists a timestmp column and it's value is null, In the row it generates, this column will use the value of the same column in the previous row. example: the value of resultSet 1, 2023-01-01 12:00:00 2, null the value of row 1, 2023-01-01 12:00:00 2, 2023-01-01 12:00:00 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42941) Add support for streaming listener in Python
[ https://issues.apache.org/jira/browse/SPARK-42941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42941. -- Resolution: Fixed Issue resolved by pull request 42250 [https://github.com/apache/spark/pull/42250] > Add support for streaming listener in Python > > > Key: SPARK-42941 > URL: https://issues.apache.org/jira/browse/SPARK-42941 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Raghu Angadi >Assignee: Wei Liu >Priority: Major > Fix For: 3.5.0 > > > Add support of streaming listener in Python. > This likely requires a design doc to hash out the details. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42730) Update Spark Standalone Mode - Starting a Cluster Manually
[ https://issues.apache.org/jira/browse/SPARK-42730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17750044#comment-17750044 ] Hyukjin Kwon commented on SPARK-42730: -- Please go ahead. Refs: [https://spark.apache.org/contributing.html] , [https://spark.apache.org/developer-tools.html] > Update Spark Standalone Mode - Starting a Cluster Manually > -- > > Key: SPARK-42730 > URL: https://issues.apache.org/jira/browse/SPARK-42730 > Project: Spark > Issue Type: Sub-task > Components: Connect, Documentation >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > https://spark.apache.org/docs/latest/spark-standalone.html > Add start-connect-server.sh to this list and cover Spark Connect sessions - > other changes needed here. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44218) Customize diff log in assertDataFrameEqual error message format
[ https://issues.apache.org/jira/browse/SPARK-44218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-44218: Assignee: Amanda Liu > Customize diff log in assertDataFrameEqual error message format > --- > > Key: SPARK-44218 > URL: https://issues.apache.org/jira/browse/SPARK-44218 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Assignee: Amanda Liu >Priority: Major > > SPIP: > https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44218) Customize diff log in assertDataFrameEqual error message format
[ https://issues.apache.org/jira/browse/SPARK-44218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-44218. -- Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 42196 [https://github.com/apache/spark/pull/42196] > Customize diff log in assertDataFrameEqual error message format > --- > > Key: SPARK-44218 > URL: https://issues.apache.org/jira/browse/SPARK-44218 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Assignee: Amanda Liu >Priority: Major > Fix For: 3.5.0, 4.0.0 > > > SPIP: > https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44626) Followup on streaming query termination when client session is timed out for Spark Connect
Bo Gao created SPARK-44626: -- Summary: Followup on streaming query termination when client session is timed out for Spark Connect Key: SPARK-44626 URL: https://issues.apache.org/jira/browse/SPARK-44626 Project: Spark Issue Type: Task Components: Connect, Structured Streaming Affects Versions: 3.5.0 Reporter: Bo Gao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44588) Migrated shuffle blocks are encrypted multiple times when io.encryption is enabled
[ https://issues.apache.org/jira/browse/SPARK-44588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-44588: - Assignee: Henry Mai > Migrated shuffle blocks are encrypted multiple times when io.encryption is > enabled > --- > > Key: SPARK-44588 > URL: https://issues.apache.org/jira/browse/SPARK-44588 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0, 3.2.2, > 3.3.1, 3.2.3, 3.2.4, 3.3.2, 3.4.0, 3.4.1 >Reporter: Henry Mai >Assignee: Henry Mai >Priority: Critical > Fix For: 3.5.0 > > > Shuffle blocks upon migration are wrapped for encryption again when being > written out to a file on the receiver side. > > Pull request to fix this: https://github.com/apache/spark/pull/42214 > > Details: > Sender/Read side: > BlockManagerDecommissioner:run() > blocks = bm.migratableResolver.getMigrationBlocks() > *dataFile = IndexShuffleBlockResolver:getDataFile(...)* > buffer = FileSegmentManagedBuffer(..., dataFile) > *^ This reads straight from disk without decryption* > blocks.foreach((blockId, buffer) => > bm.blockTransferService.uploadBlockSync(..., buffer, ...)) > -> uploadBlockSync() -> uploadBlock(..., buffer, ...) > -> client.uploadStream(UploadBlockStream, buffer, ...) > - Notice that there is no decryption here on the sender/read side. > Receiver/Write side: > NettyBlockRpcServer:receiveStream() <--- This is the UploadBlockStream handler > putBlockDataAsStream() > migratableResolver.putShuffleBlockAsStream() > *-> file = IndexShuffleBlockResolver:getDataFile(...)* > -> tmpFile = (file + . extension) > *-> Creates an encrypting writable channel to a tmpFile using > serializerManager.wrapStream()* > -> onData() writes the data into the channel > -> onComplete() renames the tmpFile to the file > - Notice: > * Both getMigrationBlocks()[read] and putShuffleBlockAsStream()[write] > target IndexShuffleBlockResolver:getDataFile() > * The read path does not decrypt but the write path encrypts. > * As a thought exercise: if this cycle happens more than once (where this > receiver is now a sender) even if we assume that the shuffle blocks are > initially unencrypted*, then bytes in the file will just have more and more > layers of encryption applied to it each time it gets migrated. > * *In practice, the shuffle blocks are encrypted on disk to begin with, this > is just a thought exercise -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44588) Migrated shuffle blocks are encrypted multiple times when io.encryption is enabled
[ https://issues.apache.org/jira/browse/SPARK-44588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-44588. --- Fix Version/s: 3.5.0 Resolution: Fixed > Migrated shuffle blocks are encrypted multiple times when io.encryption is > enabled > --- > > Key: SPARK-44588 > URL: https://issues.apache.org/jira/browse/SPARK-44588 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0, 3.2.2, > 3.3.1, 3.2.3, 3.2.4, 3.3.2, 3.4.0, 3.4.1 >Reporter: Henry Mai >Priority: Critical > Fix For: 3.5.0 > > > Shuffle blocks upon migration are wrapped for encryption again when being > written out to a file on the receiver side. > > Pull request to fix this: https://github.com/apache/spark/pull/42214 > > Details: > Sender/Read side: > BlockManagerDecommissioner:run() > blocks = bm.migratableResolver.getMigrationBlocks() > *dataFile = IndexShuffleBlockResolver:getDataFile(...)* > buffer = FileSegmentManagedBuffer(..., dataFile) > *^ This reads straight from disk without decryption* > blocks.foreach((blockId, buffer) => > bm.blockTransferService.uploadBlockSync(..., buffer, ...)) > -> uploadBlockSync() -> uploadBlock(..., buffer, ...) > -> client.uploadStream(UploadBlockStream, buffer, ...) > - Notice that there is no decryption here on the sender/read side. > Receiver/Write side: > NettyBlockRpcServer:receiveStream() <--- This is the UploadBlockStream handler > putBlockDataAsStream() > migratableResolver.putShuffleBlockAsStream() > *-> file = IndexShuffleBlockResolver:getDataFile(...)* > -> tmpFile = (file + . extension) > *-> Creates an encrypting writable channel to a tmpFile using > serializerManager.wrapStream()* > -> onData() writes the data into the channel > -> onComplete() renames the tmpFile to the file > - Notice: > * Both getMigrationBlocks()[read] and putShuffleBlockAsStream()[write] > target IndexShuffleBlockResolver:getDataFile() > * The read path does not decrypt but the write path encrypts. > * As a thought exercise: if this cycle happens more than once (where this > receiver is now a sender) even if we assume that the shuffle blocks are > initially unencrypted*, then bytes in the file will just have more and more > layers of encryption applied to it each time it gets migrated. > * *In practice, the shuffle blocks are encrypted on disk to begin with, this > is just a thought exercise -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44563) Upgrade Apache Arrow to 13.0.0
[ https://issues.apache.org/jira/browse/SPARK-44563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-44563. --- Resolution: Duplicate > Upgrade Apache Arrow to 13.0.0 > -- > > Key: SPARK-44563 > URL: https://issues.apache.org/jira/browse/SPARK-44563 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-44563) Upgrade Apache Arrow to 13.0.0
[ https://issues.apache.org/jira/browse/SPARK-44563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-44563. - > Upgrade Apache Arrow to 13.0.0 > -- > > Key: SPARK-44563 > URL: https://issues.apache.org/jira/browse/SPARK-44563 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44625) Spark Connect clean up abandoned executions
Juliusz Sompolski created SPARK-44625: - Summary: Spark Connect clean up abandoned executions Key: SPARK-44625 URL: https://issues.apache.org/jira/browse/SPARK-44625 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 3.5.0, 4.0.0 Reporter: Juliusz Sompolski With reattachable executions, some executions might get orphaned when ReattachExecute and ReleaseExecute never comes. Add a mechanism to track that and to clean them up. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44601) Make `hive-thriftserver` module daily test pass
[ https://issues.apache.org/jira/browse/SPARK-44601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie reassigned SPARK-44601: Assignee: Yang Jie > Make `hive-thriftserver` module daily test pass > --- > > Key: SPARK-44601 > URL: https://issues.apache.org/jira/browse/SPARK-44601 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > > [https://github.com/LuciferYang/spark/actions/runs/5694334367/job/15435297305] > > {code:java} > *** RUN ABORTED *** > 20159 java.lang.NoClassDefFoundError: > org/codehaus/jackson/map/type/TypeFactory > 20160 at org.apache.hadoop.hive.ql.udf.UDFJson.(UDFJson.java:64) > 20161 at java.lang.Class.forName0(Native Method) > 20162 at java.lang.Class.forName(Class.java:348) > 20163 at > org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.getUdfClassInternal(GenericUDFBridge.java:142) > 20164 at > org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.getUdfClass(GenericUDFBridge.java:132) > 20165 at > org.apache.hadoop.hive.ql.exec.FunctionInfo.getFunctionClass(FunctionInfo.java:151) > 20166 at > org.apache.hadoop.hive.ql.exec.Registry.addFunction(Registry.java:519) > 20167 at > org.apache.hadoop.hive.ql.exec.Registry.registerUDF(Registry.java:163) > 20168 at > org.apache.hadoop.hive.ql.exec.Registry.registerUDF(Registry.java:154) > 20169 at > org.apache.hadoop.hive.ql.exec.Registry.registerUDF(Registry.java:147) > 20170 ... > 20171 Cause: java.lang.ClassNotFoundException: > org.codehaus.jackson.map.type.TypeFactory > 20172 at java.net.URLClassLoader.findClass(URLClassLoader.java:387) > 20173 at java.lang.ClassLoader.loadClass(ClassLoader.java:418) > 20174 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) > 20175 at java.lang.ClassLoader.loadClass(ClassLoader.java:351) > 20176 at org.apache.hadoop.hive.ql.udf.UDFJson.(UDFJson.java:64) > 20177 at java.lang.Class.forName0(Native Method) > 20178 at java.lang.Class.forName(Class.java:348) > 20179 at > org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.getUdfClassInternal(GenericUDFBridge.java:142) > 20180 at > org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.getUdfClass(GenericUDFBridge.java:132) > 20181 at > org.apache.hadoop.hive.ql.exec.FunctionInfo.getFunctionClass(FunctionInfo.java:151) > 20182 ... {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44601) Make `hive-thriftserver` module daily test pass
[ https://issues.apache.org/jira/browse/SPARK-44601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-44601. -- Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 42260 [https://github.com/apache/spark/pull/42260] > Make `hive-thriftserver` module daily test pass > --- > > Key: SPARK-44601 > URL: https://issues.apache.org/jira/browse/SPARK-44601 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.5.0, 4.0.0 > > > [https://github.com/LuciferYang/spark/actions/runs/5694334367/job/15435297305] > > {code:java} > *** RUN ABORTED *** > 20159 java.lang.NoClassDefFoundError: > org/codehaus/jackson/map/type/TypeFactory > 20160 at org.apache.hadoop.hive.ql.udf.UDFJson.(UDFJson.java:64) > 20161 at java.lang.Class.forName0(Native Method) > 20162 at java.lang.Class.forName(Class.java:348) > 20163 at > org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.getUdfClassInternal(GenericUDFBridge.java:142) > 20164 at > org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.getUdfClass(GenericUDFBridge.java:132) > 20165 at > org.apache.hadoop.hive.ql.exec.FunctionInfo.getFunctionClass(FunctionInfo.java:151) > 20166 at > org.apache.hadoop.hive.ql.exec.Registry.addFunction(Registry.java:519) > 20167 at > org.apache.hadoop.hive.ql.exec.Registry.registerUDF(Registry.java:163) > 20168 at > org.apache.hadoop.hive.ql.exec.Registry.registerUDF(Registry.java:154) > 20169 at > org.apache.hadoop.hive.ql.exec.Registry.registerUDF(Registry.java:147) > 20170 ... > 20171 Cause: java.lang.ClassNotFoundException: > org.codehaus.jackson.map.type.TypeFactory > 20172 at java.net.URLClassLoader.findClass(URLClassLoader.java:387) > 20173 at java.lang.ClassLoader.loadClass(ClassLoader.java:418) > 20174 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) > 20175 at java.lang.ClassLoader.loadClass(ClassLoader.java:351) > 20176 at org.apache.hadoop.hive.ql.udf.UDFJson.(UDFJson.java:64) > 20177 at java.lang.Class.forName0(Native Method) > 20178 at java.lang.Class.forName(Class.java:348) > 20179 at > org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.getUdfClassInternal(GenericUDFBridge.java:142) > 20180 at > org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.getUdfClass(GenericUDFBridge.java:132) > 20181 at > org.apache.hadoop.hive.ql.exec.FunctionInfo.getFunctionClass(FunctionInfo.java:151) > 20182 ... {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44624) Wrap retries around initial streaming GRPC call in connect
Juliusz Sompolski created SPARK-44624: - Summary: Wrap retries around initial streaming GRPC call in connect Key: SPARK-44624 URL: https://issues.apache.org/jira/browse/SPARK-44624 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 3.5.0, 4.0.0 Reporter: Juliusz Sompolski Even though we empirically observed that error is throws only from first next() or hasNext() of the response StreamObserver, wrap the initial call in retries as well to not depend on it in case it's just an quirk that's not fully dependable. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44480) Add option for thread pool to perform maintenance for RocksDB/HDFS State Store Providers
[ https://issues.apache.org/jira/browse/SPARK-44480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-44480: Assignee: Eric Marnadi > Add option for thread pool to perform maintenance for RocksDB/HDFS State > Store Providers > > > Key: SPARK-44480 > URL: https://issues.apache.org/jira/browse/SPARK-44480 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: Eric Marnadi >Assignee: Eric Marnadi >Priority: Major > > Maintenance tasks on StateStore was being done by a single background thread, > which is prone to straggling. In this change, the single background thread > would instead schedule maintenance tasks to a thread pool. > Introduce > {{spark.sql.streaming.stateStore.enableStateStoreMaintenanceThreadPool}} > config so that the user can enable a thread pool for maintenance manually. > Introduce {{spark.sql.streaming.stateStore.numStateStoreMaintenanceThreads}} > config so the thread pool size is configurable. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44480) Add option for thread pool to perform maintenance for RocksDB/HDFS State Store Providers
[ https://issues.apache.org/jira/browse/SPARK-44480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-44480. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42066 [https://github.com/apache/spark/pull/42066] > Add option for thread pool to perform maintenance for RocksDB/HDFS State > Store Providers > > > Key: SPARK-44480 > URL: https://issues.apache.org/jira/browse/SPARK-44480 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: Eric Marnadi >Assignee: Eric Marnadi >Priority: Major > Fix For: 4.0.0 > > > Maintenance tasks on StateStore was being done by a single background thread, > which is prone to straggling. In this change, the single background thread > would instead schedule maintenance tasks to a thread pool. > Introduce > {{spark.sql.streaming.stateStore.enableStateStoreMaintenanceThreadPool}} > config so that the user can enable a thread pool for maintenance manually. > Introduce {{spark.sql.streaming.stateStore.numStateStoreMaintenanceThreads}} > config so the thread pool size is configurable. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44623) Upgrade commons-lang3 to 3.13.0
[ https://issues.apache.org/jira/browse/SPARK-44623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-44623: - Assignee: Dongjoon Hyun > Upgrade commons-lang3 to 3.13.0 > --- > > Key: SPARK-44623 > URL: https://issues.apache.org/jira/browse/SPARK-44623 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44623) Upgrade commons-lang3 to 3.13.0
[ https://issues.apache.org/jira/browse/SPARK-44623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-44623. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42269 [https://github.com/apache/spark/pull/42269] > Upgrade commons-lang3 to 3.13.0 > --- > > Key: SPARK-44623 > URL: https://issues.apache.org/jira/browse/SPARK-44623 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29497) Cannot assign instance of java.lang.invoke.SerializedLambda to field
[ https://issues.apache.org/jira/browse/SPARK-29497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17749991#comment-17749991 ] Herman van Hövell commented on SPARK-29497: --- I have added a check for this to Spark Connect. If someone is brave enough they can do the same thing for other UDFs. > Cannot assign instance of java.lang.invoke.SerializedLambda to field > > > Key: SPARK-29497 > URL: https://issues.apache.org/jira/browse/SPARK-29497 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.3, 3.0.1, 3.2.0 > Environment: Spark 2.4.3 Scala 2.12 > Spark 3.2.0 Scala 2.13.5 (Java 11.0.12) >Reporter: Rob Russo >Priority: Major > > Note this is for scala 2.12: > There seems to be an issue in spark with serializing a udf that is created > from a function assigned to a class member that references another function > assigned to a class member. This is similar to > https://issues.apache.org/jira/browse/SPARK-25047 but it looks like the > resolution has an issue with this case. After trimming it down to the base > issue I came up with the following to reproduce: > > > {code:java} > object TestLambdaShell extends Serializable { > val hello: String => String = s => s"hello $s!" > val lambdaTest: String => String = hello( _ ) > def functionTest: String => String = hello( _ ) > } > val hello = udf( TestLambdaShell.hello ) > val functionTest = udf( TestLambdaShell.functionTest ) > val lambdaTest = udf( TestLambdaShell.lambdaTest ) > sc.parallelize(Seq("world"),1).toDF("test").select(hello($"test")).show(1) > sc.parallelize(Seq("world"),1).toDF("test").select(functionTest($"test")).show(1) > sc.parallelize(Seq("world"),1).toDF("test").select(lambdaTest($"test")).show(1) > {code} > > All of which works except the last line which results in an exception on the > executors: > > {code:java} > Caused by: java.lang.ClassCastException: cannot assign instance of > java.lang.invoke.SerializedLambda to field > $$$82b5b23cea489b2712a1db46c77e458w$TestLambdaShell$.lambdaTest of type > scala.Function1 in instance of > $$$82b5b23cea489b2712a1db46c77e458w$TestLambdaShell$ > at > java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133) > at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2251) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1933) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1933) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529) > at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1933) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1933) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.r
[jira] [Resolved] (SPARK-44613) Add Encoders.scala to Spark Connect Scala Client
[ https://issues.apache.org/jira/browse/SPARK-44613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-44613. --- Fix Version/s: 3.5.0 Resolution: Fixed > Add Encoders.scala to Spark Connect Scala Client > > > Key: SPARK-44613 > URL: https://issues.apache.org/jira/browse/SPARK-44613 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.1 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44616) Hive Generic UDF support no longer supports short-circuiting of argument evaluation
[ https://issues.apache.org/jira/browse/SPARK-44616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated SPARK-44616: --- Description: PR [https://github.com/apache/spark/pull/39555] changed DeferredObject to no longer contain a function, and instead contains a value. This removes the deferred evaluation capability and means that HiveGenericUDF implementations can no longer short-circuit the evaluation of their arguments, which could be a performance issue for some users. Here is a relevant javadoc comment from the Hive source for DeferredObject: {code:java} /** * A Defered Object allows us to do lazy-evaluation and short-circuiting. * GenericUDF use DeferedObject to pass arguments. */ public static interface DeferredObject { {code} was: PR https://github.com/apache/spark/pull/39555 changed DeferredObject to no longer contain a function, and instead contains a value. This removes the deferred evaluation capability and means that HiveGenericUDF implementations can no longer short-circuit the evaluation of their arguments, which could be a performance issue for some users. Here is a relevant javadoc comment from the Hive source for DeferredObject: {{{ /** * A Defered Object allows us to do lazy-evaluation and short-circuiting. * GenericUDF use DeferedObject to pass arguments. */ public static interface DeferredObject { }}} > Hive Generic UDF support no longer supports short-circuiting of argument > evaluation > --- > > Key: SPARK-44616 > URL: https://issues.apache.org/jira/browse/SPARK-44616 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Andy Grove >Priority: Major > > PR [https://github.com/apache/spark/pull/39555] changed DeferredObject to no > longer contain a function, and instead contains a value. This removes the > deferred evaluation capability and means that HiveGenericUDF implementations > can no longer short-circuit the evaluation of their arguments, which could be > a performance issue for some users. > Here is a relevant javadoc comment from the Hive source for DeferredObject: > {code:java} > /** >* A Defered Object allows us to do lazy-evaluation and short-circuiting. >* GenericUDF use DeferedObject to pass arguments. >*/ > public static interface DeferredObject { > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org