[jira] [Assigned] (SPARK-42530) Remove Hadoop 2 from PySpark installation guide
[ https://issues.apache.org/jira/browse/SPARK-42530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42530: Assignee: Apache Spark > Remove Hadoop 2 from PySpark installation guide > --- > > Key: SPARK-42530 > URL: https://issues.apache.org/jira/browse/SPARK-42530 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40822) Use stable derived-column-alias algorithm, suitable for CREATE VIEW
[ https://issues.apache.org/jira/browse/SPARK-40822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692308#comment-17692308 ] Apache Spark commented on SPARK-40822: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/40126 > Use stable derived-column-alias algorithm, suitable for CREATE VIEW > > > Key: SPARK-40822 > URL: https://issues.apache.org/jira/browse/SPARK-40822 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Priority: Major > > Spark has the ability derive column aliases for expressions if no alias was > provided by the user. > E.g. > CREATE TABLE T(c1 INT, c2 INT); > SELECT c1, `(c1 + 1)`, c3 FROM (SELECT c1, c1 + 1, c1 * c2 AS c3 FROM T); > This is a valuable feature. However, the current implementation works by > pretty printing the expression from the logical plan. This has multiple > downsides: > * The derived names can be unintuitive. For example the brackets in `(c1 + > 1)` or outright ugly, such as: > SELECT `substr(hello, 1, 2147483647)` FROM (SELECT substr('hello', 1)) AS T; > * We cannot guarantee stability across versions since the logical lan of an > expression may change. > The later is a major reason why we cannot allow CREATE VIEW without a column > list except in "trivial" cases. > CREATE VIEW v AS SELECT c1, c1 + 1, c1 * c2 AS c3 FROM T; > Not allowed to create a permanent view `spark_catalog`.`default`.`v` without > explicitly assigning an alias for expression (c1 + 1). > There are two way we can go about fixing this: > # Stop deriving column aliases from the expression. Instead generate unique > names such as `_col_1` based on their position in the select list. This is > ugly and takes away the "nice" headers on result sets > # Move the derivation of the name upstream. That is instead of pretty > printing the logical plan we pretty print the lexer output, or a sanitized > version of the expression as typed. > The statement as typed is stable by definition. The lexer is stable because i > has no reason to change. And if it ever did we have a better chance to manage > the change. > In this feature we propose the following semantic: > # If the column alias can be trivially derived (some of these can stack), do > so: > ** a (qualified) column reference => the unqualified column identifier > cat.sch.tab.col => col > ** A field reference => the fieldname > struct.field1.field2 => field2 > ** A cast(column AS type) => column > cast(col1 AS INT) => col1 > ** A map lookup with literal key => keyname > map.key => key > map['key'] => key > ** A parameter less function => unqualified function name > current_schema() => current_schema > # Take the lexer tokens of the expression, eliminate comments, and append > them. > foo(tab1.c1 + /* this is a plus*/ > 1) => `foo(tab1.c1+1)` > > Of course we wan this change under a config. > If the config is set we can allow CREATE VIEW to exploit this and use the > derived expressions. > PS: The exact mechanics of formatting the name is very much debatable. > E.g.spaces between token, squeezing out comments - upper casing - preserving > quotes or double quotes...) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42468) Implement agg by (String, String)*
[ https://issues.apache.org/jira/browse/SPARK-42468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692307#comment-17692307 ] Apache Spark commented on SPARK-42468: -- User 'amaliujia' has created a pull request for this issue: https://github.com/apache/spark/pull/40125 > Implement agg by (String, String)* > -- > > Key: SPARK-42468 > URL: https://issues.apache.org/jira/browse/SPARK-42468 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37980) Extend METADATA column to support row indices for file based data sources
[ https://issues.apache.org/jira/browse/SPARK-37980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692170#comment-17692170 ] Apache Spark commented on SPARK-37980: -- User 'olaky' has created a pull request for this issue: https://github.com/apache/spark/pull/40124 > Extend METADATA column to support row indices for file based data sources > - > > Key: SPARK-37980 > URL: https://issues.apache.org/jira/browse/SPARK-37980 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Prakhar Jain >Assignee: Ala Luszczak >Priority: Major > Fix For: 3.4.0 > > > Spark recently added hidden metadata column support for File based > datasources as part of SPARK-37273. > We should extend it to support ROW_INDEX/ROW_POSITION also. > > Meaning of ROW_POSITION: > ROW_INDEX/ROW_POSITION is basically an index of a row within a file. E.g. 5th > row in the file will have ROW_INDEX 5. > > Use cases: > Row Indexes can be used in a variety of ways. A (fileName, rowIndex) tuple > uniquely identifies row in a table. This information can be used to mark rows > e.g. this can be used by indexer etc. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42272) Use available ephemeral port for Spark Connect server in testing
[ https://issues.apache.org/jira/browse/SPARK-42272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692143#comment-17692143 ] Apache Spark commented on SPARK-42272: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/40123 > Use available ephemeral port for Spark Connect server in testing > > > Key: SPARK-42272 > URL: https://issues.apache.org/jira/browse/SPARK-42272 > Project: Spark > Issue Type: Sub-task > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > Currently Spark Connect tests cannot run in parallel, and requires to set the > parallelism as 1 > {code} > python/run-tests --module pyspark-connect --parallelism 1 > {code} > The main reason is because of the port being used is hardcorded as the > default 15002. We should better search available port, and use it -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42272) Use available ephemeral port for Spark Connect server in testing
[ https://issues.apache.org/jira/browse/SPARK-42272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692145#comment-17692145 ] Apache Spark commented on SPARK-42272: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/40123 > Use available ephemeral port for Spark Connect server in testing > > > Key: SPARK-42272 > URL: https://issues.apache.org/jira/browse/SPARK-42272 > Project: Spark > Issue Type: Sub-task > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > Currently Spark Connect tests cannot run in parallel, and requires to set the > parallelism as 1 > {code} > python/run-tests --module pyspark-connect --parallelism 1 > {code} > The main reason is because of the port being used is hardcorded as the > default 15002. We should better search available port, and use it -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42349) Support pandas cogroup with multiple df
[ https://issues.apache.org/jira/browse/SPARK-42349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692129#comment-17692129 ] Apache Spark commented on SPARK-42349: -- User 'santosh-d3vpl3x' has created a pull request for this issue: https://github.com/apache/spark/pull/40122 > Support pandas cogroup with multiple df > --- > > Key: SPARK-42349 > URL: https://issues.apache.org/jira/browse/SPARK-42349 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.1 >Reporter: Santosh Pingale >Priority: Trivial > > Currently pyspark support `cogroup.applyInPandas` with only 2 dataframes. The > improvement request is to support multiple dataframes with variable arity. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42349) Support pandas cogroup with multiple df
[ https://issues.apache.org/jira/browse/SPARK-42349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692128#comment-17692128 ] Apache Spark commented on SPARK-42349: -- User 'santosh-d3vpl3x' has created a pull request for this issue: https://github.com/apache/spark/pull/40122 > Support pandas cogroup with multiple df > --- > > Key: SPARK-42349 > URL: https://issues.apache.org/jira/browse/SPARK-42349 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.1 >Reporter: Santosh Pingale >Priority: Trivial > > Currently pyspark support `cogroup.applyInPandas` with only 2 dataframes. The > improvement request is to support multiple dataframes with variable arity. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42528) Optimize PercentileHeap
[ https://issues.apache.org/jira/browse/SPARK-42528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42528: Assignee: Apache Spark (was: Alkis Evlogimenos) > Optimize PercentileHeap > --- > > Key: SPARK-42528 > URL: https://issues.apache.org/jira/browse/SPARK-42528 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Alkis Evlogimenos >Assignee: Apache Spark >Priority: Major > Fix For: 3.4.0 > > > It is not fast enough when used inside the scheduler for estimations which > slows down scheduling rate and as a result query execution time. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42528) Optimize PercentileHeap
[ https://issues.apache.org/jira/browse/SPARK-42528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42528: Assignee: Alkis Evlogimenos (was: Apache Spark) > Optimize PercentileHeap > --- > > Key: SPARK-42528 > URL: https://issues.apache.org/jira/browse/SPARK-42528 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Alkis Evlogimenos >Assignee: Alkis Evlogimenos >Priority: Major > Fix For: 3.4.0 > > > It is not fast enough when used inside the scheduler for estimations which > slows down scheduling rate and as a result query execution time. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42528) Optimize PercentileHeap
[ https://issues.apache.org/jira/browse/SPARK-42528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692109#comment-17692109 ] Apache Spark commented on SPARK-42528: -- User 'alkis' has created a pull request for this issue: https://github.com/apache/spark/pull/40121 > Optimize PercentileHeap > --- > > Key: SPARK-42528 > URL: https://issues.apache.org/jira/browse/SPARK-42528 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Alkis Evlogimenos >Assignee: Alkis Evlogimenos >Priority: Major > Fix For: 3.4.0 > > > It is not fast enough when used inside the scheduler for estimations which > slows down scheduling rate and as a result query execution time. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42527) Scala Client add Window functions
[ https://issues.apache.org/jira/browse/SPARK-42527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42527: Assignee: (was: Apache Spark) > Scala Client add Window functions > - > > Key: SPARK-42527 > URL: https://issues.apache.org/jira/browse/SPARK-42527 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42527) Scala Client add Window functions
[ https://issues.apache.org/jira/browse/SPARK-42527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42527: Assignee: Apache Spark > Scala Client add Window functions > - > > Key: SPARK-42527 > URL: https://issues.apache.org/jira/browse/SPARK-42527 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42527) Scala Client add Window functions
[ https://issues.apache.org/jira/browse/SPARK-42527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692026#comment-17692026 ] Apache Spark commented on SPARK-42527: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/40120 > Scala Client add Window functions > - > > Key: SPARK-42527 > URL: https://issues.apache.org/jira/browse/SPARK-42527 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42526) Add Classifier.getNumClasses back
[ https://issues.apache.org/jira/browse/SPARK-42526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42526: Assignee: (was: Apache Spark) > Add Classifier.getNumClasses back > - > > Key: SPARK-42526 > URL: https://issues.apache.org/jira/browse/SPARK-42526 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42526) Add Classifier.getNumClasses back
[ https://issues.apache.org/jira/browse/SPARK-42526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42526: Assignee: Apache Spark > Add Classifier.getNumClasses back > - > > Key: SPARK-42526 > URL: https://issues.apache.org/jira/browse/SPARK-42526 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42526) Add Classifier.getNumClasses back
[ https://issues.apache.org/jira/browse/SPARK-42526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692018#comment-17692018 ] Apache Spark commented on SPARK-42526: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/40119 > Add Classifier.getNumClasses back > - > > Key: SPARK-42526 > URL: https://issues.apache.org/jira/browse/SPARK-42526 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26365) spark-submit for k8s cluster doesn't propagate exit code
[ https://issues.apache.org/jira/browse/SPARK-26365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691973#comment-17691973 ] Apache Spark commented on SPARK-26365: -- User 'zwangsheng' has created a pull request for this issue: https://github.com/apache/spark/pull/40118 > spark-submit for k8s cluster doesn't propagate exit code > > > Key: SPARK-26365 > URL: https://issues.apache.org/jira/browse/SPARK-26365 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core, Spark Submit >Affects Versions: 2.3.2, 2.4.0, 3.0.0, 3.1.0 >Reporter: Oscar Bonilla >Priority: Major > Attachments: spark-2.4.5-raise-exception-k8s-failure.patch, > spark-3.0.0-raise-exception-k8s-failure.patch > > > When launching apps using spark-submit in a kubernetes cluster, if the Spark > applications fails (returns exit code = 1 for example), spark-submit will > still exit gracefully and return exit code = 0. > This is problematic, since there's no way to know if there's been a problem > with the Spark application. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26365) spark-submit for k8s cluster doesn't propagate exit code
[ https://issues.apache.org/jira/browse/SPARK-26365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691972#comment-17691972 ] Apache Spark commented on SPARK-26365: -- User 'zwangsheng' has created a pull request for this issue: https://github.com/apache/spark/pull/40118 > spark-submit for k8s cluster doesn't propagate exit code > > > Key: SPARK-26365 > URL: https://issues.apache.org/jira/browse/SPARK-26365 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core, Spark Submit >Affects Versions: 2.3.2, 2.4.0, 3.0.0, 3.1.0 >Reporter: Oscar Bonilla >Priority: Major > Attachments: spark-2.4.5-raise-exception-k8s-failure.patch, > spark-3.0.0-raise-exception-k8s-failure.patch > > > When launching apps using spark-submit in a kubernetes cluster, if the Spark > applications fails (returns exit code = 1 for example), spark-submit will > still exit gracefully and return exit code = 0. > This is problematic, since there's no way to know if there's been a problem > with the Spark application. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26365) spark-submit for k8s cluster doesn't propagate exit code
[ https://issues.apache.org/jira/browse/SPARK-26365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26365: Assignee: (was: Apache Spark) > spark-submit for k8s cluster doesn't propagate exit code > > > Key: SPARK-26365 > URL: https://issues.apache.org/jira/browse/SPARK-26365 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core, Spark Submit >Affects Versions: 2.3.2, 2.4.0, 3.0.0, 3.1.0 >Reporter: Oscar Bonilla >Priority: Major > Attachments: spark-2.4.5-raise-exception-k8s-failure.patch, > spark-3.0.0-raise-exception-k8s-failure.patch > > > When launching apps using spark-submit in a kubernetes cluster, if the Spark > applications fails (returns exit code = 1 for example), spark-submit will > still exit gracefully and return exit code = 0. > This is problematic, since there's no way to know if there's been a problem > with the Spark application. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26365) spark-submit for k8s cluster doesn't propagate exit code
[ https://issues.apache.org/jira/browse/SPARK-26365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26365: Assignee: Apache Spark > spark-submit for k8s cluster doesn't propagate exit code > > > Key: SPARK-26365 > URL: https://issues.apache.org/jira/browse/SPARK-26365 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core, Spark Submit >Affects Versions: 2.3.2, 2.4.0, 3.0.0, 3.1.0 >Reporter: Oscar Bonilla >Assignee: Apache Spark >Priority: Major > Attachments: spark-2.4.5-raise-exception-k8s-failure.patch, > spark-3.0.0-raise-exception-k8s-failure.patch > > > When launching apps using spark-submit in a kubernetes cluster, if the Spark > applications fails (returns exit code = 1 for example), spark-submit will > still exit gracefully and return exit code = 0. > This is problematic, since there's no way to know if there's been a problem > with the Spark application. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42427) Conv should return an error if the internal conversion overflows
[ https://issues.apache.org/jira/browse/SPARK-42427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691954#comment-17691954 ] Apache Spark commented on SPARK-42427: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/40117 > Conv should return an error if the internal conversion overflows > > > Key: SPARK-42427 > URL: https://issues.apache.org/jira/browse/SPARK-42427 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41391) The output column name of `groupBy.agg(count_distinct)` is incorrect
[ https://issues.apache.org/jira/browse/SPARK-41391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691930#comment-17691930 ] Apache Spark commented on SPARK-41391: -- User 'ritikam2' has created a pull request for this issue: https://github.com/apache/spark/pull/40116 > The output column name of `groupBy.agg(count_distinct)` is incorrect > > > Key: SPARK-41391 > URL: https://issues.apache.org/jira/browse/SPARK-41391 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0, 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > > scala> val df = spark.range(1, 10).withColumn("value", lit(1)) > df: org.apache.spark.sql.DataFrame = [id: bigint, value: int] > scala> df.createOrReplaceTempView("table") > scala> df.groupBy("id").agg(count_distinct($"value")) > res1: org.apache.spark.sql.DataFrame = [id: bigint, count(value): bigint] > scala> spark.sql(" SELECT id, COUNT(DISTINCT value) FROM table GROUP BY id ") > res2: org.apache.spark.sql.DataFrame = [id: bigint, count(DISTINCT value): > bigint] > scala> df.groupBy("id").agg(count_distinct($"*")) > res3: org.apache.spark.sql.DataFrame = [id: bigint, count(unresolvedstar()): > bigint] > scala> spark.sql(" SELECT id, COUNT(DISTINCT *) FROM table GROUP BY id ") > res4: org.apache.spark.sql.DataFrame = [id: bigint, count(DISTINCT id, > value): bigint] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41391) The output column name of `groupBy.agg(count_distinct)` is incorrect
[ https://issues.apache.org/jira/browse/SPARK-41391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691929#comment-17691929 ] Apache Spark commented on SPARK-41391: -- User 'ritikam2' has created a pull request for this issue: https://github.com/apache/spark/pull/40116 > The output column name of `groupBy.agg(count_distinct)` is incorrect > > > Key: SPARK-41391 > URL: https://issues.apache.org/jira/browse/SPARK-41391 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0, 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > > scala> val df = spark.range(1, 10).withColumn("value", lit(1)) > df: org.apache.spark.sql.DataFrame = [id: bigint, value: int] > scala> df.createOrReplaceTempView("table") > scala> df.groupBy("id").agg(count_distinct($"value")) > res1: org.apache.spark.sql.DataFrame = [id: bigint, count(value): bigint] > scala> spark.sql(" SELECT id, COUNT(DISTINCT value) FROM table GROUP BY id ") > res2: org.apache.spark.sql.DataFrame = [id: bigint, count(DISTINCT value): > bigint] > scala> df.groupBy("id").agg(count_distinct($"*")) > res3: org.apache.spark.sql.DataFrame = [id: bigint, count(unresolvedstar()): > bigint] > scala> spark.sql(" SELECT id, COUNT(DISTINCT *) FROM table GROUP BY id ") > res4: org.apache.spark.sql.DataFrame = [id: bigint, count(DISTINCT id, > value): bigint] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42525) collapse two adjacent windows with the same partition/order in subquery
[ https://issues.apache.org/jira/browse/SPARK-42525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42525: Assignee: Apache Spark > collapse two adjacent windows with the same partition/order in subquery > --- > > Key: SPARK-42525 > URL: https://issues.apache.org/jira/browse/SPARK-42525 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.3 >Reporter: zhuml >Assignee: Apache Spark >Priority: Major > > Extend the CollapseWindow rule to collapse Window nodes, when one window in > subquery. > > {code:java} > select a, b, c, row_number() over (partition by a order by b) as d from > ( select a, b, rank() over (partition by a order by b) as c from t1) t2 > == Optimized Logical Plan == > before > Window [row_number() windowspecdefinition(a#11, b#12 ASC NULLS FIRST, > specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS > d#26], [a#11], [b#12 ASC NULLS FIRST] > +- Window [rank(b#12) windowspecdefinition(a#11, b#12 ASC NULLS FIRST, > specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS > c#25], [a#11], [b#12 ASC NULLS FIRST] > +- InMemoryRelation [a#11, b#12], StorageLevel(disk, memory, deserialized, > 1 replicas) > +- *(1) Project [_1#6 AS a#11, _2#7 AS b#12] > +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, > scala.Tuple2, true]))._1 AS _1#6, knownnotnull(assertnotnull(input[0, > scala.Tuple2, true]))._2 AS _2#7] > +- *(1) MapElements > org.apache.spark.sql.DataFrameSuite$$Lambda$1517/1628848368@3a479fda, obj#5: > scala.Tuple2 > +- *(1) DeserializeToObject staticinvoke(class > java.lang.Long, ObjectType(class java.lang.Long), valueOf, id#0L, true, > false, true), obj#4: java.lang.Long > +- *(1) Range (0, 10, step=1, splits=2) > after > Window [rank(b#12) windowspecdefinition(a#11, b#12 ASC NULLS FIRST, > specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS > c#25, row_number() windowspecdefinition(a#11, b#12 ASC NULLS FIRST, > specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS > d#26], [a#11], [b#12 ASC NULLS FIRST] > +- InMemoryRelation [a#11, b#12], StorageLevel(disk, memory, deserialized, 1 > replicas) > +- *(1) Project [_1#6 AS a#11, _2#7 AS b#12] > +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, > scala.Tuple2, true]))._1 AS _1#6, knownnotnull(assertnotnull(input[0, > scala.Tuple2, true]))._2 AS _2#7] > +- *(1) MapElements > org.apache.spark.sql.DataFrameSuite$$Lambda$1518/1928028672@4d7a64ca, obj#5: > scala.Tuple2 > +- *(1) DeserializeToObject staticinvoke(class java.lang.Long, > ObjectType(class java.lang.Long), valueOf, id#0L, true, false, true), obj#4: > java.lang.Long > +- *(1) Range (0, 10, step=1, splits=2){code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42525) collapse two adjacent windows with the same partition/order in subquery
[ https://issues.apache.org/jira/browse/SPARK-42525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691904#comment-17691904 ] Apache Spark commented on SPARK-42525: -- User 'zml1206' has created a pull request for this issue: https://github.com/apache/spark/pull/40115 > collapse two adjacent windows with the same partition/order in subquery > --- > > Key: SPARK-42525 > URL: https://issues.apache.org/jira/browse/SPARK-42525 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.3 >Reporter: zhuml >Priority: Major > > Extend the CollapseWindow rule to collapse Window nodes, when one window in > subquery. > > {code:java} > select a, b, c, row_number() over (partition by a order by b) as d from > ( select a, b, rank() over (partition by a order by b) as c from t1) t2 > == Optimized Logical Plan == > before > Window [row_number() windowspecdefinition(a#11, b#12 ASC NULLS FIRST, > specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS > d#26], [a#11], [b#12 ASC NULLS FIRST] > +- Window [rank(b#12) windowspecdefinition(a#11, b#12 ASC NULLS FIRST, > specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS > c#25], [a#11], [b#12 ASC NULLS FIRST] > +- InMemoryRelation [a#11, b#12], StorageLevel(disk, memory, deserialized, > 1 replicas) > +- *(1) Project [_1#6 AS a#11, _2#7 AS b#12] > +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, > scala.Tuple2, true]))._1 AS _1#6, knownnotnull(assertnotnull(input[0, > scala.Tuple2, true]))._2 AS _2#7] > +- *(1) MapElements > org.apache.spark.sql.DataFrameSuite$$Lambda$1517/1628848368@3a479fda, obj#5: > scala.Tuple2 > +- *(1) DeserializeToObject staticinvoke(class > java.lang.Long, ObjectType(class java.lang.Long), valueOf, id#0L, true, > false, true), obj#4: java.lang.Long > +- *(1) Range (0, 10, step=1, splits=2) > after > Window [rank(b#12) windowspecdefinition(a#11, b#12 ASC NULLS FIRST, > specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS > c#25, row_number() windowspecdefinition(a#11, b#12 ASC NULLS FIRST, > specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS > d#26], [a#11], [b#12 ASC NULLS FIRST] > +- InMemoryRelation [a#11, b#12], StorageLevel(disk, memory, deserialized, 1 > replicas) > +- *(1) Project [_1#6 AS a#11, _2#7 AS b#12] > +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, > scala.Tuple2, true]))._1 AS _1#6, knownnotnull(assertnotnull(input[0, > scala.Tuple2, true]))._2 AS _2#7] > +- *(1) MapElements > org.apache.spark.sql.DataFrameSuite$$Lambda$1518/1928028672@4d7a64ca, obj#5: > scala.Tuple2 > +- *(1) DeserializeToObject staticinvoke(class java.lang.Long, > ObjectType(class java.lang.Long), valueOf, id#0L, true, false, true), obj#4: > java.lang.Long > +- *(1) Range (0, 10, step=1, splits=2){code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42525) collapse two adjacent windows with the same partition/order in subquery
[ https://issues.apache.org/jira/browse/SPARK-42525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42525: Assignee: (was: Apache Spark) > collapse two adjacent windows with the same partition/order in subquery > --- > > Key: SPARK-42525 > URL: https://issues.apache.org/jira/browse/SPARK-42525 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.3 >Reporter: zhuml >Priority: Major > > Extend the CollapseWindow rule to collapse Window nodes, when one window in > subquery. > > {code:java} > select a, b, c, row_number() over (partition by a order by b) as d from > ( select a, b, rank() over (partition by a order by b) as c from t1) t2 > == Optimized Logical Plan == > before > Window [row_number() windowspecdefinition(a#11, b#12 ASC NULLS FIRST, > specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS > d#26], [a#11], [b#12 ASC NULLS FIRST] > +- Window [rank(b#12) windowspecdefinition(a#11, b#12 ASC NULLS FIRST, > specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS > c#25], [a#11], [b#12 ASC NULLS FIRST] > +- InMemoryRelation [a#11, b#12], StorageLevel(disk, memory, deserialized, > 1 replicas) > +- *(1) Project [_1#6 AS a#11, _2#7 AS b#12] > +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, > scala.Tuple2, true]))._1 AS _1#6, knownnotnull(assertnotnull(input[0, > scala.Tuple2, true]))._2 AS _2#7] > +- *(1) MapElements > org.apache.spark.sql.DataFrameSuite$$Lambda$1517/1628848368@3a479fda, obj#5: > scala.Tuple2 > +- *(1) DeserializeToObject staticinvoke(class > java.lang.Long, ObjectType(class java.lang.Long), valueOf, id#0L, true, > false, true), obj#4: java.lang.Long > +- *(1) Range (0, 10, step=1, splits=2) > after > Window [rank(b#12) windowspecdefinition(a#11, b#12 ASC NULLS FIRST, > specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS > c#25, row_number() windowspecdefinition(a#11, b#12 ASC NULLS FIRST, > specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS > d#26], [a#11], [b#12 ASC NULLS FIRST] > +- InMemoryRelation [a#11, b#12], StorageLevel(disk, memory, deserialized, 1 > replicas) > +- *(1) Project [_1#6 AS a#11, _2#7 AS b#12] > +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, > scala.Tuple2, true]))._1 AS _1#6, knownnotnull(assertnotnull(input[0, > scala.Tuple2, true]))._2 AS _2#7] > +- *(1) MapElements > org.apache.spark.sql.DataFrameSuite$$Lambda$1518/1928028672@4d7a64ca, obj#5: > scala.Tuple2 > +- *(1) DeserializeToObject staticinvoke(class java.lang.Long, > ObjectType(class java.lang.Long), valueOf, id#0L, true, false, true), obj#4: > java.lang.Long > +- *(1) Range (0, 10, step=1, splits=2){code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42525) collapse two adjacent windows with the same partition/order in subquery
[ https://issues.apache.org/jira/browse/SPARK-42525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691905#comment-17691905 ] Apache Spark commented on SPARK-42525: -- User 'zml1206' has created a pull request for this issue: https://github.com/apache/spark/pull/40115 > collapse two adjacent windows with the same partition/order in subquery > --- > > Key: SPARK-42525 > URL: https://issues.apache.org/jira/browse/SPARK-42525 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.3 >Reporter: zhuml >Assignee: Apache Spark >Priority: Major > > Extend the CollapseWindow rule to collapse Window nodes, when one window in > subquery. > > {code:java} > select a, b, c, row_number() over (partition by a order by b) as d from > ( select a, b, rank() over (partition by a order by b) as c from t1) t2 > == Optimized Logical Plan == > before > Window [row_number() windowspecdefinition(a#11, b#12 ASC NULLS FIRST, > specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS > d#26], [a#11], [b#12 ASC NULLS FIRST] > +- Window [rank(b#12) windowspecdefinition(a#11, b#12 ASC NULLS FIRST, > specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS > c#25], [a#11], [b#12 ASC NULLS FIRST] > +- InMemoryRelation [a#11, b#12], StorageLevel(disk, memory, deserialized, > 1 replicas) > +- *(1) Project [_1#6 AS a#11, _2#7 AS b#12] > +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, > scala.Tuple2, true]))._1 AS _1#6, knownnotnull(assertnotnull(input[0, > scala.Tuple2, true]))._2 AS _2#7] > +- *(1) MapElements > org.apache.spark.sql.DataFrameSuite$$Lambda$1517/1628848368@3a479fda, obj#5: > scala.Tuple2 > +- *(1) DeserializeToObject staticinvoke(class > java.lang.Long, ObjectType(class java.lang.Long), valueOf, id#0L, true, > false, true), obj#4: java.lang.Long > +- *(1) Range (0, 10, step=1, splits=2) > after > Window [rank(b#12) windowspecdefinition(a#11, b#12 ASC NULLS FIRST, > specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS > c#25, row_number() windowspecdefinition(a#11, b#12 ASC NULLS FIRST, > specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS > d#26], [a#11], [b#12 ASC NULLS FIRST] > +- InMemoryRelation [a#11, b#12], StorageLevel(disk, memory, deserialized, 1 > replicas) > +- *(1) Project [_1#6 AS a#11, _2#7 AS b#12] > +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, > scala.Tuple2, true]))._1 AS _1#6, knownnotnull(assertnotnull(input[0, > scala.Tuple2, true]))._2 AS _2#7] > +- *(1) MapElements > org.apache.spark.sql.DataFrameSuite$$Lambda$1518/1928028672@4d7a64ca, obj#5: > scala.Tuple2 > +- *(1) DeserializeToObject staticinvoke(class java.lang.Long, > ObjectType(class java.lang.Long), valueOf, id#0L, true, false, true), obj#4: > java.lang.Long > +- *(1) Range (0, 10, step=1, splits=2){code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42513) Push down topK through join
[ https://issues.apache.org/jira/browse/SPARK-42513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42513: Assignee: (was: Apache Spark) > Push down topK through join > --- > > Key: SPARK-42513 > URL: https://issues.apache.org/jira/browse/SPARK-42513 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > Attachments: after-UI.png, before-UI.png > > > {code:scala} > spark.range(1).selectExpr("id % 1 as a", "id as > b").write.saveAsTable("t1") > spark.range(1).selectExpr("id % 1 as x", "id as > y").write.saveAsTable("t2") > sql("select * from t1 left join t2 on a = x order by b limit 5").collect() > spark.sql("set > spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.LimitPushDown") > sql("select * from t1 left join t2 on a = x order by b limit 5").collect() > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42513) Push down topK through join
[ https://issues.apache.org/jira/browse/SPARK-42513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42513: Assignee: Apache Spark > Push down topK through join > --- > > Key: SPARK-42513 > URL: https://issues.apache.org/jira/browse/SPARK-42513 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > Attachments: after-UI.png, before-UI.png > > > {code:scala} > spark.range(1).selectExpr("id % 1 as a", "id as > b").write.saveAsTable("t1") > spark.range(1).selectExpr("id % 1 as x", "id as > y").write.saveAsTable("t2") > sql("select * from t1 left join t2 on a = x order by b limit 5").collect() > spark.sql("set > spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.LimitPushDown") > sql("select * from t1 left join t2 on a = x order by b limit 5").collect() > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42513) Push down topK through join
[ https://issues.apache.org/jira/browse/SPARK-42513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691896#comment-17691896 ] Apache Spark commented on SPARK-42513: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/40114 > Push down topK through join > --- > > Key: SPARK-42513 > URL: https://issues.apache.org/jira/browse/SPARK-42513 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > Attachments: after-UI.png, before-UI.png > > > {code:scala} > spark.range(1).selectExpr("id % 1 as a", "id as > b").write.saveAsTable("t1") > spark.range(1).selectExpr("id % 1 as x", "id as > y").write.saveAsTable("t2") > sql("select * from t1 left join t2 on a = x order by b limit 5").collect() > spark.sql("set > spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.LimitPushDown") > sql("select * from t1 left join t2 on a = x order by b limit 5").collect() > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42509) WindowGroupLimitExec supports codegen
[ https://issues.apache.org/jira/browse/SPARK-42509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691876#comment-17691876 ] Apache Spark commented on SPARK-42509: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/40113 > WindowGroupLimitExec supports codegen > - > > Key: SPARK-42509 > URL: https://issues.apache.org/jira/browse/SPARK-42509 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42509) WindowGroupLimitExec supports codegen
[ https://issues.apache.org/jira/browse/SPARK-42509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42509: Assignee: (was: Apache Spark) > WindowGroupLimitExec supports codegen > - > > Key: SPARK-42509 > URL: https://issues.apache.org/jira/browse/SPARK-42509 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41933) Provide local mode that automatically starts the server
[ https://issues.apache.org/jira/browse/SPARK-41933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691875#comment-17691875 ] Apache Spark commented on SPARK-41933: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/40112 > Provide local mode that automatically starts the server > --- > > Key: SPARK-41933 > URL: https://issues.apache.org/jira/browse/SPARK-41933 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > Currently the Spark Connect server has to be started manually which is > troublesome for end users and developers to try Spark Connect out. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42509) WindowGroupLimitExec supports codegen
[ https://issues.apache.org/jira/browse/SPARK-42509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42509: Assignee: Apache Spark > WindowGroupLimitExec supports codegen > - > > Key: SPARK-42509 > URL: https://issues.apache.org/jira/browse/SPARK-42509 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42509) WindowGroupLimitExec supports codegen
[ https://issues.apache.org/jira/browse/SPARK-42509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691873#comment-17691873 ] Apache Spark commented on SPARK-42509: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/40113 > WindowGroupLimitExec supports codegen > - > > Key: SPARK-42509 > URL: https://issues.apache.org/jira/browse/SPARK-42509 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41933) Provide local mode that automatically starts the server
[ https://issues.apache.org/jira/browse/SPARK-41933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691871#comment-17691871 ] Apache Spark commented on SPARK-41933: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/40112 > Provide local mode that automatically starts the server > --- > > Key: SPARK-41933 > URL: https://issues.apache.org/jira/browse/SPARK-41933 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > Currently the Spark Connect server has to be started manually which is > troublesome for end users and developers to try Spark Connect out. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42524) Upgrade numpy and pandas in the release Dockerfile
[ https://issues.apache.org/jira/browse/SPARK-42524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42524: Assignee: (was: Apache Spark) > Upgrade numpy and pandas in the release Dockerfile > -- > > Key: SPARK-42524 > URL: https://issues.apache.org/jira/browse/SPARK-42524 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.4.0, 3.5.0 >Reporter: Xinrong Meng >Priority: Major > > Otherwise, errors are raised as shown below when building release docs. > {code} > ImportError: Warning: Latest version of pandas (1.5.3) is required to > generate the documentation; however, your version was 1.1.5 > ImportError: this version of pandas is incompatible with numpy < 1.20.3 > your numpy version is 1.19.4. > Please upgrade numpy to >= 1.20.3 to use this pandas version > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42524) Upgrade numpy and pandas in the release Dockerfile
[ https://issues.apache.org/jira/browse/SPARK-42524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42524: Assignee: Apache Spark > Upgrade numpy and pandas in the release Dockerfile > -- > > Key: SPARK-42524 > URL: https://issues.apache.org/jira/browse/SPARK-42524 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.4.0, 3.5.0 >Reporter: Xinrong Meng >Assignee: Apache Spark >Priority: Major > > Otherwise, errors are raised as shown below when building release docs. > {code} > ImportError: Warning: Latest version of pandas (1.5.3) is required to > generate the documentation; however, your version was 1.1.5 > ImportError: this version of pandas is incompatible with numpy < 1.20.3 > your numpy version is 1.19.4. > Please upgrade numpy to >= 1.20.3 to use this pandas version > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42524) Upgrade numpy and pandas in the release Dockerfile
[ https://issues.apache.org/jira/browse/SPARK-42524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691867#comment-17691867 ] Apache Spark commented on SPARK-42524: -- User 'xinrong-meng' has created a pull request for this issue: https://github.com/apache/spark/pull/40111 > Upgrade numpy and pandas in the release Dockerfile > -- > > Key: SPARK-42524 > URL: https://issues.apache.org/jira/browse/SPARK-42524 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.4.0, 3.5.0 >Reporter: Xinrong Meng >Priority: Major > > Otherwise, errors are raised as shown below when building release docs. > {code} > ImportError: Warning: Latest version of pandas (1.5.3) is required to > generate the documentation; however, your version was 1.1.5 > ImportError: this version of pandas is incompatible with numpy < 1.20.3 > your numpy version is 1.19.4. > Please upgrade numpy to >= 1.20.3 to use this pandas version > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41775) Implement training functions as input
[ https://issues.apache.org/jira/browse/SPARK-41775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691836#comment-17691836 ] Apache Spark commented on SPARK-41775: -- User 'rithwik-db' has created a pull request for this issue: https://github.com/apache/spark/pull/40110 > Implement training functions as input > - > > Key: SPARK-41775 > URL: https://issues.apache.org/jira/browse/SPARK-41775 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.4.0 >Reporter: Rithwik Ediga Lakhamsani >Assignee: Rithwik Ediga Lakhamsani >Priority: Major > Fix For: 3.4.0 > > > Sidenote: make formatting updates described in > https://github.com/apache/spark/pull/39188 > > Currently, `Distributor().run(...)` takes only files as input. Now we will > add in additional functionality to take in functions as well. This will > require us to go through the following process on each task in the executor > nodes: > 1. take the input function and args and pickle them > 2. Create a temp train.py file that looks like > {code:java} > import cloudpickle > import os > if _name_ == "_main_": > train, args = cloudpickle.load(f"{tempdir}/train_input.pkl") > output = train(*args) > if output and os.environ.get("RANK", "") == "0": # this is for > partitionId == 0 > cloudpickle.dump(f"{tempdir}/train_output.pkl") {code} > 3. Run that train.py file with `torchrun` > 4. Check if `train_output.pkl` has been created on process on partitionId == > 0, if it has, then deserialize it and return that output through `.collect()` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42522) Fix DataFrameWriterV2 to find the default source
[ https://issues.apache.org/jira/browse/SPARK-42522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42522: Assignee: Apache Spark > Fix DataFrameWriterV2 to find the default source > > > Key: SPARK-42522 > URL: https://issues.apache.org/jira/browse/SPARK-42522 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Assignee: Apache Spark >Priority: Major > > {code:python} > df.writeTo("test_table").create() > {code} > throws: > {noformat} > pyspark.errors.exceptions.connect.SparkConnectGrpcException: > (org.apache.spark.SparkClassNotFoundException) [DATA_SOURCE_NOT_FOUND] Failed > to find the data source: . Please find packages at > `https://spark.apache.org/third-party-projects.html`. > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42522) Fix DataFrameWriterV2 to find the default source
[ https://issues.apache.org/jira/browse/SPARK-42522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691801#comment-17691801 ] Apache Spark commented on SPARK-42522: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/40109 > Fix DataFrameWriterV2 to find the default source > > > Key: SPARK-42522 > URL: https://issues.apache.org/jira/browse/SPARK-42522 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Priority: Major > > {code:python} > df.writeTo("test_table").create() > {code} > throws: > {noformat} > pyspark.errors.exceptions.connect.SparkConnectGrpcException: > (org.apache.spark.SparkClassNotFoundException) [DATA_SOURCE_NOT_FOUND] Failed > to find the data source: . Please find packages at > `https://spark.apache.org/third-party-projects.html`. > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42522) Fix DataFrameWriterV2 to find the default source
[ https://issues.apache.org/jira/browse/SPARK-42522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42522: Assignee: (was: Apache Spark) > Fix DataFrameWriterV2 to find the default source > > > Key: SPARK-42522 > URL: https://issues.apache.org/jira/browse/SPARK-42522 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Priority: Major > > {code:python} > df.writeTo("test_table").create() > {code} > throws: > {noformat} > pyspark.errors.exceptions.connect.SparkConnectGrpcException: > (org.apache.spark.SparkClassNotFoundException) [DATA_SOURCE_NOT_FOUND] Failed > to find the data source: . Please find packages at > `https://spark.apache.org/third-party-projects.html`. > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42521) Add NULL values for INSERT commands with user-specified lists of fewer columns than the target table
[ https://issues.apache.org/jira/browse/SPARK-42521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42521: Assignee: (was: Apache Spark) > Add NULL values for INSERT commands with user-specified lists of fewer > columns than the target table > > > Key: SPARK-42521 > URL: https://issues.apache.org/jira/browse/SPARK-42521 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42521) Add NULL values for INSERT commands with user-specified lists of fewer columns than the target table
[ https://issues.apache.org/jira/browse/SPARK-42521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42521: Assignee: Apache Spark > Add NULL values for INSERT commands with user-specified lists of fewer > columns than the target table > > > Key: SPARK-42521 > URL: https://issues.apache.org/jira/browse/SPARK-42521 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42521) Add NULL values for INSERT commands with user-specified lists of fewer columns than the target table
[ https://issues.apache.org/jira/browse/SPARK-42521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691786#comment-17691786 ] Apache Spark commented on SPARK-42521: -- User 'dtenedor' has created a pull request for this issue: https://github.com/apache/spark/pull/40108 > Add NULL values for INSERT commands with user-specified lists of fewer > columns than the target table > > > Key: SPARK-42521 > URL: https://issues.apache.org/jira/browse/SPARK-42521 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42521) Add NULL values for INSERT commands with user-specified lists of fewer columns than the target table
[ https://issues.apache.org/jira/browse/SPARK-42521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691785#comment-17691785 ] Apache Spark commented on SPARK-42521: -- User 'dtenedor' has created a pull request for this issue: https://github.com/apache/spark/pull/40108 > Add NULL values for INSERT commands with user-specified lists of fewer > columns than the target table > > > Key: SPARK-42521 > URL: https://issues.apache.org/jira/browse/SPARK-42521 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42520) Spark Connect Scala Client: Window
[ https://issues.apache.org/jira/browse/SPARK-42520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691774#comment-17691774 ] Apache Spark commented on SPARK-42520: -- User 'amaliujia' has created a pull request for this issue: https://github.com/apache/spark/pull/40107 > Spark Connect Scala Client: Window > -- > > Key: SPARK-42520 > URL: https://issues.apache.org/jira/browse/SPARK-42520 > Project: Spark > Issue Type: Task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42520) Spark Connect Scala Client: Window
[ https://issues.apache.org/jira/browse/SPARK-42520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42520: Assignee: Apache Spark (was: Rui Wang) > Spark Connect Scala Client: Window > -- > > Key: SPARK-42520 > URL: https://issues.apache.org/jira/browse/SPARK-42520 > Project: Spark > Issue Type: Task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42520) Spark Connect Scala Client: Window
[ https://issues.apache.org/jira/browse/SPARK-42520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42520: Assignee: Rui Wang (was: Apache Spark) > Spark Connect Scala Client: Window > -- > > Key: SPARK-42520 > URL: https://issues.apache.org/jira/browse/SPARK-42520 > Project: Spark > Issue Type: Task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42518) Scala client Write API V2
[ https://issues.apache.org/jira/browse/SPARK-42518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42518: Assignee: (was: Apache Spark) > Scala client Write API V2 > - > > Key: SPARK-42518 > URL: https://issues.apache.org/jira/browse/SPARK-42518 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Zhen Li >Priority: Major > > Impl the Dataset#writeTo method. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42518) Scala client Write API V2
[ https://issues.apache.org/jira/browse/SPARK-42518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691758#comment-17691758 ] Apache Spark commented on SPARK-42518: -- User 'zhenlineo' has created a pull request for this issue: https://github.com/apache/spark/pull/40075 > Scala client Write API V2 > - > > Key: SPARK-42518 > URL: https://issues.apache.org/jira/browse/SPARK-42518 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Zhen Li >Priority: Major > > Impl the Dataset#writeTo method. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42518) Scala client Write API V2
[ https://issues.apache.org/jira/browse/SPARK-42518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42518: Assignee: Apache Spark > Scala client Write API V2 > - > > Key: SPARK-42518 > URL: https://issues.apache.org/jira/browse/SPARK-42518 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Zhen Li >Assignee: Apache Spark >Priority: Major > > Impl the Dataset#writeTo method. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42002) Implement DataFrameWriterV2 (ReadwriterV2Tests)
[ https://issues.apache.org/jira/browse/SPARK-42002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691721#comment-17691721 ] Apache Spark commented on SPARK-42002: -- User 'amaliujia' has created a pull request for this issue: https://github.com/apache/spark/pull/40106 > Implement DataFrameWriterV2 (ReadwriterV2Tests) > --- > > Key: SPARK-42002 > URL: https://issues.apache.org/jira/browse/SPARK-42002 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code} > pyspark/sql/tests/test_readwriter.py:182 (ReadwriterV2ParityTests.test_api) > self = > testMethod=test_api> > def test_api(self): > df = self.df > > writer = df.writeTo("testcat.t") > ../test_readwriter.py:185: > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ > self = DataFrame[key: bigint, value: string], args = ('testcat.t',), kwargs = > {} > def writeTo(self, *args: Any, **kwargs: Any) -> None: > > raise NotImplementedError("writeTo() is not implemented.") > E NotImplementedError: writeTo() is not implemented. > ../../connect/dataframe.py:1529: NotImplementedError > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42002) Implement DataFrameWriterV2 (ReadwriterV2Tests)
[ https://issues.apache.org/jira/browse/SPARK-42002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691720#comment-17691720 ] Apache Spark commented on SPARK-42002: -- User 'amaliujia' has created a pull request for this issue: https://github.com/apache/spark/pull/40106 > Implement DataFrameWriterV2 (ReadwriterV2Tests) > --- > > Key: SPARK-42002 > URL: https://issues.apache.org/jira/browse/SPARK-42002 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > > {code} > pyspark/sql/tests/test_readwriter.py:182 (ReadwriterV2ParityTests.test_api) > self = > testMethod=test_api> > def test_api(self): > df = self.df > > writer = df.writeTo("testcat.t") > ../test_readwriter.py:185: > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ > self = DataFrame[key: bigint, value: string], args = ('testcat.t',), kwargs = > {} > def writeTo(self, *args: Any, **kwargs: Any) -> None: > > raise NotImplementedError("writeTo() is not implemented.") > E NotImplementedError: writeTo() is not implemented. > ../../connect/dataframe.py:1529: NotImplementedError > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42516) Non-captured session time zone in view creation
[ https://issues.apache.org/jira/browse/SPARK-42516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691715#comment-17691715 ] Apache Spark commented on SPARK-42516: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/40103 > Non-captured session time zone in view creation > --- > > Key: SPARK-42516 > URL: https://issues.apache.org/jira/browse/SPARK-42516 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > The session time zone config is captured only when it is set explicitly but > if it is not the view is instantiated with the current settings. That's might > confuse users, for instance: > Set the session time zone explicitly before view creation: > {code:java} > TODO > {code} > Set the same time zone implicitly as JVM time zone, and the default value of > the SQL config spark.sql.session.timeZone. > {code:java} > TODO > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42516) Non-captured session time zone in view creation
[ https://issues.apache.org/jira/browse/SPARK-42516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42516: Assignee: Max Gekk (was: Apache Spark) > Non-captured session time zone in view creation > --- > > Key: SPARK-42516 > URL: https://issues.apache.org/jira/browse/SPARK-42516 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > The session time zone config is captured only when it is set explicitly but > if it is not the view is instantiated with the current settings. That's might > confuse users, for instance: > Set the session time zone explicitly before view creation: > {code:java} > TODO > {code} > Set the same time zone implicitly as JVM time zone, and the default value of > the SQL config spark.sql.session.timeZone. > {code:java} > TODO > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42516) Non-captured session time zone in view creation
[ https://issues.apache.org/jira/browse/SPARK-42516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42516: Assignee: Apache Spark (was: Max Gekk) > Non-captured session time zone in view creation > --- > > Key: SPARK-42516 > URL: https://issues.apache.org/jira/browse/SPARK-42516 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > The session time zone config is captured only when it is set explicitly but > if it is not the view is instantiated with the current settings. That's might > confuse users, for instance: > Set the session time zone explicitly before view creation: > {code:java} > TODO > {code} > Set the same time zone implicitly as JVM time zone, and the default value of > the SQL config spark.sql.session.timeZone. > {code:java} > TODO > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42514) Scala Client add partition transforms functions
[ https://issues.apache.org/jira/browse/SPARK-42514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42514: Assignee: (was: Apache Spark) > Scala Client add partition transforms functions > --- > > Key: SPARK-42514 > URL: https://issues.apache.org/jira/browse/SPARK-42514 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42514) Scala Client add partition transforms functions
[ https://issues.apache.org/jira/browse/SPARK-42514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691692#comment-17691692 ] Apache Spark commented on SPARK-42514: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/40105 > Scala Client add partition transforms functions > --- > > Key: SPARK-42514 > URL: https://issues.apache.org/jira/browse/SPARK-42514 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42514) Scala Client add partition transforms functions
[ https://issues.apache.org/jira/browse/SPARK-42514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42514: Assignee: Apache Spark > Scala Client add partition transforms functions > --- > > Key: SPARK-42514 > URL: https://issues.apache.org/jira/browse/SPARK-42514 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42510) Implement `DataFrame.mapInPandas`
[ https://issues.apache.org/jira/browse/SPARK-42510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42510: Assignee: Apache Spark > Implement `DataFrame.mapInPandas` > - > > Key: SPARK-42510 > URL: https://issues.apache.org/jira/browse/SPARK-42510 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Apache Spark >Priority: Major > > Implement `DataFrame.mapInPandas` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42510) Implement `DataFrame.mapInPandas`
[ https://issues.apache.org/jira/browse/SPARK-42510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691549#comment-17691549 ] Apache Spark commented on SPARK-42510: -- User 'xinrong-meng' has created a pull request for this issue: https://github.com/apache/spark/pull/40104 > Implement `DataFrame.mapInPandas` > - > > Key: SPARK-42510 > URL: https://issues.apache.org/jira/browse/SPARK-42510 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Implement `DataFrame.mapInPandas` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42510) Implement `DataFrame.mapInPandas`
[ https://issues.apache.org/jira/browse/SPARK-42510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42510: Assignee: (was: Apache Spark) > Implement `DataFrame.mapInPandas` > - > > Key: SPARK-42510 > URL: https://issues.apache.org/jira/browse/SPARK-42510 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Implement `DataFrame.mapInPandas` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42508) Extract the common .ml classes to `mllib-common`
[ https://issues.apache.org/jira/browse/SPARK-42508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42508: Assignee: (was: Apache Spark) > Extract the common .ml classes to `mllib-common` > > > Key: SPARK-42508 > URL: https://issues.apache.org/jira/browse/SPARK-42508 > Project: Spark > Issue Type: Sub-task > Components: Connect, ML >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42508) Extract the common .ml classes to `mllib-common`
[ https://issues.apache.org/jira/browse/SPARK-42508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691445#comment-17691445 ] Apache Spark commented on SPARK-42508: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/40097 > Extract the common .ml classes to `mllib-common` > > > Key: SPARK-42508 > URL: https://issues.apache.org/jira/browse/SPARK-42508 > Project: Spark > Issue Type: Sub-task > Components: Connect, ML >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42508) Extract the common .ml classes to `mllib-common`
[ https://issues.apache.org/jira/browse/SPARK-42508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691446#comment-17691446 ] Apache Spark commented on SPARK-42508: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/40097 > Extract the common .ml classes to `mllib-common` > > > Key: SPARK-42508 > URL: https://issues.apache.org/jira/browse/SPARK-42508 > Project: Spark > Issue Type: Sub-task > Components: Connect, ML >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42508) Extract the common .ml classes to `mllib-common`
[ https://issues.apache.org/jira/browse/SPARK-42508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42508: Assignee: Apache Spark > Extract the common .ml classes to `mllib-common` > > > Key: SPARK-42508 > URL: https://issues.apache.org/jira/browse/SPARK-42508 > Project: Spark > Issue Type: Sub-task > Components: Connect, ML >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42507) Simplify ORC schema merging conflict error check
[ https://issues.apache.org/jira/browse/SPARK-42507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691420#comment-17691420 ] Apache Spark commented on SPARK-42507: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/40101 > Simplify ORC schema merging conflict error check > > > Key: SPARK-42507 > URL: https://issues.apache.org/jira/browse/SPARK-42507 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42507) Simplify ORC schema merging conflict error check
[ https://issues.apache.org/jira/browse/SPARK-42507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42507: Assignee: (was: Apache Spark) > Simplify ORC schema merging conflict error check > > > Key: SPARK-42507 > URL: https://issues.apache.org/jira/browse/SPARK-42507 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42507) Simplify ORC schema merging conflict error check
[ https://issues.apache.org/jira/browse/SPARK-42507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691419#comment-17691419 ] Apache Spark commented on SPARK-42507: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/40101 > Simplify ORC schema merging conflict error check > > > Key: SPARK-42507 > URL: https://issues.apache.org/jira/browse/SPARK-42507 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42507) Simplify ORC schema merging conflict error check
[ https://issues.apache.org/jira/browse/SPARK-42507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42507: Assignee: Apache Spark > Simplify ORC schema merging conflict error check > > > Key: SPARK-42507 > URL: https://issues.apache.org/jira/browse/SPARK-42507 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42506) Fix Sort's maxRowsPerPartition if maxRows does not exist
[ https://issues.apache.org/jira/browse/SPARK-42506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42506: Assignee: (was: Apache Spark) > Fix Sort's maxRowsPerPartition if maxRows does not exist > > > Key: SPARK-42506 > URL: https://issues.apache.org/jira/browse/SPARK-42506 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42506) Fix Sort's maxRowsPerPartition if maxRows does not exist
[ https://issues.apache.org/jira/browse/SPARK-42506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42506: Assignee: Apache Spark > Fix Sort's maxRowsPerPartition if maxRows does not exist > > > Key: SPARK-42506 > URL: https://issues.apache.org/jira/browse/SPARK-42506 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42506) Fix Sort's maxRowsPerPartition if maxRows does not exist
[ https://issues.apache.org/jira/browse/SPARK-42506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691414#comment-17691414 ] Apache Spark commented on SPARK-42506: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/40100 > Fix Sort's maxRowsPerPartition if maxRows does not exist > > > Key: SPARK-42506 > URL: https://issues.apache.org/jira/browse/SPARK-42506 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42504) NestedColumnAliasing support pruning adjacent projects
[ https://issues.apache.org/jira/browse/SPARK-42504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42504: Assignee: Apache Spark > NestedColumnAliasing support pruning adjacent projects > -- > > Key: SPARK-42504 > URL: https://issues.apache.org/jira/browse/SPARK-42504 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Assignee: Apache Spark >Priority: Major > > CollapseProject won't combine adjacent projects into one, e.g. non-cheap > expression has been accessed more than once with the below project. Then > there would be possible to appear some adjacent project nodes that > NestedColumnAliasing does not support pruning. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42504) NestedColumnAliasing support pruning adjacent projects
[ https://issues.apache.org/jira/browse/SPARK-42504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691354#comment-17691354 ] Apache Spark commented on SPARK-42504: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/40098 > NestedColumnAliasing support pruning adjacent projects > -- > > Key: SPARK-42504 > URL: https://issues.apache.org/jira/browse/SPARK-42504 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Priority: Major > > CollapseProject won't combine adjacent projects into one, e.g. non-cheap > expression has been accessed more than once with the below project. Then > there would be possible to appear some adjacent project nodes that > NestedColumnAliasing does not support pruning. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42504) NestedColumnAliasing support pruning adjacent projects
[ https://issues.apache.org/jira/browse/SPARK-42504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42504: Assignee: (was: Apache Spark) > NestedColumnAliasing support pruning adjacent projects > -- > > Key: SPARK-42504 > URL: https://issues.apache.org/jira/browse/SPARK-42504 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Priority: Major > > CollapseProject won't combine adjacent projects into one, e.g. non-cheap > expression has been accessed more than once with the below project. Then > there would be possible to appear some adjacent project nodes that > NestedColumnAliasing does not support pruning. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41823) DataFrame.join creating ambiguous column names
[ https://issues.apache.org/jira/browse/SPARK-41823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691290#comment-17691290 ] Apache Spark commented on SPARK-41823: -- User 'grundprinzip' has created a pull request for this issue: https://github.com/apache/spark/pull/40094 > DataFrame.join creating ambiguous column names > -- > > Key: SPARK-41823 > URL: https://issues.apache.org/jira/browse/SPARK-41823 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Ruifeng Zheng >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 254, in pyspark.sql.connect.dataframe.DataFrame.drop > Failed example: > df.join(df2, df.name == df2.name, 'inner').drop('name').show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line > 1, in > df.join(df2, df.name == df2.name, 'inner').drop('name').show() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 534, in show > print(self._show_string(n, truncate, vertical)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 423, in _show_string > ).toPandas() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1031, in toPandas > return self._session.client.to_pandas(query) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 413, in to_pandas > return self._execute_and_fetch(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 573, in _execute_and_fetch > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 619, in _handle_error > raise SparkConnectAnalysisException( > pyspark.sql.connect.client.SparkConnectAnalysisException: > [AMBIGUOUS_REFERENCE] Reference `name` is ambiguous, could be: [`name`, > `name`]. > Plan: {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41823) DataFrame.join creating ambiguous column names
[ https://issues.apache.org/jira/browse/SPARK-41823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691288#comment-17691288 ] Apache Spark commented on SPARK-41823: -- User 'grundprinzip' has created a pull request for this issue: https://github.com/apache/spark/pull/40094 > DataFrame.join creating ambiguous column names > -- > > Key: SPARK-41823 > URL: https://issues.apache.org/jira/browse/SPARK-41823 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Ruifeng Zheng >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 254, in pyspark.sql.connect.dataframe.DataFrame.drop > Failed example: > df.join(df2, df.name == df2.name, 'inner').drop('name').show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line > 1, in > df.join(df2, df.name == df2.name, 'inner').drop('name').show() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 534, in show > print(self._show_string(n, truncate, vertical)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 423, in _show_string > ).toPandas() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1031, in toPandas > return self._session.client.to_pandas(query) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 413, in to_pandas > return self._execute_and_fetch(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 573, in _execute_and_fetch > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 619, in _handle_error > raise SparkConnectAnalysisException( > pyspark.sql.connect.client.SparkConnectAnalysisException: > [AMBIGUOUS_REFERENCE] Reference `name` is ambiguous, could be: [`name`, > `name`]. > Plan: {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41823) DataFrame.join creating ambiguous column names
[ https://issues.apache.org/jira/browse/SPARK-41823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691289#comment-17691289 ] Apache Spark commented on SPARK-41823: -- User 'grundprinzip' has created a pull request for this issue: https://github.com/apache/spark/pull/40094 > DataFrame.join creating ambiguous column names > -- > > Key: SPARK-41823 > URL: https://issues.apache.org/jira/browse/SPARK-41823 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Ruifeng Zheng >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 254, in pyspark.sql.connect.dataframe.DataFrame.drop > Failed example: > df.join(df2, df.name == df2.name, 'inner').drop('name').show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line > 1, in > df.join(df2, df.name == df2.name, 'inner').drop('name').show() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 534, in show > print(self._show_string(n, truncate, vertical)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 423, in _show_string > ).toPandas() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1031, in toPandas > return self._session.client.to_pandas(query) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 413, in to_pandas > return self._execute_and_fetch(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 573, in _execute_and_fetch > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 619, in _handle_error > raise SparkConnectAnalysisException( > pyspark.sql.connect.client.SparkConnectAnalysisException: > [AMBIGUOUS_REFERENCE] Reference `name` is ambiguous, could be: [`name`, > `name`]. > Plan: {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41812) DataFrame.join: ambiguous column
[ https://issues.apache.org/jira/browse/SPARK-41812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691287#comment-17691287 ] Apache Spark commented on SPARK-41812: -- User 'grundprinzip' has created a pull request for this issue: https://github.com/apache/spark/pull/40094 > DataFrame.join: ambiguous column > > > Key: SPARK-41812 > URL: https://issues.apache.org/jira/browse/SPARK-41812 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > > {code} > File "/.../spark/python/pyspark/sql/connect/column.py", line 106, in > pyspark.sql.connect.column.Column.eqNullSafe > Failed example: > df1.join(df2, df1["value"] == df2["value"]).count() > Exception raised: > Traceback (most recent call last): > File "/.../miniconda3/envs/python3.9/lib/python3.9/doctest.py", line > 1336, in __run > exec(compile(example.source, filename, "single", > File "", line > 1, in > df1.join(df2, df1["value"] == df2["value"]).count() > File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 151, in > count > pdd = self.agg(_invoke_function("count", lit(1))).toPandas() > File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 1031, > in toPandas > return self._session.client.to_pandas(query) > File "/.../spark/python/pyspark/sql/connect/client.py", line 413, in > to_pandas > return self._execute_and_fetch(req) > File "/.../spark/python/pyspark/sql/connect/client.py", line 573, in > _execute_and_fetch > self._handle_error(rpc_error) > File "/.../spark/python/pyspark/sql/connect/client.py", line 619, in > _handle_error > raise SparkConnectAnalysisException( > pyspark.sql.connect.client.SparkConnectAnalysisException: > [AMBIGUOUS_REFERENCE] Reference `value` is ambiguous, could be: [`value`, > `value`]. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42500) ConstantPropagation support more cases
[ https://issues.apache.org/jira/browse/SPARK-42500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691136#comment-17691136 ] Apache Spark commented on SPARK-42500: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/40093 > ConstantPropagation support more cases > -- > > Key: SPARK-42500 > URL: https://issues.apache.org/jira/browse/SPARK-42500 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42500) ConstantPropagation support more cases
[ https://issues.apache.org/jira/browse/SPARK-42500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42500: Assignee: Apache Spark > ConstantPropagation support more cases > -- > > Key: SPARK-42500 > URL: https://issues.apache.org/jira/browse/SPARK-42500 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42500) ConstantPropagation support more cases
[ https://issues.apache.org/jira/browse/SPARK-42500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42500: Assignee: (was: Apache Spark) > ConstantPropagation support more cases > -- > > Key: SPARK-42500 > URL: https://issues.apache.org/jira/browse/SPARK-42500 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42500) ConstantPropagation support more cases
[ https://issues.apache.org/jira/browse/SPARK-42500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691135#comment-17691135 ] Apache Spark commented on SPARK-42500: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/40093 > ConstantPropagation support more cases > -- > > Key: SPARK-42500 > URL: https://issues.apache.org/jira/browse/SPARK-42500 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42498) reduce spark connect service retry time
[ https://issues.apache.org/jira/browse/SPARK-42498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691056#comment-17691056 ] Apache Spark commented on SPARK-42498: -- User 'nija-at' has created a pull request for this issue: https://github.com/apache/spark/pull/40066 > reduce spark connect service retry time > --- > > Key: SPARK-42498 > URL: https://issues.apache.org/jira/browse/SPARK-42498 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.3.2 >Reporter: Niranjan Jayakar >Priority: Major > > https://github.com/apache/spark/blob/5fc44dabe5084fb784f064afe691951a3c270793/python/pyspark/sql/connect/client.py#L411 > > Currently, 15 retries with the current backoff strategy result in the client > sitting in > the retry loop for ~400 seconds in the worst case. This means, applications > and > users using the spark connect client will hang for >6 minutes with no > response. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42498) reduce spark connect service retry time
[ https://issues.apache.org/jira/browse/SPARK-42498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42498: Assignee: Apache Spark > reduce spark connect service retry time > --- > > Key: SPARK-42498 > URL: https://issues.apache.org/jira/browse/SPARK-42498 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.3.2 >Reporter: Niranjan Jayakar >Assignee: Apache Spark >Priority: Major > > https://github.com/apache/spark/blob/5fc44dabe5084fb784f064afe691951a3c270793/python/pyspark/sql/connect/client.py#L411 > > Currently, 15 retries with the current backoff strategy result in the client > sitting in > the retry loop for ~400 seconds in the worst case. This means, applications > and > users using the spark connect client will hang for >6 minutes with no > response. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42498) reduce spark connect service retry time
[ https://issues.apache.org/jira/browse/SPARK-42498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42498: Assignee: (was: Apache Spark) > reduce spark connect service retry time > --- > > Key: SPARK-42498 > URL: https://issues.apache.org/jira/browse/SPARK-42498 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.3.2 >Reporter: Niranjan Jayakar >Priority: Major > > https://github.com/apache/spark/blob/5fc44dabe5084fb784f064afe691951a3c270793/python/pyspark/sql/connect/client.py#L411 > > Currently, 15 retries with the current backoff strategy result in the client > sitting in > the retry loop for ~400 seconds in the worst case. This means, applications > and > users using the spark connect client will hang for >6 minutes with no > response. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42475) Getting Started: Live Notebook for Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-42475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42475: Assignee: (was: Apache Spark) > Getting Started: Live Notebook for Spark Connect > > > Key: SPARK-42475 > URL: https://issues.apache.org/jira/browse/SPARK-42475 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > It would be great to have Live Notebook for Spark Connect in [Getting > Started|https://spark.apache.org/docs/latest/api/python/getting_started/index.html] > section to help users quick start on Spark Connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42475) Getting Started: Live Notebook for Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-42475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691016#comment-17691016 ] Apache Spark commented on SPARK-42475: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/40092 > Getting Started: Live Notebook for Spark Connect > > > Key: SPARK-42475 > URL: https://issues.apache.org/jira/browse/SPARK-42475 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > It would be great to have Live Notebook for Spark Connect in [Getting > Started|https://spark.apache.org/docs/latest/api/python/getting_started/index.html] > section to help users quick start on Spark Connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42475) Getting Started: Live Notebook for Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-42475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42475: Assignee: Apache Spark > Getting Started: Live Notebook for Spark Connect > > > Key: SPARK-42475 > URL: https://issues.apache.org/jira/browse/SPARK-42475 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > > It would be great to have Live Notebook for Spark Connect in [Getting > Started|https://spark.apache.org/docs/latest/api/python/getting_started/index.html] > section to help users quick start on Spark Connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41952) Upgrade Parquet to fix off-heap memory leaks in Zstd codec
[ https://issues.apache.org/jira/browse/SPARK-41952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41952: Assignee: Apache Spark > Upgrade Parquet to fix off-heap memory leaks in Zstd codec > -- > > Key: SPARK-41952 > URL: https://issues.apache.org/jira/browse/SPARK-41952 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.1.3, 3.3.1, 3.2.3 >Reporter: Alexey Kudinkin >Assignee: Apache Spark >Priority: Critical > > Recently, native memory leak have been discovered in Parquet in conjunction > of it using Zstd decompressor from luben/zstd-jni library (PARQUET-2160). > This is very problematic to a point where we can't use Parquet w/ Zstd due to > pervasive OOMs taking down our executors and disrupting our jobs. > Luckily fix addressing this had already landed in Parquet: > [https://github.com/apache/parquet-mr/pull/982] > > Now, we just need to > # Updated version of Parquet is released in a timely manner > # Spark is upgraded onto this new version in the upcoming release > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41952) Upgrade Parquet to fix off-heap memory leaks in Zstd codec
[ https://issues.apache.org/jira/browse/SPARK-41952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691015#comment-17691015 ] Apache Spark commented on SPARK-41952: -- User 'pan3793' has created a pull request for this issue: https://github.com/apache/spark/pull/40091 > Upgrade Parquet to fix off-heap memory leaks in Zstd codec > -- > > Key: SPARK-41952 > URL: https://issues.apache.org/jira/browse/SPARK-41952 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.1.3, 3.3.1, 3.2.3 >Reporter: Alexey Kudinkin >Priority: Critical > > Recently, native memory leak have been discovered in Parquet in conjunction > of it using Zstd decompressor from luben/zstd-jni library (PARQUET-2160). > This is very problematic to a point where we can't use Parquet w/ Zstd due to > pervasive OOMs taking down our executors and disrupting our jobs. > Luckily fix addressing this had already landed in Parquet: > [https://github.com/apache/parquet-mr/pull/982] > > Now, we just need to > # Updated version of Parquet is released in a timely manner > # Spark is upgraded onto this new version in the upcoming release > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41952) Upgrade Parquet to fix off-heap memory leaks in Zstd codec
[ https://issues.apache.org/jira/browse/SPARK-41952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41952: Assignee: (was: Apache Spark) > Upgrade Parquet to fix off-heap memory leaks in Zstd codec > -- > > Key: SPARK-41952 > URL: https://issues.apache.org/jira/browse/SPARK-41952 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.1.3, 3.3.1, 3.2.3 >Reporter: Alexey Kudinkin >Priority: Critical > > Recently, native memory leak have been discovered in Parquet in conjunction > of it using Zstd decompressor from luben/zstd-jni library (PARQUET-2160). > This is very problematic to a point where we can't use Parquet w/ Zstd due to > pervasive OOMs taking down our executors and disrupting our jobs. > Luckily fix addressing this had already landed in Parquet: > [https://github.com/apache/parquet-mr/pull/982] > > Now, we just need to > # Updated version of Parquet is released in a timely manner > # Spark is upgraded onto this new version in the upcoming release > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41741) [SQL] ParquetFilters StringStartsWith push down matching string do not use UTF-8
[ https://issues.apache.org/jira/browse/SPARK-41741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41741: Assignee: (was: Apache Spark) > [SQL] ParquetFilters StringStartsWith push down matching string do not use > UTF-8 > > > Key: SPARK-41741 > URL: https://issues.apache.org/jira/browse/SPARK-41741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jiale He >Priority: Major > Attachments: image-2022-12-28-18-00-00-861.png, > image-2022-12-28-18-00-21-586.png, image-2023-01-09-11-10-31-262.png, > image-2023-01-09-18-27-53-479.png, > part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet > > > Hello ~ > > I found a problem, but there are two ways to solve it. > > The parquet filter is pushed down. When using the like '***%' statement to > query, if the system default encoding is not UTF-8, it may cause an error. > > There are two ways to bypass this problem as far as I know > 1. spark.executor.extraJavaOptions="-Dfile.encoding=UTF-8" > 2. spark.sql.parquet.filterPushdown.string.startsWith=false > > The following is the information to reproduce this problem > The parquet sample file is in the attachment > {code:java} > spark.read.parquet("file:///home/kylin/hjldir/part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet").createTempView("tmp”) > spark.sql("select * from tmp where `1` like '啦啦乐乐%'").show(false) {code} > !image-2022-12-28-18-00-00-861.png|width=879,height=430! > > !image-2022-12-28-18-00-21-586.png|width=799,height=731! > > I think the correct code should be: > {code:java} > private val strToBinary = > Binary.fromReusedByteArray(v.getBytes(StandardCharsets.UTF_8)) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41741) [SQL] ParquetFilters StringStartsWith push down matching string do not use UTF-8
[ https://issues.apache.org/jira/browse/SPARK-41741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41741: Assignee: Apache Spark > [SQL] ParquetFilters StringStartsWith push down matching string do not use > UTF-8 > > > Key: SPARK-41741 > URL: https://issues.apache.org/jira/browse/SPARK-41741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jiale He >Assignee: Apache Spark >Priority: Major > Attachments: image-2022-12-28-18-00-00-861.png, > image-2022-12-28-18-00-21-586.png, image-2023-01-09-11-10-31-262.png, > image-2023-01-09-18-27-53-479.png, > part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet > > > Hello ~ > > I found a problem, but there are two ways to solve it. > > The parquet filter is pushed down. When using the like '***%' statement to > query, if the system default encoding is not UTF-8, it may cause an error. > > There are two ways to bypass this problem as far as I know > 1. spark.executor.extraJavaOptions="-Dfile.encoding=UTF-8" > 2. spark.sql.parquet.filterPushdown.string.startsWith=false > > The following is the information to reproduce this problem > The parquet sample file is in the attachment > {code:java} > spark.read.parquet("file:///home/kylin/hjldir/part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet").createTempView("tmp”) > spark.sql("select * from tmp where `1` like '啦啦乐乐%'").show(false) {code} > !image-2022-12-28-18-00-00-861.png|width=879,height=430! > > !image-2022-12-28-18-00-21-586.png|width=799,height=731! > > I think the correct code should be: > {code:java} > private val strToBinary = > Binary.fromReusedByteArray(v.getBytes(StandardCharsets.UTF_8)) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41741) [SQL] ParquetFilters StringStartsWith push down matching string do not use UTF-8
[ https://issues.apache.org/jira/browse/SPARK-41741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691010#comment-17691010 ] Apache Spark commented on SPARK-41741: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/40090 > [SQL] ParquetFilters StringStartsWith push down matching string do not use > UTF-8 > > > Key: SPARK-41741 > URL: https://issues.apache.org/jira/browse/SPARK-41741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jiale He >Priority: Major > Attachments: image-2022-12-28-18-00-00-861.png, > image-2022-12-28-18-00-21-586.png, image-2023-01-09-11-10-31-262.png, > image-2023-01-09-18-27-53-479.png, > part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet > > > Hello ~ > > I found a problem, but there are two ways to solve it. > > The parquet filter is pushed down. When using the like '***%' statement to > query, if the system default encoding is not UTF-8, it may cause an error. > > There are two ways to bypass this problem as far as I know > 1. spark.executor.extraJavaOptions="-Dfile.encoding=UTF-8" > 2. spark.sql.parquet.filterPushdown.string.startsWith=false > > The following is the information to reproduce this problem > The parquet sample file is in the attachment > {code:java} > spark.read.parquet("file:///home/kylin/hjldir/part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet").createTempView("tmp”) > spark.sql("select * from tmp where `1` like '啦啦乐乐%'").show(false) {code} > !image-2022-12-28-18-00-00-861.png|width=879,height=430! > > !image-2022-12-28-18-00-21-586.png|width=799,height=731! > > I think the correct code should be: > {code:java} > private val strToBinary = > Binary.fromReusedByteArray(v.getBytes(StandardCharsets.UTF_8)) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42495) Scala Client: Add 2nd batch of functions
[ https://issues.apache.org/jira/browse/SPARK-42495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691001#comment-17691001 ] Apache Spark commented on SPARK-42495: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/40089 > Scala Client: Add 2nd batch of functions > > > Key: SPARK-42495 > URL: https://issues.apache.org/jira/browse/SPARK-42495 > Project: Spark > Issue Type: Task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org