[jira] [Assigned] (SPARK-37904) Improve RebalancePartitions in rules of Optimizer
[ https://issues.apache.org/jira/browse/SPARK-37904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37904: Assignee: Apache Spark > Improve RebalancePartitions in rules of Optimizer > - > > Key: SPARK-37904 > URL: https://issues.apache.org/jira/browse/SPARK-37904 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Assignee: Apache Spark >Priority: Major > > After SPARK-37267, we support do optimize rebalance partitions in everywhere > of plan rather than limit to the root node. So It should make sense to also > let `RebalancePartitions` work in all rules of Optimizer like `Repartition` > and `RepartitionByExpression` did. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37904) Improve RebalancePartitions in rules of Optimizer
[ https://issues.apache.org/jira/browse/SPARK-37904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17476006#comment-17476006 ] Apache Spark commented on SPARK-37904: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/35208 > Improve RebalancePartitions in rules of Optimizer > - > > Key: SPARK-37904 > URL: https://issues.apache.org/jira/browse/SPARK-37904 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Priority: Major > > After SPARK-37267, we support do optimize rebalance partitions in everywhere > of plan rather than limit to the root node. So It should make sense to also > let `RebalancePartitions` work in all rules of Optimizer like `Repartition` > and `RepartitionByExpression` did. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37904) Improve RebalancePartitions in rules of Optimizer
[ https://issues.apache.org/jira/browse/SPARK-37904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37904: Assignee: (was: Apache Spark) > Improve RebalancePartitions in rules of Optimizer > - > > Key: SPARK-37904 > URL: https://issues.apache.org/jira/browse/SPARK-37904 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Priority: Major > > After SPARK-37267, we support do optimize rebalance partitions in everywhere > of plan rather than limit to the root node. So It should make sense to also > let `RebalancePartitions` work in all rules of Optimizer like `Repartition` > and `RepartitionByExpression` did. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36967) Report accurate shuffle block size if its skewed
[ https://issues.apache.org/jira/browse/SPARK-36967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros reassigned SPARK-36967: -- Assignee: Wan Kun (was: Apache Spark) > Report accurate shuffle block size if its skewed > > > Key: SPARK-36967 > URL: https://issues.apache.org/jira/browse/SPARK-36967 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Wan Kun >Assignee: Wan Kun >Priority: Major > Fix For: 3.3.0 > > Attachments: map_status.png, map_status2.png > > > Now map task will report accurate shuffle block size if the block size is > greater than "spark.shuffle.accurateBlockThreshold"( 100M by default ). But > if there are a large number of map tasks and the shuffle block sizes of these > tasks are smaller than "spark.shuffle.accurateBlockThreshold", there may be > unrecognized data skew. > For example, there are 1 map task and 1 reduce task, and each map > task create 50M shuffle blocks for reduce 0, and 10K shuffle blocks for the > left reduce tasks, reduce 0 is data skew, but the stat of this plan do not > have this information. > !map_status2.png! > I think we need to judge if a shuffle block is huge and need to be accurate > reported while running. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36967) Report accurate shuffle block size if its skewed
[ https://issues.apache.org/jira/browse/SPARK-36967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros resolved SPARK-36967. Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34234 [https://github.com/apache/spark/pull/34234] > Report accurate shuffle block size if its skewed > > > Key: SPARK-36967 > URL: https://issues.apache.org/jira/browse/SPARK-36967 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Wan Kun >Assignee: Apache Spark >Priority: Major > Fix For: 3.3.0 > > Attachments: map_status.png, map_status2.png > > > Now map task will report accurate shuffle block size if the block size is > greater than "spark.shuffle.accurateBlockThreshold"( 100M by default ). But > if there are a large number of map tasks and the shuffle block sizes of these > tasks are smaller than "spark.shuffle.accurateBlockThreshold", there may be > unrecognized data skew. > For example, there are 1 map task and 1 reduce task, and each map > task create 50M shuffle blocks for reduce 0, and 10K shuffle blocks for the > left reduce tasks, reduce 0 is data skew, but the stat of this plan do not > have this information. > !map_status2.png! > I think we need to judge if a shuffle block is huge and need to be accurate > reported while running. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37907) StaticInvoke should support ConstantFolding
[ https://issues.apache.org/jira/browse/SPARK-37907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37907: Assignee: (was: Apache Spark) > StaticInvoke should support ConstantFolding > --- > > Key: SPARK-37907 > URL: https://issues.apache.org/jira/browse/SPARK-37907 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > StaticInvoke not implement folderable, should support it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37907) StaticInvoke should support ConstantFolding
[ https://issues.apache.org/jira/browse/SPARK-37907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37907: Assignee: Apache Spark > StaticInvoke should support ConstantFolding > --- > > Key: SPARK-37907 > URL: https://issues.apache.org/jira/browse/SPARK-37907 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > > StaticInvoke not implement folderable, should support it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37907) StaticInvoke should support ConstantFolding
[ https://issues.apache.org/jira/browse/SPARK-37907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475994#comment-17475994 ] Apache Spark commented on SPARK-37907: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/35207 > StaticInvoke should support ConstantFolding > --- > > Key: SPARK-37907 > URL: https://issues.apache.org/jira/browse/SPARK-37907 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > StaticInvoke not implement folderable, should support it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37873) SQL Syntax links are broken
[ https://issues.apache.org/jira/browse/SPARK-37873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Ott updated SPARK-37873: - Attachment: Screenshot 2022-01-14 at 08.07.24.png > SQL Syntax links are broken > --- > > Key: SPARK-37873 > URL: https://issues.apache.org/jira/browse/SPARK-37873 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.2.0 >Reporter: Alex Ott >Priority: Major > Attachments: Screenshot 2022-01-14 at 08.07.24.png > > > SQL Syntax links at [https://spark.apache.org/docs/latest/sql-ref.html] are > broken -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37873) SQL Syntax links are broken
[ https://issues.apache.org/jira/browse/SPARK-37873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475993#comment-17475993 ] Alex Ott commented on SPARK-37873: -- If you click on any: * [DDL Statements|https://spark.apache.org/docs/latest/sql-ref-syntax-ddl.html] * [DML Statements|https://spark.apache.org/docs/latest/sql-ref-syntax-dml.html] * [Data Retrieval Statements|https://spark.apache.org/docs/latest/sql-ref-syntax-qry.html] * [Auxiliary Statements|https://spark.apache.org/docs/latest/sql-ref-syntax-aux.html] it will show file not found (see image) !Screenshot 2022-01-14 at 08.07.24.png! > SQL Syntax links are broken > --- > > Key: SPARK-37873 > URL: https://issues.apache.org/jira/browse/SPARK-37873 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.2.0 >Reporter: Alex Ott >Priority: Major > Attachments: Screenshot 2022-01-14 at 08.07.24.png > > > SQL Syntax links at [https://spark.apache.org/docs/latest/sql-ref.html] are > broken -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37907) StaticInvoke should support ConstantFolding
[ https://issues.apache.org/jira/browse/SPARK-37907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475980#comment-17475980 ] angerszhu commented on SPARK-37907: --- Raise a pr soon > StaticInvoke should support ConstantFolding > --- > > Key: SPARK-37907 > URL: https://issues.apache.org/jira/browse/SPARK-37907 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > StaticInvoke not implement folderable, should support it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37907) StaticInvoke should support ConstantFolding
angerszhu created SPARK-37907: - Summary: StaticInvoke should support ConstantFolding Key: SPARK-37907 URL: https://issues.apache.org/jira/browse/SPARK-37907 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.2.0 Reporter: angerszhu StaticInvoke not implement folderable, should support it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37906) spark-sql should not pass last simple comment to backend
[ https://issues.apache.org/jira/browse/SPARK-37906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475974#comment-17475974 ] Apache Spark commented on SPARK-37906: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/35206 > spark-sql should not pass last simple comment to backend > - > > Key: SPARK-37906 > URL: https://issues.apache.org/jira/browse/SPARK-37906 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > spark-sql should not pass last simple comment to backend > ``` > SELECT 1; -- comment > ``` -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37906) spark-sql should not pass last simple comment to backend
[ https://issues.apache.org/jira/browse/SPARK-37906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37906: Assignee: Apache Spark > spark-sql should not pass last simple comment to backend > - > > Key: SPARK-37906 > URL: https://issues.apache.org/jira/browse/SPARK-37906 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > > spark-sql should not pass last simple comment to backend > ``` > SELECT 1; -- comment > ``` -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37906) spark-sql should not pass last simple comment to backend
[ https://issues.apache.org/jira/browse/SPARK-37906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37906: Assignee: (was: Apache Spark) > spark-sql should not pass last simple comment to backend > - > > Key: SPARK-37906 > URL: https://issues.apache.org/jira/browse/SPARK-37906 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > spark-sql should not pass last simple comment to backend > ``` > SELECT 1; -- comment > ``` -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37873) SQL Syntax links are broken
[ https://issues.apache.org/jira/browse/SPARK-37873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475969#comment-17475969 ] Hyukjin Kwon commented on SPARK-37873: -- [~alexott] which syntax is broken? > SQL Syntax links are broken > --- > > Key: SPARK-37873 > URL: https://issues.apache.org/jira/browse/SPARK-37873 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.2.0 >Reporter: Alex Ott >Priority: Major > > SQL Syntax links at [https://spark.apache.org/docs/latest/sql-ref.html] are > broken -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37872) [SQL] Some classes are move from org.codehaus.janino:janino to org.codehaus.janino:common-compiler after version 3.1.x
[ https://issues.apache.org/jira/browse/SPARK-37872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475971#comment-17475971 ] Hyukjin Kwon commented on SPARK-37872: -- Spark 2.4 is EOL. Is it still valid for Spark 3+? > [SQL] Some classes are move from org.codehaus.janino:janino to > org.codehaus.janino:common-compiler after version 3.1.x > --- > > Key: SPARK-37872 > URL: https://issues.apache.org/jira/browse/SPARK-37872 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Jin Shen >Priority: Major > > Here is the code: > > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L32] > > ByteArrayClassLoader and InternalCompilerException are moved to > org.codehaus.janino:common-compiler > > [https://github.com/janino-compiler/janino/blob/3.1.6/commons-compiler/src/main/java/org/codehaus/commons/compiler/util/reflect/ByteArrayClassLoader.java] > > [https://github.com/janino-compiler/janino/blob/3.1.6/commons-compiler/src/main/java/org/codehaus/commons/compiler/InternalCompilerException.java] > > The last working version of janino is 3.0.16 but it is out of date. > Can we make change to this and upgrade to new version of janino and > common-compiler? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37874) Link to Pandas UDF documentation is broken
[ https://issues.apache.org/jira/browse/SPARK-37874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37874. -- Resolution: Fixed > Link to Pandas UDF documentation is broken > -- > > Key: SPARK-37874 > URL: https://issues.apache.org/jira/browse/SPARK-37874 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.2.0 >Reporter: Alex Ott >Priority: Major > > Link at > [https://spark.apache.org/docs/latest/api/python/user_guide/arrow_pandas.html] > is broken -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37874) Link to Pandas UDF documentation is broken
[ https://issues.apache.org/jira/browse/SPARK-37874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475968#comment-17475968 ] Hyukjin Kwon commented on SPARK-37874: -- Fixed in https://github.com/apache/spark/pull/34475 > Link to Pandas UDF documentation is broken > -- > > Key: SPARK-37874 > URL: https://issues.apache.org/jira/browse/SPARK-37874 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.2.0 >Reporter: Alex Ott >Priority: Major > > Link at > [https://spark.apache.org/docs/latest/api/python/user_guide/arrow_pandas.html] > is broken -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37874) Link to Pandas UDF documentation is broken
[ https://issues.apache.org/jira/browse/SPARK-37874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37874: - Fix Version/s: 3.2.1 3.3.0 > Link to Pandas UDF documentation is broken > -- > > Key: SPARK-37874 > URL: https://issues.apache.org/jira/browse/SPARK-37874 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.2.0 >Reporter: Alex Ott >Priority: Major > Fix For: 3.2.1, 3.3.0 > > > Link at > [https://spark.apache.org/docs/latest/api/python/user_guide/arrow_pandas.html] > is broken -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37882) pyarrow.lib.ArrowInvalid: Can only convert 1-dimensional array values
[ https://issues.apache.org/jira/browse/SPARK-37882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475967#comment-17475967 ] Hyukjin Kwon commented on SPARK-37882: -- [~mattvan83] mind providing self-contained reproducer? > pyarrow.lib.ArrowInvalid: Can only convert 1-dimensional array values > - > > Key: SPARK-37882 > URL: https://issues.apache.org/jira/browse/SPARK-37882 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.0 > Environment: Ubuntu 18.04 >Reporter: Matthieu Vanhoutte >Priority: Major > > Hello, > When trying to convert a pandas dataframe > {code:java} > ss_corpus_dataframe{code} > (containing one column with two-dimensional numpy array) into a > pandas-on-spark dataframe with the following code: > {code:java} > df = ps.from_pandas(ss_corpus_dataframe){code} > I got the following error: > {code:java} > Traceback (most recent call last): > File > "/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/uvicorn/protocols/http/httptools_impl.py", > line 375, in run_asgi > result = await app(self.scope, self.receive, self.send) > File > "/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/uvicorn/middleware/proxy_headers.py", > line 75, in __call__ > return await self.app(scope, receive, send) > File > "/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/uvicorn/middleware/message_logger.py", > line 82, in __call__ > raise exc from None > File > "/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/uvicorn/middleware/message_logger.py", > line 78, in __call__ > await self.app(scope, inner_receive, inner_send) > File > "/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/fastapi/applications.py", > line 208, in __call__ > await super().__call__(scope, receive, send) > File > "/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/starlette/applications.py", > line 112, in __call__ > await self.middleware_stack(scope, receive, send) > File > "/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/starlette/middleware/errors.py", > line 181, in __call__ > raise exc > File > "/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/starlette/middleware/errors.py", > line 159, in __call__ > await self.app(scope, receive, _send) > File > "/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/starlette/exceptions.py", > line 82, in __call__ > raise exc > File > "/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/starlette/exceptions.py", > line 71, in __call__ > await self.app(scope, receive, sender) > File > "/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/starlette/routing.py", > line 656, in __call__ > await route.handle(scope, receive, send) > File > "/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/starlette/routing.py", > line 259, in handle > await self.app(scope, receive, send) > File > "/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/starlette/routing.py", > line 61, in app > response = await func(request) > File > "/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/fastapi/routing.py", > line 226, in app > raw_response = await run_endpoint_function( > File > "/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/fastapi/routing.py", > line 159, in run_endpoint_function > return await dependant.call(**values) > File "./app/routers/semantic_searches.py", line 60, in > create_semantic_search > date_time_sem_search, clean_query, output_dict, error_code = await > apply_semantic_search_async(query=query, > api_sent_embed_url=settings.api_sent_embed_address, > ss_corpus_dataframe=ss_corpus_dataframe.dataframe, id_matrices=id_matrices, > top_k=75, similarity_score_thresh=0.5) > File "./app/backend/semantic_search/sts_tf_semantic_search.py", line 134, > in apply_semantic_search_async > df = ps.from_pandas(ss_corpus_dataframe) > File > "/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/pyspark/pandas/namespace.py", > line 143, in from_pandas > return DataFrame(pobj) > File > "/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/pyspark/pandas/frame.py", > line 520, in __init__ > internal = InternalFrame.from_pandas(pd
[jira] [Resolved] (SPARK-37883) log4j update to 2.17.1 in spark-core 3.2
[ https://issues.apache.org/jira/browse/SPARK-37883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37883. -- Resolution: Won't Fix > log4j update to 2.17.1 in spark-core 3.2 > > > Key: SPARK-37883 > URL: https://issues.apache.org/jira/browse/SPARK-37883 > Project: Spark > Issue Type: Bug > Components: Security, Spark Core >Affects Versions: 3.2.0 >Reporter: Setu Agrawal >Priority: Major > Labels: > https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.12/3.2.0 > > We are using spark-core jar file, as below > libraryDependencies += "org.apache.spark" %% "spark-core" % "3.2.0" > as per maven repository it usgaes log4j older version, which need to be > updated latest(2.17.1) to fix security vulenrability, please help us, how we > can get update version spark-core, which usages latest updated log4j. > Thanks > > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37883) log4j update to 2.17.1 in spark-core 3.2
[ https://issues.apache.org/jira/browse/SPARK-37883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475966#comment-17475966 ] Hyukjin Kwon commented on SPARK-37883: -- you should either upgrade to the latest Spark 3.3 when it's released. > log4j update to 2.17.1 in spark-core 3.2 > > > Key: SPARK-37883 > URL: https://issues.apache.org/jira/browse/SPARK-37883 > Project: Spark > Issue Type: Bug > Components: Security, Spark Core >Affects Versions: 3.2.0 >Reporter: Setu Agrawal >Priority: Major > Labels: > https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.12/3.2.0 > > We are using spark-core jar file, as below > libraryDependencies += "org.apache.spark" %% "spark-core" % "3.2.0" > as per maven repository it usgaes log4j older version, which need to be > updated latest(2.17.1) to fix security vulenrability, please help us, how we > can get update version spark-core, which usages latest updated log4j. > Thanks > > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37883) log4j update to 2.17.1 in spark-core 3.2
[ https://issues.apache.org/jira/browse/SPARK-37883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475965#comment-17475965 ] Hyukjin Kwon commented on SPARK-37883: -- We upgraded the log4J in the latest master. Old Spark 3.2 uses log4j 1 that virtually doesn't have the security issue by default. > log4j update to 2.17.1 in spark-core 3.2 > > > Key: SPARK-37883 > URL: https://issues.apache.org/jira/browse/SPARK-37883 > Project: Spark > Issue Type: Bug > Components: Security, Spark Core >Affects Versions: 3.2.0 >Reporter: Setu Agrawal >Priority: Major > Labels: > https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.12/3.2.0 > > We are using spark-core jar file, as below > libraryDependencies += "org.apache.spark" %% "spark-core" % "3.2.0" > as per maven repository it usgaes log4j older version, which need to be > updated latest(2.17.1) to fix security vulenrability, please help us, how we > can get update version spark-core, which usages latest updated log4j. > Thanks > > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37883) log4j update to 2.17.1 in spark-core 3.2
[ https://issues.apache.org/jira/browse/SPARK-37883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37883: - Priority: Major (was: Blocker) > log4j update to 2.17.1 in spark-core 3.2 > > > Key: SPARK-37883 > URL: https://issues.apache.org/jira/browse/SPARK-37883 > Project: Spark > Issue Type: Bug > Components: Security, Spark Core >Affects Versions: 3.2.0 >Reporter: Setu Agrawal >Priority: Major > Labels: > https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.12/3.2.0 > > We are using spark-core jar file, as below > libraryDependencies += "org.apache.spark" %% "spark-core" % "3.2.0" > as per maven repository it usgaes log4j older version, which need to be > updated latest(2.17.1) to fix security vulenrability, please help us, how we > can get update version spark-core, which usages latest updated log4j. > Thanks > > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37906) spark-sql should not pass last simple comment to backend
angerszhu created SPARK-37906: - Summary: spark-sql should not pass last simple comment to backend Key: SPARK-37906 URL: https://issues.apache.org/jira/browse/SPARK-37906 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.2.0 Reporter: angerszhu spark-sql should not pass last simple comment to backend ``` SELECT 1; -- comment ``` -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37905) Make `merge_spark_pr.py` set primary author from the first commit in case of ties
[ https://issues.apache.org/jira/browse/SPARK-37905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-37905: - Assignee: Dongjoon Hyun > Make `merge_spark_pr.py` set primary author from the first commit in case of > ties > - > > Key: SPARK-37905 > URL: https://issues.apache.org/jira/browse/SPARK-37905 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37905) Make `merge_spark_pr.py` set primary author from the first commit in case of ties
[ https://issues.apache.org/jira/browse/SPARK-37905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-37905. --- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35205 [https://github.com/apache/spark/pull/35205] > Make `merge_spark_pr.py` set primary author from the first commit in case of > ties > - > > Key: SPARK-37905 > URL: https://issues.apache.org/jira/browse/SPARK-37905 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37905) Make `merge_spark_pr.py` set primary author from the first commit in case of ties
[ https://issues.apache.org/jira/browse/SPARK-37905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37905: Assignee: (was: Apache Spark) > Make `merge_spark_pr.py` set primary author from the first commit in case of > ties > - > > Key: SPARK-37905 > URL: https://issues.apache.org/jira/browse/SPARK-37905 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37905) Make `merge_spark_pr.py` set primary author from the first commit in case of ties
[ https://issues.apache.org/jira/browse/SPARK-37905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37905: Assignee: Apache Spark > Make `merge_spark_pr.py` set primary author from the first commit in case of > ties > - > > Key: SPARK-37905 > URL: https://issues.apache.org/jira/browse/SPARK-37905 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37905) Make `merge_spark_pr.py` set primary author from the first commit in case of ties
[ https://issues.apache.org/jira/browse/SPARK-37905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475923#comment-17475923 ] Apache Spark commented on SPARK-37905: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/35205 > Make `merge_spark_pr.py` set primary author from the first commit in case of > ties > - > > Key: SPARK-37905 > URL: https://issues.apache.org/jira/browse/SPARK-37905 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37905) Make `merge_spark_pr.py` set primary author from the first commit in case of ties
[ https://issues.apache.org/jira/browse/SPARK-37905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37905: -- Summary: Make `merge_spark_pr.py` set primary author from the first commit in case of ties (was: Fix `merge_spark_pr.py` to consider the first commit author as the primary author in case of ties) > Make `merge_spark_pr.py` set primary author from the first commit in case of > ties > - > > Key: SPARK-37905 > URL: https://issues.apache.org/jira/browse/SPARK-37905 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37905) Fix `merge_spark_pr.py` to consider the first commit author as the primary author in case of ties
Dongjoon Hyun created SPARK-37905: - Summary: Fix `merge_spark_pr.py` to consider the first commit author as the primary author in case of ties Key: SPARK-37905 URL: https://issues.apache.org/jira/browse/SPARK-37905 Project: Spark Issue Type: Bug Components: Project Infra Affects Versions: 3.3.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37880) Upgrade Scala to 2.13.8
[ https://issues.apache.org/jira/browse/SPARK-37880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-37880: - Assignee: Yang Jie > Upgrade Scala to 2.13.8 > --- > > Key: SPARK-37880 > URL: https://issues.apache.org/jira/browse/SPARK-37880 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > > Scala 2.13.8 already tags: > [https://github.com/scala/scala/releases/tag/v2.13.8] > > https://contributors.scala-lang.org/t/scala-2-13-8-release-planning/5487 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37880) Upgrade Scala to 2.13.8
[ https://issues.apache.org/jira/browse/SPARK-37880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-37880. --- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35181 [https://github.com/apache/spark/pull/35181] > Upgrade Scala to 2.13.8 > --- > > Key: SPARK-37880 > URL: https://issues.apache.org/jira/browse/SPARK-37880 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.3.0 > > > Scala 2.13.8 already tags: > [https://github.com/scala/scala/releases/tag/v2.13.8] > > https://contributors.scala-lang.org/t/scala-2-13-8-release-planning/5487 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37878) Migrate SHOW CREATE TABLE to use v2 command by default
[ https://issues.apache.org/jira/browse/SPARK-37878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37878: Assignee: (was: Apache Spark) > Migrate SHOW CREATE TABLE to use v2 command by default > -- > > Key: SPARK-37878 > URL: https://issues.apache.org/jira/browse/SPARK-37878 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: PengLei >Priority: Major > Fix For: 3.3.0 > > > Migrate SHOW CREATE TABLE to use v2 command by default -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37878) Migrate SHOW CREATE TABLE to use v2 command by default
[ https://issues.apache.org/jira/browse/SPARK-37878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475911#comment-17475911 ] Apache Spark commented on SPARK-37878: -- User 'Peng-Lei' has created a pull request for this issue: https://github.com/apache/spark/pull/35204 > Migrate SHOW CREATE TABLE to use v2 command by default > -- > > Key: SPARK-37878 > URL: https://issues.apache.org/jira/browse/SPARK-37878 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: PengLei >Priority: Major > Fix For: 3.3.0 > > > Migrate SHOW CREATE TABLE to use v2 command by default -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37878) Migrate SHOW CREATE TABLE to use v2 command by default
[ https://issues.apache.org/jira/browse/SPARK-37878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37878: Assignee: Apache Spark > Migrate SHOW CREATE TABLE to use v2 command by default > -- > > Key: SPARK-37878 > URL: https://issues.apache.org/jira/browse/SPARK-37878 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: PengLei >Assignee: Apache Spark >Priority: Major > Fix For: 3.3.0 > > > Migrate SHOW CREATE TABLE to use v2 command by default -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37893) Fix flaky test: AdaptiveQueryExecSuite with Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-37893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-37893. --- Fix Version/s: 3.3.0 Assignee: Yang Jie Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/35190 > Fix flaky test: AdaptiveQueryExecSuite with Scala 2.13 > -- > > Key: SPARK-37893 > URL: https://issues.apache.org/jira/browse/SPARK-37893 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.3.0 > > > Use maven test `AdaptiveQueryExecSuite` with Scala-2.13, the following > exceptions will occur with a very small probability: > {code:java} > AdaptiveQueryExecSuite > - Logging plan changes for AQE *** FAILED *** > java.util.ConcurrentModificationException: mutation occurred during > iteration > at > scala.collection.mutable.MutationTracker$.checkMutations(MutationTracker.scala:43) > at > scala.collection.mutable.CheckedIndexedSeqView$CheckedIterator.hasNext(CheckedIndexedSeqView.scala:47) > at > scala.collection.StrictOptimizedIterableOps.filterImpl(StrictOptimizedIterableOps.scala:225) > at > scala.collection.StrictOptimizedIterableOps.filterImpl$(StrictOptimizedIterableOps.scala:222) > at scala.collection.mutable.ArrayBuffer.filterImpl(ArrayBuffer.scala:43) > at > scala.collection.StrictOptimizedIterableOps.filterNot(StrictOptimizedIterableOps.scala:220) > at > scala.collection.StrictOptimizedIterableOps.filterNot$(StrictOptimizedIterableOps.scala:220) > at scala.collection.mutable.ArrayBuffer.filterNot(ArrayBuffer.scala:43) > at > org.apache.spark.SparkFunSuite$LogAppender.loggingEvents(SparkFunSuite.scala:288) > at > org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite.$anonfun$new$152(AdaptiveQueryExecSuite.scal{code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37859) SQL tables created with JDBC with Spark 3.1 are not readable with 3.2
[ https://issues.apache.org/jira/browse/SPARK-37859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-37859: --- Assignee: Karen Feng > SQL tables created with JDBC with Spark 3.1 are not readable with 3.2 > - > > Key: SPARK-37859 > URL: https://issues.apache.org/jira/browse/SPARK-37859 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Karen Feng >Assignee: Karen Feng >Priority: Major > > In > https://github.com/apache/spark/blob/bd24b4884b804fc85a083f82b864823851d5980c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L312, > a new metadata field is added during reading. As we do a full comparison of > the user-provided schema and the actual schema in > https://github.com/apache/spark/blob/bd24b4884b804fc85a083f82b864823851d5980c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L356, > resolution fails if a table created with Spark 3.1 is read with Spark 3.2. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37859) SQL tables created with JDBC with Spark 3.1 are not readable with 3.2
[ https://issues.apache.org/jira/browse/SPARK-37859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-37859. - Fix Version/s: 3.3.0 3.2.1 Resolution: Fixed Issue resolved by pull request 35158 [https://github.com/apache/spark/pull/35158] > SQL tables created with JDBC with Spark 3.1 are not readable with 3.2 > - > > Key: SPARK-37859 > URL: https://issues.apache.org/jira/browse/SPARK-37859 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Karen Feng >Assignee: Karen Feng >Priority: Major > Fix For: 3.3.0, 3.2.1 > > > In > https://github.com/apache/spark/blob/bd24b4884b804fc85a083f82b864823851d5980c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L312, > a new metadata field is added during reading. As we do a full comparison of > the user-provided schema and the actual schema in > https://github.com/apache/spark/blob/bd24b4884b804fc85a083f82b864823851d5980c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L356, > resolution fails if a table created with Spark 3.1 is read with Spark 3.2. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37904) Improve RebalancePartitions in rules of Optimizer
XiDuo You created SPARK-37904: - Summary: Improve RebalancePartitions in rules of Optimizer Key: SPARK-37904 URL: https://issues.apache.org/jira/browse/SPARK-37904 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: XiDuo You After SPARK-37267, we support do optimize rebalance partitions in everywhere of plan rather than limit to the root node. So It should make sense to also let `RebalancePartitions` work in all rules of Optimizer like `Repartition` and `RepartitionByExpression` did. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37886) Use ComparisonTestBase to reduce redundant test code
[ https://issues.apache.org/jira/browse/SPARK-37886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475902#comment-17475902 ] Apache Spark commented on SPARK-37886: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/35203 > Use ComparisonTestBase to reduce redundant test code > > > Key: SPARK-37886 > URL: https://issues.apache.org/jira/browse/SPARK-37886 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Yikun Jiang >Priority: Major > > We have many testcase are using same logic to covert pdf to psdf, we can use > ComparisonTestBase as parent class and reduce redundant. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27442) ParquetFileFormat fails to read column named with invalid characters
[ https://issues.apache.org/jira/browse/SPARK-27442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475894#comment-17475894 ] Wenchen Fan commented on SPARK-27442: - I think we should fix this. It's OK for Spark to forbid special chars in the column name, but when we read existing parquet files, there is no point to forbid it at the Spark side. [~angerszhuuu] can you take a look? Thanks! > ParquetFileFormat fails to read column named with invalid characters > > > Key: SPARK-27442 > URL: https://issues.apache.org/jira/browse/SPARK-27442 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0, 2.4.1 >Reporter: Jan Vršovský >Priority: Minor > > When reading a parquet file which contains characters considered invalid, the > reader fails with exception: > Name: org.apache.spark.sql.AnalysisException > Message: Attribute name "..." contains invalid character(s) among " > ,;{}()\n\t=". Please use alias to rename it. > Spark should not be able to write such files, but it should be able to read > it (and allow the user to correct it). However, possible workarounds (such as > using alias to rename the column, or forcing another schema) do not work, > since the check is done on the input. > (Possible fix: remove superficial > {{ParquetWriteSupport.setSchema(requiredSchema, hadoopConf)}} from > {{buildReaderWithPartitionValues}} ?) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37479) Migrate DROP NAMESPACE to use V2 command by default
[ https://issues.apache.org/jira/browse/SPARK-37479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475889#comment-17475889 ] Apache Spark commented on SPARK-37479: -- User 'dchvn' has created a pull request for this issue: https://github.com/apache/spark/pull/35202 > Migrate DROP NAMESPACE to use V2 command by default > --- > > Key: SPARK-37479 > URL: https://issues.apache.org/jira/browse/SPARK-37479 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: dch nguyen >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37479) Migrate DROP NAMESPACE to use V2 command by default
[ https://issues.apache.org/jira/browse/SPARK-37479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37479: Assignee: Apache Spark > Migrate DROP NAMESPACE to use V2 command by default > --- > > Key: SPARK-37479 > URL: https://issues.apache.org/jira/browse/SPARK-37479 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: dch nguyen >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37479) Migrate DROP NAMESPACE to use V2 command by default
[ https://issues.apache.org/jira/browse/SPARK-37479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37479: Assignee: (was: Apache Spark) > Migrate DROP NAMESPACE to use V2 command by default > --- > > Key: SPARK-37479 > URL: https://issues.apache.org/jira/browse/SPARK-37479 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: dch nguyen >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37479) Migrate DROP NAMESPACE to use V2 command by default
[ https://issues.apache.org/jira/browse/SPARK-37479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475888#comment-17475888 ] Apache Spark commented on SPARK-37479: -- User 'dchvn' has created a pull request for this issue: https://github.com/apache/spark/pull/35202 > Migrate DROP NAMESPACE to use V2 command by default > --- > > Key: SPARK-37479 > URL: https://issues.apache.org/jira/browse/SPARK-37479 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: dch nguyen >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36885) Inline type hints for python/pyspark/sql/dataframe.py
[ https://issues.apache.org/jira/browse/SPARK-36885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475883#comment-17475883 ] Apache Spark commented on SPARK-36885: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/35201 > Inline type hints for python/pyspark/sql/dataframe.py > - > > Key: SPARK-36885 > URL: https://issues.apache.org/jira/browse/SPARK-36885 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Assignee: Apache Spark >Priority: Major > Fix For: 3.3.0 > > > Inline type hints for python/pyspark/sql/dataframe.py from Inline type hints > for python/pyspark/sql/dataframe.pyi. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37903) Replace string_typehints with get_type_hints.
[ https://issues.apache.org/jira/browse/SPARK-37903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-37903: Assignee: Takuya Ueshin > Replace string_typehints with get_type_hints. > - > > Key: SPARK-37903 > URL: https://issues.apache.org/jira/browse/SPARK-37903 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > > Currently we have a hacky way to resolve type hints written as strings, but > we can use {{get_type_hints}} instead. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37903) Replace string_typehints with get_type_hints.
[ https://issues.apache.org/jira/browse/SPARK-37903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37903. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35200 [https://github.com/apache/spark/pull/35200] > Replace string_typehints with get_type_hints. > - > > Key: SPARK-37903 > URL: https://issues.apache.org/jira/browse/SPARK-37903 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.3.0 > > > Currently we have a hacky way to resolve type hints written as strings, but > we can use {{get_type_hints}} instead. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37154) Inline type hints for python/pyspark/rdd.py
[ https://issues.apache.org/jira/browse/SPARK-37154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17460243#comment-17460243 ] Byron Hsu edited comment on SPARK-37154 at 1/14/22, 12:41 AM: -- I am looking into this one was (Author: byronhsu): I am looking into this one > Inline type hints for python/pyspark/rdd.py > --- > > Key: SPARK-37154 > URL: https://issues.apache.org/jira/browse/SPARK-37154 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Byron Hsu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-37154) Inline type hints for python/pyspark/rdd.py
[ https://issues.apache.org/jira/browse/SPARK-37154 ] Byron Hsu deleted comment on SPARK-37154: --- was (Author: byronhsu): I am looking into this one > Inline type hints for python/pyspark/rdd.py > --- > > Key: SPARK-37154 > URL: https://issues.apache.org/jira/browse/SPARK-37154 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Byron Hsu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37903) Replace string_typehints with get_type_hints.
[ https://issues.apache.org/jira/browse/SPARK-37903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475858#comment-17475858 ] Apache Spark commented on SPARK-37903: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/35200 > Replace string_typehints with get_type_hints. > - > > Key: SPARK-37903 > URL: https://issues.apache.org/jira/browse/SPARK-37903 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Priority: Major > > Currently we have a hacky way to resolve type hints written as strings, but > we can use {{get_type_hints}} instead. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37903) Replace string_typehints with get_type_hints.
[ https://issues.apache.org/jira/browse/SPARK-37903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475857#comment-17475857 ] Apache Spark commented on SPARK-37903: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/35200 > Replace string_typehints with get_type_hints. > - > > Key: SPARK-37903 > URL: https://issues.apache.org/jira/browse/SPARK-37903 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Priority: Major > > Currently we have a hacky way to resolve type hints written as strings, but > we can use {{get_type_hints}} instead. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37903) Replace string_typehints with get_type_hints.
[ https://issues.apache.org/jira/browse/SPARK-37903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37903: Assignee: Apache Spark > Replace string_typehints with get_type_hints. > - > > Key: SPARK-37903 > URL: https://issues.apache.org/jira/browse/SPARK-37903 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Assignee: Apache Spark >Priority: Major > > Currently we have a hacky way to resolve type hints written as strings, but > we can use {{get_type_hints}} instead. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37903) Replace string_typehints with get_type_hints.
[ https://issues.apache.org/jira/browse/SPARK-37903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37903: Assignee: (was: Apache Spark) > Replace string_typehints with get_type_hints. > - > > Key: SPARK-37903 > URL: https://issues.apache.org/jira/browse/SPARK-37903 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Priority: Major > > Currently we have a hacky way to resolve type hints written as strings, but > we can use {{get_type_hints}} instead. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37879) Show test report in GitHub Actions builds from PRs
[ https://issues.apache.org/jira/browse/SPARK-37879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37879. -- Resolution: Fixed Issue resolved by pull request 35193 [https://github.com/apache/spark/pull/35193] > Show test report in GitHub Actions builds from PRs > -- > > Key: SPARK-37879 > URL: https://issues.apache.org/jira/browse/SPARK-37879 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.3.0 > > > Currently, the test report like > https://github.com/apache/spark/runs/4788468586 is not directly shown in > workflow runs in the link provided in PRs, e.g.) > https://github.com/yaooqinn/spark/actions/runs/1687326379 > We should make the test report visible. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37879) Show test report in GitHub Actions builds from PRs
[ https://issues.apache.org/jira/browse/SPARK-37879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-37879: Assignee: Hyukjin Kwon > Show test report in GitHub Actions builds from PRs > -- > > Key: SPARK-37879 > URL: https://issues.apache.org/jira/browse/SPARK-37879 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.3.0 > > > Currently, the test report like > https://github.com/apache/spark/runs/4788468586 is not directly shown in > workflow runs in the link provided in PRs, e.g.) > https://github.com/yaooqinn/spark/actions/runs/1687326379 > We should make the test report visible. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37903) Replace string_typehints with get_type_hints.
Takuya Ueshin created SPARK-37903: - Summary: Replace string_typehints with get_type_hints. Key: SPARK-37903 URL: https://issues.apache.org/jira/browse/SPARK-37903 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.3.0 Reporter: Takuya Ueshin Currently we have a hacky way to resolve type hints written as strings, but we can use {{get_type_hints}} instead. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37887) PySpark shell sets log level to INFO by default
[ https://issues.apache.org/jira/browse/SPARK-37887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-37887. --- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35198 [https://github.com/apache/spark/pull/35198] > PySpark shell sets log level to INFO by default > > > Key: SPARK-37887 > URL: https://issues.apache.org/jira/browse/SPARK-37887 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Spark Shell >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.3.0 > > > {code} > ./bin/pyspark > {code} > {code} > Python 3.9.5 (default, May 18 2021, 12:31:01) > [Clang 10.0.0 ] :: Anaconda, Inc. on darwin > Type "help", "copyright", "credits" or "license" for more information. > 22/01/13 10:28:15 INFO HiveConf: Found configuration file null > 22/01/13 10:28:15 INFO SparkContext: Running Spark version 3.3.0-SNAPSHOT > ... > >>> spark.range(10) > 22/01/13 10:31:48 INFO SharedState: Setting hive.metastore.warehouse.dir > ('null') to the value of spark.sql.warehouse.dir. > ... > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37900) Use SparkMasterRegex.KUBERNETES_REGEX in SecurityManager
[ https://issues.apache.org/jira/browse/SPARK-37900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-37900. --- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35195 [https://github.com/apache/spark/pull/35195] > Use SparkMasterRegex.KUBERNETES_REGEX in SecurityManager > > > Key: SPARK-37900 > URL: https://issues.apache.org/jira/browse/SPARK-37900 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37900) Use SparkMasterRegex.KUBERNETES_REGEX in SecurityManager
[ https://issues.apache.org/jira/browse/SPARK-37900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-37900: - Assignee: Dongjoon Hyun > Use SparkMasterRegex.KUBERNETES_REGEX in SecurityManager > > > Key: SPARK-37900 > URL: https://issues.apache.org/jira/browse/SPARK-37900 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37902) Update annotations to resolve issues detected with mypy==0.931
[ https://issues.apache.org/jira/browse/SPARK-37902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37902: Assignee: (was: Apache Spark) > Update annotations to resolve issues detected with mypy==0.931 > -- > > Key: SPARK-37902 > URL: https://issues.apache.org/jira/browse/SPARK-37902 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Minor > > The following new issues are detected when type checked with {{mypy==0.931}} > {code} > python/pyspark/pandas/base.py:879: error: "Sequence[Any]" has no attribute > "tolist" [attr-defined] > python/pyspark/sql/tests/test_pandas_udf_typehints_with_future_annotations.py:268: > error: Incompatible return value type (got "floating[Any]", expected > "float") [return-value] > python/pyspark/sql/tests/test_pandas_udf_typehints.py:265: error: > Incompatible return value type (got "floating[Any]", expected "float") > [return-value] > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37902) Update annotations to resolve issues detected with mypy==0.931
[ https://issues.apache.org/jira/browse/SPARK-37902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37902: Assignee: Apache Spark > Update annotations to resolve issues detected with mypy==0.931 > -- > > Key: SPARK-37902 > URL: https://issues.apache.org/jira/browse/SPARK-37902 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Apache Spark >Priority: Minor > > The following new issues are detected when type checked with {{mypy==0.931}} > {code} > python/pyspark/pandas/base.py:879: error: "Sequence[Any]" has no attribute > "tolist" [attr-defined] > python/pyspark/sql/tests/test_pandas_udf_typehints_with_future_annotations.py:268: > error: Incompatible return value type (got "floating[Any]", expected > "float") [return-value] > python/pyspark/sql/tests/test_pandas_udf_typehints.py:265: error: > Incompatible return value type (got "floating[Any]", expected "float") > [return-value] > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37902) Update annotations to resolve issues detected with mypy==0.931
[ https://issues.apache.org/jira/browse/SPARK-37902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475767#comment-17475767 ] Apache Spark commented on SPARK-37902: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/35199 > Update annotations to resolve issues detected with mypy==0.931 > -- > > Key: SPARK-37902 > URL: https://issues.apache.org/jira/browse/SPARK-37902 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Minor > > The following new issues are detected when type checked with {{mypy==0.931}} > {code} > python/pyspark/pandas/base.py:879: error: "Sequence[Any]" has no attribute > "tolist" [attr-defined] > python/pyspark/sql/tests/test_pandas_udf_typehints_with_future_annotations.py:268: > error: Incompatible return value type (got "floating[Any]", expected > "float") [return-value] > python/pyspark/sql/tests/test_pandas_udf_typehints.py:265: error: > Incompatible return value type (got "floating[Any]", expected "float") > [return-value] > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36404) Support nested columns in ORC vectorized reader for data source v2
[ https://issues.apache.org/jira/browse/SPARK-36404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-36404: -- Labels: releasenotes (was: ) > Support nested columns in ORC vectorized reader for data source v2 > -- > > Key: SPARK-36404 > URL: https://issues.apache.org/jira/browse/SPARK-36404 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Cheng Su >Assignee: Cheng Su >Priority: Minor > Labels: releasenotes > Fix For: 3.3.0 > > > We added support of nested columns in ORC vectorized reader for data source > v1. Data source v2 and v1 both use same underlying implementation for > vectorized reader (OrcColumnVector), so we can support data source v2 as well. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36649) Support Trigger.AvailableNow on Kafka data source
[ https://issues.apache.org/jira/browse/SPARK-36649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475761#comment-17475761 ] Boyang Jerry Peng commented on SPARK-36649: --- i'm working on it > Support Trigger.AvailableNow on Kafka data source > - > > Key: SPARK-36649 > URL: https://issues.apache.org/jira/browse/SPARK-36649 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: Jungtaek Lim >Priority: Major > > SPARK-36533 introduces a new trigger Trigger.AvailableNow, but only > introduces the new functionality to the file stream source. Given that Kafka > data source is the one of major data sources being used in streaming query, > we should make Kafka data source support this. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37902) Update annotations to resolve issues detected with mypy==0.931
Maciej Szymkiewicz created SPARK-37902: -- Summary: Update annotations to resolve issues detected with mypy==0.931 Key: SPARK-37902 URL: https://issues.apache.org/jira/browse/SPARK-37902 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.3.0 Reporter: Maciej Szymkiewicz The following new issues are detected when type checked with {{mypy==0.931}} {code} python/pyspark/pandas/base.py:879: error: "Sequence[Any]" has no attribute "tolist" [attr-defined] python/pyspark/sql/tests/test_pandas_udf_typehints_with_future_annotations.py:268: error: Incompatible return value type (got "floating[Any]", expected "float") [return-value] python/pyspark/sql/tests/test_pandas_udf_typehints.py:265: error: Incompatible return value type (got "floating[Any]", expected "float") [return-value] {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37887) PySpark shell sets log level to INFO by default
[ https://issues.apache.org/jira/browse/SPARK-37887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37887: Assignee: Apache Spark (was: L. C. Hsieh) > PySpark shell sets log level to INFO by default > > > Key: SPARK-37887 > URL: https://issues.apache.org/jira/browse/SPARK-37887 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Spark Shell >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > {code} > ./bin/pyspark > {code} > {code} > Python 3.9.5 (default, May 18 2021, 12:31:01) > [Clang 10.0.0 ] :: Anaconda, Inc. on darwin > Type "help", "copyright", "credits" or "license" for more information. > 22/01/13 10:28:15 INFO HiveConf: Found configuration file null > 22/01/13 10:28:15 INFO SparkContext: Running Spark version 3.3.0-SNAPSHOT > ... > >>> spark.range(10) > 22/01/13 10:31:48 INFO SharedState: Setting hive.metastore.warehouse.dir > ('null') to the value of spark.sql.warehouse.dir. > ... > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37887) PySpark shell sets log level to INFO by default
[ https://issues.apache.org/jira/browse/SPARK-37887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37887: Assignee: L. C. Hsieh (was: Apache Spark) > PySpark shell sets log level to INFO by default > > > Key: SPARK-37887 > URL: https://issues.apache.org/jira/browse/SPARK-37887 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Spark Shell >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: L. C. Hsieh >Priority: Major > > {code} > ./bin/pyspark > {code} > {code} > Python 3.9.5 (default, May 18 2021, 12:31:01) > [Clang 10.0.0 ] :: Anaconda, Inc. on darwin > Type "help", "copyright", "credits" or "license" for more information. > 22/01/13 10:28:15 INFO HiveConf: Found configuration file null > 22/01/13 10:28:15 INFO SparkContext: Running Spark version 3.3.0-SNAPSHOT > ... > >>> spark.range(10) > 22/01/13 10:31:48 INFO SharedState: Setting hive.metastore.warehouse.dir > ('null') to the value of spark.sql.warehouse.dir. > ... > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37887) PySpark shell sets log level to INFO by default
[ https://issues.apache.org/jira/browse/SPARK-37887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475740#comment-17475740 ] Apache Spark commented on SPARK-37887: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/35198 > PySpark shell sets log level to INFO by default > > > Key: SPARK-37887 > URL: https://issues.apache.org/jira/browse/SPARK-37887 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Spark Shell >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: L. C. Hsieh >Priority: Major > > {code} > ./bin/pyspark > {code} > {code} > Python 3.9.5 (default, May 18 2021, 12:31:01) > [Clang 10.0.0 ] :: Anaconda, Inc. on darwin > Type "help", "copyright", "credits" or "license" for more information. > 22/01/13 10:28:15 INFO HiveConf: Found configuration file null > 22/01/13 10:28:15 INFO SparkContext: Running Spark version 3.3.0-SNAPSHOT > ... > >>> spark.range(10) > 22/01/13 10:31:48 INFO SharedState: Setting hive.metastore.warehouse.dir > ('null') to the value of spark.sql.warehouse.dir. > ... > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37900) Use SparkMasterRegex.KUBERNETES_REGEX in SecurityManager
[ https://issues.apache.org/jira/browse/SPARK-37900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37900: -- Component/s: Spark Core (was: Kubernetes) > Use SparkMasterRegex.KUBERNETES_REGEX in SecurityManager > > > Key: SPARK-37900 > URL: https://issues.apache.org/jira/browse/SPARK-37900 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37887) PySpark shell sets log level to INFO by default
[ https://issues.apache.org/jira/browse/SPARK-37887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475713#comment-17475713 ] L. C. Hsieh commented on SPARK-37887: - I know the root cause. I will submit a PR later. > PySpark shell sets log level to INFO by default > > > Key: SPARK-37887 > URL: https://issues.apache.org/jira/browse/SPARK-37887 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Spark Shell >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > ./bin/pyspark > {code} > {code} > Python 3.9.5 (default, May 18 2021, 12:31:01) > [Clang 10.0.0 ] :: Anaconda, Inc. on darwin > Type "help", "copyright", "credits" or "license" for more information. > 22/01/13 10:28:15 INFO HiveConf: Found configuration file null > 22/01/13 10:28:15 INFO SparkContext: Running Spark version 3.3.0-SNAPSHOT > ... > >>> spark.range(10) > 22/01/13 10:31:48 INFO SharedState: Setting hive.metastore.warehouse.dir > ('null') to the value of spark.sql.warehouse.dir. > ... > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37887) PySpark shell sets log level to INFO by default
[ https://issues.apache.org/jira/browse/SPARK-37887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh reassigned SPARK-37887: --- Assignee: L. C. Hsieh > PySpark shell sets log level to INFO by default > > > Key: SPARK-37887 > URL: https://issues.apache.org/jira/browse/SPARK-37887 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Spark Shell >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: L. C. Hsieh >Priority: Major > > {code} > ./bin/pyspark > {code} > {code} > Python 3.9.5 (default, May 18 2021, 12:31:01) > [Clang 10.0.0 ] :: Anaconda, Inc. on darwin > Type "help", "copyright", "credits" or "license" for more information. > 22/01/13 10:28:15 INFO HiveConf: Found configuration file null > 22/01/13 10:28:15 INFO SparkContext: Running Spark version 3.3.0-SNAPSHOT > ... > >>> spark.range(10) > 22/01/13 10:31:48 INFO SharedState: Setting hive.metastore.warehouse.dir > ('null') to the value of spark.sql.warehouse.dir. > ... > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37901) Upgrade Netty from 4.1.72 to 4.1.73
[ https://issues.apache.org/jira/browse/SPARK-37901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37901: Assignee: (was: Apache Spark) > Upgrade Netty from 4.1.72 to 4.1.73 > --- > > Key: SPARK-37901 > URL: https://issues.apache.org/jira/browse/SPARK-37901 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: David Christle >Priority: Minor > > Netty has a new release that upgrades log4j to 2.17.1. Although I didn't find > obvious dependence on log4j via netty in my search of Spark's codebase, it > would be good to pick up this specific version. The version Spark currently > depends on is 4.1.72, which depends on log4j 2.15. Several CVE's have been > fixed in log4j between 2.15 and 2.17.1. > Besides this dependency update, several minor bugfixes have been made in this > release. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37901) Upgrade Netty from 4.1.72 to 4.1.73
[ https://issues.apache.org/jira/browse/SPARK-37901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475672#comment-17475672 ] Apache Spark commented on SPARK-37901: -- User 'dchristle' has created a pull request for this issue: https://github.com/apache/spark/pull/35196 > Upgrade Netty from 4.1.72 to 4.1.73 > --- > > Key: SPARK-37901 > URL: https://issues.apache.org/jira/browse/SPARK-37901 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: David Christle >Priority: Minor > > Netty has a new release that upgrades log4j to 2.17.1. Although I didn't find > obvious dependence on log4j via netty in my search of Spark's codebase, it > would be good to pick up this specific version. The version Spark currently > depends on is 4.1.72, which depends on log4j 2.15. Several CVE's have been > fixed in log4j between 2.15 and 2.17.1. > Besides this dependency update, several minor bugfixes have been made in this > release. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37901) Upgrade Netty from 4.1.72 to 4.1.73
[ https://issues.apache.org/jira/browse/SPARK-37901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37901: Assignee: Apache Spark > Upgrade Netty from 4.1.72 to 4.1.73 > --- > > Key: SPARK-37901 > URL: https://issues.apache.org/jira/browse/SPARK-37901 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: David Christle >Assignee: Apache Spark >Priority: Minor > > Netty has a new release that upgrades log4j to 2.17.1. Although I didn't find > obvious dependence on log4j via netty in my search of Spark's codebase, it > would be good to pick up this specific version. The version Spark currently > depends on is 4.1.72, which depends on log4j 2.15. Several CVE's have been > fixed in log4j between 2.15 and 2.17.1. > Besides this dependency update, several minor bugfixes have been made in this > release. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37901) Upgrade Netty from 4.1.72 to 4.1.73
David Christle created SPARK-37901: -- Summary: Upgrade Netty from 4.1.72 to 4.1.73 Key: SPARK-37901 URL: https://issues.apache.org/jira/browse/SPARK-37901 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.3.0 Reporter: David Christle Netty has a new release that upgrades log4j to 2.17.1. Although I didn't find obvious dependence on log4j via netty in my search of Spark's codebase, it would be good to pick up this specific version. The version Spark currently depends on is 4.1.72, which depends on log4j 2.15. Several CVE's have been fixed in log4j between 2.15 and 2.17.1. Besides this dependency update, several minor bugfixes have been made in this release. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37900) Use SparkMasterRegex.KUBERNETES_REGEX in SecurityManager
[ https://issues.apache.org/jira/browse/SPARK-37900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475650#comment-17475650 ] Apache Spark commented on SPARK-37900: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/35195 > Use SparkMasterRegex.KUBERNETES_REGEX in SecurityManager > > > Key: SPARK-37900 > URL: https://issues.apache.org/jira/browse/SPARK-37900 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37900) Use SparkMasterRegex.KUBERNETES_REGEX in SecurityManager
[ https://issues.apache.org/jira/browse/SPARK-37900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475643#comment-17475643 ] Apache Spark commented on SPARK-37900: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/35195 > Use SparkMasterRegex.KUBERNETES_REGEX in SecurityManager > > > Key: SPARK-37900 > URL: https://issues.apache.org/jira/browse/SPARK-37900 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37900) Use SparkMasterRegex.KUBERNETES_REGEX in SecurityManager
[ https://issues.apache.org/jira/browse/SPARK-37900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37900: Assignee: (was: Apache Spark) > Use SparkMasterRegex.KUBERNETES_REGEX in SecurityManager > > > Key: SPARK-37900 > URL: https://issues.apache.org/jira/browse/SPARK-37900 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37900) Use SparkMasterRegex.KUBERNETES_REGEX in SecurityManager
[ https://issues.apache.org/jira/browse/SPARK-37900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37900: Assignee: Apache Spark > Use SparkMasterRegex.KUBERNETES_REGEX in SecurityManager > > > Key: SPARK-37900 > URL: https://issues.apache.org/jira/browse/SPARK-37900 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37900) Use SparkMasterRegex.KUBERNETES_REGEX in SecurityManager
Dongjoon Hyun created SPARK-37900: - Summary: Use SparkMasterRegex.KUBERNETES_REGEX in SecurityManager Key: SPARK-37900 URL: https://issues.apache.org/jira/browse/SPARK-37900 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.3.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37864) Support Parquet v2 data page RLE encoding (for Boolean Values) for the vectorized path
[ https://issues.apache.org/jira/browse/SPARK-37864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-37864. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35163 [https://github.com/apache/spark/pull/35163] > Support Parquet v2 data page RLE encoding (for Boolean Values) for the > vectorized path > -- > > Key: SPARK-37864 > URL: https://issues.apache.org/jira/browse/SPARK-37864 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.3.0 > > > Parquet v2 data page write Boolean Values use RLE encoding, when read v2 > boolean type values it will throw exceptions as follows now: > > {code:java} > Caused by: java.lang.UnsupportedOperationException: Unsupported encoding: RLE > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.getValuesReader(VectorizedColumnReader.java:305) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.initDataReader(VectorizedColumnReader.java:277) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPageV2(VectorizedColumnReader.java:344) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.access$100(VectorizedColumnReader.java:48) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:250) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:237) > ~[classes/:?] > at org.apache.parquet.column.page.DataPageV2.accept(DataPageV2.java:192) > ~[parquet-column-1.12.2.jar:1.12.2] > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPage(VectorizedColumnReader.java:237) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:173) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:311) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:209) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:298) > ~[classes/:?] > ... 19 more {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37864) Support Parquet v2 data page RLE encoding (for Boolean Values) for the vectorized path
[ https://issues.apache.org/jira/browse/SPARK-37864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-37864: Assignee: Yang Jie > Support Parquet v2 data page RLE encoding (for Boolean Values) for the > vectorized path > -- > > Key: SPARK-37864 > URL: https://issues.apache.org/jira/browse/SPARK-37864 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > > Parquet v2 data page write Boolean Values use RLE encoding, when read v2 > boolean type values it will throw exceptions as follows now: > > {code:java} > Caused by: java.lang.UnsupportedOperationException: Unsupported encoding: RLE > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.getValuesReader(VectorizedColumnReader.java:305) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.initDataReader(VectorizedColumnReader.java:277) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPageV2(VectorizedColumnReader.java:344) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.access$100(VectorizedColumnReader.java:48) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:250) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:237) > ~[classes/:?] > at org.apache.parquet.column.page.DataPageV2.accept(DataPageV2.java:192) > ~[parquet-column-1.12.2.jar:1.12.2] > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPage(VectorizedColumnReader.java:237) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:173) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:311) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:209) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116) > ~[classes/:?] > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:298) > ~[classes/:?] > ... 19 more {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37479) Migrate DROP NAMESPACE to use V2 command by default
[ https://issues.apache.org/jira/browse/SPARK-37479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475542#comment-17475542 ] Terry Kim commented on SPARK-37479: --- OK, thanks! > Migrate DROP NAMESPACE to use V2 command by default > --- > > Key: SPARK-37479 > URL: https://issues.apache.org/jira/browse/SPARK-37479 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: dch nguyen >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37899) EliminateInnerJoin to support convert inner join to left semi join
[ https://issues.apache.org/jira/browse/SPARK-37899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475525#comment-17475525 ] Apache Spark commented on SPARK-37899: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/35194 > EliminateInnerJoin to support convert inner join to left semi join > -- > > Key: SPARK-37899 > URL: https://issues.apache.org/jira/browse/SPARK-37899 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37899) EliminateInnerJoin to support convert inner join to left semi join
[ https://issues.apache.org/jira/browse/SPARK-37899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37899: Assignee: (was: Apache Spark) > EliminateInnerJoin to support convert inner join to left semi join > -- > > Key: SPARK-37899 > URL: https://issues.apache.org/jira/browse/SPARK-37899 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37899) EliminateInnerJoin to support convert inner join to left semi join
[ https://issues.apache.org/jira/browse/SPARK-37899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37899: Assignee: Apache Spark > EliminateInnerJoin to support convert inner join to left semi join > -- > > Key: SPARK-37899 > URL: https://issues.apache.org/jira/browse/SPARK-37899 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37899) EliminateInnerJoin to support convert inner join to left semi join
[ https://issues.apache.org/jira/browse/SPARK-37899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-37899: Summary: EliminateInnerJoin to support convert inner join to left semi join (was: EliminateInnerJoin) > EliminateInnerJoin to support convert inner join to left semi join > -- > > Key: SPARK-37899 > URL: https://issues.apache.org/jira/browse/SPARK-37899 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37899) EliminateInnerJoin
Yuming Wang created SPARK-37899: --- Summary: EliminateInnerJoin Key: SPARK-37899 URL: https://issues.apache.org/jira/browse/SPARK-37899 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37895) Error while joining two tables with non-english field names
[ https://issues.apache.org/jira/browse/SPARK-37895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475361#comment-17475361 ] Wenchen Fan commented on SPARK-37895: - [~beliefer] can you help to fix it? also cc [~huaxingao] > Error while joining two tables with non-english field names > --- > > Key: SPARK-37895 > URL: https://issues.apache.org/jira/browse/SPARK-37895 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0 >Reporter: Marina Krasilnikova >Priority: Minor > > While trying to join two tables with non-english field names in postgresql > with query like > "select view1.`Имя1` , view1.`Имя2`, view2.`Имя3` from view1 left join view2 > on view1.`Имя2`=view2.`Имя4`" > we get an error which says that there is no field "`Имя4`" (field name is > surrounded by backticks). > It appears that to get the data from the second table it constructs query like > SELECT "Имя3","Имя4" FROM "public"."tab2" WHERE ("`Имя4`" IS NOT NULL) > and these backticks are redundant in WHERE clause. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37895) Error while joining two tables with non-english field names
[ https://issues.apache.org/jira/browse/SPARK-37895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475360#comment-17475360 ] Wenchen Fan commented on SPARK-37895: - This bug is only in JDBC v2. In the v2 code path, we always enable nested column in filter pushdown, and the column name in the predicate follows SQL style, which may have quotes. In the long term, this problem can be fixed by using v2 filters, which has native support for nested columns, so that we don't need to encode nested column into a single string and introduce quotes. For now, I think we should fix the v1 filter pushdown code path in JDBC v2, which is `JDBCScanBuilder.pushFilters`. > Error while joining two tables with non-english field names > --- > > Key: SPARK-37895 > URL: https://issues.apache.org/jira/browse/SPARK-37895 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0 >Reporter: Marina Krasilnikova >Priority: Minor > > While trying to join two tables with non-english field names in postgresql > with query like > "select view1.`Имя1` , view1.`Имя2`, view2.`Имя3` from view1 left join view2 > on view1.`Имя2`=view2.`Имя4`" > we get an error which says that there is no field "`Имя4`" (field name is > surrounded by backticks). > It appears that to get the data from the second table it constructs query like > SELECT "Имя3","Имя4" FROM "public"."tab2" WHERE ("`Имя4`" IS NOT NULL) > and these backticks are redundant in WHERE clause. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37898) Error reading old dates when AQE is enabled in Spark 3.1. Works when AQE is disabled
[ https://issues.apache.org/jira/browse/SPARK-37898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gaspar Muñoz updated SPARK-37898: -- Description: Hi guys, I was testing an spark job that fail when I encountered something that is not consistent among different spark versions. I reduced my code to be replicated easily with a simple spark-shell. Note: Code logic probably does not make sense :) The following snippet: - Works with Spark 3.1.2 and 3.1.3-rc when AQE disabled - Fails with Spark 3.1.2 and 3.1.3-rc when AQE enabled - Works with Spark 3.2.0 always {code:java} import org.apache.spark.sql.functions._ import org.apache.spark.sql.expressions.Window spark.conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "LEGACY") spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "LEGACY") val dataset = spark.read.parquet("/tmp/parquet-output") val window = Window.orderBy(dataset.col("date").desc) val resultDataset = dataset.withColumn("rankedFilterOverPartition", dense_rank().over(window)).filter("rankedFilterOverPartition = 1").drop("rankedFilterOverPartition") println(resultDataset.rdd.getNumPartitions){code} Previously I wrote data with this snippet and Spark 2.2 to write data in the path /tmp/parquet-output. {code:java} import spark.implicits._ import java.sql.Timestamp import org.apache.spark.sql.functions._ case class MyCustomClass(id_col: Int, date: String, timestamp_col: java.sql.Timestamp) val dataset = Seq(MyCustomClass(1, "0001-01-01", Timestamp.valueOf("1000-01-01 10:00:00")), MyCustomClass(2, "0001-01-01", Timestamp.valueOf("1000-01-01 10:00:00"))).toDF dataset.select($"id_col", $"date".cast("date"), $"timestamp_col").write.mode("overwrite").parquet("/tmp/parquet-output"){code} The error is: {code:java} scala> println(resultDataset.rdd.getNumPartitions) 22/01/13 13:45:16 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation. 22/01/13 13:45:16 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation. 22/01/13 13:45:17 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during reading. Or set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the datetime values as it is. at org.apache.spark.sql.execution.datasources.DataSourceUtils$.newRebaseExceptionInRead(DataSourceUtils.scala:147) at org.apache.spark.sql.execution.datasources.DataSourceUtils.newRebaseExceptionInRead(DataSourceUtils.scala){code} ¿It's possible fix it for 3.1 branch? Regards was: Hi guys, I was testing an spark job that fail when I encountered something that is not consistent among different spark versions. I reduced my code to be replicated easily with a simple spark-shell. Note: Code logic probably does not make sense :) The following snippet: - Works with Spark 3.1.2 and 3.1.3-rc when AQE disabled - Fails with Spark 3.1.2 and 3.1.3-rc when AQE enabled - Works with Spark 3.2.0 always {code:java} import org.apache.spark.sql.functions._ import org.apache.spark.sql.expressions.Window spark.conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "LEGACY") spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "LEGACY") val dataset = spark.read.parquet("/tmp/parquet-output") val window = Window.orderBy(dataset.col("date").desc) val resultDataset = dataset.withColumn("rankedFilterOverPartition", dense_rank().over(window)).filter("rankedFilterOverPartition = 1").drop("rankedFilterOverPartition") println(resultDataset.rdd.getNumPartitions){code} Previously I wrote data with this snippet and Spark 2.2 to write data in the path /tmp/parquet-output. {code:java} import spark.implicits._ import java.sql.Timestamp import org.apache.spark.sql.functions._ case class MyCustomClass(id_col: Int, date: String, timestamp_col: java.sql.Timestamp) val dataset = Seq(MyCustomClass(1, "0001-01-01", Timestamp.valueOf("1000-01-01 10:00:00")), MyCustomClass(2, "0001-01-01", Timestamp.valueOf("1000-01-01 10:00:00"))).toDF dataset.select($"id_col", $"date".cast("date"), $"timestamp_col").write.mode("overwrite").parquet("/tmp/parquet-output"){code} The error is: {code:java} scala> println(resultDataset.rdd.get
[jira] [Updated] (SPARK-37898) Error reading old dates when AQE is enabled in Spark 3.1. Works when AQE is disabled
[ https://issues.apache.org/jira/browse/SPARK-37898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gaspar Muñoz updated SPARK-37898: -- Description: Hi guys, I was testing an spark job that fail when I encountered something that is not consistent among different spark versions. I reduced my code to be replicated easily with a simple spark-shell. Note: Code logic probably does not make sense :) The following snippet: - Works with Spark 3.1.2 and 3.1.3-rc when AQE disabled - Fails with Spark 3.1.2 and 3.1.3-rc when AQE enabled - Works with Spark 3.2.0 always {code:java} import org.apache.spark.sql.functions._ import org.apache.spark.sql.expressions.Window spark.conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "LEGACY") spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "LEGACY") val dataset = spark.read.parquet("/tmp/parquet-output") val window = Window.orderBy(dataset.col("date").desc) val resultDataset = dataset.withColumn("rankedFilterOverPartition", dense_rank().over(window)).filter("rankedFilterOverPartition = 1").drop("rankedFilterOverPartition") println(resultDataset.rdd.getNumPartitions){code} Previously I wrote data with this snippet and Spark 2.2 to write data in the path /tmp/parquet-output. {code:java} import spark.implicits._ import java.sql.Timestamp import org.apache.spark.sql.functions._ case class MyCustomClass(id_col: Int, date: String, timestamp_col: java.sql.Timestamp) val dataset = Seq(MyCustomClass(1, "0001-01-01", Timestamp.valueOf("1000-01-01 10:00:00")), MyCustomClass(2, "0001-01-01", Timestamp.valueOf("1000-01-01 10:00:00"))).toDF dataset.select($"id_col", $"date".cast("date"), $"timestamp_col").write.mode("overwrite").parquet("/tmp/parquet-output"){code} The error is: {code:java} scala> println(resultDataset.rdd.getNumPartitions) 22/01/13 13:45:16 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation. 22/01/13 13:45:16 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation. 22/01/13 13:45:17 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during reading. Or set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the datetime values as it is. at org.apache.spark.sql.execution.datasources.DataSourceUtils$.newRebaseExceptionInRead(DataSourceUtils.scala:147) at org.apache.spark.sql.execution.datasources.DataSourceUtils.newRebaseExceptionInRead(DataSourceUtils.scala){code} Regards was: Hi guys, I was testing an spark job that fail when I encountered something that is not consistent among different spark versions. I reduced my code to be replicated easily with a simple spark-shell. The following snippet: - Works with Spark 3.1.2 and 3.1.3-rc when AQE disabled - Fails with Spark 3.1.2 and 3.1.3-rc when AQE enabled - Works with Spark 3.2.0 always {code:java} import org.apache.spark.sql.functions._ import org.apache.spark.sql.expressions.Window spark.conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "LEGACY") spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "LEGACY") val dataset = spark.read.parquet("/tmp/parquet-output") val window = Window.orderBy(dataset.col("date").desc) val resultDataset = dataset.withColumn("rankedFilterOverPartition", dense_rank().over(window)).filter("rankedFilterOverPartition = 1").drop("rankedFilterOverPartition") println(resultDataset.rdd.getNumPartitions){code} Previously I wrote data with this snippet and Spark 2.2 to write data in the path /tmp/parquet-output. {code:java} import spark.implicits._ import java.sql.Timestamp import org.apache.spark.sql.functions._ case class MyCustomClass(id_col: Int, date: String, timestamp_col: java.sql.Timestamp) val dataset = Seq(MyCustomClass(1, "0001-01-01", Timestamp.valueOf("1000-01-01 10:00:00")), MyCustomClass(2, "0001-01-01", Timestamp.valueOf("1000-01-01 10:00:00"))).toDF dataset.select($"id_col", $"date".cast("date"), $"timestamp_col").write.mode("overwrite").parquet("/tmp/parquet-output"){code} The error is: {code:java} scala> println(resultDataset.rdd.getNumPartitions) 22/01/13 13:45:16 WARN WindowExec: No Partition Defined for Window ope
[jira] [Created] (SPARK-37898) Error reading old dates when AQE is enabled in Spark 3.1. Works when AQE is disabled
Gaspar Muñoz created SPARK-37898: - Summary: Error reading old dates when AQE is enabled in Spark 3.1. Works when AQE is disabled Key: SPARK-37898 URL: https://issues.apache.org/jira/browse/SPARK-37898 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.2 Reporter: Gaspar Muñoz Hi guys, I was testing an spark job that fail when I encountered something that is not consistent among different spark versions. I reduced my code to be replicated easily with a simple spark-shell. The following snippet: - Works with Spark 3.1.2 and 3.1.3-rc when AQE disabled - Fails with Spark 3.1.2 and 3.1.3-rc when AQE enabled - Works with Spark 3.2.0 always {code:java} import org.apache.spark.sql.functions._ import org.apache.spark.sql.expressions.Window spark.conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "LEGACY") spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "LEGACY") val dataset = spark.read.parquet("/tmp/parquet-output") val window = Window.orderBy(dataset.col("date").desc) val resultDataset = dataset.withColumn("rankedFilterOverPartition", dense_rank().over(window)).filter("rankedFilterOverPartition = 1").drop("rankedFilterOverPartition") println(resultDataset.rdd.getNumPartitions){code} Previously I wrote data with this snippet and Spark 2.2 to write data in the path /tmp/parquet-output. {code:java} import spark.implicits._ import java.sql.Timestamp import org.apache.spark.sql.functions._ case class MyCustomClass(id_col: Int, date: String, timestamp_col: java.sql.Timestamp) val dataset = Seq(MyCustomClass(1, "0001-01-01", Timestamp.valueOf("1000-01-01 10:00:00")), MyCustomClass(2, "0001-01-01", Timestamp.valueOf("1000-01-01 10:00:00"))).toDF dataset.select($"id_col", $"date".cast("date"), $"timestamp_col").write.mode("overwrite").parquet("/tmp/parquet-output"){code} The error is: {code:java} scala> println(resultDataset.rdd.getNumPartitions) 22/01/13 13:45:16 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation. 22/01/13 13:45:16 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation. 22/01/13 13:45:17 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during reading. Or set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the datetime values as it is. at org.apache.spark.sql.execution.datasources.DataSourceUtils$.newRebaseExceptionInRead(DataSourceUtils.scala:147) at org.apache.spark.sql.execution.datasources.DataSourceUtils.newRebaseExceptionInRead(DataSourceUtils.scala){code} Regards -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37897) Filter with subexpression elimination may cause query failed
[ https://issues.apache.org/jira/browse/SPARK-37897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hujiahua updated SPARK-37897: - Description: The following test results will fail, the root cause was that the execution order of filter predicates had changed after subexpression elimination. So I think we should keep predicates execution order after subexpression elimination. {code:java} test("filter with subexpression elimination may cause query failed.") { withSQLConf((SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "false")) { val df = Seq(-1, 1, 2).toDF("c1") //register `plusOne` udf, and the function will failed if input was not a positive number. spark.sqlContext.udf.register("plusOne", (n: Int) => { if (n >= 0) n + 1 else throw new SparkException("Must be positive number.") }) val result = df.filter("c1 >= 0 and plusOne(c1) > 1 and plusOne(c1) < 3").collect() assert(result.size === 1) } } Caused by: org.apache.spark.SparkException: Must be positive number. at org.apache.spark.sql.DataFrameSuite.$anonfun$new$3(DataFrameSuite.scala:67) at scala.runtime.java8.JFunction1$mcII$sp.apply(JFunction1$mcII$sp.java:23) ... 20 more{code} https://github.com/apache/spark/blob/0e186e8a19926f91810f3eaf174611b71e598de6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratePredicate.scala#L63 !image-2022-01-13-20-22-09-055.png! was: The following test results will fail, the root cause was that the execution order of filter predicates had changed after subexpression elimination. So I think we should keep predicates execution order after subexpression elimination. {code:java} test("filter with subexpression elimination may cause query failed.") { withSQLConf((SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "false")) { val df = Seq(-1, 1, 2).toDF("c1") //register `plusOne` udf, and the function will failed if input was not a positive number. spark.sqlContext.udf.register("plusOne", (n: Int) => { if (n >= 0) n + 1 else throw new SparkException("Must be positive number.") }) val result = df.filter("c1 >= 0 and plusOne(c1) > 1 and plusOne(c1) < 3").collect() assert(result.size === 1) } } Caused by: org.apache.spark.SparkException: Must be positive number. at org.apache.spark.sql.DataFrameSuite.$anonfun$new$3(DataFrameSuite.scala:67) at scala.runtime.java8.JFunction1$mcII$sp.apply(JFunction1$mcII$sp.java:23) ... 20 more{code} > Filter with subexpression elimination may cause query failed > > > Key: SPARK-37897 > URL: https://issues.apache.org/jira/browse/SPARK-37897 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: hujiahua >Priority: Major > Attachments: image-2022-01-13-20-22-09-055.png > > > > The following test results will fail, the root cause was that the execution > order of filter predicates had changed after subexpression elimination. So I > think we should keep predicates execution order after subexpression > elimination. > {code:java} > test("filter with subexpression elimination may cause query failed.") { > withSQLConf((SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "false")) { > val df = Seq(-1, 1, 2).toDF("c1") > //register `plusOne` udf, and the function will failed if input was not a > positive number. > spark.sqlContext.udf.register("plusOne", > (n: Int) => { if (n >= 0) n + 1 else throw new SparkException("Must be > positive number.") }) > val result = df.filter("c1 >= 0 and plusOne(c1) > 1 and plusOne(c1) < > 3").collect() > assert(result.size === 1) > } > } > Caused by: org.apache.spark.SparkException: Must be positive number. > at > org.apache.spark.sql.DataFrameSuite.$anonfun$new$3(DataFrameSuite.scala:67) > at > scala.runtime.java8.JFunction1$mcII$sp.apply(JFunction1$mcII$sp.java:23) > ... 20 more{code} > > https://github.com/apache/spark/blob/0e186e8a19926f91810f3eaf174611b71e598de6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratePredicate.scala#L63 > !image-2022-01-13-20-22-09-055.png! > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37897) Filter with subexpression elimination may cause query failed
[ https://issues.apache.org/jira/browse/SPARK-37897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hujiahua updated SPARK-37897: - Attachment: image-2022-01-13-20-22-09-055.png > Filter with subexpression elimination may cause query failed > > > Key: SPARK-37897 > URL: https://issues.apache.org/jira/browse/SPARK-37897 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: hujiahua >Priority: Major > Attachments: image-2022-01-13-20-22-09-055.png > > > > The following test results will fail, the root cause was that the execution > order of filter predicates had changed after subexpression elimination. So I > think we should keep predicates execution order after subexpression > elimination. > {code:java} > test("filter with subexpression elimination may cause query failed.") { > withSQLConf((SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "false")) { > val df = Seq(-1, 1, 2).toDF("c1") > //register `plusOne` udf, and the function will failed if input was not a > positive number. > spark.sqlContext.udf.register("plusOne", > (n: Int) => { if (n >= 0) n + 1 else throw new SparkException("Must be > positive number.") }) > val result = df.filter("c1 >= 0 and plusOne(c1) > 1 and plusOne(c1) < > 3").collect() > assert(result.size === 1) > } > } > Caused by: org.apache.spark.SparkException: Must be positive number. > at > org.apache.spark.sql.DataFrameSuite.$anonfun$new$3(DataFrameSuite.scala:67) > at > scala.runtime.java8.JFunction1$mcII$sp.apply(JFunction1$mcII$sp.java:23) > ... 20 more{code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37897) Filter with subexpression elimination may cause query failed
[ https://issues.apache.org/jira/browse/SPARK-37897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hujiahua updated SPARK-37897: - Description: The following test results will fail, the root cause was that the execution order of filter predicates had changed after subexpression elimination. So I think we should keep predicates execution order after subexpression elimination. {code:java} test("filter with subexpression elimination may cause query failed.") { withSQLConf((SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "false")) { val df = Seq(-1, 1, 2).toDF("c1") //register `plusOne` udf, and the function will failed if input was not a positive number. spark.sqlContext.udf.register("plusOne", (n: Int) => { if (n >= 0) n + 1 else throw new SparkException("Must be positive number.") }) val result = df.filter("c1 >= 0 and plusOne(c1) > 1 and plusOne(c1) < 3").collect() assert(result.size === 1) } } Caused by: org.apache.spark.SparkException: Must be positive number. at org.apache.spark.sql.DataFrameSuite.$anonfun$new$3(DataFrameSuite.scala:67) at scala.runtime.java8.JFunction1$mcII$sp.apply(JFunction1$mcII$sp.java:23) ... 20 more{code} was: The following test results will fail, the root cause was that the execution order of filter predicates had changed after subexpression elimination. So I think we should keep predicates execution order after subexpression elimination. {code:java} test("filter with subexpression elimination may cause query failed.") { withSQLConf((SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "false")) { val df = Seq(-1, 1, 2).toDF("c1") //register `plusOne` udf, and the function will failed if input was not a positive number. spark.sqlContext.udf.register("plusOne", (n: Int) => { if (n >= 0) n + 1 else throw new SparkException("Must be positive number.") }) val result = df.filter("c1 >= 0 and plusOne(c1) > 1 and plusOne(c1) < 3").collect() assert(result.size === 1) } } {code} > Filter with subexpression elimination may cause query failed > > > Key: SPARK-37897 > URL: https://issues.apache.org/jira/browse/SPARK-37897 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: hujiahua >Priority: Major > > > The following test results will fail, the root cause was that the execution > order of filter predicates had changed after subexpression elimination. So I > think we should keep predicates execution order after subexpression > elimination. > {code:java} > test("filter with subexpression elimination may cause query failed.") { > withSQLConf((SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "false")) { > val df = Seq(-1, 1, 2).toDF("c1") > //register `plusOne` udf, and the function will failed if input was not a > positive number. > spark.sqlContext.udf.register("plusOne", > (n: Int) => { if (n >= 0) n + 1 else throw new SparkException("Must be > positive number.") }) > val result = df.filter("c1 >= 0 and plusOne(c1) > 1 and plusOne(c1) < > 3").collect() > assert(result.size === 1) > } > } > Caused by: org.apache.spark.SparkException: Must be positive number. > at > org.apache.spark.sql.DataFrameSuite.$anonfun$new$3(DataFrameSuite.scala:67) > at > scala.runtime.java8.JFunction1$mcII$sp.apply(JFunction1$mcII$sp.java:23) > ... 20 more{code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org