[jira] [Commented] (SPARK-44795) CodeGenCache should be ClassLoader specific
[ https://issues.apache.org/jira/browse/SPARK-44795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17754852#comment-17754852 ] GridGain Integration commented on SPARK-44795: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/42508 > CodeGenCache should be ClassLoader specific > --- > > Key: SPARK-44795 > URL: https://issues.apache.org/jira/browse/SPARK-44795 > Project: Spark > Issue Type: New Feature > Components: Connect, SQL >Affects Versions: 3.5.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Blocker > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44824) There is content overlap in `ammoniteOut` used in ReplE2ESuite.
Yang Jie created SPARK-44824: Summary: There is content overlap in `ammoniteOut` used in ReplE2ESuite. Key: SPARK-44824 URL: https://issues.apache.org/jira/browse/SPARK-44824 Project: Spark Issue Type: Improvement Components: Connect, Tests Affects Versions: 3.5.0, 4.0.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44823) Update black to 23.7.0 and fix erroneous check
BingKun Pan created SPARK-44823: --- Summary: Update black to 23.7.0 and fix erroneous check Key: SPARK-44823 URL: https://issues.apache.org/jira/browse/SPARK-44823 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44809) Remove unused custom metrics for RocksDB state store provider
[ https://issues.apache.org/jira/browse/SPARK-44809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-44809. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42491 [https://github.com/apache/spark/pull/42491] > Remove unused custom metrics for RocksDB state store provider > - > > Key: SPARK-44809 > URL: https://issues.apache.org/jira/browse/SPARK-44809 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.5.1 >Reporter: Anish Shrigondekar >Assignee: Anish Shrigondekar >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44564) Refine the documents with LLM
[ https://issues.apache.org/jira/browse/SPARK-44564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-44564: -- Attachment: docstr_prompt_only.py > Refine the documents with LLM > - > > Key: SPARK-44564 > URL: https://issues.apache.org/jira/browse/SPARK-44564 > Project: Spark > Issue Type: Umbrella > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > Attachments: docstr_prompt_only.py > > > Let's first focus on the Documents of *PySpark DataFrame APIs*. > *1*, Chose a subset of DF APIs > Since the review bandwidth is limited, we recommend each PR contains at least > 5 APIs; > *2*, For each API, copy-paste the function (including function signature, doc > string) to a LLM Model, and ask it to with a prompts (e.g. the attached > prompt), you can of course use/design your own prompt. > For prompt engineering, you can refer to this [Best > practices|https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api] > > *3*, Note that the LLM is not 100% reliable, the generated doc string may > still contain some mistakes, e.g. > * The example code can not run > * The example results are incorrect > * The example code doesn't reflect the example title > * The description use wrong version, add a 'Raise' selection for non-existent > exception > * The lint can be broken > * ... > we need to fix them before sending a PR. > We can try different prompts, choose the good parts and combine them to the > new doc sting. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44564) Refine the documents with LLM
[ https://issues.apache.org/jira/browse/SPARK-44564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-44564: -- Attachment: (was: docstr_prompt.py) > Refine the documents with LLM > - > > Key: SPARK-44564 > URL: https://issues.apache.org/jira/browse/SPARK-44564 > Project: Spark > Issue Type: Umbrella > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > > Let's first focus on the Documents of *PySpark DataFrame APIs*. > *1*, Chose a subset of DF APIs > Since the review bandwidth is limited, we recommend each PR contains at least > 5 APIs; > *2*, For each API, copy-paste the function (including function signature, doc > string) to a LLM Model, and ask it to with a prompts (e.g. the attached > prompt), you can of course use/design your own prompt. > For prompt engineering, you can refer to this [Best > practices|https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api] > > *3*, Note that the LLM is not 100% reliable, the generated doc string may > still contain some mistakes, e.g. > * The example code can not run > * The example results are incorrect > * The example code doesn't reflect the example title > * The description use wrong version, add a 'Raise' selection for non-existent > exception > * The lint can be broken > * ... > we need to fix them before sending a PR. > We can try different prompts, choose the good parts and combine them to the > new doc sting. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44822) Make Python UDTFs by default non-deterministic
Allison Wang created SPARK-44822: Summary: Make Python UDTFs by default non-deterministic Key: SPARK-44822 URL: https://issues.apache.org/jira/browse/SPARK-44822 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Allison Wang Change the default determinism of Python UDTF to be false. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44821) Upgrade `kubernetes-client` to 6.8.1
Dongjoon Hyun created SPARK-44821: - Summary: Upgrade `kubernetes-client` to 6.8.1 Key: SPARK-44821 URL: https://issues.apache.org/jira/browse/SPARK-44821 Project: Spark Issue Type: Dependency upgrade Components: Build Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44820) Switch languages consistently across docs for all code snippets
[ https://issues.apache.org/jira/browse/SPARK-44820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-44820: - Description: When a user chooses a different language for a code snippet, all code snippets on that page should switch to the chosen language. This was the behavior for, for example, Spark 2.0 doc: [https://spark.apache.org/docs/2.0.0/structured-streaming-programming-guide.html] But it was broken for later docs, for example the Spark 3.4.1 doc: [https://spark.apache.org/docs/latest/quick-start.html] We should fix this behavior change and possibly add test cases to prevent future regressions. was: When a user chooses a different language for a code snippet, all code snippets on that page should switch to the chosen language. This was the behavior for, for example, Spark 2.0 doc: [https://spark.apache.org/docs/2.0.0/structured-streaming-programming-guide.html] But it was broken for later docs, for example the Spark 3.4.1 doc: [https://spark.apache.org/docs/latest/quick-start.html] We should fix this behavior change. > Switch languages consistently across docs for all code snippets > --- > > Key: SPARK-44820 > URL: https://issues.apache.org/jira/browse/SPARK-44820 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.4.1, 3.5.0 >Reporter: Allison Wang >Priority: Major > > When a user chooses a different language for a code snippet, all code > snippets on that page should switch to the chosen language. This was the > behavior for, for example, Spark 2.0 doc: > [https://spark.apache.org/docs/2.0.0/structured-streaming-programming-guide.html] > But it was broken for later docs, for example the Spark 3.4.1 doc: > [https://spark.apache.org/docs/latest/quick-start.html] > We should fix this behavior change and possibly add test cases to prevent > future regressions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44819) Make Python the first language in all Spark code snippet
[ https://issues.apache.org/jira/browse/SPARK-44819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-44819: - Attachment: Screenshot 2023-08-15 at 11.59.11.png > Make Python the first language in all Spark code snippet > > > Key: SPARK-44819 > URL: https://issues.apache.org/jira/browse/SPARK-44819 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.5.0 >Reporter: Allison Wang >Priority: Major > Attachments: Screenshot 2023-08-15 at 11.59.11.png > > > Currently, the first and default language for all code snippets is Sacla. We > should make Python the first language for all the code snippets. > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44819) Make Python the first language in all Spark code snippet
[ https://issues.apache.org/jira/browse/SPARK-44819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-44819: - Description: Currently, the first and default language for all code snippets is Sacla. For instance: https://spark.apache.org/docs/latest/quick-start.html We should make Python the first language for all the code snippets. was: Currently, the first and default language for all code snippets is Sacla. We should make Python the first language for all the code snippets. > Make Python the first language in all Spark code snippet > > > Key: SPARK-44819 > URL: https://issues.apache.org/jira/browse/SPARK-44819 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.5.0 >Reporter: Allison Wang >Priority: Major > Attachments: Screenshot 2023-08-15 at 11.59.11.png > > > Currently, the first and default language for all code snippets is Sacla. For > instance: https://spark.apache.org/docs/latest/quick-start.html > We should make Python the first language for all the code snippets. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44819) Make Python the first language in all Spark code snippet
[ https://issues.apache.org/jira/browse/SPARK-44819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-44819: - Description: Currently, the first and default language for all code snippets is Sacla. We should make Python the first language for all the code snippets. was: Currently, the first and default language for all code snippets is Sacla. We should make Python the first language for all the code snippets. !image-2023-08-15-11-51-57-683.png|width=658,height=188! > Make Python the first language in all Spark code snippet > > > Key: SPARK-44819 > URL: https://issues.apache.org/jira/browse/SPARK-44819 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.5.0 >Reporter: Allison Wang >Priority: Major > Attachments: Screenshot 2023-08-15 at 11.59.11.png > > > Currently, the first and default language for all code snippets is Sacla. We > should make Python the first language for all the code snippets. > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44820) Switch languages consistently across docs for all code snippets
Allison Wang created SPARK-44820: Summary: Switch languages consistently across docs for all code snippets Key: SPARK-44820 URL: https://issues.apache.org/jira/browse/SPARK-44820 Project: Spark Issue Type: Sub-task Components: Documentation Affects Versions: 3.4.1, 3.5.0 Reporter: Allison Wang When a user chooses a different language for a code snippet, all code snippets on that page should switch to the chosen language. This was the behavior for, for example, Spark 2.0 doc: [https://spark.apache.org/docs/2.0.0/structured-streaming-programming-guide.html] But it was broken for later docs, for example the Spark 3.4.1 doc: [https://spark.apache.org/docs/latest/quick-start.html] We should fix this behavior change. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44819) Make Python the first language in all Spark code snippet
Allison Wang created SPARK-44819: Summary: Make Python the first language in all Spark code snippet Key: SPARK-44819 URL: https://issues.apache.org/jira/browse/SPARK-44819 Project: Spark Issue Type: Sub-task Components: Documentation Affects Versions: 3.5.0 Reporter: Allison Wang Currently, the first and default language for all code snippets is Sacla. We should make Python the first language for all the code snippets. !image-2023-08-15-11-51-57-683.png|width=658,height=188! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44818) Fix race for pending interrupt issued before taskThread is initialized
[ https://issues.apache.org/jira/browse/SPARK-44818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17754729#comment-17754729 ] Anish Shrigondekar commented on SPARK-44818: PR here - [https://github.com/apache/spark/pull/42504] cc - [~kabhwan] , [~jiangxb1987] > Fix race for pending interrupt issued before taskThread is initialized > -- > > Key: SPARK-44818 > URL: https://issues.apache.org/jira/browse/SPARK-44818 > Project: Spark > Issue Type: Task > Components: Spark Core, Structured Streaming >Affects Versions: 3.5.1 >Reporter: Anish Shrigondekar >Priority: Major > > Fix race for pending interrupt issued before taskThread is initialized -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44818) Fix race for pending interrupt issued before taskThread is initialized
Anish Shrigondekar created SPARK-44818: -- Summary: Fix race for pending interrupt issued before taskThread is initialized Key: SPARK-44818 URL: https://issues.apache.org/jira/browse/SPARK-44818 Project: Spark Issue Type: Task Components: Spark Core, Structured Streaming Affects Versions: 3.5.1 Reporter: Anish Shrigondekar Fix race for pending interrupt issued before taskThread is initialized -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42664) Support bloomFilter for DataFrameStatFunctions
[ https://issues.apache.org/jira/browse/SPARK-42664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-42664. --- Fix Version/s: 3.5.0 Assignee: Yang Jie Resolution: Fixed > Support bloomFilter for DataFrameStatFunctions > -- > > Key: SPARK-42664 > URL: https://issues.apache.org/jira/browse/SPARK-42664 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44794) Propagate ArtifactSet to stream execution thread
[ https://issues.apache.org/jira/browse/SPARK-44794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-44794. --- Fix Version/s: 3.5.0 Resolution: Fixed > Propagate ArtifactSet to stream execution thread > > > Key: SPARK-44794 > URL: https://issues.apache.org/jira/browse/SPARK-44794 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.5.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Blocker > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44803) Replace `publish` with `publishOrSkip` in SparkBuild to eliminate warnings
[ https://issues.apache.org/jira/browse/SPARK-44803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-44803. --- Fix Version/s: 4.0.0 Assignee: BingKun Pan Resolution: Fixed > Replace `publish` with `publishOrSkip` in SparkBuild to eliminate warnings > -- > > Key: SPARK-44803 > URL: https://issues.apache.org/jira/browse/SPARK-44803 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44124) Upgrade AWS SDK to v2
[ https://issues.apache.org/jira/browse/SPARK-44124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17754698#comment-17754698 ] Steve Loughran commented on SPARK-44124: +will need to make sure any classloaders set up to pass down com.amazonaws to children (e.g. hive classloader) now pass down software.amazon > Upgrade AWS SDK to v2 > - > > Key: SPARK-44124 > URL: https://issues.apache.org/jira/browse/SPARK-44124 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44817) Incremental Stats Collection
[ https://issues.apache.org/jira/browse/SPARK-44817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17754694#comment-17754694 ] Rakesh Raushan commented on SPARK-44817: [~cloud_fan] [~gurwls223] [~maxgekk] What are your thoughts over this ? If this looks promising, i can work on raising PR for this. > Incremental Stats Collection > > > Key: SPARK-44817 > URL: https://issues.apache.org/jira/browse/SPARK-44817 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Rakesh Raushan >Priority: Major > > Spark's Cost Based Optimizer is dependent on the table and column statistics. > After every execution of DML query, table and column stats are invalidated if > auto update of stats collection is not turned on. To keep stats updated we > need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very > expensive. It is not feasible to run this command after every DML query. > Instead, we can incrementally update the stats during each DML query run > itself. This way our table and column stats would be fresh at all the time > and CBO benefits can be applied. Initially, we can only update table level > stats and gradually start updating column level stats as well. > *Pros:* > 1. Optimize queries over table which is updated frequently. > 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE > STATISTICS` for updating stats. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44817) Incremental Stats Collection
[ https://issues.apache.org/jira/browse/SPARK-44817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Raushan updated SPARK-44817: --- Description: Spark's Cost Based Optimizer is dependent on the table and column statistics. After every execution of DML query, table and column stats are invalidated if auto update of stats collection is not turned on. To keep stats updated we need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very expensive. It is not feasible to run this command after every DML query. Instead, we can incrementally update the stats during each DML query run itself. This way our table and column stats would be fresh at all the time and CBO benefits can be applied. Initially, we can only update table level stats and gradually start updating column level stats as well. *Pros:* 1. Optimize queries over table which is updated frequently. 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE STATISTICS` for updating stats. was: Spark's Cost Based Optimizer is dependent on the table and column statistics. After every execution of DML query, table and column stats are invalidated if auto update of stats collection is not turned on. To keep stats updated we need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very expensive. It is not feasible to run this command after every DML query. Instead, we can incrementally update the stats during each DML query run itself. This way our table and column stats would be fresh at all the time and CBO benefits can be applied. *Pros:* 1. Optimize queries over table which is updated frequently. 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE STATISTICS` for updating stats. > Incremental Stats Collection > > > Key: SPARK-44817 > URL: https://issues.apache.org/jira/browse/SPARK-44817 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Rakesh Raushan >Priority: Major > > Spark's Cost Based Optimizer is dependent on the table and column statistics. > After every execution of DML query, table and column stats are invalidated if > auto update of stats collection is not turned on. To keep stats updated we > need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very > expensive. It is not feasible to run this command after every DML query. > Instead, we can incrementally update the stats during each DML query run > itself. This way our table and column stats would be fresh at all the time > and CBO benefits can be applied. Initially, we can only update table level > stats and gradually start updating column level stats as well. > *Pros:* > 1. Optimize queries over table which is updated frequently. > 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE > STATISTICS` for updating stats. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44817) Incremental Stats Collection
Rakesh Raushan created SPARK-44817: -- Summary: Incremental Stats Collection Key: SPARK-44817 URL: https://issues.apache.org/jira/browse/SPARK-44817 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Rakesh Raushan Spark's Cost Based Optimizer is dependent on the table and column statistics. After every execution of DML query, table and column stats are invalidated if auto update of stats collection is not turned on. To keep stats updated we need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very expensive. It is not feasible to run this command after every DML query. Instead, we can incrementally update the stats during each DML query run itself. This way our table and column stats would be fresh at all the time and CBO benefits can be applied. *Pros:* 1. Optimize queries over table which is updated frequently. 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE STATISTICS` for updating stats. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44806) Separate connect-client-jvm-internal
[ https://issues.apache.org/jira/browse/SPARK-44806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17754684#comment-17754684 ] Hudson commented on SPARK-44806: User 'juliuszsompolski' has created a pull request for this issue: https://github.com/apache/spark/pull/42501 > Separate connect-client-jvm-internal > > > Key: SPARK-44806 > URL: https://issues.apache.org/jira/browse/SPARK-44806 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Juliusz Sompolski >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44816) Cryptic error message when UDF associated class is not found
Niranjan Jayakar created SPARK-44816: Summary: Cryptic error message when UDF associated class is not found Key: SPARK-44816 URL: https://issues.apache.org/jira/browse/SPARK-44816 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 3.5.0 Reporter: Niranjan Jayakar When a Dataset API is used that either requires or is modeled as a UDF, the class defining the UDF/function should be uploaded to the service fist using the `addArtifact()` API. When this is not done, an error is thrown. However, this error message is cryptic and is not clear about the problem. Improve this error message to make it clear that an expected class was not found. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44803) Replace `publish` with `publishOrSkip` in SparkBuild to eliminate warnings
[ https://issues.apache.org/jira/browse/SPARK-44803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-44803: Summary: Replace `publish` with `publishOrSkip` in SparkBuild to eliminate warnings (was: Replace `publishOrSkip` with `publish` in SparkBuild to eliminate warnings) > Replace `publish` with `publishOrSkip` in SparkBuild to eliminate warnings > -- > > Key: SPARK-44803 > URL: https://issues.apache.org/jira/browse/SPARK-44803 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44815) Cache Schema of DF
Martin Grund created SPARK-44815: Summary: Cache Schema of DF Key: SPARK-44815 URL: https://issues.apache.org/jira/browse/SPARK-44815 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 3.5.0 Reporter: Martin Grund -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44814) Test to trigger protobuf 4.23.3 crash
Martin Grund created SPARK-44814: Summary: Test to trigger protobuf 4.23.3 crash Key: SPARK-44814 URL: https://issues.apache.org/jira/browse/SPARK-44814 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 3.5.0 Reporter: Martin Grund -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44718) High On-heap memory usage is detected while doing parquet-file reading with Off-Heap memory mode enabled on spark
[ https://issues.apache.org/jira/browse/SPARK-44718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-44718: --- Assignee: Zamil Majdy > High On-heap memory usage is detected while doing parquet-file reading with > Off-Heap memory mode enabled on spark > - > > Key: SPARK-44718 > URL: https://issues.apache.org/jira/browse/SPARK-44718 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.4.1 >Reporter: Zamil Majdy >Assignee: Zamil Majdy >Priority: Major > Fix For: 4.0.0 > > > I see the high use of on-heap memory usage while doing the parquet file > reading when the off-heap memory mode is enabled. This is caused by the > memory-mode for the column vector for the vectorized reader is configured by > different flag, and the default value is always set to On-Heap. > Conf to reproduce the issue: > {{spark.memory.offHeap.size 100}} > {{spark.memory.offHeap.enabled true}} > Enabling these configurations only will not change the memory mode used for > parquet-reading by the vectorized reader to Off-Heap. > > Proposed PR: https://github.com/apache/spark/pull/42394 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44718) High On-heap memory usage is detected while doing parquet-file reading with Off-Heap memory mode enabled on spark
[ https://issues.apache.org/jira/browse/SPARK-44718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-44718. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42394 [https://github.com/apache/spark/pull/42394] > High On-heap memory usage is detected while doing parquet-file reading with > Off-Heap memory mode enabled on spark > - > > Key: SPARK-44718 > URL: https://issues.apache.org/jira/browse/SPARK-44718 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.4.1 >Reporter: Zamil Majdy >Priority: Major > Fix For: 4.0.0 > > > I see the high use of on-heap memory usage while doing the parquet file > reading when the off-heap memory mode is enabled. This is caused by the > memory-mode for the column vector for the vectorized reader is configured by > different flag, and the default value is always set to On-Heap. > Conf to reproduce the issue: > {{spark.memory.offHeap.size 100}} > {{spark.memory.offHeap.enabled true}} > Enabling these configurations only will not change the memory mode used for > parquet-reading by the vectorized reader to Off-Heap. > > Proposed PR: https://github.com/apache/spark/pull/42394 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44564) Refine the documents with LLM
[ https://issues.apache.org/jira/browse/SPARK-44564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-44564: -- Description: Let's first focus on the Documents of *PySpark DataFrame APIs*. *1*, Chose a subset of DF APIs Since the review bandwidth is limited, we recommend each PR contains at least 5 APIs; *2*, For each API, copy-paste the function (including function signature, doc string) to a LLM Model, and ask it to with a prompts (e.g. the attached prompt), you can of course use/design your own prompt. For prompt engineering, you can refer to this [Best practices|https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api] *3*, Note that the LLM is not 100% reliable, the generated doc string may still contain some mistakes, e.g. * The example code can not run * The example results are incorrect * The example code doesn't reflect the example title * The description use wrong version, add a 'Raise' selection for non-existent exception * The lint can be broken * ... we need to fix them before sending a PR. We can try different prompts, choose the good parts and combine them to the new doc sting. was: Let's first focus on the Documents of *PySpark DataFrame APIs*. *1*, Chose a subset of DF APIs Since the review bandwidth is limited, we recommend each PR contains at least 5 APIs; *2*, For each API, copy-paste the function (including function signature, doc string) to a LLM Model, and ask it to with a prompts (e.g. the attached prompt), you can of course use/design your own prompt. For prompt engineering, you can refer to this [Best practices|https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api] It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the former generate better results. *3*, Note that the LLM is not 100% reliable, the generated doc string may still contain some mistakes, e.g. * The example code can not run * The example results are incorrect * The example code doesn't reflect the example title * The description use wrong version, add a 'Raise' selection for non-existent exception * The lint can be broken * ... we need to fix them before sending a PR. We can try different prompts, choose the good parts and combine them to the new doc sting. > Refine the documents with LLM > - > > Key: SPARK-44564 > URL: https://issues.apache.org/jira/browse/SPARK-44564 > Project: Spark > Issue Type: Umbrella > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > Attachments: docstr_prompt.py > > > Let's first focus on the Documents of *PySpark DataFrame APIs*. > *1*, Chose a subset of DF APIs > Since the review bandwidth is limited, we recommend each PR contains at least > 5 APIs; > *2*, For each API, copy-paste the function (including function signature, doc > string) to a LLM Model, and ask it to with a prompts (e.g. the attached > prompt), you can of course use/design your own prompt. > For prompt engineering, you can refer to this [Best > practices|https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api] > > *3*, Note that the LLM is not 100% reliable, the generated doc string may > still contain some mistakes, e.g. > * The example code can not run > * The example results are incorrect > * The example code doesn't reflect the example title > * The description use wrong version, add a 'Raise' selection for non-existent > exception > * The lint can be broken > * ... > we need to fix them before sending a PR. > We can try different prompts, choose the good parts and combine them to the > new doc sting. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44782) Adjust Pull Request Template to incorporate the ASF Generative Tooling Guidance recommendations
[ https://issues.apache.org/jira/browse/SPARK-44782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17754516#comment-17754516 ] ASF GitHub Bot commented on SPARK-44782: User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/42469 > Adjust Pull Request Template to incorporate the ASF Generative Tooling > Guidance recommendations > --- > > Key: SPARK-44782 > URL: https://issues.apache.org/jira/browse/SPARK-44782 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.3.2, 3.4.1 >Reporter: Maciej Szymkiewicz >Priority: Major > > Recently releases [ASF Generative Tooling > Guidance|https://www.apache.org/legal/generative-tooling.html] recommends > keeping track of the generative AI tools used to author patches > ??When providing contributions authored using generative AI tooling, a > recommended practice is for contributors to indicate the tooling used to > create the contribution. This should be included as a token in the source > control commit message, for example including the phrase “Generated-by: ”. > This allows for future release tooling to be considered that pulls this > content into a machine parsable Tooling-Provenance file.?? > We should adjust PR template accordingly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44806) Separate connect-client-jvm-internal
[ https://issues.apache.org/jira/browse/SPARK-44806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17754500#comment-17754500 ] ASF GitHub Bot commented on SPARK-44806: User 'juliuszsompolski' has created a pull request for this issue: https://github.com/apache/spark/pull/42441 > Separate connect-client-jvm-internal > > > Key: SPARK-44806 > URL: https://issues.apache.org/jira/browse/SPARK-44806 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Juliusz Sompolski >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44782) Adjust Pull Request Template to incorporate the ASF Generative Tooling Guidance recommendations
[ https://issues.apache.org/jira/browse/SPARK-44782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17754482#comment-17754482 ] Maciej Szymkiewicz commented on SPARK-44782: Created a pull request for this issue: https://github.com/apache/spark/pull/42469 > Adjust Pull Request Template to incorporate the ASF Generative Tooling > Guidance recommendations > --- > > Key: SPARK-44782 > URL: https://issues.apache.org/jira/browse/SPARK-44782 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.3.2, 3.4.1 >Reporter: Maciej Szymkiewicz >Priority: Major > > Recently releases [ASF Generative Tooling > Guidance|https://www.apache.org/legal/generative-tooling.html] recommends > keeping track of the generative AI tools used to author patches > ??When providing contributions authored using generative AI tooling, a > recommended practice is for contributors to indicate the tooling used to > create the contribution. This should be included as a token in the source > control commit message, for example including the phrase “Generated-by: ”. > This allows for future release tooling to be considered that pulls this > content into a machine parsable Tooling-Provenance file.?? > We should adjust PR template accordingly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44813) The JIRA Python misses our assignee when it searches user again
Kent Yao created SPARK-44813: Summary: The JIRA Python misses our assignee when it searches user again Key: SPARK-44813 URL: https://issues.apache.org/jira/browse/SPARK-44813 Project: Spark Issue Type: Bug Components: Project Infra Affects Versions: 4.0.0 Reporter: Kent Yao {code:java} >>> assignee = asf_jira.user("yao") >>> "SPARK-44801"'SPARK-44801' >>> asf_jira.assign_issue(issue.key, assignee.name) response text = {"errorMessages":[],"errors":{"assignee":"User 'airhot' cannot be assigned issues."}} {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44801) SQL Page does not capture failed queries in analyzer
[ https://issues.apache.org/jira/browse/SPARK-44801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-44801: Assignee: Kent Yao (was: Kent Yao 2) > SQL Page does not capture failed queries in analyzer > - > > Key: SPARK-44801 > URL: https://issues.apache.org/jira/browse/SPARK-44801 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 3.2.4, 3.3.2, 3.4.1, 3.5.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44801) SQL Page does not capture failed queries in analyzer
[ https://issues.apache.org/jira/browse/SPARK-44801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-44801: Assignee: (was: Kent Yao 2) > SQL Page does not capture failed queries in analyzer > - > > Key: SPARK-44801 > URL: https://issues.apache.org/jira/browse/SPARK-44801 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 3.2.4, 3.3.2, 3.4.1, 3.5.0 >Reporter: Kent Yao >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44801) SQL Page does not capture failed queries in analyzer
[ https://issues.apache.org/jira/browse/SPARK-44801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-44801: Assignee: Kent Yao 2 > SQL Page does not capture failed queries in analyzer > - > > Key: SPARK-44801 > URL: https://issues.apache.org/jira/browse/SPARK-44801 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 3.2.4, 3.3.2, 3.4.1, 3.5.0 >Reporter: Kent Yao >Assignee: Kent Yao 2 >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44782) Adjust Pull Request Template to incorporate the ASF Generative Tooling Guidance recommendations
[ https://issues.apache.org/jira/browse/SPARK-44782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17754429#comment-17754429 ] Xiao Li commented on SPARK-44782: - +1 We should update the PR template. > Adjust Pull Request Template to incorporate the ASF Generative Tooling > Guidance recommendations > --- > > Key: SPARK-44782 > URL: https://issues.apache.org/jira/browse/SPARK-44782 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.3.2, 3.4.1 >Reporter: Maciej Szymkiewicz >Priority: Major > > Recently releases [ASF Generative Tooling > Guidance|https://www.apache.org/legal/generative-tooling.html] recommends > keeping track of the generative AI tools used to author patches > ??When providing contributions authored using generative AI tooling, a > recommended practice is for contributors to indicate the tooling used to > create the contribution. This should be included as a token in the source > control commit message, for example including the phrase “Generated-by: ”. > This allows for future release tooling to be considered that pulls this > content into a machine parsable Tooling-Provenance file.?? > We should adjust PR template accordingly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org