[jira] [Resolved] (SPARK-42724) Upgrade buf to v1.15.1
[ https://issues.apache.org/jira/browse/SPARK-42724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-42724. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40348 [https://github.com/apache/spark/pull/40348] > Upgrade buf to v1.15.1 > -- > > Key: SPARK-42724 > URL: https://issues.apache.org/jira/browse/SPARK-42724 > Project: Spark > Issue Type: Sub-task > Components: Build, Connect >Affects Versions: 3.4.1 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42724) Upgrade buf to v1.15.1
[ https://issues.apache.org/jira/browse/SPARK-42724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-42724: - Assignee: BingKun Pan > Upgrade buf to v1.15.1 > -- > > Key: SPARK-42724 > URL: https://issues.apache.org/jira/browse/SPARK-42724 > Project: Spark > Issue Type: Sub-task > Components: Build, Connect >Affects Versions: 3.4.1 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42626) Add Destructive Iterator for SparkResult
[ https://issues.apache.org/jira/browse/SPARK-42626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17698220#comment-17698220 ] Tengfei Huang commented on SPARK-42626: --- I will take a look! Thanks > Add Destructive Iterator for SparkResult > > > Key: SPARK-42626 > URL: https://issues.apache.org/jira/browse/SPARK-42626 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > > Add a destructive iterator to SparkResult. Instead of keeping everything in > memory for the life time of SparkResult object, clean it up as soon as we > know we are done with it. We can use this for Dataset.toLocalIterator. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42554) Spark Connect Scala Client
[ https://issues.apache.org/jira/browse/SPARK-42554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17698216#comment-17698216 ] Yang Jie commented on SPARK-42554: -- Friendly ping [~ivoson] , [~hvanhovell] has created some tickets related to Spark Connect here, feel free to pick up them if you are interested ~ > Spark Connect Scala Client > -- > > Key: SPARK-42554 > URL: https://issues.apache.org/jira/browse/SPARK-42554 > Project: Spark > Issue Type: Epic > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > > This is the EPIC to track all the work for the Spark Connect Scala Client. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42689) Allow ShuffleDriverComponent to declare if shuffle data is reliably stored
[ https://issues.apache.org/jira/browse/SPARK-42689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-42689: --- Assignee: Mridul Muralidharan > Allow ShuffleDriverComponent to declare if shuffle data is reliably stored > -- > > Key: SPARK-42689 > URL: https://issues.apache.org/jira/browse/SPARK-42689 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Mridul Muralidharan >Assignee: Mridul Muralidharan >Priority: Major > > Currently, if there is an executor node loss, we assume the shuffle data on > that node is also lost. This is not necessarily the case if there is a > shuffle component managing the shuffle data and reliably maintaining it (for > example, in distributed filesystem or in a disaggregated shuffle cluster). > Downstream projects have patches to Apache Spark in order to workaround this > issue, for example Apache Celeborn has > [this|https://github.com/apache/incubator-celeborn/blob/main/assets/spark-patch/RSS_RDA_spark3.patch]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42689) Allow ShuffleDriverComponent to declare if shuffle data is reliably stored
[ https://issues.apache.org/jira/browse/SPARK-42689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-42689. - Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40307 [https://github.com/apache/spark/pull/40307] > Allow ShuffleDriverComponent to declare if shuffle data is reliably stored > -- > > Key: SPARK-42689 > URL: https://issues.apache.org/jira/browse/SPARK-42689 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Mridul Muralidharan >Assignee: Mridul Muralidharan >Priority: Major > Fix For: 3.5.0 > > > Currently, if there is an executor node loss, we assume the shuffle data on > that node is also lost. This is not necessarily the case if there is a > shuffle component managing the shuffle data and reliably maintaining it (for > example, in distributed filesystem or in a disaggregated shuffle cluster). > Downstream projects have patches to Apache Spark in order to workaround this > issue, for example Apache Celeborn has > [this|https://github.com/apache/incubator-celeborn/blob/main/assets/spark-patch/RSS_RDA_spark3.patch]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42690) Implement CSV/JSON parsing funcions
[ https://issues.apache.org/jira/browse/SPARK-42690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-42690: - Assignee: Yang Jie > Implement CSV/JSON parsing funcions > --- > > Key: SPARK-42690 > URL: https://issues.apache.org/jira/browse/SPARK-42690 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Yang Jie >Priority: Major > > Implement the following two methods in DataFrameReader: > > > {code:java} > /** > * Loads a `Dataset[String]` storing JSON objects ( href="http://jsonlines.org/;>JSON Lines > * text format or newline-delimited JSON) and returns the result as a > `DataFrame`. > * > * Unless the schema is specified using `schema` function, this function goes > through the > * input once to determine the input schema. > * > * @param jsonDataset input Dataset with one JSON object per record > * @since 3.4.0 > */ > def json(jsonDataset: Dataset[String]): DataFrame > /** > * Loads an `Dataset[String]` storing CSV rows and returns the result as a > `DataFrame`. > * > * If the schema is not specified using `schema` function and `inferSchema` > option is enabled, > * this function goes through the input once to determine the input schema. > * > * If the schema is not specified using `schema` function and `inferSchema` > option is disabled, > * it determines the columns as string types and it reads only the first line > to determine the > * names and the number of fields. > * > * If the enforceSchema is set to `false`, only the CSV header in the first > line is checked > * to conform specified or inferred schema. > * > * @note if `header` option is set to `true` when calling this API, all lines > same with > * the header will be removed if exists. > * > * @param csvDataset input Dataset with one CSV row per record > * @since 3.4.0 > */ > def csv(csvDataset: Dataset[String]): DataFrame > {code} > > For this we need a new message. We cannot use project because we don't know > the schema upfront. > > {code:java} > message Parse { > // (Required) Input relation to Parse. The input is expected to have single > text column. > Relation input = 1; > // (Required) The expected format of the text. > ParseFormat format = 2; > enum ParseFormat { > PARSE_FORMAT_UNSPECIFIED = 0; > PARSE_FORMAT_CSV = 1; > PARSE_FORMAT_JSON = 2; > } > } > {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42690) Implement CSV/JSON parsing funcions
[ https://issues.apache.org/jira/browse/SPARK-42690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-42690. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40332 [https://github.com/apache/spark/pull/40332] > Implement CSV/JSON parsing funcions > --- > > Key: SPARK-42690 > URL: https://issues.apache.org/jira/browse/SPARK-42690 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Yang Jie >Priority: Major > Fix For: 3.4.0 > > > Implement the following two methods in DataFrameReader: > > > {code:java} > /** > * Loads a `Dataset[String]` storing JSON objects ( href="http://jsonlines.org/;>JSON Lines > * text format or newline-delimited JSON) and returns the result as a > `DataFrame`. > * > * Unless the schema is specified using `schema` function, this function goes > through the > * input once to determine the input schema. > * > * @param jsonDataset input Dataset with one JSON object per record > * @since 3.4.0 > */ > def json(jsonDataset: Dataset[String]): DataFrame > /** > * Loads an `Dataset[String]` storing CSV rows and returns the result as a > `DataFrame`. > * > * If the schema is not specified using `schema` function and `inferSchema` > option is enabled, > * this function goes through the input once to determine the input schema. > * > * If the schema is not specified using `schema` function and `inferSchema` > option is disabled, > * it determines the columns as string types and it reads only the first line > to determine the > * names and the number of fields. > * > * If the enforceSchema is set to `false`, only the CSV header in the first > line is checked > * to conform specified or inferred schema. > * > * @note if `header` option is set to `true` when calling this API, all lines > same with > * the header will be removed if exists. > * > * @param csvDataset input Dataset with one CSV row per record > * @since 3.4.0 > */ > def csv(csvDataset: Dataset[String]): DataFrame > {code} > > For this we need a new message. We cannot use project because we don't know > the schema upfront. > > {code:java} > message Parse { > // (Required) Input relation to Parse. The input is expected to have single > text column. > Relation input = 1; > // (Required) The expected format of the text. > ParseFormat format = 2; > enum ParseFormat { > PARSE_FORMAT_UNSPECIFIED = 0; > PARSE_FORMAT_CSV = 1; > PARSE_FORMAT_JSON = 2; > } > } > {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42725) Make LiteralExpression support array
[ https://issues.apache.org/jira/browse/SPARK-42725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17698207#comment-17698207 ] Apache Spark commented on SPARK-42725: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/40349 > Make LiteralExpression support array > > > Key: SPARK-42725 > URL: https://issues.apache.org/jira/browse/SPARK-42725 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42725) Make LiteralExpression support array
[ https://issues.apache.org/jira/browse/SPARK-42725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42725: Assignee: (was: Apache Spark) > Make LiteralExpression support array > > > Key: SPARK-42725 > URL: https://issues.apache.org/jira/browse/SPARK-42725 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42725) Make LiteralExpression support array
[ https://issues.apache.org/jira/browse/SPARK-42725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42725: Assignee: Apache Spark > Make LiteralExpression support array > > > Key: SPARK-42725 > URL: https://issues.apache.org/jira/browse/SPARK-42725 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42725) Make LiteralExpression support array
[ https://issues.apache.org/jira/browse/SPARK-42725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17698206#comment-17698206 ] Apache Spark commented on SPARK-42725: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/40349 > Make LiteralExpression support array > > > Key: SPARK-42725 > URL: https://issues.apache.org/jira/browse/SPARK-42725 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42725) Make LiteralExpression support array
Ruifeng Zheng created SPARK-42725: - Summary: Make LiteralExpression support array Key: SPARK-42725 URL: https://issues.apache.org/jira/browse/SPARK-42725 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42701) Add the try_aes_decrypt() function
[ https://issues.apache.org/jira/browse/SPARK-42701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-42701. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40340 [https://github.com/apache/spark/pull/40340] > Add the try_aes_decrypt() function > -- > > Key: SPARK-42701 > URL: https://issues.apache.org/jira/browse/SPARK-42701 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Labels: starter > Fix For: 3.5.0 > > > Add new function try_aes_decrypt(). The function aes_decrypt() fails w/ an > exception when it faces to a column value which it cannot decrypt. So, if a > column contains bad and good input, it is impossible to decrypt even good > input. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42717) Upgrade mysql-connector-java from 8.0.31 to 8.0.32
[ https://issues.apache.org/jira/browse/SPARK-42717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-42717: Assignee: BingKun Pan > Upgrade mysql-connector-java from 8.0.31 to 8.0.32 > -- > > Key: SPARK-42717 > URL: https://issues.apache.org/jira/browse/SPARK-42717 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42717) Upgrade mysql-connector-java from 8.0.31 to 8.0.32
[ https://issues.apache.org/jira/browse/SPARK-42717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-42717. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40335 [https://github.com/apache/spark/pull/40335] > Upgrade mysql-connector-java from 8.0.31 to 8.0.32 > -- > > Key: SPARK-42717 > URL: https://issues.apache.org/jira/browse/SPARK-42717 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42697) /api/v1/applications return 0 for duration
[ https://issues.apache.org/jira/browse/SPARK-42697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-42697. -- Fix Version/s: 3.3.3 3.2.4 3.4.0 Resolution: Fixed Issue resolved by pull request 40313 [https://github.com/apache/spark/pull/40313] > /api/v1/applications return 0 for duration > -- > > Key: SPARK-42697 > URL: https://issues.apache.org/jira/browse/SPARK-42697 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.1.3, 3.2.3, 3.3.2, 3.4.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.3.3, 3.2.4, 3.4.0 > > > which should be total uptime -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42697) /api/v1/applications return 0 for duration
[ https://issues.apache.org/jira/browse/SPARK-42697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-42697: Assignee: Kent Yao > /api/v1/applications return 0 for duration > -- > > Key: SPARK-42697 > URL: https://issues.apache.org/jira/browse/SPARK-42697 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.1.3, 3.2.3, 3.3.2, 3.4.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > > which should be total uptime -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42703) How to use Fair Scheduler Pools
[ https://issues.apache.org/jira/browse/SPARK-42703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17698163#comment-17698163 ] LiJie2023 commented on SPARK-42703: --- Sorry, this "SPARK-42703" is also submitted by me. I haven't got the correct answer yet. 李杰 leedd1...@163.com 回复的原邮件 [ https://issues.apache.org/jira/browse/SPARK-42703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon 解决了 SPARK-42703。 - 解决结果: Invalid How to use Fair Scheduler Pools --- 关键字: SPARK-42703 URL: https://issues.apache.org/jira/browse/SPARK-42703 项目: Spark 问题类型: Question 模块: Scheduler 影响版本: 3.2.3 报告人: LiJie2023 优先级: 重要 附件: image-2023-03-08-09-53-35-867.png I have two questions to ask: # I wrote a demo referring to the official website, but it didn't meet my expectations. I don't know if there was a problem with my writing.I hope that when I use the following fairscheduler.xml, pool1 always performs tasks before pool2 #What is the relationship between "spark.scheduler.mode" and "{{{}schedulingMode{}}}" in fairscheduler.xml? {code:java} object MultiJobTest { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setAppName("test-pool").setMaster("local[1]") conf.set("spark.scheduler.mode", "FAIR") conf.set("spark.scheduler.allocation.file", "file:///D:/tmp/input/fairscheduler.xml") val sparkContext = new SparkContext(conf) val data: RDD[String] = sparkContext.textFile("file:///D:/tmp/input/input.txt") val rdd = data.flatMap(_.split(",")) .map(x => (x(0), x(0))) new Thread(() => { sparkContext.setLocalProperty("spark.scheduler.pool", "pool1") rdd.foreachAsync(x => { println("1==start==" + new SimpleDateFormat("HH:mm:ss").format(new Date())) Thread.sleep(1) println("1==end==" + new SimpleDateFormat("HH:mm:ss").format(new Date())) }) }).start() new Thread(() => { sparkContext.setLocalProperty("spark.scheduler.pool", "pool2") rdd.foreachAsync(x => { println("2==start==" + new SimpleDateFormat("HH:mm:ss").format(new Date())) Thread.sleep(1) println("2==end==" + new SimpleDateFormat("HH:mm:ss").format(new Date())) }) }).start() TimeUnit.MINUTES.sleep(2) sparkContext.stop() } } {code} fairscheduler.xml {code:java} FAIR 100 0 FAIR 1 0 {code} input.txt {code:java} aa bb {code} -- 这条信息是由Atlassian Jira发送的 (v8.20.10#820010) > How to use Fair Scheduler Pools > --- > > Key: SPARK-42703 > URL: https://issues.apache.org/jira/browse/SPARK-42703 > Project: Spark > Issue Type: Question > Components: Scheduler >Affects Versions: 3.2.3 >Reporter: LiJie2023 >Priority: Major > Attachments: image-2023-03-08-09-53-35-867.png > > > I have two questions to ask: > # I wrote a demo referring to the official website, but it didn't meet my > expectations. I don't know if there was a problem with my writing.I hope that > when I use the following fairscheduler.xml, pool1 always performs tasks > before pool2 > # What is the relationship between "spark.scheduler.mode" and > "{{{}schedulingMode{}}}" in fairscheduler.xml? > > > {code:java} > object MultiJobTest { > def main(args: Array[String]): Unit = { > val conf = new SparkConf() > conf.setAppName("test-pool").setMaster("local[1]") > conf.set("spark.scheduler.mode", "FAIR") > conf.set("spark.scheduler.allocation.file", > "file:///D:/tmp/input/fairscheduler.xml") > val sparkContext = new SparkContext(conf) > val data: RDD[String] = > sparkContext.textFile("file:///D:/tmp/input/input.txt") > val rdd = data.flatMap(_.split(",")) > .map(x => (x(0), x(0))) > new Thread(() => { > sparkContext.setLocalProperty("spark.scheduler.pool", "pool1") > rdd.foreachAsync(x => { > println("1==start==" + new > SimpleDateFormat("HH:mm:ss").format(new Date())) > Thread.sleep(1) > println("1==end==" + new SimpleDateFormat("HH:mm:ss").format(new > Date())) > }) > }).start() > new Thread(() => { > sparkContext.setLocalProperty("spark.scheduler.pool", "pool2") > rdd.foreachAsync(x => { > println("2==start==" + new > SimpleDateFormat("HH:mm:ss").format(new Date())) > Thread.sleep(1) > println("2==end==" + new SimpleDateFormat("HH:mm:ss").format(new > Date())) > }) > }).start() > TimeUnit.MINUTES.sleep(2) > sparkContext.stop() > } > } {code} > > fairscheduler.xml > > {code:java} > > > FAIR > 100 > 0 > > > FAIR > 1 > 0 > > {code} > > > input.txt > > {code:java} > aa bb {code} > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (SPARK-42723) Support parser data type json "timestamp_ltz" as TimestampType
[ https://issues.apache.org/jira/browse/SPARK-42723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-42723. Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40345 [https://github.com/apache/spark/pull/40345] > Support parser data type json "timestamp_ltz" as TimestampType > -- > > Key: SPARK-42723 > URL: https://issues.apache.org/jira/browse/SPARK-42723 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42703) How to use Fair Scheduler Pools
[ https://issues.apache.org/jira/browse/SPARK-42703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42703. -- Resolution: Invalid > How to use Fair Scheduler Pools > --- > > Key: SPARK-42703 > URL: https://issues.apache.org/jira/browse/SPARK-42703 > Project: Spark > Issue Type: Question > Components: Scheduler >Affects Versions: 3.2.3 >Reporter: LiJie2023 >Priority: Major > Attachments: image-2023-03-08-09-53-35-867.png > > > I have two questions to ask: > # I wrote a demo referring to the official website, but it didn't meet my > expectations. I don't know if there was a problem with my writing.I hope that > when I use the following fairscheduler.xml, pool1 always performs tasks > before pool2 > # What is the relationship between "spark.scheduler.mode" and > "{{{}schedulingMode{}}}" in fairscheduler.xml? > > > {code:java} > object MultiJobTest { > def main(args: Array[String]): Unit = { > val conf = new SparkConf() > conf.setAppName("test-pool").setMaster("local[1]") > conf.set("spark.scheduler.mode", "FAIR") > conf.set("spark.scheduler.allocation.file", > "file:///D:/tmp/input/fairscheduler.xml") > val sparkContext = new SparkContext(conf) > val data: RDD[String] = > sparkContext.textFile("file:///D:/tmp/input/input.txt") > val rdd = data.flatMap(_.split(",")) > .map(x => (x(0), x(0))) > new Thread(() => { > sparkContext.setLocalProperty("spark.scheduler.pool", "pool1") > rdd.foreachAsync(x => { > println("1==start==" + new > SimpleDateFormat("HH:mm:ss").format(new Date())) > Thread.sleep(1) > println("1==end==" + new SimpleDateFormat("HH:mm:ss").format(new > Date())) > }) > }).start() > new Thread(() => { > sparkContext.setLocalProperty("spark.scheduler.pool", "pool2") > rdd.foreachAsync(x => { > println("2==start==" + new > SimpleDateFormat("HH:mm:ss").format(new Date())) > Thread.sleep(1) > println("2==end==" + new SimpleDateFormat("HH:mm:ss").format(new > Date())) > }) > }).start() > TimeUnit.MINUTES.sleep(2) > sparkContext.stop() > } > } {code} > > fairscheduler.xml > > {code:java} > > > FAIR > 100 > 0 > > > FAIR > 1 > 0 > > {code} > > > input.txt > > {code:java} > aa bb {code} > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42497) Support of pandas API on Spark for Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-42497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-42497: - Summary: Support of pandas API on Spark for Spark Connect (was: Support of pandas API on Spark for Spark Connect.) > Support of pandas API on Spark for Spark Connect > > > Key: SPARK-42497 > URL: https://issues.apache.org/jira/browse/SPARK-42497 > Project: Spark > Issue Type: Umbrella > Components: Connect >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > We should enable `pandas API on Spark` on Spark Connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42711) build/sbt usage error messages about java-home
[ https://issues.apache.org/jira/browse/SPARK-42711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17698155#comment-17698155 ] Apache Spark commented on SPARK-42711: -- User 'liang3zy22' has created a pull request for this issue: https://github.com/apache/spark/pull/40347 > build/sbt usage error messages about java-home > -- > > Key: SPARK-42711 > URL: https://issues.apache.org/jira/browse/SPARK-42711 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.3.2 >Reporter: Liang Yan >Priority: Minor > > The build/sbt tool's usage information about java-home is wrong: > # java version (default: java from PATH, currently $(java -version 2>&1 | > grep version)) > -java-home alternate JAVA_HOME -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42711) build/sbt usage error messages about java-home
[ https://issues.apache.org/jira/browse/SPARK-42711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42711: Assignee: (was: Apache Spark) > build/sbt usage error messages about java-home > -- > > Key: SPARK-42711 > URL: https://issues.apache.org/jira/browse/SPARK-42711 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.3.2 >Reporter: Liang Yan >Priority: Minor > > The build/sbt tool's usage information about java-home is wrong: > # java version (default: java from PATH, currently $(java -version 2>&1 | > grep version)) > -java-home alternate JAVA_HOME -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42711) build/sbt usage error messages about java-home
[ https://issues.apache.org/jira/browse/SPARK-42711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42711: Assignee: Apache Spark > build/sbt usage error messages about java-home > -- > > Key: SPARK-42711 > URL: https://issues.apache.org/jira/browse/SPARK-42711 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.3.2 >Reporter: Liang Yan >Assignee: Apache Spark >Priority: Minor > > The build/sbt tool's usage information about java-home is wrong: > # java version (default: java from PATH, currently $(java -version 2>&1 | > grep version)) > -java-home alternate JAVA_HOME -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42724) Upgrade buf to v1.15.1
[ https://issues.apache.org/jira/browse/SPARK-42724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42724: Assignee: (was: Apache Spark) > Upgrade buf to v1.15.1 > -- > > Key: SPARK-42724 > URL: https://issues.apache.org/jira/browse/SPARK-42724 > Project: Spark > Issue Type: Sub-task > Components: Build, Connect >Affects Versions: 3.4.1 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42724) Upgrade buf to v1.15.1
[ https://issues.apache.org/jira/browse/SPARK-42724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17698156#comment-17698156 ] Apache Spark commented on SPARK-42724: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/40348 > Upgrade buf to v1.15.1 > -- > > Key: SPARK-42724 > URL: https://issues.apache.org/jira/browse/SPARK-42724 > Project: Spark > Issue Type: Sub-task > Components: Build, Connect >Affects Versions: 3.4.1 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42724) Upgrade buf to v1.15.1
[ https://issues.apache.org/jira/browse/SPARK-42724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42724: Assignee: Apache Spark > Upgrade buf to v1.15.1 > -- > > Key: SPARK-42724 > URL: https://issues.apache.org/jira/browse/SPARK-42724 > Project: Spark > Issue Type: Sub-task > Components: Build, Connect >Affects Versions: 3.4.1 >Reporter: BingKun Pan >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42724) Upgrade buf to v1.15.1
BingKun Pan created SPARK-42724: --- Summary: Upgrade buf to v1.15.1 Key: SPARK-42724 URL: https://issues.apache.org/jira/browse/SPARK-42724 Project: Spark Issue Type: Sub-task Components: Build, Connect Affects Versions: 3.4.1 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42643) Register Java (aggregate) user-defined functions
[ https://issues.apache.org/jira/browse/SPARK-42643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-42643: - Parent: SPARK-41661 Issue Type: Sub-task (was: Improvement) > Register Java (aggregate) user-defined functions > > > Key: SPARK-42643 > URL: https://issues.apache.org/jira/browse/SPARK-42643 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `spark.udf.registerJavaFunction`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42722) Python Connect def schema() should not cache the schema
[ https://issues.apache.org/jira/browse/SPARK-42722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42722. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40343 [https://github.com/apache/spark/pull/40343] > Python Connect def schema() should not cache the schema > > > Key: SPARK-42722 > URL: https://issues.apache.org/jira/browse/SPARK-42722 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42480) Improve the performance of drop partitions
[ https://issues.apache.org/jira/browse/SPARK-42480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-42480: - Fix Version/s: 3.4.0 > Improve the performance of drop partitions > -- > > Key: SPARK-42480 > URL: https://issues.apache.org/jira/browse/SPARK-42480 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.2 >Reporter: Wechar >Assignee: Wechar >Priority: Major > Fix For: 3.4.0, 3.5.0 > > > Currently to drop the matching partitions, Spark will first get all matching > Partition objects from Hive metastore, and just use the partition values of > these Partition objects. > We can get the matching partition names instead of the partition objects for > the following reasons: > 1. we can also get partition values through a partition name (like a=1/b=2) > 2. the byte size of partition name is much smaller than partition object, > which will help improve the performance of drop partitions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42480) Improve the performance of drop partitions
[ https://issues.apache.org/jira/browse/SPARK-42480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-42480: - Fix Version/s: (was: 3.5.0) > Improve the performance of drop partitions > -- > > Key: SPARK-42480 > URL: https://issues.apache.org/jira/browse/SPARK-42480 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.2 >Reporter: Wechar >Assignee: Wechar >Priority: Major > Fix For: 3.4.0 > > > Currently to drop the matching partitions, Spark will first get all matching > Partition objects from Hive metastore, and just use the partition values of > these Partition objects. > We can get the matching partition names instead of the partition objects for > the following reasons: > 1. we can also get partition values through a partition name (like a=1/b=2) > 2. the byte size of partition name is much smaller than partition object, > which will help improve the performance of drop partitions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42480) Improve the performance of drop partitions
[ https://issues.apache.org/jira/browse/SPARK-42480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-42480: Assignee: Wechar > Improve the performance of drop partitions > -- > > Key: SPARK-42480 > URL: https://issues.apache.org/jira/browse/SPARK-42480 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.2 >Reporter: Wechar >Assignee: Wechar >Priority: Major > > Currently to drop the matching partitions, Spark will first get all matching > Partition objects from Hive metastore, and just use the partition values of > these Partition objects. > We can get the matching partition names instead of the partition objects for > the following reasons: > 1. we can also get partition values through a partition name (like a=1/b=2) > 2. the byte size of partition name is much smaller than partition object, > which will help improve the performance of drop partitions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42480) Improve the performance of drop partitions
[ https://issues.apache.org/jira/browse/SPARK-42480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-42480. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40069 [https://github.com/apache/spark/pull/40069] > Improve the performance of drop partitions > -- > > Key: SPARK-42480 > URL: https://issues.apache.org/jira/browse/SPARK-42480 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.2 >Reporter: Wechar >Assignee: Wechar >Priority: Major > Fix For: 3.5.0 > > > Currently to drop the matching partitions, Spark will first get all matching > Partition objects from Hive metastore, and just use the partition values of > these Partition objects. > We can get the matching partition names instead of the partition objects for > the following reasons: > 1. we can also get partition values through a partition name (like a=1/b=2) > 2. the byte size of partition name is much smaller than partition object, > which will help improve the performance of drop partitions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42667) Spark Connect: newSession API
[ https://issues.apache.org/jira/browse/SPARK-42667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17698119#comment-17698119 ] Apache Spark commented on SPARK-42667: -- User 'amaliujia' has created a pull request for this issue: https://github.com/apache/spark/pull/40346 > Spark Connect: newSession API > - > > Key: SPARK-42667 > URL: https://issues.apache.org/jira/browse/SPARK-42667 > Project: Spark > Issue Type: Task > Components: Connect >Affects Versions: 3.4.1 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Fix For: 3.4.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42723) Support parser data type json "timestamp_ltz" as TimestampType
[ https://issues.apache.org/jira/browse/SPARK-42723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17698096#comment-17698096 ] Apache Spark commented on SPARK-42723: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/40345 > Support parser data type json "timestamp_ltz" as TimestampType > -- > > Key: SPARK-42723 > URL: https://issues.apache.org/jira/browse/SPARK-42723 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42723) Support parser data type json "timestamp_ltz" as TimestampType
[ https://issues.apache.org/jira/browse/SPARK-42723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42723: Assignee: Apache Spark (was: Gengliang Wang) > Support parser data type json "timestamp_ltz" as TimestampType > -- > > Key: SPARK-42723 > URL: https://issues.apache.org/jira/browse/SPARK-42723 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42723) Support parser data type json "timestamp_ltz" as TimestampType
[ https://issues.apache.org/jira/browse/SPARK-42723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42723: Assignee: Gengliang Wang (was: Apache Spark) > Support parser data type json "timestamp_ltz" as TimestampType > -- > > Key: SPARK-42723 > URL: https://issues.apache.org/jira/browse/SPARK-42723 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42723) Support parser data type json "timestamp_ltz" as TimestampType
[ https://issues.apache.org/jira/browse/SPARK-42723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17698095#comment-17698095 ] Apache Spark commented on SPARK-42723: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/40345 > Support parser data type json "timestamp_ltz" as TimestampType > -- > > Key: SPARK-42723 > URL: https://issues.apache.org/jira/browse/SPARK-42723 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42723) Support parser data type json "timestamp_ltz" as TimestampType
Gengliang Wang created SPARK-42723: -- Summary: Support parser data type json "timestamp_ltz" as TimestampType Key: SPARK-42723 URL: https://issues.apache.org/jira/browse/SPARK-42723 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Gengliang Wang Assignee: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42656) Spark Connect Scala Client Shell Script
[ https://issues.apache.org/jira/browse/SPARK-42656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17698081#comment-17698081 ] Apache Spark commented on SPARK-42656: -- User 'zhenlineo' has created a pull request for this issue: https://github.com/apache/spark/pull/40344 > Spark Connect Scala Client Shell Script > --- > > Key: SPARK-42656 > URL: https://issues.apache.org/jira/browse/SPARK-42656 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Zhen Li >Assignee: Zhen Li >Priority: Major > Fix For: 3.4.0 > > > Adding a shell script to run scala client in a scala REPL to allow users to > connect to spark connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42722) Python Connect def schema() should not cache the schema
[ https://issues.apache.org/jira/browse/SPARK-42722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17698073#comment-17698073 ] Apache Spark commented on SPARK-42722: -- User 'amaliujia' has created a pull request for this issue: https://github.com/apache/spark/pull/40343 > Python Connect def schema() should not cache the schema > > > Key: SPARK-42722 > URL: https://issues.apache.org/jira/browse/SPARK-42722 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42722) Python Connect def schema() should not cache the schema
[ https://issues.apache.org/jira/browse/SPARK-42722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42722: Assignee: Apache Spark (was: Rui Wang) > Python Connect def schema() should not cache the schema > > > Key: SPARK-42722 > URL: https://issues.apache.org/jira/browse/SPARK-42722 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42722) Python Connect def schema() should not cache the schema
[ https://issues.apache.org/jira/browse/SPARK-42722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17698072#comment-17698072 ] Apache Spark commented on SPARK-42722: -- User 'amaliujia' has created a pull request for this issue: https://github.com/apache/spark/pull/40343 > Python Connect def schema() should not cache the schema > > > Key: SPARK-42722 > URL: https://issues.apache.org/jira/browse/SPARK-42722 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42722) Python Connect def schema() should not cache the schema
[ https://issues.apache.org/jira/browse/SPARK-42722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42722: Assignee: Rui Wang (was: Apache Spark) > Python Connect def schema() should not cache the schema > > > Key: SPARK-42722 > URL: https://issues.apache.org/jira/browse/SPARK-42722 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42722) Python Connect def schema() should not cache the schema
Rui Wang created SPARK-42722: Summary: Python Connect def schema() should not cache the schema Key: SPARK-42722 URL: https://issues.apache.org/jira/browse/SPARK-42722 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 3.4.0 Reporter: Rui Wang Assignee: Rui Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42721) Add an Interceptor to log RPCs in connect-server
[ https://issues.apache.org/jira/browse/SPARK-42721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17698057#comment-17698057 ] Apache Spark commented on SPARK-42721: -- User 'rangadi' has created a pull request for this issue: https://github.com/apache/spark/pull/40342 > Add an Interceptor to log RPCs in connect-server > > > Key: SPARK-42721 > URL: https://issues.apache.org/jira/browse/SPARK-42721 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Raghu Angadi >Priority: Major > Fix For: 3.5.0 > > > It would be useful to be able to log RPC to connect server during > development. It makes simpler to see the flow of messages. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42721) Add an Interceptor to log RPCs in connect-server
[ https://issues.apache.org/jira/browse/SPARK-42721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42721: Assignee: (was: Apache Spark) > Add an Interceptor to log RPCs in connect-server > > > Key: SPARK-42721 > URL: https://issues.apache.org/jira/browse/SPARK-42721 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Raghu Angadi >Priority: Major > Fix For: 3.5.0 > > > It would be useful to be able to log RPC to connect server during > development. It makes simpler to see the flow of messages. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42721) Add an Interceptor to log RPCs in connect-server
[ https://issues.apache.org/jira/browse/SPARK-42721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42721: Assignee: Apache Spark > Add an Interceptor to log RPCs in connect-server > > > Key: SPARK-42721 > URL: https://issues.apache.org/jira/browse/SPARK-42721 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Raghu Angadi >Assignee: Apache Spark >Priority: Major > Fix For: 3.5.0 > > > It would be useful to be able to log RPC to connect server during > development. It makes simpler to see the flow of messages. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42721) Add an Interceptor to log RPCs in connect-server
[ https://issues.apache.org/jira/browse/SPARK-42721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17698055#comment-17698055 ] Apache Spark commented on SPARK-42721: -- User 'rangadi' has created a pull request for this issue: https://github.com/apache/spark/pull/40342 > Add an Interceptor to log RPCs in connect-server > > > Key: SPARK-42721 > URL: https://issues.apache.org/jira/browse/SPARK-42721 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Raghu Angadi >Priority: Major > Fix For: 3.5.0 > > > It would be useful to be able to log RPC to connect server during > development. It makes simpler to see the flow of messages. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42721) Add an Interceptor to log RPCs in connect-server
Raghu Angadi created SPARK-42721: Summary: Add an Interceptor to log RPCs in connect-server Key: SPARK-42721 URL: https://issues.apache.org/jira/browse/SPARK-42721 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 3.5.0 Reporter: Raghu Angadi Fix For: 3.5.0 It would be useful to be able to log RPC to connect server during development. It makes simpler to see the flow of messages. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42689) Allow ShuffleDriverComponent to declare if shuffle data is reliably stored
[ https://issues.apache.org/jira/browse/SPARK-42689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-42689: -- Affects Version/s: 3.5.0 (was: 3.1.0) (was: 3.2.0) (was: 3.3.0) (was: 3.4.0) > Allow ShuffleDriverComponent to declare if shuffle data is reliably stored > -- > > Key: SPARK-42689 > URL: https://issues.apache.org/jira/browse/SPARK-42689 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Mridul Muralidharan >Priority: Major > > Currently, if there is an executor node loss, we assume the shuffle data on > that node is also lost. This is not necessarily the case if there is a > shuffle component managing the shuffle data and reliably maintaining it (for > example, in distributed filesystem or in a disaggregated shuffle cluster). > Downstream projects have patches to Apache Spark in order to workaround this > issue, for example Apache Celeborn has > [this|https://github.com/apache/incubator-celeborn/blob/main/assets/spark-patch/RSS_RDA_spark3.patch]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42709) Do not rely on __file__
[ https://issues.apache.org/jira/browse/SPARK-42709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42709. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40328 [https://github.com/apache/spark/pull/40328] > Do not rely on __file__ > --- > > Key: SPARK-42709 > URL: https://issues.apache.org/jira/browse/SPARK-42709 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > We have a lot of places using __file__ which is actually optional. We > shouldn't reply on them -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42709) Do not rely on __file__
[ https://issues.apache.org/jira/browse/SPARK-42709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-42709: - Assignee: Hyukjin Kwon > Do not rely on __file__ > --- > > Key: SPARK-42709 > URL: https://issues.apache.org/jira/browse/SPARK-42709 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > We have a lot of places using __file__ which is actually optional. We > shouldn't reply on them -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42715) NegativeArraySizeException by too many datas read from ORC file
[ https://issues.apache.org/jira/browse/SPARK-42715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17697999#comment-17697999 ] Apache Spark commented on SPARK-42715: -- User 'chong0929' has created a pull request for this issue: https://github.com/apache/spark/pull/40341 > NegativeArraySizeException by too many datas read from ORC file > --- > > Key: SPARK-42715 > URL: https://issues.apache.org/jira/browse/SPARK-42715 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.2 >Reporter: XiaoLong Wu >Priority: Minor > > If need more friendly exception msg about how to avoid this exception? Like > when we catch this expetion, told user can reduce the value about > spark.sql.orc.columnarReaderBatchSize; > In the current version, for batch reading of orc files, we use the function > OrcColumnarBatchReader.nextBatch() to do this and depends on > [ORC|https://github.com/apache/orc](version:1.8.2) to completed data copy, in > ORC relevant code is as follows: > {code:java} > private static byte[] commonReadByteArrays(InStream stream, IntegerReader > lengths, > LongColumnVector scratchlcv, > BytesColumnVector result, final int batchSize) throws IOException { > // Read lengths > scratchlcv.isRepeating = result.isRepeating; > scratchlcv.noNulls = result.noNulls; > scratchlcv.isNull = result.isNull; // Notice we are replacing the isNull > vector here... > lengths.nextVector(scratchlcv, scratchlcv.vector, batchSize); > int totalLength = 0; > if (!scratchlcv.isRepeating) { > for (int i = 0; i < batchSize; i++) { > if (!scratchlcv.isNull[i]) { > totalLength += (int) scratchlcv.vector[i]; > } > } > } else { > if (!scratchlcv.isNull[0]) { > totalLength = (int) (batchSize * scratchlcv.vector[0]); > } > } > // Read all the strings for this batch > byte[] allBytes = new byte[totalLength]; > int offset = 0; > int len = totalLength; > while (len > 0) { > int bytesRead = stream.read(allBytes, offset, len); > if (bytesRead < 0) { > throw new EOFException("Can't finish byte read from " + stream); > } > len -= bytesRead; > offset += bytesRead; > } > return allBytes; > } {code} > As shown above, totalLength as a Long type param is used to mark the data > size. If the data size too big to over max_int, converting to int will lead > to value overflow and throws the following exception: > {code:java} > Caused by: java.lang.NegativeArraySizeException > at > org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.commonReadByteArrays(TreeReaderFactory.java:1998) > at > org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.readOrcByteArrays(TreeReaderFactory.java:2021) > at > org.apache.orc.impl.TreeReaderFactory$StringDirectTreeReader.nextVector(TreeReaderFactory.java:2119) > at > org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1962) > at > org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65) > at > org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100) > at > org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77) > at > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:197) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:99) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:274) > ... 20 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42715) NegativeArraySizeException by too many datas read from ORC file
[ https://issues.apache.org/jira/browse/SPARK-42715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42715: Assignee: Apache Spark > NegativeArraySizeException by too many datas read from ORC file > --- > > Key: SPARK-42715 > URL: https://issues.apache.org/jira/browse/SPARK-42715 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.2 >Reporter: XiaoLong Wu >Assignee: Apache Spark >Priority: Minor > > If need more friendly exception msg about how to avoid this exception? Like > when we catch this expetion, told user can reduce the value about > spark.sql.orc.columnarReaderBatchSize; > In the current version, for batch reading of orc files, we use the function > OrcColumnarBatchReader.nextBatch() to do this and depends on > [ORC|https://github.com/apache/orc](version:1.8.2) to completed data copy, in > ORC relevant code is as follows: > {code:java} > private static byte[] commonReadByteArrays(InStream stream, IntegerReader > lengths, > LongColumnVector scratchlcv, > BytesColumnVector result, final int batchSize) throws IOException { > // Read lengths > scratchlcv.isRepeating = result.isRepeating; > scratchlcv.noNulls = result.noNulls; > scratchlcv.isNull = result.isNull; // Notice we are replacing the isNull > vector here... > lengths.nextVector(scratchlcv, scratchlcv.vector, batchSize); > int totalLength = 0; > if (!scratchlcv.isRepeating) { > for (int i = 0; i < batchSize; i++) { > if (!scratchlcv.isNull[i]) { > totalLength += (int) scratchlcv.vector[i]; > } > } > } else { > if (!scratchlcv.isNull[0]) { > totalLength = (int) (batchSize * scratchlcv.vector[0]); > } > } > // Read all the strings for this batch > byte[] allBytes = new byte[totalLength]; > int offset = 0; > int len = totalLength; > while (len > 0) { > int bytesRead = stream.read(allBytes, offset, len); > if (bytesRead < 0) { > throw new EOFException("Can't finish byte read from " + stream); > } > len -= bytesRead; > offset += bytesRead; > } > return allBytes; > } {code} > As shown above, totalLength as a Long type param is used to mark the data > size. If the data size too big to over max_int, converting to int will lead > to value overflow and throws the following exception: > {code:java} > Caused by: java.lang.NegativeArraySizeException > at > org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.commonReadByteArrays(TreeReaderFactory.java:1998) > at > org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.readOrcByteArrays(TreeReaderFactory.java:2021) > at > org.apache.orc.impl.TreeReaderFactory$StringDirectTreeReader.nextVector(TreeReaderFactory.java:2119) > at > org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1962) > at > org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65) > at > org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100) > at > org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77) > at > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:197) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:99) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:274) > ... 20 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42715) NegativeArraySizeException by too many datas read from ORC file
[ https://issues.apache.org/jira/browse/SPARK-42715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42715: Assignee: (was: Apache Spark) > NegativeArraySizeException by too many datas read from ORC file > --- > > Key: SPARK-42715 > URL: https://issues.apache.org/jira/browse/SPARK-42715 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.2 >Reporter: XiaoLong Wu >Priority: Minor > > If need more friendly exception msg about how to avoid this exception? Like > when we catch this expetion, told user can reduce the value about > spark.sql.orc.columnarReaderBatchSize; > In the current version, for batch reading of orc files, we use the function > OrcColumnarBatchReader.nextBatch() to do this and depends on > [ORC|https://github.com/apache/orc](version:1.8.2) to completed data copy, in > ORC relevant code is as follows: > {code:java} > private static byte[] commonReadByteArrays(InStream stream, IntegerReader > lengths, > LongColumnVector scratchlcv, > BytesColumnVector result, final int batchSize) throws IOException { > // Read lengths > scratchlcv.isRepeating = result.isRepeating; > scratchlcv.noNulls = result.noNulls; > scratchlcv.isNull = result.isNull; // Notice we are replacing the isNull > vector here... > lengths.nextVector(scratchlcv, scratchlcv.vector, batchSize); > int totalLength = 0; > if (!scratchlcv.isRepeating) { > for (int i = 0; i < batchSize; i++) { > if (!scratchlcv.isNull[i]) { > totalLength += (int) scratchlcv.vector[i]; > } > } > } else { > if (!scratchlcv.isNull[0]) { > totalLength = (int) (batchSize * scratchlcv.vector[0]); > } > } > // Read all the strings for this batch > byte[] allBytes = new byte[totalLength]; > int offset = 0; > int len = totalLength; > while (len > 0) { > int bytesRead = stream.read(allBytes, offset, len); > if (bytesRead < 0) { > throw new EOFException("Can't finish byte read from " + stream); > } > len -= bytesRead; > offset += bytesRead; > } > return allBytes; > } {code} > As shown above, totalLength as a Long type param is used to mark the data > size. If the data size too big to over max_int, converting to int will lead > to value overflow and throws the following exception: > {code:java} > Caused by: java.lang.NegativeArraySizeException > at > org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.commonReadByteArrays(TreeReaderFactory.java:1998) > at > org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.readOrcByteArrays(TreeReaderFactory.java:2021) > at > org.apache.orc.impl.TreeReaderFactory$StringDirectTreeReader.nextVector(TreeReaderFactory.java:2119) > at > org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1962) > at > org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65) > at > org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100) > at > org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77) > at > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:197) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:99) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:274) > ... 20 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42720) Refactor the withSequenceColumn
Haejoon Lee created SPARK-42720: --- Summary: Refactor the withSequenceColumn Key: SPARK-42720 URL: https://issues.apache.org/jira/browse/SPARK-42720 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42701) Add the try_aes_decrypt() function
[ https://issues.apache.org/jira/browse/SPARK-42701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42701: Assignee: Max Gekk (was: Apache Spark) > Add the try_aes_decrypt() function > -- > > Key: SPARK-42701 > URL: https://issues.apache.org/jira/browse/SPARK-42701 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Labels: starter > > Add new function try_aes_decrypt(). The function aes_decrypt() fails w/ an > exception when it faces to a column value which it cannot decrypt. So, if a > column contains bad and good input, it is impossible to decrypt even good > input. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42701) Add the try_aes_decrypt() function
[ https://issues.apache.org/jira/browse/SPARK-42701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42701: Assignee: Apache Spark (was: Max Gekk) > Add the try_aes_decrypt() function > -- > > Key: SPARK-42701 > URL: https://issues.apache.org/jira/browse/SPARK-42701 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > Labels: starter > > Add new function try_aes_decrypt(). The function aes_decrypt() fails w/ an > exception when it faces to a column value which it cannot decrypt. So, if a > column contains bad and good input, it is impossible to decrypt even good > input. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42701) Add the try_aes_decrypt() function
[ https://issues.apache.org/jira/browse/SPARK-42701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17697923#comment-17697923 ] Apache Spark commented on SPARK-42701: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/40340 > Add the try_aes_decrypt() function > -- > > Key: SPARK-42701 > URL: https://issues.apache.org/jira/browse/SPARK-42701 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Labels: starter > > Add new function try_aes_decrypt(). The function aes_decrypt() fails w/ an > exception when it faces to a column value which it cannot decrypt. So, if a > column contains bad and good input, it is impossible to decrypt even good > input. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42701) Add the try_aes_decrypt() function
[ https://issues.apache.org/jira/browse/SPARK-42701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-42701: Assignee: Max Gekk > Add the try_aes_decrypt() function > -- > > Key: SPARK-42701 > URL: https://issues.apache.org/jira/browse/SPARK-42701 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Labels: starter > > Add new function try_aes_decrypt(). The function aes_decrypt() fails w/ an > exception when it faces to a column value which it cannot decrypt. So, if a > column contains bad and good input, it is impossible to decrypt even good > input. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42701) Add the try_aes_decrypt() function
[ https://issues.apache.org/jira/browse/SPARK-42701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17697903#comment-17697903 ] Max Gekk commented on SPARK-42701: -- I am working on this. > Add the try_aes_decrypt() function > -- > > Key: SPARK-42701 > URL: https://issues.apache.org/jira/browse/SPARK-42701 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Labels: starter > > Add new function try_aes_decrypt(). The function aes_decrypt() fails w/ an > exception when it faces to a column value which it cannot decrypt. So, if a > column contains bad and good input, it is impossible to decrypt even good > input. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42719) `MapOutputTracker#getMapLocation` should respect `spark.shuffle.reduceLocality.enabled`
[ https://issues.apache.org/jira/browse/SPARK-42719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42719: Assignee: (was: Apache Spark) > `MapOutputTracker#getMapLocation` should respect > `spark.shuffle.reduceLocality.enabled` > > > Key: SPARK-42719 > URL: https://issues.apache.org/jira/browse/SPARK-42719 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: He Qi >Priority: Major > > Discuss as [https://github.com/apache/spark/pull/40307] > {{getPreferredLocations}} in {{ShuffledRowRDD}} should return {{Nil}} at the > very beginning in case {{spark.shuffle.reduceLocality.enabled = false}} > (conceptually). > This logic is pushed into MapOutputTracker though - and > {{getPreferredLocationsForShuffle}} honors > {{spark.shuffle.reduceLocality.enabled}} - but {{getMapLocation}} does not. > So the fix would be to fix {{getMapLocation}} to honor the parameter. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42719) `MapOutputTracker#getMapLocation` should respect `spark.shuffle.reduceLocality.enabled`
[ https://issues.apache.org/jira/browse/SPARK-42719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17697891#comment-17697891 ] Apache Spark commented on SPARK-42719: -- User 'jerqi' has created a pull request for this issue: https://github.com/apache/spark/pull/40339 > `MapOutputTracker#getMapLocation` should respect > `spark.shuffle.reduceLocality.enabled` > > > Key: SPARK-42719 > URL: https://issues.apache.org/jira/browse/SPARK-42719 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: He Qi >Priority: Major > > Discuss as [https://github.com/apache/spark/pull/40307] > {{getPreferredLocations}} in {{ShuffledRowRDD}} should return {{Nil}} at the > very beginning in case {{spark.shuffle.reduceLocality.enabled = false}} > (conceptually). > This logic is pushed into MapOutputTracker though - and > {{getPreferredLocationsForShuffle}} honors > {{spark.shuffle.reduceLocality.enabled}} - but {{getMapLocation}} does not. > So the fix would be to fix {{getMapLocation}} to honor the parameter. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42719) `MapOutputTracker#getMapLocation` should respect `spark.shuffle.reduceLocality.enabled`
[ https://issues.apache.org/jira/browse/SPARK-42719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42719: Assignee: Apache Spark > `MapOutputTracker#getMapLocation` should respect > `spark.shuffle.reduceLocality.enabled` > > > Key: SPARK-42719 > URL: https://issues.apache.org/jira/browse/SPARK-42719 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: He Qi >Assignee: Apache Spark >Priority: Major > > Discuss as [https://github.com/apache/spark/pull/40307] > {{getPreferredLocations}} in {{ShuffledRowRDD}} should return {{Nil}} at the > very beginning in case {{spark.shuffle.reduceLocality.enabled = false}} > (conceptually). > This logic is pushed into MapOutputTracker though - and > {{getPreferredLocationsForShuffle}} honors > {{spark.shuffle.reduceLocality.enabled}} - but {{getMapLocation}} does not. > So the fix would be to fix {{getMapLocation}} to honor the parameter. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42684) v2 catalog should not allow column default value by default
[ https://issues.apache.org/jira/browse/SPARK-42684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-42684: --- Assignee: Wenchen Fan > v2 catalog should not allow column default value by default > --- > > Key: SPARK-42684 > URL: https://issues.apache.org/jira/browse/SPARK-42684 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42684) v2 catalog should not allow column default value by default
[ https://issues.apache.org/jira/browse/SPARK-42684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-42684. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40299 [https://github.com/apache/spark/pull/40299] > v2 catalog should not allow column default value by default > --- > > Key: SPARK-42684 > URL: https://issues.apache.org/jira/browse/SPARK-42684 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42719) `MapOutputTracker#getMapLocation` should respect `spark.shuffle.reduceLocality.enabled`
[ https://issues.apache.org/jira/browse/SPARK-42719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Qi updated SPARK-42719: -- Summary: `MapOutputTracker#getMapLocation` should respect `spark.shuffle.reduceLocality.enabled` (was: `MapOutputTracker#getPreferredLocations` should respect `spark.shuffle.reduceLocality.enabled`) > `MapOutputTracker#getMapLocation` should respect > `spark.shuffle.reduceLocality.enabled` > > > Key: SPARK-42719 > URL: https://issues.apache.org/jira/browse/SPARK-42719 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: He Qi >Priority: Major > > Discuss as [https://github.com/apache/spark/pull/40307] > {{getPreferredLocations}} in {{ShuffledRowRDD}} should return {{Nil}} at the > very beginning in case {{spark.shuffle.reduceLocality.enabled = false}} > (conceptually). > This logic is pushed into MapOutputTracker though - and > {{getPreferredLocationsForShuffle}} honors > {{spark.shuffle.reduceLocality.enabled}} - but {{getMapLocation}} does not. > So the fix would be to fix {{getMapLocation}} to honor the parameter. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42719) `Map#getPreferredLocations` should respect `spark.shuffle.reduceLocality.enabled`
[ https://issues.apache.org/jira/browse/SPARK-42719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Qi updated SPARK-42719: -- Summary: `Map#getPreferredLocations` should respect `spark.shuffle.reduceLocality.enabled` (was: `ShuffledRowRdd#getPreferredLocations` should respect to `spark.shuffle.reduceLocality.enabled`) > `Map#getPreferredLocations` should respect > `spark.shuffle.reduceLocality.enabled` > -- > > Key: SPARK-42719 > URL: https://issues.apache.org/jira/browse/SPARK-42719 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: He Qi >Priority: Major > > Discuss as [https://github.com/apache/spark/pull/40307] > {{getPreferredLocations}} in {{ShuffledRowRDD}} should return {{Nil}} at the > very beginning in case {{spark.shuffle.reduceLocality.enabled = false}} > (conceptually). > This logic is pushed into MapOutputTracker though - and > {{getPreferredLocationsForShuffle}} honors > {{spark.shuffle.reduceLocality.enabled}} - but {{getMapLocation}} does not. > So the fix would be to fix {{getMapLocation}} to honor the parameter. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42719) `MapOutputTracker#getPreferredLocations` should respect `spark.shuffle.reduceLocality.enabled`
[ https://issues.apache.org/jira/browse/SPARK-42719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Qi updated SPARK-42719: -- Summary: `MapOutputTracker#getPreferredLocations` should respect `spark.shuffle.reduceLocality.enabled` (was: `Map#getPreferredLocations` should respect `spark.shuffle.reduceLocality.enabled`) > `MapOutputTracker#getPreferredLocations` should respect > `spark.shuffle.reduceLocality.enabled` > --- > > Key: SPARK-42719 > URL: https://issues.apache.org/jira/browse/SPARK-42719 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: He Qi >Priority: Major > > Discuss as [https://github.com/apache/spark/pull/40307] > {{getPreferredLocations}} in {{ShuffledRowRDD}} should return {{Nil}} at the > very beginning in case {{spark.shuffle.reduceLocality.enabled = false}} > (conceptually). > This logic is pushed into MapOutputTracker though - and > {{getPreferredLocationsForShuffle}} honors > {{spark.shuffle.reduceLocality.enabled}} - but {{getMapLocation}} does not. > So the fix would be to fix {{getMapLocation}} to honor the parameter. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42718) Upgrade rocksdbjni to 7.10.2
[ https://issues.apache.org/jira/browse/SPARK-42718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42718: Assignee: Apache Spark > Upgrade rocksdbjni to 7.10.2 > > > Key: SPARK-42718 > URL: https://issues.apache.org/jira/browse/SPARK-42718 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > > https://github.com/facebook/rocksdb/releases/tag/v7.10.2 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42718) Upgrade rocksdbjni to 7.10.2
[ https://issues.apache.org/jira/browse/SPARK-42718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42718: Assignee: (was: Apache Spark) > Upgrade rocksdbjni to 7.10.2 > > > Key: SPARK-42718 > URL: https://issues.apache.org/jira/browse/SPARK-42718 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Major > > https://github.com/facebook/rocksdb/releases/tag/v7.10.2 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42718) Upgrade rocksdbjni to 7.10.2
[ https://issues.apache.org/jira/browse/SPARK-42718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17697875#comment-17697875 ] Apache Spark commented on SPARK-42718: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/40337 > Upgrade rocksdbjni to 7.10.2 > > > Key: SPARK-42718 > URL: https://issues.apache.org/jira/browse/SPARK-42718 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Major > > https://github.com/facebook/rocksdb/releases/tag/v7.10.2 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42719) `ShuffledRowRdd#getPreferredLocations` should respect to `spark.shuffle.reduceLocality.enabled`
[ https://issues.apache.org/jira/browse/SPARK-42719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Qi updated SPARK-42719: -- Description: Discuss as [https://github.com/apache/spark/pull/40307] {{getPreferredLocations}} in {{ShuffledRowRDD}} should return {{Nil}} at the very beginning in case {{spark.shuffle.reduceLocality.enabled = false}} (conceptually). This logic is pushed into MapOutputTracker though - and {{getPreferredLocationsForShuffle}} honors {{spark.shuffle.reduceLocality.enabled}} - but {{getMapLocation}} does not. So the fix would be to fix {{getMapLocation}} to honor the parameter. > `ShuffledRowRdd#getPreferredLocations` should respect to > `spark.shuffle.reduceLocality.enabled` > --- > > Key: SPARK-42719 > URL: https://issues.apache.org/jira/browse/SPARK-42719 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: He Qi >Priority: Major > > Discuss as [https://github.com/apache/spark/pull/40307] > {{getPreferredLocations}} in {{ShuffledRowRDD}} should return {{Nil}} at the > very beginning in case {{spark.shuffle.reduceLocality.enabled = false}} > (conceptually). > This logic is pushed into MapOutputTracker though - and > {{getPreferredLocationsForShuffle}} honors > {{spark.shuffle.reduceLocality.enabled}} - but {{getMapLocation}} does not. > So the fix would be to fix {{getMapLocation}} to honor the parameter. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42719) `ShuffledRowRdd#getPreferredLocations` should respect to `spark.shuffle.reduceLocality.enabled`
He Qi created SPARK-42719: - Summary: `ShuffledRowRdd#getPreferredLocations` should respect to `spark.shuffle.reduceLocality.enabled` Key: SPARK-42719 URL: https://issues.apache.org/jira/browse/SPARK-42719 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.5.0 Reporter: He Qi -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42706) Document the Spark SQL error classes in user-facing documentation.
[ https://issues.apache.org/jira/browse/SPARK-42706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-42706: Summary: Document the Spark SQL error classes in user-facing documentation. (was: List the error class to user-facing documentation.) > Document the Spark SQL error classes in user-facing documentation. > -- > > Key: SPARK-42706 > URL: https://issues.apache.org/jira/browse/SPARK-42706 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > We need to have an error class list to user facing documents. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42706) List the error class to user-facing documentation.
[ https://issues.apache.org/jira/browse/SPARK-42706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42706: Assignee: (was: Apache Spark) > List the error class to user-facing documentation. > -- > > Key: SPARK-42706 > URL: https://issues.apache.org/jira/browse/SPARK-42706 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > We need to have an error class list to user facing documents. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42706) List the error class to user-facing documentation.
[ https://issues.apache.org/jira/browse/SPARK-42706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42706: Assignee: Apache Spark > List the error class to user-facing documentation. > -- > > Key: SPARK-42706 > URL: https://issues.apache.org/jira/browse/SPARK-42706 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > > We need to have an error class list to user facing documents. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42706) List the error class to user-facing documentation.
[ https://issues.apache.org/jira/browse/SPARK-42706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17697863#comment-17697863 ] Apache Spark commented on SPARK-42706: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/40336 > List the error class to user-facing documentation. > -- > > Key: SPARK-42706 > URL: https://issues.apache.org/jira/browse/SPARK-42706 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > We need to have an error class list to user facing documents. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42718) Upgrade rocksdbjni to 7.10.2
Yang Jie created SPARK-42718: Summary: Upgrade rocksdbjni to 7.10.2 Key: SPARK-42718 URL: https://issues.apache.org/jira/browse/SPARK-42718 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.5.0 Reporter: Yang Jie https://github.com/facebook/rocksdb/releases/tag/v7.10.2 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42717) Upgrade mysql-connector-java from 8.0.31 to 8.0.32
[ https://issues.apache.org/jira/browse/SPARK-42717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17697850#comment-17697850 ] Apache Spark commented on SPARK-42717: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/40335 > Upgrade mysql-connector-java from 8.0.31 to 8.0.32 > -- > > Key: SPARK-42717 > URL: https://issues.apache.org/jira/browse/SPARK-42717 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42717) Upgrade mysql-connector-java from 8.0.31 to 8.0.32
[ https://issues.apache.org/jira/browse/SPARK-42717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17697851#comment-17697851 ] Apache Spark commented on SPARK-42717: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/40335 > Upgrade mysql-connector-java from 8.0.31 to 8.0.32 > -- > > Key: SPARK-42717 > URL: https://issues.apache.org/jira/browse/SPARK-42717 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42717) Upgrade mysql-connector-java from 8.0.31 to 8.0.32
[ https://issues.apache.org/jira/browse/SPARK-42717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42717: Assignee: Apache Spark > Upgrade mysql-connector-java from 8.0.31 to 8.0.32 > -- > > Key: SPARK-42717 > URL: https://issues.apache.org/jira/browse/SPARK-42717 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42717) Upgrade mysql-connector-java from 8.0.31 to 8.0.32
[ https://issues.apache.org/jira/browse/SPARK-42717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42717: Assignee: (was: Apache Spark) > Upgrade mysql-connector-java from 8.0.31 to 8.0.32 > -- > > Key: SPARK-42717 > URL: https://issues.apache.org/jira/browse/SPARK-42717 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42717) Upgrade mysql-connector-java from 8.0.31 to 8.0.32
BingKun Pan created SPARK-42717: --- Summary: Upgrade mysql-connector-java from 8.0.31 to 8.0.32 Key: SPARK-42717 URL: https://issues.apache.org/jira/browse/SPARK-42717 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.5.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42716) DataSourceV2 cannot report KeyGroupedPartitioning with multiple keys per partition
[ https://issues.apache.org/jira/browse/SPARK-42716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17697847#comment-17697847 ] Apache Spark commented on SPARK-42716: -- User 'EnricoMi' has created a pull request for this issue: https://github.com/apache/spark/pull/40334 > DataSourceV2 cannot report KeyGroupedPartitioning with multiple keys per > partition > -- > > Key: SPARK-42716 > URL: https://issues.apache.org/jira/browse/SPARK-42716 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.3.1, 3.3.2, 3.4.0, 3.4.1 >Reporter: Enrico Minack >Priority: Major > > From Spark 3.0.0 until 3.2.3, a DataSourceV2 could report its partitioning as > {{KeyGroupedPartitioning}} via {{SupportsReportPartitioning}}, even if > multiple keys belong to a partition. > With SPARK-37377, only if all partitions implement {{HasPartitionKey}}, the > partition information reported through {{SupportsReportPartitioning}} is > considered by catalyst. But this limits the number of keys per partition to 1. > Spark should continue to support the more general situation of > {{KeyGroupedPartitioning}} with multiple keys per partition, like > {{HashPartitioning}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42716) DataSourceV2 cannot report KeyGroupedPartitioning with multiple keys per partition
[ https://issues.apache.org/jira/browse/SPARK-42716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42716: Assignee: (was: Apache Spark) > DataSourceV2 cannot report KeyGroupedPartitioning with multiple keys per > partition > -- > > Key: SPARK-42716 > URL: https://issues.apache.org/jira/browse/SPARK-42716 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.3.1, 3.3.2, 3.4.0, 3.4.1 >Reporter: Enrico Minack >Priority: Major > > From Spark 3.0.0 until 3.2.3, a DataSourceV2 could report its partitioning as > {{KeyGroupedPartitioning}} via {{SupportsReportPartitioning}}, even if > multiple keys belong to a partition. > With SPARK-37377, only if all partitions implement {{HasPartitionKey}}, the > partition information reported through {{SupportsReportPartitioning}} is > considered by catalyst. But this limits the number of keys per partition to 1. > Spark should continue to support the more general situation of > {{KeyGroupedPartitioning}} with multiple keys per partition, like > {{HashPartitioning}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42716) DataSourceV2 cannot report KeyGroupedPartitioning with multiple keys per partition
[ https://issues.apache.org/jira/browse/SPARK-42716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42716: Assignee: Apache Spark > DataSourceV2 cannot report KeyGroupedPartitioning with multiple keys per > partition > -- > > Key: SPARK-42716 > URL: https://issues.apache.org/jira/browse/SPARK-42716 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.3.1, 3.3.2, 3.4.0, 3.4.1 >Reporter: Enrico Minack >Assignee: Apache Spark >Priority: Major > > From Spark 3.0.0 until 3.2.3, a DataSourceV2 could report its partitioning as > {{KeyGroupedPartitioning}} via {{SupportsReportPartitioning}}, even if > multiple keys belong to a partition. > With SPARK-37377, only if all partitions implement {{HasPartitionKey}}, the > partition information reported through {{SupportsReportPartitioning}} is > considered by catalyst. But this limits the number of keys per partition to 1. > Spark should continue to support the more general situation of > {{KeyGroupedPartitioning}} with multiple keys per partition, like > {{HashPartitioning}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42716) DataSourceV2 cannot report KeyGroupedPartitioning with multiple keys per partition
[ https://issues.apache.org/jira/browse/SPARK-42716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17697845#comment-17697845 ] Apache Spark commented on SPARK-42716: -- User 'EnricoMi' has created a pull request for this issue: https://github.com/apache/spark/pull/40334 > DataSourceV2 cannot report KeyGroupedPartitioning with multiple keys per > partition > -- > > Key: SPARK-42716 > URL: https://issues.apache.org/jira/browse/SPARK-42716 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.3.1, 3.3.2, 3.4.0, 3.4.1 >Reporter: Enrico Minack >Priority: Major > > From Spark 3.0.0 until 3.2.3, a DataSourceV2 could report its partitioning as > {{KeyGroupedPartitioning}} via {{SupportsReportPartitioning}}, even if > multiple keys belong to a partition. > With SPARK-37377, only if all partitions implement {{HasPartitionKey}}, the > partition information reported through {{SupportsReportPartitioning}} is > considered by catalyst. But this limits the number of keys per partition to 1. > Spark should continue to support the more general situation of > {{KeyGroupedPartitioning}} with multiple keys per partition, like > {{HashPartitioning}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42716) DataSourceV2 cannot report KeyGroupedPartitioning with multiple keys per partition
Enrico Minack created SPARK-42716: - Summary: DataSourceV2 cannot report KeyGroupedPartitioning with multiple keys per partition Key: SPARK-42716 URL: https://issues.apache.org/jira/browse/SPARK-42716 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.2, 3.3.1, 3.3.0, 3.4.0, 3.4.1 Reporter: Enrico Minack >From Spark 3.0.0 until 3.2.3, a DataSourceV2 could report its partitioning as >{{KeyGroupedPartitioning}} via {{SupportsReportPartitioning}}, even if >multiple keys belong to a partition. With SPARK-37377, only if all partitions implement {{HasPartitionKey}}, the partition information reported through {{SupportsReportPartitioning}} is considered by catalyst. But this limits the number of keys per partition to 1. Spark should continue to support the more general situation of {{KeyGroupedPartitioning}} with multiple keys per partition, like {{HashPartitioning}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42713) Add '__getattr__' and '__getitem__' of DataFrame and Column to API reference
[ https://issues.apache.org/jira/browse/SPARK-42713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42713. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40331 [https://github.com/apache/spark/pull/40331] > Add '__getattr__' and '__getitem__' of DataFrame and Column to API reference > > > Key: SPARK-42713 > URL: https://issues.apache.org/jira/browse/SPARK-42713 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42713) Add '__getattr__' and '__getitem__' of DataFrame and Column to API reference
[ https://issues.apache.org/jira/browse/SPARK-42713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42713: Assignee: Ruifeng Zheng > Add '__getattr__' and '__getitem__' of DataFrame and Column to API reference > > > Key: SPARK-42713 > URL: https://issues.apache.org/jira/browse/SPARK-42713 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42266) Local mode should work with IPython
[ https://issues.apache.org/jira/browse/SPARK-42266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42266. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40327 [https://github.com/apache/spark/pull/40327] > Local mode should work with IPython > --- > > Key: SPARK-42266 > URL: https://issues.apache.org/jira/browse/SPARK-42266 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > {code:java} > (spark_dev) ➜ spark git:(master) bin/pyspark --remote "local[*]" > Python 3.9.15 (main, Nov 24 2022, 08:28:41) > Type 'copyright', 'credits' or 'license' for more information > IPython 8.9.0 -- An enhanced Interactive Python. Type '?' for help. > /Users/ruifeng.zheng/Dev/spark/python/pyspark/shell.py:45: UserWarning: > Failed to initialize Spark session. > warnings.warn("Failed to initialize Spark session.") > Traceback (most recent call last): > File "/Users/ruifeng.zheng/Dev/spark/python/pyspark/shell.py", line 40, in > > spark = SparkSession.builder.getOrCreate() > File "/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/session.py", line > 429, in getOrCreate > from pyspark.sql.connect.session import SparkSession as RemoteSparkSession > File > "/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/__init__.py", line > 21, in > from pyspark.sql.connect.dataframe import DataFrame # noqa: F401 > File > "/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/dataframe.py", > line 35, in > import pandas > File "/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/__init__.py", > line 29, in > from pyspark.pandas.missing.general_functions import > MissingPandasLikeGeneralFunctions > File "/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/__init__.py", > line 34, in > require_minimum_pandas_version() > File "/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/pandas/utils.py", > line 37, in require_minimum_pandas_version > if LooseVersion(pandas.__version__) < > LooseVersion(minimum_pandas_version): > AttributeError: partially initialized module 'pandas' has no attribute > '__version__' (most likely due to a circular import) > [TerminalIPythonApp] WARNING | Unknown error in handling PYTHONSTARTUP file > /Users/ruifeng.zheng/Dev/spark//python/pyspark/shell.py: > --- > AttributeErrorTraceback (most recent call last) > File ~/Dev/spark/python/pyspark/shell.py:40 > 38 try: > 39 # Creates pyspark.sql.connect.SparkSession. > ---> 40 spark = SparkSession.builder.getOrCreate() > 41 except Exception: > File ~/Dev/spark/python/pyspark/sql/session.py:429, in > SparkSession.Builder.getOrCreate(self) > 428 with SparkContext._lock: > --> 429 from pyspark.sql.connect.session import SparkSession as > RemoteSparkSession > 431 if ( > 432 SparkContext._active_spark_context is None > 433 and SparkSession._instantiatedSession is None > 434 ): > File ~/Dev/spark/python/pyspark/sql/connect/__init__.py:21 > 18 """Currently Spark Connect is very experimental and the APIs to > interact with > 19 Spark through this API are can be changed at any time without > warning.""" > ---> 21 from pyspark.sql.connect.dataframe import DataFrame # noqa: F401 > 22 from pyspark.sql.pandas.utils import ( > 23 require_minimum_pandas_version, > 24 require_minimum_pyarrow_version, > 25 require_minimum_grpc_version, > 26 ) > File ~/Dev/spark/python/pyspark/sql/connect/dataframe.py:35 > 34 import random > ---> 35 import pandas > 36 import json > File ~/Dev/spark/python/pyspark/pandas/__init__.py:29 > 27 from typing import Any > ---> 29 from pyspark.pandas.missing.general_functions import > MissingPandasLikeGeneralFunctions > 30 from pyspark.pandas.missing.scalars import MissingPandasLikeScalars > File ~/Dev/spark/python/pyspark/pandas/__init__.py:34 > 33 try: > ---> 34 require_minimum_pandas_version() > 35 require_minimum_pyarrow_version() > File ~/Dev/spark/python/pyspark/sql/pandas/utils.py:37, in > require_minimum_pandas_version() > 34 raise ImportError( > 35 "Pandas >= %s must be installed; however, " "it was not > found." % minimum_pandas_version > 36 ) from raised_error > ---> 37 if LooseVersion(pandas.__version__) < > LooseVersion(minimum_pandas_version): > 38 raise ImportError( > 39 "Pandas >= %s must be installed; however, " > 40 "your version was %s." %
[jira] [Assigned] (SPARK-42266) Local mode should work with IPython
[ https://issues.apache.org/jira/browse/SPARK-42266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42266: Assignee: Hyukjin Kwon > Local mode should work with IPython > --- > > Key: SPARK-42266 > URL: https://issues.apache.org/jira/browse/SPARK-42266 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Hyukjin Kwon >Priority: Major > > {code:java} > (spark_dev) ➜ spark git:(master) bin/pyspark --remote "local[*]" > Python 3.9.15 (main, Nov 24 2022, 08:28:41) > Type 'copyright', 'credits' or 'license' for more information > IPython 8.9.0 -- An enhanced Interactive Python. Type '?' for help. > /Users/ruifeng.zheng/Dev/spark/python/pyspark/shell.py:45: UserWarning: > Failed to initialize Spark session. > warnings.warn("Failed to initialize Spark session.") > Traceback (most recent call last): > File "/Users/ruifeng.zheng/Dev/spark/python/pyspark/shell.py", line 40, in > > spark = SparkSession.builder.getOrCreate() > File "/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/session.py", line > 429, in getOrCreate > from pyspark.sql.connect.session import SparkSession as RemoteSparkSession > File > "/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/__init__.py", line > 21, in > from pyspark.sql.connect.dataframe import DataFrame # noqa: F401 > File > "/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/dataframe.py", > line 35, in > import pandas > File "/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/__init__.py", > line 29, in > from pyspark.pandas.missing.general_functions import > MissingPandasLikeGeneralFunctions > File "/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/__init__.py", > line 34, in > require_minimum_pandas_version() > File "/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/pandas/utils.py", > line 37, in require_minimum_pandas_version > if LooseVersion(pandas.__version__) < > LooseVersion(minimum_pandas_version): > AttributeError: partially initialized module 'pandas' has no attribute > '__version__' (most likely due to a circular import) > [TerminalIPythonApp] WARNING | Unknown error in handling PYTHONSTARTUP file > /Users/ruifeng.zheng/Dev/spark//python/pyspark/shell.py: > --- > AttributeErrorTraceback (most recent call last) > File ~/Dev/spark/python/pyspark/shell.py:40 > 38 try: > 39 # Creates pyspark.sql.connect.SparkSession. > ---> 40 spark = SparkSession.builder.getOrCreate() > 41 except Exception: > File ~/Dev/spark/python/pyspark/sql/session.py:429, in > SparkSession.Builder.getOrCreate(self) > 428 with SparkContext._lock: > --> 429 from pyspark.sql.connect.session import SparkSession as > RemoteSparkSession > 431 if ( > 432 SparkContext._active_spark_context is None > 433 and SparkSession._instantiatedSession is None > 434 ): > File ~/Dev/spark/python/pyspark/sql/connect/__init__.py:21 > 18 """Currently Spark Connect is very experimental and the APIs to > interact with > 19 Spark through this API are can be changed at any time without > warning.""" > ---> 21 from pyspark.sql.connect.dataframe import DataFrame # noqa: F401 > 22 from pyspark.sql.pandas.utils import ( > 23 require_minimum_pandas_version, > 24 require_minimum_pyarrow_version, > 25 require_minimum_grpc_version, > 26 ) > File ~/Dev/spark/python/pyspark/sql/connect/dataframe.py:35 > 34 import random > ---> 35 import pandas > 36 import json > File ~/Dev/spark/python/pyspark/pandas/__init__.py:29 > 27 from typing import Any > ---> 29 from pyspark.pandas.missing.general_functions import > MissingPandasLikeGeneralFunctions > 30 from pyspark.pandas.missing.scalars import MissingPandasLikeScalars > File ~/Dev/spark/python/pyspark/pandas/__init__.py:34 > 33 try: > ---> 34 require_minimum_pandas_version() > 35 require_minimum_pyarrow_version() > File ~/Dev/spark/python/pyspark/sql/pandas/utils.py:37, in > require_minimum_pandas_version() > 34 raise ImportError( > 35 "Pandas >= %s must be installed; however, " "it was not > found." % minimum_pandas_version > 36 ) from raised_error > ---> 37 if LooseVersion(pandas.__version__) < > LooseVersion(minimum_pandas_version): > 38 raise ImportError( > 39 "Pandas >= %s must be installed; however, " > 40 "your version was %s." % (minimum_pandas_version, > pandas.__version__) > 41 ) > AttributeError: partially initialized module 'pandas' has no attribute >
[jira] [Resolved] (SPARK-42712) Improve docstring of mapInPandas and mapInArrow
[ https://issues.apache.org/jira/browse/SPARK-42712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42712. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40330 [https://github.com/apache/spark/pull/40330] > Improve docstring of mapInPandas and mapInArrow > --- > > Key: SPARK-42712 > URL: https://issues.apache.org/jira/browse/SPARK-42712 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > We'd better call out they are not scalar. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42712) Improve docstring of mapInPandas and mapInArrow
[ https://issues.apache.org/jira/browse/SPARK-42712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42712: Assignee: Xinrong Meng > Improve docstring of mapInPandas and mapInArrow > --- > > Key: SPARK-42712 > URL: https://issues.apache.org/jira/browse/SPARK-42712 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > > We'd better call out they are not scalar. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42715) NegativeArraySizeException by too many datas read from ORC file
XiaoLong Wu created SPARK-42715: --- Summary: NegativeArraySizeException by too many datas read from ORC file Key: SPARK-42715 URL: https://issues.apache.org/jira/browse/SPARK-42715 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.2 Reporter: XiaoLong Wu If need more friendly exception msg about how to avoid this exception? Like when we catch this expetion, told user can reduce the value about spark.sql.orc.columnarReaderBatchSize; In the current version, for batch reading of orc files, we use the function OrcColumnarBatchReader.nextBatch() to do this and depends on [ORC|https://github.com/apache/orc](version:1.8.2) to completed data copy, in ORC relevant code is as follows: {code:java} private static byte[] commonReadByteArrays(InStream stream, IntegerReader lengths, LongColumnVector scratchlcv, BytesColumnVector result, final int batchSize) throws IOException { // Read lengths scratchlcv.isRepeating = result.isRepeating; scratchlcv.noNulls = result.noNulls; scratchlcv.isNull = result.isNull; // Notice we are replacing the isNull vector here... lengths.nextVector(scratchlcv, scratchlcv.vector, batchSize); int totalLength = 0; if (!scratchlcv.isRepeating) { for (int i = 0; i < batchSize; i++) { if (!scratchlcv.isNull[i]) { totalLength += (int) scratchlcv.vector[i]; } } } else { if (!scratchlcv.isNull[0]) { totalLength = (int) (batchSize * scratchlcv.vector[0]); } } // Read all the strings for this batch byte[] allBytes = new byte[totalLength]; int offset = 0; int len = totalLength; while (len > 0) { int bytesRead = stream.read(allBytes, offset, len); if (bytesRead < 0) { throw new EOFException("Can't finish byte read from " + stream); } len -= bytesRead; offset += bytesRead; } return allBytes; } {code} As shown above, totalLength as a Long type param is used to mark the data size. If the data size too big to over max_int, converting to int will lead to value overflow and throws the following exception: {code:java} Caused by: java.lang.NegativeArraySizeException at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.commonReadByteArrays(TreeReaderFactory.java:1998) at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.readOrcByteArrays(TreeReaderFactory.java:2021) at org.apache.orc.impl.TreeReaderFactory$StringDirectTreeReader.nextVector(TreeReaderFactory.java:2119) at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1962) at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65) at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100) at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77) at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371) at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:197) at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:99) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:274) ... 20 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-42623) parameter markers not blocked in DDL
[ https://issues.apache.org/jira/browse/SPARK-42623 ] ming95 deleted comment on SPARK-42623: was (Author: zing): i can try to fix this issue > parameter markers not blocked in DDL > > > Key: SPARK-42623 > URL: https://issues.apache.org/jira/browse/SPARK-42623 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Priority: Major > > The parameterized query code does not block DDL statements from referencing > parameter markers. > E.g. a > > {code:java} > scala> spark.sql(sqlText = "CREATE VIEW v1 AS SELECT current_timestamp() + > :later as stamp, :x * :x AS square", args = Map("later" -> "INTERVAL'3' > HOUR", "x" -> "15.0")).show() > ++ > || > ++ > ++ > {code} > It appears we have some protection that fails us when the view is invoked: > > {code:java} > scala> spark.sql(sqlText = "SELECT * FROM v1", args = Map("later" -> > "INTERVAL'3' HOUR", "x" -> "15.0")).show() > org.apache.spark.sql.AnalysisException: [UNBOUND_SQL_PARAMETER] Found the > unbound parameter: `later`. Please, fix `args` and provide a mapping of the > parameter to a SQL literal.; line 1 pos 29 > {code} > Right now I think affected are: > * DEFAULT definition > * VIEW definition > but any other future standard expression popping up is at risk, such as SQL > Functions, or GENERATED COLUMN. > CREATE TABLE AS is debatable, since it it executes the query at definition > only. > For simplicity I propose to block the feature from ANY DDL statement (CREATE, > ALTER). > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org