[jira] [Commented] (SPARK-40100) Add Int128 type
[ https://issues.apache.org/jira/browse/SPARK-40100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580084#comment-17580084 ] jiaan.geng commented on SPARK-40100: I'm working on. > Add Int128 type > --- > > Key: SPARK-40100 > URL: https://issues.apache.org/jira/browse/SPARK-40100 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > Extend Catalyst's type system by a new type: > Int128Type represents the Int128 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40100) Add Int128 type
jiaan.geng created SPARK-40100: -- Summary: Add Int128 type Key: SPARK-40100 URL: https://issues.apache.org/jira/browse/SPARK-40100 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: jiaan.geng Extend Catalyst's type system by a new type: Int128Type represents the Int128 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40099) Merge adjacent CaseWhen branches if their values are the same
Yuming Wang created SPARK-40099: --- Summary: Merge adjacent CaseWhen branches if their values are the same Key: SPARK-40099 URL: https://issues.apache.org/jira/browse/SPARK-40099 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Yuming Wang For example: {code:sql} CASE WHEN f1.buyer_id IS NOT NULL THEN 1 WHEN f2.buyer_id IS NOT NULL THEN 1 ELSE 0 END {code} The excepted result: {code:sql} CASE WHEN f1.buyer_id IS NOT NULL or f2.buyer_id IS NOT NULL THEN 1 ELSE 0 END {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40095) sc.uiWebUrl should not throw exception when webui is disabled
[ https://issues.apache.org/jira/browse/SPARK-40095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40095. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37530 [https://github.com/apache/spark/pull/37530] > sc.uiWebUrl should not throw exception when webui is disabled > - > > Key: SPARK-40095 > URL: https://issues.apache.org/jira/browse/SPARK-40095 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40098) Format error messages in the Thrift Server
[ https://issues.apache.org/jira/browse/SPARK-40098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40098: Assignee: Max Gekk (was: Apache Spark) > Format error messages in the Thrift Server > -- > > Key: SPARK-40098 > URL: https://issues.apache.org/jira/browse/SPARK-40098 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > # Introduce a config to control the format of error messages: plain text and > JSON > # Modify the Thrift Server to output errors from Spark SQL according to the > config -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40098) Format error messages in the Thrift Server
[ https://issues.apache.org/jira/browse/SPARK-40098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40098: Assignee: Apache Spark (was: Max Gekk) > Format error messages in the Thrift Server > -- > > Key: SPARK-40098 > URL: https://issues.apache.org/jira/browse/SPARK-40098 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > # Introduce a config to control the format of error messages: plain text and > JSON > # Modify the Thrift Server to output errors from Spark SQL according to the > config -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40098) Format error messages in the Thrift Server
[ https://issues.apache.org/jira/browse/SPARK-40098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580074#comment-17580074 ] Apache Spark commented on SPARK-40098: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/37520 > Format error messages in the Thrift Server > -- > > Key: SPARK-40098 > URL: https://issues.apache.org/jira/browse/SPARK-40098 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > # Introduce a config to control the format of error messages: plain text and > JSON > # Modify the Thrift Server to output errors from Spark SQL according to the > config -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40098) Format error messages in the Thrift Server
Max Gekk created SPARK-40098: Summary: Format error messages in the Thrift Server Key: SPARK-40098 URL: https://issues.apache.org/jira/browse/SPARK-40098 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Max Gekk Assignee: Max Gekk # Introduce a config to control the format of error messages: plain text and JSON # Modify the Thrift Server to output errors from Spark SQL according to the config -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40063) pyspark.pandas .apply() changing rows ordering
[ https://issues.apache.org/jira/browse/SPARK-40063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580073#comment-17580073 ] Hyukjin Kwon commented on SPARK-40063: -- Just to clarify, does that only happen in single column? or happen in other columns too? Spark itself doesn't guarantee the order of rows. If you need to keep the natural order, you could try to enable {{compute.ordered_head}} and see what you get. > pyspark.pandas .apply() changing rows ordering > -- > > Key: SPARK-40063 > URL: https://issues.apache.org/jira/browse/SPARK-40063 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark >Affects Versions: 3.3.0 > Environment: Databricks Runtime 11.1 >Reporter: Marcelo Rossini Castro >Priority: Minor > Labels: Pandas, PySpark > > When using the apply function to apply a function to a DataFrame column, it > ends up mixing the column's rows ordering. > A command like this: > {code:java} > def example_func(df_col): > return df_col ** 2 > df['col_to_apply_function'] = df.apply(lambda row: > example_func(row['col_to_apply_function']), axis=1) {code} > A workaround is to assign the results to a new column instead of the same > one, but if the old column is dropped, the same error is produced. > Setting one column as index also didn't work. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40097) Support Int128 type
[ https://issues.apache.org/jira/browse/SPARK-40097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-40097: --- Description: Spark SQL today supports the Decimal data type. The implementation of Spark Decimal holds a BigDecimal or Long value. Spark Decimal provides some operators like +, -, *, /, % and so on. These operators rely heavily on the computational power of BigDecimal or Long itself. For ease of understanding, take the + as an example. The implementation shows below. {code:java} def + (that: Decimal): Decimal = { if (decimalVal.eq(null) && that.decimalVal.eq(null) && scale == that.scale) { Decimal(longVal + that.longVal, Math.max(precision, that.precision) + 1, scale) } else { Decimal(toBigDecimal.bigDecimal.add(that.toBigDecimal.bigDecimal)) } } {code} We can see the + of Long will be called if Spark Decimal holds a Long value. Otherwise, the add of BigDecimal will be called if Spark Decimal holds a BigDecimal value. The other operators of Spark Decimal adopt the similar way. Furthermore, the code shown above calls Decimal.apply to construct a new instance of Spark Decimal. As we know, the add operator of BigDecimal constructed a new instance of BigDecimal. So, if we call the + operator of Spark Decimal who holds a Long value, Spark will construct a new instance of Decimal. Otherwise, Spark will construct a new instance of BigDecimal and a new instance of Decimal. Through rough analysis, we know: 1. The computational power of Spark Decimal may depend on BigDecimal. 2. The calculation operators of Spark Decimal create a lot of new instances of Decimal and may create a lot of new instances of BigDecimal. If a large table has a field called 'colA whose type is Spark Decimal, the execution of SUM('colA) will involve the creation of a large number of Spark Decimal instances and BigDecimal instances. These Spark Decimal instances and BigDecimal instances will lead to garbage collection frequently. Int128 is a high-performance data type about 2X~10X more efficient than Spark Decimal for typical operations. It uses a finite (128 bit) precision and can handle up to decimal(38, X). The implementation of Int128 just uses two Long values to represent the high and low bits of 128 bits respectively. Int128 is lighter than Spark Decimal, reduces the cost of new() and garbage collection. h3. Milestone 1 – Spark Decimal equivalency ( The new type Int128 meets or exceeds all function of the existing SQL Decimal): * Add a new DataType implementation for Int128. * Support Int128 in Dataset/UDF. * Int128 literals * Int128 arithmetic(e.g. Int128 + Int128, Int128 - Decimal) * Decimal or Math functions/operators: POWER, LOG, Round, etc * Cast to and from Int128, cast String/Decimal to Int128, cast Int128 to string (pretty printing)/Decimal, with the * * SQL syntax to specify the types * Support sorting Int128. h3. Milestone 2 – Persistence: * Ability to create tables of type Int128 * Ability to write to common file formats such as Parquet and JSON. * INSERT, SELECT, UPDATE, MERGE * Discovery h3. Milestone 3 – Client support * JDBC support * Hive Thrift server h3. Milestone 4 – PySpark and Spark R integration * Python UDF can take and return Int128 * DataFrame support > Support Int128 type > --- > > Key: SPARK-40097 > URL: https://issues.apache.org/jira/browse/SPARK-40097 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > Spark SQL today supports the Decimal data type. The implementation of Spark > Decimal holds a BigDecimal or Long value. Spark Decimal provides some > operators like +, -, *, /, % and so on. These operators rely heavily on the > computational power of BigDecimal or Long itself. For ease of understanding, > take the + as an example. The implementation shows below. > {code:java} > def + (that: Decimal): Decimal = { > if (decimalVal.eq(null) && that.decimalVal.eq(null) && scale == > that.scale) { > Decimal(longVal + that.longVal, Math.max(precision, that.precision) + > 1, scale) > } else { > Decimal(toBigDecimal.bigDecimal.add(that.toBigDecimal.bigDecimal)) > } > } > {code} > We can see the + of Long will be called if Spark Decimal holds a Long value. > Otherwise, the add of BigDecimal will be called if Spark Decimal holds a > BigDecimal value. The other operators of Spark Decimal adopt the similar way. > Furthermore, the code shown above calls Decimal.apply to construct a new > instance of Spark Decimal. As we know, the add operator of BigDecimal > constructed a new instance of BigDecimal. So, if we call the + operator of > Spark Decimal who holds a Long value, Spark will construct a new instance of > Decimal. Otherwise, Spark wi
[jira] [Updated] (SPARK-40097) Support Int128 type
[ https://issues.apache.org/jira/browse/SPARK-40097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-40097: --- Attachment: Benchmark for performance comparison between Int128 and Spark decimal.pdf > Support Int128 type > --- > > Key: SPARK-40097 > URL: https://issues.apache.org/jira/browse/SPARK-40097 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > Attachments: Benchmark for performance comparison between Int128 and > Spark decimal.pdf > > > Spark SQL today supports the Decimal data type. The implementation of Spark > Decimal holds a BigDecimal or Long value. Spark Decimal provides some > operators like +, -, *, /, % and so on. These operators rely heavily on the > computational power of BigDecimal or Long itself. For ease of understanding, > take the + as an example. The implementation shows below. > {code:java} > def + (that: Decimal): Decimal = { > if (decimalVal.eq(null) && that.decimalVal.eq(null) && scale == > that.scale) { > Decimal(longVal + that.longVal, Math.max(precision, that.precision) + > 1, scale) > } else { > Decimal(toBigDecimal.bigDecimal.add(that.toBigDecimal.bigDecimal)) > } > } > {code} > We can see the + of Long will be called if Spark Decimal holds a Long value. > Otherwise, the add of BigDecimal will be called if Spark Decimal holds a > BigDecimal value. The other operators of Spark Decimal adopt the similar way. > Furthermore, the code shown above calls Decimal.apply to construct a new > instance of Spark Decimal. As we know, the add operator of BigDecimal > constructed a new instance of BigDecimal. So, if we call the + operator of > Spark Decimal who holds a Long value, Spark will construct a new instance of > Decimal. Otherwise, Spark will construct a new instance of BigDecimal and a > new instance of Decimal. > Through rough analysis, we know: > 1. The computational power of Spark Decimal may depend on BigDecimal. > 2. The calculation operators of Spark Decimal create a lot of new instances > of Decimal and may create a lot of new instances of BigDecimal. > If a large table has a field called 'colA whose type is Spark Decimal, the > execution of SUM('colA) will involve the creation of a large number of Spark > Decimal instances and BigDecimal instances. These Spark Decimal instances and > BigDecimal instances will lead to garbage collection frequently. > Int128 is a high-performance data type about 2X~10X more efficient than Spark > Decimal for typical operations. It uses a finite (128 bit) precision and can > handle up to decimal(38, X). The implementation of Int128 just uses two Long > values to represent the high and low bits of 128 bits respectively. Int128 is > lighter than Spark Decimal, reduces the cost of new() and garbage collection. > h3. Milestone 1 – Spark Decimal equivalency ( The new type Int128 meets or > exceeds all function of the existing SQL Decimal): > * Add a new DataType implementation for Int128. > * Support Int128 in Dataset/UDF. > * Int128 literals > * Int128 arithmetic(e.g. Int128 + Int128, Int128 - Decimal) > * Decimal or Math functions/operators: POWER, LOG, Round, etc > * Cast to and from Int128, cast String/Decimal to Int128, cast Int128 to > string (pretty printing)/Decimal, with the * * SQL syntax to specify the types > * Support sorting Int128. > h3. Milestone 2 – Persistence: > * Ability to create tables of type Int128 > * Ability to write to common file formats such as Parquet and JSON. > * INSERT, SELECT, UPDATE, MERGE > * Discovery > h3. Milestone 3 – Client support > * JDBC support > * Hive Thrift server > h3. Milestone 4 – PySpark and Spark R integration > * Python UDF can take and return Int128 > * DataFrame support -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40082) DAGScheduler may not schduler new stage in condition of push-based shuffle enabled
[ https://issues.apache.org/jira/browse/SPARK-40082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580070#comment-17580070 ] Min Shen commented on SPARK-40082: -- [~csingh] [~mridul] Want to bring your attention to this ticket. This seems an issue that we previously saw. Does upstream already have the fix for this? > DAGScheduler may not schduler new stage in condition of push-based shuffle > enabled > -- > > Key: SPARK-40082 > URL: https://issues.apache.org/jira/browse/SPARK-40082 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 3.1.1 >Reporter: Penglei Shi >Priority: Major > Attachments: missParentStages.png, shuffleMergeFinalized.png, > submitMissingTasks.png > > > In condition of push-based shuffle being enabled and speculative tasks > existing, a shuffleMapStage will be resubmitting once fetchFailed occurring, > then its parent stages will be resubmitting firstly and it will cost some > time to compute. Before the shuffleMapStage being resubmitted, its all > speculative tasks success and register map output, but speculative task > successful events can not trigger shuffleMergeFinalized because this stage > has been removed from runningStages. > Then this stage is resubmitted, but speculative tasks have registered map > output and there are no missing tasks to compute, resubmitting stages will > also not trigger shuffleMergeFinalized. Eventually this stage‘s > _shuffleMergedFinalized keeps false. > Then AQE will submit next stages which are dependent on this shuffleMapStage > occurring fetchFailed. And in getMissingParentStages, this stage will be > marked as missing and will be resubmitted, but next stages are added to > waitingStages after this stage being finished, so next stages will not be > submitted even though this stage's resubmitting has been finished. > I have only met some times in my production env and it is difficult to > reproduce。 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40091) is rocksdbjni required if not using streaming?
[ https://issues.apache.org/jira/browse/SPARK-40091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40091. -- Resolution: Duplicate > is rocksdbjni required if not using streaming? > -- > > Key: SPARK-40091 > URL: https://issues.apache.org/jira/browse/SPARK-40091 > Project: Spark > Issue Type: Question > Components: Deploy >Affects Versions: 3.3.0 >Reporter: t oo >Priority: Major > > my docker file is very big with pyspark > can i remove dis file below if i don't use 'spark streaming'? > 35M > /usr/local/lib/python3.10/site-packages/pyspark/jars/rocksdbjni-6.20.3.jar -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40092) is breeze required if not using ML?
[ https://issues.apache.org/jira/browse/SPARK-40092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40092. -- Resolution: Duplicate > is breeze required if not using ML? > --- > > Key: SPARK-40092 > URL: https://issues.apache.org/jira/browse/SPARK-40092 > Project: Spark > Issue Type: Question > Components: Deploy >Affects Versions: 3.3.0 >Reporter: t oo >Priority: Major > > my docker file is very big with pyspark > can i remove dis files below if i don't use 'ML'? > 14M > /usr/local/lib/python3.10/site-packages/pyspark/jars/breeze_2.12-1.2.jar > 5.9M > /usr/local/lib/python3.10/site-packages/pyspark/jars/spark-mllib_2.12-3.3.0.jar -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40093) is kubernetes jar required if not using that executor?
[ https://issues.apache.org/jira/browse/SPARK-40093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40093. -- Resolution: Won't Fix > is kubernetes jar required if not using that executor? > -- > > Key: SPARK-40093 > URL: https://issues.apache.org/jira/browse/SPARK-40093 > Project: Spark > Issue Type: Question > Components: Deploy >Affects Versions: 3.3.0 >Reporter: t oo >Priority: Major > > my docker file is very big with pyspark > can i remove dis files below if i don't use 'kubernetes executor'? > 11M total > 4.0M > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-core-5.12.2.jar > 840K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-client-5.12.2.jar > 760K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-admissionregistration-5.12.2.jar > 704K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-apiextensions-5.12.2.jar > 640K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-autoscaling-5.12.2.jar > 528K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-extensions-5.12.2.jar > 516K > /usr/local/lib/python3.10/site-packages/pyspark/jars/spark-kubernetes_2.12-3.3.0.jar > 456K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-networking-5.12.2.jar > 436K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-apps-5.12.2.jar > 364K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-storageclass-5.12.2.jar > 336K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-policy-5.12.2.jar > 264K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-flowcontrol-5.12.2.jar > 244K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-batch-5.12.2.jar > 192K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-discovery-5.12.2.jar > 176K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-rbac-5.12.2.jar > 160K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-node-5.12.2.jar > 144K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-certificates-5.12.2.jar > 104K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-events-5.12.2.jar > 80K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-metrics-5.12.2.jar > 68K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-scheduling-5.12.2.jar > 48K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-coordination-5.12.2.jar > 20K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-common-5.12.2.jar -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40093) is kubernetes jar required if not using that executor?
[ https://issues.apache.org/jira/browse/SPARK-40093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580069#comment-17580069 ] Hyukjin Kwon commented on SPARK-40093: -- [~toopt4] the profile and release distributions are matched with our regular Spark release. I think we should better raise a discussion in the mailing list before filing it as a JIRA. Short answer: I don't think we'll do this because such releases with the exclusion is over-complicated but the benefit is rather trival (size of the release) > is kubernetes jar required if not using that executor? > -- > > Key: SPARK-40093 > URL: https://issues.apache.org/jira/browse/SPARK-40093 > Project: Spark > Issue Type: Question > Components: Deploy >Affects Versions: 3.3.0 >Reporter: t oo >Priority: Major > > my docker file is very big with pyspark > can i remove dis files below if i don't use 'kubernetes executor'? > 11M total > 4.0M > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-core-5.12.2.jar > 840K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-client-5.12.2.jar > 760K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-admissionregistration-5.12.2.jar > 704K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-apiextensions-5.12.2.jar > 640K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-autoscaling-5.12.2.jar > 528K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-extensions-5.12.2.jar > 516K > /usr/local/lib/python3.10/site-packages/pyspark/jars/spark-kubernetes_2.12-3.3.0.jar > 456K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-networking-5.12.2.jar > 436K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-apps-5.12.2.jar > 364K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-storageclass-5.12.2.jar > 336K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-policy-5.12.2.jar > 264K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-flowcontrol-5.12.2.jar > 244K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-batch-5.12.2.jar > 192K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-discovery-5.12.2.jar > 176K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-rbac-5.12.2.jar > 160K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-node-5.12.2.jar > 144K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-certificates-5.12.2.jar > 104K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-events-5.12.2.jar > 80K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-metrics-5.12.2.jar > 68K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-scheduling-5.12.2.jar > 48K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-coordination-5.12.2.jar > 20K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-common-5.12.2.jar -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40097) Support Int128 type
jiaan.geng created SPARK-40097: -- Summary: Support Int128 type Key: SPARK-40097 URL: https://issues.apache.org/jira/browse/SPARK-40097 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.4.0 Reporter: jiaan.geng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40095) sc.uiWebUrl should not throw exception when webui is disabled
[ https://issues.apache.org/jira/browse/SPARK-40095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40095: Assignee: Ruifeng Zheng > sc.uiWebUrl should not throw exception when webui is disabled > - > > Key: SPARK-40095 > URL: https://issues.apache.org/jira/browse/SPARK-40095 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40085) use INTERNAL_ERROR error class instead of IllegalStateException to indicate bugs
[ https://issues.apache.org/jira/browse/SPARK-40085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-40085. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37524 [https://github.com/apache/spark/pull/37524] > use INTERNAL_ERROR error class instead of IllegalStateException to indicate > bugs > > > Key: SPARK-40085 > URL: https://issues.apache.org/jira/browse/SPARK-40085 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40085) use INTERNAL_ERROR error class instead of IllegalStateException to indicate bugs
[ https://issues.apache.org/jira/browse/SPARK-40085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-40085: Assignee: Wenchen Fan > use INTERNAL_ERROR error class instead of IllegalStateException to indicate > bugs > > > Key: SPARK-40085 > URL: https://issues.apache.org/jira/browse/SPARK-40085 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40096) Finalize shuffle merge slow due to connection creation fails
[ https://issues.apache.org/jira/browse/SPARK-40096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40096: Assignee: (was: Apache Spark) > Finalize shuffle merge slow due to connection creation fails > > > Key: SPARK-40096 > URL: https://issues.apache.org/jira/browse/SPARK-40096 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Wan Kun >Priority: Major > > *How to reproduce this issue* > * Enable push based shuffle > * Remove some merger nodes before sending finalize RPCs > * Driver try to connect those merger shuffle services and send finalize RPC > one by one, each connection creation will timeout after > SPARK_NETWORK_IO_CONNECTIONCREATIONTIMEOUT_KEY (120s by default) > > We can send these RPCs in *shuffleMergeFinalizeScheduler* thread pool and > handle the connection creation exception -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40096) Finalize shuffle merge slow due to connection creation fails
[ https://issues.apache.org/jira/browse/SPARK-40096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40096: Assignee: Apache Spark > Finalize shuffle merge slow due to connection creation fails > > > Key: SPARK-40096 > URL: https://issues.apache.org/jira/browse/SPARK-40096 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Wan Kun >Assignee: Apache Spark >Priority: Major > > *How to reproduce this issue* > * Enable push based shuffle > * Remove some merger nodes before sending finalize RPCs > * Driver try to connect those merger shuffle services and send finalize RPC > one by one, each connection creation will timeout after > SPARK_NETWORK_IO_CONNECTIONCREATIONTIMEOUT_KEY (120s by default) > > We can send these RPCs in *shuffleMergeFinalizeScheduler* thread pool and > handle the connection creation exception -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40096) Finalize shuffle merge slow due to connection creation fails
[ https://issues.apache.org/jira/browse/SPARK-40096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580059#comment-17580059 ] Apache Spark commented on SPARK-40096: -- User 'wankunde' has created a pull request for this issue: https://github.com/apache/spark/pull/37533 > Finalize shuffle merge slow due to connection creation fails > > > Key: SPARK-40096 > URL: https://issues.apache.org/jira/browse/SPARK-40096 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Wan Kun >Priority: Major > > *How to reproduce this issue* > * Enable push based shuffle > * Remove some merger nodes before sending finalize RPCs > * Driver try to connect those merger shuffle services and send finalize RPC > one by one, each connection creation will timeout after > SPARK_NETWORK_IO_CONNECTIONCREATIONTIMEOUT_KEY (120s by default) > > We can send these RPCs in *shuffleMergeFinalizeScheduler* thread pool and > handle the connection creation exception -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40096) Finalize shuffle merge slow due to connection creation fails
Wan Kun created SPARK-40096: --- Summary: Finalize shuffle merge slow due to connection creation fails Key: SPARK-40096 URL: https://issues.apache.org/jira/browse/SPARK-40096 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.3.0 Reporter: Wan Kun *How to reproduce this issue* * Enable push based shuffle * Remove some merger nodes before sending finalize RPCs * Driver try to connect those merger shuffle services and send finalize RPC one by one, each connection creation will timeout after SPARK_NETWORK_IO_CONNECTIONCREATIONTIMEOUT_KEY (120s by default) We can send these RPCs in *shuffleMergeFinalizeScheduler* thread pool and handle the connection creation exception -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39989) Support estimate column statistics if it is foldable expression
[ https://issues.apache.org/jira/browse/SPARK-39989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580046#comment-17580046 ] Apache Spark commented on SPARK-39989: -- User 'linhongliu-db' has created a pull request for this issue: https://github.com/apache/spark/pull/37532 > Support estimate column statistics if it is foldable expression > --- > > Key: SPARK-39989 > URL: https://issues.apache.org/jira/browse/SPARK-39989 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39989) Support estimate column statistics if it is foldable expression
[ https://issues.apache.org/jira/browse/SPARK-39989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580045#comment-17580045 ] Apache Spark commented on SPARK-39989: -- User 'linhongliu-db' has created a pull request for this issue: https://github.com/apache/spark/pull/37532 > Support estimate column statistics if it is foldable expression > --- > > Key: SPARK-39989 > URL: https://issues.apache.org/jira/browse/SPARK-39989 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38334) Implement support for DEFAULT values for columns in tables
[ https://issues.apache.org/jira/browse/SPARK-38334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580028#comment-17580028 ] Daniel commented on SPARK-38334: I think it is OK to declare that this feature is implemented in Apache Spark now. Marking this as fixed. > Implement support for DEFAULT values for columns in tables > --- > > Key: SPARK-38334 > URL: https://issues.apache.org/jira/browse/SPARK-38334 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Priority: Major > > This story tracks the implementation of DEFAULT values for columns in tables. > CREATE TABLE and ALTER TABLE invocations will support setting column default > values for future operations. Following INSERT, UPDATE, MERGE statements may > then reference the value using the DEFAULT keyword as needed. > Examples: > {code:sql} > CREATE TABLE T(a INT, b INT NOT NULL); > -- The default default is NULL > INSERT INTO T VALUES (DEFAULT, 0); > INSERT INTO T(b) VALUES (1); > SELECT * FROM T; > (NULL, 0) > (NULL, 1) > -- Adding a default to a table with rows, sets the values for the > -- existing rows (exist default) and new rows (current default). > ALTER TABLE T ADD COLUMN c INT DEFAULT 5; > INSERT INTO T VALUES (1, 2, DEFAULT); > SELECT * FROM T; > (NULL, 0, 5) > (NULL, 1, 5) > (1, 2, 5) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-38334) Implement support for DEFAULT values for columns in tables
[ https://issues.apache.org/jira/browse/SPARK-38334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580028#comment-17580028 ] Daniel edited comment on SPARK-38334 at 8/16/22 3:48 AM: - I think it is OK to declare that this feature is implemented in Apache Spark now. Marking this as fixed. We can always reopen (or file another ticket) if we would like to support other data sources. was (Author: JIRAUSER285772): I think it is OK to declare that this feature is implemented in Apache Spark now. Marking this as fixed. > Implement support for DEFAULT values for columns in tables > --- > > Key: SPARK-38334 > URL: https://issues.apache.org/jira/browse/SPARK-38334 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Priority: Major > Fix For: 3.4.0 > > > This story tracks the implementation of DEFAULT values for columns in tables. > CREATE TABLE and ALTER TABLE invocations will support setting column default > values for future operations. Following INSERT, UPDATE, MERGE statements may > then reference the value using the DEFAULT keyword as needed. > Examples: > {code:sql} > CREATE TABLE T(a INT, b INT NOT NULL); > -- The default default is NULL > INSERT INTO T VALUES (DEFAULT, 0); > INSERT INTO T(b) VALUES (1); > SELECT * FROM T; > (NULL, 0) > (NULL, 1) > -- Adding a default to a table with rows, sets the values for the > -- existing rows (exist default) and new rows (current default). > ALTER TABLE T ADD COLUMN c INT DEFAULT 5; > INSERT INTO T VALUES (1, 2, DEFAULT); > SELECT * FROM T; > (NULL, 0, 5) > (NULL, 1, 5) > (1, 2, 5) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38334) Implement support for DEFAULT values for columns in tables
[ https://issues.apache.org/jira/browse/SPARK-38334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel resolved SPARK-38334. Fix Version/s: 3.4.0 Resolution: Fixed > Implement support for DEFAULT values for columns in tables > --- > > Key: SPARK-38334 > URL: https://issues.apache.org/jira/browse/SPARK-38334 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Priority: Major > Fix For: 3.4.0 > > > This story tracks the implementation of DEFAULT values for columns in tables. > CREATE TABLE and ALTER TABLE invocations will support setting column default > values for future operations. Following INSERT, UPDATE, MERGE statements may > then reference the value using the DEFAULT keyword as needed. > Examples: > {code:sql} > CREATE TABLE T(a INT, b INT NOT NULL); > -- The default default is NULL > INSERT INTO T VALUES (DEFAULT, 0); > INSERT INTO T(b) VALUES (1); > SELECT * FROM T; > (NULL, 0) > (NULL, 1) > -- Adding a default to a table with rows, sets the values for the > -- existing rows (exist default) and new rows (current default). > ALTER TABLE T ADD COLUMN c INT DEFAULT 5; > INSERT INTO T VALUES (1, 2, DEFAULT); > SELECT * FROM T; > (NULL, 0, 5) > (NULL, 1, 5) > (1, 2, 5) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40095) sc.uiWebUrl should not throw exception when webui is disabled
[ https://issues.apache.org/jira/browse/SPARK-40095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580022#comment-17580022 ] Apache Spark commented on SPARK-40095: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37530 > sc.uiWebUrl should not throw exception when webui is disabled > - > > Key: SPARK-40095 > URL: https://issues.apache.org/jira/browse/SPARK-40095 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40095) sc.uiWebUrl should not throw exception when webui is disabled
[ https://issues.apache.org/jira/browse/SPARK-40095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40095: Assignee: (was: Apache Spark) > sc.uiWebUrl should not throw exception when webui is disabled > - > > Key: SPARK-40095 > URL: https://issues.apache.org/jira/browse/SPARK-40095 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40095) sc.uiWebUrl should not throw exception when webui is disabled
[ https://issues.apache.org/jira/browse/SPARK-40095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40095: Assignee: Apache Spark > sc.uiWebUrl should not throw exception when webui is disabled > - > > Key: SPARK-40095 > URL: https://issues.apache.org/jira/browse/SPARK-40095 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40095) sc.uiWebUrl should not throw exception when webui is disabled
[ https://issues.apache.org/jira/browse/SPARK-40095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580021#comment-17580021 ] Apache Spark commented on SPARK-40095: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37530 > sc.uiWebUrl should not throw exception when webui is disabled > - > > Key: SPARK-40095 > URL: https://issues.apache.org/jira/browse/SPARK-40095 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40095) sc.uiWebUrl should not throw exception when webui is disabled
Ruifeng Zheng created SPARK-40095: - Summary: sc.uiWebUrl should not throw exception when webui is disabled Key: SPARK-40095 URL: https://issues.apache.org/jira/browse/SPARK-40095 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40000) Add config to toggle whether to automatically add default values for INSERTs without user-specified fields
[ https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-4. Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37430 [https://github.com/apache/spark/pull/37430] > Add config to toggle whether to automatically add default values for INSERTs > without user-specified fields > -- > > Key: SPARK-4 > URL: https://issues.apache.org/jira/browse/SPARK-4 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Assignee: Daniel >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40000) Add config to toggle whether to automatically add default values for INSERTs without user-specified fields
[ https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang reassigned SPARK-4: -- Assignee: Daniel > Add config to toggle whether to automatically add default values for INSERTs > without user-specified fields > -- > > Key: SPARK-4 > URL: https://issues.apache.org/jira/browse/SPARK-4 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Assignee: Daniel >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40094) Send TaskEnd event when task failed with NotSerializableException or TaskOutputFileAlreadyExistException to release executors for dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-40094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580003#comment-17580003 ] Apache Spark commented on SPARK-40094: -- User 'wangshengjie123' has created a pull request for this issue: https://github.com/apache/spark/pull/37528 > Send TaskEnd event when task failed with NotSerializableException or > TaskOutputFileAlreadyExistException to release executors for dynamic > allocation > -- > > Key: SPARK-40094 > URL: https://issues.apache.org/jira/browse/SPARK-40094 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: wangshengjie >Priority: Major > > We found if task failed with NotSerializableException or > TaskOutputFileAlreadyExistException, wont send TaskEnd event, and this will > cause dynamic allocation not release executor normally. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40094) Send TaskEnd event when task failed with NotSerializableException or TaskOutputFileAlreadyExistException to release executors for dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-40094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40094: Assignee: (was: Apache Spark) > Send TaskEnd event when task failed with NotSerializableException or > TaskOutputFileAlreadyExistException to release executors for dynamic > allocation > -- > > Key: SPARK-40094 > URL: https://issues.apache.org/jira/browse/SPARK-40094 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: wangshengjie >Priority: Major > > We found if task failed with NotSerializableException or > TaskOutputFileAlreadyExistException, wont send TaskEnd event, and this will > cause dynamic allocation not release executor normally. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40094) Send TaskEnd event when task failed with NotSerializableException or TaskOutputFileAlreadyExistException to release executors for dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-40094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40094: Assignee: Apache Spark > Send TaskEnd event when task failed with NotSerializableException or > TaskOutputFileAlreadyExistException to release executors for dynamic > allocation > -- > > Key: SPARK-40094 > URL: https://issues.apache.org/jira/browse/SPARK-40094 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: wangshengjie >Assignee: Apache Spark >Priority: Major > > We found if task failed with NotSerializableException or > TaskOutputFileAlreadyExistException, wont send TaskEnd event, and this will > cause dynamic allocation not release executor normally. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40094) Send TaskEnd event when task failed with NotSerializableException or TaskOutputFileAlreadyExistException to release executors for dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-40094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580002#comment-17580002 ] Apache Spark commented on SPARK-40094: -- User 'wangshengjie123' has created a pull request for this issue: https://github.com/apache/spark/pull/37528 > Send TaskEnd event when task failed with NotSerializableException or > TaskOutputFileAlreadyExistException to release executors for dynamic > allocation > -- > > Key: SPARK-40094 > URL: https://issues.apache.org/jira/browse/SPARK-40094 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: wangshengjie >Priority: Major > > We found if task failed with NotSerializableException or > TaskOutputFileAlreadyExistException, wont send TaskEnd event, and this will > cause dynamic allocation not release executor normally. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40094) Send TaskEnd event when task failed with NotSerializableException or TaskOutputFileAlreadyExistException to release executors for dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-40094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1757#comment-1757 ] wangshengjie commented on SPARK-40094: -- I'm working on this, a pr will be submitted later, thanks. > Send TaskEnd event when task failed with NotSerializableException or > TaskOutputFileAlreadyExistException to release executors for dynamic > allocation > -- > > Key: SPARK-40094 > URL: https://issues.apache.org/jira/browse/SPARK-40094 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: wangshengjie >Priority: Major > > We found if task failed with NotSerializableException or > TaskOutputFileAlreadyExistException, wont send TaskEnd event, and this will > cause dynamic allocation not release executor normally. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40094) Send TaskEnd event when task failed with NotSerializableException or TaskOutputFileAlreadyExistException to release executors for dynamic allocation
wangshengjie created SPARK-40094: Summary: Send TaskEnd event when task failed with NotSerializableException or TaskOutputFileAlreadyExistException to release executors for dynamic allocation Key: SPARK-40094 URL: https://issues.apache.org/jira/browse/SPARK-40094 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.3.0 Reporter: wangshengjie We found if task failed with NotSerializableException or TaskOutputFileAlreadyExistException, wont send TaskEnd event, and this will cause dynamic allocation not release executor normally. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40093) is kubernetes jar required if not using that executor?
t oo created SPARK-40093: Summary: is kubernetes jar required if not using that executor? Key: SPARK-40093 URL: https://issues.apache.org/jira/browse/SPARK-40093 Project: Spark Issue Type: Question Components: Deploy Affects Versions: 3.3.0 Reporter: t oo my docker file is very big with pyspark can i remove dis files below if i don't use 'ML'? 14M /usr/local/lib/python3.10/site-packages/pyspark/jars/breeze_2.12-1.2.jar 5.9M /usr/local/lib/python3.10/site-packages/pyspark/jars/spark-mllib_2.12-3.3.0.jar -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40093) is kubernetes jar required if not using that executor?
[ https://issues.apache.org/jira/browse/SPARK-40093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] t oo updated SPARK-40093: - Description: my docker file is very big with pyspark can i remove dis files below if i don't use 'kubernetes executor'? 11M total 4.0M /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-core-5.12.2.jar 840K /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-client-5.12.2.jar 760K /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-admissionregistration-5.12.2.jar 704K /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-apiextensions-5.12.2.jar 640K /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-autoscaling-5.12.2.jar 528K /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-extensions-5.12.2.jar 516K /usr/local/lib/python3.10/site-packages/pyspark/jars/spark-kubernetes_2.12-3.3.0.jar 456K /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-networking-5.12.2.jar 436K /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-apps-5.12.2.jar 364K /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-storageclass-5.12.2.jar 336K /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-policy-5.12.2.jar 264K /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-flowcontrol-5.12.2.jar 244K /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-batch-5.12.2.jar 192K /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-discovery-5.12.2.jar 176K /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-rbac-5.12.2.jar 160K /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-node-5.12.2.jar 144K /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-certificates-5.12.2.jar 104K /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-events-5.12.2.jar 80K /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-metrics-5.12.2.jar 68K /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-scheduling-5.12.2.jar 48K /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-coordination-5.12.2.jar 20K /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-common-5.12.2.jar was: my docker file is very big with pyspark can i remove dis files below if i don't use 'ML'? 14M /usr/local/lib/python3.10/site-packages/pyspark/jars/breeze_2.12-1.2.jar 5.9M /usr/local/lib/python3.10/site-packages/pyspark/jars/spark-mllib_2.12-3.3.0.jar > is kubernetes jar required if not using that executor? > -- > > Key: SPARK-40093 > URL: https://issues.apache.org/jira/browse/SPARK-40093 > Project: Spark > Issue Type: Question > Components: Deploy >Affects Versions: 3.3.0 >Reporter: t oo >Priority: Major > > my docker file is very big with pyspark > can i remove dis files below if i don't use 'kubernetes executor'? > 11M total > 4.0M > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-core-5.12.2.jar > 840K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-client-5.12.2.jar > 760K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-admissionregistration-5.12.2.jar > 704K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-apiextensions-5.12.2.jar > 640K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-autoscaling-5.12.2.jar > 528K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-extensions-5.12.2.jar > 516K > /usr/local/lib/python3.10/site-packages/pyspark/jars/spark-kubernetes_2.12-3.3.0.jar > 456K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-networking-5.12.2.jar > 436K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-apps-5.12.2.jar > 364K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-storageclass-5.12.2.jar > 336K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-policy-5.12.2.jar > 264K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-flowcontrol-5.12.2.jar > 244K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-batch-5.12.2.jar > 192K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-discovery-5.12.2.jar > 176K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-rbac-5.12.2.jar > 160K > /usr/local/lib/python3.10/site-packages/pyspark/jars/kubernetes-model-node-5.12.2.jar > 144K > /usr/local/lib/python3.10/site-
[jira] [Created] (SPARK-40092) is breeze required if not using ML?
t oo created SPARK-40092: Summary: is breeze required if not using ML? Key: SPARK-40092 URL: https://issues.apache.org/jira/browse/SPARK-40092 Project: Spark Issue Type: Question Components: Deploy Affects Versions: 3.3.0 Reporter: t oo my docker file is very big with pyspark can i remove dis file below if i don't use 'spark streaming'? 35M /usr/local/lib/python3.10/site-packages/pyspark/jars/rocksdbjni-6.20.3.jar -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40092) is breeze required if not using ML?
[ https://issues.apache.org/jira/browse/SPARK-40092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] t oo updated SPARK-40092: - Description: my docker file is very big with pyspark can i remove dis files below if i don't use 'ML'? 14M /usr/local/lib/python3.10/site-packages/pyspark/jars/breeze_2.12-1.2.jar 5.9M /usr/local/lib/python3.10/site-packages/pyspark/jars/spark-mllib_2.12-3.3.0.jar was: my docker file is very big with pyspark can i remove dis file below if i don't use 'spark streaming'? 35M /usr/local/lib/python3.10/site-packages/pyspark/jars/rocksdbjni-6.20.3.jar > is breeze required if not using ML? > --- > > Key: SPARK-40092 > URL: https://issues.apache.org/jira/browse/SPARK-40092 > Project: Spark > Issue Type: Question > Components: Deploy >Affects Versions: 3.3.0 >Reporter: t oo >Priority: Major > > my docker file is very big with pyspark > can i remove dis files below if i don't use 'ML'? > 14M > /usr/local/lib/python3.10/site-packages/pyspark/jars/breeze_2.12-1.2.jar > 5.9M > /usr/local/lib/python3.10/site-packages/pyspark/jars/spark-mllib_2.12-3.3.0.jar -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40091) is rocksdbjni required if not using streaming?
t oo created SPARK-40091: Summary: is rocksdbjni required if not using streaming? Key: SPARK-40091 URL: https://issues.apache.org/jira/browse/SPARK-40091 Project: Spark Issue Type: Question Components: Deploy Affects Versions: 3.3.0 Reporter: t oo my docker file is very big with pyspark can i remove dis file below if i don't use 'spark streaming'? 35M /usr/local/lib/python3.10/site-packages/pyspark/jars/rocksdbjni-6.20.3.jar -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40089) Doring of at least Decimal(20, 2) fails for some values near the max.
[ https://issues.apache.org/jira/browse/SPARK-40089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579984#comment-17579984 ] XiDuo You commented on SPARK-40089: --- thank you [~revans2] for reporting the issue, I can reproduce it by: {code:java} SELECT cast(col1 as decimal(20,2)) as c FROM VALUES (99.50),(99.49),(1.11) ORDER BY c; -- output: 99.50 1.11 99.49{code} do you want send a pr to fix it ? > Doring of at least Decimal(20, 2) fails for some values near the max. > - > > Key: SPARK-40089 > URL: https://issues.apache.org/jira/browse/SPARK-40089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0, 3.4.0 >Reporter: Robert Joseph Evans >Priority: Major > Attachments: input.parquet > > > I have been doing some testing with Decimal values for the RAPIDS Accelerator > for Apache Spark. I have been trying to add in new corner cases and when I > tried to enable the maximum supported value for a sort I started to get > failures. On closer inspection it looks like the CPU is sorting things > incorrectly. Specifically anything that is "99.50" or above > is placed as a chunk in the wrong location in the outputs. > In local mode with 12 tasks. > {code:java} > spark.read.parquet("input.parquet").orderBy(col("a")).collect.foreach(System.err.println) > {code} > > Here you will notice that the last entry printed is > {{[99.49]}}, and {{[99.99]}} is near the top > near {{[-99.99]}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40009) Add missing doc string info to DataFrame API
[ https://issues.apache.org/jira/browse/SPARK-40009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-40009. -- Fix Version/s: 3.4.0 Assignee: Khalid Mammadov Resolution: Fixed Resolved by https://github.com/apache/spark/pull/37441 > Add missing doc string info to DataFrame API > > > Key: SPARK-40009 > URL: https://issues.apache.org/jira/browse/SPARK-40009 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.4.0 >Reporter: Khalid Mammadov >Assignee: Khalid Mammadov >Priority: Minor > Fix For: 3.4.0 > > > Some of the docstrings in Python DataFrame API is not complete, for example > some missing Parameters section or Return or Examples. It would help users if > we can provide these missing infos for all methods/functions -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40077) Make pyspark.context examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40077: Assignee: Ruifeng Zheng > Make pyspark.context examples self-contained > > > Key: SPARK-40077 > URL: https://issues.apache.org/jira/browse/SPARK-40077 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40077) Make pyspark.context examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40077. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37517 [https://github.com/apache/spark/pull/37517] > Make pyspark.context examples self-contained > > > Key: SPARK-40077 > URL: https://issues.apache.org/jira/browse/SPARK-40077 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37442) In AQE, wrong InMemoryRelation size estimation causes "Cannot broadcast the table that is larger than 8GB: 8 GB" failure
[ https://issues.apache.org/jira/browse/SPARK-37442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579969#comment-17579969 ] EmmaYang commented on SPARK-37442: -- Hello, I exactly have this issue. and I am usign spark2.4 so my broadcast dataframe is built on top of files. and the oveall files size is > 12gb, but I only use the sub dataframe, and in STORAGE, it showed only 2.5 GB, but still give me the broadcast hit 8GB error so any workaround solution for it ? Thank you. : : : +- ResolvedHint (broadcast) : : : +- Filter isnotnull(invlv_pty_id#5078) : : : +- InMemoryRelation [invlv_pty_id#5078, invlv_pty_id#5078], StorageLevel(disk, memory, deserialized, 1 replicas) : : : +- *(1) Project [invlv_pty_id#5078, invlv_pty_id#5078] : : : +- *(1) FileScan csv [invlv_pty_id#5078] Batched: false, Format: CSV, Location: InMemoryFileIndex[hdfs://gftsdev/data/gfrrsnsd/standardization/hive/gfrrsnsd_standardization/trl_..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct : : +- ResolvedHint (broadcast) : : +- Filter isnotnull(invlv_pty_id#5078) : : +- InMemoryRelation [invlv_pty_id#5078, invlv_pty_id#5078], StorageLevel(disk, memory, deserialized, 1 replicas) : : +- *(1) Project [invlv_pty_id#5078, invlv_pty_id#5078] : : +- *(1) FileScan csv [invlv_pty_id#5078] Batched: false, Format: CSV, Location: InMemoryFileIndex[hdfs://gftsdev/data/gfrrsnsd/standardization/hive/gfrrsnsd_standardization/trl_..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct : +- ResolvedHint (broadcast) : +- Filter isnotnull(invlv_pty_id#5078) : +- InMemoryRelation [invlv_pty_id#5078, invlv_pty_id#5078], StorageLevel(disk, memory, deserialized, 1 replicas) : +- *(1) Project [invlv_pty_id#5078, invlv_pty_id#5078] : +- *(1) FileScan csv [invlv_pty_id#5078] Batched: false, Format: CSV, Location: InMemoryFileIndex[hdfs://gftsdev/data/gfrrsnsd/standardization/hive/gfrrsnsd_standardization/trl_..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct +- ResolvedHint (broadcast) +- Filter isnotnull(invlv_pty_id#5078) +- InMemoryRelation [invlv_pty_id#5078, invlv_pty_id#5078], StorageLevel(disk, memory, deserialized, 1 replicas) +- *(1) Project [invlv_pty_id#5078, invlv_pty_id#5078] +- *(1) FileScan csv [invlv_pty_id#5078] Batched: false, Format: CSV, Location: InMemoryFileIndex[hdfs://gftsdev/data/gfrrsnsd/standardization/hive/gfrrsnsd_standardization/trl_..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct > In AQE, wrong InMemoryRelation size estimation causes "Cannot broadcast the > table that is larger than 8GB: 8 GB" failure > > > Key: SPARK-37442 > URL: https://issues.apache.org/jira/browse/SPARK-37442 > Project: Spark > Issue Type: Sub-task > Components: Optimizer, SQL >Affects Versions: 3.1.1, 3.2.0 >Reporter: Michael Chen >Assignee: Michael Chen >Priority: Major > Fix For: 3.2.1, 3.3.0 > > > There is a period in time where an InMemoryRelation will have the cached > buffers loaded, but the statistics will be inaccurate (anywhere between 0 -> > size in bytes reported by accumulators). When AQE is enabled, it is possible > that join planning strategies will happen in this window. In this scenario, > join children sizes including InMemoryRelation are greatly underestimated and > a broadcast join can be planned when it shouldn't be. We have seen scenarios > where a broadcast join is planned with the builder size greater than 8GB > because at planning time, the optimizer believes the InMemoryRelation is 0 > bytes. > Here is an example test case where the broadcast threshold is being ignored. > It can mimic the 8GB error by increasing the size of the tables. > {code:java} > withSQLConf( > SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true", > SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "1048584") { > // Spark estimates a string column as 20 bytes so with 60k rows, these > relations should be > // estimated at ~120m bytes which is greater than the broadcast join > threshold > Seq.fill(6)("a").toDF("key") > .createOrReplaceTempView("temp") > Seq.fill(6)("b").toDF("key") > .createOrReplaceTempView("temp2") > Seq("a").toDF("key").createOrReplaceTempView("smallTemp") > spark.sql("SELECT key as newKey FROM temp").persist() > val query = > s""" > |SELECT t3.newKey > |FROM > | (SELECT t1.newKey > | FROM (SELECT key as newKey FROM temp) as t1 > |JOIN > |(SELECT key FROM smallTemp) as t2 > |ON t1.newKey = t2.key > | ) as t3 > | JOIN > | (SELECT key FROM temp2) as t4 > | ON t3.newKey = t4.key > |UNION > |SELECT t1.newKey >
[jira] [Resolved] (SPARK-40066) ANSI mode: always return null on invalid access to map column
[ https://issues.apache.org/jira/browse/SPARK-40066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-40066. Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37503 [https://github.com/apache/spark/pull/37503] > ANSI mode: always return null on invalid access to map column > - > > Key: SPARK-40066 > URL: https://issues.apache.org/jira/browse/SPARK-40066 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.4.0 > > > Since https://github.com/apache/spark/pull/30386, Spark always throws an > error on invalid access to a map column. There is no such syntax in the ANSI > SQL standard since there is no Map type in it. There is a similar type > `multiset` which returns null on non-existing element access. > Also, I investigated PostgreSQL/Snowflake/Biguqery and all of them returns > null return on map(json) key not exists. > I suggest loosen the the syntax here. When users get the error, most of them > will just use `try_element_at()` to get the same syntax or just turn off the > ANSI SQL mode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40090) Upgrade to Py4J 0.10.9.7
[ https://issues.apache.org/jira/browse/SPARK-40090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng resolved SPARK-40090. -- Resolution: Duplicate > Upgrade to Py4J 0.10.9.7 > > > Key: SPARK-40090 > URL: https://issues.apache.org/jira/browse/SPARK-40090 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Upgrade to Py4J 0.10.9.7 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40090) Upgrade to Py4J 0.10.9.7
[ https://issues.apache.org/jira/browse/SPARK-40090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40090: Assignee: (was: Apache Spark) > Upgrade to Py4J 0.10.9.7 > > > Key: SPARK-40090 > URL: https://issues.apache.org/jira/browse/SPARK-40090 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Upgrade to Py4J 0.10.9.7 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40090) Upgrade to Py4J 0.10.9.7
[ https://issues.apache.org/jira/browse/SPARK-40090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579930#comment-17579930 ] Apache Spark commented on SPARK-40090: -- User 'xinrong-meng' has created a pull request for this issue: https://github.com/apache/spark/pull/37527 > Upgrade to Py4J 0.10.9.7 > > > Key: SPARK-40090 > URL: https://issues.apache.org/jira/browse/SPARK-40090 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Upgrade to Py4J 0.10.9.7 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40090) Upgrade to Py4J 0.10.9.7
[ https://issues.apache.org/jira/browse/SPARK-40090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40090: Assignee: Apache Spark > Upgrade to Py4J 0.10.9.7 > > > Key: SPARK-40090 > URL: https://issues.apache.org/jira/browse/SPARK-40090 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Apache Spark >Priority: Major > > Upgrade to Py4J 0.10.9.7 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40090) Upgrade to Py4J 0.10.9.7
[ https://issues.apache.org/jira/browse/SPARK-40090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579929#comment-17579929 ] Apache Spark commented on SPARK-40090: -- User 'xinrong-meng' has created a pull request for this issue: https://github.com/apache/spark/pull/37527 > Upgrade to Py4J 0.10.9.7 > > > Key: SPARK-40090 > URL: https://issues.apache.org/jira/browse/SPARK-40090 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Upgrade to Py4J 0.10.9.7 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40090) Upgrade to Py4J 0.10.9.7
Xinrong Meng created SPARK-40090: Summary: Upgrade to Py4J 0.10.9.7 Key: SPARK-40090 URL: https://issues.apache.org/jira/browse/SPARK-40090 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Upgrade to Py4J 0.10.9.7 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40089) Doring of at least Decimal(20, 2) fails for some values near the max.
[ https://issues.apache.org/jira/browse/SPARK-40089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579897#comment-17579897 ] Robert Joseph Evans commented on SPARK-40089: - Looking at the code it appears that the prefix calculator has an overflow bug in it. {code} if (value.changePrecision(p, s)) value.toUnscaledLong else Long.MinValue {code} We are rounding up when changing the precision and when that happens we fall back to {{Long.MinValue}} a.k.a -9223372036854775808, which results in the failure. > Doring of at least Decimal(20, 2) fails for some values near the max. > - > > Key: SPARK-40089 > URL: https://issues.apache.org/jira/browse/SPARK-40089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0, 3.4.0 >Reporter: Robert Joseph Evans >Priority: Major > Attachments: input.parquet > > > I have been doing some testing with Decimal values for the RAPIDS Accelerator > for Apache Spark. I have been trying to add in new corner cases and when I > tried to enable the maximum supported value for a sort I started to get > failures. On closer inspection it looks like the CPU is sorting things > incorrectly. Specifically anything that is "99.50" or above > is placed as a chunk in the wrong location in the outputs. > In local mode with 12 tasks. > {code:java} > spark.read.parquet("input.parquet").orderBy(col("a")).collect.foreach(System.err.println) > {code} > > Here you will notice that the last entry printed is > {{[99.49]}}, and {{[99.99]}} is near the top > near {{[-99.99]}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40089) Doring of at least Decimal(20, 2) fails for some values near the max.
[ https://issues.apache.org/jira/browse/SPARK-40089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579892#comment-17579892 ] Robert Joseph Evans commented on SPARK-40089: - It sure looks like it is related to the prefix calculator. I think it is overflowing some how. I added some debugging into 3.2.0 and I got back {code} 22/08/15 20:17:58 ERROR SortExec: PREFIX FOR 99.99 IS false -9223372036854775808 {code} The prefix should not be negative for non-negative values. > Doring of at least Decimal(20, 2) fails for some values near the max. > - > > Key: SPARK-40089 > URL: https://issues.apache.org/jira/browse/SPARK-40089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0, 3.4.0 >Reporter: Robert Joseph Evans >Priority: Major > Attachments: input.parquet > > > I have been doing some testing with Decimal values for the RAPIDS Accelerator > for Apache Spark. I have been trying to add in new corner cases and when I > tried to enable the maximum supported value for a sort I started to get > failures. On closer inspection it looks like the CPU is sorting things > incorrectly. Specifically anything that is "99.50" or above > is placed as a chunk in the wrong location in the outputs. > In local mode with 12 tasks. > {code:java} > spark.read.parquet("input.parquet").orderBy(col("a")).collect.foreach(System.err.println) > {code} > > Here you will notice that the last entry printed is > {{[99.49]}}, and {{[99.99]}} is near the top > near {{[-99.99]}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40089) Doring of at least Decimal(20, 2) fails for some values near the max.
[ https://issues.apache.org/jira/browse/SPARK-40089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579887#comment-17579887 ] Robert Joseph Evans commented on SPARK-40089: - I have been trying to debug this and it does not look like it is related to the partitioner. I can run with a single shuffle partition and I get the same results. Not sure if the prefix calculation is doing this or what. > Doring of at least Decimal(20, 2) fails for some values near the max. > - > > Key: SPARK-40089 > URL: https://issues.apache.org/jira/browse/SPARK-40089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0, 3.4.0 >Reporter: Robert Joseph Evans >Priority: Major > Attachments: input.parquet > > > I have been doing some testing with Decimal values for the RAPIDS Accelerator > for Apache Spark. I have been trying to add in new corner cases and when I > tried to enable the maximum supported value for a sort I started to get > failures. On closer inspection it looks like the CPU is sorting things > incorrectly. Specifically anything that is "99.50" or above > is placed as a chunk in the wrong location in the outputs. > In local mode with 12 tasks. > {code:java} > spark.read.parquet("input.parquet").orderBy(col("a")).collect.foreach(System.err.println) > {code} > > Here you will notice that the last entry printed is > {{[99.49]}}, and {{[99.99]}} is near the top > near {{[-99.99]}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40089) Doring of at least Decimal(20, 2) fails for some values near the max.
[ https://issues.apache.org/jira/browse/SPARK-40089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Joseph Evans updated SPARK-40089: Attachment: input.parquet > Doring of at least Decimal(20, 2) fails for some values near the max. > - > > Key: SPARK-40089 > URL: https://issues.apache.org/jira/browse/SPARK-40089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0, 3.4.0 >Reporter: Robert Joseph Evans >Priority: Major > Attachments: input.parquet > > > I have been doing some testing with Decimal values for the RAPIDS Accelerator > for Apache Spark. I have been trying to add in new corner cases and when I > tried to enable the maximum supported value for a sort I started to get > failures. On closer inspection it looks like the CPU is sorting things > incorrectly. Specifically anything that is "99.50" or above > is placed as a chunk in the wrong location in the outputs. > In local mode with 12 tasks. > {code:java} > spark.read.parquet("input.parquet").orderBy(col("a")).collect.foreach(System.err.println) > {code} > > Here you will notice that the last entry printed is > {{[99.49]}}, and {{[99.99]}} is near the top > near {{[-99.99]}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40089) Doring of at least Decimal(20, 2) fails for some values near the max.
Robert Joseph Evans created SPARK-40089: --- Summary: Doring of at least Decimal(20, 2) fails for some values near the max. Key: SPARK-40089 URL: https://issues.apache.org/jira/browse/SPARK-40089 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0, 3.2.0, 3.4.0 Reporter: Robert Joseph Evans I have been doing some testing with Decimal values for the RAPIDS Accelerator for Apache Spark. I have been trying to add in new corner cases and when I tried to enable the maximum supported value for a sort I started to get failures. On closer inspection it looks like the CPU is sorting things incorrectly. Specifically anything that is "99.50" or above is placed as a chunk in the wrong location in the outputs. In local mode with 12 tasks. {code:java} spark.read.parquet("input.parquet").orderBy(col("a")).collect.foreach(System.err.println) {code} Here you will notice that the last entry printed is {{[99.49]}}, and {{[99.99]}} is near the top near {{[-99.99]}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38888) Add `RocksDBProvider` similar to `LevelDBProvider`
[ https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579845#comment-17579845 ] Dongjoon Hyun commented on SPARK-3: --- BTW, SPARK-35781 has more backgrounds. - LevelDB, RocksDB, BrotliCodec had known issues on Apple Silicon. - Only RocksDB community managed to fix it recently at RocksDB 7.0.3. - However, since it's a big change from RocksDB 6 to RocksDB 7, SPARK-38257 landed to only Apache Spark 3.4.0 in the community. However, I'm using RocksDB 7 in the production internally. > Add `RocksDBProvider` similar to `LevelDBProvider` > -- > > Key: SPARK-3 > URL: https://issues.apache.org/jira/browse/SPARK-3 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > `LevelDBProvider` is used by `ExternalShuffleBlockResolver` and > `YarnShuffleService`, a corresponding `RocksDB` implementation should be added -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38888) Add `RocksDBProvider` similar to `LevelDBProvider`
[ https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579839#comment-17579839 ] Dongjoon Hyun commented on SPARK-3: --- Yes, [~tgraves]. LevelDB is too ancient and has compatibility issue on Apple Silicon which is tracked by SPARK-35782. > Add `RocksDBProvider` similar to `LevelDBProvider` > -- > > Key: SPARK-3 > URL: https://issues.apache.org/jira/browse/SPARK-3 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > `LevelDBProvider` is used by `ExternalShuffleBlockResolver` and > `YarnShuffleService`, a corresponding `RocksDB` implementation should be added -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40064) Use V2 Filter in SupportsOverwrite
[ https://issues.apache.org/jira/browse/SPARK-40064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao resolved SPARK-40064. Fix Version/s: 3.4.0 Assignee: Huaxin Gao Resolution: Fixed > Use V2 Filter in SupportsOverwrite > -- > > Key: SPARK-40064 > URL: https://issues.apache.org/jira/browse/SPARK-40064 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.4.0 > > > Add V2 Filter support in SupportsOverwrite -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40088) Add SparkPlanWIthAQESuite
Kazuyuki Tanimura created SPARK-40088: - Summary: Add SparkPlanWIthAQESuite Key: SPARK-40088 URL: https://issues.apache.org/jira/browse/SPARK-40088 Project: Spark Issue Type: Test Components: SQL, Tests Affects Versions: 3.4.0 Reporter: Kazuyuki Tanimura Currently `SparkPlanSuite` assumes that AQE is always turned off. We should also test with AQE turned on -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40087) Support multiple Column drop in R
[ https://issues.apache.org/jira/browse/SPARK-40087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40087: Assignee: Apache Spark > Support multiple Column drop in R > - > > Key: SPARK-40087 > URL: https://issues.apache.org/jira/browse/SPARK-40087 > Project: Spark > Issue Type: New Feature > Components: R >Affects Versions: 3.3.0 >Reporter: Santosh Pingale >Assignee: Apache Spark >Priority: Minor > > This is a followup on SPARK-39895. The PR previously attempted to adjust > implementation for R as well to match signatures but that part was removed > and we only focused on getting python implementation to behave correctly. > *{{Change supports following operations:}}* > {{df <- select(read.json(jsonPath), "name", "age")}} > {{df$age2 <- df$age}} > {{df1 <- drop(df, df$age, df$name)}} > {{expect_equal(columns(df1), c("age2"))}} > {{df1 <- drop(df, list(df$age, column("random")))}} > {{expect_equal(columns(df1), c("name", "age2"))}} > {{df1 <- drop(df, list(df$age, df$name))}} > {{expect_equal(columns(df1), c("age2"))}} > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40087) Support multiple Column drop in R
[ https://issues.apache.org/jira/browse/SPARK-40087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40087: Assignee: (was: Apache Spark) > Support multiple Column drop in R > - > > Key: SPARK-40087 > URL: https://issues.apache.org/jira/browse/SPARK-40087 > Project: Spark > Issue Type: New Feature > Components: R >Affects Versions: 3.3.0 >Reporter: Santosh Pingale >Priority: Minor > > This is a followup on SPARK-39895. The PR previously attempted to adjust > implementation for R as well to match signatures but that part was removed > and we only focused on getting python implementation to behave correctly. > *{{Change supports following operations:}}* > {{df <- select(read.json(jsonPath), "name", "age")}} > {{df$age2 <- df$age}} > {{df1 <- drop(df, df$age, df$name)}} > {{expect_equal(columns(df1), c("age2"))}} > {{df1 <- drop(df, list(df$age, column("random")))}} > {{expect_equal(columns(df1), c("name", "age2"))}} > {{df1 <- drop(df, list(df$age, df$name))}} > {{expect_equal(columns(df1), c("age2"))}} > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40087) Support multiple Column drop in R
[ https://issues.apache.org/jira/browse/SPARK-40087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579826#comment-17579826 ] Apache Spark commented on SPARK-40087: -- User 'santosh-d3vpl3x' has created a pull request for this issue: https://github.com/apache/spark/pull/37526 > Support multiple Column drop in R > - > > Key: SPARK-40087 > URL: https://issues.apache.org/jira/browse/SPARK-40087 > Project: Spark > Issue Type: New Feature > Components: R >Affects Versions: 3.3.0 >Reporter: Santosh Pingale >Priority: Minor > > This is a followup on SPARK-39895. The PR previously attempted to adjust > implementation for R as well to match signatures but that part was removed > and we only focused on getting python implementation to behave correctly. > *{{Change supports following operations:}}* > {{df <- select(read.json(jsonPath), "name", "age")}} > {{df$age2 <- df$age}} > {{df1 <- drop(df, df$age, df$name)}} > {{expect_equal(columns(df1), c("age2"))}} > {{df1 <- drop(df, list(df$age, column("random")))}} > {{expect_equal(columns(df1), c("name", "age2"))}} > {{df1 <- drop(df, list(df$age, df$name))}} > {{expect_equal(columns(df1), c("age2"))}} > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40087) Support multiple Column drop in R
Santosh Pingale created SPARK-40087: --- Summary: Support multiple Column drop in R Key: SPARK-40087 URL: https://issues.apache.org/jira/browse/SPARK-40087 Project: Spark Issue Type: New Feature Components: R Affects Versions: 3.3.0 Reporter: Santosh Pingale This is a followup on SPARK-39895. The PR previously attempted to adjust implementation for R as well to match signatures but that part was removed and we only focused on getting python implementation to behave correctly. *{{Change supports following operations:}}* {{df <- select(read.json(jsonPath), "name", "age")}} {{df$age2 <- df$age}} {{df1 <- drop(df, df$age, df$name)}} {{expect_equal(columns(df1), c("age2"))}} {{df1 <- drop(df, list(df$age, column("random")))}} {{expect_equal(columns(df1), c("name", "age2"))}} {{df1 <- drop(df, list(df$age, df$name))}} {{expect_equal(columns(df1), c("age2"))}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40086) Improve AliasAwareOutputPartitioning to take all aliases into account
[ https://issues.apache.org/jira/browse/SPARK-40086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40086: Assignee: (was: Apache Spark) > Improve AliasAwareOutputPartitioning to take all aliases into account > - > > Key: SPARK-40086 > URL: https://issues.apache.org/jira/browse/SPARK-40086 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Peter Toth >Priority: Major > > Currently AliasAwareOutputPartitioning takes only the last alias by aliased > expressions into account. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40086) Improve AliasAwareOutputPartitioning to take all aliases into account
[ https://issues.apache.org/jira/browse/SPARK-40086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579814#comment-17579814 ] Apache Spark commented on SPARK-40086: -- User 'peter-toth' has created a pull request for this issue: https://github.com/apache/spark/pull/37525 > Improve AliasAwareOutputPartitioning to take all aliases into account > - > > Key: SPARK-40086 > URL: https://issues.apache.org/jira/browse/SPARK-40086 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Peter Toth >Priority: Major > > Currently AliasAwareOutputPartitioning takes only the last alias by aliased > expressions into account. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40086) Improve AliasAwareOutputPartitioning to take all aliases into account
[ https://issues.apache.org/jira/browse/SPARK-40086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40086: Assignee: Apache Spark > Improve AliasAwareOutputPartitioning to take all aliases into account > - > > Key: SPARK-40086 > URL: https://issues.apache.org/jira/browse/SPARK-40086 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Peter Toth >Assignee: Apache Spark >Priority: Major > > Currently AliasAwareOutputPartitioning takes only the last alias by aliased > expressions into account. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40086) Improve AliasAwareOutputPartitioning to take all aliases into account
Peter Toth created SPARK-40086: -- Summary: Improve AliasAwareOutputPartitioning to take all aliases into account Key: SPARK-40086 URL: https://issues.apache.org/jira/browse/SPARK-40086 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Peter Toth Currently AliasAwareOutputPartitioning takes only the last alias by aliased expressions into account. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40085) use INTERNAL_ERROR error class instead of IllegalStateException to indicate bugs
[ https://issues.apache.org/jira/browse/SPARK-40085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579749#comment-17579749 ] Apache Spark commented on SPARK-40085: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/37524 > use INTERNAL_ERROR error class instead of IllegalStateException to indicate > bugs > > > Key: SPARK-40085 > URL: https://issues.apache.org/jira/browse/SPARK-40085 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40085) use INTERNAL_ERROR error class instead of IllegalStateException to indicate bugs
[ https://issues.apache.org/jira/browse/SPARK-40085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40085: Assignee: Apache Spark > use INTERNAL_ERROR error class instead of IllegalStateException to indicate > bugs > > > Key: SPARK-40085 > URL: https://issues.apache.org/jira/browse/SPARK-40085 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40085) use INTERNAL_ERROR error class instead of IllegalStateException to indicate bugs
[ https://issues.apache.org/jira/browse/SPARK-40085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40085: Assignee: (was: Apache Spark) > use INTERNAL_ERROR error class instead of IllegalStateException to indicate > bugs > > > Key: SPARK-40085 > URL: https://issues.apache.org/jira/browse/SPARK-40085 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40085) use INTERNAL_ERROR error class instead of IllegalStateException to indicate bugs
[ https://issues.apache.org/jira/browse/SPARK-40085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579748#comment-17579748 ] Apache Spark commented on SPARK-40085: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/37524 > use INTERNAL_ERROR error class instead of IllegalStateException to indicate > bugs > > > Key: SPARK-40085 > URL: https://issues.apache.org/jira/browse/SPARK-40085 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40085) use INTERNAL_ERROR error class instead of IllegalStateException to indicate bugs
Wenchen Fan created SPARK-40085: --- Summary: use INTERNAL_ERROR error class instead of IllegalStateException to indicate bugs Key: SPARK-40085 URL: https://issues.apache.org/jira/browse/SPARK-40085 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Wenchen Fan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38888) Add `RocksDBProvider` similar to `LevelDBProvider`
[ https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579738#comment-17579738 ] Thomas Graves commented on SPARK-3: --- Just curious does rocksdb give us some particular benefit - performance or compatibility? Is leveldb not support on apple silicon? Just curious and would be good to record why we add support. > Add `RocksDBProvider` similar to `LevelDBProvider` > -- > > Key: SPARK-3 > URL: https://issues.apache.org/jira/browse/SPARK-3 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > `LevelDBProvider` is used by `ExternalShuffleBlockResolver` and > `YarnShuffleService`, a corresponding `RocksDB` implementation should be added -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40058) Avoid filter twice in HadoopFSUtils
[ https://issues.apache.org/jira/browse/SPARK-40058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-40058: Assignee: ZiyueGuan > Avoid filter twice in HadoopFSUtils > --- > > Key: SPARK-40058 > URL: https://issues.apache.org/jira/browse/SPARK-40058 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: ZiyueGuan >Assignee: ZiyueGuan >Priority: Minor > Fix For: 3.4.0 > > > In HadoopFSUtils, listLeafFiles will apply filter more than once in recursive > method call. This may waste more time when filter logic is heavy. Would like > to have a refactor on this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40058) Avoid filter twice in HadoopFSUtils
[ https://issues.apache.org/jira/browse/SPARK-40058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-40058. -- Fix Version/s: 3.4.0 Resolution: Fixed Resolved by https://github.com/apache/spark/pull/37498 > Avoid filter twice in HadoopFSUtils > --- > > Key: SPARK-40058 > URL: https://issues.apache.org/jira/browse/SPARK-40058 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: ZiyueGuan >Priority: Minor > Fix For: 3.4.0 > > > In HadoopFSUtils, listLeafFiles will apply filter more than once in recursive > method call. This may waste more time when filter logic is heavy. Would like > to have a refactor on this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40035) Avoid apply filter twice when listing files
[ https://issues.apache.org/jira/browse/SPARK-40035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-40035. -- Resolution: Duplicate > Avoid apply filter twice when listing files > --- > > Key: SPARK-40035 > URL: https://issues.apache.org/jira/browse/SPARK-40035 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: EdisonWang >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39887) Expression transform error
[ https://issues.apache.org/jira/browse/SPARK-39887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-39887: --- Assignee: Peter Toth > Expression transform error > -- > > Key: SPARK-39887 > URL: https://issues.apache.org/jira/browse/SPARK-39887 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1, 3.3.0, 3.2.2 >Reporter: zhuml >Assignee: Peter Toth >Priority: Major > Fix For: 3.1.4 > > > {code:java} > spark.sql( > """ > |select to_date(a) a, to_date(b) b from > |(select a, a as b from > |(select to_date(a) a from > | values ('2020-02-01') as t1(a) > | group by to_date(a)) t3 > |union all > |select a, b from > |(select to_date(a) a, to_date(b) b from > |values ('2020-01-01','2020-01-02') as t1(a, b) > | group by to_date(a), to_date(b)) t4) t5 > |group by to_date(a), to_date(b) > |""".stripMargin).show(){code} > result is (2020-02-01, 2020-02-01), (2020-01-01, 2020-01-01) > expected (2020-02-01, 2020-02-01), (2020-01-01, 2020-01-02) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39887) Expression transform error
[ https://issues.apache.org/jira/browse/SPARK-39887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-39887: Fix Version/s: 3.4.0 3.3.1 3.2.3 > Expression transform error > -- > > Key: SPARK-39887 > URL: https://issues.apache.org/jira/browse/SPARK-39887 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1, 3.3.0, 3.2.2 >Reporter: zhuml >Assignee: Peter Toth >Priority: Major > Fix For: 3.1.4, 3.4.0, 3.3.1, 3.2.3 > > > {code:java} > spark.sql( > """ > |select to_date(a) a, to_date(b) b from > |(select a, a as b from > |(select to_date(a) a from > | values ('2020-02-01') as t1(a) > | group by to_date(a)) t3 > |union all > |select a, b from > |(select to_date(a) a, to_date(b) b from > |values ('2020-01-01','2020-01-02') as t1(a, b) > | group by to_date(a), to_date(b)) t4) t5 > |group by to_date(a), to_date(b) > |""".stripMargin).show(){code} > result is (2020-02-01, 2020-02-01), (2020-01-01, 2020-01-01) > expected (2020-02-01, 2020-02-01), (2020-01-01, 2020-01-02) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39887) Expression transform error
[ https://issues.apache.org/jira/browse/SPARK-39887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-39887. - Fix Version/s: 3.1.4 Resolution: Fixed Issue resolved by pull request 37496 [https://github.com/apache/spark/pull/37496] > Expression transform error > -- > > Key: SPARK-39887 > URL: https://issues.apache.org/jira/browse/SPARK-39887 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1, 3.3.0, 3.2.2 >Reporter: zhuml >Priority: Major > Fix For: 3.1.4 > > > {code:java} > spark.sql( > """ > |select to_date(a) a, to_date(b) b from > |(select a, a as b from > |(select to_date(a) a from > | values ('2020-02-01') as t1(a) > | group by to_date(a)) t3 > |union all > |select a, b from > |(select to_date(a) a, to_date(b) b from > |values ('2020-01-01','2020-01-02') as t1(a, b) > | group by to_date(a), to_date(b)) t4) t5 > |group by to_date(a), to_date(b) > |""".stripMargin).show(){code} > result is (2020-02-01, 2020-02-01), (2020-01-01, 2020-01-01) > expected (2020-02-01, 2020-02-01), (2020-01-01, 2020-01-02) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39982) StructType.fromJson method missing documentation
[ https://issues.apache.org/jira/browse/SPARK-39982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-39982. -- Fix Version/s: 3.4.0 Resolution: Fixed Resolved by https://github.com/apache/spark/pull/37408 > StructType.fromJson method missing documentation > > > Key: SPARK-39982 > URL: https://issues.apache.org/jira/browse/SPARK-39982 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Khalid Mammadov >Assignee: Khalid Mammadov >Priority: Trivial > Fix For: 3.4.0 > > > StructType.fromJson method does not have any documentation. It would be good > to have one that explains how one can use it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39982) StructType.fromJson method missing documentation
[ https://issues.apache.org/jira/browse/SPARK-39982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-39982: Assignee: Khalid Mammadov > StructType.fromJson method missing documentation > > > Key: SPARK-39982 > URL: https://issues.apache.org/jira/browse/SPARK-39982 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Khalid Mammadov >Assignee: Khalid Mammadov >Priority: Trivial > > StructType.fromJson method does not have any documentation. It would be good > to have one that explains how one can use it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40019) Refactor comment of ArrayType
[ https://issues.apache.org/jira/browse/SPARK-40019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-40019. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37453 [https://github.com/apache/spark/pull/37453] > Refactor comment of ArrayType > - > > Key: SPARK-40019 > URL: https://issues.apache.org/jira/browse/SPARK-40019 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0, 3.2.2 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.4.0 > > > Now the parameter `containsNull` of ArrayType/MapType is so confused, need to > add comment -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40019) Refactor comment of ArrayType
[ https://issues.apache.org/jira/browse/SPARK-40019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-40019: --- Assignee: angerszhu > Refactor comment of ArrayType > - > > Key: SPARK-40019 > URL: https://issues.apache.org/jira/browse/SPARK-40019 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0, 3.2.2 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > > Now the parameter `containsNull` of ArrayType/MapType is so confused, need to > add comment -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40073) Should Use `connector/${moduleName}` instead of `external/${moduleName}`
[ https://issues.apache.org/jira/browse/SPARK-40073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-40073. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37512 [https://github.com/apache/spark/pull/37512] > Should Use `connector/${moduleName}` instead of `external/${moduleName}` > > > Key: SPARK-40073 > URL: https://issues.apache.org/jira/browse/SPARK-40073 > Project: Spark > Issue Type: Improvement > Components: Build, Project Infra, PySpark, Spark Core, SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.4.0 > > > SPARK-38569 rename `external` top level dir to `connector`, but > `external/${moduleName}` is still used in documents instead of > `connector/${moduleName}` > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-40083) Add shuffle index cache expire time policy to avoid unused continuous memory consumption
[ https://issues.apache.org/jira/browse/SPARK-40083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579692#comment-17579692 ] wangshengjie edited comment on SPARK-40083 at 8/15/22 1:03 PM: --- Maybe i also should add expire policy for push merge shuffle manager was (Author: wangshengjie): Maybe i should add expire policy for push merge shuffle manager > Add shuffle index cache expire time policy to avoid unused continuous memory > consumption > > > Key: SPARK-40083 > URL: https://issues.apache.org/jira/browse/SPARK-40083 > Project: Spark > Issue Type: Improvement > Components: Shuffle, YARN >Affects Versions: 3.3.0 >Reporter: wangshengjie >Priority: Major > > In our production environment, we found some applicaitons finished already > about 2 days and its cache still in memory, so we could add guava cache > expire time policy to save memory. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40073) Should Use `connector/${moduleName}` instead of `external/${moduleName}`
[ https://issues.apache.org/jira/browse/SPARK-40073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-40073: --- Assignee: Yang Jie > Should Use `connector/${moduleName}` instead of `external/${moduleName}` > > > Key: SPARK-40073 > URL: https://issues.apache.org/jira/browse/SPARK-40073 > Project: Spark > Issue Type: Improvement > Components: Build, Project Infra, PySpark, Spark Core, SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > SPARK-38569 rename `external` top level dir to `connector`, but > `external/${moduleName}` is still used in documents instead of > `connector/${moduleName}` > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40083) Add shuffle index cache expire time policy to avoid unused continuous memory consumption
[ https://issues.apache.org/jira/browse/SPARK-40083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579692#comment-17579692 ] wangshengjie commented on SPARK-40083: -- Maybe i should add expire policy for push merge shuffle manager > Add shuffle index cache expire time policy to avoid unused continuous memory > consumption > > > Key: SPARK-40083 > URL: https://issues.apache.org/jira/browse/SPARK-40083 > Project: Spark > Issue Type: Improvement > Components: Shuffle, YARN >Affects Versions: 3.3.0 >Reporter: wangshengjie >Priority: Major > > In our production environment, we found some applicaitons finished already > about 2 days and its cache still in memory, so we could add guava cache > expire time policy to save memory. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39976) NULL check in ArrayIntersect adds extraneous null from first param
[ https://issues.apache.org/jira/browse/SPARK-39976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-39976. - Fix Version/s: 3.4.0 3.3.1 Assignee: angerszhu Resolution: Fixed > NULL check in ArrayIntersect adds extraneous null from first param > -- > > Key: SPARK-39976 > URL: https://issues.apache.org/jira/browse/SPARK-39976 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Navin Kumar >Assignee: angerszhu >Priority: Major > Labels: correctness > Fix For: 3.4.0, 3.3.1 > > > This is very likely a regression from SPARK-36829. > When using {{array_intersect(a, b)}}, if the first parameter contains a > {{NULL}} value and the second one does not, an extraneous {{NULL}} is present > in the output. This also leads to {{array_intersect(a, b) != > array_intersect(b, a)}} which is incorrect as set intersection should be > commutative. > Example using PySpark: > {code:python} > >>> a = [1, 2, 3] > >>> b = [3, None, 5] > >>> df = spark.sparkContext.parallelize(data).toDF(["a","b"]) > >>> df.show() > +-++ > |a| b| > +-++ > |[1, 2, 3]|[3, null, 5]| > +-++ > >>> df.selectExpr("array_intersect(a,b)").show() > +-+ > |array_intersect(a, b)| > +-+ > | [3]| > +-+ > >>> df.selectExpr("array_intersect(b,a)").show() > +-+ > |array_intersect(b, a)| > +-+ > |[3, null]| > +-+ > {code} > Note that in the first case, {{a}} does not contain a {{NULL}}, and the final > output is correct: {{[3]}}. In the second case, since {{b}} does contain > {{NULL}} and is now the first parameter. > The same behavior occurs in Scala when writing to Parquet: > {code:scala} > scala> val a = Array[java.lang.Integer](1, 2, null, 4) > a: Array[Integer] = Array(1, 2, null, 4) > scala> val b = Array[java.lang.Integer](4, 5, 6, 7) > b: Array[Integer] = Array(4, 5, 6, 7) > scala> val df = Seq((a, b)).toDF("a","b") > df: org.apache.spark.sql.DataFrame = [a: array, b: array] > scala> df.write.parquet("/tmp/simple.parquet") > scala> val df = spark.read.parquet("/tmp/simple.parquet") > df: org.apache.spark.sql.DataFrame = [a: array, b: array] > scala> df.show() > +---++ > | a| b| > +---++ > |[1, 2, null, 4]|[4, 5, 6, 7]| > +---++ > scala> df.selectExpr("array_intersect(a,b)").show() > +-+ > |array_intersect(a, b)| > +-+ > |[null, 4]| > +-+ > scala> df.selectExpr("array_intersect(b,a)").show() > +-+ > |array_intersect(b, a)| > +-+ > | [4]| > +-+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40084) Upgrade Py4J from 0.10.9.5 to 0.10.9.7
[ https://issues.apache.org/jira/browse/SPARK-40084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579685#comment-17579685 ] Apache Spark commented on SPARK-40084: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/37523 > Upgrade Py4J from 0.10.9.5 to 0.10.9.7 > -- > > Key: SPARK-40084 > URL: https://issues.apache.org/jira/browse/SPARK-40084 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > Fix For: 3.4.0 > > > * Java side: Add support for Java 11/17 > Release note: https://www.py4j.org/changelog.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40084) Upgrade Py4J from 0.10.9.5 to 0.10.9.7
[ https://issues.apache.org/jira/browse/SPARK-40084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40084: Assignee: (was: Apache Spark) > Upgrade Py4J from 0.10.9.5 to 0.10.9.7 > -- > > Key: SPARK-40084 > URL: https://issues.apache.org/jira/browse/SPARK-40084 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > Fix For: 3.4.0 > > > * Java side: Add support for Java 11/17 > Release note: https://www.py4j.org/changelog.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40084) Upgrade Py4J from 0.10.9.5 to 0.10.9.7
[ https://issues.apache.org/jira/browse/SPARK-40084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40084: Assignee: Apache Spark > Upgrade Py4J from 0.10.9.5 to 0.10.9.7 > -- > > Key: SPARK-40084 > URL: https://issues.apache.org/jira/browse/SPARK-40084 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: Apache Spark >Priority: Minor > Fix For: 3.4.0 > > > * Java side: Add support for Java 11/17 > Release note: https://www.py4j.org/changelog.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org