[jira] [Updated] (SPARK-37512) Support TimedeltaIndex creation (from Series/Index) and TimedeltaIndex.astype
[ https://issues.apache.org/jira/browse/SPARK-37512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-37512: - Summary: Support TimedeltaIndex creation (from Series/Index) and TimedeltaIndex.astype (was: Support TimedeltaIndex creation given a timedelta Series/Index) > Support TimedeltaIndex creation (from Series/Index) and TimedeltaIndex.astype > - > > Key: SPARK-37512 > URL: https://issues.apache.org/jira/browse/SPARK-37512 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Priority: Major > > > To solve the issues below: > {code:java} > >>> idx = ps.TimedeltaIndex([timedelta(1), timedelta(microseconds=2)]) > >>> idx > TimedeltaIndex(['1 days 00:00:00', '0 days 00:00:00.02'], > dtype='timedelta64[ns]', freq=None) > >>> ps.TimedeltaIndex(idx) > Traceback (most recent call last): > ... > raise TypeError("astype can not be applied to %s." % self.pretty_name) > TypeError: astype can not be applied to timedelta. > {code} > > > > {code:java} > >>> s = ps.Series([timedelta(1), timedelta(microseconds=2)], index=[10, 20]) > >>> s > 10 1 days 00:00:00 > 20 0 days 00:00:00.02 > dtype: timedelta64[ns] > >>> ps.TimedeltaIndex(s) > Traceback (most recent call last): > ... > raise TypeError("astype can not be applied to %s." % self.pretty_name) > TypeError: astype can not be applied to timedelta. > {code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36396) Implement DataFrame.cov
[ https://issues.apache.org/jira/browse/SPARK-36396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452194#comment-17452194 ] Apache Spark commented on SPARK-36396: -- User 'dchvn' has created a pull request for this issue: https://github.com/apache/spark/pull/34778 > Implement DataFrame.cov > --- > > Key: SPARK-36396 > URL: https://issues.apache.org/jira/browse/SPARK-36396 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37326) Support TimestampNTZ in CSV data source
[ https://issues.apache.org/jira/browse/SPARK-37326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452178#comment-17452178 ] Apache Spark commented on SPARK-37326: -- User 'sadikovi' has created a pull request for this issue: https://github.com/apache/spark/pull/34777 > Support TimestampNTZ in CSV data source > --- > > Key: SPARK-37326 > URL: https://issues.apache.org/jira/browse/SPARK-37326 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Ivan Sadikov >Assignee: Ivan Sadikov >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37326) Support TimestampNTZ in CSV data source
[ https://issues.apache.org/jira/browse/SPARK-37326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452177#comment-17452177 ] Apache Spark commented on SPARK-37326: -- User 'sadikovi' has created a pull request for this issue: https://github.com/apache/spark/pull/34777 > Support TimestampNTZ in CSV data source > --- > > Key: SPARK-37326 > URL: https://issues.apache.org/jira/browse/SPARK-37326 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Ivan Sadikov >Assignee: Ivan Sadikov >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37512) Support TimedeltaIndex creation given a timedelta Series/Index
[ https://issues.apache.org/jira/browse/SPARK-37512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452175#comment-17452175 ] Apache Spark commented on SPARK-37512: -- User 'xinrong-databricks' has created a pull request for this issue: https://github.com/apache/spark/pull/34776 > Support TimedeltaIndex creation given a timedelta Series/Index > -- > > Key: SPARK-37512 > URL: https://issues.apache.org/jira/browse/SPARK-37512 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Priority: Major > > > To solve the issues below: > {code:java} > >>> idx = ps.TimedeltaIndex([timedelta(1), timedelta(microseconds=2)]) > >>> idx > TimedeltaIndex(['1 days 00:00:00', '0 days 00:00:00.02'], > dtype='timedelta64[ns]', freq=None) > >>> ps.TimedeltaIndex(idx) > Traceback (most recent call last): > ... > raise TypeError("astype can not be applied to %s." % self.pretty_name) > TypeError: astype can not be applied to timedelta. > {code} > > > > {code:java} > >>> s = ps.Series([timedelta(1), timedelta(microseconds=2)], index=[10, 20]) > >>> s > 10 1 days 00:00:00 > 20 0 days 00:00:00.02 > dtype: timedelta64[ns] > >>> ps.TimedeltaIndex(s) > Traceback (most recent call last): > ... > raise TypeError("astype can not be applied to %s." % self.pretty_name) > TypeError: astype can not be applied to timedelta. > {code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37512) Support TimedeltaIndex creation given a timedelta Series/Index
[ https://issues.apache.org/jira/browse/SPARK-37512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37512: Assignee: (was: Apache Spark) > Support TimedeltaIndex creation given a timedelta Series/Index > -- > > Key: SPARK-37512 > URL: https://issues.apache.org/jira/browse/SPARK-37512 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Priority: Major > > > To solve the issues below: > {code:java} > >>> idx = ps.TimedeltaIndex([timedelta(1), timedelta(microseconds=2)]) > >>> idx > TimedeltaIndex(['1 days 00:00:00', '0 days 00:00:00.02'], > dtype='timedelta64[ns]', freq=None) > >>> ps.TimedeltaIndex(idx) > Traceback (most recent call last): > ... > raise TypeError("astype can not be applied to %s." % self.pretty_name) > TypeError: astype can not be applied to timedelta. > {code} > > > > {code:java} > >>> s = ps.Series([timedelta(1), timedelta(microseconds=2)], index=[10, 20]) > >>> s > 10 1 days 00:00:00 > 20 0 days 00:00:00.02 > dtype: timedelta64[ns] > >>> ps.TimedeltaIndex(s) > Traceback (most recent call last): > ... > raise TypeError("astype can not be applied to %s." % self.pretty_name) > TypeError: astype can not be applied to timedelta. > {code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37512) Support TimedeltaIndex creation given a timedelta Series/Index
[ https://issues.apache.org/jira/browse/SPARK-37512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37512: Assignee: Apache Spark > Support TimedeltaIndex creation given a timedelta Series/Index > -- > > Key: SPARK-37512 > URL: https://issues.apache.org/jira/browse/SPARK-37512 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Apache Spark >Priority: Major > > > To solve the issues below: > {code:java} > >>> idx = ps.TimedeltaIndex([timedelta(1), timedelta(microseconds=2)]) > >>> idx > TimedeltaIndex(['1 days 00:00:00', '0 days 00:00:00.02'], > dtype='timedelta64[ns]', freq=None) > >>> ps.TimedeltaIndex(idx) > Traceback (most recent call last): > ... > raise TypeError("astype can not be applied to %s." % self.pretty_name) > TypeError: astype can not be applied to timedelta. > {code} > > > > {code:java} > >>> s = ps.Series([timedelta(1), timedelta(microseconds=2)], index=[10, 20]) > >>> s > 10 1 days 00:00:00 > 20 0 days 00:00:00.02 > dtype: timedelta64[ns] > >>> ps.TimedeltaIndex(s) > Traceback (most recent call last): > ... > raise TypeError("astype can not be applied to %s." % self.pretty_name) > TypeError: astype can not be applied to timedelta. > {code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37512) Support TimedeltaIndex creation given a timedelta Series/Index
[ https://issues.apache.org/jira/browse/SPARK-37512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452174#comment-17452174 ] Apache Spark commented on SPARK-37512: -- User 'xinrong-databricks' has created a pull request for this issue: https://github.com/apache/spark/pull/34776 > Support TimedeltaIndex creation given a timedelta Series/Index > -- > > Key: SPARK-37512 > URL: https://issues.apache.org/jira/browse/SPARK-37512 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Priority: Major > > > To solve the issues below: > {code:java} > >>> idx = ps.TimedeltaIndex([timedelta(1), timedelta(microseconds=2)]) > >>> idx > TimedeltaIndex(['1 days 00:00:00', '0 days 00:00:00.02'], > dtype='timedelta64[ns]', freq=None) > >>> ps.TimedeltaIndex(idx) > Traceback (most recent call last): > ... > raise TypeError("astype can not be applied to %s." % self.pretty_name) > TypeError: astype can not be applied to timedelta. > {code} > > > > {code:java} > >>> s = ps.Series([timedelta(1), timedelta(microseconds=2)], index=[10, 20]) > >>> s > 10 1 days 00:00:00 > 20 0 days 00:00:00.02 > dtype: timedelta64[ns] > >>> ps.TimedeltaIndex(s) > Traceback (most recent call last): > ... > raise TypeError("astype can not be applied to %s." % self.pretty_name) > TypeError: astype can not be applied to timedelta. > {code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37511) Introduce TimedeltaIndex to pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-37511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452172#comment-17452172 ] Apache Spark commented on SPARK-37511: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/34775 > Introduce TimedeltaIndex to pandas API on Spark > --- > > Key: SPARK-37511 > URL: https://issues.apache.org/jira/browse/SPARK-37511 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.3.0 > > > Introduce TimedeltaIndex to pandas API on Spark. > Properties, functions, and basic operations of TimedeltaIndex will be > supported in follow-up PRs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37516) Uses Python's standard string formatter for SQL API in PySpark
[ https://issues.apache.org/jira/browse/SPARK-37516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37516: Assignee: Apache Spark > Uses Python's standard string formatter for SQL API in PySpark > -- > > Key: SPARK-37516 > URL: https://issues.apache.org/jira/browse/SPARK-37516 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > This is similar with SPARK-37436. It aims to add the Python string formatter > support for {{SparkSession.sql}} API. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37516) Uses Python's standard string formatter for SQL API in PySpark
[ https://issues.apache.org/jira/browse/SPARK-37516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452171#comment-17452171 ] Apache Spark commented on SPARK-37516: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/34774 > Uses Python's standard string formatter for SQL API in PySpark > -- > > Key: SPARK-37516 > URL: https://issues.apache.org/jira/browse/SPARK-37516 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > This is similar with SPARK-37436. It aims to add the Python string formatter > support for {{SparkSession.sql}} API. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37516) Uses Python's standard string formatter for SQL API in PySpark
[ https://issues.apache.org/jira/browse/SPARK-37516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452170#comment-17452170 ] Apache Spark commented on SPARK-37516: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/34774 > Uses Python's standard string formatter for SQL API in PySpark > -- > > Key: SPARK-37516 > URL: https://issues.apache.org/jira/browse/SPARK-37516 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > This is similar with SPARK-37436. It aims to add the Python string formatter > support for {{SparkSession.sql}} API. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37516) Uses Python's standard string formatter for SQL API in PySpark
[ https://issues.apache.org/jira/browse/SPARK-37516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37516: Assignee: (was: Apache Spark) > Uses Python's standard string formatter for SQL API in PySpark > -- > > Key: SPARK-37516 > URL: https://issues.apache.org/jira/browse/SPARK-37516 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > This is similar with SPARK-37436. It aims to add the Python string formatter > support for {{SparkSession.sql}} API. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37516) Uses Python's standard string formatter for SQL API in PySpark
Hyukjin Kwon created SPARK-37516: Summary: Uses Python's standard string formatter for SQL API in PySpark Key: SPARK-37516 URL: https://issues.apache.org/jira/browse/SPARK-37516 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 3.3.0 Reporter: Hyukjin Kwon This is similar with SPARK-37436. It aims to add the Python string formatter support for {{SparkSession.sql}} API. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37494) Unify v1 and v2 options output of `SHOW CREATE TABLE` command
[ https://issues.apache.org/jira/browse/SPARK-37494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-37494. - Resolution: Fixed Issue resolved by pull request 34753 [https://github.com/apache/spark/pull/34753] > Unify v1 and v2 options output of `SHOW CREATE TABLE` command > - > > Key: SPARK-37494 > URL: https://issues.apache.org/jira/browse/SPARK-37494 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: PengLei >Assignee: PengLei >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37494) Unify v1 and v2 options output of `SHOW CREATE TABLE` command
[ https://issues.apache.org/jira/browse/SPARK-37494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-37494: --- Assignee: PengLei > Unify v1 and v2 options output of `SHOW CREATE TABLE` command > - > > Key: SPARK-37494 > URL: https://issues.apache.org/jira/browse/SPARK-37494 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: PengLei >Assignee: PengLei >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33898) Support SHOW CREATE TABLE in v2
[ https://issues.apache.org/jira/browse/SPARK-33898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452164#comment-17452164 ] Apache Spark commented on SPARK-33898: -- User 'Peng-Lei' has created a pull request for this issue: https://github.com/apache/spark/pull/34773 > Support SHOW CREATE TABLE in v2 > --- > > Key: SPARK-33898 > URL: https://issues.apache.org/jira/browse/SPARK-33898 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Assignee: PengLei >Priority: Major > Fix For: 3.2.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37476) udaf doesnt work with nullable (or option of) case class result
[ https://issues.apache.org/jira/browse/SPARK-37476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452118#comment-17452118 ] koert kuipers commented on SPARK-37476: --- note that changing my case class to: {code:java} case class SumAndProduct(sum: java.lang.Double, product: java.lang.Double) {code} did not fix anything. same errors. > udaf doesnt work with nullable (or option of) case class result > > > Key: SPARK-37476 > URL: https://issues.apache.org/jira/browse/SPARK-37476 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 > Environment: spark master branch on nov 27 >Reporter: koert kuipers >Priority: Minor > > i have a need to have a dataframe aggregation return a nullable case class. > there seems to be no way to get this to work. the suggestion to wrap the > result in an option doesnt work either. > first attempt using nulls: > {code:java} > case class SumAndProduct(sum: Double, product: Double) > val sumAndProductAgg = new Aggregator[Double, SumAndProduct, SumAndProduct] { > def zero: SumAndProduct = null > def reduce(b: SumAndProduct, a: Double): SumAndProduct = > if (b == null) { > SumAndProduct(a, a) > } else { > SumAndProduct(b.sum + a, b.product * a) > } > def merge(b1: SumAndProduct, b2: SumAndProduct): SumAndProduct = > if (b1 == null) { > b2 > } else if (b2 == null) { > b1 > } else { > SumAndProduct(b1.sum + b2.sum, b1.product * b2.product) > } > def finish(r: SumAndProduct): SumAndProduct = r > def bufferEncoder: Encoder[SumAndProduct] = ExpressionEncoder() > def outputEncoder: Encoder[SumAndProduct] = ExpressionEncoder() > } > val df = Seq.empty[Double] > .toDF() > .select(udaf(sumAndProductAgg).apply(col("value"))) > df.printSchema() > df.show() > {code} > this gives: > {code:java} > root > |-- $anon$3(value): struct (nullable = true) > | |-- sum: double (nullable = false) > | |-- product: double (nullable = false) > 16:44:54.882 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 > in stage 1491.0 (TID 1929) > java.lang.RuntimeException: Error while encoding: > java.lang.NullPointerException: Null value appeared in non-nullable field: > top level Product or row object > If the schema is inferred from a Scala tuple/case class, or a Java bean, > please try to use scala.Option[_] or other nullable types (e.g. > java.lang.Integer instead of int/scala.Int). > knownnotnull(assertnotnull(input[0, org.apache.spark.sql.SumAndProduct, > true])).sum AS sum#20070 > knownnotnull(assertnotnull(input[0, org.apache.spark.sql.SumAndProduct, > true])).product AS product#20071 > at > org.apache.spark.sql.errors.QueryExecutionErrors$.expressionEncodingError(QueryExecutionErrors.scala:1125) > {code} > i dont really understand the error, this result is not a top-level row object. > anyhow taking the advice to heart and using option we get to the second > attempt using options: > {code:java} > case class SumAndProduct(sum: Double, product: Double) > val sumAndProductAgg = new Aggregator[Double, Option[SumAndProduct], > Option[SumAndProduct]] { > def zero: Option[SumAndProduct] = None > def reduce(b: Option[SumAndProduct], a: Double): Option[SumAndProduct] = > b > .map{ b => SumAndProduct(b.sum + a, b.product * a) } > .orElse{ Option(SumAndProduct(a, a)) } > def merge(b1: Option[SumAndProduct], b2: Option[SumAndProduct]): > Option[SumAndProduct] = > b1.map{ b1 => > b2.map{ b2 => > SumAndProduct(b1.sum + b2.sum, b1.product * b2.product) > }.getOrElse(b1) > }.orElse(b2) > def finish(r: Option[SumAndProduct]): Option[SumAndProduct] = r > def bufferEncoder: Encoder[Option[SumAndProduct]] = ExpressionEncoder() > def outputEncoder: Encoder[Option[SumAndProduct]] = ExpressionEncoder() > } > val df = Seq.empty[Double] > .toDF() > .select(udaf(sumAndProductAgg).apply(col("value"))) > df.printSchema() > df.show() > {code} > this gives: > {code:java} > root > |-- $anon$4(value): struct (nullable = true) > | |-- sum: double (nullable = false) > | |-- product: double (nullable = false) > 16:44:54.998 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 > in stage 1493.0 (TID 1930) > java.lang.AssertionError: index (1) should < 1 > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:142) > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:338) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateResultProjection$5(Aggregatio
[jira] [Assigned] (SPARK-37514) Remove workarounds due to older pandas
[ https://issues.apache.org/jira/browse/SPARK-37514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-37514: Assignee: Takuya Ueshin > Remove workarounds due to older pandas > -- > > Key: SPARK-37514 > URL: https://issues.apache.org/jira/browse/SPARK-37514 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > > Now that we upgraded the minimum version of pandas to {{1.0.5}}. > We can remove workarounds for pandas API on Spark to run with older pandas. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37514) Remove workarounds due to older pandas
[ https://issues.apache.org/jira/browse/SPARK-37514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37514. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34772 [https://github.com/apache/spark/pull/34772] > Remove workarounds due to older pandas > -- > > Key: SPARK-37514 > URL: https://issues.apache.org/jira/browse/SPARK-37514 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.3.0 > > > Now that we upgraded the minimum version of pandas to {{1.0.5}}. > We can remove workarounds for pandas API on Spark to run with older pandas. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37430) Inline type hints for python/pyspark/mllib/linalg/distributed.py
[ https://issues.apache.org/jira/browse/SPARK-37430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452114#comment-17452114 ] Jyoti Singh commented on SPARK-37430: - Sure, I’ll check the other one first. > Inline type hints for python/pyspark/mllib/linalg/distributed.py > > > Key: SPARK-37430 > URL: https://issues.apache.org/jira/browse/SPARK-37430 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > Inline type hints from python/pyspark/mlib/linalg/distributed.pyi to > python/pyspark/mllib/linalg/distributed.py -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37430) Inline type hints for python/pyspark/mllib/linalg/distributed.py
[ https://issues.apache.org/jira/browse/SPARK-37430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452110#comment-17452110 ] Maciej Szymkiewicz commented on SPARK-37430: Thank you [~jyoti_08], but please keep in mind that before we get deeper into SPARK-3739 we should fully resolve SPARK-37094. > Inline type hints for python/pyspark/mllib/linalg/distributed.py > > > Key: SPARK-37430 > URL: https://issues.apache.org/jira/browse/SPARK-37430 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > Inline type hints from python/pyspark/mlib/linalg/distributed.pyi to > python/pyspark/mllib/linalg/distributed.py -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37430) Inline type hints for python/pyspark/mllib/linalg/distributed.py
[ https://issues.apache.org/jira/browse/SPARK-37430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452110#comment-17452110 ] Maciej Szymkiewicz edited comment on SPARK-37430 at 12/2/21, 1:35 AM: -- Thank you [~jyoti_08], but please keep in mind that before we get deeper into SPARK-37396 we should fully resolve SPARK-37094. was (Author: zero323): Thank you [~jyoti_08], but please keep in mind that before we get deeper into SPARK-3739 we should fully resolve SPARK-37094. > Inline type hints for python/pyspark/mllib/linalg/distributed.py > > > Key: SPARK-37430 > URL: https://issues.apache.org/jira/browse/SPARK-37430 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > Inline type hints from python/pyspark/mlib/linalg/distributed.pyi to > python/pyspark/mllib/linalg/distributed.py -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37430) Inline type hints for python/pyspark/mllib/linalg/distributed.py
[ https://issues.apache.org/jira/browse/SPARK-37430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452103#comment-17452103 ] Jyoti Singh commented on SPARK-37430: - I can start working on this issue. > Inline type hints for python/pyspark/mllib/linalg/distributed.py > > > Key: SPARK-37430 > URL: https://issues.apache.org/jira/browse/SPARK-37430 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > Inline type hints from python/pyspark/mlib/linalg/distributed.pyi to > python/pyspark/mllib/linalg/distributed.py -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37476) udaf doesnt work with nullable (or option of) case class result
[ https://issues.apache.org/jira/browse/SPARK-37476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452101#comment-17452101 ] koert kuipers edited comment on SPARK-37476 at 12/2/21, 12:58 AM: -- i get that scala Double cannot be null. however i dont understand how this is relevant? my case class can be null, yet it fails when i try to return null for the case class. in so far i know a nullable struct is perfectly valid in spark? was (Author: koert): i get that scala Double cannot be null. however i dont understand how this is relevant? my case class can be null, yet it fails when i try to return null for the case class. > udaf doesnt work with nullable (or option of) case class result > > > Key: SPARK-37476 > URL: https://issues.apache.org/jira/browse/SPARK-37476 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 > Environment: spark master branch on nov 27 >Reporter: koert kuipers >Priority: Minor > > i have a need to have a dataframe aggregation return a nullable case class. > there seems to be no way to get this to work. the suggestion to wrap the > result in an option doesnt work either. > first attempt using nulls: > {code:java} > case class SumAndProduct(sum: Double, product: Double) > val sumAndProductAgg = new Aggregator[Double, SumAndProduct, SumAndProduct] { > def zero: SumAndProduct = null > def reduce(b: SumAndProduct, a: Double): SumAndProduct = > if (b == null) { > SumAndProduct(a, a) > } else { > SumAndProduct(b.sum + a, b.product * a) > } > def merge(b1: SumAndProduct, b2: SumAndProduct): SumAndProduct = > if (b1 == null) { > b2 > } else if (b2 == null) { > b1 > } else { > SumAndProduct(b1.sum + b2.sum, b1.product * b2.product) > } > def finish(r: SumAndProduct): SumAndProduct = r > def bufferEncoder: Encoder[SumAndProduct] = ExpressionEncoder() > def outputEncoder: Encoder[SumAndProduct] = ExpressionEncoder() > } > val df = Seq.empty[Double] > .toDF() > .select(udaf(sumAndProductAgg).apply(col("value"))) > df.printSchema() > df.show() > {code} > this gives: > {code:java} > root > |-- $anon$3(value): struct (nullable = true) > | |-- sum: double (nullable = false) > | |-- product: double (nullable = false) > 16:44:54.882 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 > in stage 1491.0 (TID 1929) > java.lang.RuntimeException: Error while encoding: > java.lang.NullPointerException: Null value appeared in non-nullable field: > top level Product or row object > If the schema is inferred from a Scala tuple/case class, or a Java bean, > please try to use scala.Option[_] or other nullable types (e.g. > java.lang.Integer instead of int/scala.Int). > knownnotnull(assertnotnull(input[0, org.apache.spark.sql.SumAndProduct, > true])).sum AS sum#20070 > knownnotnull(assertnotnull(input[0, org.apache.spark.sql.SumAndProduct, > true])).product AS product#20071 > at > org.apache.spark.sql.errors.QueryExecutionErrors$.expressionEncodingError(QueryExecutionErrors.scala:1125) > {code} > i dont really understand the error, this result is not a top-level row object. > anyhow taking the advice to heart and using option we get to the second > attempt using options: > {code:java} > case class SumAndProduct(sum: Double, product: Double) > val sumAndProductAgg = new Aggregator[Double, Option[SumAndProduct], > Option[SumAndProduct]] { > def zero: Option[SumAndProduct] = None > def reduce(b: Option[SumAndProduct], a: Double): Option[SumAndProduct] = > b > .map{ b => SumAndProduct(b.sum + a, b.product * a) } > .orElse{ Option(SumAndProduct(a, a)) } > def merge(b1: Option[SumAndProduct], b2: Option[SumAndProduct]): > Option[SumAndProduct] = > b1.map{ b1 => > b2.map{ b2 => > SumAndProduct(b1.sum + b2.sum, b1.product * b2.product) > }.getOrElse(b1) > }.orElse(b2) > def finish(r: Option[SumAndProduct]): Option[SumAndProduct] = r > def bufferEncoder: Encoder[Option[SumAndProduct]] = ExpressionEncoder() > def outputEncoder: Encoder[Option[SumAndProduct]] = ExpressionEncoder() > } > val df = Seq.empty[Double] > .toDF() > .select(udaf(sumAndProductAgg).apply(col("value"))) > df.printSchema() > df.show() > {code} > this gives: > {code:java} > root > |-- $anon$4(value): struct (nullable = true) > | |-- sum: double (nullable = false) > | |-- product: double (nullable = false) > 16:44:54.998 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 > in stage 1493.0 (TID 1930) > java.lang.AssertionError: index (1) should < 1 > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:142) > a
[jira] [Commented] (SPARK-37476) udaf doesnt work with nullable (or option of) case class result
[ https://issues.apache.org/jira/browse/SPARK-37476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452101#comment-17452101 ] koert kuipers commented on SPARK-37476: --- i get that scala Double cannot be null. however i dont understand how this is relevant? my case class can be null, yet it fails when i try to return null for the case class. > udaf doesnt work with nullable (or option of) case class result > > > Key: SPARK-37476 > URL: https://issues.apache.org/jira/browse/SPARK-37476 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 > Environment: spark master branch on nov 27 >Reporter: koert kuipers >Priority: Minor > > i have a need to have a dataframe aggregation return a nullable case class. > there seems to be no way to get this to work. the suggestion to wrap the > result in an option doesnt work either. > first attempt using nulls: > {code:java} > case class SumAndProduct(sum: Double, product: Double) > val sumAndProductAgg = new Aggregator[Double, SumAndProduct, SumAndProduct] { > def zero: SumAndProduct = null > def reduce(b: SumAndProduct, a: Double): SumAndProduct = > if (b == null) { > SumAndProduct(a, a) > } else { > SumAndProduct(b.sum + a, b.product * a) > } > def merge(b1: SumAndProduct, b2: SumAndProduct): SumAndProduct = > if (b1 == null) { > b2 > } else if (b2 == null) { > b1 > } else { > SumAndProduct(b1.sum + b2.sum, b1.product * b2.product) > } > def finish(r: SumAndProduct): SumAndProduct = r > def bufferEncoder: Encoder[SumAndProduct] = ExpressionEncoder() > def outputEncoder: Encoder[SumAndProduct] = ExpressionEncoder() > } > val df = Seq.empty[Double] > .toDF() > .select(udaf(sumAndProductAgg).apply(col("value"))) > df.printSchema() > df.show() > {code} > this gives: > {code:java} > root > |-- $anon$3(value): struct (nullable = true) > | |-- sum: double (nullable = false) > | |-- product: double (nullable = false) > 16:44:54.882 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 > in stage 1491.0 (TID 1929) > java.lang.RuntimeException: Error while encoding: > java.lang.NullPointerException: Null value appeared in non-nullable field: > top level Product or row object > If the schema is inferred from a Scala tuple/case class, or a Java bean, > please try to use scala.Option[_] or other nullable types (e.g. > java.lang.Integer instead of int/scala.Int). > knownnotnull(assertnotnull(input[0, org.apache.spark.sql.SumAndProduct, > true])).sum AS sum#20070 > knownnotnull(assertnotnull(input[0, org.apache.spark.sql.SumAndProduct, > true])).product AS product#20071 > at > org.apache.spark.sql.errors.QueryExecutionErrors$.expressionEncodingError(QueryExecutionErrors.scala:1125) > {code} > i dont really understand the error, this result is not a top-level row object. > anyhow taking the advice to heart and using option we get to the second > attempt using options: > {code:java} > case class SumAndProduct(sum: Double, product: Double) > val sumAndProductAgg = new Aggregator[Double, Option[SumAndProduct], > Option[SumAndProduct]] { > def zero: Option[SumAndProduct] = None > def reduce(b: Option[SumAndProduct], a: Double): Option[SumAndProduct] = > b > .map{ b => SumAndProduct(b.sum + a, b.product * a) } > .orElse{ Option(SumAndProduct(a, a)) } > def merge(b1: Option[SumAndProduct], b2: Option[SumAndProduct]): > Option[SumAndProduct] = > b1.map{ b1 => > b2.map{ b2 => > SumAndProduct(b1.sum + b2.sum, b1.product * b2.product) > }.getOrElse(b1) > }.orElse(b2) > def finish(r: Option[SumAndProduct]): Option[SumAndProduct] = r > def bufferEncoder: Encoder[Option[SumAndProduct]] = ExpressionEncoder() > def outputEncoder: Encoder[Option[SumAndProduct]] = ExpressionEncoder() > } > val df = Seq.empty[Double] > .toDF() > .select(udaf(sumAndProductAgg).apply(col("value"))) > df.printSchema() > df.show() > {code} > this gives: > {code:java} > root > |-- $anon$4(value): struct (nullable = true) > | |-- sum: double (nullable = false) > | |-- product: double (nullable = false) > 16:44:54.998 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 > in stage 1493.0 (TID 1930) > java.lang.AssertionError: index (1) should < 1 > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:142) > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:338) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateResultProjection$5(Ag
[jira] [Commented] (SPARK-37476) udaf doesnt work with nullable (or option of) case class result
[ https://issues.apache.org/jira/browse/SPARK-37476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452099#comment-17452099 ] Hyukjin Kwon commented on SPARK-37476: -- Awesome! In Scala, their double cannot be null: {code} scala> val a: Double = null :22: error: an expression of type Null is ineligible for implicit conversion val a: Double = null ^ scala> null.asInstanceOf[Double] res1: Double = 0.0 {code} probably that's why the initial reproducer failed to do that. > udaf doesnt work with nullable (or option of) case class result > > > Key: SPARK-37476 > URL: https://issues.apache.org/jira/browse/SPARK-37476 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 > Environment: spark master branch on nov 27 >Reporter: koert kuipers >Priority: Minor > > i have a need to have a dataframe aggregation return a nullable case class. > there seems to be no way to get this to work. the suggestion to wrap the > result in an option doesnt work either. > first attempt using nulls: > {code:java} > case class SumAndProduct(sum: Double, product: Double) > val sumAndProductAgg = new Aggregator[Double, SumAndProduct, SumAndProduct] { > def zero: SumAndProduct = null > def reduce(b: SumAndProduct, a: Double): SumAndProduct = > if (b == null) { > SumAndProduct(a, a) > } else { > SumAndProduct(b.sum + a, b.product * a) > } > def merge(b1: SumAndProduct, b2: SumAndProduct): SumAndProduct = > if (b1 == null) { > b2 > } else if (b2 == null) { > b1 > } else { > SumAndProduct(b1.sum + b2.sum, b1.product * b2.product) > } > def finish(r: SumAndProduct): SumAndProduct = r > def bufferEncoder: Encoder[SumAndProduct] = ExpressionEncoder() > def outputEncoder: Encoder[SumAndProduct] = ExpressionEncoder() > } > val df = Seq.empty[Double] > .toDF() > .select(udaf(sumAndProductAgg).apply(col("value"))) > df.printSchema() > df.show() > {code} > this gives: > {code:java} > root > |-- $anon$3(value): struct (nullable = true) > | |-- sum: double (nullable = false) > | |-- product: double (nullable = false) > 16:44:54.882 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 > in stage 1491.0 (TID 1929) > java.lang.RuntimeException: Error while encoding: > java.lang.NullPointerException: Null value appeared in non-nullable field: > top level Product or row object > If the schema is inferred from a Scala tuple/case class, or a Java bean, > please try to use scala.Option[_] or other nullable types (e.g. > java.lang.Integer instead of int/scala.Int). > knownnotnull(assertnotnull(input[0, org.apache.spark.sql.SumAndProduct, > true])).sum AS sum#20070 > knownnotnull(assertnotnull(input[0, org.apache.spark.sql.SumAndProduct, > true])).product AS product#20071 > at > org.apache.spark.sql.errors.QueryExecutionErrors$.expressionEncodingError(QueryExecutionErrors.scala:1125) > {code} > i dont really understand the error, this result is not a top-level row object. > anyhow taking the advice to heart and using option we get to the second > attempt using options: > {code:java} > case class SumAndProduct(sum: Double, product: Double) > val sumAndProductAgg = new Aggregator[Double, Option[SumAndProduct], > Option[SumAndProduct]] { > def zero: Option[SumAndProduct] = None > def reduce(b: Option[SumAndProduct], a: Double): Option[SumAndProduct] = > b > .map{ b => SumAndProduct(b.sum + a, b.product * a) } > .orElse{ Option(SumAndProduct(a, a)) } > def merge(b1: Option[SumAndProduct], b2: Option[SumAndProduct]): > Option[SumAndProduct] = > b1.map{ b1 => > b2.map{ b2 => > SumAndProduct(b1.sum + b2.sum, b1.product * b2.product) > }.getOrElse(b1) > }.orElse(b2) > def finish(r: Option[SumAndProduct]): Option[SumAndProduct] = r > def bufferEncoder: Encoder[Option[SumAndProduct]] = ExpressionEncoder() > def outputEncoder: Encoder[Option[SumAndProduct]] = ExpressionEncoder() > } > val df = Seq.empty[Double] > .toDF() > .select(udaf(sumAndProductAgg).apply(col("value"))) > df.printSchema() > df.show() > {code} > this gives: > {code:java} > root > |-- $anon$4(value): struct (nullable = true) > | |-- sum: double (nullable = false) > | |-- product: double (nullable = false) > 16:44:54.998 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 > in stage 1493.0 (TID 1930) > java.lang.AssertionError: index (1) should < 1 > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:142) > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:338) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$S
[jira] [Commented] (SPARK-37515) minRatePerPartition works as "max messages per partition per a batch" (it should be per seconds)
[ https://issues.apache.org/jira/browse/SPARK-37515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452097#comment-17452097 ] Sungpeo Kook commented on SPARK-37515: -- I couldn't find the proper JIRA's components for Kafka Streaming. > minRatePerPartition works as "max messages per partition per a batch" (it > should be per seconds) > > > Key: SPARK-37515 > URL: https://issues.apache.org/jira/browse/SPARK-37515 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.8, 3.2.0 >Reporter: Sungpeo Kook >Assignee: Apache Spark >Priority: Major > > {{maxRatePerPartition}} means "max messages per partition per second". > But minRatePerPartition does not. ("max messages per partition per a batch"). > This is a bug. > > minRatePerPartition should be works as "min messages per partition per > second". > * > [https://github.com/apache/spark/blob/f97de309792f382ae823894e978f7e54f34f1a29/docs/configuration.md?plain=1#L2878-L2886] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37515) minRatePerPartition works as "max messages per partition per a batch" (it should be per seconds)
[ https://issues.apache.org/jira/browse/SPARK-37515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37515: Assignee: (was: Apache Spark) > minRatePerPartition works as "max messages per partition per a batch" (it > should be per seconds) > > > Key: SPARK-37515 > URL: https://issues.apache.org/jira/browse/SPARK-37515 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.8, 3.2.0 >Reporter: Sungpeo Kook >Priority: Major > > {{maxRatePerPartition}} means "max messages per partition per second". > But minRatePerPartition does not. ("max messages per partition per a batch"). > This is a bug. > > minRatePerPartition should be works as "min messages per partition per > second". > * > [https://github.com/apache/spark/blob/f97de309792f382ae823894e978f7e54f34f1a29/docs/configuration.md?plain=1#L2878-L2886] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37515) minRatePerPartition works as "max messages per partition per a batch" (it should be per seconds)
[ https://issues.apache.org/jira/browse/SPARK-37515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37515: Assignee: Apache Spark > minRatePerPartition works as "max messages per partition per a batch" (it > should be per seconds) > > > Key: SPARK-37515 > URL: https://issues.apache.org/jira/browse/SPARK-37515 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.8, 3.2.0 >Reporter: Sungpeo Kook >Assignee: Apache Spark >Priority: Major > > {{maxRatePerPartition}} means "max messages per partition per second". > But minRatePerPartition does not. ("max messages per partition per a batch"). > This is a bug. > > minRatePerPartition should be works as "min messages per partition per > second". > * > [https://github.com/apache/spark/blob/f97de309792f382ae823894e978f7e54f34f1a29/docs/configuration.md?plain=1#L2878-L2886] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37515) minRatePerPartition works as "max messages per partition per a batch" (it should be per seconds)
Sungpeo Kook created SPARK-37515: Summary: minRatePerPartition works as "max messages per partition per a batch" (it should be per seconds) Key: SPARK-37515 URL: https://issues.apache.org/jira/browse/SPARK-37515 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.2.0, 2.4.8 Reporter: Sungpeo Kook {{maxRatePerPartition}} means "max messages per partition per second". But minRatePerPartition does not. ("max messages per partition per a batch"). This is a bug. minRatePerPartition should be works as "min messages per partition per second". * [https://github.com/apache/spark/blob/f97de309792f382ae823894e978f7e54f34f1a29/docs/configuration.md?plain=1#L2878-L2886] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37514) Remove workarounds due to older pandas
[ https://issues.apache.org/jira/browse/SPARK-37514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37514: Assignee: Apache Spark > Remove workarounds due to older pandas > -- > > Key: SPARK-37514 > URL: https://issues.apache.org/jira/browse/SPARK-37514 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Assignee: Apache Spark >Priority: Major > > Now that we upgraded the minimum version of pandas to {{1.0.5}}. > We can remove workarounds for pandas API on Spark to run with older pandas. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37514) Remove workarounds due to older pandas
[ https://issues.apache.org/jira/browse/SPARK-37514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37514: Assignee: (was: Apache Spark) > Remove workarounds due to older pandas > -- > > Key: SPARK-37514 > URL: https://issues.apache.org/jira/browse/SPARK-37514 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Priority: Major > > Now that we upgraded the minimum version of pandas to {{1.0.5}}. > We can remove workarounds for pandas API on Spark to run with older pandas. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37514) Remove workarounds due to older pandas
[ https://issues.apache.org/jira/browse/SPARK-37514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452092#comment-17452092 ] Apache Spark commented on SPARK-37514: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/34772 > Remove workarounds due to older pandas > -- > > Key: SPARK-37514 > URL: https://issues.apache.org/jira/browse/SPARK-37514 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Priority: Major > > Now that we upgraded the minimum version of pandas to {{1.0.5}}. > We can remove workarounds for pandas API on Spark to run with older pandas. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37514) Remove workarounds due to older pandas
Takuya Ueshin created SPARK-37514: - Summary: Remove workarounds due to older pandas Key: SPARK-37514 URL: https://issues.apache.org/jira/browse/SPARK-37514 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.3.0 Reporter: Takuya Ueshin Now that we upgraded the minimum version of pandas to {{1.0.5}}. We can remove workarounds for pandas API on Spark to run with older pandas. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26589) proper `median` method for spark dataframe
[ https://issues.apache.org/jira/browse/SPARK-26589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452089#comment-17452089 ] Sean R. Owen commented on SPARK-26589: -- That approach ends up being about n log n and seems worse than just sorting? I think. Or bootstrap this kind of approach with approximate median - would converge faster. > proper `median` method for spark dataframe > -- > > Key: SPARK-26589 > URL: https://issues.apache.org/jira/browse/SPARK-26589 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Jan Gorecki >Priority: Minor > > I found multiple tickets asking for median function to be implemented in > Spark. Most of those tickets links to "SPARK-6761 Approximate quantile" as > duplicate of it. The thing is that approximate quantile is a workaround for > lack of median function. Thus I am filling this Feature Request for proper, > exact, not approximation of, median function. I am aware about difficulties > that are caused by distributed environment when trying to compute median, > nevertheless I don't think those difficulties is reason good enough to drop > out `median` function from scope of Spark. I am not asking about efficient > median but exact median. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37496) Migrate ReplaceTableAsSelectStatement to v2 command
[ https://issues.apache.org/jira/browse/SPARK-37496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao reassigned SPARK-37496: -- Assignee: Huaxin Gao > Migrate ReplaceTableAsSelectStatement to v2 command > --- > > Key: SPARK-37496 > URL: https://issues.apache.org/jira/browse/SPARK-37496 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37496) Migrate ReplaceTableAsSelectStatement to v2 command
[ https://issues.apache.org/jira/browse/SPARK-37496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao resolved SPARK-37496. Resolution: Fixed > Migrate ReplaceTableAsSelectStatement to v2 command > --- > > Key: SPARK-37496 > URL: https://issues.apache.org/jira/browse/SPARK-37496 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26589) proper `median` method for spark dataframe
[ https://issues.apache.org/jira/browse/SPARK-26589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452081#comment-17452081 ] Nicholas Chammas commented on SPARK-26589: -- [~srowen] - I'll ask for help on the dev list if appropriate, but I'm wondering if you can give me some high level guidance here. I have an outline of an approach to calculate the median that does not require sorting or shuffling the data. It's based on the approach I linked to in my previous comment (by Michael Harris). It does require, however, multiple passes over the data for the algorithm to converge on the median. Here's a working sketch of the approach: {code:python} def spark_median(data): total_count = data.count() if total_count % 2 == 0: target_positions = [total_count // 2, total_count // 2 + 1] else: target_positions = [total_count // 2 + 1] target_values = [ kth_position(data, k, data_count=total_count) for k in target_positions ] return sum(target_values) / len(target_values) def kth_position(data, k, data_count=None): if data_count is None: total_count = data.count() else: total_count = data_count if k > total_count or k < 1: return None while True: # This value, along with the following two counts, are the only data that need # to be shared across nodes. some_value = data.first()["id"] # These two counts can be performed together via an aggregator. larger_count = data.where(col("id") > some_value).count() equal_count = data.where(col("id") == some_value).count() value_positions = range( total_count - larger_count - equal_count + 1, total_count - larger_count + 1, ) # print(some_value, total_count, k, value_positions) if k in value_positions: return some_value elif k >= value_positions.stop: k -= (value_positions.stop - 1) data = data.where(col("id") > some_value) total_count = larger_count elif k < value_positions.start: data = data.where(col("id") < some_value) total_count -= (larger_count + equal_count) {code} Of course, this needs to be converted into a Catalyst Expression, but the basic idea is expressed there. I am looking at the definitions of [DeclarativeAggregate|https://github.com/apache/spark/blob/1dd0ca23f64acfc7a3dc697e19627a1b74012a2d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L381-L394] and [ImperativeAggregate|https://github.com/apache/spark/blob/1dd0ca23f64acfc7a3dc697e19627a1b74012a2d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L267-L285] and trying to find an existing expression to model after, but I don't think we have any existing aggregates that would work like this median—specifically, where multiple passes over the data are required (in this case, to count elements matching different filters). Do you have any advice on how to approach converting this into a Catalyst expression? There is an [NthValue|https://github.com/apache/spark/blob/568ad6aa4435ce76ca3b5d9966e64259ea1f9b38/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L648-L675] window expression, but I don't think I can build on it to make my median expression since a) median shouldn't be limited to window expressions, and b) NthValue requires a complete sort of the data, which I want to avoid. > proper `median` method for spark dataframe > -- > > Key: SPARK-26589 > URL: https://issues.apache.org/jira/browse/SPARK-26589 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Jan Gorecki >Priority: Minor > > I found multiple tickets asking for median function to be implemented in > Spark. Most of those tickets links to "SPARK-6761 Approximate quantile" as > duplicate of it. The thing is that approximate quantile is a workaround for > lack of median function. Thus I am filling this Feature Request for proper, > exact, not approximation of, median function. I am aware about difficulties > that are caused by distributed environment when trying to compute median, > nevertheless I don't think those difficulties is reason good enough to drop > out `median` function from scope of Spark. I am not asking about efficient > median but exact median. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26589) proper `median` method for spark dataframe
[ https://issues.apache.org/jira/browse/SPARK-26589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451936#comment-17451936 ] Nicholas Chammas commented on SPARK-26589: -- Just for reference, Stack Overflow provides evidence that a proper median function has been in high demand for some time: * [How can I calculate exact median with Apache Spark?|https://stackoverflow.com/q/28158729/877069] (14K views) * [How to find median and quantiles using Spark|https://stackoverflow.com/q/31432843/877069] (117K views) * [Median / quantiles within PySpark groupBy|https://stackoverflow.com/q/46845672/877069] (67K views) > proper `median` method for spark dataframe > -- > > Key: SPARK-26589 > URL: https://issues.apache.org/jira/browse/SPARK-26589 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Jan Gorecki >Priority: Minor > > I found multiple tickets asking for median function to be implemented in > Spark. Most of those tickets links to "SPARK-6761 Approximate quantile" as > duplicate of it. The thing is that approximate quantile is a workaround for > lack of median function. Thus I am filling this Feature Request for proper, > exact, not approximation of, median function. I am aware about difficulties > that are caused by distributed environment when trying to compute median, > nevertheless I don't think those difficulties is reason good enough to drop > out `median` function from scope of Spark. I am not asking about efficient > median but exact median. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37480) Configurations in docs/running-on-kubernetes.md are not uptodate
[ https://issues.apache.org/jira/browse/SPARK-37480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37480: -- Affects Version/s: 3.2.0 (was: 3.3.0) > Configurations in docs/running-on-kubernetes.md are not uptodate > > > Key: SPARK-37480 > URL: https://issues.apache.org/jira/browse/SPARK-37480 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Minor > Fix For: 3.2.1, 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37480) Configurations in docs/running-on-kubernetes.md are not uptodate
[ https://issues.apache.org/jira/browse/SPARK-37480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37480: -- Fix Version/s: 3.2.1 > Configurations in docs/running-on-kubernetes.md are not uptodate > > > Key: SPARK-37480 > URL: https://issues.apache.org/jira/browse/SPARK-37480 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Minor > Fix For: 3.2.1, 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37476) udaf doesnt work with nullable (or option of) case class result
[ https://issues.apache.org/jira/browse/SPARK-37476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451921#comment-17451921 ] koert kuipers commented on SPARK-37476: --- yes it works well {code:java} import java.lang.{Double => JDouble} val sumAgg = new Aggregator[Double, JDouble, JDouble] { def zero: JDouble = null def reduce(b: JDouble, a: Double): JDouble = if (b == null) { a } else { b + a } def merge(b1: JDouble, b2: JDouble): JDouble = if (b1 == null) { b2 } else if (b2 == null) { b1 } else { b1 + b2 } def finish(r: JDouble): JDouble = r def bufferEncoder: Encoder[JDouble] = ExpressionEncoder() def outputEncoder: Encoder[JDouble] = ExpressionEncoder() } val df = Seq.empty[Double] .toDF() .select(udaf(sumAgg).apply(col("value"))) df.printSchema() root |-- $anon$3(value): double (nullable = true) df.show() +--+ |$anon$3(value)| +--+ | null| +--+ {code} it works with Option without issues too > udaf doesnt work with nullable (or option of) case class result > > > Key: SPARK-37476 > URL: https://issues.apache.org/jira/browse/SPARK-37476 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 > Environment: spark master branch on nov 27 >Reporter: koert kuipers >Priority: Minor > > i have a need to have a dataframe aggregation return a nullable case class. > there seems to be no way to get this to work. the suggestion to wrap the > result in an option doesnt work either. > first attempt using nulls: > {code:java} > case class SumAndProduct(sum: Double, product: Double) > val sumAndProductAgg = new Aggregator[Double, SumAndProduct, SumAndProduct] { > def zero: SumAndProduct = null > def reduce(b: SumAndProduct, a: Double): SumAndProduct = > if (b == null) { > SumAndProduct(a, a) > } else { > SumAndProduct(b.sum + a, b.product * a) > } > def merge(b1: SumAndProduct, b2: SumAndProduct): SumAndProduct = > if (b1 == null) { > b2 > } else if (b2 == null) { > b1 > } else { > SumAndProduct(b1.sum + b2.sum, b1.product * b2.product) > } > def finish(r: SumAndProduct): SumAndProduct = r > def bufferEncoder: Encoder[SumAndProduct] = ExpressionEncoder() > def outputEncoder: Encoder[SumAndProduct] = ExpressionEncoder() > } > val df = Seq.empty[Double] > .toDF() > .select(udaf(sumAndProductAgg).apply(col("value"))) > df.printSchema() > df.show() > {code} > this gives: > {code:java} > root > |-- $anon$3(value): struct (nullable = true) > | |-- sum: double (nullable = false) > | |-- product: double (nullable = false) > 16:44:54.882 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 > in stage 1491.0 (TID 1929) > java.lang.RuntimeException: Error while encoding: > java.lang.NullPointerException: Null value appeared in non-nullable field: > top level Product or row object > If the schema is inferred from a Scala tuple/case class, or a Java bean, > please try to use scala.Option[_] or other nullable types (e.g. > java.lang.Integer instead of int/scala.Int). > knownnotnull(assertnotnull(input[0, org.apache.spark.sql.SumAndProduct, > true])).sum AS sum#20070 > knownnotnull(assertnotnull(input[0, org.apache.spark.sql.SumAndProduct, > true])).product AS product#20071 > at > org.apache.spark.sql.errors.QueryExecutionErrors$.expressionEncodingError(QueryExecutionErrors.scala:1125) > {code} > i dont really understand the error, this result is not a top-level row object. > anyhow taking the advice to heart and using option we get to the second > attempt using options: > {code:java} > case class SumAndProduct(sum: Double, product: Double) > val sumAndProductAgg = new Aggregator[Double, Option[SumAndProduct], > Option[SumAndProduct]] { > def zero: Option[SumAndProduct] = None > def reduce(b: Option[SumAndProduct], a: Double): Option[SumAndProduct] = > b > .map{ b => SumAndProduct(b.sum + a, b.product * a) } > .orElse{ Option(SumAndProduct(a, a)) } > def merge(b1: Option[SumAndProduct], b2: Option[SumAndProduct]): > Option[SumAndProduct] = > b1.map{ b1 => > b2.map{ b2 => > SumAndProduct(b1.sum + b2.sum, b1.product * b2.product) > }.getOrElse(b1) > }.orElse(b2) > def finish(r: Option[SumAndProduct]): Option[SumAndProduct] = r > def bufferEncoder: Encoder[Option[SumAndProduct]] = ExpressionEncoder() > def outputEncoder: Encoder[Option[SumAndProduct]] = ExpressionEncoder() > } > val df = Seq.empty[Double] > .toDF() > .select(udaf(sumAndProductAgg).apply(col("value"))) > df.printSchema() > df.show() > {code} > this gives: > {code:java} > root > |-- $anon$4(value
[jira] [Comment Edited] (SPARK-37476) udaf doesnt work with nullable (or option of) case class result
[ https://issues.apache.org/jira/browse/SPARK-37476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451921#comment-17451921 ] koert kuipers edited comment on SPARK-37476 at 12/1/21, 4:57 PM: - yes it works well {code:java} import java.lang.{Double => JDouble} val sumAgg = new Aggregator[Double, JDouble, JDouble] { def zero: JDouble = null def reduce(b: JDouble, a: Double): JDouble = if (b == null) { a } else { b + a } def merge(b1: JDouble, b2: JDouble): JDouble = if (b1 == null) { b2 } else if (b2 == null) { b1 } else { b1 + b2 } def finish(r: JDouble): JDouble = r def bufferEncoder: Encoder[JDouble] = ExpressionEncoder() def outputEncoder: Encoder[JDouble] = ExpressionEncoder() } val df = Seq.empty[Double] .toDF() .select(udaf(sumAgg).apply(col("value"))) df.printSchema() root |-- $anon$3(value): double (nullable = true) df.show() +--+ |$anon$3(value)| +--+ | null| +--+ {code} it works with Option without issues too was (Author: koert): yes it works well {code:java} import java.lang.{Double => JDouble} val sumAgg = new Aggregator[Double, JDouble, JDouble] { def zero: JDouble = null def reduce(b: JDouble, a: Double): JDouble = if (b == null) { a } else { b + a } def merge(b1: JDouble, b2: JDouble): JDouble = if (b1 == null) { b2 } else if (b2 == null) { b1 } else { b1 + b2 } def finish(r: JDouble): JDouble = r def bufferEncoder: Encoder[JDouble] = ExpressionEncoder() def outputEncoder: Encoder[JDouble] = ExpressionEncoder() } val df = Seq.empty[Double] .toDF() .select(udaf(sumAgg).apply(col("value"))) df.printSchema() root |-- $anon$3(value): double (nullable = true) df.show() +--+ |$anon$3(value)| +--+ | null| +--+ {code} it works with Option without issues too > udaf doesnt work with nullable (or option of) case class result > > > Key: SPARK-37476 > URL: https://issues.apache.org/jira/browse/SPARK-37476 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 > Environment: spark master branch on nov 27 >Reporter: koert kuipers >Priority: Minor > > i have a need to have a dataframe aggregation return a nullable case class. > there seems to be no way to get this to work. the suggestion to wrap the > result in an option doesnt work either. > first attempt using nulls: > {code:java} > case class SumAndProduct(sum: Double, product: Double) > val sumAndProductAgg = new Aggregator[Double, SumAndProduct, SumAndProduct] { > def zero: SumAndProduct = null > def reduce(b: SumAndProduct, a: Double): SumAndProduct = > if (b == null) { > SumAndProduct(a, a) > } else { > SumAndProduct(b.sum + a, b.product * a) > } > def merge(b1: SumAndProduct, b2: SumAndProduct): SumAndProduct = > if (b1 == null) { > b2 > } else if (b2 == null) { > b1 > } else { > SumAndProduct(b1.sum + b2.sum, b1.product * b2.product) > } > def finish(r: SumAndProduct): SumAndProduct = r > def bufferEncoder: Encoder[SumAndProduct] = ExpressionEncoder() > def outputEncoder: Encoder[SumAndProduct] = ExpressionEncoder() > } > val df = Seq.empty[Double] > .toDF() > .select(udaf(sumAndProductAgg).apply(col("value"))) > df.printSchema() > df.show() > {code} > this gives: > {code:java} > root > |-- $anon$3(value): struct (nullable = true) > | |-- sum: double (nullable = false) > | |-- product: double (nullable = false) > 16:44:54.882 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 > in stage 1491.0 (TID 1929) > java.lang.RuntimeException: Error while encoding: > java.lang.NullPointerException: Null value appeared in non-nullable field: > top level Product or row object > If the schema is inferred from a Scala tuple/case class, or a Java bean, > please try to use scala.Option[_] or other nullable types (e.g. > java.lang.Integer instead of int/scala.Int). > knownnotnull(assertnotnull(input[0, org.apache.spark.sql.SumAndProduct, > true])).sum AS sum#20070 > knownnotnull(assertnotnull(input[0, org.apache.spark.sql.SumAndProduct, > true])).product AS product#20071 > at > org.apache.spark.sql.errors.QueryExecutionErrors$.expressionEncodingError(QueryExecutionErrors.scala:1125) > {code} > i dont really understand the error, this result is not a top-level row object. > anyhow taking the advice to heart and using option we get to the second > attempt using options: > {code:java} > case class SumAndProduct(sum: Double, product: Double) > val sumAndProductAgg = new Aggregator[Double, Opti
[jira] [Commented] (SPARK-37326) Support TimestampNTZ in CSV data source
[ https://issues.apache.org/jira/browse/SPARK-37326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451900#comment-17451900 ] Apache Spark commented on SPARK-37326: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/34771 > Support TimestampNTZ in CSV data source > --- > > Key: SPARK-37326 > URL: https://issues.apache.org/jira/browse/SPARK-37326 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Ivan Sadikov >Assignee: Ivan Sadikov >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37326) Support TimestampNTZ in CSV data source
[ https://issues.apache.org/jira/browse/SPARK-37326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451899#comment-17451899 ] Apache Spark commented on SPARK-37326: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/34771 > Support TimestampNTZ in CSV data source > --- > > Key: SPARK-37326 > URL: https://issues.apache.org/jira/browse/SPARK-37326 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Ivan Sadikov >Assignee: Ivan Sadikov >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37480) Configurations in docs/running-on-kubernetes.md are not uptodate
[ https://issues.apache.org/jira/browse/SPARK-37480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451882#comment-17451882 ] Apache Spark commented on SPARK-37480: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/34770 > Configurations in docs/running-on-kubernetes.md are not uptodate > > > Key: SPARK-37480 > URL: https://issues.apache.org/jira/browse/SPARK-37480 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Minor > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37501) CREATE/REPLACE TABLE should qualify location for v2 command
[ https://issues.apache.org/jira/browse/SPARK-37501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-37501: --- Assignee: PengLei > CREATE/REPLACE TABLE should qualify location for v2 command > --- > > Key: SPARK-37501 > URL: https://issues.apache.org/jira/browse/SPARK-37501 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: PengLei >Assignee: PengLei >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37501) CREATE/REPLACE TABLE should qualify location for v2 command
[ https://issues.apache.org/jira/browse/SPARK-37501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-37501. - Resolution: Fixed Issue resolved by pull request 34758 [https://github.com/apache/spark/pull/34758] > CREATE/REPLACE TABLE should qualify location for v2 command > --- > > Key: SPARK-37501 > URL: https://issues.apache.org/jira/browse/SPARK-37501 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: PengLei >Assignee: PengLei >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37463) Read/Write Timestamp ntz from/to Orc uses UTC time zone
[ https://issues.apache.org/jira/browse/SPARK-37463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451767#comment-17451767 ] Apache Spark commented on SPARK-37463: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/34769 > Read/Write Timestamp ntz from/to Orc uses UTC time zone > --- > > Key: SPARK-37463 > URL: https://issues.apache.org/jira/browse/SPARK-37463 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > > There are some example code: > import java.util.TimeZone > TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles")) > sql("set spark.sql.session.timeZone=America/Los_Angeles") > val df = sql("select timestamp_ntz '2021-06-01 00:00:00' ts_ntz, timestamp > '2021-06-01 00:00:00' ts") > df.write.mode("overwrite").orc("ts_ntz_orc") > df.write.mode("overwrite").parquet("ts_ntz_parquet") > df.write.mode("overwrite").format("avro").save("ts_ntz_avro") > val query = """ > select 'orc', * > from `orc`.`ts_ntz_orc` > union all > select 'parquet', * > from `parquet`.`ts_ntz_parquet` > union all > select 'avro', * > from `avro`.`ts_ntz_avro` > """ > val tzs = Seq("America/Los_Angeles", "UTC", "Europe/Amsterdam") > for (tz <- tzs) { > TimeZone.setDefault(TimeZone.getTimeZone(tz)) > sql(s"set spark.sql.session.timeZone=$tz") > println(s"Time zone is ${TimeZone.getDefault.getID}") > sql(query).show(false) > } > The output show below looks so strange. > Time zone is America/Los_Angeles > +---+---+---+ > |orc|ts_ntz |ts | > +---+---+---+ > |orc|2021-06-01 00:00:00|2021-06-01 00:00:00| > |parquet|2021-06-01 00:00:00|2021-06-01 00:00:00| > |avro |2021-06-01 00:00:00|2021-06-01 00:00:00| > +---+---+---+ > Time zone is UTC > +---+---+---+ > |orc|ts_ntz |ts | > +---+---+---+ > |orc|2021-05-31 17:00:00|2021-06-01 00:00:00| > |parquet|2021-06-01 00:00:00|2021-06-01 07:00:00| > |avro |2021-06-01 00:00:00|2021-06-01 07:00:00| > +---+---+---+ > Time zone is Europe/Amsterdam > +---+---+---+ > |orc|ts_ntz |ts | > +---+---+---+ > |orc|2021-05-31 15:00:00|2021-06-01 00:00:00| > |parquet|2021-06-01 00:00:00|2021-06-01 09:00:00| > |avro |2021-06-01 00:00:00|2021-06-01 09:00:00| > +---+---+---+ -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37463) Read/Write Timestamp ntz from/to Orc uses UTC time zone
[ https://issues.apache.org/jira/browse/SPARK-37463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451766#comment-17451766 ] Apache Spark commented on SPARK-37463: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/34769 > Read/Write Timestamp ntz from/to Orc uses UTC time zone > --- > > Key: SPARK-37463 > URL: https://issues.apache.org/jira/browse/SPARK-37463 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > > There are some example code: > import java.util.TimeZone > TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles")) > sql("set spark.sql.session.timeZone=America/Los_Angeles") > val df = sql("select timestamp_ntz '2021-06-01 00:00:00' ts_ntz, timestamp > '2021-06-01 00:00:00' ts") > df.write.mode("overwrite").orc("ts_ntz_orc") > df.write.mode("overwrite").parquet("ts_ntz_parquet") > df.write.mode("overwrite").format("avro").save("ts_ntz_avro") > val query = """ > select 'orc', * > from `orc`.`ts_ntz_orc` > union all > select 'parquet', * > from `parquet`.`ts_ntz_parquet` > union all > select 'avro', * > from `avro`.`ts_ntz_avro` > """ > val tzs = Seq("America/Los_Angeles", "UTC", "Europe/Amsterdam") > for (tz <- tzs) { > TimeZone.setDefault(TimeZone.getTimeZone(tz)) > sql(s"set spark.sql.session.timeZone=$tz") > println(s"Time zone is ${TimeZone.getDefault.getID}") > sql(query).show(false) > } > The output show below looks so strange. > Time zone is America/Los_Angeles > +---+---+---+ > |orc|ts_ntz |ts | > +---+---+---+ > |orc|2021-06-01 00:00:00|2021-06-01 00:00:00| > |parquet|2021-06-01 00:00:00|2021-06-01 00:00:00| > |avro |2021-06-01 00:00:00|2021-06-01 00:00:00| > +---+---+---+ > Time zone is UTC > +---+---+---+ > |orc|ts_ntz |ts | > +---+---+---+ > |orc|2021-05-31 17:00:00|2021-06-01 00:00:00| > |parquet|2021-06-01 00:00:00|2021-06-01 07:00:00| > |avro |2021-06-01 00:00:00|2021-06-01 07:00:00| > +---+---+---+ > Time zone is Europe/Amsterdam > +---+---+---+ > |orc|ts_ntz |ts | > +---+---+---+ > |orc|2021-05-31 15:00:00|2021-06-01 00:00:00| > |parquet|2021-06-01 00:00:00|2021-06-01 09:00:00| > |avro |2021-06-01 00:00:00|2021-06-01 09:00:00| > +---+---+---+ -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11150) Dynamic partition pruning
[ https://issues.apache.org/jira/browse/SPARK-11150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451764#comment-17451764 ] Apache Spark commented on SPARK-11150: -- User 'weixiuli' has created a pull request for this issue: https://github.com/apache/spark/pull/34768 > Dynamic partition pruning > - > > Key: SPARK-11150 > URL: https://issues.apache.org/jira/browse/SPARK-11150 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.5.1, 1.6.0, 2.0.0, 2.1.2, 2.2.1, 2.3.0 >Reporter: Younes >Assignee: Wei Xue >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > Attachments: image-2019-10-04-11-20-02-616.png > > > Implements dynamic partition pruning by adding a dynamic-partition-pruning > filter if there is a partitioned table and a filter on the dimension table. > The filter is then planned using a heuristic approach: > # As a broadcast relation if it is a broadcast hash join. The broadcast > relation will then be transformed into a reused broadcast exchange by the > {{ReuseExchange}} rule; or > # As a subquery duplicate if the estimated benefit of partition table scan > being saved is greater than the estimated cost of the extra scan of the > duplicated subquery; otherwise > # As a bypassed condition ({{true}}). > Below shows a basic example of DPP. > !image-2019-10-04-11-20-02-616.png|width=521,height=225! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37460) ALTER (DATABASE|SCHEMA|NAMESPACE) ... SET LOCATION command not documented
[ https://issues.apache.org/jira/browse/SPARK-37460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-37460: --- Assignee: Yuto Akutsu > ALTER (DATABASE|SCHEMA|NAMESPACE) ... SET LOCATION command not documented > - > > Key: SPARK-37460 > URL: https://issues.apache.org/jira/browse/SPARK-37460 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.2.0 >Reporter: Yuto Akutsu >Assignee: Yuto Akutsu >Priority: Minor > > The instruction of {color:#ff}ALTER DATABASE ... SET LOCATION ... > {color:#172b4d}command{color}{color} should be documented in a > sql-ref-syntax-ddl-create-table page. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11150) Dynamic partition pruning
[ https://issues.apache.org/jira/browse/SPARK-11150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451762#comment-17451762 ] Apache Spark commented on SPARK-11150: -- User 'weixiuli' has created a pull request for this issue: https://github.com/apache/spark/pull/34768 > Dynamic partition pruning > - > > Key: SPARK-11150 > URL: https://issues.apache.org/jira/browse/SPARK-11150 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.5.1, 1.6.0, 2.0.0, 2.1.2, 2.2.1, 2.3.0 >Reporter: Younes >Assignee: Wei Xue >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > Attachments: image-2019-10-04-11-20-02-616.png > > > Implements dynamic partition pruning by adding a dynamic-partition-pruning > filter if there is a partitioned table and a filter on the dimension table. > The filter is then planned using a heuristic approach: > # As a broadcast relation if it is a broadcast hash join. The broadcast > relation will then be transformed into a reused broadcast exchange by the > {{ReuseExchange}} rule; or > # As a subquery duplicate if the estimated benefit of partition table scan > being saved is greater than the estimated cost of the extra scan of the > duplicated subquery; otherwise > # As a bypassed condition ({{true}}). > Below shows a basic example of DPP. > !image-2019-10-04-11-20-02-616.png|width=521,height=225! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37460) ALTER (DATABASE|SCHEMA|NAMESPACE) ... SET LOCATION command not documented
[ https://issues.apache.org/jira/browse/SPARK-37460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-37460. - Fix Version/s: 3.3.0 3.2.1 Resolution: Fixed Issue resolved by pull request 34718 [https://github.com/apache/spark/pull/34718] > ALTER (DATABASE|SCHEMA|NAMESPACE) ... SET LOCATION command not documented > - > > Key: SPARK-37460 > URL: https://issues.apache.org/jira/browse/SPARK-37460 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.2.0 >Reporter: Yuto Akutsu >Assignee: Yuto Akutsu >Priority: Minor > Fix For: 3.3.0, 3.2.1 > > > The instruction of {color:#ff}ALTER DATABASE ... SET LOCATION ... > {color:#172b4d}command{color}{color} should be documented in a > sql-ref-syntax-ddl-create-table page. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37508) Add CONTAINS() function
[ https://issues.apache.org/jira/browse/SPARK-37508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-37508: Assignee: angerszhu > Add CONTAINS() function > --- > > Key: SPARK-37508 > URL: https://issues.apache.org/jira/browse/SPARK-37508 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: angerszhu >Priority: Major > > {{contains()}} is a common convenient function supported by a number of > database systems: > # > [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#contains_substr] > # [https://docs.snowflake.com/en/_static/apple-touch-icon.png!CONTAINS — > Snowflake > Documentation|https://docs.snowflake.com/en/sql-reference/functions/contains.html] > Proposed syntax: > {code:java} > contains(haystack, needle) > return type: boolean {code} > It is semantically equivalent to {{haystack like '%needle%'}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37508) Add CONTAINS() function
[ https://issues.apache.org/jira/browse/SPARK-37508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-37508. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34761 [https://github.com/apache/spark/pull/34761] > Add CONTAINS() function > --- > > Key: SPARK-37508 > URL: https://issues.apache.org/jira/browse/SPARK-37508 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: angerszhu >Priority: Major > Fix For: 3.3.0 > > > {{contains()}} is a common convenient function supported by a number of > database systems: > # > [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#contains_substr] > # [https://docs.snowflake.com/en/_static/apple-touch-icon.png!CONTAINS — > Snowflake > Documentation|https://docs.snowflake.com/en/sql-reference/functions/contains.html] > Proposed syntax: > {code:java} > contains(haystack, needle) > return type: boolean {code} > It is semantically equivalent to {{haystack like '%needle%'}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37511) Introduce TimedeltaIndex to pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-37511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-37511: Assignee: Xinrong Meng > Introduce TimedeltaIndex to pandas API on Spark > --- > > Key: SPARK-37511 > URL: https://issues.apache.org/jira/browse/SPARK-37511 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > > Introduce TimedeltaIndex to pandas API on Spark. > Properties, functions, and basic operations of TimedeltaIndex will be > supported in follow-up PRs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37511) Introduce TimedeltaIndex to pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-37511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37511. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34657 [https://github.com/apache/spark/pull/34657] > Introduce TimedeltaIndex to pandas API on Spark > --- > > Key: SPARK-37511 > URL: https://issues.apache.org/jira/browse/SPARK-37511 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.3.0 > > > Introduce TimedeltaIndex to pandas API on Spark. > Properties, functions, and basic operations of TimedeltaIndex will be > supported in follow-up PRs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-35332) Not Coalesce shuffle partitions when cache table
[ https://issues.apache.org/jira/browse/SPARK-35332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451620#comment-17451620 ] zhangyangyang edited comment on SPARK-35332 at 12/1/21, 8:49 AM: - [~luxianghao] https://issues.apache.org/jira/browse/MAPREDUCE-6944, hello, please help me for the last comment. Thank you very much was (Author: yangyang): https://issues.apache.org/jira/browse/MAPREDUCE-6944, hello, please help me for the last comment. Thank you very much > Not Coalesce shuffle partitions when cache table > > > Key: SPARK-35332 > URL: https://issues.apache.org/jira/browse/SPARK-35332 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.0.1, 3.1.0, 3.1.1 > Environment: latest spark version >Reporter: Xianghao Lu >Assignee: XiDuo You >Priority: Major > Fix For: 3.2.0 > > Attachments: cacheTable.png > > > *How to reproduce the problem* > _linux shell command to prepare data:_ > for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > > data.text > _sql to reproduce the problem:_ > * create table data_table(id int, str string, num int) row format delimited > fields terminated by ','; > * load data local inpath '/path/to/data.text' into table data_table; > * CACHE TABLE test_cache_table AS > SELECT str > FROM > (SELECT id,str FROM data_table > )group by str; > Finally you will see a stage with 200 tasks and not coalesce shuffle > partitions, the problem will waste resource when data size is small. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35332) Not Coalesce shuffle partitions when cache table
[ https://issues.apache.org/jira/browse/SPARK-35332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451620#comment-17451620 ] zhangyangyang commented on SPARK-35332: --- https://issues.apache.org/jira/browse/MAPREDUCE-6944, hello, please help me for the last comment. Thank you very much > Not Coalesce shuffle partitions when cache table > > > Key: SPARK-35332 > URL: https://issues.apache.org/jira/browse/SPARK-35332 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.0.1, 3.1.0, 3.1.1 > Environment: latest spark version >Reporter: Xianghao Lu >Assignee: XiDuo You >Priority: Major > Fix For: 3.2.0 > > Attachments: cacheTable.png > > > *How to reproduce the problem* > _linux shell command to prepare data:_ > for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > > data.text > _sql to reproduce the problem:_ > * create table data_table(id int, str string, num int) row format delimited > fields terminated by ','; > * load data local inpath '/path/to/data.text' into table data_table; > * CACHE TABLE test_cache_table AS > SELECT str > FROM > (SELECT id,str FROM data_table > )group by str; > Finally you will see a stage with 200 tasks and not coalesce shuffle > partitions, the problem will waste resource when data size is small. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37461) yarn-client mode client's appid value is null
[ https://issues.apache.org/jira/browse/SPARK-37461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451619#comment-17451619 ] Apache Spark commented on SPARK-37461: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/34767 > yarn-client mode client's appid value is null > - > > Key: SPARK-37461 > URL: https://issues.apache.org/jira/browse/SPARK-37461 > Project: Spark > Issue Type: Task > Components: YARN >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Minor > Fix For: 3.3.0 > > > In yarn-client mode, *Client.appId* variable is not assigned, it is always > {*}null{*}, in cluster mode, this variable will be assigned to the true > value. In this patch, we assign true application id to `appId` too. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37512) Support TimedeltaIndex creation given a timedelta Series/Index
[ https://issues.apache.org/jira/browse/SPARK-37512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451610#comment-17451610 ] Xinrong Meng commented on SPARK-37512: -- I am working on this. > Support TimedeltaIndex creation given a timedelta Series/Index > -- > > Key: SPARK-37512 > URL: https://issues.apache.org/jira/browse/SPARK-37512 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Priority: Major > > > To solve the issues below: > {code:java} > >>> idx = ps.TimedeltaIndex([timedelta(1), timedelta(microseconds=2)]) > >>> idx > TimedeltaIndex(['1 days 00:00:00', '0 days 00:00:00.02'], > dtype='timedelta64[ns]', freq=None) > >>> ps.TimedeltaIndex(idx) > Traceback (most recent call last): > ... > raise TypeError("astype can not be applied to %s." % self.pretty_name) > TypeError: astype can not be applied to timedelta. > {code} > > > > {code:java} > >>> s = ps.Series([timedelta(1), timedelta(microseconds=2)], index=[10, 20]) > >>> s > 10 1 days 00:00:00 > 20 0 days 00:00:00.02 > dtype: timedelta64[ns] > >>> ps.TimedeltaIndex(s) > Traceback (most recent call last): > ... > raise TypeError("astype can not be applied to %s." % self.pretty_name) > TypeError: astype can not be applied to timedelta. > {code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37513) date +/- interval with only day-time fields returns different data type between Spark3.2 and Spark3.1
[ https://issues.apache.org/jira/browse/SPARK-37513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-37513. - Fix Version/s: 3.3.0 3.2.1 Resolution: Fixed Issue resolved by pull request 34766 [https://github.com/apache/spark/pull/34766] > date +/- interval with only day-time fields returns different data type > between Spark3.2 and Spark3.1 > - > > Key: SPARK-37513 > URL: https://issues.apache.org/jira/browse/SPARK-37513 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.3.0, 3.2.1 > > > {code:java} > select date '2011-11-11' + interval 12 hours; > {code} > Previously returned the date type, now it returns the timestamp type. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37513) date +/- interval with only day-time fields returns different data type between Spark3.2 and Spark3.1
[ https://issues.apache.org/jira/browse/SPARK-37513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-37513: --- Assignee: jiaan.geng > date +/- interval with only day-time fields returns different data type > between Spark3.2 and Spark3.1 > - > > Key: SPARK-37513 > URL: https://issues.apache.org/jira/browse/SPARK-37513 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > > {code:java} > select date '2011-11-11' + interval 12 hours; > {code} > Previously returned the date type, now it returns the timestamp type. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37458) Remove unnecessary object serialization on foreachBatch
[ https://issues.apache.org/jira/browse/SPARK-37458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-37458: - Assignee: Jungtaek Lim > Remove unnecessary object serialization on foreachBatch > --- > > Key: SPARK-37458 > URL: https://issues.apache.org/jira/browse/SPARK-37458 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > > Currently, ForeachBatchSink leverages ExternalRDD with converting > RDD[InternalRow] to RDD[T], to provide Dataset[T] to the user function. This > adds SerializeFromObject in the plan, which is actually not required. > We can leverage LogicalRDD instead, to remove SerializeFromObject from the > plan. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37458) Remove unnecessary object serialization on foreachBatch
[ https://issues.apache.org/jira/browse/SPARK-37458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-37458. --- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34706 [https://github.com/apache/spark/pull/34706] > Remove unnecessary object serialization on foreachBatch > --- > > Key: SPARK-37458 > URL: https://issues.apache.org/jira/browse/SPARK-37458 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.3.0 > > > Currently, ForeachBatchSink leverages ExternalRDD with converting > RDD[InternalRow] to RDD[T], to provide Dataset[T] to the user function. This > adds SerializeFromObject in the plan, which is actually not required. > We can leverage LogicalRDD instead, to remove SerializeFromObject from the > plan. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org