[jira] [Commented] (SPARK-28605) Performance regression in SS's foreach
[ https://issues.apache.org/jira/browse/SPARK-28605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899790#comment-16899790 ] Dongjoon Hyun commented on SPARK-28605: --- Oh, thank you, [~zsxwing]! > Performance regression in SS's foreach > -- > > Key: SPARK-28605 > URL: https://issues.apache.org/jira/browse/SPARK-28605 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3 >Reporter: Shixiong Zhu >Priority: Major > Labels: regresssion > > When "ForeachWriter.open" return "false", ForeachSink v1 will skip the whole > partition without reading data. But in ForeachSink v2, due to the API > limitation, it needs to read the whole partition even if all data just gets > dropped. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28605) Performance regression in SS's foreach
[ https://issues.apache.org/jira/browse/SPARK-28605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899789#comment-16899789 ] Shixiong Zhu commented on SPARK-28605: -- By the way, this is not a critical regression. It's not easy to hit this issue. > Performance regression in SS's foreach > -- > > Key: SPARK-28605 > URL: https://issues.apache.org/jira/browse/SPARK-28605 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3 >Reporter: Shixiong Zhu >Priority: Major > Labels: regresssion > > When "ForeachWriter.open" return "false", ForeachSink v1 will skip the whole > partition without reading data. But in ForeachSink v2, due to the API > limitation, it needs to read the whole partition even if all data just gets > dropped. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28422) GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause
[ https://issues.apache.org/jira/browse/SPARK-28422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899788#comment-16899788 ] Liang-Chi Hsieh commented on SPARK-28422: - Thanks [~dongjoon]! > GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause > --- > > Key: SPARK-28422 > URL: https://issues.apache.org/jira/browse/SPARK-28422 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3 >Reporter: Li Jin >Priority: Major > > > {code:python} > from pyspark.sql.functions import pandas_udf, PandasUDFType > @pandas_udf('double', PandasUDFType.GROUPED_AGG) > def max_udf(v): > return v.max() > df = spark.range(0, 100) > spark.udf.register('max_udf', max_udf) > df.createTempView('table') > # A. This works > df.agg(max_udf(df['id'])).show() > # B. This doesn't work > spark.sql("select max_udf(id) from table").show(){code} > > > Query plan: > A: > {code:java} > == Parsed Logical Plan == > 'Aggregate [max_udf('id) AS max_udf(id)#140] > +- Range (0, 1000, step=1, splits=Some(4)) > == Analyzed Logical Plan == > max_udf(id): double > Aggregate [max_udf(id#64L) AS max_udf(id)#140] > +- Range (0, 1000, step=1, splits=Some(4)) > == Optimized Logical Plan == > Aggregate [max_udf(id#64L) AS max_udf(id)#140] > +- Range (0, 1000, step=1, splits=Some(4)) > == Physical Plan == > !AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140] > +- Exchange SinglePartition > +- *(1) Range (0, 1000, step=1, splits=4) > {code} > B: > {code:java} > == Parsed Logical Plan == > 'Project [unresolvedalias('max_udf('id), None)] > +- 'UnresolvedRelation [table] > == Analyzed Logical Plan == > max_udf(id): double > Project [max_udf(id#0L) AS max_udf(id)#136] > +- SubqueryAlias `table` > +- Range (0, 100, step=1, splits=Some(4)) > == Optimized Logical Plan == > Project [max_udf(id#0L) AS max_udf(id)#136] > +- Range (0, 100, step=1, splits=Some(4)) > == Physical Plan == > *(1) Project [max_udf(id#0L) AS max_udf(id)#136] > +- *(1) Range (0, 100, step=1, splits=4) > {code} > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28605) Performance regression in SS's foreach
[ https://issues.apache.org/jira/browse/SPARK-28605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899787#comment-16899787 ] Shixiong Zhu commented on SPARK-28605: -- This is a regression at all 2.4 branches. It's caused by SPARK-23099. > Performance regression in SS's foreach > -- > > Key: SPARK-28605 > URL: https://issues.apache.org/jira/browse/SPARK-28605 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3 >Reporter: Shixiong Zhu >Priority: Major > Labels: regresssion > > When "ForeachWriter.open" return "false", ForeachSink v1 will skip the whole > partition without reading data. But in ForeachSink v2, due to the API > limitation, it needs to read the whole partition even if all data just gets > dropped. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28605) Performance regression in SS's foreach
[ https://issues.apache.org/jira/browse/SPARK-28605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-28605: - Affects Version/s: 2.4.0 2.4.1 2.4.2 > Performance regression in SS's foreach > -- > > Key: SPARK-28605 > URL: https://issues.apache.org/jira/browse/SPARK-28605 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3 >Reporter: Shixiong Zhu >Priority: Major > Labels: regresssion > > When "ForeachWriter.open" return "false", ForeachSink v1 will skip the whole > partition without reading data. But in ForeachSink v2, due to the API > limitation, it needs to read the whole partition even if all data just gets > dropped. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28605) Performance regression in SS's foreach
[ https://issues.apache.org/jira/browse/SPARK-28605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899784#comment-16899784 ] Dongjoon Hyun commented on SPARK-28605: --- Hi, [~zsxwing]. Is this a regression at 2.4.3? > Performance regression in SS's foreach > -- > > Key: SPARK-28605 > URL: https://issues.apache.org/jira/browse/SPARK-28605 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.3 >Reporter: Shixiong Zhu >Priority: Major > Labels: regresssion > > When "ForeachWriter.open" return "false", ForeachSink v1 will skip the whole > partition without reading data. But in ForeachSink v2, due to the API > limitation, it needs to read the whole partition even if all data just gets > dropped. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28422) GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause
[ https://issues.apache.org/jira/browse/SPARK-28422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899781#comment-16899781 ] Dongjoon Hyun commented on SPARK-28422: --- Thank you for reporting, [~icexelloss]. And, thank you for making a PR, [~viirya]. Since this is not supported from 2.4.0, I updated the affected versions, too. > GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause > --- > > Key: SPARK-28422 > URL: https://issues.apache.org/jira/browse/SPARK-28422 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3 >Reporter: Li Jin >Priority: Major > > > {code:python} > from pyspark.sql.functions import pandas_udf, PandasUDFType > @pandas_udf('double', PandasUDFType.GROUPED_AGG) > def max_udf(v): > return v.max() > df = spark.range(0, 100) > spark.udf.register('max_udf', max_udf) > df.createTempView('table') > # A. This works > df.agg(max_udf(df['id'])).show() > # B. This doesn't work > spark.sql("select max_udf(id) from table").show(){code} > > > Query plan: > A: > {code:java} > == Parsed Logical Plan == > 'Aggregate [max_udf('id) AS max_udf(id)#140] > +- Range (0, 1000, step=1, splits=Some(4)) > == Analyzed Logical Plan == > max_udf(id): double > Aggregate [max_udf(id#64L) AS max_udf(id)#140] > +- Range (0, 1000, step=1, splits=Some(4)) > == Optimized Logical Plan == > Aggregate [max_udf(id#64L) AS max_udf(id)#140] > +- Range (0, 1000, step=1, splits=Some(4)) > == Physical Plan == > !AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140] > +- Exchange SinglePartition > +- *(1) Range (0, 1000, step=1, splits=4) > {code} > B: > {code:java} > == Parsed Logical Plan == > 'Project [unresolvedalias('max_udf('id), None)] > +- 'UnresolvedRelation [table] > == Analyzed Logical Plan == > max_udf(id): double > Project [max_udf(id#0L) AS max_udf(id)#136] > +- SubqueryAlias `table` > +- Range (0, 100, step=1, splits=Some(4)) > == Optimized Logical Plan == > Project [max_udf(id#0L) AS max_udf(id)#136] > +- Range (0, 100, step=1, splits=Some(4)) > == Physical Plan == > *(1) Project [max_udf(id#0L) AS max_udf(id)#136] > +- *(1) Range (0, 100, step=1, splits=4) > {code} > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28422) GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause
[ https://issues.apache.org/jira/browse/SPARK-28422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28422: -- Affects Version/s: 2.4.0 > GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause > --- > > Key: SPARK-28422 > URL: https://issues.apache.org/jira/browse/SPARK-28422 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3 >Reporter: Li Jin >Priority: Major > > > {code:python} > from pyspark.sql.functions import pandas_udf, PandasUDFType > @pandas_udf('double', PandasUDFType.GROUPED_AGG) > def max_udf(v): > return v.max() > df = spark.range(0, 100) > spark.udf.register('max_udf', max_udf) > df.createTempView('table') > # A. This works > df.agg(max_udf(df['id'])).show() > # B. This doesn't work > spark.sql("select max_udf(id) from table").show(){code} > > > Query plan: > A: > {code:java} > == Parsed Logical Plan == > 'Aggregate [max_udf('id) AS max_udf(id)#140] > +- Range (0, 1000, step=1, splits=Some(4)) > == Analyzed Logical Plan == > max_udf(id): double > Aggregate [max_udf(id#64L) AS max_udf(id)#140] > +- Range (0, 1000, step=1, splits=Some(4)) > == Optimized Logical Plan == > Aggregate [max_udf(id#64L) AS max_udf(id)#140] > +- Range (0, 1000, step=1, splits=Some(4)) > == Physical Plan == > !AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140] > +- Exchange SinglePartition > +- *(1) Range (0, 1000, step=1, splits=4) > {code} > B: > {code:java} > == Parsed Logical Plan == > 'Project [unresolvedalias('max_udf('id), None)] > +- 'UnresolvedRelation [table] > == Analyzed Logical Plan == > max_udf(id): double > Project [max_udf(id#0L) AS max_udf(id)#136] > +- SubqueryAlias `table` > +- Range (0, 100, step=1, splits=Some(4)) > == Optimized Logical Plan == > Project [max_udf(id#0L) AS max_udf(id)#136] > +- Range (0, 100, step=1, splits=Some(4)) > == Physical Plan == > *(1) Project [max_udf(id#0L) AS max_udf(id)#136] > +- *(1) Range (0, 100, step=1, splits=4) > {code} > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28422) GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause
[ https://issues.apache.org/jira/browse/SPARK-28422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28422: -- Description: {code:python} from pyspark.sql.functions import pandas_udf, PandasUDFType @pandas_udf('double', PandasUDFType.GROUPED_AGG) def max_udf(v): return v.max() df = spark.range(0, 100) spark.udf.register('max_udf', max_udf) df.createTempView('table') # A. This works df.agg(max_udf(df['id'])).show() # B. This doesn't work spark.sql("select max_udf(id) from table").show(){code} Query plan: A: {code:java} == Parsed Logical Plan == 'Aggregate [max_udf('id) AS max_udf(id)#140] +- Range (0, 1000, step=1, splits=Some(4)) == Analyzed Logical Plan == max_udf(id): double Aggregate [max_udf(id#64L) AS max_udf(id)#140] +- Range (0, 1000, step=1, splits=Some(4)) == Optimized Logical Plan == Aggregate [max_udf(id#64L) AS max_udf(id)#140] +- Range (0, 1000, step=1, splits=Some(4)) == Physical Plan == !AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140] +- Exchange SinglePartition +- *(1) Range (0, 1000, step=1, splits=4) {code} B: {code:java} == Parsed Logical Plan == 'Project [unresolvedalias('max_udf('id), None)] +- 'UnresolvedRelation [table] == Analyzed Logical Plan == max_udf(id): double Project [max_udf(id#0L) AS max_udf(id)#136] +- SubqueryAlias `table` +- Range (0, 100, step=1, splits=Some(4)) == Optimized Logical Plan == Project [max_udf(id#0L) AS max_udf(id)#136] +- Range (0, 100, step=1, splits=Some(4)) == Physical Plan == *(1) Project [max_udf(id#0L) AS max_udf(id)#136] +- *(1) Range (0, 100, step=1, splits=4) {code} was: {code:java} @pandas_udf('double', PandasUDFType.GROUPED_AGG) def max_udf(v): return v.max() df = spark.range(0, 100) spark.udf.register('max_udf', max_udf) df.createTempView('table') # A. This works df.agg(max_udf(df['id'])).show() # B. This doesn't work spark.sql("select max_udf(id) from table").show(){code} Query plan: A: {code:java} == Parsed Logical Plan == 'Aggregate [max_udf('id) AS max_udf(id)#140] +- Range (0, 1000, step=1, splits=Some(4)) == Analyzed Logical Plan == max_udf(id): double Aggregate [max_udf(id#64L) AS max_udf(id)#140] +- Range (0, 1000, step=1, splits=Some(4)) == Optimized Logical Plan == Aggregate [max_udf(id#64L) AS max_udf(id)#140] +- Range (0, 1000, step=1, splits=Some(4)) == Physical Plan == !AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140] +- Exchange SinglePartition +- *(1) Range (0, 1000, step=1, splits=4) {code} B: {code:java} == Parsed Logical Plan == 'Project [unresolvedalias('max_udf('id), None)] +- 'UnresolvedRelation [table] == Analyzed Logical Plan == max_udf(id): double Project [max_udf(id#0L) AS max_udf(id)#136] +- SubqueryAlias `table` +- Range (0, 100, step=1, splits=Some(4)) == Optimized Logical Plan == Project [max_udf(id#0L) AS max_udf(id)#136] +- Range (0, 100, step=1, splits=Some(4)) == Physical Plan == *(1) Project [max_udf(id#0L) AS max_udf(id)#136] +- *(1) Range (0, 100, step=1, splits=4) {code} > GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause > --- > > Key: SPARK-28422 > URL: https://issues.apache.org/jira/browse/SPARK-28422 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.1, 2.4.2, 2.4.3 >Reporter: Li Jin >Priority: Major > > > {code:python} > from pyspark.sql.functions import pandas_udf, PandasUDFType > @pandas_udf('double', PandasUDFType.GROUPED_AGG) > def max_udf(v): > return v.max() > df = spark.range(0, 100) > spark.udf.register('max_udf', max_udf) > df.createTempView('table') > # A. This works > df.agg(max_udf(df['id'])).show() > # B. This doesn't work > spark.sql("select max_udf(id) from table").show(){code} > > > Query plan: > A: > {code:java} > == Parsed Logical Plan == > 'Aggregate [max_udf('id) AS max_udf(id)#140] > +- Range (0, 1000, step=1, splits=Some(4)) > == Analyzed Logical Plan == > max_udf(id): double > Aggregate [max_udf(id#64L) AS max_udf(id)#140] > +- Range (0, 1000, step=1, splits=Some(4)) > == Optimized Logical Plan == > Aggregate [max_udf(id#64L) AS max_udf(id)#140] > +- Range (0, 1000, step=1, splits=Some(4)) > == Physical Plan == > !AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140] > +- Exchange SinglePartition > +- *(1) Range (0, 1000, step=1, splits=4) > {code} > B: > {code:java} > == Parsed Logical Plan == > 'Project [unresolvedalias('max_udf('id), None)] > +- 'UnresolvedRelation [table] > == Analyzed Logical Plan == > max_udf(id): double > Project [max_udf(id#0L) AS max_udf(id)#136] > +- SubqueryAlias `table` > +- Range (0, 100, step=1, splits=Some(4)) > ==
[jira] [Updated] (SPARK-28422) GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause
[ https://issues.apache.org/jira/browse/SPARK-28422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28422: -- Affects Version/s: 2.4.1 > GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause > --- > > Key: SPARK-28422 > URL: https://issues.apache.org/jira/browse/SPARK-28422 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.1, 2.4.2, 2.4.3 >Reporter: Li Jin >Priority: Major > > > {code:java} > @pandas_udf('double', PandasUDFType.GROUPED_AGG) > def max_udf(v): > return v.max() > df = spark.range(0, 100) > spark.udf.register('max_udf', max_udf) > df.createTempView('table') > # A. This works > df.agg(max_udf(df['id'])).show() > # B. This doesn't work > spark.sql("select max_udf(id) from table").show(){code} > > > Query plan: > A: > {code:java} > == Parsed Logical Plan == > 'Aggregate [max_udf('id) AS max_udf(id)#140] > +- Range (0, 1000, step=1, splits=Some(4)) > == Analyzed Logical Plan == > max_udf(id): double > Aggregate [max_udf(id#64L) AS max_udf(id)#140] > +- Range (0, 1000, step=1, splits=Some(4)) > == Optimized Logical Plan == > Aggregate [max_udf(id#64L) AS max_udf(id)#140] > +- Range (0, 1000, step=1, splits=Some(4)) > == Physical Plan == > !AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140] > +- Exchange SinglePartition > +- *(1) Range (0, 1000, step=1, splits=4) > {code} > B: > {code:java} > == Parsed Logical Plan == > 'Project [unresolvedalias('max_udf('id), None)] > +- 'UnresolvedRelation [table] > == Analyzed Logical Plan == > max_udf(id): double > Project [max_udf(id#0L) AS max_udf(id)#136] > +- SubqueryAlias `table` > +- Range (0, 100, step=1, splits=Some(4)) > == Optimized Logical Plan == > Project [max_udf(id#0L) AS max_udf(id)#136] > +- Range (0, 100, step=1, splits=Some(4)) > == Physical Plan == > *(1) Project [max_udf(id#0L) AS max_udf(id)#136] > +- *(1) Range (0, 100, step=1, splits=4) > {code} > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28422) GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause
[ https://issues.apache.org/jira/browse/SPARK-28422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28422: -- Affects Version/s: 2.4.2 > GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause > --- > > Key: SPARK-28422 > URL: https://issues.apache.org/jira/browse/SPARK-28422 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.2, 2.4.3 >Reporter: Li Jin >Priority: Major > > > {code:java} > @pandas_udf('double', PandasUDFType.GROUPED_AGG) > def max_udf(v): > return v.max() > df = spark.range(0, 100) > spark.udf.register('max_udf', max_udf) > df.createTempView('table') > # A. This works > df.agg(max_udf(df['id'])).show() > # B. This doesn't work > spark.sql("select max_udf(id) from table").show(){code} > > > Query plan: > A: > {code:java} > == Parsed Logical Plan == > 'Aggregate [max_udf('id) AS max_udf(id)#140] > +- Range (0, 1000, step=1, splits=Some(4)) > == Analyzed Logical Plan == > max_udf(id): double > Aggregate [max_udf(id#64L) AS max_udf(id)#140] > +- Range (0, 1000, step=1, splits=Some(4)) > == Optimized Logical Plan == > Aggregate [max_udf(id#64L) AS max_udf(id)#140] > +- Range (0, 1000, step=1, splits=Some(4)) > == Physical Plan == > !AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140] > +- Exchange SinglePartition > +- *(1) Range (0, 1000, step=1, splits=4) > {code} > B: > {code:java} > == Parsed Logical Plan == > 'Project [unresolvedalias('max_udf('id), None)] > +- 'UnresolvedRelation [table] > == Analyzed Logical Plan == > max_udf(id): double > Project [max_udf(id#0L) AS max_udf(id)#136] > +- SubqueryAlias `table` > +- Range (0, 100, step=1, splits=Some(4)) > == Optimized Logical Plan == > Project [max_udf(id#0L) AS max_udf(id)#136] > +- Range (0, 100, step=1, splits=Some(4)) > == Physical Plan == > *(1) Project [max_udf(id#0L) AS max_udf(id)#136] > +- *(1) Range (0, 100, step=1, splits=4) > {code} > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28422) GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause
[ https://issues.apache.org/jira/browse/SPARK-28422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28422: -- Description: {code:java} @pandas_udf('double', PandasUDFType.GROUPED_AGG) def max_udf(v): return v.max() df = spark.range(0, 100) spark.udf.register('max_udf', max_udf) df.createTempView('table') # A. This works df.agg(max_udf(df['id'])).show() # B. This doesn't work spark.sql("select max_udf(id) from table").show(){code} Query plan: A: {code:java} == Parsed Logical Plan == 'Aggregate [max_udf('id) AS max_udf(id)#140] +- Range (0, 1000, step=1, splits=Some(4)) == Analyzed Logical Plan == max_udf(id): double Aggregate [max_udf(id#64L) AS max_udf(id)#140] +- Range (0, 1000, step=1, splits=Some(4)) == Optimized Logical Plan == Aggregate [max_udf(id#64L) AS max_udf(id)#140] +- Range (0, 1000, step=1, splits=Some(4)) == Physical Plan == !AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140] +- Exchange SinglePartition +- *(1) Range (0, 1000, step=1, splits=4) {code} B: {code:java} == Parsed Logical Plan == 'Project [unresolvedalias('max_udf('id), None)] +- 'UnresolvedRelation [table] == Analyzed Logical Plan == max_udf(id): double Project [max_udf(id#0L) AS max_udf(id)#136] +- SubqueryAlias `table` +- Range (0, 100, step=1, splits=Some(4)) == Optimized Logical Plan == Project [max_udf(id#0L) AS max_udf(id)#136] +- Range (0, 100, step=1, splits=Some(4)) == Physical Plan == *(1) Project [max_udf(id#0L) AS max_udf(id)#136] +- *(1) Range (0, 100, step=1, splits=4) {code} was: {code:java} @pandas_udf('double', PandasUDFType.GROUPED_AGG) def max_udf(v): return v.max() df = spark.range(0, 100) df.udf.register('max_udf', max_udf) df.createTempView('table') # A. This works df.agg(max_udf(df['id'])).show() # B. This doesn't work spark.sql("select max_udf(id) from table").show(){code} Query plan: A: {code:java} == Parsed Logical Plan == 'Aggregate [max_udf('id) AS max_udf(id)#140] +- Range (0, 1000, step=1, splits=Some(4)) == Analyzed Logical Plan == max_udf(id): double Aggregate [max_udf(id#64L) AS max_udf(id)#140] +- Range (0, 1000, step=1, splits=Some(4)) == Optimized Logical Plan == Aggregate [max_udf(id#64L) AS max_udf(id)#140] +- Range (0, 1000, step=1, splits=Some(4)) == Physical Plan == !AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140] +- Exchange SinglePartition +- *(1) Range (0, 1000, step=1, splits=4) {code} B: {code:java} == Parsed Logical Plan == 'Project [unresolvedalias('max_udf('id), None)] +- 'UnresolvedRelation [table] == Analyzed Logical Plan == max_udf(id): double Project [max_udf(id#0L) AS max_udf(id)#136] +- SubqueryAlias `table` +- Range (0, 100, step=1, splits=Some(4)) == Optimized Logical Plan == Project [max_udf(id#0L) AS max_udf(id)#136] +- Range (0, 100, step=1, splits=Some(4)) == Physical Plan == *(1) Project [max_udf(id#0L) AS max_udf(id)#136] +- *(1) Range (0, 100, step=1, splits=4) {code} > GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause > --- > > Key: SPARK-28422 > URL: https://issues.apache.org/jira/browse/SPARK-28422 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.3 >Reporter: Li Jin >Priority: Major > > > {code:java} > @pandas_udf('double', PandasUDFType.GROUPED_AGG) > def max_udf(v): > return v.max() > df = spark.range(0, 100) > spark.udf.register('max_udf', max_udf) > df.createTempView('table') > # A. This works > df.agg(max_udf(df['id'])).show() > # B. This doesn't work > spark.sql("select max_udf(id) from table").show(){code} > > > Query plan: > A: > {code:java} > == Parsed Logical Plan == > 'Aggregate [max_udf('id) AS max_udf(id)#140] > +- Range (0, 1000, step=1, splits=Some(4)) > == Analyzed Logical Plan == > max_udf(id): double > Aggregate [max_udf(id#64L) AS max_udf(id)#140] > +- Range (0, 1000, step=1, splits=Some(4)) > == Optimized Logical Plan == > Aggregate [max_udf(id#64L) AS max_udf(id)#140] > +- Range (0, 1000, step=1, splits=Some(4)) > == Physical Plan == > !AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140] > +- Exchange SinglePartition > +- *(1) Range (0, 1000, step=1, splits=4) > {code} > B: > {code:java} > == Parsed Logical Plan == > 'Project [unresolvedalias('max_udf('id), None)] > +- 'UnresolvedRelation [table] > == Analyzed Logical Plan == > max_udf(id): double > Project [max_udf(id#0L) AS max_udf(id)#136] > +- SubqueryAlias `table` > +- Range (0, 100, step=1, splits=Some(4)) > == Optimized Logical Plan == > Project [max_udf(id#0L) AS max_udf(id)#136] > +- Range (0, 100, step=1, splits=Some(4)) > == Physical Plan == > *(1)
[jira] [Comment Edited] (SPARK-28580) ANSI SQL: unique predicate
[ https://issues.apache.org/jira/browse/SPARK-28580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899773#comment-16899773 ] Dongjoon Hyun edited comment on SPARK-28580 at 8/5/19 5:08 AM: --- Hi, [~beliefer]. Please make a PR with a working implementation. Thanks! was (Author: dongjoon): Please make a PR with a working implementation. > ANSI SQL: unique predicate > -- > > Key: SPARK-28580 > URL: https://issues.apache.org/jira/browse/SPARK-28580 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: jiaan.geng >Priority: Major > > Format > {code:java} > ::= > UNIQUE {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28580) ANSI SQL: unique predicate
[ https://issues.apache.org/jira/browse/SPARK-28580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899773#comment-16899773 ] Dongjoon Hyun commented on SPARK-28580: --- Please make a PR with a working implementation. > ANSI SQL: unique predicate > -- > > Key: SPARK-28580 > URL: https://issues.apache.org/jira/browse/SPARK-28580 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: jiaan.geng >Priority: Major > > Format > {code:java} > ::= > UNIQUE {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27661) Add SupportsNamespaces interface for v2 catalogs
[ https://issues.apache.org/jira/browse/SPARK-27661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-27661. - Resolution: Fixed Fix Version/s: 3.0.0 Resolved by [https://github.com/apache/spark/pull/24560] > Add SupportsNamespaces interface for v2 catalogs > > > Key: SPARK-27661 > URL: https://issues.apache.org/jira/browse/SPARK-27661 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Fix For: 3.0.0 > > > Some catalogs support namespace operations, like creating or dropping > namespaces. The v2 API should have a way to expose these operations to Spark. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27661) Add SupportsNamespaces interface for v2 catalogs
[ https://issues.apache.org/jira/browse/SPARK-27661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz reassigned SPARK-27661: --- Assignee: Ryan Blue > Add SupportsNamespaces interface for v2 catalogs > > > Key: SPARK-27661 > URL: https://issues.apache.org/jira/browse/SPARK-27661 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > > Some catalogs support namespace operations, like creating or dropping > namespaces. The v2 API should have a way to expose these operations to Spark. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28616) Improve merge-spark-pr script to warn WIP PRs and strip trailing dots
[ https://issues.apache.org/jira/browse/SPARK-28616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-28616: - Assignee: Dongjoon Hyun > Improve merge-spark-pr script to warn WIP PRs and strip trailing dots > - > > Key: SPARK-28616 > URL: https://issues.apache.org/jira/browse/SPARK-28616 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > > This issue aims to improve the `merge-spark-pr` script. > - `[WIP]` is useful when we show that a PR is not ready for merge. Apache > Spark allows merging `WIP` PRs. However, sometime, we accidentally forgot to > clean up the title for the completed PRs. We had better warn once more during > merging stage and get a confirmation from the committers. > - We have two kinds of PR title in terms of the ending period. This issue > aims to remove the trailing `dot` since the shorter is the better in the PR > title. Also, the PR titles without the trailing `dot` is dominant in the > commit logs. > {code} > $ git log --oneline | grep '[.]$' | wc -l > 4090 > $ git log --oneline | grep '[^.]$' | wc -l >20747 > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28616) Improve merge-spark-pr script to warn WIP PRs and strip trailing dots
[ https://issues.apache.org/jira/browse/SPARK-28616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28616. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 25356 [https://github.com/apache/spark/pull/25356] > Improve merge-spark-pr script to warn WIP PRs and strip trailing dots > - > > Key: SPARK-28616 > URL: https://issues.apache.org/jira/browse/SPARK-28616 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.0.0 > > > This issue aims to improve the `merge-spark-pr` script. > - `[WIP]` is useful when we show that a PR is not ready for merge. Apache > Spark allows merging `WIP` PRs. However, sometime, we accidentally forgot to > clean up the title for the completed PRs. We had better warn once more during > merging stage and get a confirmation from the committers. > - We have two kinds of PR title in terms of the ending period. This issue > aims to remove the trailing `dot` since the shorter is the better in the PR > title. Also, the PR titles without the trailing `dot` is dominant in the > commit logs. > {code} > $ git log --oneline | grep '[.]$' | wc -l > 4090 > $ git log --oneline | grep '[^.]$' | wc -l >20747 > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28617) Completely remove comments from the gold result file
Yuming Wang created SPARK-28617: --- Summary: Completely remove comments from the gold result file Key: SPARK-28617 URL: https://issues.apache.org/jira/browse/SPARK-28617 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28616) Improve merge-spark-pr script to warn WIP PRs and strip trailing dots
Dongjoon Hyun created SPARK-28616: - Summary: Improve merge-spark-pr script to warn WIP PRs and strip trailing dots Key: SPARK-28616 URL: https://issues.apache.org/jira/browse/SPARK-28616 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 3.0.0 Reporter: Dongjoon Hyun This issue aims to improve the `merge-spark-pr` script. - `[WIP]` is useful when we show that a PR is not ready for merge. Apache Spark allows merging `WIP` PRs. However, sometime, we accidentally forgot to clean up the title for the completed PRs. We had better warn once more during merging stage and get a confirmation from the committers. - We have two kinds of PR title in terms of the ending period. This issue aims to remove the trailing `dot` since the shorter is the better in the PR title. Also, the PR titles without the trailing `dot` is dominant in the commit logs. {code} $ git log --oneline | grep '[.]$' | wc -l 4090 $ git log --oneline | grep '[^.]$' | wc -l 20747 {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28615) Add a guide line for dataframe functions to say column signature function is by default
Weichen Xu created SPARK-28615: -- Summary: Add a guide line for dataframe functions to say column signature function is by default Key: SPARK-28615 URL: https://issues.apache.org/jira/browse/SPARK-28615 Project: Spark Issue Type: Improvement Components: Documentation, SQL Affects Versions: 2.4.3 Reporter: Weichen Xu Add a guide line for dataframe functions to say column signature function is by default. like: {code} This function APIs usually have methods with Column signature only because it can support not only Column but also other types such as a native string. The other variants currently exist for historical reasons. {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28614) Do not remove leading write space in the golden result file
[ https://issues.apache.org/jira/browse/SPARK-28614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28614. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 25351 [https://github.com/apache/spark/pull/25351] > Do not remove leading write space in the golden result file > --- > > Key: SPARK-28614 > URL: https://issues.apache.org/jira/browse/SPARK-28614 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > Do not remove leading write space in the golden result file. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28614) Do not remove leading write space in the golden result file
[ https://issues.apache.org/jira/browse/SPARK-28614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-28614: - Assignee: Yuming Wang > Do not remove leading write space in the golden result file > --- > > Key: SPARK-28614 > URL: https://issues.apache.org/jira/browse/SPARK-28614 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > Do not remove leading write space in the golden result file. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28604) Use log1p(x) instead of log(1+x) and expm1(x) instead of exp(x)-1
[ https://issues.apache.org/jira/browse/SPARK-28604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-28604. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 25337 [https://github.com/apache/spark/pull/25337] > Use log1p(x) instead of log(1+x) and expm1(x) instead of exp(x)-1 > - > > Key: SPARK-28604 > URL: https://issues.apache.org/jira/browse/SPARK-28604 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.4.3 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > Fix For: 3.0.0 > > > Following https://issues.apache.org/jira/browse/SPARK-28519, I noticed that > there are number of places in the MLlib code that may be, for example, > computing log(1+x) where x is very small. All math libraries specialize for > this case with log1p(x) for better accuracy. Same for expm1. It shouldn't > hurt to use log1p/expm1 where possible and in a few cases I think it will > improve accuracy, as sometimes x is for example a very small probability. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28614) Do not remove leading write space in the golden result file
Yuming Wang created SPARK-28614: --- Summary: Do not remove leading write space in the golden result file Key: SPARK-28614 URL: https://issues.apache.org/jira/browse/SPARK-28614 Project: Spark Issue Type: Sub-task Components: SQL, Tests Affects Versions: 3.0.0 Reporter: Yuming Wang Do not remove leading write space in the golden result file. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28527) Build a Test Framework for Thriftserver
[ https://issues.apache.org/jira/browse/SPARK-28527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899368#comment-16899368 ] Yuming Wang edited comment on SPARK-28527 at 8/4/19 5:23 PM: - When building this test framework, found one inconsistent behavior: [SPARK-28461] Thrift Server pad decimal numbers with trailing zeros to the scale of the column: was (Author: q79969786): When building this test framework, found one inconsistent behavior: [SPARK-28461] Thrift Server pad decimal numbers with trailing zeros to the scale of the column: !image-2019-08-03-13-58-06-106.png! > Build a Test Framework for Thriftserver > --- > > Key: SPARK-28527 > URL: https://issues.apache.org/jira/browse/SPARK-28527 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-28463 shows we need to improve > the overall test coverage for Thrift-server. > We can bulid a test framework that can directly re-run all the tests in > SQLQueryTestSuite via Thrift Server. This can help our test coverage of BI > use cases. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28527) Build a Test Framework for Thriftserver
[ https://issues.apache.org/jira/browse/SPARK-28527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28527: Attachment: (was: image-2019-08-03-13-58-06-106.png) > Build a Test Framework for Thriftserver > --- > > Key: SPARK-28527 > URL: https://issues.apache.org/jira/browse/SPARK-28527 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-28463 shows we need to improve > the overall test coverage for Thrift-server. > We can bulid a test framework that can directly re-run all the tests in > SQLQueryTestSuite via Thrift Server. This can help our test coverage of BI > use cases. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28471) Formatting dates with negative years
[ https://issues.apache.org/jira/browse/SPARK-28471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899657#comment-16899657 ] Yuming Wang commented on SPARK-28471: - Thank you [~maxgekk] I see. > Formatting dates with negative years > > > Key: SPARK-28471 > URL: https://issues.apache.org/jira/browse/SPARK-28471 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > While converting dates with negative years to strings, Spark skips era > sub-field by default. That's can confuse users since years from BC era are > mirrored to current era. For example: > {code} > spark-sql> select make_date(-44, 3, 15); > 0045-03-15 > {code} > Even negative years are out of supported range by the DATE type, it would be > nice to indicate the era for such dates. > PostgreSQL outputs the era for such inputs: > {code} > # select make_date(-44, 3, 15); >make_date > --- > 0044-03-15 BC > (1 row) > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28609) Fix broken styles/links and make up-to-date
[ https://issues.apache.org/jira/browse/SPARK-28609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28609. --- Resolution: Fixed Fix Version/s: 2.4.4 3.0.0 Issue resolved by pull request 25345 [https://github.com/apache/spark/pull/25345] > Fix broken styles/links and make up-to-date > --- > > Key: SPARK-28609 > URL: https://issues.apache.org/jira/browse/SPARK-28609 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.4.4, 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.0.0, 2.4.4 > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28609) Fix broken styles/links and make up-to-date
[ https://issues.apache.org/jira/browse/SPARK-28609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-28609: - Assignee: Dongjoon Hyun > Fix broken styles/links and make up-to-date > --- > > Key: SPARK-28609 > URL: https://issues.apache.org/jira/browse/SPARK-28609 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.4.4, 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28471) Formatting dates with negative years
[ https://issues.apache.org/jira/browse/SPARK-28471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899633#comment-16899633 ] Maxim Gekk edited comment on SPARK-28471 at 8/4/19 1:53 PM: [~yumwang] There are 2 equal methods to output negative years (for BC era): # With sign '-' # or add suffix `BC` The proposed PR outputs `-` for negative years. I have fixed this only Spark's formatters. In your example, you use standard formatter of *java.sql.Date*. For example, if you switch on Java 8 API: {code:java} scala> spark.conf.set("spark.sql.datetime.java8API.enabled", true) scala> val df = spark.sql("select make_date(-44, 3, 15)") df: org.apache.spark.sql.DataFrame = [make_date(-44, 3, 15): date] scala> val date = df.collect.head.getAs[java.time.LocalDate](0) date: java.time.LocalDate = -0044-03-15 {code} was (Author: maxgekk): [~yumwang] There are 2 equal methods to output negative years (for BC era): # With sign '-' # or add suffix `BC` The proposed PR outputs `-` for negative years. I have fixed this only Spark's formatters. In your example, you uses standard formatter of java.sql.Date. For example, if you switch on Java 8 API: {code:java} scala> spark.conf.set("spark.sql.datetime.java8API.enabled", true) scala> val df = spark.sql("select make_date(-44, 3, 15)") df: org.apache.spark.sql.DataFrame = [make_date(-44, 3, 15): date] scala> val date = df.collect.head.getAs[java.time.LocalDate](0) date: java.time.LocalDate = -0044-03-15 {code} > Formatting dates with negative years > > > Key: SPARK-28471 > URL: https://issues.apache.org/jira/browse/SPARK-28471 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > While converting dates with negative years to strings, Spark skips era > sub-field by default. That's can confuse users since years from BC era are > mirrored to current era. For example: > {code} > spark-sql> select make_date(-44, 3, 15); > 0045-03-15 > {code} > Even negative years are out of supported range by the DATE type, it would be > nice to indicate the era for such dates. > PostgreSQL outputs the era for such inputs: > {code} > # select make_date(-44, 3, 15); >make_date > --- > 0044-03-15 BC > (1 row) > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28471) Formatting dates with negative years
[ https://issues.apache.org/jira/browse/SPARK-28471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899633#comment-16899633 ] Maxim Gekk commented on SPARK-28471: [~yumwang] There are 2 equal methods to output negative years (for BC era): # With sign '-' # or add suffix `BC` The proposed PR outputs `-` for negative years. I have fixed this only Spark's formatters. In your example, you uses standard formatter of java.sql.Date. For example, if you switch on Java 8 API: {code:java} scala> spark.conf.set("spark.sql.datetime.java8API.enabled", true) scala> val df = spark.sql("select make_date(-44, 3, 15)") df: org.apache.spark.sql.DataFrame = [make_date(-44, 3, 15): date] scala> val date = df.collect.head.getAs[java.time.LocalDate](0) date: java.time.LocalDate = -0044-03-15 {code} > Formatting dates with negative years > > > Key: SPARK-28471 > URL: https://issues.apache.org/jira/browse/SPARK-28471 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > While converting dates with negative years to strings, Spark skips era > sub-field by default. That's can confuse users since years from BC era are > mirrored to current era. For example: > {code} > spark-sql> select make_date(-44, 3, 15); > 0045-03-15 > {code} > Even negative years are out of supported range by the DATE type, it would be > nice to indicate the era for such dates. > PostgreSQL outputs the era for such inputs: > {code} > # select make_date(-44, 3, 15); >make_date > --- > 0044-03-15 BC > (1 row) > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28471) Formatting dates with negative years
[ https://issues.apache.org/jira/browse/SPARK-28471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899630#comment-16899630 ] Yuming Wang commented on SPARK-28471: - Hi [~maxgekk], It seems the issue still exist: {code:scala} scala> val df = spark.sql("select make_date(-44, 3, 15)") df: org.apache.spark.sql.DataFrame = [make_date(-44, 3, 15): date] scala> df.show +-+ |make_date(-44, 3, 15)| +-+ | -0044-03-15| +-+ scala> df.collect.head.getDate(0) res1: java.sql.Date = 0045-03-17 {code} {code:sql} spark-sql> select make_date(-44, 3, 15); -0044-03-15 {code} > Formatting dates with negative years > > > Key: SPARK-28471 > URL: https://issues.apache.org/jira/browse/SPARK-28471 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > While converting dates with negative years to strings, Spark skips era > sub-field by default. That's can confuse users since years from BC era are > mirrored to current era. For example: > {code} > spark-sql> select make_date(-44, 3, 15); > 0045-03-15 > {code} > Even negative years are out of supported range by the DATE type, it would be > nice to indicate the era for such dates. > PostgreSQL outputs the era for such inputs: > {code} > # select make_date(-44, 3, 15); >make_date > --- > 0044-03-15 BC > (1 row) > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28613) Spark SQL action collect just judge size of compressed RDD's size, not accurate enough
[ https://issues.apache.org/jira/browse/SPARK-28613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-28613: -- Description: When we run action DataFrame.collect() , for the configuration *spark.driver.maxResultSize ,*when determine if the returned data exceeds the limit, it will use the compressed byte array's size, it is not accurate. Since when we get data when use SparkThriftServer, when not use incremental colletion. It will get all data of datafrme for each partition. For return data, it has the preocess" # compress data's byte array # Being packaged as ResultSet # return to driver and judge by *spark.Driver.resultMaxSize* # *decode(uncompress) data as Array[Row]* The amount of data unzipped differs significantly from the amount of data unzipped, The difference in the size of the data is more than ten times was:When we run action DataFrame.collect() , for the configuration *spark.driver.maxResultSize ,*when determine if the returned data exceeds the limit, it will use the compressed byte array's size, it it not > Spark SQL action collect just judge size of compressed RDD's size, not > accurate enough > -- > > Key: SPARK-28613 > URL: https://issues.apache.org/jira/browse/SPARK-28613 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.4.0 >Reporter: angerszhu >Priority: Major > > When we run action DataFrame.collect() , for the configuration > *spark.driver.maxResultSize ,*when determine if the returned data exceeds > the limit, it will use the compressed byte array's size, it is not accurate. > Since when we get data when use SparkThriftServer, when not use incremental > colletion. It will get all data of datafrme for each partition. > For return data, it has the preocess" > # compress data's byte array > # Being packaged as ResultSet > # return to driver and judge by *spark.Driver.resultMaxSize* > # *decode(uncompress) data as Array[Row]* > The amount of data unzipped differs significantly from the amount of data > unzipped, The difference in the size of the data is more than ten times > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28613) Spark SQL action collect just judge size of compressed RDD's size, not accurate enough
[ https://issues.apache.org/jira/browse/SPARK-28613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-28613: -- Description: When we run action DataFrame.collect() , for the configuration *spark.driver.maxResultSize ,*when determine if the returned data exceeds the limit, it will use the compressed byte array's size, it it not > Spark SQL action collect just judge size of compressed RDD's size, not > accurate enough > -- > > Key: SPARK-28613 > URL: https://issues.apache.org/jira/browse/SPARK-28613 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.4.0 >Reporter: angerszhu >Priority: Major > > When we run action DataFrame.collect() , for the configuration > *spark.driver.maxResultSize ,*when determine if the returned data exceeds > the limit, it will use the compressed byte array's size, it it not -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28613) Spark SQL action collect just judge size of compressed RDD's size, not accurate enough
angerszhu created SPARK-28613: - Summary: Spark SQL action collect just judge size of compressed RDD's size, not accurate enough Key: SPARK-28613 URL: https://issues.apache.org/jira/browse/SPARK-28613 Project: Spark Issue Type: Wish Components: SQL Affects Versions: 2.4.0 Reporter: angerszhu -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28611) Histogram's height is diffrent
[ https://issues.apache.org/jira/browse/SPARK-28611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899585#comment-16899585 ] Marco Gaido commented on SPARK-28611: - Mmmhthat's weird! How can you get a different result than what was got on the CI? How are you running the code? > Histogram's height is diffrent > -- > > Key: SPARK-28611 > URL: https://issues.apache.org/jira/browse/SPARK-28611 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Yuming Wang >Priority: Major > > {code:sql} > CREATE TABLE desc_col_table (key int COMMENT 'column_comment') USING PARQUET; > -- Test output for histogram statistics > SET spark.sql.statistics.histogram.enabled=true; > SET spark.sql.statistics.histogram.numBins=2; > INSERT INTO desc_col_table values 1, 2, 3, 4; > ANALYZE TABLE desc_col_table COMPUTE STATISTICS FOR COLUMNS key; > DESC EXTENDED desc_col_table key; > {code} > {noformat} > spark-sql> DESC EXTENDED desc_col_table key; > col_name key > data_type int > comment column_comment > min 1 > max 4 > num_nulls 0 > distinct_count4 > avg_col_len 4 > max_col_len 4 > histogram height: 4.0, num_of_bins: 2 > bin_0 lower_bound: 1.0, upper_bound: 2.0, distinct_count: 2 > bin_1 lower_bound: 2.0, upper_bound: 4.0, distinct_count: 2 > {noformat} > But our result is: > https://github.com/apache/spark/blob/v2.4.3/sql/core/src/test/resources/sql-tests/results/describe-table-column.sql.out#L231-L242 -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24091) Internally used ConfigMap prevents use of user-specified ConfigMaps carrying Spark configs files
[ https://issues.apache.org/jira/browse/SPARK-24091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899578#comment-16899578 ] Michael Gendelman commented on SPARK-24091: --- [~tmckay] We are experiencing the same issue and your solution looks great but I'm not sure how to implement it. You write: {quote}One way to handle this is allow a user-specified ConfigMap to be mounted at another location {quote} How can we mount a user-specified ConfigMap when we can't control the Pod specification? > Internally used ConfigMap prevents use of user-specified ConfigMaps carrying > Spark configs files > > > Key: SPARK-24091 > URL: https://issues.apache.org/jira/browse/SPARK-24091 > Project: Spark > Issue Type: Brainstorming > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > The recent PR [https://github.com/apache/spark/pull/20669] for removing the > init-container introduced a internally used ConfigMap carrying Spark > configuration properties in a file for the driver. This ConfigMap gets > mounted under {{$SPARK_HOME/conf}} and the environment variable > {{SPARK_CONF_DIR}} is set to point to the mount path. This pretty much > prevents users from mounting their own ConfigMaps that carry custom Spark > configuration files, e.g., {{log4j.properties}} and {{spark-env.sh}} and > leaves users with only the option of building custom images. IMO, it is very > useful to support mounting user-specified ConfigMaps for custom Spark > configuration files. This worths further discussions. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28411) insertInto with overwrite inconsistent behaviour Python/Scala
[ https://issues.apache.org/jira/browse/SPARK-28411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-28411: Labels: release-notes (was: ) > insertInto with overwrite inconsistent behaviour Python/Scala > - > > Key: SPARK-28411 > URL: https://issues.apache.org/jira/browse/SPARK-28411 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.2.1, 2.4.0 >Reporter: Maria Rebelka >Assignee: Huaxin Gao >Priority: Minor > Labels: release-notes > Fix For: 3.0.0 > > > The df.write.mode("overwrite").insertInto("table") has inconsistent behaviour > between Scala and Python. In Python, insertInto ignores "mode" parameter and > appends by default. Only when changing syntax to df.write.insertInto("table", > overwrite=True) we get expected behaviour. > This is a native Spark syntax, expected to be the same between languages... > Also, in other write methods, like saveAsTable or write.parquet "mode" seem > to be respected. > Reproduce, Python, ignore "overwrite": > {code:java} > df = spark.createDataFrame(sc.parallelize([(1, 2),(3,4)]),['i','j']) > # create the table and load data > df.write.saveAsTable("spark_overwrite_issue") > # insert overwrite, expected result - 2 rows > df.write.mode("overwrite").insertInto("spark_overwrite_issue") > spark.sql("select * from spark_overwrite_issue").count() > # result - 4 rows, insert appended data instead of overwrite{code} > Reproduce, Scala, works as expected: > {code:java} > val df = Seq((1, 2),(3,4)).toDF("i","j") > df.write.mode("overwrite").insertInto("spark_overwrite_issue") > spark.sql("select * from spark_overwrite_issue").count() > # result - 2 rows{code} > Tested on Spark 2.2.1 (EMR) and 2.4.0 (Databricks) -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org