[jira] [Commented] (SPARK-29707) Make PartitionFilters and PushedFilters abbreviate configurable in metadata
[ https://issues.apache.org/jira/browse/SPARK-29707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964655#comment-16964655 ] Hu Fuwang commented on SPARK-29707: --- I am working on this. > Make PartitionFilters and PushedFilters abbreviate configurable in metadata > --- > > Key: SPARK-29707 > URL: https://issues.apache.org/jira/browse/SPARK-29707 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > Attachments: screenshot-1.png > > > !screenshot-1.png! > It lost some key information. > Related code: > https://github.com/apache/spark/blob/ec5d698d99634e5bb8fc7b0fa1c270dd67c129c8/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L58-L66 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29710) Seeing offsets not resetting even when reset policy is configured explicitly
Shyam created SPARK-29710: - Summary: Seeing offsets not resetting even when reset policy is configured explicitly Key: SPARK-29710 URL: https://issues.apache.org/jira/browse/SPARK-29710 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.4.1 Environment: Window10 , eclipse neos Reporter: Shyam even after setting *"auto.offset.reset" to "latest"* I am getting below error org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of range with no configured reset policy for partitions: \{COMPANY_TRANSACTIONS_INBOUND-16=168}org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of range with no configured reset policy for partitions: \{COMPANY_TRANSACTIONS_INBOUND-16=168} at org.apache.kafka.clients.consumer.internals.Fetcher.throwIfOffsetOutOfRange(Fetcher.java:348) at org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:396) at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:999) at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:937) at org.apache.spark.sql.kafka010.InternalKafkaConsumer.fetchData(KafkaDataConsumer.scala:470) at org.apache.spark.sql.kafka010.InternalKafkaConsumer.org$apache$spark$sql$kafka010$InternalKafkaConsumer$$fetchRecord(KafkaDataConsumer.scala:361) at org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:251) at org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:234) at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) at org.apache.spark.sql.kafka010.InternalKafkaConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:209) at org.apache.spark.sql.kafka010.InternalKafkaConsumer.get(KafkaDataConsumer.scala:234) [https://stackoverflow.com/questions/58653885/even-after-setting-auto-offset-reset-to-latest-getting-error-offsetoutofrang] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29709) structured streaming The offset in the checkpoint is suddenly reset to the earliest
test created SPARK-29709: Summary: structured streaming The offset in the checkpoint is suddenly reset to the earliest Key: SPARK-29709 URL: https://issues.apache.org/jira/browse/SPARK-29709 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.4.0 Reporter: test structured streaming The offset in the checkpoint is suddenly reset to the earliest, -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29630) Not allowed to create a permanent view by referencing a temporary view in EXISTS
[ https://issues.apache.org/jira/browse/SPARK-29630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964626#comment-16964626 ] Terry Kim edited comment on SPARK-29630 at 11/1/19 5:50 AM: In the above example, the EXISTS clause becomes a `condition` of `Filter`. The current implementation is not exhaustive enough - e.g., it doesn't traverse Expression node for checking views. I will create a PR to address this. was (Author: imback82): In the above example, the EXISTS clause becomes a `condition` of `Filter`. The current implementation is not exhaustive enough - e.g., it doesn't traverse Expression node, etc. I will create a PR to address this. > Not allowed to create a permanent view by referencing a temporary view in > EXISTS > > > Key: SPARK-29630 > URL: https://issues.apache.org/jira/browse/SPARK-29630 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Major > > {code} > // In the master, the query below fails > $ CREATE VIEW v7_temp AS SELECT t1.id, t2.a FROM base_table t1, (SELECT * > FROM temp_table) t2; > org.apache.spark.sql.AnalysisException > Not allowed to create a permanent view `v7_temp` by referencing a temporary > view `temp_table`; > // In the master, the query below passed, but this should fail > $ CREATE VIEW v8_temp AS SELECT * FROM base_table WHERE EXISTS (SELECT 1 FROM > temp_table); > Passed > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29630) Not allowed to create a permanent view by referencing a temporary view in EXISTS
[ https://issues.apache.org/jira/browse/SPARK-29630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964626#comment-16964626 ] Terry Kim commented on SPARK-29630: --- In the above example, the EXISTS clause becomes a `condition` of `Filter`. The current implementation is not exhaustive enough - e.g., it doesn't traverse Expression node, etc. I will create a PR to address this. > Not allowed to create a permanent view by referencing a temporary view in > EXISTS > > > Key: SPARK-29630 > URL: https://issues.apache.org/jira/browse/SPARK-29630 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Major > > {code} > // In the master, the query below fails > $ CREATE VIEW v7_temp AS SELECT t1.id, t2.a FROM base_table t1, (SELECT * > FROM temp_table) t2; > org.apache.spark.sql.AnalysisException > Not allowed to create a permanent view `v7_temp` by referencing a temporary > view `temp_table`; > // In the master, the query below passed, but this should fail > $ CREATE VIEW v8_temp AS SELECT * FROM base_table WHERE EXISTS (SELECT 1 FROM > temp_table); > Passed > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29698) Support grouping function with multiple arguments
[ https://issues.apache.org/jira/browse/SPARK-29698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-29698: - Description: In PgSQL, grouping() can have multiple arguments, but Spark grouping() must have a single argument ([https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala#L100]); {code:java} postgres=# select a, b, grouping(a, b), sum(v), count(*), max(v) postgres-# from gstest1 group by rollup (a,b); a | b | grouping | sum | count | max ---+---+--+-+---+- | |3 | | 0 | (1 row) {code} See a doc for the form: https://www.postgresql.org/docs/12/functions-aggregate.html (Table 9.59. Grouping Operations) was: In PgSQL, grouping() can have multiple arguments, but Spark grouping() must have a single argument ([https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala#L100]); {code:java} postgres=# select a, b, grouping(a, b), sum(v), count(*), max(v) postgres-# from gstest1 group by rollup (a,b); a | b | grouping | sum | count | max ---+---+--+-+---+- | |3 | | 0 | (1 row) {code} > Support grouping function with multiple arguments > - > > Key: SPARK-29698 > URL: https://issues.apache.org/jira/browse/SPARK-29698 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Major > > In PgSQL, grouping() can have multiple arguments, but Spark grouping() must > have a single argument > ([https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala#L100]); > {code:java} > postgres=# select a, b, grouping(a, b), sum(v), count(*), max(v) > postgres-# from gstest1 group by rollup (a,b); > a | b | grouping | sum | count | max > ---+---+--+-+---+- >| |3 | | 0 | > (1 row) > {code} > See a doc for the form: > https://www.postgresql.org/docs/12/functions-aggregate.html (Table 9.59. > Grouping Operations) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29697) Support bit string types/literals
[ https://issues.apache.org/jira/browse/SPARK-29697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-29697: - Description: In PgSQL, there are bit types and literals; {code} postgres=# create table b(b bit(4)); CREATE TABLE postgres=# select b'0010'; ?column? -- 0010 (1 row) {code} See a doc for the form: https://www.postgresql.org/docs/current/datatype-bit.html was: In PgSQL, there are bit types and literals; {code} postgres=# create table b(b bit(4)); CREATE TABLE postgres=# select b'0010'; ?column? -- 0010 (1 row) {code} > Support bit string types/literals > - > > Key: SPARK-29697 > URL: https://issues.apache.org/jira/browse/SPARK-29697 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Major > > In PgSQL, there are bit types and literals; > {code} > postgres=# create table b(b bit(4)); > CREATE TABLE > postgres=# select b'0010'; > ?column? > -- > 0010 > (1 row) > {code} > See a doc for the form: > https://www.postgresql.org/docs/current/datatype-bit.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29700) Support nested grouping sets
[ https://issues.apache.org/jira/browse/SPARK-29700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-29700: - Description: PgSQL can process nested grouping sets, but Spark cannot; {code:java} postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e integer, f integer, g integer, h integer); postgres=# insert into gstest2 values postgres-# (1, 1, 1, 1, 1, 1, 1, 1), postgres-# (1, 1, 1, 1, 1, 1, 1, 2), postgres-# (1, 1, 1, 1, 1, 1, 2, 2), postgres-# (1, 1, 1, 1, 1, 2, 2, 2), postgres-# (1, 1, 1, 1, 2, 2, 2, 2), postgres-# (1, 1, 1, 2, 2, 2, 2, 2), postgres-# (1, 1, 2, 2, 2, 2, 2, 2), postgres-# (1, 2, 2, 2, 2, 2, 2, 2), postgres-# (2, 2, 2, 2, 2, 2, 2, 2); INSERT 0 9 postgres=# select sum(c) from gstest2 postgres-# group by grouping sets(grouping sets((a, b))) postgres-# order by 1 desc; sum - 16 4 4 (3 rows) {code} {code:java} scala> sql(""" | select sum(c) from gstest2 | group by grouping sets(grouping sets((a, b))) | order by 1 desc | """).show org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'sets' expecting {')', ','}(line 3, pos 34) == SQL == select sum(c) from gstest2 group by grouping sets(grouping sets((a, b))) --^^^ order by 1 desc at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:268) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:135) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:85) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605) ... 51 elided {code} See a doc for the form: https://www.postgresql.org/docs/current/sql-select.html was: PgSQL can process nested grouping sets, but Spark cannot; {code:java} postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e integer, f integer, g integer, h integer); postgres=# insert into gstest2 values postgres-# (1, 1, 1, 1, 1, 1, 1, 1), postgres-# (1, 1, 1, 1, 1, 1, 1, 2), postgres-# (1, 1, 1, 1, 1, 1, 2, 2), postgres-# (1, 1, 1, 1, 1, 2, 2, 2), postgres-# (1, 1, 1, 1, 2, 2, 2, 2), postgres-# (1, 1, 1, 2, 2, 2, 2, 2), postgres-# (1, 1, 2, 2, 2, 2, 2, 2), postgres-# (1, 2, 2, 2, 2, 2, 2, 2), postgres-# (2, 2, 2, 2, 2, 2, 2, 2); INSERT 0 9 postgres=# select sum(c) from gstest2 postgres-# group by grouping sets(grouping sets((a, b))) postgres-# order by 1 desc; sum - 16 4 4 (3 rows) {code} {code:java} scala> sql(""" | select sum(c) from gstest2 | group by grouping sets(grouping sets((a, b))) | order by 1 desc | """).show org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'sets' expecting {')', ','}(line 3, pos 34) == SQL == select sum(c) from gstest2 group by grouping sets(grouping sets((a, b))) --^^^ order by 1 desc at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:268) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:135) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:85) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605) ... 51 elided {code} > Support nested grouping sets > > > Key: SPARK-29700 > URL: https://issues.apache.org/jira/browse/SPARK-29700 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Major > > PgSQL can process nested grouping sets, but Spark cannot; > {code:java} > postgres=# create table gstest2 (a integer, b integer, c integer, d integer, > e integer, f integer, g integer, h integer); > postgres=# insert into gstest2 values > postgres-# (1, 1, 1, 1, 1, 1, 1, 1), > postgres-# (1, 1, 1, 1, 1, 1, 1, 2), > postgres-# (1, 1, 1, 1, 1, 1, 2, 2), > postgres-# (1, 1, 1, 1, 1, 2, 2, 2), > postgres-# (1, 1, 1, 1, 2, 2, 2, 2), > postgres-# (1, 1, 1, 2, 2, 2, 2, 2), > postgres-# (1, 1, 2, 2, 2, 2, 2, 2), > postgres-# (1, 2, 2, 2, 2, 2, 2, 2), > postgres-# (2, 2, 2, 2, 2, 2, 2, 2); > INSERT 0 9 > postgres=# select sum(c) from gstest2 > postgres-# group by grouping sets(groupi
[jira] [Updated] (SPARK-29705) Support more expressive forms in GroupingSets/Cube/Rollup
[ https://issues.apache.org/jira/browse/SPARK-29705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-29705: - Description: See a doc for the form: https://www.postgresql.org/docs/current/sql-select.html {code:java} postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e integer, f integer, g integer, h integer); postgres=# insert into gstest2 values postgres-# (1, 1, 1, 1, 1, 1, 1, 1), postgres-# (1, 1, 1, 1, 1, 1, 1, 2), postgres-# (1, 1, 1, 1, 1, 1, 2, 2), postgres-# (1, 1, 1, 1, 1, 2, 2, 2), postgres-# (1, 1, 1, 1, 2, 2, 2, 2), postgres-# (1, 1, 1, 2, 2, 2, 2, 2), postgres-# (1, 1, 2, 2, 2, 2, 2, 2), postgres-# (1, 2, 2, 2, 2, 2, 2, 2), postgres-# (2, 2, 2, 2, 2, 2, 2, 2); INSERT 0 9 postgres=# select a, b, grouping(a,b), sum(c), count(*), max(c) from gstest2 group by rollup ((a,b,c),(c,d)); a | b | grouping | sum | count | max ---+---+--+-+---+- | |3 | 24 |18 | 2 1 | 1 |0 | 4 | 2 | 2 1 | 2 |0 | 4 | 2 | 2 1 | 1 |0 | 2 | 2 | 1 2 | 2 |0 | 4 | 2 | 2 1 | 1 |0 | 10 |10 | 1 1 | 2 |0 | 4 | 2 | 2 1 | 1 |0 | 12 |12 | 1 1 | 1 |0 | 4 | 2 | 2 2 | 2 |0 | 4 | 2 | 2 (10 rows) {code} {code:java} scala> sql("""select a, b, grouping(a,b), sum(c), count(*), max(c) from gstest2 group by rollup ((a,b,c),(c,d))""").show org.apache.spark.sql.AnalysisException: Invalid number of arguments for function grouping. Expected: 1; Found: 2; line 1 pos 13 at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.$anonfun$expression$8(FunctionRegistry.scala:614) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.$anonfun$expression$4(FunctionRegistry.scala:598) at org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:121) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction(SessionCatalog.scala:1375) at org.apache.spark.sql.hive.HiveSessionCatalog.super$lookupFunction(HiveSessionCatalog.scala:132) at org.apache.spark.sql.hive.HiveSessionCatalog.$anonfun$lookupFunction0$2(HiveSessionCatalog.scala:132) at scala.util.Try$.apply(Try.scala:213) {code} was: {code:java} postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e integer, f integer, g integer, h integer); postgres=# insert into gstest2 values postgres-# (1, 1, 1, 1, 1, 1, 1, 1), postgres-# (1, 1, 1, 1, 1, 1, 1, 2), postgres-# (1, 1, 1, 1, 1, 1, 2, 2), postgres-# (1, 1, 1, 1, 1, 2, 2, 2), postgres-# (1, 1, 1, 1, 2, 2, 2, 2), postgres-# (1, 1, 1, 2, 2, 2, 2, 2), postgres-# (1, 1, 2, 2, 2, 2, 2, 2), postgres-# (1, 2, 2, 2, 2, 2, 2, 2), postgres-# (2, 2, 2, 2, 2, 2, 2, 2); INSERT 0 9 postgres=# select a, b, grouping(a,b), sum(c), count(*), max(c) from gstest2 group by rollup ((a,b,c),(c,d)); a | b | grouping | sum | count | max ---+---+--+-+---+- | |3 | 24 |18 | 2 1 | 1 |0 | 4 | 2 | 2 1 | 2 |0 | 4 | 2 | 2 1 | 1 |0 | 2 | 2 | 1 2 | 2 |0 | 4 | 2 | 2 1 | 1 |0 | 10 |10 | 1 1 | 2 |0 | 4 | 2 | 2 1 | 1 |0 | 12 |12 | 1 1 | 1 |0 | 4 | 2 | 2 2 | 2 |0 | 4 | 2 | 2 (10 rows) {code} {code:java} scala> sql("""select a, b, grouping(a,b), sum(c), count(*), max(c) from gstest2 group by rollup ((a,b,c),(c,d))""").show org.apache.spark.sql.AnalysisException: Invalid number of arguments for function grouping. Expected: 1; Found: 2; line 1 pos 13 at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.$anonfun$expression$8(FunctionRegistry.scala:614) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.$anonfun$expression$4(FunctionRegistry.scala:598) at org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:121) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction(SessionCatalog.scala:1375) at org.apache.spark.sql.hive.HiveSessionCatalog.super$lookupFunction(HiveSessionCatalog.scala:132) at org.apache.spark.sql.hive.HiveSessionCatalog.$anonfun$lookupFunction0$2(HiveSessionCatalog.scala:132) at scala.util.Try$.apply(Try.scala:213) {code} > Support more expressive forms in GroupingSets/Cube/Rollup > - > > Key: SPARK-29705 > URL: https://issues.apache.org/jira/browse/SPARK-29705 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Major > > See a doc
[jira] [Updated] (SPARK-29704) Support the combinations of grouping operations
[ https://issues.apache.org/jira/browse/SPARK-29704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-29704: - Description: PgSQL can accept a query below with the combinations of grouping operations, but Spark cannot; {code:java} postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e integer, f integer, g integer, h integer); postgres=# insert into gstest2 values postgres-# (1, 1, 1, 1, 1, 1, 1, 1), postgres-# (1, 1, 1, 1, 1, 1, 1, 2), postgres-# (1, 1, 1, 1, 1, 1, 2, 2), postgres-# (1, 1, 1, 1, 1, 2, 2, 2), postgres-# (1, 1, 1, 1, 2, 2, 2, 2), postgres-# (1, 1, 1, 2, 2, 2, 2, 2), postgres-# (1, 1, 2, 2, 2, 2, 2, 2), postgres-# (1, 2, 2, 2, 2, 2, 2, 2), postgres-# (2, 2, 2, 2, 2, 2, 2, 2); INSERT 0 9 postgres=# select a, b, c, d from gstest2 group by rollup(a,b),grouping sets(c,d); a | b | c | d ---+---+---+--- 1 | 2 | 2 | 1 | 1 | 2 | 1 | 1 | 1 | 2 | 2 | 2 | 1 | | 1 | 2 | | 2 | 1 | | 2 | | | 2 | | | 1 | 1 | 2 | | 2 1 | 1 | | 2 1 | 1 | | 1 2 | 2 | | 2 1 | | | 1 2 | | | 2 1 | | | 2 | | | 2 | | | 1 (18 rows) {code} {code} scala> sql("""select a, b, c, d from gstest2 group by rollup(a,b),grouping sets(c,d)""").show org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'sets' expecting {, ',', '.', '[', 'AND', 'BETWEEN', 'CLUSTER', 'DISTRIBUTE', 'EXCEPT', 'GROUPING', 'HAVING', 'IN', 'INTERSECT', 'IS', 'LIKE', 'LIMIT', NOT, 'OR', 'ORDER', RLIKE, 'MINUS', 'SORT', 'UNION', 'WINDOW', 'WITH', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '||', '^'} (line 1, pos 61) == SQL == select a, b, c, d from gstest2 group by rollup(a,b),grouping sets(c,d) -^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:268) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:135) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:85) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605) ... 47 elided {code} See a doc for the form: https://www.postgresql.org/docs/current/sql-select.html was: PgSQL can accept a query below with the combinations of grouping operations, but Spark cannot; {code:java} postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e integer, f integer, g integer, h integer); postgres=# insert into gstest2 values postgres-# (1, 1, 1, 1, 1, 1, 1, 1), postgres-# (1, 1, 1, 1, 1, 1, 1, 2), postgres-# (1, 1, 1, 1, 1, 1, 2, 2), postgres-# (1, 1, 1, 1, 1, 2, 2, 2), postgres-# (1, 1, 1, 1, 2, 2, 2, 2), postgres-# (1, 1, 1, 2, 2, 2, 2, 2), postgres-# (1, 1, 2, 2, 2, 2, 2, 2), postgres-# (1, 2, 2, 2, 2, 2, 2, 2), postgres-# (2, 2, 2, 2, 2, 2, 2, 2); INSERT 0 9 postgres=# select a, b, c, d from gstest2 group by rollup(a,b),grouping sets(c,d); a | b | c | d ---+---+---+--- 1 | 2 | 2 | 1 | 1 | 2 | 1 | 1 | 1 | 2 | 2 | 2 | 1 | | 1 | 2 | | 2 | 1 | | 2 | | | 2 | | | 1 | 1 | 2 | | 2 1 | 1 | | 2 1 | 1 | | 1 2 | 2 | | 2 1 | | | 1 2 | | | 2 1 | | | 2 | | | 2 | | | 1 (18 rows) {code} {code} scala> sql("""select a, b, c, d from gstest2 group by rollup(a,b),grouping sets(c,d)""").show org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'sets' expecting {, ',', '.', '[', 'AND', 'BETWEEN', 'CLUSTER', 'DISTRIBUTE', 'EXCEPT', 'GROUPING', 'HAVING', 'IN', 'INTERSECT', 'IS', 'LIKE', 'LIMIT', NOT, 'OR', 'ORDER', RLIKE, 'MINUS', 'SORT', 'UNION', 'WINDOW', 'WITH', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '||', '^'} (line 1, pos 61) == SQL == select a, b, c, d from gstest2 group by rollup(a,b),grouping sets(c,d) -^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:268) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:135) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:85) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605) ... 47 elided {code} > Support the combinations of grouping operat
[jira] [Created] (SPARK-29708) Different answers in aggregates of multiple grouping sets
Takeshi Yamamuro created SPARK-29708: Summary: Different answers in aggregates of multiple grouping sets Key: SPARK-29708 URL: https://issues.apache.org/jira/browse/SPARK-29708 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Takeshi Yamamuro A query below with multiple grouping sets seems to have different answers between PgSQL and Spark; {code:java} postgres=# create table gstest4(id integer, v integer, unhashable_col bit(4), unsortable_col xid); postgres=# insert into gstest4 postgres-# values (1,1,b'','1'), (2,2,b'0001','1'), postgres-#(3,4,b'0010','2'), (4,8,b'0011','2'), postgres-#(5,16,b'','2'), (6,32,b'0001','2'), postgres-#(7,64,b'0010','1'), (8,128,b'0011','1'); INSERT 0 8 postgres=# select unsortable_col, count(*) postgres-# from gstest4 group by grouping sets ((unsortable_col),(unsortable_col)) postgres-# order by text(unsortable_col); unsortable_col | count +--- 1 | 8 1 | 8 2 | 8 2 | 8 (4 rows) {code} {code:java} scala> sql("""create table gstest4(id integer, v integer, unhashable_col /* bit(4) */ byte, unsortable_col /* xid */ integer) using parquet""") scala> sql(""" | insert into gstest4 | values (1,1,tinyint('0'),1), (2,2,tinyint('1'),1), |(3,4,tinyint('2'),2), (4,8,tinyint('3'),2), |(5,16,tinyint('0'),2), (6,32,tinyint('1'),2), |(7,64,tinyint('2'),1), (8,128,tinyint('3'),1) | """) res21: org.apache.spark.sql.DataFrame = [] scala> scala> sql(""" | select unsortable_col, count(*) | from gstest4 group by grouping sets ((unsortable_col),(unsortable_col)) | order by string(unsortable_col) | """).show +--++ |unsortable_col|count(1)| +--++ | 1| 8| | 2| 8| +--++ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29701) Different answers when empty input given in GROUPING SETS
[ https://issues.apache.org/jira/browse/SPARK-29701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-29701: - Description: A query below with an empty input seems to have different answers between PgSQL and Spark; {code:java} postgres=# create table gstest_empty (a integer, b integer, v integer); CREATE TABLE postgres=# select a, b, sum(v), count(*) from gstest_empty group by grouping sets ((a,b),()); a | b | sum | count ---+---+-+--- | | | 0 (1 row) {code} {code:java} scala> sql("""select a, b, sum(v), count(*) from gstest_empty group by grouping sets ((a,b),())""").show +---+---+--++ | a| b|sum(v)|count(1)| +---+---+--++ +---+---+--++ {code} was: {code:java} A query below with an empty input seems to have different answers between PgSQL and Spark; postgres=# create table gstest_empty (a integer, b integer, v integer); CREATE TABLE postgres=# select a, b, sum(v), count(*) from gstest_empty group by grouping sets ((a,b),()); a | b | sum | count ---+---+-+--- | | | 0 (1 row) {code} {code:java} scala> sql("""select a, b, sum(v), count(*) from gstest_empty group by grouping sets ((a,b),())""").show +---+---+--++ | a| b|sum(v)|count(1)| +---+---+--++ +---+---+--++ {code} > Different answers when empty input given in GROUPING SETS > - > > Key: SPARK-29701 > URL: https://issues.apache.org/jira/browse/SPARK-29701 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Major > > A query below with an empty input seems to have different answers between > PgSQL and Spark; > {code:java} > postgres=# create table gstest_empty (a integer, b integer, v integer); > CREATE TABLE > postgres=# select a, b, sum(v), count(*) from gstest_empty group by grouping > sets ((a,b),()); > a | b | sum | count > ---+---+-+--- >| | | 0 > (1 row) > {code} > {code:java} > scala> sql("""select a, b, sum(v), count(*) from gstest_empty group by > grouping sets ((a,b),())""").show > +---+---+--++ > | a| b|sum(v)|count(1)| > +---+---+--++ > +---+---+--++ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29707) Make PartitionFilters and PushedFilters abbreviate configurable in metadata
Yuming Wang created SPARK-29707: --- Summary: Make PartitionFilters and PushedFilters abbreviate configurable in metadata Key: SPARK-29707 URL: https://issues.apache.org/jira/browse/SPARK-29707 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang Attachments: screenshot-1.png !image-2019-11-01-13-12-38-712.png! It lost some key information. Related code: https://github.com/apache/spark/blob/ec5d698d99634e5bb8fc7b0fa1c270dd67c129c8/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L58-L66 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29707) Make PartitionFilters and PushedFilters abbreviate configurable in metadata
[ https://issues.apache.org/jira/browse/SPARK-29707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-29707: Attachment: screenshot-1.png > Make PartitionFilters and PushedFilters abbreviate configurable in metadata > --- > > Key: SPARK-29707 > URL: https://issues.apache.org/jira/browse/SPARK-29707 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > Attachments: screenshot-1.png > > > !image-2019-11-01-13-12-38-712.png! > It lost some key information. > Related code: > https://github.com/apache/spark/blob/ec5d698d99634e5bb8fc7b0fa1c270dd67c129c8/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L58-L66 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29707) Make PartitionFilters and PushedFilters abbreviate configurable in metadata
[ https://issues.apache.org/jira/browse/SPARK-29707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-29707: Description: !screenshot-1.png! It lost some key information. Related code: https://github.com/apache/spark/blob/ec5d698d99634e5bb8fc7b0fa1c270dd67c129c8/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L58-L66 was: !image-2019-11-01-13-12-38-712.png! It lost some key information. Related code: https://github.com/apache/spark/blob/ec5d698d99634e5bb8fc7b0fa1c270dd67c129c8/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L58-L66 > Make PartitionFilters and PushedFilters abbreviate configurable in metadata > --- > > Key: SPARK-29707 > URL: https://issues.apache.org/jira/browse/SPARK-29707 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > Attachments: screenshot-1.png > > > !screenshot-1.png! > It lost some key information. > Related code: > https://github.com/apache/spark/blob/ec5d698d99634e5bb8fc7b0fa1c270dd67c129c8/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L58-L66 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29706) Support an empty grouping expression
Takeshi Yamamuro created SPARK-29706: Summary: Support an empty grouping expression Key: SPARK-29706 URL: https://issues.apache.org/jira/browse/SPARK-29706 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Takeshi Yamamuro PgSQL can accept a query below with an empty grouping expr, but Spark cannot; {code:java} postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e integer, f integer, g integer, h integer); postgres=# insert into gstest2 values postgres-# (1, 1, 1, 1, 1, 1, 1, 1), postgres-# (1, 1, 1, 1, 1, 1, 1, 2), postgres-# (1, 1, 1, 1, 1, 1, 2, 2), postgres-# (1, 1, 1, 1, 1, 2, 2, 2), postgres-# (1, 1, 1, 1, 2, 2, 2, 2), postgres-# (1, 1, 1, 2, 2, 2, 2, 2), postgres-# (1, 1, 2, 2, 2, 2, 2, 2), postgres-# (1, 2, 2, 2, 2, 2, 2, 2), postgres-# (2, 2, 2, 2, 2, 2, 2, 2); INSERT 0 9 postgres=# select v.c, (select count(*) from gstest2 group by () having v.c) from (values (false),(true)) v(c) order by v.c; c | count ---+--- f | t |18 (2 rows) {code} {code:java} scala> sql("""select v.c, (select count(*) from gstest2 group by () having v.c) from (values (false),(true)) v(c) order by v.c""").show org.apache.spark.sql.catalyst.parser.ParseException: no viable alternative at input '()'(line 1, pos 52) == SQL == select v.c, (select count(*) from gstest2 group by () having v.c) from (values (false),(true)) v(c) order by v.c ^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:268) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:135) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:85) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605) ... 47 elided {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29700) Support nested grouping sets
[ https://issues.apache.org/jira/browse/SPARK-29700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-29700: - Description: PgSQL can process nested grouping sets, but Spark cannot; {code:java} postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e integer, f integer, g integer, h integer); postgres=# insert into gstest2 values postgres-# (1, 1, 1, 1, 1, 1, 1, 1), postgres-# (1, 1, 1, 1, 1, 1, 1, 2), postgres-# (1, 1, 1, 1, 1, 1, 2, 2), postgres-# (1, 1, 1, 1, 1, 2, 2, 2), postgres-# (1, 1, 1, 1, 2, 2, 2, 2), postgres-# (1, 1, 1, 2, 2, 2, 2, 2), postgres-# (1, 1, 2, 2, 2, 2, 2, 2), postgres-# (1, 2, 2, 2, 2, 2, 2, 2), postgres-# (2, 2, 2, 2, 2, 2, 2, 2); INSERT 0 9 postgres=# select sum(c) from gstest2 postgres-# group by grouping sets(grouping sets((a, b))) postgres-# order by 1 desc; sum - 16 4 4 (3 rows) {code} {code:java} scala> sql(""" | select sum(c) from gstest2 | group by grouping sets(grouping sets((a, b))) | order by 1 desc | """).show org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'sets' expecting {')', ','}(line 3, pos 34) == SQL == select sum(c) from gstest2 group by grouping sets(grouping sets((a, b))) --^^^ order by 1 desc at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:268) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:135) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:85) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605) ... 51 elided {code} was: PgSQL can process nested grouping sets, but Spark cannot; {code:java} postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e integer, f integer, g integer, h integer); ERROR: relation "gstest2" already exists postgres=# insert into gstest2 values postgres-# (1, 1, 1, 1, 1, 1, 1, 1), postgres-# (1, 1, 1, 1, 1, 1, 1, 2), postgres-# (1, 1, 1, 1, 1, 1, 2, 2), postgres-# (1, 1, 1, 1, 1, 2, 2, 2), postgres-# (1, 1, 1, 1, 2, 2, 2, 2), postgres-# (1, 1, 1, 2, 2, 2, 2, 2), postgres-# (1, 1, 2, 2, 2, 2, 2, 2), postgres-# (1, 2, 2, 2, 2, 2, 2, 2), postgres-# (2, 2, 2, 2, 2, 2, 2, 2); INSERT 0 9 postgres=# select sum(c) from gstest2 postgres-# group by grouping sets(grouping sets((a, b))) postgres-# order by 1 desc; sum - 16 4 4 (3 rows) {code} {code:java} scala> sql(""" | select sum(c) from gstest2 | group by grouping sets(grouping sets((a, b))) | order by 1 desc | """).show org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'sets' expecting {')', ','}(line 3, pos 34) == SQL == select sum(c) from gstest2 group by grouping sets(grouping sets((a, b))) --^^^ order by 1 desc at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:268) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:135) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:85) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605) ... 51 elided {code} > Support nested grouping sets > > > Key: SPARK-29700 > URL: https://issues.apache.org/jira/browse/SPARK-29700 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Major > > PgSQL can process nested grouping sets, but Spark cannot; > {code:java} > postgres=# create table gstest2 (a integer, b integer, c integer, d integer, > e integer, f integer, g integer, h integer); > postgres=# insert into gstest2 values > postgres-# (1, 1, 1, 1, 1, 1, 1, 1), > postgres-# (1, 1, 1, 1, 1, 1, 1, 2), > postgres-# (1, 1, 1, 1, 1, 1, 2, 2), > postgres-# (1, 1, 1, 1, 1, 2, 2, 2), > postgres-# (1, 1, 1, 1, 2, 2, 2, 2), > postgres-# (1, 1, 1, 2, 2, 2, 2, 2), > postgres-# (1, 1, 2, 2, 2, 2, 2, 2), > postgres-# (1, 2, 2, 2, 2, 2, 2, 2), > postgres-# (2, 2, 2, 2, 2, 2, 2, 2); > INSERT 0 9 > postgres=# select sum(c) from gstest2 > postgres-# group by grouping sets(grouping sets((a, b))) > postgres-# order
[jira] [Updated] (SPARK-29704) Support the combinations of grouping operations
[ https://issues.apache.org/jira/browse/SPARK-29704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-29704: - Description: PgSQL can accept a query below with the combinations of grouping operations, but Spark cannot; {code:java} postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e integer, f integer, g integer, h integer); postgres=# insert into gstest2 values postgres-# (1, 1, 1, 1, 1, 1, 1, 1), postgres-# (1, 1, 1, 1, 1, 1, 1, 2), postgres-# (1, 1, 1, 1, 1, 1, 2, 2), postgres-# (1, 1, 1, 1, 1, 2, 2, 2), postgres-# (1, 1, 1, 1, 2, 2, 2, 2), postgres-# (1, 1, 1, 2, 2, 2, 2, 2), postgres-# (1, 1, 2, 2, 2, 2, 2, 2), postgres-# (1, 2, 2, 2, 2, 2, 2, 2), postgres-# (2, 2, 2, 2, 2, 2, 2, 2); INSERT 0 9 postgres=# select a, b, c, d from gstest2 group by rollup(a,b),grouping sets(c,d); a | b | c | d ---+---+---+--- 1 | 2 | 2 | 1 | 1 | 2 | 1 | 1 | 1 | 2 | 2 | 2 | 1 | | 1 | 2 | | 2 | 1 | | 2 | | | 2 | | | 1 | 1 | 2 | | 2 1 | 1 | | 2 1 | 1 | | 1 2 | 2 | | 2 1 | | | 1 2 | | | 2 1 | | | 2 | | | 2 | | | 1 (18 rows) {code} scala> sql("""select a, b, c, d from gstest2 group by rollup(a,b),grouping sets(c,d)""").show org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'sets' expecting {, ',', '.', '[', 'AND', 'BETWEEN', 'CLUSTER', 'DISTRIBUTE', 'EXCEPT', 'GROUPING', 'HAVING', 'IN', 'INTERSECT', 'IS', 'LIKE', 'LIMIT', NOT, 'OR', 'ORDER', RLIKE, 'MINUS', 'SORT', 'UNION', 'WINDOW', 'WITH', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '||', '^'} (line 1, pos 61) == SQL == select a, b, c, d from gstest2 group by rollup(a,b),grouping sets(c,d) -^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:268) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:135) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:85) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605) ... 47 elided {code:java} {code} was: PgSQL can accept a query below with the combinations of grouping operations, but Spark cannot; {code} postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e integer, f integer, g integer, h integer); ERROR: relation "gstest2" already exists postgres=# insert into gstest2 values postgres-# (1, 1, 1, 1, 1, 1, 1, 1), postgres-# (1, 1, 1, 1, 1, 1, 1, 2), postgres-# (1, 1, 1, 1, 1, 1, 2, 2), postgres-# (1, 1, 1, 1, 1, 2, 2, 2), postgres-# (1, 1, 1, 1, 2, 2, 2, 2), postgres-# (1, 1, 1, 2, 2, 2, 2, 2), postgres-# (1, 1, 2, 2, 2, 2, 2, 2), postgres-# (1, 2, 2, 2, 2, 2, 2, 2), postgres-# (2, 2, 2, 2, 2, 2, 2, 2); INSERT 0 9 postgres=# select a, b, c, d from gstest2 group by rollup(a,b),grouping sets(c,d); a | b | c | d ---+---+---+--- 1 | 2 | 2 | 1 | 1 | 2 | 1 | 1 | 1 | 2 | 2 | 2 | 1 | | 1 | 2 | | 2 | 1 | | 2 | | | 2 | | | 1 | 1 | 2 | | 2 1 | 1 | | 2 1 | 1 | | 1 2 | 2 | | 2 1 | | | 1 2 | | | 2 1 | | | 2 | | | 2 | | | 1 (18 rows) {code} scala> sql("""select a, b, c, d from gstest2 group by rollup(a,b),grouping sets(c,d)""").show org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'sets' expecting {, ',', '.', '[', 'AND', 'BETWEEN', 'CLUSTER', 'DISTRIBUTE', 'EXCEPT', 'GROUPING', 'HAVING', 'IN', 'INTERSECT', 'IS', 'LIKE', 'LIMIT', NOT, 'OR', 'ORDER', RLIKE, 'MINUS', 'SORT', 'UNION', 'WINDOW', 'WITH', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '||', '^'}(line 1, pos 61) == SQL == select a, b, c, d from gstest2 group by rollup(a,b),grouping sets(c,d) -^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:268) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:135) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:85) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605) ... 47 elided {code} {code} > Support the combinations of grouping operations > ---
[jira] [Updated] (SPARK-29704) Support the combinations of grouping operations
[ https://issues.apache.org/jira/browse/SPARK-29704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-29704: - Description: PgSQL can accept a query below with the combinations of grouping operations, but Spark cannot; {code:java} postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e integer, f integer, g integer, h integer); postgres=# insert into gstest2 values postgres-# (1, 1, 1, 1, 1, 1, 1, 1), postgres-# (1, 1, 1, 1, 1, 1, 1, 2), postgres-# (1, 1, 1, 1, 1, 1, 2, 2), postgres-# (1, 1, 1, 1, 1, 2, 2, 2), postgres-# (1, 1, 1, 1, 2, 2, 2, 2), postgres-# (1, 1, 1, 2, 2, 2, 2, 2), postgres-# (1, 1, 2, 2, 2, 2, 2, 2), postgres-# (1, 2, 2, 2, 2, 2, 2, 2), postgres-# (2, 2, 2, 2, 2, 2, 2, 2); INSERT 0 9 postgres=# select a, b, c, d from gstest2 group by rollup(a,b),grouping sets(c,d); a | b | c | d ---+---+---+--- 1 | 2 | 2 | 1 | 1 | 2 | 1 | 1 | 1 | 2 | 2 | 2 | 1 | | 1 | 2 | | 2 | 1 | | 2 | | | 2 | | | 1 | 1 | 2 | | 2 1 | 1 | | 2 1 | 1 | | 1 2 | 2 | | 2 1 | | | 1 2 | | | 2 1 | | | 2 | | | 2 | | | 1 (18 rows) {code} {code} scala> sql("""select a, b, c, d from gstest2 group by rollup(a,b),grouping sets(c,d)""").show org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'sets' expecting {, ',', '.', '[', 'AND', 'BETWEEN', 'CLUSTER', 'DISTRIBUTE', 'EXCEPT', 'GROUPING', 'HAVING', 'IN', 'INTERSECT', 'IS', 'LIKE', 'LIMIT', NOT, 'OR', 'ORDER', RLIKE, 'MINUS', 'SORT', 'UNION', 'WINDOW', 'WITH', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '||', '^'} (line 1, pos 61) == SQL == select a, b, c, d from gstest2 group by rollup(a,b),grouping sets(c,d) -^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:268) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:135) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:85) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605) ... 47 elided {code} was: PgSQL can accept a query below with the combinations of grouping operations, but Spark cannot; {code:java} postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e integer, f integer, g integer, h integer); postgres=# insert into gstest2 values postgres-# (1, 1, 1, 1, 1, 1, 1, 1), postgres-# (1, 1, 1, 1, 1, 1, 1, 2), postgres-# (1, 1, 1, 1, 1, 1, 2, 2), postgres-# (1, 1, 1, 1, 1, 2, 2, 2), postgres-# (1, 1, 1, 1, 2, 2, 2, 2), postgres-# (1, 1, 1, 2, 2, 2, 2, 2), postgres-# (1, 1, 2, 2, 2, 2, 2, 2), postgres-# (1, 2, 2, 2, 2, 2, 2, 2), postgres-# (2, 2, 2, 2, 2, 2, 2, 2); INSERT 0 9 postgres=# select a, b, c, d from gstest2 group by rollup(a,b),grouping sets(c,d); a | b | c | d ---+---+---+--- 1 | 2 | 2 | 1 | 1 | 2 | 1 | 1 | 1 | 2 | 2 | 2 | 1 | | 1 | 2 | | 2 | 1 | | 2 | | | 2 | | | 1 | 1 | 2 | | 2 1 | 1 | | 2 1 | 1 | | 1 2 | 2 | | 2 1 | | | 1 2 | | | 2 1 | | | 2 | | | 2 | | | 1 (18 rows) {code} scala> sql("""select a, b, c, d from gstest2 group by rollup(a,b),grouping sets(c,d)""").show org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'sets' expecting {, ',', '.', '[', 'AND', 'BETWEEN', 'CLUSTER', 'DISTRIBUTE', 'EXCEPT', 'GROUPING', 'HAVING', 'IN', 'INTERSECT', 'IS', 'LIKE', 'LIMIT', NOT, 'OR', 'ORDER', RLIKE, 'MINUS', 'SORT', 'UNION', 'WINDOW', 'WITH', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '||', '^'} (line 1, pos 61) == SQL == select a, b, c, d from gstest2 group by rollup(a,b),grouping sets(c,d) -^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:268) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:135) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:85) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605) ... 47 elided {code:java} {code} > Support the combinations of grouping operations > --- > >
[jira] [Updated] (SPARK-29705) Support more expressive forms in GroupingSets/Cube/Rollup
[ https://issues.apache.org/jira/browse/SPARK-29705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-29705: - Description: {code:java} postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e integer, f integer, g integer, h integer); postgres=# insert into gstest2 values postgres-# (1, 1, 1, 1, 1, 1, 1, 1), postgres-# (1, 1, 1, 1, 1, 1, 1, 2), postgres-# (1, 1, 1, 1, 1, 1, 2, 2), postgres-# (1, 1, 1, 1, 1, 2, 2, 2), postgres-# (1, 1, 1, 1, 2, 2, 2, 2), postgres-# (1, 1, 1, 2, 2, 2, 2, 2), postgres-# (1, 1, 2, 2, 2, 2, 2, 2), postgres-# (1, 2, 2, 2, 2, 2, 2, 2), postgres-# (2, 2, 2, 2, 2, 2, 2, 2); INSERT 0 9 postgres=# select a, b, grouping(a,b), sum(c), count(*), max(c) from gstest2 group by rollup ((a,b,c),(c,d)); a | b | grouping | sum | count | max ---+---+--+-+---+- | |3 | 24 |18 | 2 1 | 1 |0 | 4 | 2 | 2 1 | 2 |0 | 4 | 2 | 2 1 | 1 |0 | 2 | 2 | 1 2 | 2 |0 | 4 | 2 | 2 1 | 1 |0 | 10 |10 | 1 1 | 2 |0 | 4 | 2 | 2 1 | 1 |0 | 12 |12 | 1 1 | 1 |0 | 4 | 2 | 2 2 | 2 |0 | 4 | 2 | 2 (10 rows) {code} {code:java} scala> sql("""select a, b, grouping(a,b), sum(c), count(*), max(c) from gstest2 group by rollup ((a,b,c),(c,d))""").show org.apache.spark.sql.AnalysisException: Invalid number of arguments for function grouping. Expected: 1; Found: 2; line 1 pos 13 at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.$anonfun$expression$8(FunctionRegistry.scala:614) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.$anonfun$expression$4(FunctionRegistry.scala:598) at org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:121) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction(SessionCatalog.scala:1375) at org.apache.spark.sql.hive.HiveSessionCatalog.super$lookupFunction(HiveSessionCatalog.scala:132) at org.apache.spark.sql.hive.HiveSessionCatalog.$anonfun$lookupFunction0$2(HiveSessionCatalog.scala:132) at scala.util.Try$.apply(Try.scala:213) {code} was: {code:java} postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e integer, f integer, g integer, h integer); ERROR: relation "gstest2" already exists postgres=# insert into gstest2 values postgres-# (1, 1, 1, 1, 1, 1, 1, 1), postgres-# (1, 1, 1, 1, 1, 1, 1, 2), postgres-# (1, 1, 1, 1, 1, 1, 2, 2), postgres-# (1, 1, 1, 1, 1, 2, 2, 2), postgres-# (1, 1, 1, 1, 2, 2, 2, 2), postgres-# (1, 1, 1, 2, 2, 2, 2, 2), postgres-# (1, 1, 2, 2, 2, 2, 2, 2), postgres-# (1, 2, 2, 2, 2, 2, 2, 2), postgres-# (2, 2, 2, 2, 2, 2, 2, 2); INSERT 0 9 postgres=# select a, b, grouping(a,b), sum(c), count(*), max(c) from gstest2 group by rollup ((a,b,c),(c,d)); a | b | grouping | sum | count | max ---+---+--+-+---+- | |3 | 24 |18 | 2 1 | 1 |0 | 4 | 2 | 2 1 | 2 |0 | 4 | 2 | 2 1 | 1 |0 | 2 | 2 | 1 2 | 2 |0 | 4 | 2 | 2 1 | 1 |0 | 10 |10 | 1 1 | 2 |0 | 4 | 2 | 2 1 | 1 |0 | 12 |12 | 1 1 | 1 |0 | 4 | 2 | 2 2 | 2 |0 | 4 | 2 | 2 (10 rows) {code} {code:java} scala> sql("""select a, b, grouping(a,b), sum(c), count(*), max(c) from gstest2 group by rollup ((a,b,c),(c,d))""").show org.apache.spark.sql.AnalysisException: Invalid number of arguments for function grouping. Expected: 1; Found: 2; line 1 pos 13 at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.$anonfun$expression$8(FunctionRegistry.scala:614) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.$anonfun$expression$4(FunctionRegistry.scala:598) at org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:121) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction(SessionCatalog.scala:1375) at org.apache.spark.sql.hive.HiveSessionCatalog.super$lookupFunction(HiveSessionCatalog.scala:132) at org.apache.spark.sql.hive.HiveSessionCatalog.$anonfun$lookupFunction0$2(HiveSessionCatalog.scala:132) at scala.util.Try$.apply(Try.scala:213) {code} > Support more expressive forms in GroupingSets/Cube/Rollup > - > > Key: SPARK-29705 > URL: https://issues.apache.org/jira/browse/SPARK-29705 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Major > > {code:java} > postgres=# create table gstest2 (
[jira] [Created] (SPARK-29705) Support more expressive forms in GroupingSets/Cube/Rollup
Takeshi Yamamuro created SPARK-29705: Summary: Support more expressive forms in GroupingSets/Cube/Rollup Key: SPARK-29705 URL: https://issues.apache.org/jira/browse/SPARK-29705 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Takeshi Yamamuro {code:java} postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e integer, f integer, g integer, h integer); ERROR: relation "gstest2" already exists postgres=# insert into gstest2 values postgres-# (1, 1, 1, 1, 1, 1, 1, 1), postgres-# (1, 1, 1, 1, 1, 1, 1, 2), postgres-# (1, 1, 1, 1, 1, 1, 2, 2), postgres-# (1, 1, 1, 1, 1, 2, 2, 2), postgres-# (1, 1, 1, 1, 2, 2, 2, 2), postgres-# (1, 1, 1, 2, 2, 2, 2, 2), postgres-# (1, 1, 2, 2, 2, 2, 2, 2), postgres-# (1, 2, 2, 2, 2, 2, 2, 2), postgres-# (2, 2, 2, 2, 2, 2, 2, 2); INSERT 0 9 postgres=# select a, b, grouping(a,b), sum(c), count(*), max(c) from gstest2 group by rollup ((a,b,c),(c,d)); a | b | grouping | sum | count | max ---+---+--+-+---+- | |3 | 24 |18 | 2 1 | 1 |0 | 4 | 2 | 2 1 | 2 |0 | 4 | 2 | 2 1 | 1 |0 | 2 | 2 | 1 2 | 2 |0 | 4 | 2 | 2 1 | 1 |0 | 10 |10 | 1 1 | 2 |0 | 4 | 2 | 2 1 | 1 |0 | 12 |12 | 1 1 | 1 |0 | 4 | 2 | 2 2 | 2 |0 | 4 | 2 | 2 (10 rows) {code} {code:java} scala> sql("""select a, b, grouping(a,b), sum(c), count(*), max(c) from gstest2 group by rollup ((a,b,c),(c,d))""").show org.apache.spark.sql.AnalysisException: Invalid number of arguments for function grouping. Expected: 1; Found: 2; line 1 pos 13 at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.$anonfun$expression$8(FunctionRegistry.scala:614) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.$anonfun$expression$4(FunctionRegistry.scala:598) at org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:121) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction(SessionCatalog.scala:1375) at org.apache.spark.sql.hive.HiveSessionCatalog.super$lookupFunction(HiveSessionCatalog.scala:132) at org.apache.spark.sql.hive.HiveSessionCatalog.$anonfun$lookupFunction0$2(HiveSessionCatalog.scala:132) at scala.util.Try$.apply(Try.scala:213) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24203) Make executor's bindAddress configurable
[ https://issues.apache.org/jira/browse/SPARK-24203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai reassigned SPARK-24203: --- Assignee: Nishchal Venkataramana > Make executor's bindAddress configurable > > > Key: SPARK-24203 > URL: https://issues.apache.org/jira/browse/SPARK-24203 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Lukas Majercak >Assignee: Nishchal Venkataramana >Priority: Major > Labels: bulk-closed > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29704) Support the combinations of grouping operations
Takeshi Yamamuro created SPARK-29704: Summary: Support the combinations of grouping operations Key: SPARK-29704 URL: https://issues.apache.org/jira/browse/SPARK-29704 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Takeshi Yamamuro PgSQL can accept a query below with the combinations of grouping operations, but Spark cannot; {code} postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e integer, f integer, g integer, h integer); ERROR: relation "gstest2" already exists postgres=# insert into gstest2 values postgres-# (1, 1, 1, 1, 1, 1, 1, 1), postgres-# (1, 1, 1, 1, 1, 1, 1, 2), postgres-# (1, 1, 1, 1, 1, 1, 2, 2), postgres-# (1, 1, 1, 1, 1, 2, 2, 2), postgres-# (1, 1, 1, 1, 2, 2, 2, 2), postgres-# (1, 1, 1, 2, 2, 2, 2, 2), postgres-# (1, 1, 2, 2, 2, 2, 2, 2), postgres-# (1, 2, 2, 2, 2, 2, 2, 2), postgres-# (2, 2, 2, 2, 2, 2, 2, 2); INSERT 0 9 postgres=# select a, b, c, d from gstest2 group by rollup(a,b),grouping sets(c,d); a | b | c | d ---+---+---+--- 1 | 2 | 2 | 1 | 1 | 2 | 1 | 1 | 1 | 2 | 2 | 2 | 1 | | 1 | 2 | | 2 | 1 | | 2 | | | 2 | | | 1 | 1 | 2 | | 2 1 | 1 | | 2 1 | 1 | | 1 2 | 2 | | 2 1 | | | 1 2 | | | 2 1 | | | 2 | | | 2 | | | 1 (18 rows) {code} scala> sql("""select a, b, c, d from gstest2 group by rollup(a,b),grouping sets(c,d)""").show org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'sets' expecting {, ',', '.', '[', 'AND', 'BETWEEN', 'CLUSTER', 'DISTRIBUTE', 'EXCEPT', 'GROUPING', 'HAVING', 'IN', 'INTERSECT', 'IS', 'LIKE', 'LIMIT', NOT, 'OR', 'ORDER', RLIKE, 'MINUS', 'SORT', 'UNION', 'WINDOW', 'WITH', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '||', '^'}(line 1, pos 61) == SQL == select a, b, c, d from gstest2 group by rollup(a,b),grouping sets(c,d) -^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:268) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:135) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:85) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605) ... 47 elided {code} {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-24203) Make executor's bindAddress configurable
[ https://issues.apache.org/jira/browse/SPARK-24203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai reopened SPARK-24203: - > Make executor's bindAddress configurable > > > Key: SPARK-24203 > URL: https://issues.apache.org/jira/browse/SPARK-24203 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Lukas Majercak >Priority: Major > Labels: bulk-closed > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29670) Make executor's bindAddress configurable
[ https://issues.apache.org/jira/browse/SPARK-29670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964599#comment-16964599 ] DB Tsai commented on SPARK-29670: - This is a duplication of SPARK-29670 > Make executor's bindAddress configurable > > > Key: SPARK-29670 > URL: https://issues.apache.org/jira/browse/SPARK-29670 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4 >Reporter: Nishchal Venkataramana >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29670) Make executor's bindAddress configurable
[ https://issues.apache.org/jira/browse/SPARK-29670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-29670. - Resolution: Duplicate > Make executor's bindAddress configurable > > > Key: SPARK-29670 > URL: https://issues.apache.org/jira/browse/SPARK-29670 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4 >Reporter: Nishchal Venkataramana >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29703) Support grouping() in GROUP BY without GroupingSets/Cube/Rollup
Takeshi Yamamuro created SPARK-29703: Summary: Support grouping() in GROUP BY without GroupingSets/Cube/Rollup Key: SPARK-29703 URL: https://issues.apache.org/jira/browse/SPARK-29703 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Takeshi Yamamuro PgSQL can accept the query below that have grouping() in GROUP BY without GroupingSets/Cube/Rollup; {code:java} postgres=# CREATE TABLE onek (unique1 int, unique2 int, two int, four int, ten int, twenty int, hundred int, thousand int, twothousand int, fivethous int, tenthous int, odd int, even int, textu1 text, textu2 text, text4 text); CREATE TABLE postgres=# select ten, grouping(ten) from onek group by (ten) having grouping(ten) >= 0 order by 2,1; ten | grouping -+-- (0 rows) {code} {code:java} scala> sql("""select ten, grouping(ten) from onek group by (ten) having grouping(ten) >= 0 order by 2,1""").show() org.apache.spark.sql.AnalysisException: grouping()/grouping_id() can only be used with GroupingSets/Cube/Rollup; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:47) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:46) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:122) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGroupingAnalytics$$findGroupingExprs$1.applyOrElse(Analyzer.scala:503) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGroupingAnalytics$$findGroupingExprs$1.applyOrElse(Analyzer.scala:497) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:228) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:224) at org.apache.spark.sql.catalyst.trees.TreeNode.collectFirst(TreeNode.scala:202) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGroupingAnalytics$$findGroupingExprs(Analyzer.scala:497) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29702) Resolve group-by columns with functional dependencies
Takeshi Yamamuro created SPARK-29702: Summary: Resolve group-by columns with functional dependencies Key: SPARK-29702 URL: https://issues.apache.org/jira/browse/SPARK-29702 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Takeshi Yamamuro In PgSQL, functional dependencies affect grouping column resolution in an analyzer; {code:java} postgres=# \d gstest3 Table "public.gstest3" Column | Type | Collation | Nullable | Default +-+---+--+- a | integer | | | b | integer | | | c | integer | | | d | integer | | | postgres=# select a, d, grouping(a,b,c) from gstest3 group by grouping sets ((a,b), (a,c)); ERROR: column "gstest3.d" must appear in the GROUP BY clause or be used in an aggregate function LINE 1: select a, d, grouping(a,b,c) from gstest3 group by grouping ... ^ postgres=# alter table gstest3 add primary key (a); ALTER TABLE postgres=# select a, d, grouping(a,b,c) from gstest3 group by grouping sets ((a,b), (a,c)); a | d | grouping ---+---+-- 1 | 1 |1 2 | 2 |1 1 | 1 |2 2 | 2 |2 (4 rows) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29701) Different answers when empty input given in GROUPING SETS
Takeshi Yamamuro created SPARK-29701: Summary: Different answers when empty input given in GROUPING SETS Key: SPARK-29701 URL: https://issues.apache.org/jira/browse/SPARK-29701 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Takeshi Yamamuro {code:java} A query below with an empty input seems to have different answers between PgSQL and Spark; postgres=# create table gstest_empty (a integer, b integer, v integer); CREATE TABLE postgres=# select a, b, sum(v), count(*) from gstest_empty group by grouping sets ((a,b),()); a | b | sum | count ---+---+-+--- | | | 0 (1 row) {code} {code:java} scala> sql("""select a, b, sum(v), count(*) from gstest_empty group by grouping sets ((a,b),())""").show +---+---+--++ | a| b|sum(v)|count(1)| +---+---+--++ +---+---+--++ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29700) Support nested grouping sets
Takeshi Yamamuro created SPARK-29700: Summary: Support nested grouping sets Key: SPARK-29700 URL: https://issues.apache.org/jira/browse/SPARK-29700 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Takeshi Yamamuro PgSQL can process nested grouping sets, but Spark cannot; {code:java} postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e integer, f integer, g integer, h integer); ERROR: relation "gstest2" already exists postgres=# insert into gstest2 values postgres-# (1, 1, 1, 1, 1, 1, 1, 1), postgres-# (1, 1, 1, 1, 1, 1, 1, 2), postgres-# (1, 1, 1, 1, 1, 1, 2, 2), postgres-# (1, 1, 1, 1, 1, 2, 2, 2), postgres-# (1, 1, 1, 1, 2, 2, 2, 2), postgres-# (1, 1, 1, 2, 2, 2, 2, 2), postgres-# (1, 1, 2, 2, 2, 2, 2, 2), postgres-# (1, 2, 2, 2, 2, 2, 2, 2), postgres-# (2, 2, 2, 2, 2, 2, 2, 2); INSERT 0 9 postgres=# select sum(c) from gstest2 postgres-# group by grouping sets(grouping sets((a, b))) postgres-# order by 1 desc; sum - 16 4 4 (3 rows) {code} {code:java} scala> sql(""" | select sum(c) from gstest2 | group by grouping sets(grouping sets((a, b))) | order by 1 desc | """).show org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'sets' expecting {')', ','}(line 3, pos 34) == SQL == select sum(c) from gstest2 group by grouping sets(grouping sets((a, b))) --^^^ order by 1 desc at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:268) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:135) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:85) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605) ... 51 elided {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29699) Different answers in nested aggregates with window functions
Takeshi Yamamuro created SPARK-29699: Summary: Different answers in nested aggregates with window functions Key: SPARK-29699 URL: https://issues.apache.org/jira/browse/SPARK-29699 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Takeshi Yamamuro A nested aggregate below with a window function seems to have different answers in the `rsum` column between PgSQL and Spark; {code:java} postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e integer, f integer, g integer, h integer); postgres=# insert into gstest2 values postgres-# (1, 1, 1, 1, 1, 1, 1, 1), postgres-# (1, 1, 1, 1, 1, 1, 1, 2), postgres-# (1, 1, 1, 1, 1, 1, 2, 2), postgres-# (1, 1, 1, 1, 1, 2, 2, 2), postgres-# (1, 1, 1, 1, 2, 2, 2, 2), postgres-# (1, 1, 1, 2, 2, 2, 2, 2), postgres-# (1, 1, 2, 2, 2, 2, 2, 2), postgres-# (1, 2, 2, 2, 2, 2, 2, 2), postgres-# (2, 2, 2, 2, 2, 2, 2, 2); INSERT 0 9 postgres=# postgres=# select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum postgres-# from gstest2 group by rollup (a,b) order by rsum, a, b; a | b | sum | rsum ---+---+-+-- 1 | 1 | 16 | 16 1 | 2 | 4 | 20 1 | | 20 | 40 2 | 2 | 4 | 44 2 | | 4 | 48 | | 24 | 72 (6 rows) {code} {code:java} scala> sql(""" | select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum | from gstest2 group by rollup (a,b) order by rsum, a, b | """).show() +++--++ | a| b|sum(c)|rsum| +++--++ |null|null|12| 12| | 1|null|10| 22| | 1| 1| 8| 30| | 1| 2| 2| 32| | 2|null| 2| 34| | 2| 2| 2| 36| +++--++ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29698) Support grouping function with multiple arguments
Takeshi Yamamuro created SPARK-29698: Summary: Support grouping function with multiple arguments Key: SPARK-29698 URL: https://issues.apache.org/jira/browse/SPARK-29698 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Takeshi Yamamuro In PgSQL, grouping() can have multiple arguments, but Spark grouping() must have a single argument ([https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala#L100]); {code:java} postgres=# select a, b, grouping(a, b), sum(v), count(*), max(v) postgres-# from gstest1 group by rollup (a,b); a | b | grouping | sum | count | max ---+---+--+-+---+- | |3 | | 0 | (1 row) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29686) LinearSVC should persist instances if needed
[ https://issues.apache.org/jira/browse/SPARK-29686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-29686. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26344 [https://github.com/apache/spark/pull/26344] > LinearSVC should persist instances if needed > > > Key: SPARK-29686 > URL: https://issues.apache.org/jira/browse/SPARK-29686 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Trivial > Fix For: 3.0.0 > > > Current LinearSVC impl forgot to cache the input dataset. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29697) Support bit string types/literals
Takeshi Yamamuro created SPARK-29697: Summary: Support bit string types/literals Key: SPARK-29697 URL: https://issues.apache.org/jira/browse/SPARK-29697 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Takeshi Yamamuro In PgSQL, there are bit types and literals; {code} postgres=# create table b(b bit(4)); CREATE TABLE postgres=# select b'0010'; ?column? -- 0010 (1 row) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27763) Port test cases from PostgreSQL to Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-27763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964583#comment-16964583 ] Takeshi Yamamuro edited comment on SPARK-27763 at 11/1/19 3:54 AM: --- Thanks for the check, Hyukjin! I've made PRs for limit.sql and groupingset.sql. Also, I'll check the left three tests within a few days. was (Author: maropu): Thanks for the check, Hyukjin! I've made PRs for limit.sql and groupingset.sql. I'll check the left three tests within a few days. > Port test cases from PostgreSQL to Spark SQL > > > Key: SPARK-27763 > URL: https://issues.apache.org/jira/browse/SPARK-27763 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Yuming Wang >Priority: Major > > To improve the test coverage, we can port the regression tests from the other > popular open source projects to Spark SQL. PostgreSQL is one of the best SQL > systems. Below are the links to the test cases and results. > * Regression test cases: > [https://github.com/postgres/postgres/tree/master/src/test/regress/sql] > * Expected results: > [https://github.com/postgres/postgres/tree/master/src/test/regress/expected] > Spark SQL does not support all the feature sets of PostgreSQL. In the current > stage, we should first comment out these test cases and create the > corresponding JIRAs in SPARK-27764. We can discuss and prioritize which > features we should support. Also, these PostgreSQL regression tests could > also expose the existing bugs of Spark SQL. We should also create the JIRAs > and track them in SPARK-27764. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27763) Port test cases from PostgreSQL to Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-27763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964583#comment-16964583 ] Takeshi Yamamuro commented on SPARK-27763: -- Thanks for the check, Hyukjin! I've made PRs for limit.sql and groupingset.sql. I'll check the left three tests within a few days. > Port test cases from PostgreSQL to Spark SQL > > > Key: SPARK-27763 > URL: https://issues.apache.org/jira/browse/SPARK-27763 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Yuming Wang >Priority: Major > > To improve the test coverage, we can port the regression tests from the other > popular open source projects to Spark SQL. PostgreSQL is one of the best SQL > systems. Below are the links to the test cases and results. > * Regression test cases: > [https://github.com/postgres/postgres/tree/master/src/test/regress/sql] > * Expected results: > [https://github.com/postgres/postgres/tree/master/src/test/regress/expected] > Spark SQL does not support all the feature sets of PostgreSQL. In the current > stage, we should first comment out these test cases and create the > corresponding JIRAs in SPARK-27764. We can discuss and prioritize which > features we should support. Also, these PostgreSQL regression tests could > also expose the existing bugs of Spark SQL. We should also create the JIRAs > and track them in SPARK-27764. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29696) Add groupingsets.sql
Takeshi Yamamuro created SPARK-29696: Summary: Add groupingsets.sql Key: SPARK-29696 URL: https://issues.apache.org/jira/browse/SPARK-29696 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Takeshi Yamamuro -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29676) ALTER TABLE (RENAME PARTITION) should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh resolved SPARK-29676. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26350 [https://github.com/apache/spark/pull/26350] > ALTER TABLE (RENAME PARTITION) should look up catalog/table like v2 commands > > > Key: SPARK-29676 > URL: https://issues.apache.org/jira/browse/SPARK-29676 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.0.0 > > > ALTER TABLE (RENAME PARTITION) should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29676) ALTER TABLE (RENAME PARTITION) should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh reassigned SPARK-29676: --- Assignee: Huaxin Gao > ALTER TABLE (RENAME PARTITION) should look up catalog/table like v2 commands > > > Key: SPARK-29676 > URL: https://issues.apache.org/jira/browse/SPARK-29676 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > > ALTER TABLE (RENAME PARTITION) should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29664) Column.getItem behavior is not consistent with Scala version
[ https://issues.apache.org/jira/browse/SPARK-29664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29664. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26351 [https://github.com/apache/spark/pull/26351] > Column.getItem behavior is not consistent with Scala version > > > Key: SPARK-29664 > URL: https://issues.apache.org/jira/browse/SPARK-29664 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > Fix For: 3.0.0 > > > In PySpark, Column.getItem's behavior is different from the Scala version. > For example, > In PySpark: > {code:python} > df = spark.range(2) > map_col = create_map(lit(0), lit(100), lit(1), lit(200)) > df.withColumn("mapped", map_col.getItem(col('id'))).show() > # +---+--+ > # | id|mapped| > # +---+--+ > # | 0| 100| > # | 1| 200| > # +---+--+ > {code} > In Scala: > {code:scala} > val df = spark.range(2) > val map_col = map(lit(0), lit(100), lit(1), lit(200)) > // The following getItem results in the following exception, which is the > right behavior: > // java.lang.RuntimeException: Unsupported literal type class > org.apache.spark.sql.Column id > // at > org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78) > // at org.apache.spark.sql.Column.getItem(Column.scala:856) > // ... 49 elided > df.withColumn("mapped", map_col.getItem(col("id"))).show > // You have to use apply() to match with PySpark's behavior. > df.withColumn("mapped", map_col(col("id"))).show > // +---+--+ > // | id|mapped| > // +---+--+ > // | 0| 100| > // | 1| 200| > // +---+--+ > {code} > Looking at the code for Scala implementation, PySpark's behavior is incorrect > since the argument to getItem becomes `Literal`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29664) Column.getItem behavior is not consistent with Scala version
[ https://issues.apache.org/jira/browse/SPARK-29664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-29664: Assignee: Terry Kim > Column.getItem behavior is not consistent with Scala version > > > Key: SPARK-29664 > URL: https://issues.apache.org/jira/browse/SPARK-29664 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > > In PySpark, Column.getItem's behavior is different from the Scala version. > For example, > In PySpark: > {code:python} > df = spark.range(2) > map_col = create_map(lit(0), lit(100), lit(1), lit(200)) > df.withColumn("mapped", map_col.getItem(col('id'))).show() > # +---+--+ > # | id|mapped| > # +---+--+ > # | 0| 100| > # | 1| 200| > # +---+--+ > {code} > In Scala: > {code:scala} > val df = spark.range(2) > val map_col = map(lit(0), lit(100), lit(1), lit(200)) > // The following getItem results in the following exception, which is the > right behavior: > // java.lang.RuntimeException: Unsupported literal type class > org.apache.spark.sql.Column id > // at > org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78) > // at org.apache.spark.sql.Column.getItem(Column.scala:856) > // ... 49 elided > df.withColumn("mapped", map_col.getItem(col("id"))).show > // You have to use apply() to match with PySpark's behavior. > df.withColumn("mapped", map_col(col("id"))).show > // +---+--+ > // | id|mapped| > // +---+--+ > // | 0| 100| > // | 1| 200| > // +---+--+ > {code} > Looking at the code for Scala implementation, PySpark's behavior is incorrect > since the argument to getItem becomes `Literal`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29694) Execute UDF only once when there are multiple identical UDF usages
[ https://issues.apache.org/jira/browse/SPARK-29694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964571#comment-16964571 ] Xuedong Luan commented on SPARK-29694: -- hi [~yumwang] I will work on this Jira > Execute UDF only once when there are multiple identical UDF usages > -- > > Key: SPARK-29694 > URL: https://issues.apache.org/jira/browse/SPARK-29694 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Example: > {code:sql} > SELECT > CASE > WHEN udf1(col1, 'swd') = '2' THEN 'Facebook' > WHEN udf1(col1, 'swd') = '3' THEN 'Twitter' > WHEN udf1(col1, 'swd') = '11' THEN 'Pinterest' > WHEN col2 IN (28,29) THEN > WHEN col2 IN (10,16,18) AND udf1(col1, 'rd') IS NULL AND > udf1(col1, 'rot') IN ('71188223167180', '14361105000167180') THEN 'Yandex' > WHEN col2 IN (10,16,18) AND udf1(col1, 'rd') IS NULL AND > udf1(col1, 'rot') IN ('4686145537108740', '7055082982390', '7113399718530') > THEN 'Yahoo' > WHEN col2 IN (10,16,18) AND udf1(col1, 'rd') IS NULL THEN > 'Google' > WHEN udf1(col1, 'rd') LIKE '%google%' OR udf1(col1, 'rd') LIKE > '%gmail%' THEN 'Google' > WHEN udf1(col1, 'rd') LIKE '%yahoo%' THEN 'Yahoo' > WHEN udf1(col1, 'rd') LIKE '%bing%' THEN 'Bing' > WHEN udf1(col1, 'rd') LIKE '%facebook%' THEN 'Facebook' > WHEN udf1(col1, 'rd') LIKE '%pinterest%' THEN 'Pinterest' > WHEN udf1(col1, 'rd') LIKE '%twitter%' OR udf1(col1, 'rd') LIKE > '%t.co' THEN 'Twitter' > WHEN udf1(col1, 'rd') LIKE '%baidu%' THEN 'Baidu' > WHEN udf1(col1, 'rd') LIKE '%yandex%' THEN 'Yandex' > WHEN udf1(col1, 'rd') LIKE '%aol.%' THEN 'AOL' > WHEN udf1(col1, 'rd') LIKE '%ask.%' THEN 'Ask' > WHEN udf1(col1, 'rd') LIKE '%duckduckgo.%' THEN 'DuckDuckGo' > WHEN udf1(col1, 'rd') LIKE '%t-online.de' THEN 'T-Online' > WHEN udf1(col1, 'rd') LIKE '%com-kleinanzeigen.%' OR udf1(col1, > 'rd') LIKE '%kleinanzeigen%' THEN 'Kleinanzeigen' > WHEN udf1(col1, 'rd') LIKE '%com.%' OR udf1(col1, 'rd') LIKE > '%comdesc.%' THEN 'com' > WHEN udf1(col1, 'rd') LIKE '%paypal.%' THEN 'PayPal' > WHEN udf1(col1, 'rd') IS NULL THEN 'None' > ELSE 'Other' END AS source_domain, > COUNT(*) AS cnt > FROM > tbl s > GROUP BY > 1 > {code} > We can rewrite it to: > {code:sql} > SELECT > CASE > WHEN udf1(col1, 'swd') = '2' THEN 'Facebook' > WHEN udf1(col1, 'swd') = '3' THEN 'Twitter' > WHEN udf1(col1, 'swd') = '11' THEN 'Pinterest' > WHEN col2 IN (28,29) THEN 'Google' > WHEN col2 IN (10,16,18) AND col1 IS NULL AND udf1(col1, 'rot') > IN ('71188223167180', '14361105000167180') THEN 'Yandex' > WHEN col2 IN (10,16,18) AND col1 IS NULL AND udf1(col1, 'rot') > IN ('4686145537108740', '7055082982390', '7113399718530') THEN 'Yahoo' > WHEN col2 IN (10,16,18) AND col1 IS NULL THEN 'Google' > WHEN col1 LIKE '%google%' OR col1 LIKE '%gmail%' THEN 'Google' > WHEN col1 LIKE '%yahoo%' THEN 'Yahoo' > WHEN col1 LIKE '%bing%' THEN 'Bing' > WHEN col1 LIKE '%facebook%' THEN 'Facebook' > WHEN col1 LIKE '%pinterest%' THEN 'Pinterest' > WHEN col1 LIKE '%twitter%' OR col1 LIKE '%t.co' THEN 'Twitter' > WHEN col1 LIKE '%baidu%' THEN 'Baidu' > WHEN col1 LIKE '%yandex%' THEN 'Yandex' > WHEN col1 LIKE '%aol.%' THEN 'AOL' > WHEN col1 LIKE '%ask.%' THEN 'Ask' > WHEN col1 LIKE '%duckduckgo.%' THEN 'DuckDuckGo' > WHEN col1 LIKE '%t-online.de' THEN 'T-Online' > WHEN col1 LIKE '%com-kleinanzeigen.%' OR col1 LIKE > '%kleinanzeigen%' THEN 'Kleinanzeigen' > WHEN col1 LIKE '%com.%' OR col1 LIKE '%comdesc.%' THEN 'com' > WHEN col1 LIKE '%paypal.%' THEN 'PayPal' > WHEN col1 IS NULL THEN 'None' > ELSE 'Other' END AS source_domain, > COUNT(*) AS cnt > FROM > (SELECT *, udf1(col1, 'rd') as col1 FROM tbl) s > GROUP BY > 1 > {code} > It would be great if we could optimize it by the framework. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29657) Iterator spill supporting radix sort with null prefix
[ https://issues.apache.org/jira/browse/SPARK-29657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-29657: --- Issue Type: Bug (was: Improvement) > Iterator spill supporting radix sort with null prefix > - > > Key: SPARK-29657 > URL: https://issues.apache.org/jira/browse/SPARK-29657 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: dzcxzl >Priority: Trivial > > In the case of radix sort, when the insertRecord part of the keyPrefix is > null, the iterator type returned by getSortedIterator is ChainedIterator. > Currently ChainedIterator does not support spill, causing > UnsafeExternalSorter to take up a lot of execution memory, allocatePage > fails, throw SparkOutOfMemoryError Unable to acquire xxx bytes of memory, got > 0 > The following is a log of an error we encountered in the production > environment. > [Executor task launch worker for task 66055] INFO TaskMemoryManager: Memory > used in task 66055 > [Executor task launch worker for task 66055] INFO TaskMemoryManager: Acquired > by > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@39dd866e: > 64.0 KB > [Executor task launch worker for task 66055] INFO TaskMemoryManager: Acquired > by > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@74d17927: > 4.6 GB > [Executor task launch worker for task 66055] INFO TaskMemoryManager: Acquired > by > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@31478f9c: > 61.0 MB > [Executor task launch worker for task 66055] INFO TaskMemoryManager: 0 bytes > of memory were used by task 66055 but are not associated with specific > consumers > [Executor task launch worker for task 66055] INFO TaskMemoryManager: > 4962998749 bytes of memory are used for execution and 2218326 bytes of memory > are used for storage > [Executor task launch worker for task 66055] ERROR Executor: Exception in > task 42.3 in stage 29.0 (TID 66055) > SparkOutOfMemoryError: Unable to acquire 3436 bytes of memory, got 0 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29695) ALTER TABLE (SerDe properties) should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964559#comment-16964559 ] Huaxin Gao commented on SPARK-29695: I will work on this > ALTER TABLE (SerDe properties) should look up catalog/table like v2 commands > > > Key: SPARK-29695 > URL: https://issues.apache.org/jira/browse/SPARK-29695 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Priority: Major > > ALTER TABLE (SerDe properties) should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29695) ALTER TABLE (SerDe properties) should look up catalog/table like v2 commands
Huaxin Gao created SPARK-29695: -- Summary: ALTER TABLE (SerDe properties) should look up catalog/table like v2 commands Key: SPARK-29695 URL: https://issues.apache.org/jira/browse/SPARK-29695 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Huaxin Gao ALTER TABLE (SerDe properties) should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29694) Execute UDF only once when there are multiple identical UDF usages
Yuming Wang created SPARK-29694: --- Summary: Execute UDF only once when there are multiple identical UDF usages Key: SPARK-29694 URL: https://issues.apache.org/jira/browse/SPARK-29694 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang Example: {code:sql} SELECT CASE WHEN udf1(col1, 'swd') = '2' THEN 'Facebook' WHEN udf1(col1, 'swd') = '3' THEN 'Twitter' WHEN udf1(col1, 'swd') = '11' THEN 'Pinterest' WHEN col2 IN (28,29) THEN WHEN col2 IN (10,16,18) AND udf1(col1, 'rd') IS NULL AND udf1(col1, 'rot') IN ('71188223167180', '14361105000167180') THEN 'Yandex' WHEN col2 IN (10,16,18) AND udf1(col1, 'rd') IS NULL AND udf1(col1, 'rot') IN ('4686145537108740', '7055082982390', '7113399718530') THEN 'Yahoo' WHEN col2 IN (10,16,18) AND udf1(col1, 'rd') IS NULL THEN 'Google' WHEN udf1(col1, 'rd') LIKE '%google%' OR udf1(col1, 'rd') LIKE '%gmail%' THEN 'Google' WHEN udf1(col1, 'rd') LIKE '%yahoo%' THEN 'Yahoo' WHEN udf1(col1, 'rd') LIKE '%bing%' THEN 'Bing' WHEN udf1(col1, 'rd') LIKE '%facebook%' THEN 'Facebook' WHEN udf1(col1, 'rd') LIKE '%pinterest%' THEN 'Pinterest' WHEN udf1(col1, 'rd') LIKE '%twitter%' OR udf1(col1, 'rd') LIKE '%t.co' THEN 'Twitter' WHEN udf1(col1, 'rd') LIKE '%baidu%' THEN 'Baidu' WHEN udf1(col1, 'rd') LIKE '%yandex%' THEN 'Yandex' WHEN udf1(col1, 'rd') LIKE '%aol.%' THEN 'AOL' WHEN udf1(col1, 'rd') LIKE '%ask.%' THEN 'Ask' WHEN udf1(col1, 'rd') LIKE '%duckduckgo.%' THEN 'DuckDuckGo' WHEN udf1(col1, 'rd') LIKE '%t-online.de' THEN 'T-Online' WHEN udf1(col1, 'rd') LIKE '%com-kleinanzeigen.%' OR udf1(col1, 'rd') LIKE '%kleinanzeigen%' THEN 'Kleinanzeigen' WHEN udf1(col1, 'rd') LIKE '%com.%' OR udf1(col1, 'rd') LIKE '%comdesc.%' THEN 'com' WHEN udf1(col1, 'rd') LIKE '%paypal.%' THEN 'PayPal' WHEN udf1(col1, 'rd') IS NULL THEN 'None' ELSE 'Other' END AS source_domain, COUNT(*) AS cnt FROM tbl s GROUP BY 1 {code} We can rewrite it to: {code:sql} SELECT CASE WHEN udf1(col1, 'swd') = '2' THEN 'Facebook' WHEN udf1(col1, 'swd') = '3' THEN 'Twitter' WHEN udf1(col1, 'swd') = '11' THEN 'Pinterest' WHEN col2 IN (28,29) THEN 'Google' WHEN col2 IN (10,16,18) AND col1 IS NULL AND udf1(col1, 'rot') IN ('71188223167180', '14361105000167180') THEN 'Yandex' WHEN col2 IN (10,16,18) AND col1 IS NULL AND udf1(col1, 'rot') IN ('4686145537108740', '7055082982390', '7113399718530') THEN 'Yahoo' WHEN col2 IN (10,16,18) AND col1 IS NULL THEN 'Google' WHEN col1 LIKE '%google%' OR col1 LIKE '%gmail%' THEN 'Google' WHEN col1 LIKE '%yahoo%' THEN 'Yahoo' WHEN col1 LIKE '%bing%' THEN 'Bing' WHEN col1 LIKE '%facebook%' THEN 'Facebook' WHEN col1 LIKE '%pinterest%' THEN 'Pinterest' WHEN col1 LIKE '%twitter%' OR col1 LIKE '%t.co' THEN 'Twitter' WHEN col1 LIKE '%baidu%' THEN 'Baidu' WHEN col1 LIKE '%yandex%' THEN 'Yandex' WHEN col1 LIKE '%aol.%' THEN 'AOL' WHEN col1 LIKE '%ask.%' THEN 'Ask' WHEN col1 LIKE '%duckduckgo.%' THEN 'DuckDuckGo' WHEN col1 LIKE '%t-online.de' THEN 'T-Online' WHEN col1 LIKE '%com-kleinanzeigen.%' OR col1 LIKE '%kleinanzeigen%' THEN 'Kleinanzeigen' WHEN col1 LIKE '%com.%' OR col1 LIKE '%comdesc.%' THEN 'com' WHEN col1 LIKE '%paypal.%' THEN 'PayPal' WHEN col1 IS NULL THEN 'None' ELSE 'Other' END AS source_domain, COUNT(*) AS cnt FROM (SELECT *, udf1(col1, 'rd') as col1 FROM tbl) s GROUP BY 1 {code} It would be great if we could optimize it by the framework. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23643) XORShiftRandom.hashSeed allocates unnecessary memory
[ https://issues.apache.org/jira/browse/SPARK-23643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964529#comment-16964529 ] Dongjoon Hyun commented on SPARK-23643: --- This causes the UI test result difference between Apache Spark 3.0 and 2.4. > XORShiftRandom.hashSeed allocates unnecessary memory > > > Key: SPARK-23643 > URL: https://issues.apache.org/jira/browse/SPARK-23643 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > The hashSeed method allocates 64 bytes buffer and puts only 8 bytes of the > seed parameter into it. Other bytes are always zero and could be easily > excluded from hash calculation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23643) XORShiftRandom.hashSeed allocates unnecessary memory
[ https://issues.apache.org/jira/browse/SPARK-23643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23643: -- Priority: Major (was: Trivial) > XORShiftRandom.hashSeed allocates unnecessary memory > > > Key: SPARK-23643 > URL: https://issues.apache.org/jira/browse/SPARK-23643 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > The hashSeed method allocates 64 bytes buffer and puts only 8 bytes of the > seed parameter into it. Other bytes are always zero and could be easily > excluded from hash calculation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23643) XORShiftRandom.hashSeed allocates unnecessary memory
[ https://issues.apache.org/jira/browse/SPARK-23643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23643: -- Labels: release-notes (was: ) > XORShiftRandom.hashSeed allocates unnecessary memory > > > Key: SPARK-23643 > URL: https://issues.apache.org/jira/browse/SPARK-23643 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Trivial > Labels: release-notes > Fix For: 3.0.0 > > > The hashSeed method allocates 64 bytes buffer and puts only 8 bytes of the > seed parameter into it. Other bytes are always zero and could be easily > excluded from hash calculation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23643) XORShiftRandom.hashSeed allocates unnecessary memory
[ https://issues.apache.org/jira/browse/SPARK-23643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964528#comment-16964528 ] Dongjoon Hyun commented on SPARK-23643: --- I added `release-note` label because this changes the seed and expected result. cc [~jiangxb1987] and [~smilegator] > XORShiftRandom.hashSeed allocates unnecessary memory > > > Key: SPARK-23643 > URL: https://issues.apache.org/jira/browse/SPARK-23643 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Trivial > Labels: release-notes > Fix For: 3.0.0 > > > The hashSeed method allocates 64 bytes buffer and puts only 8 bytes of the > seed parameter into it. Other bytes are always zero and could be easily > excluded from hash calculation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29693) Bucket map join if the one's bucket number is the multiple of the other
[ https://issues.apache.org/jira/browse/SPARK-29693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964491#comment-16964491 ] Yuming Wang commented on SPARK-29693: - cc [~gwang3] > Bucket map join if the one's bucket number is the multiple of the other > --- > > Key: SPARK-29693 > URL: https://issues.apache.org/jira/browse/SPARK-29693 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29693) Bucket map join if the one's bucket number is the multiple of the other
Yuming Wang created SPARK-29693: --- Summary: Bucket map join if the one's bucket number is the multiple of the other Key: SPARK-29693 URL: https://issues.apache.org/jira/browse/SPARK-29693 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29693) Bucket map join if the one's bucket number is the multiple of the other
[ https://issues.apache.org/jira/browse/SPARK-29693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964490#comment-16964490 ] Yuming Wang commented on SPARK-29693: - https://data-flair.training/blogs/bucket-map-join/ > Bucket map join if the one's bucket number is the multiple of the other > --- > > Key: SPARK-29693 > URL: https://issues.apache.org/jira/browse/SPARK-29693 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29625) Spark Structure Streaming Kafka Wrong Reset Offset twice
[ https://issues.apache.org/jira/browse/SPARK-29625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964489#comment-16964489 ] Hyukjin Kwon commented on SPARK-29625: -- It needs investigation. Can you share the codes you ran? > Spark Structure Streaming Kafka Wrong Reset Offset twice > > > Key: SPARK-29625 > URL: https://issues.apache.org/jira/browse/SPARK-29625 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Sandish Kumar HN >Priority: Major > > Spark Structure Streaming Kafka Reset Offset twice, once with right offsets > and second time with very old offsets > {code} > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-151 to offset 0. > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-118 to offset 0. > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-85 to offset 0. > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-52 to offset 122677634. > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-19 to offset 0. > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-52 to offset 120504922.* > [2019-10-28 19:27:40,153] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO ContextCleaner: Cleaned accumulator 810 > {code} > which is causing a Data loss issue. > {code} > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > ERROR StreamExecution: Query [id = d62ca9e4-6650-454f-8691-a3d576d1e4ba, > runId = 3946389f-222b-495c-9ab2-832c0422cbbb] terminated with error > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - > java.lang.IllegalStateException: Partition topic-52's offset was changed from > 122677598 to 120504922, some data may have been missed. > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - Some data may have > been lost because they are not available in Kafka any more; either the > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - data was aged out > by Kafka or the topic may have been deleted before all the data in the > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - topic was > processed. If you don't want your streaming query to fail on such cases, set > the > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - source option > "failOnDataLoss" to "false". > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > org.apache.spark.sql.kafka010.KafkaSource.org$apache$spark$sql$kafka010$KafkaSource$$reportDataLoss(KafkaSource.scala:329) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > org.apache.spark.sql.kafka010.KafkaSource$$anonfun$8.apply(KafkaSource.scala:283) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > org.apache.spark.sql.kafka010.KafkaSource$$anonfun$8.apply(KafkaSource.scala:281) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > scala.collection.TraversableLike$class.filter(TraversableLike.scala:259) > [2019-10-28 19:27:40,351] \{bash_
[jira] [Commented] (SPARK-12806) Support SQL expressions extracting values from VectorUDT
[ https://issues.apache.org/jira/browse/SPARK-12806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964481#comment-16964481 ] John Bauer commented on SPARK-12806: Also, when using PyArrow to convert a Spark DataFrame for use in a pandas_udf, as soon as a VectorUDT is encountered it reverts to a non-optimized conversion, losing much of the advantage of using PyArrow. > Support SQL expressions extracting values from VectorUDT > > > Key: SPARK-12806 > URL: https://issues.apache.org/jira/browse/SPARK-12806 > Project: Spark > Issue Type: Improvement > Components: MLlib, SQL >Affects Versions: 1.6.0 >Reporter: Feynman Liang >Priority: Major > Labels: bulk-closed > > Use cases exist where a specific index within a {{VectorUDT}} column of a > {{DataFrame}} is required. For example, we may be interested in extracting a > specific class probability from the {{probabilityCol}} of a > {{LogisticRegression}} to compute losses. However, if {{probability}} is a > column of {{df}} with type {{VectorUDT}}, the following code fails: > {code} > df.select("probability.0") > AnalysisException: u"Can't extract value from probability" > {code} > thrown from > {{sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala}}. > {{VectorUDT}} essentially wraps a {{StructType}}, hence one would expect it > to support value extraction Expressions in an analogous way. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12806) Support SQL expressions extracting values from VectorUDT
[ https://issues.apache.org/jira/browse/SPARK-12806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964475#comment-16964475 ] John Bauer commented on SPARK-12806: This is still a problem. For example, classification models emit probability as a VectorUDT, which are unusable in PySpark. This makes constructing boosting/bagging algorithms or even just using them as additional features in a second model problematic. > Support SQL expressions extracting values from VectorUDT > > > Key: SPARK-12806 > URL: https://issues.apache.org/jira/browse/SPARK-12806 > Project: Spark > Issue Type: Improvement > Components: MLlib, SQL >Affects Versions: 1.6.0 >Reporter: Feynman Liang >Priority: Major > Labels: bulk-closed > > Use cases exist where a specific index within a {{VectorUDT}} column of a > {{DataFrame}} is required. For example, we may be interested in extracting a > specific class probability from the {{probabilityCol}} of a > {{LogisticRegression}} to compute losses. However, if {{probability}} is a > column of {{df}} with type {{VectorUDT}}, the following code fails: > {code} > df.select("probability.0") > AnalysisException: u"Can't extract value from probability" > {code} > thrown from > {{sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala}}. > {{VectorUDT}} essentially wraps a {{StructType}}, hence one would expect it > to support value extraction Expressions in an analogous way. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29687) Fix jdbc metrics counter type to long
[ https://issues.apache.org/jira/browse/SPARK-29687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-29687. -- Fix Version/s: 3.0.0 Assignee: ulysses you Resolution: Fixed Resolved by https://github.com/apache/spark/pull/26346 > Fix jdbc metrics counter type to long > - > > Key: SPARK-29687 > URL: https://issues.apache.org/jira/browse/SPARK-29687 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: ulysses you >Assignee: ulysses you >Priority: Minor > Fix For: 3.0.0 > > > JDBC metrics counter var is an int type that may by overflow. Change it to > Long type. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29692) SparkContext.defaultParallism should reflect resource limits when resource limits are set
Bago Amirbekian created SPARK-29692: --- Summary: SparkContext.defaultParallism should reflect resource limits when resource limits are set Key: SPARK-29692 URL: https://issues.apache.org/jira/browse/SPARK-29692 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: Bago Amirbekian With the new gpu/fpga support in spark, defaultParallelism may not be computed correctly. Specifically defaultParaallelism may be much higher than the total possible concurrent tasks if workers have many more cores than gpus for example. Steps to reproduce: Start a cluster with spark.executor.resource.gpu.amount < cores per executor. Set spark.task.resource.gpu.amount = 1. Keep cores per task as 1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)
[ https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Bauer updated SPARK-29691: --- Description: Estimator `fit` method is supposed to copy a dictionary of params, overwriting the estimator's previous values, before fitting the model. However, the parameter values are not updated. This was observed in PySpark, but may be present in the Java objects, as the PySpark code appears to be functioning correctly. (The copy method that interacts with Java is actually implemented in Params.) For example, this prints Before: 0.8 After: 0.8 but After should be 0.75 {code:python} from pyspark.ml.classification import LogisticRegression # Load training data training = spark \ .read \ .format("libsvm") \ .load("data/mllib/sample_multiclass_classification_data.txt") lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) print("Before:", lr.getOrDefault("elasticNetParam")) # Fit the model, but with an updated parameter setting: lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) print("After:", lr.getOrDefault("elasticNetParam")) {code} was: Estimator `fit` method (implemented in Params) is supposed to copy a dictionary of params, overwriting the estimator's previous values, before fitting the model. However, the parameter values are not updated. This was observed in PySpark, but may be present in the Java objects, as the PySpark code appears to be functioning correctly. For example, this prints Before: 0.8 After: 0.8 but After should be 0.75 {code:python} from pyspark.ml.classification import LogisticRegression # Load training data training = spark \ .read \ .format("libsvm") \ .load("data/mllib/sample_multiclass_classification_data.txt") lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) print("Before:", lr.getOrDefault("elasticNetParam")) # Fit the model, but with an updated parameter setting: lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) print("After:", lr.getOrDefault("elasticNetParam")) {code} > Estimator fit method fails to copy params (in PySpark) > -- > > Key: SPARK-29691 > URL: https://issues.apache.org/jira/browse/SPARK-29691 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: John Bauer >Priority: Minor > > Estimator `fit` method is supposed to copy a dictionary of params, > overwriting the estimator's previous values, before fitting the model. > However, the parameter values are not updated. This was observed in PySpark, > but may be present in the Java objects, as the PySpark code appears to be > functioning correctly. (The copy method that interacts with Java is > actually implemented in Params.) > For example, this prints > Before: 0.8 > After: 0.8 > but After should be 0.75 > {code:python} > from pyspark.ml.classification import LogisticRegression > # Load training data > training = spark \ > .read \ > .format("libsvm") \ > .load("data/mllib/sample_multiclass_classification_data.txt") > lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) > print("Before:", lr.getOrDefault("elasticNetParam")) > # Fit the model, but with an updated parameter setting: > lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) > print("After:", lr.getOrDefault("elasticNetParam")) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)
[ https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Bauer updated SPARK-29691: --- Description: Estimator `fit` method (implemented in Params) is supposed to copy a dictionary of params, overwriting the estimator's previous values, before fitting the model. However, the parameter values are not updated. This was observed in PySpark, but may be present in the Java objects, as the PySpark code appears to be functioning correctly. For example, this prints Before: 0.8 After: 0.8 but After should be 0.75 {code:python} from pyspark.ml.classification import LogisticRegression # Load training data training = spark \ .read \ .format("libsvm") \ .load("data/mllib/sample_multiclass_classification_data.txt") lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) print("Before:", lr.getOrDefault("elasticNetParam")) # Fit the model, but with an updated parameter setting: lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) print("After:", lr.getOrDefault("elasticNetParam")) {code} was: Estimator `fit` method (implemented in Params) is supposed to copy a dictionary of params, overwriting the estimator's previous values, before fitting the model. However, the parameter values are not updated. This was observed in PySpark, but may be present in the Java objects, as the PySpark code appears to be functioning correctly. For example, this prints {{Before: 0.8 After: 0.8}} but After should be 0.75 {code:python} from pyspark.ml.classification import LogisticRegression # Load training data training = spark \ .read \ .format("libsvm") \ .load("data/mllib/sample_multiclass_classification_data.txt") lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) print("Before:", lr.getOrDefault("elasticNetParam")) # Fit the model, but with an updated parameter setting: lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) print("After:", lr.getOrDefault("elasticNetParam")) {code} > Estimator fit method fails to copy params (in PySpark) > -- > > Key: SPARK-29691 > URL: https://issues.apache.org/jira/browse/SPARK-29691 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: John Bauer >Priority: Minor > > Estimator `fit` method (implemented in Params) is supposed to copy a > dictionary of params, overwriting the estimator's previous values, before > fitting the model. However, the parameter values are not updated. This was > observed in PySpark, but may be present in the Java objects, as the PySpark > code appears to be functioning correctly. > For example, this prints > Before: 0.8 > After: 0.8 > but After should be 0.75 > {code:python} > from pyspark.ml.classification import LogisticRegression > # Load training data > training = spark \ > .read \ > .format("libsvm") \ > .load("data/mllib/sample_multiclass_classification_data.txt") > lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) > print("Before:", lr.getOrDefault("elasticNetParam")) > # Fit the model, but with an updated parameter setting: > lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) > print("After:", lr.getOrDefault("elasticNetParam")) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)
[ https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Bauer updated SPARK-29691: --- Description: Estimator `fit` method (implemented in Params) is supposed to copy a dictionary of params, overwriting the estimator's previous values, before fitting the model. However, the parameter values are not updated. This was observed in PySpark, but may be present in the Java objects, as the PySpark code appears to be functioning correctly. For example, this prints {{Before: 0.8 After: 0.8}} but After should be 0.75 {code:python} from pyspark.ml.classification import LogisticRegression # Load training data training = spark \ .read \ .format("libsvm") \ .load("data/mllib/sample_multiclass_classification_data.txt") lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) print("Before:", lr.getOrDefault("elasticNetParam")) # Fit the model, but with an updated parameter setting: lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) print("After:", lr.getOrDefault("elasticNetParam")) {code} was: Estimator `fit` method (implemented in Params) is supposed to copy a dictionary of params, overwriting the estimator's previous values, before fitting the model. However, the parameter values are not updated. This was observed in PySpark, but may be present in the Java objects, as the PySpark code appears to be functioning correctly. For example, this prints {{Before: 0.8 After: 0.8}} but After should be 0.75 {{from pyspark.ml.classification import LogisticRegression # Load training data training = spark \ .read \ .format("libsvm") \ .load("data/mllib/sample_multiclass_classification_data.txt") lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) print("Before:", lr.getOrDefault("elasticNetParam")) # Fit the model, but with an updated parameter setting: lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) print("After:", lr.getOrDefault("elasticNetParam"))}} > Estimator fit method fails to copy params (in PySpark) > -- > > Key: SPARK-29691 > URL: https://issues.apache.org/jira/browse/SPARK-29691 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: John Bauer >Priority: Minor > > Estimator `fit` method (implemented in Params) is supposed to copy a > dictionary of params, overwriting the estimator's previous values, before > fitting the model. However, the parameter values are not updated. This was > observed in PySpark, but may be present in the Java objects, as the PySpark > code appears to be functioning correctly. > For example, this prints > {{Before: 0.8 > After: 0.8}} > but After should be 0.75 > {code:python} > from pyspark.ml.classification import LogisticRegression > # Load training data > training = spark \ > .read \ > .format("libsvm") \ > .load("data/mllib/sample_multiclass_classification_data.txt") > lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) > print("Before:", lr.getOrDefault("elasticNetParam")) > # Fit the model, but with an updated parameter setting: > lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) > print("After:", lr.getOrDefault("elasticNetParam")) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)
John Bauer created SPARK-29691: -- Summary: Estimator fit method fails to copy params (in PySpark) Key: SPARK-29691 URL: https://issues.apache.org/jira/browse/SPARK-29691 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.4 Reporter: John Bauer Estimator `fit` method (implemented in Params) is supposed to copy a dictionary of params, overwriting the estimator's previous values, before fitting the model. However, the parameter values are not updated. This was observed in PySpark, but may be present in the Java objects, as the PySpark code appears to be functioning correctly. For example, this prints {{Before: 0.8 After: 0.8}} but After should be 0.75 {{from pyspark.ml.classification import LogisticRegression # Load training data training = spark \ .read \ .format("libsvm") \ .load("data/mllib/sample_multiclass_classification_data.txt") lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) print("Before:", lr.getOrDefault("elasticNetParam")) # Fit the model, but with an updated parameter setting: lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) print("After:", lr.getOrDefault("elasticNetParam"))}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29415) Stage Level Sched: Add base ResourceProfile and Request classes
[ https://issues.apache.org/jira/browse/SPARK-29415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964360#comment-16964360 ] Thomas Graves edited comment on SPARK-29415 at 10/31/19 8:03 PM: - More details: TaskResourceRequest - this supports taking a resourceName and an amount (Double for fractional resources). It only supports cpus (spark.task.cpus) and accelerator resource types (spark.*.resource.[resourceName].*. So user can specify cpus and resources like GPU's and FPGAS. The accerator type resources match what we already have for configs in the acceralator aware scheduling. [https://github.com/apache/spark/blob/master/docs/configuration.md#custom-resource-scheduling-and-configuration-overview] ExecutorResourceRequest - this supports specifying the requirements for the executors. It supports all the configs needed for accelerator aware scheduling - {{spark.\{executor/driver}.resource.\{resourceName}.\{amount, discoveryScript, vendor} as well as heap memory, overhead memory, pyspark memory, and cores. In order to support memory types we added a "units" parameter into ExecutorResourceRequest. the other parameters resourceName, vendor, discoveryScript, amount all match the accelerator aware scheduling parameters.}} ResourceProfile - this class takes in the executor and task requirement and holds them to be used by other components. For instance, we have to pass the executor resources into the cluster managers so it can ask for the proper containers. The requests have to also be passed into the executors when launched so they use the correct discovery Script. The task requirements are used by the scheduler to assign tasks to proper containers. {{ We also have a ResourceProfile object that has an accessor to get the default ResourceProfile. This is the profile generated from the configs the user passes in when the spark application is submitted. So it will have --executor-cores, memory, overhead memory, pyspark memory, accelerator resources the user all specified via --confs or properties file on submit. The default profile will be used in a lot of places since the user may never specify another ResourceProfile and want an easy way to access it.}} was (Author: tgraves): More details: TaskResourceRequest - this supports taking a resourceName and an amount (Double for fractional resources). It only supports cpus (spark.task.cpus) and accelerator resource types (spark.*.resource.[resourceName].*. So user can specify cpus and resources like GPU's and FPGAS. The accerator type resources match what we already have for configs in the acceralator aware scheduling. [https://github.com/apache/spark/blob/master/docs/configuration.md#custom-resource-scheduling-and-configuration-overview] ExecutorResourceRequest - this supports specifying the requirements for the executors. It supports all the configs needed for accelerator aware scheduling - {{spark.\{executor/driver}.resource.\{resourceName}.\{amount, discoveryScript, vendor} as well as heap memory, overhead memory, pyspark memory, and cores. In order to support memory types we added a "units" parameter into ExecutorResourceRequest. the other parameters resourceName, vendor, discoveryScript, amount all match the accelerator aware scheduling parameters.}} ResourceProfile - this class takes in the executor and task requirement and holds them to be used by other components. For instance, we have to pass the executor resources into the cluster managers so it can ask for the proper containers. The requests have to also be passed into the executors when launched so they use the correct discovery Script. The task requirements are used by the scheduler to assign tasks to proper containers. {{ We also have a ResourceProfile object that has an accessor to get the default ResourceProfile. This is the profile generated from the configs the user passes in when the spark application is submitted. So it will have --executor-cores, memory, overhead memory, pyspark memory, accelerator resources the user all specified via --confs or properties file on submit.}} > Stage Level Sched: Add base ResourceProfile and Request classes > --- > > Key: SPARK-29415 > URL: https://issues.apache.org/jira/browse/SPARK-29415 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Major > > this is just to add initial ResourceProfile, ExecutorResourceRequest and > taskResourceRequest classes that are used by the other parts of the code. > Initially we will have them private until we have other pieces in place. -- This message was sent by Atlassian Jir
[jira] [Comment Edited] (SPARK-29415) Stage Level Sched: Add base ResourceProfile and Request classes
[ https://issues.apache.org/jira/browse/SPARK-29415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964360#comment-16964360 ] Thomas Graves edited comment on SPARK-29415 at 10/31/19 7:59 PM: - More details: TaskResourceRequest - this supports taking a resourceName and an amount (Double for fractional resources). It only supports cpus (spark.task.cpus) and accelerator resource types (spark.*.resource.[resourceName].*. So user can specify cpus and resources like GPU's and FPGAS. The accerator type resources match what we already have for configs in the acceralator aware scheduling. [https://github.com/apache/spark/blob/master/docs/configuration.md#custom-resource-scheduling-and-configuration-overview] ExecutorResourceRequest - this supports specifying the requirements for the executors. It supports all the configs needed for accelerator aware scheduling - {{spark.\{executor/driver}.resource.\{resourceName}.\{amount, discoveryScript, vendor} as well as heap memory, overhead memory, pyspark memory, and cores. In order to support memory types we added a "units" parameter into ExecutorResourceRequest. the other parameters resourceName, vendor, discoveryScript, amount all match the accelerator aware scheduling parameters.}} ResourceProfile - this class takes in the executor and task requirement and holds them to be used by other components. For instance, we have to pass the executor resources into the cluster managers so it can ask for the proper containers. The requests have to also be passed into the executors when launched so they use the correct discovery Script. The task requirements are used by the scheduler to assign tasks to proper containers. {{ We also have a ResourceProfile object that has an accessor to get the default ResourceProfile. This is the profile generated from the configs the user passes in when the spark application is submitted. So it will have --executor-cores, memory, overhead memory, pyspark memory, accelerator resources the user all specified via --confs or properties file on submit.}} was (Author: tgraves): More details: TaskResourceRequest - this supports taking a resourceName and an amount (Double for fractional resources). It only supports cpus (spark.task.cpus) and accelerator resource types (spark.*.resource.[resourceName].*. So user can specify cpus and resources like GPU's and FPGAS. The accerator type resources match what we already have for configs in the acceralator aware scheduling. [https://github.com/apache/spark/blob/master/docs/configuration.md#custom-resource-scheduling-and-configuration-overview] ExecutorResourceRequest - this supports specifying the requirements for the executors. It supports all the configs needed for accelerator aware scheduling - {{spark.\{executor/driver}.resource.\{resourceName}.\{amount, discoveryScript, vendor} as well as heap memory, overhead memory, pyspark memory, and cores. In order to support memory types we added a "units" parameter into ExecutorResourceRequest. the other parameters resourceName, vendor, discoveryScript, amount all match the accelerator aware scheduling parameters.}} {{}} ResourceProfile - this class takes in the executor and task requirement and holds them to be used by other components. For instance, we have to pass the executor resources into the cluster managers so it can ask for the proper containers. The requests have to also be passed into the executors when launched so they use the correct discovery Script. The task requirements are used by the scheduler to assign tasks to proper containers. {{ We also have a ResourceProfile object that has an accessor to get the default ResourceProfile. This is the profile generated from the configs the user passes in when the spark application is submitted. So it will have --executor-cores, memory, overhead memory, pyspark memory, accelerator resources the user all specified via --confs or properties file on submit.}} > Stage Level Sched: Add base ResourceProfile and Request classes > --- > > Key: SPARK-29415 > URL: https://issues.apache.org/jira/browse/SPARK-29415 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Major > > this is just to add initial ResourceProfile, ExecutorResourceRequest and > taskResourceRequest classes that are used by the other parts of the code. > Initially we will have them private until we have other pieces in place. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache
[jira] [Commented] (SPARK-29415) Stage Level Sched: Add base ResourceProfile and Request classes
[ https://issues.apache.org/jira/browse/SPARK-29415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964360#comment-16964360 ] Thomas Graves commented on SPARK-29415: --- More details: TaskResourceRequest - this supports taking a resourceName and an amount (Double for fractional resources). It only supports cpus (spark.task.cpus) and accelerator resource types (spark.*.resource.[resourceName].*. So user can specify cpus and resources like GPU's and FPGAS. The accerator type resources match what we already have for configs in the acceralator aware scheduling. [https://github.com/apache/spark/blob/master/docs/configuration.md#custom-resource-scheduling-and-configuration-overview] ExecutorResourceRequest - this supports specifying the requirements for the executors. It supports all the configs needed for accelerator aware scheduling - {{spark.\{executor/driver}.resource.\{resourceName}.\{amount, discoveryScript, vendor} as well as heap memory, overhead memory, pyspark memory, and cores. In order to support memory types we added a "units" parameter into ExecutorResourceRequest. the other parameters resourceName, vendor, discoveryScript, amount all match the accelerator aware scheduling parameters.}} {{}} ResourceProfile - this class takes in the executor and task requirement and holds them to be used by other components. For instance, we have to pass the executor resources into the cluster managers so it can ask for the proper containers. The requests have to also be passed into the executors when launched so they use the correct discovery Script. The task requirements are used by the scheduler to assign tasks to proper containers. {{ We also have a ResourceProfile object that has an accessor to get the default ResourceProfile. This is the profile generated from the configs the user passes in when the spark application is submitted. So it will have --executor-cores, memory, overhead memory, pyspark memory, accelerator resources the user all specified via --confs or properties file on submit.}} > Stage Level Sched: Add base ResourceProfile and Request classes > --- > > Key: SPARK-29415 > URL: https://issues.apache.org/jira/browse/SPARK-29415 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Major > > this is just to add initial ResourceProfile, ExecutorResourceRequest and > taskResourceRequest classes that are used by the other parts of the code. > Initially we will have them private until we have other pieces in place. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29673) upgrade jenkins pypy to PyPy3.6 v7.2.0
[ https://issues.apache.org/jira/browse/SPARK-29673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964359#comment-16964359 ] Shane Knapp commented on SPARK-29673: - pypy3.6-7.2.0-linux_x86_64-portable has been installed on the centos workers, and i'm testing with https://github.com/apache/spark/pull/26330 ubuntu workers will be updated later today. > upgrade jenkins pypy to PyPy3.6 v7.2.0 > -- > > Key: SPARK-29673 > URL: https://issues.apache.org/jira/browse/SPARK-29673 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Assignee: Shane Knapp >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22579) BlockManager.getRemoteValues and BlockManager.getRemoteBytes should be implemented using streaming
[ https://issues.apache.org/jira/browse/SPARK-22579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964281#comment-16964281 ] Imran Rashid commented on SPARK-22579: -- Sorry I had not noticed this issue before. I agree that there is an inefficiency here, if you did this streaming you could pipeline fetching the data w/ computing on the data. The existing changes you point to solve the memory footprint, by fetching to disk, but not actually pipelining the computation. That said, this isnt' easy to fix. You need to touch a lot of core stuff in the network layers, and as you said it gets trickier with handling failures (you have to throw out all partial work in the current task). You'll probably still see a discrepancy between runtimes when running locally vs. remote. Best case, you'd get a 2x speedup with this change. In your use case, that would still be ~40 seconds to 4 minutes. > BlockManager.getRemoteValues and BlockManager.getRemoteBytes should be > implemented using streaming > -- > > Key: SPARK-22579 > URL: https://issues.apache.org/jira/browse/SPARK-22579 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Spark Core >Affects Versions: 2.1.0 >Reporter: Eyal Farago >Priority: Major > > when an RDD partition is cached on an executor bu the task requiring it is > running on another executor (process locality ANY), the cached partition is > fetched via BlockManager.getRemoteValues which delegates to > BlockManager.getRemoteBytes, both calls are blocking. > in my use case I had a 700GB RDD spread over 1000 partitions on a 6 nodes > cluster, cached to disk. rough math shows that average partition size is > 700MB. > looking at spark UI it was obvious that tasks running with process locality > 'ANY' are much slower than local tasks (~40 seconds to 8-10 minutes ratio), I > was able to capture thread dumps of executors executing remote tasks and got > this stake trace: > {quote}Thread ID Thread Name Thread StateThread Locks > 1521 Executor task launch worker-1000WAITING > Lock(java.util.concurrent.ThreadPoolExecutor$Worker@196462978}) > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202) > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > scala.concurrent.Await$.result(package.scala:190) > org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:190) > org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:104) > org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:582) > org.apache.spark.storage.BlockManager.getRemoteValues(BlockManager.scala:550) > org.apache.spark.storage.BlockManager.get(BlockManager.scala:638) > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:690) > org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) > org.apache.spark.rdd.RDD.iterator(RDD.scala:285) > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > org.apache.spark.rdd.RDD.iterator(RDD.scala:287){quote} > digging into the code showed that the block manager first fetches all bytes > (getRemoteBytes) and then wraps it with a deserialization stream, this has > several draw backs: > 1. blocking, requesting executor is blocked while the remote executor is > serving the block. > 2. potentially large memory footprint on requesting executor, in my use case > a 700mb of raw bytes stored in a ChunkedByteBuffer. > 3. inefficient, requesting side usually don't need all values at once as it > consumes the values via an iterator. > 4. potentially large memory footprint on serving executor, in case the block > is cached in deserialized form the serving executor has to serialize it into > a ChunkedByteBuffer (BlockManager
[jira] [Commented] (SPARK-29625) Spark Structure Streaming Kafka Wrong Reset Offset twice
[ https://issues.apache.org/jira/browse/SPARK-29625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964249#comment-16964249 ] Sandish Kumar HN commented on SPARK-29625: -- [~hyukjin.kwon] it is happening randomly, so there is no way to reproduce the exact error again. the basic question is why is spark is trying to reset the offset of same partition twice? hope you understand the problem. > Spark Structure Streaming Kafka Wrong Reset Offset twice > > > Key: SPARK-29625 > URL: https://issues.apache.org/jira/browse/SPARK-29625 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Sandish Kumar HN >Priority: Major > > Spark Structure Streaming Kafka Reset Offset twice, once with right offsets > and second time with very old offsets > {code} > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-151 to offset 0. > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-118 to offset 0. > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-85 to offset 0. > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-52 to offset 122677634. > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-19 to offset 0. > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-52 to offset 120504922.* > [2019-10-28 19:27:40,153] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO ContextCleaner: Cleaned accumulator 810 > {code} > which is causing a Data loss issue. > {code} > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > ERROR StreamExecution: Query [id = d62ca9e4-6650-454f-8691-a3d576d1e4ba, > runId = 3946389f-222b-495c-9ab2-832c0422cbbb] terminated with error > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - > java.lang.IllegalStateException: Partition topic-52's offset was changed from > 122677598 to 120504922, some data may have been missed. > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - Some data may have > been lost because they are not available in Kafka any more; either the > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - data was aged out > by Kafka or the topic may have been deleted before all the data in the > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - topic was > processed. If you don't want your streaming query to fail on such cases, set > the > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - source option > "failOnDataLoss" to "false". > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > org.apache.spark.sql.kafka010.KafkaSource.org$apache$spark$sql$kafka010$KafkaSource$$reportDataLoss(KafkaSource.scala:329) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > org.apache.spark.sql.kafka010.KafkaSource$$anonfun$8.apply(KafkaSource.scala:283) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > org.apache.spark.sql.kafka010.KafkaSource$$anonfun$8.apply(KafkaSource.scala:281) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247) > [2
[jira] [Commented] (SPARK-25923) SparkR UT Failure (checking CRAN incoming feasibility)
[ https://issues.apache.org/jira/browse/SPARK-25923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964247#comment-16964247 ] L. C. Hsieh commented on SPARK-25923: - Got reply back now. It should be fixed now. > SparkR UT Failure (checking CRAN incoming feasibility) > -- > > Key: SPARK-25923 > URL: https://issues.apache.org/jira/browse/SPARK-25923 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: L. C. Hsieh >Priority: Blocker > > Currently, the following SparkR error blocks PR builders. > {code:java} > * checking CRAN incoming feasibility ...Error in > .check_package_CRAN_incoming(pkgdir) : > dims [product 26] do not match the length of object [0] > Execution halted > {code} > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98362/console > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98367/console > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98368/testReport/ > - > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4403/testReport/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25923) SparkR UT Failure (checking CRAN incoming feasibility)
[ https://issues.apache.org/jira/browse/SPARK-25923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964236#comment-16964236 ] L. C. Hsieh commented on SPARK-25923: - Noticed that and asked help from CRAN two hours ago. > SparkR UT Failure (checking CRAN incoming feasibility) > -- > > Key: SPARK-25923 > URL: https://issues.apache.org/jira/browse/SPARK-25923 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: L. C. Hsieh >Priority: Blocker > > Currently, the following SparkR error blocks PR builders. > {code:java} > * checking CRAN incoming feasibility ...Error in > .check_package_CRAN_incoming(pkgdir) : > dims [product 26] do not match the length of object [0] > Execution halted > {code} > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98362/console > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98367/console > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98368/testReport/ > - > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4403/testReport/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25923) SparkR UT Failure (checking CRAN incoming feasibility)
[ https://issues.apache.org/jira/browse/SPARK-25923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964232#comment-16964232 ] Sean R. Owen commented on SPARK-25923: -- [~viirya][~hyukjin.kwon][~dongjoon] Looks like this is happening again -- I wonder if it has anything to do with the changes in master for 3.0 preview? https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4910/console {code} * checking CRAN incoming feasibility ...Error in .check_package_CRAN_incoming(pkgdir) : dims [product 24] do not match the length of object [0] {code} Is this something we can resolve on our side in any way or needs CRAN help? > SparkR UT Failure (checking CRAN incoming feasibility) > -- > > Key: SPARK-25923 > URL: https://issues.apache.org/jira/browse/SPARK-25923 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: L. C. Hsieh >Priority: Blocker > > Currently, the following SparkR error blocks PR builders. > {code:java} > * checking CRAN incoming feasibility ...Error in > .check_package_CRAN_incoming(pkgdir) : > dims [product 26] do not match the length of object [0] > Execution halted > {code} > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98362/console > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98367/console > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98368/testReport/ > - > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4403/testReport/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29690) Spark Shell - Clear imports
[ https://issues.apache.org/jira/browse/SPARK-29690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dinesh updated SPARK-29690: --- Description: I 'm facing below problem with Spark Shell. So, in a shell session - # I imported following - {color:#57d9a3}{{import scala.collection.immutable.HashMap}}{color} # Then I realized my mistake and imported correct class - {color:#57d9a3}{{import java.util.HashMap}}{color} But, now I get following error on running my code - {color:#de350b}:34: error: reference to HashMap is ambiguous;it is imported twice in the same scope byimport java.util.HashMapand import scala.collection.immutable.HashMapval colMap = new HashMap[String, HashMap[String, String]](){color} If I have long running Spark Shell session i.e I do not want to close and reopen my shell. So, is there a way I can clear previous imports and use correct class? I know that we can also specify full qualified name like - {color:#57d9a3}{{val colMap = new java.util.HashMap[String, java.util.HashMap[String, String]]()}}{color} But, 'm looking if there is a way to clear an incorrect loaded class? I thought spark shell picks imports from history the same way REPL does. That said, previous HashMap should be shadowed away with new import statement. was: I 'm facing below problem with Spark Shell. So, in a shell session - # I imported following - {color:#57d9a3}{{import scala.collection.immutable.HashMap}}{color} # Then I realized my mistake and imported correct class - {color:#57d9a3}{{import java.util.HashMap}}{color} But, now I get following error on running my code - {color:#de350b}:34: error: reference to HashMap is ambiguous;it is imported twice in the same scope byimport java.util.HashMapand import scala.collection.immutable.HashMapval colMap = new HashMap[String, HashMap[String, String]](){color} if I have long running Spark Shell session i.e I do not want to close and reopen my shell. So, is there a way I can clear previous imports and use correct class? I know that we can also specify full qualified name like - {color:#57d9a3}{{val colMap = new java.util.HashMap[String, java.util.HashMap[String, String]]()}}{color} But, 'm looking if there is a way to clear an incorrect loaded class? I thought spark shell picks imports from history the same way REPL does. That said, previous HashMap should be shadowed away with new import statement. {{}} > Spark Shell - Clear imports > > > Key: SPARK-29690 > URL: https://issues.apache.org/jira/browse/SPARK-29690 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.2.0 >Reporter: dinesh >Priority: Major > > I 'm facing below problem with Spark Shell. So, in a shell session - > # I imported following - {color:#57d9a3}{{import > scala.collection.immutable.HashMap}}{color} > # Then I realized my mistake and imported correct class - > {color:#57d9a3}{{import java.util.HashMap}}{color} > But, now I get following error on running my code - > {color:#de350b}:34: error: reference to HashMap is ambiguous;it > is imported twice in the same scope byimport java.util.HashMapand import > scala.collection.immutable.HashMapval colMap = new HashMap[String, > HashMap[String, String]](){color} > If I have long running Spark Shell session i.e I do not want to close and > reopen my shell. So, is there a way I can clear previous imports and use > correct class? > I know that we can also specify full qualified name like - > {color:#57d9a3}{{val colMap = new java.util.HashMap[String, > java.util.HashMap[String, String]]()}}{color} > But, 'm looking if there is a way to clear an incorrect loaded class? > > I thought spark shell picks imports from history the same way REPL does. That > said, previous HashMap should be shadowed away with new import statement. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29690) Spark Shell - Clear imports
[ https://issues.apache.org/jira/browse/SPARK-29690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dinesh updated SPARK-29690: --- Description: I 'm facing below problem with Spark Shell. So, in a shell session - # I imported following - {color:#57d9a3}{{import scala.collection.immutable.HashMap}}{color} # Then I realized my mistake and imported correct class - {color:#57d9a3}{{import java.util.HashMap}}{color} But, now I get following error on running my code - {color:#de350b}:34: error: reference to HashMap is ambiguous;it is imported twice in the same scope byimport java.util.HashMapand import scala.collection.immutable.HashMapval colMap = new HashMap[String, HashMap[String, String]](){color} if I have long running Spark Shell session i.e I do not want to close and reopen my shell. So, is there a way I can clear previous imports and use correct class? I know that we can also specify full qualified name like - {color:#57d9a3}{{val colMap = new java.util.HashMap[String, java.util.HashMap[String, String]]()}}{color} But, 'm looking if there is a way to clear an incorrect loaded class? I thought spark shell picks imports from history the same way REPL does. That said, previous HashMap should be shadowed away with new import statement. {{}} was: I 'm facing below problem with Spark Shell. So, in a shell session - # I imported following - {{import scala.collection.immutable.HashMap}} # Then I realized my mistake and imported correct class - {{import java.util.HashMap}} But, now I get following error on running my code - {color:#de350b}:34: error: reference to HashMap is ambiguous;it is imported twice in the same scope byimport java.util.HashMapand import scala.collection.immutable.HashMapval colMap = new HashMap[String, HashMap[String, String]](){color} if I have long running Spark Shell session i.e I do not want to close and reopen my shell. So, is there a way I can clear previous imports and use correct class? I know that we can also specify full qualified name like - {color:#57d9a3}{{val colMap = new java.util.HashMap[String, java.util.HashMap[String, String]]()}}{color} But, 'm looking if there is a way to clear an incorrect loaded class? I thought spark shell picks imports from history the same way REPL does. That said, previous HashMap should be shadowed away with new import statement. {{}} > Spark Shell - Clear imports > > > Key: SPARK-29690 > URL: https://issues.apache.org/jira/browse/SPARK-29690 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.2.0 >Reporter: dinesh >Priority: Major > > I 'm facing below problem with Spark Shell. So, in a shell session - > # I imported following - {color:#57d9a3}{{import > scala.collection.immutable.HashMap}}{color} > # Then I realized my mistake and imported correct class - > {color:#57d9a3}{{import java.util.HashMap}}{color} > But, now I get following error on running my code - > {color:#de350b}:34: error: reference to HashMap is ambiguous;it > is imported twice in the same scope byimport java.util.HashMapand import > scala.collection.immutable.HashMapval colMap = new HashMap[String, > HashMap[String, String]](){color} > if I have long running Spark Shell session i.e I do not want to close and > reopen my shell. So, is there a way I can clear previous imports and use > correct class? > I know that we can also specify full qualified name like - > {color:#57d9a3}{{val colMap = new java.util.HashMap[String, > java.util.HashMap[String, String]]()}}{color} > But, 'm looking if there is a way to clear an incorrect loaded class? > > I thought spark shell picks imports from history the same way REPL does. That > said, previous HashMap should be shadowed away with new import statement. > {{}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29690) Spark Shell - Clear imports
[ https://issues.apache.org/jira/browse/SPARK-29690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dinesh updated SPARK-29690: --- Description: I 'm facing below problem with Spark Shell. So, in a shell session - # I imported following - {{import scala.collection.immutable.HashMap}} # Then I realized my mistake and imported correct class - {{import java.util.HashMap}} But, now I get following error on running my code - {color:#de350b}:34: error: reference to HashMap is ambiguous;it is imported twice in the same scope byimport java.util.HashMapand import scala.collection.immutable.HashMapval colMap = new HashMap[String, HashMap[String, String]](){color} if I have long running Spark Shell session i.e I do not want to close and reopen my shell. So, is there a way I can clear previous imports and use correct class? I know that we can also specify full qualified name like - {color:#57d9a3}{{val colMap = new java.util.HashMap[String, java.util.HashMap[String, String]]()}}{color} But, 'm looking if there is a way to clear an incorrect loaded class? I thought spark shell picks imports from history the same way REPL does. That said, previous HashMap should be shadowed away with new import statement. {{}} was: I 'm facing below problem with Spark Shell. So, in a shell session - # I imported following - {{import scala.collection.immutable.HashMap}} # Then I realized my mistake and imported correct class - {{import java.util.HashMap}} But, now I get following error on running my code - {{:34: error: reference to HashMap is ambiguous;it is imported twice in the same scope byimport java.util.HashMapand import scala.collection.immutable.HashMapval colMap = new HashMap[String, HashMap[String, String]]()}} {{}} if I have long running Spark Shell session i.e I do not want to close and reopen my shell. So, is there a way I can clear previous imports and use correct class? I know that we can also specify full qualified name like - {{val colMap = new java.util.HashMap[String, java.util.HashMap[String, String]]()}} But, 'm looking if there is a way to clear an incorrect loaded class? I thought spark shell picks imports from history the same way REPL does. That said, previous HashMap should be shadowed away with new import statement. {{}} > Spark Shell - Clear imports > > > Key: SPARK-29690 > URL: https://issues.apache.org/jira/browse/SPARK-29690 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.2.0 >Reporter: dinesh >Priority: Major > > I 'm facing below problem with Spark Shell. So, in a shell session - > # I imported following - {{import scala.collection.immutable.HashMap}} > # Then I realized my mistake and imported correct class - {{import > java.util.HashMap}} > But, now I get following error on running my code - > {color:#de350b}:34: error: reference to HashMap is ambiguous;it > is imported twice in the same scope byimport java.util.HashMapand import > scala.collection.immutable.HashMapval colMap = new HashMap[String, > HashMap[String, String]](){color} > if I have long running Spark Shell session i.e I do not want to close and > reopen my shell. So, is there a way I can clear previous imports and use > correct class? > I know that we can also specify full qualified name like - > {color:#57d9a3}{{val colMap = new java.util.HashMap[String, > java.util.HashMap[String, String]]()}}{color} > But, 'm looking if there is a way to clear an incorrect loaded class? > > I thought spark shell picks imports from history the same way REPL does. That > said, previous HashMap should be shadowed away with new import statement. > {{}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29690) Spark Shell - Clear imports
dinesh created SPARK-29690: -- Summary: Spark Shell - Clear imports Key: SPARK-29690 URL: https://issues.apache.org/jira/browse/SPARK-29690 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 2.2.0 Reporter: dinesh I 'm facing below problem with Spark Shell. So, in a shell session - # I imported following - {{import scala.collection.immutable.HashMap}} # Then I realized my mistake and imported correct class - {{import java.util.HashMap}} But, now I get following error on running my code - {{:34: error: reference to HashMap is ambiguous;it is imported twice in the same scope byimport java.util.HashMapand import scala.collection.immutable.HashMapval colMap = new HashMap[String, HashMap[String, String]]()}} {{}} if I have long running Spark Shell session i.e I do not want to close and reopen my shell. So, is there a way I can clear previous imports and use correct class? I know that we can also specify full qualified name like - {{val colMap = new java.util.HashMap[String, java.util.HashMap[String, String]]()}} But, 'm looking if there is a way to clear an incorrect loaded class? I thought spark shell picks imports from history the same way REPL does. That said, previous HashMap should be shadowed away with new import statement. {{}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29644) ShortType is wrongly set as Int in JDBCUtils.scala
[ https://issues.apache.org/jira/browse/SPARK-29644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-29644: - Issue Type: Bug (was: New Feature) > ShortType is wrongly set as Int in JDBCUtils.scala > -- > > Key: SPARK-29644 > URL: https://issues.apache.org/jira/browse/SPARK-29644 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Shiv Prashant Sood >Priority: Minor > > @maropu pointed out this issue during [PR > 25344|https://github.com/apache/spark/pull/25344] review discussion. > In > [JDBCUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala] > line number 547 > case ShortType => > (stmt: PreparedStatement, row: Row, pos: Int) => > stmt.setInt(pos + 1, row.getShort(pos)) > I dont see any reproducible issue, but this is clearly a problem that must be > fixed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29675) Add exception when isolationLevel is Illegal
[ https://issues.apache.org/jira/browse/SPARK-29675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29675. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26334 [https://github.com/apache/spark/pull/26334] > Add exception when isolationLevel is Illegal > > > Key: SPARK-29675 > URL: https://issues.apache.org/jira/browse/SPARK-29675 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: ulysses you >Assignee: ulysses you >Priority: Minor > Fix For: 3.0.0 > > > Now we use JDBC api and set an Illegal isolationLevel option, spark will > throw a `scala.MatchError`, it's not friendly to user. So we should add an > IllegalArgumentException. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29675) Add exception when isolationLevel is Illegal
[ https://issues.apache.org/jira/browse/SPARK-29675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29675: - Assignee: ulysses you > Add exception when isolationLevel is Illegal > > > Key: SPARK-29675 > URL: https://issues.apache.org/jira/browse/SPARK-29675 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: ulysses you >Assignee: ulysses you >Priority: Minor > > Now we use JDBC api and set an Illegal isolationLevel option, spark will > throw a `scala.MatchError`, it's not friendly to user. So we should add an > IllegalArgumentException. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29682) Failure when resolving conflicting references in Join:
[ https://issues.apache.org/jira/browse/SPARK-29682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964158#comment-16964158 ] Terry Kim commented on SPARK-29682: --- Sure, I will look into this. Thanks for pinging me. > Failure when resolving conflicting references in Join: > -- > > Key: SPARK-29682 > URL: https://issues.apache.org/jira/browse/SPARK-29682 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Shell, Spark Submit >Affects Versions: 2.4.3 >Reporter: sandeshyapuram >Priority: Major > > When I try to self join a parentDf with multiple childDf say childDf1 ... ... > where childDfs are derived after a cube or rollup and are filtered based on > group bys, > I get and error > {{Failure when resolving conflicting references in Join: }} > This shows a long error message which is quite unreadable. On the other hand, > if I replace cube or rollup with old groupBy, it works without issues. > > *Sample code:* > {code:java} > val numsDF = sc.parallelize(Seq(1,2,3,4,5,6)).toDF("nums") > val cubeDF = numsDF > .cube("nums") > .agg( > max(lit(0)).as("agcol"), > grouping_id().as("gid") > ) > > val group0 = cubeDF.filter(col("gid") <=> lit(0)) > val group1 = cubeDF.filter(col("gid") <=> lit(1)) > cubeDF.printSchema > group0.printSchema > group1.printSchema > //Recreating cubeDf > cubeDF.select("nums").distinct > .join(group0, Seq("nums"), "inner") > .join(group1, Seq("nums"), "inner") > .show > {code} > *Sample output:* > {code:java} > numsDF: org.apache.spark.sql.DataFrame = [nums: int] > cubeDF: org.apache.spark.sql.DataFrame = [nums: int, agcol: int ... 1 more > field] > group0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, > agcol: int ... 1 more field] > group1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, > agcol: int ... 1 more field] > root > |-- nums: integer (nullable = true) > |-- agcol: integer (nullable = true) > |-- gid: integer (nullable = false) > root > |-- nums: integer (nullable = true) > |-- agcol: integer (nullable = true) > |-- gid: integer (nullable = false) > root > |-- nums: integer (nullable = true) > |-- agcol: integer (nullable = true) > |-- gid: integer (nullable = false) > org.apache.spark.sql.AnalysisException: > Failure when resolving conflicting references in Join: > 'Join Inner > :- Deduplicate [nums#220] > : +- Project [nums#220] > : +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS > agcol#216, spark_grouping_id#218 AS gid#217] > :+- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], > [nums#212, nums#220, spark_grouping_id#218] > : +- Project [nums#212, nums#212 AS nums#219] > : +- Project [value#210 AS nums#212] > : +- SerializeFromObject [input[0, int, false] AS value#210] > :+- ExternalRDD [obj#209] > +- Filter (gid#217 <=> 0) >+- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS > agcol#216, spark_grouping_id#218 AS gid#217] > +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], > [nums#212, nums#220, spark_grouping_id#218] > +- Project [nums#212, nums#212 AS nums#219] > +- Project [value#210 AS nums#212] >+- SerializeFromObject [input[0, int, false] AS value#210] > +- ExternalRDD [obj#209] > Conflicting attributes: nums#220 > ;; > 'Join Inner > :- Deduplicate [nums#220] > : +- Project [nums#220] > : +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS > agcol#216, spark_grouping_id#218 AS gid#217] > :+- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], > [nums#212, nums#220, spark_grouping_id#218] > : +- Project [nums#212, nums#212 AS nums#219] > : +- Project [value#210 AS nums#212] > : +- SerializeFromObject [input[0, int, false] AS value#210] > :+- ExternalRDD [obj#209] > +- Filter (gid#217 <=> 0) >+- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS > agcol#216, spark_grouping_id#218 AS gid#217] > +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], > [nums#212, nums#220, spark_grouping_id#218] > +- Project [nums#212, nums#212 AS nums#219] > +- Project [value#210 AS nums#212] >+- SerializeFromObject [input[0, int, false] AS value#210] > +- ExternalRDD [obj#209] > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:42) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:96) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkA
[jira] [Commented] (SPARK-29630) Not allowed to create a permanent view by referencing a temporary view in EXISTS
[ https://issues.apache.org/jira/browse/SPARK-29630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964157#comment-16964157 ] Terry Kim commented on SPARK-29630: --- Yes, I will take a look. > Not allowed to create a permanent view by referencing a temporary view in > EXISTS > > > Key: SPARK-29630 > URL: https://issues.apache.org/jira/browse/SPARK-29630 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Major > > {code} > // In the master, the query below fails > $ CREATE VIEW v7_temp AS SELECT t1.id, t2.a FROM base_table t1, (SELECT * > FROM temp_table) t2; > org.apache.spark.sql.AnalysisException > Not allowed to create a permanent view `v7_temp` by referencing a temporary > view `temp_table`; > // In the master, the query below passed, but this should fail > $ CREATE VIEW v8_temp AS SELECT * FROM base_table WHERE EXISTS (SELECT 1 FROM > temp_table); > Passed > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29604) SessionState is initialized with isolated classloader for Hive if spark.sql.hive.metastore.jars is being set
[ https://issues.apache.org/jira/browse/SPARK-29604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964137#comment-16964137 ] Dongjoon Hyun commented on SPARK-29604: --- Thank you for keeping working on this. Yes. It passed locally. That's the reason why I didn't revert this patch until now. But, we are on 3.0.0-preview voting. In the worst case, we need to revert this. > SessionState is initialized with isolated classloader for Hive if > spark.sql.hive.metastore.jars is being set > > > Key: SPARK-29604 > URL: https://issues.apache.org/jira/browse/SPARK-29604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > I've observed the issue that external listeners cannot be loaded properly > when we run spark-sql with "spark.sql.hive.metastore.jars" configuration > being used. > {noformat} > Exception in thread "main" java.lang.IllegalArgumentException: Error while > instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder': > at > org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1102) > at > org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:154) > at > org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:153) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:153) > at > org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:150) > at > org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$2.apply(SparkSession.scala:104) > at > org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$2.apply(SparkSession.scala:104) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:104) > at > org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:103) > at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:149) > at > org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$client(HiveClientImpl.scala:282) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:306) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:247) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:246) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:296) > at > org.apache.spark.sql.hive.client.HiveClientImpl.databaseExists(HiveClientImpl.scala:386) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:215) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:215) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:215) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) > at > org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:214) > at > org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114) > at > org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:53) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:315) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:166) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:847) > at > org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) > at org.apache.spark.deploy.SparkSubmit.submit(SparkSub
[jira] [Assigned] (SPARK-29277) DataSourceV2: Add early filter and projection pushdown
[ https://issues.apache.org/jira/browse/SPARK-29277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29277: - Assignee: Ryan Blue > DataSourceV2: Add early filter and projection pushdown > -- > > Key: SPARK-29277 > URL: https://issues.apache.org/jira/browse/SPARK-29277 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Fix For: 3.0.0 > > > Spark uses optimizer rules that need stats before conversion to physical > plan. DataSourceV2 should support early pushdown for those rules. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29277) DataSourceV2: Add early filter and projection pushdown
[ https://issues.apache.org/jira/browse/SPARK-29277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29277. --- Resolution: Fixed Issue resolved by pull request 26341 [https://github.com/apache/spark/pull/26341] > DataSourceV2: Add early filter and projection pushdown > -- > > Key: SPARK-29277 > URL: https://issues.apache.org/jira/browse/SPARK-29277 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Fix For: 3.0.0 > > > Spark uses optimizer rules that need stats before conversion to physical > plan. DataSourceV2 should support early pushdown for those rules. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29689) [WEB-UI] When task failed during reading shuffle data or other failure, enable show total shuffle read size
[ https://issues.apache.org/jira/browse/SPARK-29689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29689: Description: As shown in the attachment, if task failed during reading shuffle data or because of executor loss, its shuffle read size would be shown as 0. But this size is important for user, it can help detect data skew. was: If task failed during reading shuffle data or because of executor loss, its shuffle read size would be shown as 0. But this size is important for user, it can help detect data skew. !screenshot-1.png! > [WEB-UI] When task failed during reading shuffle data or other failure, > enable show total shuffle read size > --- > > Key: SPARK-29689 > URL: https://issues.apache.org/jira/browse/SPARK-29689 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Minor > Attachments: screenshot-1.png > > > As shown in the attachment, if task failed during reading shuffle data or > because of executor loss, its shuffle read size would be shown as 0. > But this size is important for user, it can help detect data skew. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29689) [WEB-UI] When task failed during reading shuffle data or other failure, enable show total shuffle read size
[ https://issues.apache.org/jira/browse/SPARK-29689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29689: Description: If task failed during reading shuffle data or because of executor loss, its shuffle read size would be shown as 0. But this size is important for user, it can help detect data skew. !screenshot-1.png! > [WEB-UI] When task failed during reading shuffle data or other failure, > enable show total shuffle read size > --- > > Key: SPARK-29689 > URL: https://issues.apache.org/jira/browse/SPARK-29689 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Minor > Attachments: screenshot-1.png > > > If task failed during reading shuffle data or because of executor loss, its > shuffle read size would be shown as 0. > But this size is important for user, it can help detect data skew. > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29689) [WEB-UI] When task failed during reading shuffle data or other failure, enable show total shuffle read size
[ https://issues.apache.org/jira/browse/SPARK-29689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29689: Attachment: screenshot-1.png > [WEB-UI] When task failed during reading shuffle data or other failure, > enable show total shuffle read size > --- > > Key: SPARK-29689 > URL: https://issues.apache.org/jira/browse/SPARK-29689 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Minor > Attachments: screenshot-1.png > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29689) [WEB-UI] When task failed during reading shuffle data or other failure, enable show total shuffle read size
[ https://issues.apache.org/jira/browse/SPARK-29689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29689: Summary: [WEB-UI] When task failed during reading shuffle data or other failure, enable show total shuffle read size (was: [UI] When task failed during reading shuffle data or other failure, enable show total shuffle read size) > [WEB-UI] When task failed during reading shuffle data or other failure, > enable show total shuffle read size > --- > > Key: SPARK-29689 > URL: https://issues.apache.org/jira/browse/SPARK-29689 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29689) [UI] When task failed during reading shuffle data or other failure, enable show total shuffle read size
feiwang created SPARK-29689: --- Summary: [UI] When task failed during reading shuffle data or other failure, enable show total shuffle read size Key: SPARK-29689 URL: https://issues.apache.org/jira/browse/SPARK-29689 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 2.4.4 Reporter: feiwang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29688) Support average with interval type values
Kent Yao created SPARK-29688: Summary: Support average with interval type values Key: SPARK-29688 URL: https://issues.apache.org/jira/browse/SPARK-29688 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Kent Yao add average aggegate support for spark -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29687) Fix jdbc metrics counter type to long
[ https://issues.apache.org/jira/browse/SPARK-29687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964015#comment-16964015 ] ulysses you commented on SPARK-29687: - Sorry for a misstake. > Fix jdbc metrics counter type to long > - > > Key: SPARK-29687 > URL: https://issues.apache.org/jira/browse/SPARK-29687 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: ulysses you >Priority: Minor > > JDBC metrics counter var is an int type that may by overflow. Change it to > Long type. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29687) Fix jdbc metrics counter type to long
[ https://issues.apache.org/jira/browse/SPARK-29687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964011#comment-16964011 ] ulysses you edited comment on SPARK-29687 at 10/31/19 1:33 PM: --- You can just see the pr [26346|[https://github.com/apache/spark/pull/26346]] was (Author: ulysses): You can just see the pr [26334|[https://github.com/apache/spark/pull/26334]] > Fix jdbc metrics counter type to long > - > > Key: SPARK-29687 > URL: https://issues.apache.org/jira/browse/SPARK-29687 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: ulysses you >Priority: Minor > > JDBC metrics counter var is an int type that may by overflow. Change it to > Long type. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29687) Fix jdbc metrics counter type to long
[ https://issues.apache.org/jira/browse/SPARK-29687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964011#comment-16964011 ] ulysses you commented on SPARK-29687: - You can just see the pr [26334|[https://github.com/apache/spark/pull/26334]] > Fix jdbc metrics counter type to long > - > > Key: SPARK-29687 > URL: https://issues.apache.org/jira/browse/SPARK-29687 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: ulysses you >Priority: Minor > > JDBC metrics counter var is an int type that may by overflow. Change it to > Long type. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29687) Fix jdbc metrics counter type to long
[ https://issues.apache.org/jira/browse/SPARK-29687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964011#comment-16964011 ] ulysses you edited comment on SPARK-29687 at 10/31/19 1:31 PM: --- You can just see the pr [26334|[https://github.com/apache/spark/pull/26334]] was (Author: ulysses): You can just see the pr [26334|[https://github.com/apache/spark/pull/26334]] > Fix jdbc metrics counter type to long > - > > Key: SPARK-29687 > URL: https://issues.apache.org/jira/browse/SPARK-29687 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: ulysses you >Priority: Minor > > JDBC metrics counter var is an int type that may by overflow. Change it to > Long type. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29687) Fix jdbc metrics counter type to long
[ https://issues.apache.org/jira/browse/SPARK-29687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16963993#comment-16963993 ] jobit mathew commented on SPARK-29687: -- Hi can you give some details about this variable where it is getting used > Fix jdbc metrics counter type to long > - > > Key: SPARK-29687 > URL: https://issues.apache.org/jira/browse/SPARK-29687 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: ulysses you >Priority: Minor > > JDBC metrics counter var is an int type that may by overflow. Change it to > Long type. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29687) Fix jdbc metrics counter type to long
[ https://issues.apache.org/jira/browse/SPARK-29687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ulysses you updated SPARK-29687: Affects Version/s: (was: 2.4.4) 3.0.0 > Fix jdbc metrics counter type to long > - > > Key: SPARK-29687 > URL: https://issues.apache.org/jira/browse/SPARK-29687 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: ulysses you >Priority: Minor > > JDBC metrics counter var is an int type that may by overflow. Change it to > Long type. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29687) Fix jdbc metrics counter type to long
ulysses you created SPARK-29687: --- Summary: Fix jdbc metrics counter type to long Key: SPARK-29687 URL: https://issues.apache.org/jira/browse/SPARK-29687 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.4 Reporter: ulysses you JDBC metrics counter var is an int type that may by overflow. Change it to Long type. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29630) Not allowed to create a permanent view by referencing a temporary view in EXISTS
[ https://issues.apache.org/jira/browse/SPARK-29630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16963858#comment-16963858 ] Wenchen Fan commented on SPARK-29630: - Yea this should be disallowed. We store view as text, so it's not allowed to have temp views appear in the view SQL text. I think it's a bug in the checking logic of `CREATE VIEW`: it doesn't go through subqueries. [~imback82] do you have time to look into it? > Not allowed to create a permanent view by referencing a temporary view in > EXISTS > > > Key: SPARK-29630 > URL: https://issues.apache.org/jira/browse/SPARK-29630 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Major > > {code} > // In the master, the query below fails > $ CREATE VIEW v7_temp AS SELECT t1.id, t2.a FROM base_table t1, (SELECT * > FROM temp_table) t2; > org.apache.spark.sql.AnalysisException > Not allowed to create a permanent view `v7_temp` by referencing a temporary > view `temp_table`; > // In the master, the query below passed, but this should fail > $ CREATE VIEW v8_temp AS SELECT * FROM base_table WHERE EXISTS (SELECT 1 FROM > temp_table); > Passed > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29682) Failure when resolving conflicting references in Join:
[ https://issues.apache.org/jira/browse/SPARK-29682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16963850#comment-16963850 ] Wenchen Fan commented on SPARK-29682: - [~imback82] do you want to look into it? > Failure when resolving conflicting references in Join: > -- > > Key: SPARK-29682 > URL: https://issues.apache.org/jira/browse/SPARK-29682 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Shell, Spark Submit >Affects Versions: 2.4.3 >Reporter: sandeshyapuram >Priority: Major > > When I try to self join a parentDf with multiple childDf say childDf1 ... ... > where childDfs are derived after a cube or rollup and are filtered based on > group bys, > I get and error > {{Failure when resolving conflicting references in Join: }} > This shows a long error message which is quite unreadable. On the other hand, > if I replace cube or rollup with old groupBy, it works without issues. > > *Sample code:* > {code:java} > val numsDF = sc.parallelize(Seq(1,2,3,4,5,6)).toDF("nums") > val cubeDF = numsDF > .cube("nums") > .agg( > max(lit(0)).as("agcol"), > grouping_id().as("gid") > ) > > val group0 = cubeDF.filter(col("gid") <=> lit(0)) > val group1 = cubeDF.filter(col("gid") <=> lit(1)) > cubeDF.printSchema > group0.printSchema > group1.printSchema > //Recreating cubeDf > cubeDF.select("nums").distinct > .join(group0, Seq("nums"), "inner") > .join(group1, Seq("nums"), "inner") > .show > {code} > *Sample output:* > {code:java} > numsDF: org.apache.spark.sql.DataFrame = [nums: int] > cubeDF: org.apache.spark.sql.DataFrame = [nums: int, agcol: int ... 1 more > field] > group0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, > agcol: int ... 1 more field] > group1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, > agcol: int ... 1 more field] > root > |-- nums: integer (nullable = true) > |-- agcol: integer (nullable = true) > |-- gid: integer (nullable = false) > root > |-- nums: integer (nullable = true) > |-- agcol: integer (nullable = true) > |-- gid: integer (nullable = false) > root > |-- nums: integer (nullable = true) > |-- agcol: integer (nullable = true) > |-- gid: integer (nullable = false) > org.apache.spark.sql.AnalysisException: > Failure when resolving conflicting references in Join: > 'Join Inner > :- Deduplicate [nums#220] > : +- Project [nums#220] > : +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS > agcol#216, spark_grouping_id#218 AS gid#217] > :+- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], > [nums#212, nums#220, spark_grouping_id#218] > : +- Project [nums#212, nums#212 AS nums#219] > : +- Project [value#210 AS nums#212] > : +- SerializeFromObject [input[0, int, false] AS value#210] > :+- ExternalRDD [obj#209] > +- Filter (gid#217 <=> 0) >+- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS > agcol#216, spark_grouping_id#218 AS gid#217] > +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], > [nums#212, nums#220, spark_grouping_id#218] > +- Project [nums#212, nums#212 AS nums#219] > +- Project [value#210 AS nums#212] >+- SerializeFromObject [input[0, int, false] AS value#210] > +- ExternalRDD [obj#209] > Conflicting attributes: nums#220 > ;; > 'Join Inner > :- Deduplicate [nums#220] > : +- Project [nums#220] > : +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS > agcol#216, spark_grouping_id#218 AS gid#217] > :+- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], > [nums#212, nums#220, spark_grouping_id#218] > : +- Project [nums#212, nums#212 AS nums#219] > : +- Project [value#210 AS nums#212] > : +- SerializeFromObject [input[0, int, false] AS value#210] > :+- ExternalRDD [obj#209] > +- Filter (gid#217 <=> 0) >+- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS > agcol#216, spark_grouping_id#218 AS gid#217] > +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], > [nums#212, nums#220, spark_grouping_id#218] > +- Project [nums#212, nums#212 AS nums#219] > +- Project [value#210 AS nums#212] >+- SerializeFromObject [input[0, int, false] AS value#210] > +- ExternalRDD [obj#209] > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:42) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:96) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis
[jira] [Updated] (SPARK-29685) Spark SQL also better to show the column details while doing SELECT * from table, like sparkshell and spark beeline
[ https://issues.apache.org/jira/browse/SPARK-29685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jobit mathew updated SPARK-29685: - Issue Type: Improvement (was: Bug) > Spark SQL also better to show the column details while doing SELECT * from > table, like sparkshell and spark beeline > --- > > Key: SPARK-29685 > URL: https://issues.apache.org/jira/browse/SPARK-29685 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: jobit mathew >Priority: Minor > > Spark SQL also better to show the column details in top while doing SELECT * > from table, like spark scala shell and spark beeline shows in table format. > *Test steps* > 1.create table table1(id int,name string,address string); > 2.insert into table1 values (5,name1,add1); > 3.insert into table1 values (5,name2,add2); > 4.insert into table1 values (5,name3,add3); > {code:java} > spark-sql> select * from table1; > 5 name3 add3 > 5 name1 add1 > 5 name2 add2 > But in spark scala shell & spark beeline shows the columns details also in > table format > scala> sql("select * from table1").show() > +---+-+---+ > | id| name|address| > +---+-+---+ > | 5|name3| add3| > | 5|name1| add1| > | 5|name2| add2| > +---+-+---+ > scala> > 0: jdbc:hive2://10.18.18.214:23040/default> select * from table1; > +-++--+--+ > | id | name | address | > +-++--+--+ > | 5 | name3 | add3 | > | 5 | name1 | add1 | > | 5 | name2 | add2 | > +-++--+--+ > 3 rows selected (0.679 seconds) > 0: jdbc:hive2://10.18.18.214:23040/default> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29685) Spark SQL also better to show the column details while doing SELECT * from table, like sparkshell and spark beeline
[ https://issues.apache.org/jira/browse/SPARK-29685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jobit mathew updated SPARK-29685: - Description: Spark SQL also better to show the column details in top while doing SELECT * from table, like spark scala shell and spark beeline shows in table format. *Test steps* 1.create table table1(id int,name string,address string); 2.insert into table1 values (5,name1,add1); 3.insert into table1 values (5,name2,add2); 4.insert into table1 values (5,name3,add3); {code:java} spark-sql> select * from table1; 5 name3 add3 5 name1 add1 5 name2 add2 But in spark scala shell & spark beeline shows the columns details also in table format scala> sql("select * from table1").show() +---+-+---+ | id| name|address| +---+-+---+ | 5|name3| add3| | 5|name1| add1| | 5|name2| add2| +---+-+---+ scala> 0: jdbc:hive2://10.18.18.214:23040/default> select * from table1; +-++--+--+ | id | name | address | +-++--+--+ | 5 | name3 | add3 | | 5 | name1 | add1 | | 5 | name2 | add2 | +-++--+--+ 3 rows selected (0.679 seconds) 0: jdbc:hive2://10.18.18.214:23040/default> {code} was: Spark SQL also better to show the column details in top while doing SELECT * from table, like spark scala shell and spark beeline shows in table format. *Test steps* 1.create table table1(id int,name string,address string); 2.insert into table1 values (5,name1,add1); 3.insert into table1 values (5,name2,add2); 4.insert into table1 values (5,name3,add3); spark-sql> select * from table1; 5 name3 add3 5 name1 add1 5 name2 add2 But in spark scala shell & spark beeline shows the columns details also in table format scala> sql("select * from table1").show() +---+-+---+ | id| name|address| +---+-+---+ | 5|name3| add3| | 5|name1| add1| | 5|name2| add2| +---+-+---+ scala> 0: jdbc:hive2://10.18.18.214:23040/default> select * from table1; +-++--+--+ | id | name | address | +-++--+--+ | 5 | name3 | add3 | | 5 | name1 | add1 | | 5 | name2 | add2 | +-++--+--+ 3 rows selected (0.679 seconds) 0: jdbc:hive2://10.18.18.214:23040/default> > Spark SQL also better to show the column details while doing SELECT * from > table, like sparkshell and spark beeline > --- > > Key: SPARK-29685 > URL: https://issues.apache.org/jira/browse/SPARK-29685 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: jobit mathew >Priority: Minor > > Spark SQL also better to show the column details in top while doing SELECT * > from table, like spark scala shell and spark beeline shows in table format. > *Test steps* > 1.create table table1(id int,name string,address string); > 2.insert into table1 values (5,name1,add1); > 3.insert into table1 values (5,name2,add2); > 4.insert into table1 values (5,name3,add3); > {code:java} > spark-sql> select * from table1; > 5 name3 add3 > 5 name1 add1 > 5 name2 add2 > But in spark scala shell & spark beeline shows the columns details also in > table format > scala> sql("select * from table1").show() > +---+-+---+ > | id| name|address| > +---+-+---+ > | 5|name3| add3| > | 5|name1| add1| > | 5|name2| add2| > +---+-+---+ > scala> > 0: jdbc:hive2://10.18.18.214:23040/default> select * from table1; > +-++--+--+ > | id | name | address | > +-++--+--+ > | 5 | name3 | add3 | > | 5 | name1 | add1 | > | 5 | name2 | add2 | > +-++--+--+ > 3 rows selected (0.679 seconds) > 0: jdbc:hive2://10.18.18.214:23040/default> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org