date:20191031

[jira] [Commented] (SPARK-29707) Make PartitionFilters and PushedFilters abbreviate configurable in metadata

2019-10-31 Thread Hu Fuwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964655#comment-16964655
 ] 

Hu Fuwang commented on SPARK-29707:
---

I am working on this.

> Make PartitionFilters and PushedFilters abbreviate configurable in metadata
> ---
>
> Key: SPARK-29707
> URL: https://issues.apache.org/jira/browse/SPARK-29707
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: screenshot-1.png
>
>
>  !screenshot-1.png! 
> It lost some key information.
> Related code:
> https://github.com/apache/spark/blob/ec5d698d99634e5bb8fc7b0fa1c270dd67c129c8/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L58-L66



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29710) Seeing offsets not resetting even when reset policy is configured explicitly

2019-10-31 Thread Shyam (Jira)

Shyam created SPARK-29710:
-

 Summary: Seeing offsets not resetting even when reset policy is 
configured explicitly
 Key: SPARK-29710
 URL: https://issues.apache.org/jira/browse/SPARK-29710
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.1
 Environment: Window10 , eclipse neos
Reporter: Shyam


 

 even after setting *"auto.offset.reset" to "latest"*  I am getting below error

 

org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of 
range with no configured reset policy for partitions: 
\{COMPANY_TRANSACTIONS_INBOUND-16=168}org.apache.kafka.clients.consumer.OffsetOutOfRangeException:
 Offsets out of range with no configured reset policy for partitions: 
\{COMPANY_TRANSACTIONS_INBOUND-16=168} at 
org.apache.kafka.clients.consumer.internals.Fetcher.throwIfOffsetOutOfRange(Fetcher.java:348)
 at 
org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:396)
 at 
org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:999)
 at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:937) at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer.fetchData(KafkaDataConsumer.scala:470)
 at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer.org$apache$spark$sql$kafka010$InternalKafkaConsumer$$fetchRecord(KafkaDataConsumer.scala:361)
 at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:251)
 at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:234)
 at 
org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
 at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:209)
 at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer.get(KafkaDataConsumer.scala:234)

 

[https://stackoverflow.com/questions/58653885/even-after-setting-auto-offset-reset-to-latest-getting-error-offsetoutofrang]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29709) structured streaming The offset in the checkpoint is suddenly reset to the earliest

2019-10-31 Thread test (Jira)

test created SPARK-29709:


 Summary: structured streaming The offset in the checkpoint is 
suddenly reset to the earliest
 Key: SPARK-29709
 URL: https://issues.apache.org/jira/browse/SPARK-29709
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: test


structured streaming The offset in the checkpoint is suddenly reset to the 
earliest,

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29630) Not allowed to create a permanent view by referencing a temporary view in EXISTS

2019-10-31 Thread Terry Kim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964626#comment-16964626
 ] 

Terry Kim edited comment on SPARK-29630 at 11/1/19 5:50 AM:


In the above example, the EXISTS clause becomes a `condition` of `Filter`. The 
current implementation is not exhaustive enough - e.g., it doesn't traverse 
Expression node for checking views. I will create a PR to address this.

 


was (Author: imback82):
In the above example, the EXISTS clause becomes a `condition` of `Filter`. The 
current implementation is not exhaustive enough - e.g., it doesn't traverse 
Expression node, etc. I will create a PR to address this.

 

> Not allowed to create a permanent view by referencing a temporary view in 
> EXISTS
> 
>
> Key: SPARK-29630
> URL: https://issues.apache.org/jira/browse/SPARK-29630
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> {code}
> // In the master, the query below fails
> $ CREATE VIEW v7_temp AS SELECT t1.id, t2.a FROM base_table t1, (SELECT * 
> FROM temp_table) t2;
> org.apache.spark.sql.AnalysisException
> Not allowed to create a permanent view `v7_temp` by referencing a temporary 
> view `temp_table`;
> // In the master, the query below passed, but this should fail
> $ CREATE VIEW v8_temp AS SELECT * FROM base_table WHERE EXISTS (SELECT 1 FROM 
> temp_table);
> Passed
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29630) Not allowed to create a permanent view by referencing a temporary view in EXISTS

2019-10-31 Thread Terry Kim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964626#comment-16964626
 ] 

Terry Kim commented on SPARK-29630:
---

In the above example, the EXISTS clause becomes a `condition` of `Filter`. The 
current implementation is not exhaustive enough - e.g., it doesn't traverse 
Expression node, etc. I will create a PR to address this.

 

> Not allowed to create a permanent view by referencing a temporary view in 
> EXISTS
> 
>
> Key: SPARK-29630
> URL: https://issues.apache.org/jira/browse/SPARK-29630
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> {code}
> // In the master, the query below fails
> $ CREATE VIEW v7_temp AS SELECT t1.id, t2.a FROM base_table t1, (SELECT * 
> FROM temp_table) t2;
> org.apache.spark.sql.AnalysisException
> Not allowed to create a permanent view `v7_temp` by referencing a temporary 
> view `temp_table`;
> // In the master, the query below passed, but this should fail
> $ CREATE VIEW v8_temp AS SELECT * FROM base_table WHERE EXISTS (SELECT 1 FROM 
> temp_table);
> Passed
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29698) Support grouping function with multiple arguments

2019-10-31 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-29698:
-
Description: 
In PgSQL, grouping() can have multiple arguments, but Spark grouping() must 
have a single argument 
([https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala#L100]);
{code:java}
postgres=# select a, b, grouping(a, b), sum(v), count(*), max(v)
postgres-#   from gstest1 group by rollup (a,b);
 a | b | grouping | sum | count | max 
---+---+--+-+---+-
   |   |3 | | 0 |
(1 row)
{code}
See a doc for the form: 
https://www.postgresql.org/docs/12/functions-aggregate.html (Table 9.59. 
Grouping Operations)

  was:
In PgSQL, grouping() can have multiple arguments, but Spark grouping() must 
have a single argument 
([https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala#L100]);
{code:java}
postgres=# select a, b, grouping(a, b), sum(v), count(*), max(v)
postgres-#   from gstest1 group by rollup (a,b);
 a | b | grouping | sum | count | max 
---+---+--+-+---+-
   |   |3 | | 0 |
(1 row)
{code}


> Support grouping function with multiple arguments
> -
>
> Key: SPARK-29698
> URL: https://issues.apache.org/jira/browse/SPARK-29698
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> In PgSQL, grouping() can have multiple arguments, but Spark grouping() must 
> have a single argument 
> ([https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala#L100]);
> {code:java}
> postgres=# select a, b, grouping(a, b), sum(v), count(*), max(v)
> postgres-#   from gstest1 group by rollup (a,b);
>  a | b | grouping | sum | count | max 
> ---+---+--+-+---+-
>|   |3 | | 0 |
> (1 row)
> {code}
> See a doc for the form: 
> https://www.postgresql.org/docs/12/functions-aggregate.html (Table 9.59. 
> Grouping Operations)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29697) Support bit string types/literals

2019-10-31 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-29697:
-
Description: 
In PgSQL, there are bit types and literals;
{code}
postgres=# create table b(b bit(4));
CREATE TABLE
postgres=# select b'0010';
 ?column? 
--
 0010
(1 row)
{code}
See a doc for the form: 
https://www.postgresql.org/docs/current/datatype-bit.html

  was:
In PgSQL, there are bit types and literals;
{code}
postgres=# create table b(b bit(4));
CREATE TABLE
postgres=# select b'0010';
 ?column? 
--
 0010
(1 row)
{code}


> Support bit string types/literals
> -
>
> Key: SPARK-29697
> URL: https://issues.apache.org/jira/browse/SPARK-29697
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> In PgSQL, there are bit types and literals;
> {code}
> postgres=# create table b(b bit(4));
> CREATE TABLE
> postgres=# select b'0010';
>  ?column? 
> --
>  0010
> (1 row)
> {code}
> See a doc for the form: 
> https://www.postgresql.org/docs/current/datatype-bit.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29700) Support nested grouping sets

2019-10-31 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-29700:
-
Description: 
PgSQL can process nested grouping sets, but Spark cannot;
{code:java}
postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e 
integer, f integer, g integer, h integer);
postgres=# insert into gstest2 values
postgres-#   (1, 1, 1, 1, 1, 1, 1, 1),
postgres-#   (1, 1, 1, 1, 1, 1, 1, 2),
postgres-#   (1, 1, 1, 1, 1, 1, 2, 2),
postgres-#   (1, 1, 1, 1, 1, 2, 2, 2),
postgres-#   (1, 1, 1, 1, 2, 2, 2, 2),
postgres-#   (1, 1, 1, 2, 2, 2, 2, 2),
postgres-#   (1, 1, 2, 2, 2, 2, 2, 2),
postgres-#   (1, 2, 2, 2, 2, 2, 2, 2),
postgres-#   (2, 2, 2, 2, 2, 2, 2, 2);
INSERT 0 9

postgres=# select sum(c) from gstest2
postgres-#   group by grouping sets(grouping sets((a, b)))
postgres-#   order by 1 desc;
 sum 
-
  16
   4
   4
(3 rows)
{code}
{code:java}
scala> sql("""
 | select sum(c) from gstest2
 |   group by grouping sets(grouping sets((a, b)))
 |   order by 1 desc
 | """).show
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'sets' expecting {')', ','}(line 3, pos 34)

== SQL ==

select sum(c) from gstest2
  group by grouping sets(grouping sets((a, b)))
--^^^
  order by 1 desc

  at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:268)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:135)
  at 
org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:85)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605)
  at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605)
  ... 51 elided
{code}
See a doc for the form: https://www.postgresql.org/docs/current/sql-select.html

  was:
PgSQL can process nested grouping sets, but Spark cannot;
{code:java}
postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e 
integer, f integer, g integer, h integer);
postgres=# insert into gstest2 values
postgres-#   (1, 1, 1, 1, 1, 1, 1, 1),
postgres-#   (1, 1, 1, 1, 1, 1, 1, 2),
postgres-#   (1, 1, 1, 1, 1, 1, 2, 2),
postgres-#   (1, 1, 1, 1, 1, 2, 2, 2),
postgres-#   (1, 1, 1, 1, 2, 2, 2, 2),
postgres-#   (1, 1, 1, 2, 2, 2, 2, 2),
postgres-#   (1, 1, 2, 2, 2, 2, 2, 2),
postgres-#   (1, 2, 2, 2, 2, 2, 2, 2),
postgres-#   (2, 2, 2, 2, 2, 2, 2, 2);
INSERT 0 9

postgres=# select sum(c) from gstest2
postgres-#   group by grouping sets(grouping sets((a, b)))
postgres-#   order by 1 desc;
 sum 
-
  16
   4
   4
(3 rows)
{code}
{code:java}
scala> sql("""
 | select sum(c) from gstest2
 |   group by grouping sets(grouping sets((a, b)))
 |   order by 1 desc
 | """).show
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'sets' expecting {')', ','}(line 3, pos 34)

== SQL ==

select sum(c) from gstest2
  group by grouping sets(grouping sets((a, b)))
--^^^
  order by 1 desc

  at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:268)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:135)
  at 
org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:85)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605)
  at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605)
  ... 51 elided
{code}


> Support nested grouping sets
> 
>
> Key: SPARK-29700
> URL: https://issues.apache.org/jira/browse/SPARK-29700
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> PgSQL can process nested grouping sets, but Spark cannot;
> {code:java}
> postgres=# create table gstest2 (a integer, b integer, c integer, d integer, 
> e integer, f integer, g integer, h integer);
> postgres=# insert into gstest2 values
> postgres-#   (1, 1, 1, 1, 1, 1, 1, 1),
> postgres-#   (1, 1, 1, 1, 1, 1, 1, 2),
> postgres-#   (1, 1, 1, 1, 1, 1, 2, 2),
> postgres-#   (1, 1, 1, 1, 1, 2, 2, 2),
> postgres-#   (1, 1, 1, 1, 2, 2, 2, 2),
> postgres-#   (1, 1, 1, 2, 2, 2, 2, 2),
> postgres-#   (1, 1, 2, 2, 2, 2, 2, 2),
> postgres-#   (1, 2, 2, 2, 2, 2, 2, 2),
> postgres-#   (2, 2, 2, 2, 2, 2, 2, 2);
> INSERT 0 9
> postgres=# select sum(c) from gstest2
> postgres-#   group by grouping sets(groupi

[jira] [Updated] (SPARK-29705) Support more expressive forms in GroupingSets/Cube/Rollup

2019-10-31 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-29705:
-
Description: 
See a doc for the form: https://www.postgresql.org/docs/current/sql-select.html
{code:java}
postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e 
integer, f integer, g integer, h integer);
postgres=# insert into gstest2 values
postgres-#   (1, 1, 1, 1, 1, 1, 1, 1),
postgres-#   (1, 1, 1, 1, 1, 1, 1, 2),
postgres-#   (1, 1, 1, 1, 1, 1, 2, 2),
postgres-#   (1, 1, 1, 1, 1, 2, 2, 2),
postgres-#   (1, 1, 1, 1, 2, 2, 2, 2),
postgres-#   (1, 1, 1, 2, 2, 2, 2, 2),
postgres-#   (1, 1, 2, 2, 2, 2, 2, 2),
postgres-#   (1, 2, 2, 2, 2, 2, 2, 2),
postgres-#   (2, 2, 2, 2, 2, 2, 2, 2);
INSERT 0 9

postgres=# select a, b, grouping(a,b), sum(c), count(*), max(c) from gstest2 
group by rollup ((a,b,c),(c,d));
 a | b | grouping | sum | count | max 
---+---+--+-+---+-
   |   |3 |  24 |18 |   2
 1 | 1 |0 |   4 | 2 |   2
 1 | 2 |0 |   4 | 2 |   2
 1 | 1 |0 |   2 | 2 |   1
 2 | 2 |0 |   4 | 2 |   2
 1 | 1 |0 |  10 |10 |   1
 1 | 2 |0 |   4 | 2 |   2
 1 | 1 |0 |  12 |12 |   1
 1 | 1 |0 |   4 | 2 |   2
 2 | 2 |0 |   4 | 2 |   2
(10 rows)
{code}
{code:java}
scala> sql("""select a, b, grouping(a,b), sum(c), count(*), max(c) from gstest2 
group by rollup ((a,b,c),(c,d))""").show
org.apache.spark.sql.AnalysisException: Invalid number of arguments for 
function grouping. Expected: 1; Found: 2; line 1 pos 13
  at 
org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.$anonfun$expression$8(FunctionRegistry.scala:614)
  at scala.Option.getOrElse(Option.scala:189)
  at 
org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.$anonfun$expression$4(FunctionRegistry.scala:598)
  at 
org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:121)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction(SessionCatalog.scala:1375)
  at 
org.apache.spark.sql.hive.HiveSessionCatalog.super$lookupFunction(HiveSessionCatalog.scala:132)
  at 
org.apache.spark.sql.hive.HiveSessionCatalog.$anonfun$lookupFunction0$2(HiveSessionCatalog.scala:132)
  at scala.util.Try$.apply(Try.scala:213)
{code}

  was:
{code:java}
postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e 
integer, f integer, g integer, h integer);
postgres=# insert into gstest2 values
postgres-#   (1, 1, 1, 1, 1, 1, 1, 1),
postgres-#   (1, 1, 1, 1, 1, 1, 1, 2),
postgres-#   (1, 1, 1, 1, 1, 1, 2, 2),
postgres-#   (1, 1, 1, 1, 1, 2, 2, 2),
postgres-#   (1, 1, 1, 1, 2, 2, 2, 2),
postgres-#   (1, 1, 1, 2, 2, 2, 2, 2),
postgres-#   (1, 1, 2, 2, 2, 2, 2, 2),
postgres-#   (1, 2, 2, 2, 2, 2, 2, 2),
postgres-#   (2, 2, 2, 2, 2, 2, 2, 2);
INSERT 0 9

postgres=# select a, b, grouping(a,b), sum(c), count(*), max(c) from gstest2 
group by rollup ((a,b,c),(c,d));
 a | b | grouping | sum | count | max 
---+---+--+-+---+-
   |   |3 |  24 |18 |   2
 1 | 1 |0 |   4 | 2 |   2
 1 | 2 |0 |   4 | 2 |   2
 1 | 1 |0 |   2 | 2 |   1
 2 | 2 |0 |   4 | 2 |   2
 1 | 1 |0 |  10 |10 |   1
 1 | 2 |0 |   4 | 2 |   2
 1 | 1 |0 |  12 |12 |   1
 1 | 1 |0 |   4 | 2 |   2
 2 | 2 |0 |   4 | 2 |   2
(10 rows)
{code}
{code:java}
scala> sql("""select a, b, grouping(a,b), sum(c), count(*), max(c) from gstest2 
group by rollup ((a,b,c),(c,d))""").show
org.apache.spark.sql.AnalysisException: Invalid number of arguments for 
function grouping. Expected: 1; Found: 2; line 1 pos 13
  at 
org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.$anonfun$expression$8(FunctionRegistry.scala:614)
  at scala.Option.getOrElse(Option.scala:189)
  at 
org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.$anonfun$expression$4(FunctionRegistry.scala:598)
  at 
org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:121)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction(SessionCatalog.scala:1375)
  at 
org.apache.spark.sql.hive.HiveSessionCatalog.super$lookupFunction(HiveSessionCatalog.scala:132)
  at 
org.apache.spark.sql.hive.HiveSessionCatalog.$anonfun$lookupFunction0$2(HiveSessionCatalog.scala:132)
  at scala.util.Try$.apply(Try.scala:213)
{code}


> Support more expressive forms in GroupingSets/Cube/Rollup
> -
>
> Key: SPARK-29705
> URL: https://issues.apache.org/jira/browse/SPARK-29705
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> See a doc

[jira] [Updated] (SPARK-29704) Support the combinations of grouping operations

2019-10-31 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-29704:
-
Description: 
PgSQL can accept a query below with the combinations of grouping operations, 
but Spark cannot;
{code:java}
postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e 
integer, f integer, g integer, h integer);
postgres=# insert into gstest2 values
postgres-#   (1, 1, 1, 1, 1, 1, 1, 1),
postgres-#   (1, 1, 1, 1, 1, 1, 1, 2),
postgres-#   (1, 1, 1, 1, 1, 1, 2, 2),
postgres-#   (1, 1, 1, 1, 1, 2, 2, 2),
postgres-#   (1, 1, 1, 1, 2, 2, 2, 2),
postgres-#   (1, 1, 1, 2, 2, 2, 2, 2),
postgres-#   (1, 1, 2, 2, 2, 2, 2, 2),
postgres-#   (1, 2, 2, 2, 2, 2, 2, 2),
postgres-#   (2, 2, 2, 2, 2, 2, 2, 2);
INSERT 0 9

postgres=# select a, b, c, d from gstest2 group by rollup(a,b),grouping 
sets(c,d);
 a | b | c | d 
---+---+---+---
 1 | 2 | 2 |  
 1 | 1 | 2 |  
 1 | 1 | 1 |  
 2 | 2 | 2 |  
 1 |   | 1 |  
 2 |   | 2 |  
 1 |   | 2 |  
   |   | 2 |  
   |   | 1 |  
 1 | 2 |   | 2
 1 | 1 |   | 2
 1 | 1 |   | 1
 2 | 2 |   | 2
 1 |   |   | 1
 2 |   |   | 2
 1 |   |   | 2
   |   |   | 2
   |   |   | 1
(18 rows)
{code}
{code}
scala> sql("""select a, b, c, d from gstest2 group by rollup(a,b),grouping 
sets(c,d)""").show
 org.apache.spark.sql.catalyst.parser.ParseException:
 mismatched input 'sets' expecting

{, ',', '.', '[', 'AND', 'BETWEEN', 'CLUSTER', 'DISTRIBUTE', 'EXCEPT', 
'GROUPING', 'HAVING', 'IN', 'INTERSECT', 'IS', 'LIKE', 'LIMIT', NOT, 'OR', 
'ORDER', RLIKE, 'MINUS', 'SORT', 'UNION', 'WINDOW', 'WITH', EQ, '<=>', '<>', 
'!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '||', '^'}

(line 1, pos 61)

== SQL ==
 select a, b, c, d from gstest2 group by rollup(a,b),grouping sets(c,d)
 -^^^

at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:268)
 at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:135)
 at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
 at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:85)
 at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605)
 at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
 at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605)
 ... 47 elided
 {code}
See a doc for the form: https://www.postgresql.org/docs/current/sql-select.html

  was:
PgSQL can accept a query below with the combinations of grouping operations, 
but Spark cannot;
{code:java}
postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e 
integer, f integer, g integer, h integer);
postgres=# insert into gstest2 values
postgres-#   (1, 1, 1, 1, 1, 1, 1, 1),
postgres-#   (1, 1, 1, 1, 1, 1, 1, 2),
postgres-#   (1, 1, 1, 1, 1, 1, 2, 2),
postgres-#   (1, 1, 1, 1, 1, 2, 2, 2),
postgres-#   (1, 1, 1, 1, 2, 2, 2, 2),
postgres-#   (1, 1, 1, 2, 2, 2, 2, 2),
postgres-#   (1, 1, 2, 2, 2, 2, 2, 2),
postgres-#   (1, 2, 2, 2, 2, 2, 2, 2),
postgres-#   (2, 2, 2, 2, 2, 2, 2, 2);
INSERT 0 9

postgres=# select a, b, c, d from gstest2 group by rollup(a,b),grouping 
sets(c,d);
 a | b | c | d 
---+---+---+---
 1 | 2 | 2 |  
 1 | 1 | 2 |  
 1 | 1 | 1 |  
 2 | 2 | 2 |  
 1 |   | 1 |  
 2 |   | 2 |  
 1 |   | 2 |  
   |   | 2 |  
   |   | 1 |  
 1 | 2 |   | 2
 1 | 1 |   | 2
 1 | 1 |   | 1
 2 | 2 |   | 2
 1 |   |   | 1
 2 |   |   | 2
 1 |   |   | 2
   |   |   | 2
   |   |   | 1
(18 rows)
{code}
{code}
scala> sql("""select a, b, c, d from gstest2 group by rollup(a,b),grouping 
sets(c,d)""").show
 org.apache.spark.sql.catalyst.parser.ParseException:
 mismatched input 'sets' expecting

{, ',', '.', '[', 'AND', 'BETWEEN', 'CLUSTER', 'DISTRIBUTE', 'EXCEPT', 
'GROUPING', 'HAVING', 'IN', 'INTERSECT', 'IS', 'LIKE', 'LIMIT', NOT, 'OR', 
'ORDER', RLIKE, 'MINUS', 'SORT', 'UNION', 'WINDOW', 'WITH', EQ, '<=>', '<>', 
'!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '||', '^'}

(line 1, pos 61)

== SQL ==
 select a, b, c, d from gstest2 group by rollup(a,b),grouping sets(c,d)
 -^^^

at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:268)
 at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:135)
 at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
 at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:85)
 at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605)
 at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
 at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605)
 ... 47 elided
 {code}


> Support the combinations of grouping operat

[jira] [Created] (SPARK-29708) Different answers in aggregates of multiple grouping sets

2019-10-31 Thread Takeshi Yamamuro (Jira)

Takeshi Yamamuro created SPARK-29708:


 Summary: Different answers in aggregates of multiple grouping sets
 Key: SPARK-29708
 URL: https://issues.apache.org/jira/browse/SPARK-29708
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Takeshi Yamamuro


A query below with multiple grouping sets seems to have different answers 
between PgSQL and Spark;
{code:java}
postgres=# create table gstest4(id integer, v integer, unhashable_col bit(4), 
unsortable_col xid);

postgres=# insert into gstest4
postgres-# values (1,1,b'','1'), (2,2,b'0001','1'),
postgres-#(3,4,b'0010','2'), (4,8,b'0011','2'),
postgres-#(5,16,b'','2'), (6,32,b'0001','2'),
postgres-#(7,64,b'0010','1'), (8,128,b'0011','1');
INSERT 0 8

postgres=# select unsortable_col, count(*)
postgres-#   from gstest4 group by grouping sets 
((unsortable_col),(unsortable_col))
postgres-#   order by text(unsortable_col);
 unsortable_col | count 
+---
  1 | 8
  1 | 8
  2 | 8
  2 | 8
(4 rows)
{code}
{code:java}
scala> sql("""create table gstest4(id integer, v integer, unhashable_col /* 
bit(4) */ byte, unsortable_col /* xid */ integer) using parquet""")

scala> sql("""
 | insert into gstest4
 | values (1,1,tinyint('0'),1), (2,2,tinyint('1'),1),
 |(3,4,tinyint('2'),2), (4,8,tinyint('3'),2),
 |(5,16,tinyint('0'),2), (6,32,tinyint('1'),2),
 |(7,64,tinyint('2'),1), (8,128,tinyint('3'),1)
 | """)
res21: org.apache.spark.sql.DataFrame = []

scala> 

scala> sql("""
 | select unsortable_col, count(*)
 |   from gstest4 group by grouping sets ((unsortable_col),(unsortable_col))
 |   order by string(unsortable_col)
 | """).show
+--++
|unsortable_col|count(1)|
+--++
| 1|   8|
| 2|   8|
+--++
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29701) Different answers when empty input given in GROUPING SETS

2019-10-31 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-29701:
-
Description: 
A query below with an empty input seems to have different answers between PgSQL 
and Spark;

{code:java}
postgres=# create table gstest_empty (a integer, b integer, v integer);
CREATE TABLE
postgres=# select a, b, sum(v), count(*) from gstest_empty group by grouping 
sets ((a,b),());
 a | b | sum | count 
---+---+-+---
   |   | | 0
(1 row)
{code}
{code:java}
scala> sql("""select a, b, sum(v), count(*) from gstest_empty group by grouping 
sets ((a,b),())""").show
+---+---+--++
|  a|  b|sum(v)|count(1)|
+---+---+--++
+---+---+--++
 {code}

  was:
{code:java}
A query below with an empty input seems to have different answers between PgSQL 
and Spark;

postgres=# create table gstest_empty (a integer, b integer, v integer);
CREATE TABLE
postgres=# select a, b, sum(v), count(*) from gstest_empty group by grouping 
sets ((a,b),());
 a | b | sum | count 
---+---+-+---
   |   | | 0
(1 row)
{code}
{code:java}
scala> sql("""select a, b, sum(v), count(*) from gstest_empty group by grouping 
sets ((a,b),())""").show
+---+---+--++
|  a|  b|sum(v)|count(1)|
+---+---+--++
+---+---+--++
 {code}


> Different answers when empty input given in GROUPING SETS
> -
>
> Key: SPARK-29701
> URL: https://issues.apache.org/jira/browse/SPARK-29701
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> A query below with an empty input seems to have different answers between 
> PgSQL and Spark;
> {code:java}
> postgres=# create table gstest_empty (a integer, b integer, v integer);
> CREATE TABLE
> postgres=# select a, b, sum(v), count(*) from gstest_empty group by grouping 
> sets ((a,b),());
>  a | b | sum | count 
> ---+---+-+---
>|   | | 0
> (1 row)
> {code}
> {code:java}
> scala> sql("""select a, b, sum(v), count(*) from gstest_empty group by 
> grouping sets ((a,b),())""").show
> +---+---+--++
> |  a|  b|sum(v)|count(1)|
> +---+---+--++
> +---+---+--++
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29707) Make PartitionFilters and PushedFilters abbreviate configurable in metadata

2019-10-31 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-29707:
---

 Summary: Make PartitionFilters and PushedFilters abbreviate 
configurable in metadata
 Key: SPARK-29707
 URL: https://issues.apache.org/jira/browse/SPARK-29707
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang
 Attachments: screenshot-1.png

!image-2019-11-01-13-12-38-712.png!

It lost some key information.

Related code:
https://github.com/apache/spark/blob/ec5d698d99634e5bb8fc7b0fa1c270dd67c129c8/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L58-L66



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29707) Make PartitionFilters and PushedFilters abbreviate configurable in metadata

2019-10-31 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-29707:

Attachment: screenshot-1.png

> Make PartitionFilters and PushedFilters abbreviate configurable in metadata
> ---
>
> Key: SPARK-29707
> URL: https://issues.apache.org/jira/browse/SPARK-29707
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> !image-2019-11-01-13-12-38-712.png!
> It lost some key information.
> Related code:
> https://github.com/apache/spark/blob/ec5d698d99634e5bb8fc7b0fa1c270dd67c129c8/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L58-L66



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29707) Make PartitionFilters and PushedFilters abbreviate configurable in metadata

2019-10-31 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-29707:

Description: 
 !screenshot-1.png! 

It lost some key information.

Related code:
https://github.com/apache/spark/blob/ec5d698d99634e5bb8fc7b0fa1c270dd67c129c8/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L58-L66

  was:
!image-2019-11-01-13-12-38-712.png!

It lost some key information.

Related code:
https://github.com/apache/spark/blob/ec5d698d99634e5bb8fc7b0fa1c270dd67c129c8/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L58-L66


> Make PartitionFilters and PushedFilters abbreviate configurable in metadata
> ---
>
> Key: SPARK-29707
> URL: https://issues.apache.org/jira/browse/SPARK-29707
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: screenshot-1.png
>
>
>  !screenshot-1.png! 
> It lost some key information.
> Related code:
> https://github.com/apache/spark/blob/ec5d698d99634e5bb8fc7b0fa1c270dd67c129c8/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L58-L66



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29706) Support an empty grouping expression

2019-10-31 Thread Takeshi Yamamuro (Jira)

Takeshi Yamamuro created SPARK-29706:


 Summary: Support an empty grouping expression
 Key: SPARK-29706
 URL: https://issues.apache.org/jira/browse/SPARK-29706
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Takeshi Yamamuro


PgSQL can accept a query below with an empty grouping expr, but Spark cannot;
{code:java}
postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e 
integer, f integer, g integer, h integer);
postgres=# insert into gstest2 values
postgres-#   (1, 1, 1, 1, 1, 1, 1, 1),
postgres-#   (1, 1, 1, 1, 1, 1, 1, 2),
postgres-#   (1, 1, 1, 1, 1, 1, 2, 2),
postgres-#   (1, 1, 1, 1, 1, 2, 2, 2),
postgres-#   (1, 1, 1, 1, 2, 2, 2, 2),
postgres-#   (1, 1, 1, 2, 2, 2, 2, 2),
postgres-#   (1, 1, 2, 2, 2, 2, 2, 2),
postgres-#   (1, 2, 2, 2, 2, 2, 2, 2),
postgres-#   (2, 2, 2, 2, 2, 2, 2, 2);
INSERT 0 9

postgres=# select v.c, (select count(*) from gstest2 group by () having v.c) 
from (values (false),(true)) v(c) order by v.c;
 c | count 
---+---
 f |  
 t |18
(2 rows)
{code}
{code:java}
scala> sql("""select v.c, (select count(*) from gstest2 group by () having v.c) 
from (values (false),(true)) v(c) order by v.c""").show
org.apache.spark.sql.catalyst.parser.ParseException:
no viable alternative at input '()'(line 1, pos 52)

== SQL ==
select v.c, (select count(*) from gstest2 group by () having v.c) from (values 
(false),(true)) v(c) order by v.c
^^^

  at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:268)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:135)
  at 
org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:85)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605)
  at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605)
  ... 47 elided
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29700) Support nested grouping sets

2019-10-31 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-29700:
-
Description: 
PgSQL can process nested grouping sets, but Spark cannot;
{code:java}
postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e 
integer, f integer, g integer, h integer);
postgres=# insert into gstest2 values
postgres-#   (1, 1, 1, 1, 1, 1, 1, 1),
postgres-#   (1, 1, 1, 1, 1, 1, 1, 2),
postgres-#   (1, 1, 1, 1, 1, 1, 2, 2),
postgres-#   (1, 1, 1, 1, 1, 2, 2, 2),
postgres-#   (1, 1, 1, 1, 2, 2, 2, 2),
postgres-#   (1, 1, 1, 2, 2, 2, 2, 2),
postgres-#   (1, 1, 2, 2, 2, 2, 2, 2),
postgres-#   (1, 2, 2, 2, 2, 2, 2, 2),
postgres-#   (2, 2, 2, 2, 2, 2, 2, 2);
INSERT 0 9

postgres=# select sum(c) from gstest2
postgres-#   group by grouping sets(grouping sets((a, b)))
postgres-#   order by 1 desc;
 sum 
-
  16
   4
   4
(3 rows)
{code}
{code:java}
scala> sql("""
 | select sum(c) from gstest2
 |   group by grouping sets(grouping sets((a, b)))
 |   order by 1 desc
 | """).show
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'sets' expecting {')', ','}(line 3, pos 34)

== SQL ==

select sum(c) from gstest2
  group by grouping sets(grouping sets((a, b)))
--^^^
  order by 1 desc

  at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:268)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:135)
  at 
org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:85)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605)
  at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605)
  ... 51 elided
{code}

  was:
PgSQL can process nested grouping sets, but Spark cannot;
{code:java}
postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e 
integer, f integer, g integer, h integer);
ERROR:  relation "gstest2" already exists
postgres=# insert into gstest2 values
postgres-#   (1, 1, 1, 1, 1, 1, 1, 1),
postgres-#   (1, 1, 1, 1, 1, 1, 1, 2),
postgres-#   (1, 1, 1, 1, 1, 1, 2, 2),
postgres-#   (1, 1, 1, 1, 1, 2, 2, 2),
postgres-#   (1, 1, 1, 1, 2, 2, 2, 2),
postgres-#   (1, 1, 1, 2, 2, 2, 2, 2),
postgres-#   (1, 1, 2, 2, 2, 2, 2, 2),
postgres-#   (1, 2, 2, 2, 2, 2, 2, 2),
postgres-#   (2, 2, 2, 2, 2, 2, 2, 2);
INSERT 0 9

postgres=# select sum(c) from gstest2
postgres-#   group by grouping sets(grouping sets((a, b)))
postgres-#   order by 1 desc;
 sum 
-
  16
   4
   4
(3 rows)
{code}
{code:java}
scala> sql("""
 | select sum(c) from gstest2
 |   group by grouping sets(grouping sets((a, b)))
 |   order by 1 desc
 | """).show
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'sets' expecting {')', ','}(line 3, pos 34)

== SQL ==

select sum(c) from gstest2
  group by grouping sets(grouping sets((a, b)))
--^^^
  order by 1 desc

  at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:268)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:135)
  at 
org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:85)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605)
  at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605)
  ... 51 elided
{code}


> Support nested grouping sets
> 
>
> Key: SPARK-29700
> URL: https://issues.apache.org/jira/browse/SPARK-29700
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> PgSQL can process nested grouping sets, but Spark cannot;
> {code:java}
> postgres=# create table gstest2 (a integer, b integer, c integer, d integer, 
> e integer, f integer, g integer, h integer);
> postgres=# insert into gstest2 values
> postgres-#   (1, 1, 1, 1, 1, 1, 1, 1),
> postgres-#   (1, 1, 1, 1, 1, 1, 1, 2),
> postgres-#   (1, 1, 1, 1, 1, 1, 2, 2),
> postgres-#   (1, 1, 1, 1, 1, 2, 2, 2),
> postgres-#   (1, 1, 1, 1, 2, 2, 2, 2),
> postgres-#   (1, 1, 1, 2, 2, 2, 2, 2),
> postgres-#   (1, 1, 2, 2, 2, 2, 2, 2),
> postgres-#   (1, 2, 2, 2, 2, 2, 2, 2),
> postgres-#   (2, 2, 2, 2, 2, 2, 2, 2);
> INSERT 0 9
> postgres=# select sum(c) from gstest2
> postgres-#   group by grouping sets(grouping sets((a, b)))
> postgres-#   order

[jira] [Updated] (SPARK-29704) Support the combinations of grouping operations

2019-10-31 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-29704:
-
Description: 
PgSQL can accept a query below with the combinations of grouping operations, 
but Spark cannot;
{code:java}
postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e 
integer, f integer, g integer, h integer);
postgres=# insert into gstest2 values
postgres-#   (1, 1, 1, 1, 1, 1, 1, 1),
postgres-#   (1, 1, 1, 1, 1, 1, 1, 2),
postgres-#   (1, 1, 1, 1, 1, 1, 2, 2),
postgres-#   (1, 1, 1, 1, 1, 2, 2, 2),
postgres-#   (1, 1, 1, 1, 2, 2, 2, 2),
postgres-#   (1, 1, 1, 2, 2, 2, 2, 2),
postgres-#   (1, 1, 2, 2, 2, 2, 2, 2),
postgres-#   (1, 2, 2, 2, 2, 2, 2, 2),
postgres-#   (2, 2, 2, 2, 2, 2, 2, 2);
INSERT 0 9

postgres=# select a, b, c, d from gstest2 group by rollup(a,b),grouping 
sets(c,d);
 a | b | c | d 
---+---+---+---
 1 | 2 | 2 |  
 1 | 1 | 2 |  
 1 | 1 | 1 |  
 2 | 2 | 2 |  
 1 |   | 1 |  
 2 |   | 2 |  
 1 |   | 2 |  
   |   | 2 |  
   |   | 1 |  
 1 | 2 |   | 2
 1 | 1 |   | 2
 1 | 1 |   | 1
 2 | 2 |   | 2
 1 |   |   | 1
 2 |   |   | 2
 1 |   |   | 2
   |   |   | 2
   |   |   | 1
(18 rows)
{code}
scala> sql("""select a, b, c, d from gstest2 group by rollup(a,b),grouping 
sets(c,d)""").show
 org.apache.spark.sql.catalyst.parser.ParseException:
 mismatched input 'sets' expecting

{, ',', '.', '[', 'AND', 'BETWEEN', 'CLUSTER', 'DISTRIBUTE', 'EXCEPT', 
'GROUPING', 'HAVING', 'IN', 'INTERSECT', 'IS', 'LIKE', 'LIMIT', NOT, 'OR', 
'ORDER', RLIKE, 'MINUS', 'SORT', 'UNION', 'WINDOW', 'WITH', EQ, '<=>', '<>', 
'!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '||', '^'}

(line 1, pos 61)

== SQL ==
 select a, b, c, d from gstest2 group by rollup(a,b),grouping sets(c,d)
 -^^^

at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:268)
 at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:135)
 at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
 at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:85)
 at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605)
 at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
 at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605)
 ... 47 elided
{code:java}
 {code}

  was:
PgSQL can accept a query below with the combinations of grouping operations, 
but Spark cannot;
{code}
postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e 
integer, f integer, g integer, h integer);
ERROR:  relation "gstest2" already exists
postgres=# insert into gstest2 values
postgres-#   (1, 1, 1, 1, 1, 1, 1, 1),
postgres-#   (1, 1, 1, 1, 1, 1, 1, 2),
postgres-#   (1, 1, 1, 1, 1, 1, 2, 2),
postgres-#   (1, 1, 1, 1, 1, 2, 2, 2),
postgres-#   (1, 1, 1, 1, 2, 2, 2, 2),
postgres-#   (1, 1, 1, 2, 2, 2, 2, 2),
postgres-#   (1, 1, 2, 2, 2, 2, 2, 2),
postgres-#   (1, 2, 2, 2, 2, 2, 2, 2),
postgres-#   (2, 2, 2, 2, 2, 2, 2, 2);
INSERT 0 9

postgres=# select a, b, c, d from gstest2 group by rollup(a,b),grouping 
sets(c,d);
 a | b | c | d 
---+---+---+---
 1 | 2 | 2 |  
 1 | 1 | 2 |  
 1 | 1 | 1 |  
 2 | 2 | 2 |  
 1 |   | 1 |  
 2 |   | 2 |  
 1 |   | 2 |  
   |   | 2 |  
   |   | 1 |  
 1 | 2 |   | 2
 1 | 1 |   | 2
 1 | 1 |   | 1
 2 | 2 |   | 2
 1 |   |   | 1
 2 |   |   | 2
 1 |   |   | 2
   |   |   | 2
   |   |   | 1
(18 rows)
{code}
scala> sql("""select a, b, c, d from gstest2 group by rollup(a,b),grouping 
sets(c,d)""").show
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'sets' expecting {, ',', '.', '[', 'AND', 'BETWEEN', 
'CLUSTER', 'DISTRIBUTE', 'EXCEPT', 'GROUPING', 'HAVING', 'IN', 'INTERSECT', 
'IS', 'LIKE', 'LIMIT', NOT, 'OR', 'ORDER', RLIKE, 'MINUS', 'SORT', 'UNION', 
'WINDOW', 'WITH', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', 
'/', '%', 'DIV', '&', '|', '||', '^'}(line 1, pos 61)

== SQL ==
select a, b, c, d from gstest2 group by rollup(a,b),grouping sets(c,d)
-^^^

  at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:268)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:135)
  at 
org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:85)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605)
  at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605)
  ... 47 elided
{code}

{code}


> Support the combinations of grouping operations
> ---

[jira] [Updated] (SPARK-29704) Support the combinations of grouping operations

2019-10-31 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-29704:
-
Description: 
PgSQL can accept a query below with the combinations of grouping operations, 
but Spark cannot;
{code:java}
postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e 
integer, f integer, g integer, h integer);
postgres=# insert into gstest2 values
postgres-#   (1, 1, 1, 1, 1, 1, 1, 1),
postgres-#   (1, 1, 1, 1, 1, 1, 1, 2),
postgres-#   (1, 1, 1, 1, 1, 1, 2, 2),
postgres-#   (1, 1, 1, 1, 1, 2, 2, 2),
postgres-#   (1, 1, 1, 1, 2, 2, 2, 2),
postgres-#   (1, 1, 1, 2, 2, 2, 2, 2),
postgres-#   (1, 1, 2, 2, 2, 2, 2, 2),
postgres-#   (1, 2, 2, 2, 2, 2, 2, 2),
postgres-#   (2, 2, 2, 2, 2, 2, 2, 2);
INSERT 0 9

postgres=# select a, b, c, d from gstest2 group by rollup(a,b),grouping 
sets(c,d);
 a | b | c | d 
---+---+---+---
 1 | 2 | 2 |  
 1 | 1 | 2 |  
 1 | 1 | 1 |  
 2 | 2 | 2 |  
 1 |   | 1 |  
 2 |   | 2 |  
 1 |   | 2 |  
   |   | 2 |  
   |   | 1 |  
 1 | 2 |   | 2
 1 | 1 |   | 2
 1 | 1 |   | 1
 2 | 2 |   | 2
 1 |   |   | 1
 2 |   |   | 2
 1 |   |   | 2
   |   |   | 2
   |   |   | 1
(18 rows)
{code}
{code}
scala> sql("""select a, b, c, d from gstest2 group by rollup(a,b),grouping 
sets(c,d)""").show
 org.apache.spark.sql.catalyst.parser.ParseException:
 mismatched input 'sets' expecting

{, ',', '.', '[', 'AND', 'BETWEEN', 'CLUSTER', 'DISTRIBUTE', 'EXCEPT', 
'GROUPING', 'HAVING', 'IN', 'INTERSECT', 'IS', 'LIKE', 'LIMIT', NOT, 'OR', 
'ORDER', RLIKE, 'MINUS', 'SORT', 'UNION', 'WINDOW', 'WITH', EQ, '<=>', '<>', 
'!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '||', '^'}

(line 1, pos 61)

== SQL ==
 select a, b, c, d from gstest2 group by rollup(a,b),grouping sets(c,d)
 -^^^

at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:268)
 at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:135)
 at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
 at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:85)
 at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605)
 at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
 at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605)
 ... 47 elided
 {code}

  was:
PgSQL can accept a query below with the combinations of grouping operations, 
but Spark cannot;
{code:java}
postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e 
integer, f integer, g integer, h integer);
postgres=# insert into gstest2 values
postgres-#   (1, 1, 1, 1, 1, 1, 1, 1),
postgres-#   (1, 1, 1, 1, 1, 1, 1, 2),
postgres-#   (1, 1, 1, 1, 1, 1, 2, 2),
postgres-#   (1, 1, 1, 1, 1, 2, 2, 2),
postgres-#   (1, 1, 1, 1, 2, 2, 2, 2),
postgres-#   (1, 1, 1, 2, 2, 2, 2, 2),
postgres-#   (1, 1, 2, 2, 2, 2, 2, 2),
postgres-#   (1, 2, 2, 2, 2, 2, 2, 2),
postgres-#   (2, 2, 2, 2, 2, 2, 2, 2);
INSERT 0 9

postgres=# select a, b, c, d from gstest2 group by rollup(a,b),grouping 
sets(c,d);
 a | b | c | d 
---+---+---+---
 1 | 2 | 2 |  
 1 | 1 | 2 |  
 1 | 1 | 1 |  
 2 | 2 | 2 |  
 1 |   | 1 |  
 2 |   | 2 |  
 1 |   | 2 |  
   |   | 2 |  
   |   | 1 |  
 1 | 2 |   | 2
 1 | 1 |   | 2
 1 | 1 |   | 1
 2 | 2 |   | 2
 1 |   |   | 1
 2 |   |   | 2
 1 |   |   | 2
   |   |   | 2
   |   |   | 1
(18 rows)
{code}
scala> sql("""select a, b, c, d from gstest2 group by rollup(a,b),grouping 
sets(c,d)""").show
 org.apache.spark.sql.catalyst.parser.ParseException:
 mismatched input 'sets' expecting

{, ',', '.', '[', 'AND', 'BETWEEN', 'CLUSTER', 'DISTRIBUTE', 'EXCEPT', 
'GROUPING', 'HAVING', 'IN', 'INTERSECT', 'IS', 'LIKE', 'LIMIT', NOT, 'OR', 
'ORDER', RLIKE, 'MINUS', 'SORT', 'UNION', 'WINDOW', 'WITH', EQ, '<=>', '<>', 
'!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '||', '^'}

(line 1, pos 61)

== SQL ==
 select a, b, c, d from gstest2 group by rollup(a,b),grouping sets(c,d)
 -^^^

at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:268)
 at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:135)
 at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
 at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:85)
 at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605)
 at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
 at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605)
 ... 47 elided
{code:java}
 {code}


> Support the combinations of grouping operations
> ---
>
>

[jira] [Updated] (SPARK-29705) Support more expressive forms in GroupingSets/Cube/Rollup

2019-10-31 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-29705:
-
Description: 
{code:java}
postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e 
integer, f integer, g integer, h integer);
postgres=# insert into gstest2 values
postgres-#   (1, 1, 1, 1, 1, 1, 1, 1),
postgres-#   (1, 1, 1, 1, 1, 1, 1, 2),
postgres-#   (1, 1, 1, 1, 1, 1, 2, 2),
postgres-#   (1, 1, 1, 1, 1, 2, 2, 2),
postgres-#   (1, 1, 1, 1, 2, 2, 2, 2),
postgres-#   (1, 1, 1, 2, 2, 2, 2, 2),
postgres-#   (1, 1, 2, 2, 2, 2, 2, 2),
postgres-#   (1, 2, 2, 2, 2, 2, 2, 2),
postgres-#   (2, 2, 2, 2, 2, 2, 2, 2);
INSERT 0 9

postgres=# select a, b, grouping(a,b), sum(c), count(*), max(c) from gstest2 
group by rollup ((a,b,c),(c,d));
 a | b | grouping | sum | count | max 
---+---+--+-+---+-
   |   |3 |  24 |18 |   2
 1 | 1 |0 |   4 | 2 |   2
 1 | 2 |0 |   4 | 2 |   2
 1 | 1 |0 |   2 | 2 |   1
 2 | 2 |0 |   4 | 2 |   2
 1 | 1 |0 |  10 |10 |   1
 1 | 2 |0 |   4 | 2 |   2
 1 | 1 |0 |  12 |12 |   1
 1 | 1 |0 |   4 | 2 |   2
 2 | 2 |0 |   4 | 2 |   2
(10 rows)
{code}
{code:java}
scala> sql("""select a, b, grouping(a,b), sum(c), count(*), max(c) from gstest2 
group by rollup ((a,b,c),(c,d))""").show
org.apache.spark.sql.AnalysisException: Invalid number of arguments for 
function grouping. Expected: 1; Found: 2; line 1 pos 13
  at 
org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.$anonfun$expression$8(FunctionRegistry.scala:614)
  at scala.Option.getOrElse(Option.scala:189)
  at 
org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.$anonfun$expression$4(FunctionRegistry.scala:598)
  at 
org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:121)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction(SessionCatalog.scala:1375)
  at 
org.apache.spark.sql.hive.HiveSessionCatalog.super$lookupFunction(HiveSessionCatalog.scala:132)
  at 
org.apache.spark.sql.hive.HiveSessionCatalog.$anonfun$lookupFunction0$2(HiveSessionCatalog.scala:132)
  at scala.util.Try$.apply(Try.scala:213)
{code}

  was:
{code:java}
postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e 
integer, f integer, g integer, h integer);
ERROR:  relation "gstest2" already exists
postgres=# insert into gstest2 values
postgres-#   (1, 1, 1, 1, 1, 1, 1, 1),
postgres-#   (1, 1, 1, 1, 1, 1, 1, 2),
postgres-#   (1, 1, 1, 1, 1, 1, 2, 2),
postgres-#   (1, 1, 1, 1, 1, 2, 2, 2),
postgres-#   (1, 1, 1, 1, 2, 2, 2, 2),
postgres-#   (1, 1, 1, 2, 2, 2, 2, 2),
postgres-#   (1, 1, 2, 2, 2, 2, 2, 2),
postgres-#   (1, 2, 2, 2, 2, 2, 2, 2),
postgres-#   (2, 2, 2, 2, 2, 2, 2, 2);
INSERT 0 9

postgres=# select a, b, grouping(a,b), sum(c), count(*), max(c) from gstest2 
group by rollup ((a,b,c),(c,d));
 a | b | grouping | sum | count | max 
---+---+--+-+---+-
   |   |3 |  24 |18 |   2
 1 | 1 |0 |   4 | 2 |   2
 1 | 2 |0 |   4 | 2 |   2
 1 | 1 |0 |   2 | 2 |   1
 2 | 2 |0 |   4 | 2 |   2
 1 | 1 |0 |  10 |10 |   1
 1 | 2 |0 |   4 | 2 |   2
 1 | 1 |0 |  12 |12 |   1
 1 | 1 |0 |   4 | 2 |   2
 2 | 2 |0 |   4 | 2 |   2
(10 rows)
{code}
{code:java}
scala> sql("""select a, b, grouping(a,b), sum(c), count(*), max(c) from gstest2 
group by rollup ((a,b,c),(c,d))""").show
org.apache.spark.sql.AnalysisException: Invalid number of arguments for 
function grouping. Expected: 1; Found: 2; line 1 pos 13
  at 
org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.$anonfun$expression$8(FunctionRegistry.scala:614)
  at scala.Option.getOrElse(Option.scala:189)
  at 
org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.$anonfun$expression$4(FunctionRegistry.scala:598)
  at 
org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:121)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction(SessionCatalog.scala:1375)
  at 
org.apache.spark.sql.hive.HiveSessionCatalog.super$lookupFunction(HiveSessionCatalog.scala:132)
  at 
org.apache.spark.sql.hive.HiveSessionCatalog.$anonfun$lookupFunction0$2(HiveSessionCatalog.scala:132)
  at scala.util.Try$.apply(Try.scala:213)
{code}


> Support more expressive forms in GroupingSets/Cube/Rollup
> -
>
> Key: SPARK-29705
> URL: https://issues.apache.org/jira/browse/SPARK-29705
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> {code:java}
> postgres=# create table gstest2 (

[jira] [Created] (SPARK-29705) Support more expressive forms in GroupingSets/Cube/Rollup

2019-10-31 Thread Takeshi Yamamuro (Jira)

Takeshi Yamamuro created SPARK-29705:


 Summary: Support more expressive forms in GroupingSets/Cube/Rollup
 Key: SPARK-29705
 URL: https://issues.apache.org/jira/browse/SPARK-29705
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Takeshi Yamamuro


{code:java}
postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e 
integer, f integer, g integer, h integer);
ERROR:  relation "gstest2" already exists
postgres=# insert into gstest2 values
postgres-#   (1, 1, 1, 1, 1, 1, 1, 1),
postgres-#   (1, 1, 1, 1, 1, 1, 1, 2),
postgres-#   (1, 1, 1, 1, 1, 1, 2, 2),
postgres-#   (1, 1, 1, 1, 1, 2, 2, 2),
postgres-#   (1, 1, 1, 1, 2, 2, 2, 2),
postgres-#   (1, 1, 1, 2, 2, 2, 2, 2),
postgres-#   (1, 1, 2, 2, 2, 2, 2, 2),
postgres-#   (1, 2, 2, 2, 2, 2, 2, 2),
postgres-#   (2, 2, 2, 2, 2, 2, 2, 2);
INSERT 0 9

postgres=# select a, b, grouping(a,b), sum(c), count(*), max(c) from gstest2 
group by rollup ((a,b,c),(c,d));
 a | b | grouping | sum | count | max 
---+---+--+-+---+-
   |   |3 |  24 |18 |   2
 1 | 1 |0 |   4 | 2 |   2
 1 | 2 |0 |   4 | 2 |   2
 1 | 1 |0 |   2 | 2 |   1
 2 | 2 |0 |   4 | 2 |   2
 1 | 1 |0 |  10 |10 |   1
 1 | 2 |0 |   4 | 2 |   2
 1 | 1 |0 |  12 |12 |   1
 1 | 1 |0 |   4 | 2 |   2
 2 | 2 |0 |   4 | 2 |   2
(10 rows)
{code}
{code:java}
scala> sql("""select a, b, grouping(a,b), sum(c), count(*), max(c) from gstest2 
group by rollup ((a,b,c),(c,d))""").show
org.apache.spark.sql.AnalysisException: Invalid number of arguments for 
function grouping. Expected: 1; Found: 2; line 1 pos 13
  at 
org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.$anonfun$expression$8(FunctionRegistry.scala:614)
  at scala.Option.getOrElse(Option.scala:189)
  at 
org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.$anonfun$expression$4(FunctionRegistry.scala:598)
  at 
org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:121)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction(SessionCatalog.scala:1375)
  at 
org.apache.spark.sql.hive.HiveSessionCatalog.super$lookupFunction(HiveSessionCatalog.scala:132)
  at 
org.apache.spark.sql.hive.HiveSessionCatalog.$anonfun$lookupFunction0$2(HiveSessionCatalog.scala:132)
  at scala.util.Try$.apply(Try.scala:213)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24203) Make executor's bindAddress configurable

2019-10-31 Thread DB Tsai (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-24203:
---

Assignee: Nishchal Venkataramana

> Make executor's bindAddress configurable
> 
>
> Key: SPARK-24203
> URL: https://issues.apache.org/jira/browse/SPARK-24203
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Lukas Majercak
>Assignee: Nishchal Venkataramana
>Priority: Major
>  Labels: bulk-closed
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29704) Support the combinations of grouping operations

2019-10-31 Thread Takeshi Yamamuro (Jira)

Takeshi Yamamuro created SPARK-29704:


 Summary: Support the combinations of grouping operations
 Key: SPARK-29704
 URL: https://issues.apache.org/jira/browse/SPARK-29704
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Takeshi Yamamuro


PgSQL can accept a query below with the combinations of grouping operations, 
but Spark cannot;
{code}
postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e 
integer, f integer, g integer, h integer);
ERROR:  relation "gstest2" already exists
postgres=# insert into gstest2 values
postgres-#   (1, 1, 1, 1, 1, 1, 1, 1),
postgres-#   (1, 1, 1, 1, 1, 1, 1, 2),
postgres-#   (1, 1, 1, 1, 1, 1, 2, 2),
postgres-#   (1, 1, 1, 1, 1, 2, 2, 2),
postgres-#   (1, 1, 1, 1, 2, 2, 2, 2),
postgres-#   (1, 1, 1, 2, 2, 2, 2, 2),
postgres-#   (1, 1, 2, 2, 2, 2, 2, 2),
postgres-#   (1, 2, 2, 2, 2, 2, 2, 2),
postgres-#   (2, 2, 2, 2, 2, 2, 2, 2);
INSERT 0 9

postgres=# select a, b, c, d from gstest2 group by rollup(a,b),grouping 
sets(c,d);
 a | b | c | d 
---+---+---+---
 1 | 2 | 2 |  
 1 | 1 | 2 |  
 1 | 1 | 1 |  
 2 | 2 | 2 |  
 1 |   | 1 |  
 2 |   | 2 |  
 1 |   | 2 |  
   |   | 2 |  
   |   | 1 |  
 1 | 2 |   | 2
 1 | 1 |   | 2
 1 | 1 |   | 1
 2 | 2 |   | 2
 1 |   |   | 1
 2 |   |   | 2
 1 |   |   | 2
   |   |   | 2
   |   |   | 1
(18 rows)
{code}
scala> sql("""select a, b, c, d from gstest2 group by rollup(a,b),grouping 
sets(c,d)""").show
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'sets' expecting {, ',', '.', '[', 'AND', 'BETWEEN', 
'CLUSTER', 'DISTRIBUTE', 'EXCEPT', 'GROUPING', 'HAVING', 'IN', 'INTERSECT', 
'IS', 'LIKE', 'LIMIT', NOT, 'OR', 'ORDER', RLIKE, 'MINUS', 'SORT', 'UNION', 
'WINDOW', 'WITH', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', 
'/', '%', 'DIV', '&', '|', '||', '^'}(line 1, pos 61)

== SQL ==
select a, b, c, d from gstest2 group by rollup(a,b),grouping sets(c,d)
-^^^

  at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:268)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:135)
  at 
org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:85)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605)
  at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605)
  ... 47 elided
{code}

{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-24203) Make executor's bindAddress configurable

2019-10-31 Thread DB Tsai (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reopened SPARK-24203:
-

> Make executor's bindAddress configurable
> 
>
> Key: SPARK-24203
> URL: https://issues.apache.org/jira/browse/SPARK-24203
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Lukas Majercak
>Priority: Major
>  Labels: bulk-closed
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29670) Make executor's bindAddress configurable

2019-10-31 Thread DB Tsai (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964599#comment-16964599
 ] 

DB Tsai commented on SPARK-29670:
-

This is a duplication of SPARK-29670

> Make executor's bindAddress configurable
> 
>
> Key: SPARK-29670
> URL: https://issues.apache.org/jira/browse/SPARK-29670
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4
>Reporter: Nishchal Venkataramana
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29670) Make executor's bindAddress configurable

2019-10-31 Thread DB Tsai (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-29670.
-
Resolution: Duplicate

> Make executor's bindAddress configurable
> 
>
> Key: SPARK-29670
> URL: https://issues.apache.org/jira/browse/SPARK-29670
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4
>Reporter: Nishchal Venkataramana
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29703) Support grouping() in GROUP BY without GroupingSets/Cube/Rollup

2019-10-31 Thread Takeshi Yamamuro (Jira)

Takeshi Yamamuro created SPARK-29703:


 Summary: Support grouping() in GROUP BY without 
GroupingSets/Cube/Rollup
 Key: SPARK-29703
 URL: https://issues.apache.org/jira/browse/SPARK-29703
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Takeshi Yamamuro


PgSQL can accept the query below that have grouping() in GROUP BY without 
GroupingSets/Cube/Rollup;
{code:java}
postgres=# CREATE TABLE onek (unique1 int, unique2 int, two int, four int, ten 
int, twenty int, hundred int, thousand int, twothousand int, fivethous int, 
tenthous int, odd int, even int, textu1 text, textu2 text, text4 text);
CREATE TABLE

postgres=# select ten, grouping(ten) from onek group by (ten) having 
grouping(ten) >= 0 order by 2,1;
 ten | grouping 
-+--
(0 rows)
{code}
{code:java}
scala> sql("""select ten, grouping(ten) from onek group by (ten) having 
grouping(ten) >= 0 order by 2,1""").show()
org.apache.spark.sql.AnalysisException: grouping()/grouping_id() can only be 
used with GroupingSets/Cube/Rollup;
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:47)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:46)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:122)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGroupingAnalytics$$findGroupingExprs$1.applyOrElse(Analyzer.scala:503)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGroupingAnalytics$$findGroupingExprs$1.applyOrElse(Analyzer.scala:497)
  at scala.PartialFunction$Lifted.apply(PartialFunction.scala:228)
  at scala.PartialFunction$Lifted.apply(PartialFunction.scala:224)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.collectFirst(TreeNode.scala:202)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGroupingAnalytics$$findGroupingExprs(Analyzer.scala:497)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29702) Resolve group-by columns with functional dependencies

2019-10-31 Thread Takeshi Yamamuro (Jira)

Takeshi Yamamuro created SPARK-29702:


 Summary: Resolve group-by columns with functional dependencies
 Key: SPARK-29702
 URL: https://issues.apache.org/jira/browse/SPARK-29702
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Takeshi Yamamuro


In PgSQL, functional dependencies affect grouping column resolution in an 
analyzer;
{code:java}
postgres=# \d gstest3
  Table "public.gstest3"
 Column |  Type   | Collation | Nullable | Default 
+-+---+--+-
 a  | integer |   |  | 
 b  | integer |   |  | 
 c  | integer |   |  | 
 d  | integer |   |  | 

postgres=# select a, d, grouping(a,b,c) from gstest3 group by grouping sets 
((a,b), (a,c));
ERROR:  column "gstest3.d" must appear in the GROUP BY clause or be used in an 
aggregate function
LINE 1: select a, d, grouping(a,b,c) from gstest3 group by grouping ...
  ^
postgres=# alter table gstest3 add primary key (a);
ALTER TABLE

postgres=# select a, d, grouping(a,b,c) from gstest3 group by grouping sets 
((a,b), (a,c));
 a | d | grouping 
---+---+--
 1 | 1 |1
 2 | 2 |1
 1 | 1 |2
 2 | 2 |2
(4 rows)
 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29701) Different answers when empty input given in GROUPING SETS

2019-10-31 Thread Takeshi Yamamuro (Jira)

Takeshi Yamamuro created SPARK-29701:


 Summary: Different answers when empty input given in GROUPING SETS
 Key: SPARK-29701
 URL: https://issues.apache.org/jira/browse/SPARK-29701
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Takeshi Yamamuro


{code:java}
A query below with an empty input seems to have different answers between PgSQL 
and Spark;

postgres=# create table gstest_empty (a integer, b integer, v integer);
CREATE TABLE
postgres=# select a, b, sum(v), count(*) from gstest_empty group by grouping 
sets ((a,b),());
 a | b | sum | count 
---+---+-+---
   |   | | 0
(1 row)
{code}
{code:java}
scala> sql("""select a, b, sum(v), count(*) from gstest_empty group by grouping 
sets ((a,b),())""").show
+---+---+--++
|  a|  b|sum(v)|count(1)|
+---+---+--++
+---+---+--++
 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29700) Support nested grouping sets

2019-10-31 Thread Takeshi Yamamuro (Jira)

Takeshi Yamamuro created SPARK-29700:


 Summary: Support nested grouping sets
 Key: SPARK-29700
 URL: https://issues.apache.org/jira/browse/SPARK-29700
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Takeshi Yamamuro


PgSQL can process nested grouping sets, but Spark cannot;
{code:java}
postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e 
integer, f integer, g integer, h integer);
ERROR:  relation "gstest2" already exists
postgres=# insert into gstest2 values
postgres-#   (1, 1, 1, 1, 1, 1, 1, 1),
postgres-#   (1, 1, 1, 1, 1, 1, 1, 2),
postgres-#   (1, 1, 1, 1, 1, 1, 2, 2),
postgres-#   (1, 1, 1, 1, 1, 2, 2, 2),
postgres-#   (1, 1, 1, 1, 2, 2, 2, 2),
postgres-#   (1, 1, 1, 2, 2, 2, 2, 2),
postgres-#   (1, 1, 2, 2, 2, 2, 2, 2),
postgres-#   (1, 2, 2, 2, 2, 2, 2, 2),
postgres-#   (2, 2, 2, 2, 2, 2, 2, 2);
INSERT 0 9

postgres=# select sum(c) from gstest2
postgres-#   group by grouping sets(grouping sets((a, b)))
postgres-#   order by 1 desc;
 sum 
-
  16
   4
   4
(3 rows)
{code}
{code:java}
scala> sql("""
 | select sum(c) from gstest2
 |   group by grouping sets(grouping sets((a, b)))
 |   order by 1 desc
 | """).show
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'sets' expecting {')', ','}(line 3, pos 34)

== SQL ==

select sum(c) from gstest2
  group by grouping sets(grouping sets((a, b)))
--^^^
  order by 1 desc

  at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:268)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:135)
  at 
org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:85)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605)
  at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605)
  ... 51 elided
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29699) Different answers in nested aggregates with window functions

2019-10-31 Thread Takeshi Yamamuro (Jira)

Takeshi Yamamuro created SPARK-29699:


 Summary: Different answers in nested aggregates with window 
functions
 Key: SPARK-29699
 URL: https://issues.apache.org/jira/browse/SPARK-29699
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Takeshi Yamamuro


A nested aggregate below with a window function seems to have different answers 
in the `rsum` column  between PgSQL and Spark;
{code:java}
postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e 
integer, f integer, g integer, h integer);
postgres=# insert into gstest2 values
postgres-#   (1, 1, 1, 1, 1, 1, 1, 1),
postgres-#   (1, 1, 1, 1, 1, 1, 1, 2),
postgres-#   (1, 1, 1, 1, 1, 1, 2, 2),
postgres-#   (1, 1, 1, 1, 1, 2, 2, 2),
postgres-#   (1, 1, 1, 1, 2, 2, 2, 2),
postgres-#   (1, 1, 1, 2, 2, 2, 2, 2),
postgres-#   (1, 1, 2, 2, 2, 2, 2, 2),
postgres-#   (1, 2, 2, 2, 2, 2, 2, 2),
postgres-#   (2, 2, 2, 2, 2, 2, 2, 2);
INSERT 0 9
postgres=# 
postgres=# select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
postgres-#   from gstest2 group by rollup (a,b) order by rsum, a, b;
 a | b | sum | rsum 
---+---+-+--
 1 | 1 |  16 |   16
 1 | 2 |   4 |   20
 1 |   |  20 |   40
 2 | 2 |   4 |   44
 2 |   |   4 |   48
   |   |  24 |   72
(6 rows)
{code}
{code:java}
scala> sql("""
 | select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
 |   from gstest2 group by rollup (a,b) order by rsum, a, b
 | """).show()
+++--++ 
|   a|   b|sum(c)|rsum|
+++--++
|null|null|12|  12|
|   1|null|10|  22|
|   1|   1| 8|  30|
|   1|   2| 2|  32|
|   2|null| 2|  34|
|   2|   2| 2|  36|
+++--++
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29698) Support grouping function with multiple arguments

2019-10-31 Thread Takeshi Yamamuro (Jira)

Takeshi Yamamuro created SPARK-29698:


 Summary: Support grouping function with multiple arguments
 Key: SPARK-29698
 URL: https://issues.apache.org/jira/browse/SPARK-29698
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Takeshi Yamamuro


In PgSQL, grouping() can have multiple arguments, but Spark grouping() must 
have a single argument 
([https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala#L100]);
{code:java}
postgres=# select a, b, grouping(a, b), sum(v), count(*), max(v)
postgres-#   from gstest1 group by rollup (a,b);
 a | b | grouping | sum | count | max 
---+---+--+-+---+-
   |   |3 | | 0 |
(1 row)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29686) LinearSVC should persist instances if needed

2019-10-31 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-29686.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26344
[https://github.com/apache/spark/pull/26344]

> LinearSVC should persist instances if needed
> 
>
> Key: SPARK-29686
> URL: https://issues.apache.org/jira/browse/SPARK-29686
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Trivial
> Fix For: 3.0.0
>
>
> Current LinearSVC impl forgot to cache the input dataset.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29697) Support bit string types/literals

2019-10-31 Thread Takeshi Yamamuro (Jira)

Takeshi Yamamuro created SPARK-29697:


 Summary: Support bit string types/literals
 Key: SPARK-29697
 URL: https://issues.apache.org/jira/browse/SPARK-29697
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Takeshi Yamamuro


In PgSQL, there are bit types and literals;
{code}
postgres=# create table b(b bit(4));
CREATE TABLE
postgres=# select b'0010';
 ?column? 
--
 0010
(1 row)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27763) Port test cases from PostgreSQL to Spark SQL

2019-10-31 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964583#comment-16964583
 ] 

Takeshi Yamamuro edited comment on SPARK-27763 at 11/1/19 3:54 AM:
---

Thanks for the check, Hyukjin! I've made PRs for limit.sql and groupingset.sql. 
Also, I'll check the left three tests within a few days.


was (Author: maropu):
Thanks for the check, Hyukjin! I've made PRs for limit.sql and groupingset.sql. 
I'll check the left three tests within a few days.

> Port test cases from PostgreSQL to Spark SQL
> 
>
> Key: SPARK-27763
> URL: https://issues.apache.org/jira/browse/SPARK-27763
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Yuming Wang
>Priority: Major
>
> To improve the test coverage, we can port the regression tests from the other 
> popular open source projects to Spark SQL. PostgreSQL is one of the best SQL 
> systems. Below are the links to the test cases and results. 
>  * Regression test cases: 
> [https://github.com/postgres/postgres/tree/master/src/test/regress/sql]
>  * Expected results: 
> [https://github.com/postgres/postgres/tree/master/src/test/regress/expected]
> Spark SQL does not support all the feature sets of PostgreSQL. In the current 
> stage, we should first comment out these test cases and create the 
> corresponding JIRAs in SPARK-27764. We can discuss and prioritize which 
> features we should support. Also, these PostgreSQL regression tests could 
> also expose the existing bugs of Spark SQL. We should also create the JIRAs 
> and track them in SPARK-27764. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27763) Port test cases from PostgreSQL to Spark SQL

2019-10-31 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964583#comment-16964583
 ] 

Takeshi Yamamuro commented on SPARK-27763:
--

Thanks for the check, Hyukjin! I've made PRs for limit.sql and groupingset.sql. 
I'll check the left three tests within a few days.

> Port test cases from PostgreSQL to Spark SQL
> 
>
> Key: SPARK-27763
> URL: https://issues.apache.org/jira/browse/SPARK-27763
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Yuming Wang
>Priority: Major
>
> To improve the test coverage, we can port the regression tests from the other 
> popular open source projects to Spark SQL. PostgreSQL is one of the best SQL 
> systems. Below are the links to the test cases and results. 
>  * Regression test cases: 
> [https://github.com/postgres/postgres/tree/master/src/test/regress/sql]
>  * Expected results: 
> [https://github.com/postgres/postgres/tree/master/src/test/regress/expected]
> Spark SQL does not support all the feature sets of PostgreSQL. In the current 
> stage, we should first comment out these test cases and create the 
> corresponding JIRAs in SPARK-27764. We can discuss and prioritize which 
> features we should support. Also, these PostgreSQL regression tests could 
> also expose the existing bugs of Spark SQL. We should also create the JIRAs 
> and track them in SPARK-27764. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29696) Add groupingsets.sql

2019-10-31 Thread Takeshi Yamamuro (Jira)

Takeshi Yamamuro created SPARK-29696:


 Summary: Add groupingsets.sql
 Key: SPARK-29696
 URL: https://issues.apache.org/jira/browse/SPARK-29696
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Takeshi Yamamuro






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29676) ALTER TABLE (RENAME PARTITION) should look up catalog/table like v2 commands

2019-10-31 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-29676.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26350
[https://github.com/apache/spark/pull/26350]

> ALTER TABLE (RENAME PARTITION) should look up catalog/table like v2 commands
> 
>
> Key: SPARK-29676
> URL: https://issues.apache.org/jira/browse/SPARK-29676
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>
> ALTER TABLE (RENAME PARTITION) should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29676) ALTER TABLE (RENAME PARTITION) should look up catalog/table like v2 commands

2019-10-31 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-29676:
---

Assignee: Huaxin Gao

> ALTER TABLE (RENAME PARTITION) should look up catalog/table like v2 commands
> 
>
> Key: SPARK-29676
> URL: https://issues.apache.org/jira/browse/SPARK-29676
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
>
> ALTER TABLE (RENAME PARTITION) should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29664) Column.getItem behavior is not consistent with Scala version

2019-10-31 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29664.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26351
[https://github.com/apache/spark/pull/26351]

> Column.getItem behavior is not consistent with Scala version
> 
>
> Key: SPARK-29664
> URL: https://issues.apache.org/jira/browse/SPARK-29664
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.0.0
>
>
> In PySpark, Column.getItem's behavior is different from the Scala version.
> For example,
> In PySpark:
> {code:python}
> df = spark.range(2)
> map_col = create_map(lit(0), lit(100), lit(1), lit(200))
> df.withColumn("mapped", map_col.getItem(col('id'))).show()
> # +---+--+
> # | id|mapped|
> # +---+--+
> # |  0|   100|
> # |  1|   200|
> # +---+--+
> {code}
> In Scala:
> {code:scala}
> val df = spark.range(2)
> val map_col = map(lit(0), lit(100), lit(1), lit(200))
> // The following getItem results in the following exception, which is the 
> right behavior:
> // java.lang.RuntimeException: Unsupported literal type class 
> org.apache.spark.sql.Column id
> //  at 
> org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78)
> //  at org.apache.spark.sql.Column.getItem(Column.scala:856)
> //  ... 49 elided
> df.withColumn("mapped", map_col.getItem(col("id"))).show
> // You have to use apply() to match with PySpark's behavior.
> df.withColumn("mapped", map_col(col("id"))).show
> // +---+--+
> // | id|mapped|
> // +---+--+
> // |  0|   100|
> // |  1|   200|
> // +---+--+
> {code}
> Looking at the code for Scala implementation, PySpark's behavior is incorrect 
> since the argument to getItem becomes `Literal`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29664) Column.getItem behavior is not consistent with Scala version

2019-10-31 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-29664:


Assignee: Terry Kim

> Column.getItem behavior is not consistent with Scala version
> 
>
> Key: SPARK-29664
> URL: https://issues.apache.org/jira/browse/SPARK-29664
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
>
> In PySpark, Column.getItem's behavior is different from the Scala version.
> For example,
> In PySpark:
> {code:python}
> df = spark.range(2)
> map_col = create_map(lit(0), lit(100), lit(1), lit(200))
> df.withColumn("mapped", map_col.getItem(col('id'))).show()
> # +---+--+
> # | id|mapped|
> # +---+--+
> # |  0|   100|
> # |  1|   200|
> # +---+--+
> {code}
> In Scala:
> {code:scala}
> val df = spark.range(2)
> val map_col = map(lit(0), lit(100), lit(1), lit(200))
> // The following getItem results in the following exception, which is the 
> right behavior:
> // java.lang.RuntimeException: Unsupported literal type class 
> org.apache.spark.sql.Column id
> //  at 
> org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78)
> //  at org.apache.spark.sql.Column.getItem(Column.scala:856)
> //  ... 49 elided
> df.withColumn("mapped", map_col.getItem(col("id"))).show
> // You have to use apply() to match with PySpark's behavior.
> df.withColumn("mapped", map_col(col("id"))).show
> // +---+--+
> // | id|mapped|
> // +---+--+
> // |  0|   100|
> // |  1|   200|
> // +---+--+
> {code}
> Looking at the code for Scala implementation, PySpark's behavior is incorrect 
> since the argument to getItem becomes `Literal`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29694) Execute UDF only once when there are multiple identical UDF usages

2019-10-31 Thread Xuedong Luan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964571#comment-16964571
 ] 

Xuedong Luan commented on SPARK-29694:
--

hi [~yumwang]

 I will work on this Jira

> Execute UDF only once when there are multiple identical UDF usages
> --
>
> Key: SPARK-29694
> URL: https://issues.apache.org/jira/browse/SPARK-29694
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Example:
> {code:sql}
> SELECT
>   CASE
>   WHEN udf1(col1, 'swd') = '2' THEN 'Facebook'
>   WHEN udf1(col1, 'swd') = '3' THEN 'Twitter'
>   WHEN udf1(col1, 'swd') = '11' THEN 'Pinterest'
>   WHEN col2 IN (28,29) THEN
>   WHEN col2 IN (10,16,18) AND udf1(col1, 'rd') IS NULL AND 
> udf1(col1, 'rot') IN ('71188223167180', '14361105000167180') THEN 'Yandex'
>   WHEN col2 IN (10,16,18) AND udf1(col1, 'rd') IS NULL AND 
> udf1(col1, 'rot') IN ('4686145537108740', '7055082982390', '7113399718530') 
> THEN 'Yahoo'
>   WHEN col2 IN (10,16,18) AND udf1(col1, 'rd') IS NULL THEN 
> 'Google'
>   WHEN udf1(col1, 'rd') LIKE '%google%' OR udf1(col1, 'rd') LIKE 
> '%gmail%' THEN 'Google'
>   WHEN udf1(col1, 'rd') LIKE '%yahoo%' THEN 'Yahoo'
>   WHEN udf1(col1, 'rd') LIKE '%bing%' THEN 'Bing'
>   WHEN udf1(col1, 'rd') LIKE '%facebook%' THEN 'Facebook'
>   WHEN udf1(col1, 'rd') LIKE '%pinterest%' THEN 'Pinterest'
>   WHEN udf1(col1, 'rd') LIKE '%twitter%' OR udf1(col1, 'rd') LIKE 
> '%t.co' THEN 'Twitter'
>   WHEN udf1(col1, 'rd') LIKE '%baidu%' THEN 'Baidu'
>   WHEN udf1(col1, 'rd') LIKE '%yandex%' THEN 'Yandex'
>   WHEN udf1(col1, 'rd') LIKE '%aol.%' THEN 'AOL'
>   WHEN udf1(col1, 'rd') LIKE '%ask.%' THEN 'Ask'
>   WHEN udf1(col1, 'rd') LIKE '%duckduckgo.%' THEN 'DuckDuckGo'
>   WHEN udf1(col1, 'rd') LIKE '%t-online.de' THEN 'T-Online'
>   WHEN udf1(col1, 'rd') LIKE '%com-kleinanzeigen.%' OR udf1(col1, 
> 'rd') LIKE '%kleinanzeigen%' THEN 'Kleinanzeigen'
>   WHEN udf1(col1, 'rd') LIKE '%com.%' OR udf1(col1, 'rd') LIKE 
> '%comdesc.%' THEN 'com'
>   WHEN udf1(col1, 'rd') LIKE '%paypal.%' THEN 'PayPal'
>   WHEN udf1(col1, 'rd') IS NULL THEN 'None'
>   ELSE 'Other' END AS source_domain,
>   COUNT(*) AS cnt
> FROM
>   tbl s
> GROUP BY
>   1
> {code}
> We can rewrite it to:
> {code:sql}
> SELECT
>   CASE
>   WHEN udf1(col1, 'swd') = '2' THEN 'Facebook'
>   WHEN udf1(col1, 'swd') = '3' THEN 'Twitter'
>   WHEN udf1(col1, 'swd') = '11' THEN 'Pinterest'
>   WHEN col2 IN (28,29) THEN 'Google'
>   WHEN col2 IN (10,16,18) AND col1 IS NULL AND udf1(col1, 'rot') 
> IN ('71188223167180', '14361105000167180') THEN 'Yandex'
>   WHEN col2 IN (10,16,18) AND col1 IS NULL AND udf1(col1, 'rot') 
> IN ('4686145537108740', '7055082982390', '7113399718530') THEN 'Yahoo'
>   WHEN col2 IN (10,16,18) AND col1 IS NULL THEN 'Google'
>   WHEN col1 LIKE '%google%' OR col1 LIKE '%gmail%' THEN 'Google'
>   WHEN col1 LIKE '%yahoo%' THEN 'Yahoo'
>   WHEN col1 LIKE '%bing%' THEN 'Bing'
>   WHEN col1 LIKE '%facebook%' THEN 'Facebook'
>   WHEN col1 LIKE '%pinterest%' THEN 'Pinterest'
>   WHEN col1 LIKE '%twitter%' OR col1 LIKE '%t.co' THEN 'Twitter'
>   WHEN col1 LIKE '%baidu%' THEN 'Baidu'
>   WHEN col1 LIKE '%yandex%' THEN 'Yandex'
>   WHEN col1 LIKE '%aol.%' THEN 'AOL'
>   WHEN col1 LIKE '%ask.%' THEN 'Ask'
>   WHEN col1 LIKE '%duckduckgo.%' THEN 'DuckDuckGo'
>   WHEN col1 LIKE '%t-online.de' THEN 'T-Online'
>   WHEN col1 LIKE '%com-kleinanzeigen.%' OR col1 LIKE 
> '%kleinanzeigen%' THEN 'Kleinanzeigen'
>   WHEN col1 LIKE '%com.%' OR col1 LIKE '%comdesc.%' THEN 'com'
>   WHEN col1 LIKE '%paypal.%' THEN 'PayPal'
>   WHEN col1 IS NULL THEN 'None'
>   ELSE 'Other' END AS source_domain,
>   COUNT(*) AS cnt
> FROM
>   (SELECT *, udf1(col1, 'rd') as col1 FROM tbl) s
> GROUP BY
>   1
> {code}
> It would be great if we could optimize it by the framework.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29657) Iterator spill supporting radix sort with null prefix

2019-10-31 Thread dzcxzl (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dzcxzl updated SPARK-29657:
---
Issue Type: Bug  (was: Improvement)

> Iterator spill supporting radix sort with null prefix
> -
>
> Key: SPARK-29657
> URL: https://issues.apache.org/jira/browse/SPARK-29657
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: dzcxzl
>Priority: Trivial
>
> In the case of radix sort, when the insertRecord part of the keyPrefix is 
> null, the iterator type returned by getSortedIterator is ChainedIterator.
> Currently ChainedIterator does not support spill, causing 
> UnsafeExternalSorter to take up a lot of execution memory, allocatePage 
> fails, throw SparkOutOfMemoryError Unable to acquire xxx bytes of memory, got > 0
> The following is a log of an error we encountered in the production 
> environment.
> [Executor task launch worker for task 66055] INFO TaskMemoryManager: Memory 
> used in task 66055
> [Executor task launch worker for task 66055] INFO TaskMemoryManager: Acquired 
> by 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@39dd866e: 
> 64.0 KB
> [Executor task launch worker for task 66055] INFO TaskMemoryManager: Acquired 
> by 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@74d17927: 
> 4.6 GB
> [Executor task launch worker for task 66055] INFO TaskMemoryManager: Acquired 
> by 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@31478f9c: 
> 61.0 MB
> [Executor task launch worker for task 66055] INFO TaskMemoryManager: 0 bytes 
> of memory were used by task 66055 but are not associated with specific 
> consumers
> [Executor task launch worker for task 66055] INFO TaskMemoryManager: 
> 4962998749 bytes of memory are used for execution and 2218326 bytes of memory 
> are used for storage
> [Executor task launch worker for task 66055] ERROR Executor: Exception in 
> task 42.3 in stage 29.0 (TID 66055)
> SparkOutOfMemoryError: Unable to acquire 3436 bytes of memory, got 0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29695) ALTER TABLE (SerDe properties) should look up catalog/table like v2 commands

2019-10-31 Thread Huaxin Gao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964559#comment-16964559
 ] 

Huaxin Gao commented on SPARK-29695:


I will work on this

> ALTER TABLE (SerDe properties) should look up catalog/table like v2 commands
> 
>
> Key: SPARK-29695
> URL: https://issues.apache.org/jira/browse/SPARK-29695
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Major
>
> ALTER TABLE (SerDe properties) should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29695) ALTER TABLE (SerDe properties) should look up catalog/table like v2 commands

2019-10-31 Thread Huaxin Gao (Jira)

Huaxin Gao created SPARK-29695:
--

 Summary: ALTER TABLE (SerDe properties) should look up 
catalog/table like v2 commands
 Key: SPARK-29695
 URL: https://issues.apache.org/jira/browse/SPARK-29695
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Huaxin Gao


ALTER TABLE (SerDe properties) should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29694) Execute UDF only once when there are multiple identical UDF usages

2019-10-31 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-29694:
---

 Summary: Execute UDF only once when there are multiple identical 
UDF usages
 Key: SPARK-29694
 URL: https://issues.apache.org/jira/browse/SPARK-29694
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


Example:
{code:sql}
SELECT
CASE
WHEN udf1(col1, 'swd') = '2' THEN 'Facebook'
WHEN udf1(col1, 'swd') = '3' THEN 'Twitter'
WHEN udf1(col1, 'swd') = '11' THEN 'Pinterest'
WHEN col2 IN (28,29) THEN
WHEN col2 IN (10,16,18) AND udf1(col1, 'rd') IS NULL AND 
udf1(col1, 'rot') IN ('71188223167180', '14361105000167180') THEN 'Yandex'
WHEN col2 IN (10,16,18) AND udf1(col1, 'rd') IS NULL AND 
udf1(col1, 'rot') IN ('4686145537108740', '7055082982390', '7113399718530') 
THEN 'Yahoo'
WHEN col2 IN (10,16,18) AND udf1(col1, 'rd') IS NULL THEN 
'Google'
WHEN udf1(col1, 'rd') LIKE '%google%' OR udf1(col1, 'rd') LIKE 
'%gmail%' THEN 'Google'
WHEN udf1(col1, 'rd') LIKE '%yahoo%' THEN 'Yahoo'
WHEN udf1(col1, 'rd') LIKE '%bing%' THEN 'Bing'
WHEN udf1(col1, 'rd') LIKE '%facebook%' THEN 'Facebook'
WHEN udf1(col1, 'rd') LIKE '%pinterest%' THEN 'Pinterest'
WHEN udf1(col1, 'rd') LIKE '%twitter%' OR udf1(col1, 'rd') LIKE 
'%t.co' THEN 'Twitter'
WHEN udf1(col1, 'rd') LIKE '%baidu%' THEN 'Baidu'
WHEN udf1(col1, 'rd') LIKE '%yandex%' THEN 'Yandex'
WHEN udf1(col1, 'rd') LIKE '%aol.%' THEN 'AOL'
WHEN udf1(col1, 'rd') LIKE '%ask.%' THEN 'Ask'
WHEN udf1(col1, 'rd') LIKE '%duckduckgo.%' THEN 'DuckDuckGo'
WHEN udf1(col1, 'rd') LIKE '%t-online.de' THEN 'T-Online'
WHEN udf1(col1, 'rd') LIKE '%com-kleinanzeigen.%' OR udf1(col1, 
'rd') LIKE '%kleinanzeigen%' THEN 'Kleinanzeigen'
WHEN udf1(col1, 'rd') LIKE '%com.%' OR udf1(col1, 'rd') LIKE 
'%comdesc.%' THEN 'com'
WHEN udf1(col1, 'rd') LIKE '%paypal.%' THEN 'PayPal'
WHEN udf1(col1, 'rd') IS NULL THEN 'None'
ELSE 'Other' END AS source_domain,
COUNT(*) AS cnt
FROM
tbl s
GROUP BY
1
{code}

We can rewrite it to:

{code:sql}
SELECT
CASE
WHEN udf1(col1, 'swd') = '2' THEN 'Facebook'
WHEN udf1(col1, 'swd') = '3' THEN 'Twitter'
WHEN udf1(col1, 'swd') = '11' THEN 'Pinterest'
WHEN col2 IN (28,29) THEN 'Google'
WHEN col2 IN (10,16,18) AND col1 IS NULL AND udf1(col1, 'rot') 
IN ('71188223167180', '14361105000167180') THEN 'Yandex'
WHEN col2 IN (10,16,18) AND col1 IS NULL AND udf1(col1, 'rot') 
IN ('4686145537108740', '7055082982390', '7113399718530') THEN 'Yahoo'
WHEN col2 IN (10,16,18) AND col1 IS NULL THEN 'Google'
WHEN col1 LIKE '%google%' OR col1 LIKE '%gmail%' THEN 'Google'
WHEN col1 LIKE '%yahoo%' THEN 'Yahoo'
WHEN col1 LIKE '%bing%' THEN 'Bing'
WHEN col1 LIKE '%facebook%' THEN 'Facebook'
WHEN col1 LIKE '%pinterest%' THEN 'Pinterest'
WHEN col1 LIKE '%twitter%' OR col1 LIKE '%t.co' THEN 'Twitter'
WHEN col1 LIKE '%baidu%' THEN 'Baidu'
WHEN col1 LIKE '%yandex%' THEN 'Yandex'
WHEN col1 LIKE '%aol.%' THEN 'AOL'
WHEN col1 LIKE '%ask.%' THEN 'Ask'
WHEN col1 LIKE '%duckduckgo.%' THEN 'DuckDuckGo'
WHEN col1 LIKE '%t-online.de' THEN 'T-Online'
WHEN col1 LIKE '%com-kleinanzeigen.%' OR col1 LIKE 
'%kleinanzeigen%' THEN 'Kleinanzeigen'
WHEN col1 LIKE '%com.%' OR col1 LIKE '%comdesc.%' THEN 'com'
WHEN col1 LIKE '%paypal.%' THEN 'PayPal'
WHEN col1 IS NULL THEN 'None'
ELSE 'Other' END AS source_domain,
COUNT(*) AS cnt
FROM
(SELECT *, udf1(col1, 'rd') as col1 FROM tbl) s
GROUP BY
1
{code}

It would be great if we could optimize it by the framework.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23643) XORShiftRandom.hashSeed allocates unnecessary memory

2019-10-31 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-23643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964529#comment-16964529
 ] 

Dongjoon Hyun commented on SPARK-23643:
---

This causes the UI test result difference between Apache Spark 3.0 and 2.4.

> XORShiftRandom.hashSeed allocates unnecessary memory
> 
>
> Key: SPARK-23643
> URL: https://issues.apache.org/jira/browse/SPARK-23643
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> The hashSeed method allocates 64 bytes buffer and puts only 8 bytes of the 
> seed parameter into it. Other bytes are always zero and could be easily 
> excluded from hash calculation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23643) XORShiftRandom.hashSeed allocates unnecessary memory

2019-10-31 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23643:
--
Priority: Major  (was: Trivial)

> XORShiftRandom.hashSeed allocates unnecessary memory
> 
>
> Key: SPARK-23643
> URL: https://issues.apache.org/jira/browse/SPARK-23643
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> The hashSeed method allocates 64 bytes buffer and puts only 8 bytes of the 
> seed parameter into it. Other bytes are always zero and could be easily 
> excluded from hash calculation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23643) XORShiftRandom.hashSeed allocates unnecessary memory

2019-10-31 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23643:
--
Labels: release-notes  (was: )

> XORShiftRandom.hashSeed allocates unnecessary memory
> 
>
> Key: SPARK-23643
> URL: https://issues.apache.org/jira/browse/SPARK-23643
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Trivial
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> The hashSeed method allocates 64 bytes buffer and puts only 8 bytes of the 
> seed parameter into it. Other bytes are always zero and could be easily 
> excluded from hash calculation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23643) XORShiftRandom.hashSeed allocates unnecessary memory

2019-10-31 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-23643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964528#comment-16964528
 ] 

Dongjoon Hyun commented on SPARK-23643:
---

I added `release-note` label because this changes the seed and expected result.
cc [~jiangxb1987] and [~smilegator]

> XORShiftRandom.hashSeed allocates unnecessary memory
> 
>
> Key: SPARK-23643
> URL: https://issues.apache.org/jira/browse/SPARK-23643
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Trivial
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> The hashSeed method allocates 64 bytes buffer and puts only 8 bytes of the 
> seed parameter into it. Other bytes are always zero and could be easily 
> excluded from hash calculation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29693) Bucket map join if the one's bucket number is the multiple of the other

2019-10-31 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964491#comment-16964491
 ] 

Yuming Wang commented on SPARK-29693:
-

cc [~gwang3]

> Bucket map join if the one's bucket number is the multiple of the other
> ---
>
> Key: SPARK-29693
> URL: https://issues.apache.org/jira/browse/SPARK-29693
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29693) Bucket map join if the one's bucket number is the multiple of the other

2019-10-31 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-29693:
---

 Summary: Bucket map join if the one's bucket number is the 
multiple of the other
 Key: SPARK-29693
 URL: https://issues.apache.org/jira/browse/SPARK-29693
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29693) Bucket map join if the one's bucket number is the multiple of the other

2019-10-31 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964490#comment-16964490
 ] 

Yuming Wang commented on SPARK-29693:
-

https://data-flair.training/blogs/bucket-map-join/

> Bucket map join if the one's bucket number is the multiple of the other
> ---
>
> Key: SPARK-29693
> URL: https://issues.apache.org/jira/browse/SPARK-29693
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29625) Spark Structure Streaming Kafka Wrong Reset Offset twice

2019-10-31 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964489#comment-16964489
 ] 

Hyukjin Kwon commented on SPARK-29625:
--

It needs investigation. Can you share the codes you ran?

> Spark Structure Streaming Kafka Wrong Reset Offset twice
> 
>
> Key: SPARK-29625
> URL: https://issues.apache.org/jira/browse/SPARK-29625
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Sandish Kumar HN
>Priority: Major
>
> Spark Structure Streaming Kafka Reset Offset twice, once with right offsets 
> and second time with very old offsets 
> {code}
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-151 to offset 0.
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-118 to offset 0.
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-85 to offset 0.
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-52 to offset 122677634.
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-19 to offset 0.
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-52 to offset 120504922.*
> [2019-10-28 19:27:40,153] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO ContextCleaner: Cleaned accumulator 810
> {code}
> which is causing a Data loss issue.  
> {code}
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> ERROR StreamExecution: Query [id = d62ca9e4-6650-454f-8691-a3d576d1e4ba, 
> runId = 3946389f-222b-495c-9ab2-832c0422cbbb] terminated with error
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - 
> java.lang.IllegalStateException: Partition topic-52's offset was changed from 
> 122677598 to 120504922, some data may have been missed.
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - Some data may have 
> been lost because they are not available in Kafka any more; either the
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  data was aged out 
> by Kafka or the topic may have been deleted before all the data in the
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  topic was 
> processed. If you don't want your streaming query to fail on such cases, set 
> the
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  source option 
> "failOnDataLoss" to "false".
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - 
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> org.apache.spark.sql.kafka010.KafkaSource.org$apache$spark$sql$kafka010$KafkaSource$$reportDataLoss(KafkaSource.scala:329)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> org.apache.spark.sql.kafka010.KafkaSource$$anonfun$8.apply(KafkaSource.scala:283)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> org.apache.spark.sql.kafka010.KafkaSource$$anonfun$8.apply(KafkaSource.scala:281)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
> [2019-10-28 19:27:40,351] \{bash_

[jira] [Commented] (SPARK-12806) Support SQL expressions extracting values from VectorUDT

2019-10-31 Thread John Bauer (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-12806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964481#comment-16964481
 ] 

John Bauer commented on SPARK-12806:


Also, when using PyArrow to convert a Spark DataFrame for use in a pandas_udf, 
as soon as a VectorUDT is encountered it reverts to a non-optimized conversion, 
losing much of the advantage of using PyArrow. 

> Support SQL expressions extracting values from VectorUDT
> 
>
> Key: SPARK-12806
> URL: https://issues.apache.org/jira/browse/SPARK-12806
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, SQL
>Affects Versions: 1.6.0
>Reporter: Feynman Liang
>Priority: Major
>  Labels: bulk-closed
>
> Use cases exist where a specific index within a {{VectorUDT}} column of a 
> {{DataFrame}} is required. For example, we may be interested in extracting a 
> specific class probability from the {{probabilityCol}} of a 
> {{LogisticRegression}} to compute losses. However, if {{probability}} is a 
> column of {{df}} with type {{VectorUDT}}, the following code fails:
> {code}
> df.select("probability.0")
> AnalysisException: u"Can't extract value from probability"
> {code}
> thrown from 
> {{sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala}}.
> {{VectorUDT}} essentially wraps a {{StructType}}, hence one would expect it 
> to support value extraction Expressions in an analogous way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12806) Support SQL expressions extracting values from VectorUDT

2019-10-31 Thread John Bauer (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-12806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964475#comment-16964475
 ] 

John Bauer commented on SPARK-12806:


This is still a problem.  For example, classification models emit probability 
as a VectorUDT, which are unusable in PySpark.  This makes constructing 
boosting/bagging algorithms or even just using them as additional features in a 
second model problematic.

> Support SQL expressions extracting values from VectorUDT
> 
>
> Key: SPARK-12806
> URL: https://issues.apache.org/jira/browse/SPARK-12806
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, SQL
>Affects Versions: 1.6.0
>Reporter: Feynman Liang
>Priority: Major
>  Labels: bulk-closed
>
> Use cases exist where a specific index within a {{VectorUDT}} column of a 
> {{DataFrame}} is required. For example, we may be interested in extracting a 
> specific class probability from the {{probabilityCol}} of a 
> {{LogisticRegression}} to compute losses. However, if {{probability}} is a 
> column of {{df}} with type {{VectorUDT}}, the following code fails:
> {code}
> df.select("probability.0")
> AnalysisException: u"Can't extract value from probability"
> {code}
> thrown from 
> {{sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala}}.
> {{VectorUDT}} essentially wraps a {{StructType}}, hence one would expect it 
> to support value extraction Expressions in an analogous way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29687) Fix jdbc metrics counter type to long

2019-10-31 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-29687.
--
Fix Version/s: 3.0.0
 Assignee: ulysses you
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/26346

> Fix jdbc metrics counter type to long
> -
>
> Key: SPARK-29687
> URL: https://issues.apache.org/jira/browse/SPARK-29687
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
> Fix For: 3.0.0
>
>
> JDBC metrics counter var is an int type that may by overflow. Change it to 
> Long type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29692) SparkContext.defaultParallism should reflect resource limits when resource limits are set

2019-10-31 Thread Bago Amirbekian (Jira)

Bago Amirbekian created SPARK-29692:
---

 Summary: SparkContext.defaultParallism should reflect resource 
limits when resource limits are set
 Key: SPARK-29692
 URL: https://issues.apache.org/jira/browse/SPARK-29692
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Bago Amirbekian


With the new gpu/fpga support in spark, defaultParallelism may not be computed 
correctly. Specifically defaultParaallelism may be much higher than the total 
possible concurrent tasks if workers have many more cores than gpus for example.

Steps to reproduce:
Start a cluster with spark.executor.resource.gpu.amount < cores per executor. 
Set spark.task.resource.gpu.amount = 1. Keep cores per task as 1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)

2019-10-31 Thread John Bauer (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Bauer updated SPARK-29691:
---
Description: 
Estimator `fit` method is supposed to copy a dictionary of params, overwriting 
the estimator's previous values, before fitting the model. However, the 
parameter values are not updated.  This was observed in PySpark, but may be 
present in the Java objects, as the PySpark code appears to be functioning 
correctly.   (The copy method that interacts with Java is actually implemented 
in Params.)

For example, this prints

Before: 0.8
After: 0.8

but After should be 0.75

{code:python}
from pyspark.ml.classification import LogisticRegression

# Load training data
training = spark \
.read \
.format("libsvm") \
.load("data/mllib/sample_multiclass_classification_data.txt")

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
print("Before:", lr.getOrDefault("elasticNetParam"))

# Fit the model, but with an updated parameter setting:
lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})

print("After:", lr.getOrDefault("elasticNetParam"))
{code}

  was:
Estimator `fit` method (implemented in Params) is supposed to copy a dictionary 
of params, overwriting the estimator's previous values, before fitting the 
model.  However, the parameter values are not updated.  This was observed in 
PySpark, but may be present in the Java objects, as the PySpark code appears to 
be functioning correctly.

For example, this prints

Before: 0.8
After: 0.8

but After should be 0.75

{code:python}
from pyspark.ml.classification import LogisticRegression

# Load training data
training = spark \
.read \
.format("libsvm") \
.load("data/mllib/sample_multiclass_classification_data.txt")

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
print("Before:", lr.getOrDefault("elasticNetParam"))

# Fit the model, but with an updated parameter setting:
lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})

print("After:", lr.getOrDefault("elasticNetParam"))
{code}


> Estimator fit method fails to copy params (in PySpark)
> --
>
> Key: SPARK-29691
> URL: https://issues.apache.org/jira/browse/SPARK-29691
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: John Bauer
>Priority: Minor
>
> Estimator `fit` method is supposed to copy a dictionary of params, 
> overwriting the estimator's previous values, before fitting the model. 
> However, the parameter values are not updated.  This was observed in PySpark, 
> but may be present in the Java objects, as the PySpark code appears to be 
> functioning correctly.   (The copy method that interacts with Java is 
> actually implemented in Params.)
> For example, this prints
> Before: 0.8
> After: 0.8
> but After should be 0.75
> {code:python}
> from pyspark.ml.classification import LogisticRegression
> # Load training data
> training = spark \
> .read \
> .format("libsvm") \
> .load("data/mllib/sample_multiclass_classification_data.txt")
> lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
> print("Before:", lr.getOrDefault("elasticNetParam"))
> # Fit the model, but with an updated parameter setting:
> lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})
> print("After:", lr.getOrDefault("elasticNetParam"))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)

2019-10-31 Thread John Bauer (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Bauer updated SPARK-29691:
---
Description: 
Estimator `fit` method (implemented in Params) is supposed to copy a dictionary 
of params, overwriting the estimator's previous values, before fitting the 
model.  However, the parameter values are not updated.  This was observed in 
PySpark, but may be present in the Java objects, as the PySpark code appears to 
be functioning correctly.

For example, this prints

Before: 0.8
After: 0.8

but After should be 0.75

{code:python}
from pyspark.ml.classification import LogisticRegression

# Load training data
training = spark \
.read \
.format("libsvm") \
.load("data/mllib/sample_multiclass_classification_data.txt")

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
print("Before:", lr.getOrDefault("elasticNetParam"))

# Fit the model, but with an updated parameter setting:
lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})

print("After:", lr.getOrDefault("elasticNetParam"))
{code}

  was:
Estimator `fit` method (implemented in Params) is supposed to copy a dictionary 
of params, overwriting the estimator's previous values, before fitting the 
model.  However, the parameter values are not updated.  This was observed in 
PySpark, but may be present in the Java objects, as the PySpark code appears to 
be functioning correctly.

For example, this prints

{{Before: 0.8
After: 0.8}}

but After should be 0.75


{code:python}
from pyspark.ml.classification import LogisticRegression

# Load training data
training = spark \
.read \
.format("libsvm") \
.load("data/mllib/sample_multiclass_classification_data.txt")

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
print("Before:", lr.getOrDefault("elasticNetParam"))

# Fit the model, but with an updated parameter setting:
lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})

print("After:", lr.getOrDefault("elasticNetParam"))
{code}


> Estimator fit method fails to copy params (in PySpark)
> --
>
> Key: SPARK-29691
> URL: https://issues.apache.org/jira/browse/SPARK-29691
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: John Bauer
>Priority: Minor
>
> Estimator `fit` method (implemented in Params) is supposed to copy a 
> dictionary of params, overwriting the estimator's previous values, before 
> fitting the model.  However, the parameter values are not updated.  This was 
> observed in PySpark, but may be present in the Java objects, as the PySpark 
> code appears to be functioning correctly.
> For example, this prints
> Before: 0.8
> After: 0.8
> but After should be 0.75
> {code:python}
> from pyspark.ml.classification import LogisticRegression
> # Load training data
> training = spark \
> .read \
> .format("libsvm") \
> .load("data/mllib/sample_multiclass_classification_data.txt")
> lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
> print("Before:", lr.getOrDefault("elasticNetParam"))
> # Fit the model, but with an updated parameter setting:
> lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})
> print("After:", lr.getOrDefault("elasticNetParam"))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)

2019-10-31 Thread John Bauer (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Bauer updated SPARK-29691:
---
Description: 
Estimator `fit` method (implemented in Params) is supposed to copy a dictionary 
of params, overwriting the estimator's previous values, before fitting the 
model.  However, the parameter values are not updated.  This was observed in 
PySpark, but may be present in the Java objects, as the PySpark code appears to 
be functioning correctly.

For example, this prints

{{Before: 0.8
After: 0.8}}

but After should be 0.75


{code:python}
from pyspark.ml.classification import LogisticRegression

# Load training data
training = spark \
.read \
.format("libsvm") \
.load("data/mllib/sample_multiclass_classification_data.txt")

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
print("Before:", lr.getOrDefault("elasticNetParam"))

# Fit the model, but with an updated parameter setting:
lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})

print("After:", lr.getOrDefault("elasticNetParam"))
{code}

  was:
Estimator `fit` method (implemented in Params) is supposed to copy a dictionary 
of params, overwriting the estimator's previous values, before fitting the 
model.  However, the parameter values are not updated.  This was observed in 
PySpark, but may be present in the Java objects, as the PySpark code appears to 
be functioning correctly.

For example, this prints

{{Before: 0.8
After: 0.8}}

but After should be 0.75

{{from pyspark.ml.classification import LogisticRegression

# Load training data
training = spark \
.read \
.format("libsvm") \
.load("data/mllib/sample_multiclass_classification_data.txt")

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
print("Before:", lr.getOrDefault("elasticNetParam"))

# Fit the model, but with an updated parameter setting:
lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})

print("After:", lr.getOrDefault("elasticNetParam"))}}


> Estimator fit method fails to copy params (in PySpark)
> --
>
> Key: SPARK-29691
> URL: https://issues.apache.org/jira/browse/SPARK-29691
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: John Bauer
>Priority: Minor
>
> Estimator `fit` method (implemented in Params) is supposed to copy a 
> dictionary of params, overwriting the estimator's previous values, before 
> fitting the model.  However, the parameter values are not updated.  This was 
> observed in PySpark, but may be present in the Java objects, as the PySpark 
> code appears to be functioning correctly.
> For example, this prints
> {{Before: 0.8
> After: 0.8}}
> but After should be 0.75
> {code:python}
> from pyspark.ml.classification import LogisticRegression
> # Load training data
> training = spark \
> .read \
> .format("libsvm") \
> .load("data/mllib/sample_multiclass_classification_data.txt")
> lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
> print("Before:", lr.getOrDefault("elasticNetParam"))
> # Fit the model, but with an updated parameter setting:
> lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})
> print("After:", lr.getOrDefault("elasticNetParam"))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)

2019-10-31 Thread John Bauer (Jira)

John Bauer created SPARK-29691:
--

 Summary: Estimator fit method fails to copy params (in PySpark)
 Key: SPARK-29691
 URL: https://issues.apache.org/jira/browse/SPARK-29691
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.4
Reporter: John Bauer


Estimator `fit` method (implemented in Params) is supposed to copy a dictionary 
of params, overwriting the estimator's previous values, before fitting the 
model.  However, the parameter values are not updated.  This was observed in 
PySpark, but may be present in the Java objects, as the PySpark code appears to 
be functioning correctly.

For example, this prints

{{Before: 0.8
After: 0.8}}

but After should be 0.75

{{from pyspark.ml.classification import LogisticRegression

# Load training data
training = spark \
.read \
.format("libsvm") \
.load("data/mllib/sample_multiclass_classification_data.txt")

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
print("Before:", lr.getOrDefault("elasticNetParam"))

# Fit the model, but with an updated parameter setting:
lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})

print("After:", lr.getOrDefault("elasticNetParam"))}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29415) Stage Level Sched: Add base ResourceProfile and Request classes

2019-10-31 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964360#comment-16964360
 ] 

Thomas Graves edited comment on SPARK-29415 at 10/31/19 8:03 PM:
-

More details:

TaskResourceRequest - this supports taking a resourceName and an amount (Double 
for fractional resources).  It only supports cpus (spark.task.cpus) and 
accelerator resource types (spark.*.resource.[resourceName].*.  So user can 
specify cpus and resources like GPU's and FPGAS. The accerator type resources 
match what we already have for configs in the acceralator aware scheduling. 
[https://github.com/apache/spark/blob/master/docs/configuration.md#custom-resource-scheduling-and-configuration-overview]

 

ExecutorResourceRequest - this supports specifying the requirements for the 
executors.  It supports all the configs needed for accelerator aware scheduling 
- {{spark.\{executor/driver}.resource.\{resourceName}.\{amount, 
discoveryScript, vendor} as well as heap memory, overhead memory, pyspark 
memory, and cores. In order to support memory types we added a "units" 
parameter into ExecutorResourceRequest.  the other parameters resourceName, 
vendor, discoveryScript, amount all match the accelerator aware scheduling 
parameters.}}

 

ResourceProfile - this class takes in the executor and task requirement and 
holds them to be used by other components.  For instance, we have to pass the 
executor resources into the cluster managers so it can ask for the proper 
containers.  The requests have to also be passed into the executors when 
launched so they use the correct discovery Script.  The task requirements are 
used by the scheduler to assign tasks to proper containers. {{  We also have a 
ResourceProfile object that has an accessor to get the default ResourceProfile. 
This is the profile generated from the configs the user passes in when the 
spark application is submitted.  So it will have --executor-cores, memory, 
overhead memory, pyspark memory, accelerator resources the user all specified 
via --confs or properties file on submit.  The default profile will be used in 
a lot of places since the user may never specify another ResourceProfile and 
want an easy way to access it.}}


was (Author: tgraves):
More details:

TaskResourceRequest - this supports taking a resourceName and an amount (Double 
for fractional resources).  It only supports cpus (spark.task.cpus) and 
accelerator resource types (spark.*.resource.[resourceName].*.  So user can 
specify cpus and resources like GPU's and FPGAS. The accerator type resources 
match what we already have for configs in the acceralator aware scheduling. 
[https://github.com/apache/spark/blob/master/docs/configuration.md#custom-resource-scheduling-and-configuration-overview]

 

ExecutorResourceRequest - this supports specifying the requirements for the 
executors.  It supports all the configs needed for accelerator aware scheduling 
- {{spark.\{executor/driver}.resource.\{resourceName}.\{amount, 
discoveryScript, vendor} as well as heap memory, overhead memory, pyspark 
memory, and cores. In order to support memory types we added a "units" 
parameter into ExecutorResourceRequest.  the other parameters resourceName, 
vendor, discoveryScript, amount all match the accelerator aware scheduling 
parameters.}}

 

ResourceProfile - this class takes in the executor and task requirement and 
holds them to be used by other components.  For instance, we have to pass the 
executor resources into the cluster managers so it can ask for the proper 
containers.  The requests have to also be passed into the executors when 
launched so they use the correct discovery Script.  The task requirements are 
used by the scheduler to assign tasks to proper containers. {{  We also have a 
ResourceProfile object that has an accessor to get the default ResourceProfile. 
This is the profile generated from the configs the user passes in when the 
spark application is submitted.  So it will have --executor-cores, memory, 
overhead memory, pyspark memory, accelerator resources the user all specified 
via --confs or properties file on submit.}}

> Stage Level Sched: Add base ResourceProfile and Request classes
> ---
>
> Key: SPARK-29415
> URL: https://issues.apache.org/jira/browse/SPARK-29415
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
>
> this is just to add initial ResourceProfile, ExecutorResourceRequest and 
> taskResourceRequest classes that are used by the other parts of the code.
> Initially we will have them private until we have other pieces in place.



--
This message was sent by Atlassian Jir

[jira] [Comment Edited] (SPARK-29415) Stage Level Sched: Add base ResourceProfile and Request classes

2019-10-31 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964360#comment-16964360
 ] 

Thomas Graves edited comment on SPARK-29415 at 10/31/19 7:59 PM:
-

More details:

TaskResourceRequest - this supports taking a resourceName and an amount (Double 
for fractional resources).  It only supports cpus (spark.task.cpus) and 
accelerator resource types (spark.*.resource.[resourceName].*.  So user can 
specify cpus and resources like GPU's and FPGAS. The accerator type resources 
match what we already have for configs in the acceralator aware scheduling. 
[https://github.com/apache/spark/blob/master/docs/configuration.md#custom-resource-scheduling-and-configuration-overview]

 

ExecutorResourceRequest - this supports specifying the requirements for the 
executors.  It supports all the configs needed for accelerator aware scheduling 
- {{spark.\{executor/driver}.resource.\{resourceName}.\{amount, 
discoveryScript, vendor} as well as heap memory, overhead memory, pyspark 
memory, and cores. In order to support memory types we added a "units" 
parameter into ExecutorResourceRequest.  the other parameters resourceName, 
vendor, discoveryScript, amount all match the accelerator aware scheduling 
parameters.}}

 

ResourceProfile - this class takes in the executor and task requirement and 
holds them to be used by other components.  For instance, we have to pass the 
executor resources into the cluster managers so it can ask for the proper 
containers.  The requests have to also be passed into the executors when 
launched so they use the correct discovery Script.  The task requirements are 
used by the scheduler to assign tasks to proper containers. {{  We also have a 
ResourceProfile object that has an accessor to get the default ResourceProfile. 
This is the profile generated from the configs the user passes in when the 
spark application is submitted.  So it will have --executor-cores, memory, 
overhead memory, pyspark memory, accelerator resources the user all specified 
via --confs or properties file on submit.}}


was (Author: tgraves):
More details:

TaskResourceRequest - this supports taking a resourceName and an amount (Double 
for fractional resources).  It only supports cpus (spark.task.cpus) and 
accelerator resource types (spark.*.resource.[resourceName].*.  So user can 
specify cpus and resources like GPU's and FPGAS. The accerator type resources 
match what we already have for configs in the acceralator aware scheduling. 
[https://github.com/apache/spark/blob/master/docs/configuration.md#custom-resource-scheduling-and-configuration-overview]

 

ExecutorResourceRequest - this supports specifying the requirements for the 
executors.  It supports all the configs needed for accelerator aware scheduling 
- {{spark.\{executor/driver}.resource.\{resourceName}.\{amount, 
discoveryScript, vendor} as well as heap memory, overhead memory, pyspark 
memory, and cores. In order to support memory types we added a "units" 
parameter into ExecutorResourceRequest.  the other parameters resourceName, 
vendor, discoveryScript, amount all match the accelerator aware scheduling 
parameters.}}

{{}}

ResourceProfile - this class takes in the executor and task requirement and 
holds them to be used by other components.  For instance, we have to pass the 
executor resources into the cluster managers so it can ask for the proper 
containers.  The requests have to also be passed into the executors when 
launched so they use the correct discovery Script.  The task requirements are 
used by the scheduler to assign tasks to proper containers. {{  We also have a 
ResourceProfile object that has an accessor to get the default ResourceProfile. 
This is the profile generated from the configs the user passes in when the 
spark application is submitted.  So it will have --executor-cores, memory, 
overhead memory, pyspark memory, accelerator resources the user all specified 
via --confs or properties file on submit.}}

> Stage Level Sched: Add base ResourceProfile and Request classes
> ---
>
> Key: SPARK-29415
> URL: https://issues.apache.org/jira/browse/SPARK-29415
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
>
> this is just to add initial ResourceProfile, ExecutorResourceRequest and 
> taskResourceRequest classes that are used by the other parts of the code.
> Initially we will have them private until we have other pieces in place.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache

[jira] [Commented] (SPARK-29415) Stage Level Sched: Add base ResourceProfile and Request classes

2019-10-31 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964360#comment-16964360
 ] 

Thomas Graves commented on SPARK-29415:
---

More details:

TaskResourceRequest - this supports taking a resourceName and an amount (Double 
for fractional resources).  It only supports cpus (spark.task.cpus) and 
accelerator resource types (spark.*.resource.[resourceName].*.  So user can 
specify cpus and resources like GPU's and FPGAS. The accerator type resources 
match what we already have for configs in the acceralator aware scheduling. 
[https://github.com/apache/spark/blob/master/docs/configuration.md#custom-resource-scheduling-and-configuration-overview]

 

ExecutorResourceRequest - this supports specifying the requirements for the 
executors.  It supports all the configs needed for accelerator aware scheduling 
- {{spark.\{executor/driver}.resource.\{resourceName}.\{amount, 
discoveryScript, vendor} as well as heap memory, overhead memory, pyspark 
memory, and cores. In order to support memory types we added a "units" 
parameter into ExecutorResourceRequest.  the other parameters resourceName, 
vendor, discoveryScript, amount all match the accelerator aware scheduling 
parameters.}}

{{}}

ResourceProfile - this class takes in the executor and task requirement and 
holds them to be used by other components.  For instance, we have to pass the 
executor resources into the cluster managers so it can ask for the proper 
containers.  The requests have to also be passed into the executors when 
launched so they use the correct discovery Script.  The task requirements are 
used by the scheduler to assign tasks to proper containers. {{  We also have a 
ResourceProfile object that has an accessor to get the default ResourceProfile. 
This is the profile generated from the configs the user passes in when the 
spark application is submitted.  So it will have --executor-cores, memory, 
overhead memory, pyspark memory, accelerator resources the user all specified 
via --confs or properties file on submit.}}

> Stage Level Sched: Add base ResourceProfile and Request classes
> ---
>
> Key: SPARK-29415
> URL: https://issues.apache.org/jira/browse/SPARK-29415
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
>
> this is just to add initial ResourceProfile, ExecutorResourceRequest and 
> taskResourceRequest classes that are used by the other parts of the code.
> Initially we will have them private until we have other pieces in place.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29673) upgrade jenkins pypy to PyPy3.6 v7.2.0

2019-10-31 Thread Shane Knapp (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964359#comment-16964359
 ] 

Shane Knapp commented on SPARK-29673:
-

pypy3.6-7.2.0-linux_x86_64-portable has been installed on the centos workers, 
and i'm testing with https://github.com/apache/spark/pull/26330

ubuntu workers will be updated later today.

> upgrade jenkins pypy to PyPy3.6 v7.2.0
> --
>
> Key: SPARK-29673
> URL: https://issues.apache.org/jira/browse/SPARK-29673
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Shane Knapp
>Assignee: Shane Knapp
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22579) BlockManager.getRemoteValues and BlockManager.getRemoteBytes should be implemented using streaming

2019-10-31 Thread Imran Rashid (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-22579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964281#comment-16964281
 ] 

Imran Rashid commented on SPARK-22579:
--

Sorry I had not noticed this issue before.  I agree that there is an 
inefficiency here, if you did this streaming you could pipeline fetching the 
data w/ computing on the data.  The existing changes you point to solve the 
memory footprint, by fetching to disk, but not actually pipelining the 
computation.

That said, this isnt' easy to fix.  You need to touch a lot of core stuff in 
the  network layers, and as you said it gets trickier with handling failures 
(you have to throw out all partial work in the current task).

You'll probably still see a discrepancy between runtimes when running locally 
vs. remote.  Best case, you'd get a 2x speedup with this change.  In your use 
case, that would still be ~40 seconds to 4 minutes.

> BlockManager.getRemoteValues and BlockManager.getRemoteBytes should be 
> implemented using streaming
> --
>
> Key: SPARK-22579
> URL: https://issues.apache.org/jira/browse/SPARK-22579
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Affects Versions: 2.1.0
>Reporter: Eyal Farago
>Priority: Major
>
> when an RDD partition is cached on an executor bu the task requiring it is 
> running on another executor (process locality ANY), the cached partition is 
> fetched via BlockManager.getRemoteValues which delegates to 
> BlockManager.getRemoteBytes, both calls are blocking.
> in my use case I had a 700GB RDD spread over 1000 partitions on a 6 nodes 
> cluster, cached to disk. rough math shows that average partition size is 
> 700MB.
> looking at spark UI it was obvious that tasks running with process locality 
> 'ANY' are much slower than local tasks (~40 seconds to 8-10 minutes ratio), I 
> was able to capture thread dumps of executors executing remote tasks and got 
> this stake trace:
> {quote}Thread ID  Thread Name Thread StateThread Locks
> 1521  Executor task launch worker-1000WAITING 
> Lock(java.util.concurrent.ThreadPoolExecutor$Worker@196462978})
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> scala.concurrent.Await$.result(package.scala:190)
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:190)
> org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:104)
> org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:582)
> org.apache.spark.storage.BlockManager.getRemoteValues(BlockManager.scala:550)
> org.apache.spark.storage.BlockManager.get(BlockManager.scala:638)
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:690)
> org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287){quote}
> digging into the code showed that the block manager first fetches all bytes 
> (getRemoteBytes) and then wraps it with a deserialization stream, this has 
> several draw backs:
> 1. blocking, requesting executor is blocked while the remote executor is 
> serving the block.
> 2. potentially large memory footprint on requesting executor, in my use case 
> a 700mb of raw bytes stored in a ChunkedByteBuffer.
> 3. inefficient, requesting side usually don't need all values at once as it 
> consumes the values via an iterator.
> 4. potentially large memory footprint on serving executor, in case the block 
> is cached in deserialized form the serving executor has to serialize it into 
> a ChunkedByteBuffer (BlockManager

[jira] [Commented] (SPARK-29625) Spark Structure Streaming Kafka Wrong Reset Offset twice

2019-10-31 Thread Sandish Kumar HN (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964249#comment-16964249
 ] 

Sandish Kumar HN commented on SPARK-29625:
--

[~hyukjin.kwon] it is happening randomly, so there is no way to reproduce the 
exact error again. the basic question is why is spark is trying to reset the 
offset of same partition twice? hope you understand the problem. 

> Spark Structure Streaming Kafka Wrong Reset Offset twice
> 
>
> Key: SPARK-29625
> URL: https://issues.apache.org/jira/browse/SPARK-29625
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Sandish Kumar HN
>Priority: Major
>
> Spark Structure Streaming Kafka Reset Offset twice, once with right offsets 
> and second time with very old offsets 
> {code}
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-151 to offset 0.
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-118 to offset 0.
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-85 to offset 0.
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-52 to offset 122677634.
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-19 to offset 0.
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-52 to offset 120504922.*
> [2019-10-28 19:27:40,153] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO ContextCleaner: Cleaned accumulator 810
> {code}
> which is causing a Data loss issue.  
> {code}
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> ERROR StreamExecution: Query [id = d62ca9e4-6650-454f-8691-a3d576d1e4ba, 
> runId = 3946389f-222b-495c-9ab2-832c0422cbbb] terminated with error
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - 
> java.lang.IllegalStateException: Partition topic-52's offset was changed from 
> 122677598 to 120504922, some data may have been missed.
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - Some data may have 
> been lost because they are not available in Kafka any more; either the
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  data was aged out 
> by Kafka or the topic may have been deleted before all the data in the
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  topic was 
> processed. If you don't want your streaming query to fail on such cases, set 
> the
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  source option 
> "failOnDataLoss" to "false".
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - 
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> org.apache.spark.sql.kafka010.KafkaSource.org$apache$spark$sql$kafka010$KafkaSource$$reportDataLoss(KafkaSource.scala:329)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> org.apache.spark.sql.kafka010.KafkaSource$$anonfun$8.apply(KafkaSource.scala:283)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> org.apache.spark.sql.kafka010.KafkaSource$$anonfun$8.apply(KafkaSource.scala:281)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
> [2

[jira] [Commented] (SPARK-25923) SparkR UT Failure (checking CRAN incoming feasibility)

2019-10-31 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964247#comment-16964247
 ] 

L. C. Hsieh commented on SPARK-25923:
-

Got reply back now. It should be fixed now.

> SparkR UT Failure (checking CRAN incoming feasibility)
> --
>
> Key: SPARK-25923
> URL: https://issues.apache.org/jira/browse/SPARK-25923
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: L. C. Hsieh
>Priority: Blocker
>
> Currently, the following SparkR error blocks PR builders.
> {code:java}
> * checking CRAN incoming feasibility ...Error in 
> .check_package_CRAN_incoming(pkgdir) : 
>   dims [product 26] do not match the length of object [0]
> Execution halted
> {code}
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98362/console
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98367/console
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98368/testReport/
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4403/testReport/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25923) SparkR UT Failure (checking CRAN incoming feasibility)

2019-10-31 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964236#comment-16964236
 ] 

L. C. Hsieh commented on SPARK-25923:
-

Noticed that and asked help from CRAN two hours ago.

> SparkR UT Failure (checking CRAN incoming feasibility)
> --
>
> Key: SPARK-25923
> URL: https://issues.apache.org/jira/browse/SPARK-25923
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: L. C. Hsieh
>Priority: Blocker
>
> Currently, the following SparkR error blocks PR builders.
> {code:java}
> * checking CRAN incoming feasibility ...Error in 
> .check_package_CRAN_incoming(pkgdir) : 
>   dims [product 26] do not match the length of object [0]
> Execution halted
> {code}
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98362/console
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98367/console
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98368/testReport/
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4403/testReport/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25923) SparkR UT Failure (checking CRAN incoming feasibility)

2019-10-31 Thread Sean R. Owen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964232#comment-16964232
 ] 

Sean R. Owen commented on SPARK-25923:
--

[~viirya][~hyukjin.kwon][~dongjoon] Looks like this is happening again -- I 
wonder if it has anything to do with the changes in master for 3.0 preview?

https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4910/console

{code}
* checking CRAN incoming feasibility ...Error in 
.check_package_CRAN_incoming(pkgdir) : 
  dims [product 24] do not match the length of object [0]
{code}

Is this something we can resolve on our side in any way or needs CRAN help?

> SparkR UT Failure (checking CRAN incoming feasibility)
> --
>
> Key: SPARK-25923
> URL: https://issues.apache.org/jira/browse/SPARK-25923
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: L. C. Hsieh
>Priority: Blocker
>
> Currently, the following SparkR error blocks PR builders.
> {code:java}
> * checking CRAN incoming feasibility ...Error in 
> .check_package_CRAN_incoming(pkgdir) : 
>   dims [product 26] do not match the length of object [0]
> Execution halted
> {code}
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98362/console
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98367/console
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98368/testReport/
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4403/testReport/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29690) Spark Shell - Clear imports

2019-10-31 Thread dinesh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dinesh updated SPARK-29690:
---
Description: 
I 'm facing below problem with Spark Shell. So, in a shell session -
 # I imported following - {color:#57d9a3}{{import 
scala.collection.immutable.HashMap}}{color}
 # Then I realized my mistake and imported correct class - 
{color:#57d9a3}{{import java.util.HashMap}}{color}

But, now I get following error on running my code -

{color:#de350b}:34: error: reference to HashMap is ambiguous;it is 
imported twice in the same scope byimport java.util.HashMapand import 
scala.collection.immutable.HashMapval colMap = new HashMap[String, 
HashMap[String, String]](){color}

If I have long running Spark Shell session i.e I do not want to close and 
reopen my shell. So, is there a way I can clear previous imports and use 
correct class?

I know that we can also specify full qualified name like - {color:#57d9a3}{{val 
colMap = new java.util.HashMap[String, java.util.HashMap[String, 
String]]()}}{color}

But, 'm looking if there is a way to clear an incorrect loaded class?

 

I thought spark shell picks imports from history the same way REPL does. That 
said, previous HashMap should be shadowed away with new import statement.

  was:
I 'm facing below problem with Spark Shell. So, in a shell session -
 # I imported following - {color:#57d9a3}{{import 
scala.collection.immutable.HashMap}}{color}
 # Then I realized my mistake and imported correct class - 
{color:#57d9a3}{{import java.util.HashMap}}{color}

But, now I get following error on running my code -

{color:#de350b}:34: error: reference to HashMap is ambiguous;it is 
imported twice in the same scope byimport java.util.HashMapand import 
scala.collection.immutable.HashMapval colMap = new HashMap[String, 
HashMap[String, String]](){color}

 if I have long running Spark Shell session i.e I do not want to close and 
reopen my shell. So, is there a way I can clear previous imports and use 
correct class?

I know that we can also specify full qualified name like - {color:#57d9a3}{{val 
colMap = new java.util.HashMap[String, java.util.HashMap[String, 
String]]()}}{color}

But, 'm looking if there is a way to clear an incorrect loaded class?

 

I thought spark shell picks imports from history the same way REPL does. That 
said, previous HashMap should be shadowed away with new import statement.

{{}}


> Spark Shell  - Clear imports
> 
>
> Key: SPARK-29690
> URL: https://issues.apache.org/jira/browse/SPARK-29690
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.2.0
>Reporter: dinesh
>Priority: Major
>
> I 'm facing below problem with Spark Shell. So, in a shell session -
>  # I imported following - {color:#57d9a3}{{import 
> scala.collection.immutable.HashMap}}{color}
>  # Then I realized my mistake and imported correct class - 
> {color:#57d9a3}{{import java.util.HashMap}}{color}
> But, now I get following error on running my code -
> {color:#de350b}:34: error: reference to HashMap is ambiguous;it 
> is imported twice in the same scope byimport java.util.HashMapand import 
> scala.collection.immutable.HashMapval colMap = new HashMap[String, 
> HashMap[String, String]](){color}
> If I have long running Spark Shell session i.e I do not want to close and 
> reopen my shell. So, is there a way I can clear previous imports and use 
> correct class?
> I know that we can also specify full qualified name like - 
> {color:#57d9a3}{{val colMap = new java.util.HashMap[String, 
> java.util.HashMap[String, String]]()}}{color}
> But, 'm looking if there is a way to clear an incorrect loaded class?
>  
> I thought spark shell picks imports from history the same way REPL does. That 
> said, previous HashMap should be shadowed away with new import statement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29690) Spark Shell - Clear imports

2019-10-31 Thread dinesh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dinesh updated SPARK-29690:
---
Description: 
I 'm facing below problem with Spark Shell. So, in a shell session -
 # I imported following - {color:#57d9a3}{{import 
scala.collection.immutable.HashMap}}{color}
 # Then I realized my mistake and imported correct class - 
{color:#57d9a3}{{import java.util.HashMap}}{color}

But, now I get following error on running my code -

{color:#de350b}:34: error: reference to HashMap is ambiguous;it is 
imported twice in the same scope byimport java.util.HashMapand import 
scala.collection.immutable.HashMapval colMap = new HashMap[String, 
HashMap[String, String]](){color}

 if I have long running Spark Shell session i.e I do not want to close and 
reopen my shell. So, is there a way I can clear previous imports and use 
correct class?

I know that we can also specify full qualified name like - {color:#57d9a3}{{val 
colMap = new java.util.HashMap[String, java.util.HashMap[String, 
String]]()}}{color}

But, 'm looking if there is a way to clear an incorrect loaded class?

 

I thought spark shell picks imports from history the same way REPL does. That 
said, previous HashMap should be shadowed away with new import statement.

{{}}

  was:
I 'm facing below problem with Spark Shell. So, in a shell session -
 # I imported following - {{import scala.collection.immutable.HashMap}}
 # Then I realized my mistake and imported correct class - {{import 
java.util.HashMap}}

But, now I get following error on running my code -

{color:#de350b}:34: error: reference to HashMap is ambiguous;it is 
imported twice in the same scope byimport java.util.HashMapand import 
scala.collection.immutable.HashMapval colMap = new HashMap[String, 
HashMap[String, String]](){color}

 if I have long running Spark Shell session i.e I do not want to close and 
reopen my shell. So, is there a way I can clear previous imports and use 
correct class?

I know that we can also specify full qualified name like - {color:#57d9a3}{{val 
colMap = new java.util.HashMap[String, java.util.HashMap[String, 
String]]()}}{color}

But, 'm looking if there is a way to clear an incorrect loaded class?

 

I thought spark shell picks imports from history the same way REPL does. That 
said, previous HashMap should be shadowed away with new import statement.

{{}}


> Spark Shell  - Clear imports
> 
>
> Key: SPARK-29690
> URL: https://issues.apache.org/jira/browse/SPARK-29690
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.2.0
>Reporter: dinesh
>Priority: Major
>
> I 'm facing below problem with Spark Shell. So, in a shell session -
>  # I imported following - {color:#57d9a3}{{import 
> scala.collection.immutable.HashMap}}{color}
>  # Then I realized my mistake and imported correct class - 
> {color:#57d9a3}{{import java.util.HashMap}}{color}
> But, now I get following error on running my code -
> {color:#de350b}:34: error: reference to HashMap is ambiguous;it 
> is imported twice in the same scope byimport java.util.HashMapand import 
> scala.collection.immutable.HashMapval colMap = new HashMap[String, 
> HashMap[String, String]](){color}
>  if I have long running Spark Shell session i.e I do not want to close and 
> reopen my shell. So, is there a way I can clear previous imports and use 
> correct class?
> I know that we can also specify full qualified name like - 
> {color:#57d9a3}{{val colMap = new java.util.HashMap[String, 
> java.util.HashMap[String, String]]()}}{color}
> But, 'm looking if there is a way to clear an incorrect loaded class?
>  
> I thought spark shell picks imports from history the same way REPL does. That 
> said, previous HashMap should be shadowed away with new import statement.
> {{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29690) Spark Shell - Clear imports

2019-10-31 Thread dinesh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dinesh updated SPARK-29690:
---
Description: 
I 'm facing below problem with Spark Shell. So, in a shell session -
 # I imported following - {{import scala.collection.immutable.HashMap}}
 # Then I realized my mistake and imported correct class - {{import 
java.util.HashMap}}

But, now I get following error on running my code -

{color:#de350b}:34: error: reference to HashMap is ambiguous;it is 
imported twice in the same scope byimport java.util.HashMapand import 
scala.collection.immutable.HashMapval colMap = new HashMap[String, 
HashMap[String, String]](){color}

 if I have long running Spark Shell session i.e I do not want to close and 
reopen my shell. So, is there a way I can clear previous imports and use 
correct class?

I know that we can also specify full qualified name like - {color:#57d9a3}{{val 
colMap = new java.util.HashMap[String, java.util.HashMap[String, 
String]]()}}{color}

But, 'm looking if there is a way to clear an incorrect loaded class?

 

I thought spark shell picks imports from history the same way REPL does. That 
said, previous HashMap should be shadowed away with new import statement.

{{}}

  was:
I 'm facing below problem with Spark Shell. So, in a shell session -
 # I imported following - {{import scala.collection.immutable.HashMap}}
 # Then I realized my mistake and imported correct class - {{import 
java.util.HashMap}}

But, now I get following error on running my code -

{{:34: error: reference to HashMap is ambiguous;it is imported twice 
in the same scope byimport java.util.HashMapand import 
scala.collection.immutable.HashMapval colMap = new HashMap[String, 
HashMap[String, String]]()}}

{{}}

 if I have long running Spark Shell session i.e I do not want to close and 
reopen my shell. So, is there a way I can clear previous imports and use 
correct class?

I know that we can also specify full qualified name like - {{val colMap = new 
java.util.HashMap[String, java.util.HashMap[String, String]]()}}

But, 'm looking if there is a way to clear an incorrect loaded class?

 

 I thought spark shell picks imports from history the same way REPL does. That 
said, previous HashMap should be shadowed away with new import statement.

{{}}


> Spark Shell  - Clear imports
> 
>
> Key: SPARK-29690
> URL: https://issues.apache.org/jira/browse/SPARK-29690
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.2.0
>Reporter: dinesh
>Priority: Major
>
> I 'm facing below problem with Spark Shell. So, in a shell session -
>  # I imported following - {{import scala.collection.immutable.HashMap}}
>  # Then I realized my mistake and imported correct class - {{import 
> java.util.HashMap}}
> But, now I get following error on running my code -
> {color:#de350b}:34: error: reference to HashMap is ambiguous;it 
> is imported twice in the same scope byimport java.util.HashMapand import 
> scala.collection.immutable.HashMapval colMap = new HashMap[String, 
> HashMap[String, String]](){color}
>  if I have long running Spark Shell session i.e I do not want to close and 
> reopen my shell. So, is there a way I can clear previous imports and use 
> correct class?
> I know that we can also specify full qualified name like - 
> {color:#57d9a3}{{val colMap = new java.util.HashMap[String, 
> java.util.HashMap[String, String]]()}}{color}
> But, 'm looking if there is a way to clear an incorrect loaded class?
>  
> I thought spark shell picks imports from history the same way REPL does. That 
> said, previous HashMap should be shadowed away with new import statement.
> {{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29690) Spark Shell - Clear imports

2019-10-31 Thread dinesh (Jira)

dinesh created SPARK-29690:
--

 Summary: Spark Shell  - Clear imports
 Key: SPARK-29690
 URL: https://issues.apache.org/jira/browse/SPARK-29690
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 2.2.0
Reporter: dinesh


I 'm facing below problem with Spark Shell. So, in a shell session -
 # I imported following - {{import scala.collection.immutable.HashMap}}
 # Then I realized my mistake and imported correct class - {{import 
java.util.HashMap}}

But, now I get following error on running my code -

{{:34: error: reference to HashMap is ambiguous;it is imported twice 
in the same scope byimport java.util.HashMapand import 
scala.collection.immutable.HashMapval colMap = new HashMap[String, 
HashMap[String, String]]()}}

{{}}

 if I have long running Spark Shell session i.e I do not want to close and 
reopen my shell. So, is there a way I can clear previous imports and use 
correct class?

I know that we can also specify full qualified name like - {{val colMap = new 
java.util.HashMap[String, java.util.HashMap[String, String]]()}}

But, 'm looking if there is a way to clear an incorrect loaded class?

 

 I thought spark shell picks imports from history the same way REPL does. That 
said, previous HashMap should be shadowed away with new import statement.

{{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29644) ShortType is wrongly set as Int in JDBCUtils.scala

2019-10-31 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-29644:
-
Issue Type: Bug  (was: New Feature)

> ShortType is wrongly set as Int in JDBCUtils.scala
> --
>
> Key: SPARK-29644
> URL: https://issues.apache.org/jira/browse/SPARK-29644
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Shiv Prashant Sood
>Priority: Minor
>
> @maropu pointed out this issue during  [PR 
> 25344|https://github.com/apache/spark/pull/25344]  review discussion.
>  In 
> [JDBCUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala]
>  line number 547
> case ShortType =>
>  (stmt: PreparedStatement, row: Row, pos: Int) =>
>  stmt.setInt(pos + 1, row.getShort(pos))
> I dont see any reproducible issue, but this is clearly a problem that must be 
> fixed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29675) Add exception when isolationLevel is Illegal

2019-10-31 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29675.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26334
[https://github.com/apache/spark/pull/26334]

> Add exception when isolationLevel is Illegal
> 
>
> Key: SPARK-29675
> URL: https://issues.apache.org/jira/browse/SPARK-29675
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
> Fix For: 3.0.0
>
>
> Now we use JDBC api and set an Illegal isolationLevel option, spark will 
> throw a `scala.MatchError`, it's not friendly to user. So we should add an 
> IllegalArgumentException.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29675) Add exception when isolationLevel is Illegal

2019-10-31 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29675:
-

Assignee: ulysses you

> Add exception when isolationLevel is Illegal
> 
>
> Key: SPARK-29675
> URL: https://issues.apache.org/jira/browse/SPARK-29675
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
>
> Now we use JDBC api and set an Illegal isolationLevel option, spark will 
> throw a `scala.MatchError`, it's not friendly to user. So we should add an 
> IllegalArgumentException.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29682) Failure when resolving conflicting references in Join:

2019-10-31 Thread Terry Kim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964158#comment-16964158
 ] 

Terry Kim commented on SPARK-29682:
---

Sure, I will look into this. Thanks for pinging me.

> Failure when resolving conflicting references in Join:
> --
>
> Key: SPARK-29682
> URL: https://issues.apache.org/jira/browse/SPARK-29682
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, Spark Submit
>Affects Versions: 2.4.3
>Reporter: sandeshyapuram
>Priority: Major
>
> When I try to self join a parentDf with multiple childDf say childDf1 ... ... 
> where childDfs are derived after a cube or rollup and are filtered based on 
> group bys,
> I get and error 
> {{Failure when resolving conflicting references in Join: }}
> This shows a long error message which is quite unreadable. On the other hand, 
> if I replace cube or rollup with old groupBy, it works without issues.
>  
> *Sample code:* 
> {code:java}
> val numsDF = sc.parallelize(Seq(1,2,3,4,5,6)).toDF("nums")
> val cubeDF = numsDF
> .cube("nums")
> .agg(
> max(lit(0)).as("agcol"),
> grouping_id().as("gid")
> )
> 
> val group0 = cubeDF.filter(col("gid") <=> lit(0))
> val group1 = cubeDF.filter(col("gid") <=> lit(1))
> cubeDF.printSchema
> group0.printSchema
> group1.printSchema
> //Recreating cubeDf
> cubeDF.select("nums").distinct
> .join(group0, Seq("nums"), "inner")
> .join(group1, Seq("nums"), "inner")
> .show
> {code}
> *Sample output:*
> {code:java}
> numsDF: org.apache.spark.sql.DataFrame = [nums: int]
> cubeDF: org.apache.spark.sql.DataFrame = [nums: int, agcol: int ... 1 more 
> field]
> group0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, 
> agcol: int ... 1 more field]
> group1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, 
> agcol: int ... 1 more field]
> root
>  |-- nums: integer (nullable = true)
>  |-- agcol: integer (nullable = true)
>  |-- gid: integer (nullable = false)
> root
>  |-- nums: integer (nullable = true)
>  |-- agcol: integer (nullable = true)
>  |-- gid: integer (nullable = false)
> root
>  |-- nums: integer (nullable = true)
>  |-- agcol: integer (nullable = true)
>  |-- gid: integer (nullable = false)
> org.apache.spark.sql.AnalysisException:
> Failure when resolving conflicting references in Join:
> 'Join Inner
> :- Deduplicate [nums#220]
> :  +- Project [nums#220]
> : +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
> agcol#216, spark_grouping_id#218 AS gid#217]
> :+- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
> [nums#212, nums#220, spark_grouping_id#218]
> :   +- Project [nums#212, nums#212 AS nums#219]
> :  +- Project [value#210 AS nums#212]
> : +- SerializeFromObject [input[0, int, false] AS value#210]
> :+- ExternalRDD [obj#209]
> +- Filter (gid#217 <=> 0)
>+- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
> agcol#216, spark_grouping_id#218 AS gid#217]
>   +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
> [nums#212, nums#220, spark_grouping_id#218]
>  +- Project [nums#212, nums#212 AS nums#219]
> +- Project [value#210 AS nums#212]
>+- SerializeFromObject [input[0, int, false] AS value#210]
>   +- ExternalRDD [obj#209]
> Conflicting attributes: nums#220
> ;;
> 'Join Inner
> :- Deduplicate [nums#220]
> :  +- Project [nums#220]
> : +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
> agcol#216, spark_grouping_id#218 AS gid#217]
> :+- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
> [nums#212, nums#220, spark_grouping_id#218]
> :   +- Project [nums#212, nums#212 AS nums#219]
> :  +- Project [value#210 AS nums#212]
> : +- SerializeFromObject [input[0, int, false] AS value#210]
> :+- ExternalRDD [obj#209]
> +- Filter (gid#217 <=> 0)
>+- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
> agcol#216, spark_grouping_id#218 AS gid#217]
>   +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
> [nums#212, nums#220, spark_grouping_id#218]
>  +- Project [nums#212, nums#212 AS nums#219]
> +- Project [value#210 AS nums#212]
>+- SerializeFromObject [input[0, int, false] AS value#210]
>   +- ExternalRDD [obj#209]
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:96)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkA

[jira] [Commented] (SPARK-29630) Not allowed to create a permanent view by referencing a temporary view in EXISTS

2019-10-31 Thread Terry Kim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964157#comment-16964157
 ] 

Terry Kim commented on SPARK-29630:
---

Yes, I will take a look.

> Not allowed to create a permanent view by referencing a temporary view in 
> EXISTS
> 
>
> Key: SPARK-29630
> URL: https://issues.apache.org/jira/browse/SPARK-29630
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> {code}
> // In the master, the query below fails
> $ CREATE VIEW v7_temp AS SELECT t1.id, t2.a FROM base_table t1, (SELECT * 
> FROM temp_table) t2;
> org.apache.spark.sql.AnalysisException
> Not allowed to create a permanent view `v7_temp` by referencing a temporary 
> view `temp_table`;
> // In the master, the query below passed, but this should fail
> $ CREATE VIEW v8_temp AS SELECT * FROM base_table WHERE EXISTS (SELECT 1 FROM 
> temp_table);
> Passed
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29604) SessionState is initialized with isolated classloader for Hive if spark.sql.hive.metastore.jars is being set

2019-10-31 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964137#comment-16964137
 ] 

Dongjoon Hyun commented on SPARK-29604:
---

Thank you for keeping working on this.
Yes. It passed locally. That's the reason why I didn't revert this patch until 
now.
But, we are on 3.0.0-preview voting. In the worst case, we need to revert this.


> SessionState is initialized with isolated classloader for Hive if 
> spark.sql.hive.metastore.jars is being set
> 
>
> Key: SPARK-29604
> URL: https://issues.apache.org/jira/browse/SPARK-29604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> I've observed the issue that external listeners cannot be loaded properly 
> when we run spark-sql with "spark.sql.hive.metastore.jars" configuration 
> being used.
> {noformat}
> Exception in thread "main" java.lang.IllegalArgumentException: Error while 
> instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':
>   at 
> org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1102)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:154)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:153)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:153)
>   at 
> org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:150)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$2.apply(SparkSession.scala:104)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$2.apply(SparkSession.scala:104)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:104)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:103)
>   at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:149)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$client(HiveClientImpl.scala:282)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:306)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:247)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:246)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:296)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.databaseExists(HiveClientImpl.scala:386)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:215)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:215)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:215)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:214)
>   at 
> org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114)
>   at 
> org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:53)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:315)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:166)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:847)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSub

[jira] [Assigned] (SPARK-29277) DataSourceV2: Add early filter and projection pushdown

2019-10-31 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29277:
-

Assignee: Ryan Blue

> DataSourceV2: Add early filter and projection pushdown
> --
>
> Key: SPARK-29277
> URL: https://issues.apache.org/jira/browse/SPARK-29277
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 3.0.0
>
>
> Spark uses optimizer rules that need stats before conversion to physical 
> plan. DataSourceV2 should support early pushdown for those rules.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29277) DataSourceV2: Add early filter and projection pushdown

2019-10-31 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29277.
---
Resolution: Fixed

Issue resolved by pull request 26341
[https://github.com/apache/spark/pull/26341]

> DataSourceV2: Add early filter and projection pushdown
> --
>
> Key: SPARK-29277
> URL: https://issues.apache.org/jira/browse/SPARK-29277
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 3.0.0
>
>
> Spark uses optimizer rules that need stats before conversion to physical 
> plan. DataSourceV2 should support early pushdown for those rules.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29689) [WEB-UI] When task failed during reading shuffle data or other failure, enable show total shuffle read size

2019-10-31 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29689:

Description: 
As shown in the attachment, if task failed during reading shuffle data or 
because of executor loss, its shuffle read size would be shown as 0.

But this size is important for user, it can help detect data skew.

  was:
If task failed during reading shuffle data or because of executor loss, its 
shuffle read size would be shown as 0.

But this size is important for user, it can help detect data skew.




 !screenshot-1.png! 


> [WEB-UI] When task failed during reading shuffle data or other failure, 
> enable show total shuffle read size
> ---
>
> Key: SPARK-29689
> URL: https://issues.apache.org/jira/browse/SPARK-29689
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> As shown in the attachment, if task failed during reading shuffle data or 
> because of executor loss, its shuffle read size would be shown as 0.
> But this size is important for user, it can help detect data skew.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29689) [WEB-UI] When task failed during reading shuffle data or other failure, enable show total shuffle read size

2019-10-31 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29689:

Description: 
If task failed during reading shuffle data or because of executor loss, its 
shuffle read size would be shown as 0.

But this size is important for user, it can help detect data skew.




 !screenshot-1.png! 

> [WEB-UI] When task failed during reading shuffle data or other failure, 
> enable show total shuffle read size
> ---
>
> Key: SPARK-29689
> URL: https://issues.apache.org/jira/browse/SPARK-29689
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> If task failed during reading shuffle data or because of executor loss, its 
> shuffle read size would be shown as 0.
> But this size is important for user, it can help detect data skew.
>  !screenshot-1.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29689) [WEB-UI] When task failed during reading shuffle data or other failure, enable show total shuffle read size

2019-10-31 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29689:

Attachment: screenshot-1.png

> [WEB-UI] When task failed during reading shuffle data or other failure, 
> enable show total shuffle read size
> ---
>
> Key: SPARK-29689
> URL: https://issues.apache.org/jira/browse/SPARK-29689
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Minor
> Attachments: screenshot-1.png
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29689) [WEB-UI] When task failed during reading shuffle data or other failure, enable show total shuffle read size

2019-10-31 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29689:

Summary: [WEB-UI] When task failed during reading shuffle data or other 
failure, enable show total shuffle read size  (was: [UI] When task failed 
during reading shuffle data or other failure, enable show total shuffle read 
size)

> [WEB-UI] When task failed during reading shuffle data or other failure, 
> enable show total shuffle read size
> ---
>
> Key: SPARK-29689
> URL: https://issues.apache.org/jira/browse/SPARK-29689
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29689) [UI] When task failed during reading shuffle data or other failure, enable show total shuffle read size

2019-10-31 Thread feiwang (Jira)

feiwang created SPARK-29689:
---

 Summary: [UI] When task failed during reading shuffle data or 
other failure, enable show total shuffle read size
 Key: SPARK-29689
 URL: https://issues.apache.org/jira/browse/SPARK-29689
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 2.4.4
Reporter: feiwang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29688) Support average with interval type values

2019-10-31 Thread Kent Yao (Jira)

Kent Yao created SPARK-29688:


 Summary: Support average with interval type values
 Key: SPARK-29688
 URL: https://issues.apache.org/jira/browse/SPARK-29688
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kent Yao


add average aggegate support for spark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29687) Fix jdbc metrics counter type to long

2019-10-31 Thread ulysses you (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964015#comment-16964015
 ] 

ulysses you commented on SPARK-29687:
-

Sorry for a misstake.

> Fix jdbc metrics counter type to long
> -
>
> Key: SPARK-29687
> URL: https://issues.apache.org/jira/browse/SPARK-29687
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ulysses you
>Priority: Minor
>
> JDBC metrics counter var is an int type that may by overflow. Change it to 
> Long type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29687) Fix jdbc metrics counter type to long

2019-10-31 Thread ulysses you (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964011#comment-16964011
 ] 

ulysses you edited comment on SPARK-29687 at 10/31/19 1:33 PM:
---

You can just see the pr [26346|[https://github.com/apache/spark/pull/26346]]


was (Author: ulysses):
You can just see the pr [26334|[https://github.com/apache/spark/pull/26334]]

> Fix jdbc metrics counter type to long
> -
>
> Key: SPARK-29687
> URL: https://issues.apache.org/jira/browse/SPARK-29687
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ulysses you
>Priority: Minor
>
> JDBC metrics counter var is an int type that may by overflow. Change it to 
> Long type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29687) Fix jdbc metrics counter type to long

2019-10-31 Thread ulysses you (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964011#comment-16964011
 ] 

ulysses you commented on SPARK-29687:
-

You can just see the pr [26334|[https://github.com/apache/spark/pull/26334]]

> Fix jdbc metrics counter type to long
> -
>
> Key: SPARK-29687
> URL: https://issues.apache.org/jira/browse/SPARK-29687
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ulysses you
>Priority: Minor
>
> JDBC metrics counter var is an int type that may by overflow. Change it to 
> Long type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29687) Fix jdbc metrics counter type to long

2019-10-31 Thread ulysses you (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964011#comment-16964011
 ] 

ulysses you edited comment on SPARK-29687 at 10/31/19 1:31 PM:
---

You can just see the pr [26334|[https://github.com/apache/spark/pull/26334]]


was (Author: ulysses):
You can just see the pr [26334|[https://github.com/apache/spark/pull/26334]]

> Fix jdbc metrics counter type to long
> -
>
> Key: SPARK-29687
> URL: https://issues.apache.org/jira/browse/SPARK-29687
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ulysses you
>Priority: Minor
>
> JDBC metrics counter var is an int type that may by overflow. Change it to 
> Long type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29687) Fix jdbc metrics counter type to long

2019-10-31 Thread jobit mathew (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16963993#comment-16963993
 ] 

jobit mathew commented on SPARK-29687:
--

Hi can you give some details about this variable where it is getting used

> Fix jdbc metrics counter type to long
> -
>
> Key: SPARK-29687
> URL: https://issues.apache.org/jira/browse/SPARK-29687
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ulysses you
>Priority: Minor
>
> JDBC metrics counter var is an int type that may by overflow. Change it to 
> Long type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29687) Fix jdbc metrics counter type to long

2019-10-31 Thread ulysses you (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ulysses you updated SPARK-29687:

Affects Version/s: (was: 2.4.4)
   3.0.0

> Fix jdbc metrics counter type to long
> -
>
> Key: SPARK-29687
> URL: https://issues.apache.org/jira/browse/SPARK-29687
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ulysses you
>Priority: Minor
>
> JDBC metrics counter var is an int type that may by overflow. Change it to 
> Long type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29687) Fix jdbc metrics counter type to long

2019-10-31 Thread ulysses you (Jira)

ulysses you created SPARK-29687:
---

 Summary: Fix jdbc metrics counter type to long
 Key: SPARK-29687
 URL: https://issues.apache.org/jira/browse/SPARK-29687
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.4
Reporter: ulysses you


JDBC metrics counter var is an int type that may by overflow. Change it to Long 
type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29630) Not allowed to create a permanent view by referencing a temporary view in EXISTS

2019-10-31 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16963858#comment-16963858
 ] 

Wenchen Fan commented on SPARK-29630:
-

Yea this should be disallowed. We store view as text, so it's not allowed to 
have temp views appear in the view SQL text. I think it's a bug in the checking 
logic of `CREATE VIEW`: it doesn't go through subqueries.

[~imback82] do you have time to look into it?

> Not allowed to create a permanent view by referencing a temporary view in 
> EXISTS
> 
>
> Key: SPARK-29630
> URL: https://issues.apache.org/jira/browse/SPARK-29630
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> {code}
> // In the master, the query below fails
> $ CREATE VIEW v7_temp AS SELECT t1.id, t2.a FROM base_table t1, (SELECT * 
> FROM temp_table) t2;
> org.apache.spark.sql.AnalysisException
> Not allowed to create a permanent view `v7_temp` by referencing a temporary 
> view `temp_table`;
> // In the master, the query below passed, but this should fail
> $ CREATE VIEW v8_temp AS SELECT * FROM base_table WHERE EXISTS (SELECT 1 FROM 
> temp_table);
> Passed
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29682) Failure when resolving conflicting references in Join:

2019-10-31 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16963850#comment-16963850
 ] 

Wenchen Fan commented on SPARK-29682:
-

[~imback82] do you want to look into it?

> Failure when resolving conflicting references in Join:
> --
>
> Key: SPARK-29682
> URL: https://issues.apache.org/jira/browse/SPARK-29682
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, Spark Submit
>Affects Versions: 2.4.3
>Reporter: sandeshyapuram
>Priority: Major
>
> When I try to self join a parentDf with multiple childDf say childDf1 ... ... 
> where childDfs are derived after a cube or rollup and are filtered based on 
> group bys,
> I get and error 
> {{Failure when resolving conflicting references in Join: }}
> This shows a long error message which is quite unreadable. On the other hand, 
> if I replace cube or rollup with old groupBy, it works without issues.
>  
> *Sample code:* 
> {code:java}
> val numsDF = sc.parallelize(Seq(1,2,3,4,5,6)).toDF("nums")
> val cubeDF = numsDF
> .cube("nums")
> .agg(
> max(lit(0)).as("agcol"),
> grouping_id().as("gid")
> )
> 
> val group0 = cubeDF.filter(col("gid") <=> lit(0))
> val group1 = cubeDF.filter(col("gid") <=> lit(1))
> cubeDF.printSchema
> group0.printSchema
> group1.printSchema
> //Recreating cubeDf
> cubeDF.select("nums").distinct
> .join(group0, Seq("nums"), "inner")
> .join(group1, Seq("nums"), "inner")
> .show
> {code}
> *Sample output:*
> {code:java}
> numsDF: org.apache.spark.sql.DataFrame = [nums: int]
> cubeDF: org.apache.spark.sql.DataFrame = [nums: int, agcol: int ... 1 more 
> field]
> group0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, 
> agcol: int ... 1 more field]
> group1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, 
> agcol: int ... 1 more field]
> root
>  |-- nums: integer (nullable = true)
>  |-- agcol: integer (nullable = true)
>  |-- gid: integer (nullable = false)
> root
>  |-- nums: integer (nullable = true)
>  |-- agcol: integer (nullable = true)
>  |-- gid: integer (nullable = false)
> root
>  |-- nums: integer (nullable = true)
>  |-- agcol: integer (nullable = true)
>  |-- gid: integer (nullable = false)
> org.apache.spark.sql.AnalysisException:
> Failure when resolving conflicting references in Join:
> 'Join Inner
> :- Deduplicate [nums#220]
> :  +- Project [nums#220]
> : +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
> agcol#216, spark_grouping_id#218 AS gid#217]
> :+- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
> [nums#212, nums#220, spark_grouping_id#218]
> :   +- Project [nums#212, nums#212 AS nums#219]
> :  +- Project [value#210 AS nums#212]
> : +- SerializeFromObject [input[0, int, false] AS value#210]
> :+- ExternalRDD [obj#209]
> +- Filter (gid#217 <=> 0)
>+- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
> agcol#216, spark_grouping_id#218 AS gid#217]
>   +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
> [nums#212, nums#220, spark_grouping_id#218]
>  +- Project [nums#212, nums#212 AS nums#219]
> +- Project [value#210 AS nums#212]
>+- SerializeFromObject [input[0, int, false] AS value#210]
>   +- ExternalRDD [obj#209]
> Conflicting attributes: nums#220
> ;;
> 'Join Inner
> :- Deduplicate [nums#220]
> :  +- Project [nums#220]
> : +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
> agcol#216, spark_grouping_id#218 AS gid#217]
> :+- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
> [nums#212, nums#220, spark_grouping_id#218]
> :   +- Project [nums#212, nums#212 AS nums#219]
> :  +- Project [value#210 AS nums#212]
> : +- SerializeFromObject [input[0, int, false] AS value#210]
> :+- ExternalRDD [obj#209]
> +- Filter (gid#217 <=> 0)
>+- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
> agcol#216, spark_grouping_id#218 AS gid#217]
>   +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
> [nums#212, nums#220, spark_grouping_id#218]
>  +- Project [nums#212, nums#212 AS nums#219]
> +- Project [value#210 AS nums#212]
>+- SerializeFromObject [input[0, int, false] AS value#210]
>   +- ExternalRDD [obj#209]
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:96)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis

[jira] [Updated] (SPARK-29685) Spark SQL also better to show the column details while doing SELECT * from table, like sparkshell and spark beeline

2019-10-31 Thread jobit mathew (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jobit mathew updated SPARK-29685:
-
Issue Type: Improvement  (was: Bug)

> Spark SQL also better to show the column details while doing SELECT * from 
> table, like sparkshell and spark beeline
> ---
>
> Key: SPARK-29685
> URL: https://issues.apache.org/jira/browse/SPARK-29685
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: jobit mathew
>Priority: Minor
>
> Spark SQL also better to show the column details in top while doing SELECT * 
> from table, like spark scala shell and spark beeline shows in table format.
> *Test steps*
> 1.create table table1(id int,name string,address string);
> 2.insert into table1 values (5,name1,add1);
> 3.insert into table1 values (5,name2,add2);
> 4.insert into table1 values (5,name3,add3);
> {code:java}
> spark-sql> select * from table1;
> 5   name3   add3
> 5   name1   add1
> 5   name2   add2
> But in spark scala shell & spark beeline shows the columns details also in 
> table format
> scala> sql("select * from table1").show()
> +---+-+---+
> | id| name|address|
> +---+-+---+
> |  5|name3|   add3|
> |  5|name1|   add1|
> |  5|name2|   add2|
> +---+-+---+
> scala>
> 0: jdbc:hive2://10.18.18.214:23040/default> select * from table1;
> +-++--+--+
> | id  |  name  | address  |
> +-++--+--+
> | 5   | name3  | add3 |
> | 5   | name1  | add1 |
> | 5   | name2  | add2 |
> +-++--+--+
> 3 rows selected (0.679 seconds)
> 0: jdbc:hive2://10.18.18.214:23040/default>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29685) Spark SQL also better to show the column details while doing SELECT * from table, like sparkshell and spark beeline

2019-10-31 Thread jobit mathew (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jobit mathew updated SPARK-29685:
-
Description: 
Spark SQL also better to show the column details in top while doing SELECT * 
from table, like spark scala shell and spark beeline shows in table format.

*Test steps*
1.create table table1(id int,name string,address string);
2.insert into table1 values (5,name1,add1);
3.insert into table1 values (5,name2,add2);
4.insert into table1 values (5,name3,add3);
{code:java}
spark-sql> select * from table1;
5   name3   add3
5   name1   add1
5   name2   add2

But in spark scala shell & spark beeline shows the columns details also in 
table format

scala> sql("select * from table1").show()
+---+-+---+
| id| name|address|
+---+-+---+
|  5|name3|   add3|
|  5|name1|   add1|
|  5|name2|   add2|
+---+-+---+


scala>

0: jdbc:hive2://10.18.18.214:23040/default> select * from table1;
+-++--+--+
| id  |  name  | address  |
+-++--+--+
| 5   | name3  | add3 |
| 5   | name1  | add1 |
| 5   | name2  | add2 |
+-++--+--+
3 rows selected (0.679 seconds)
0: jdbc:hive2://10.18.18.214:23040/default>

{code}


  was:
Spark SQL also better to show the column details in top while doing SELECT * 
from table, like spark scala shell and spark beeline shows in table format.

*Test steps*
1.create table table1(id int,name string,address string);
2.insert into table1 values (5,name1,add1);
3.insert into table1 values (5,name2,add2);
4.insert into table1 values (5,name3,add3);

spark-sql> select * from table1;
5   name3   add3
5   name1   add1
5   name2   add2

But in spark scala shell & spark beeline shows the columns details also in 
table format

scala> sql("select * from table1").show()
+---+-+---+
| id| name|address|
+---+-+---+
|  5|name3|   add3|
|  5|name1|   add1|
|  5|name2|   add2|
+---+-+---+


scala>

0: jdbc:hive2://10.18.18.214:23040/default> select * from table1;
+-++--+--+
| id  |  name  | address  |
+-++--+--+
| 5   | name3  | add3 |
| 5   | name1  | add1 |
| 5   | name2  | add2 |
+-++--+--+
3 rows selected (0.679 seconds)
0: jdbc:hive2://10.18.18.214:23040/default>





> Spark SQL also better to show the column details while doing SELECT * from 
> table, like sparkshell and spark beeline
> ---
>
> Key: SPARK-29685
> URL: https://issues.apache.org/jira/browse/SPARK-29685
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: jobit mathew
>Priority: Minor
>
> Spark SQL also better to show the column details in top while doing SELECT * 
> from table, like spark scala shell and spark beeline shows in table format.
> *Test steps*
> 1.create table table1(id int,name string,address string);
> 2.insert into table1 values (5,name1,add1);
> 3.insert into table1 values (5,name2,add2);
> 4.insert into table1 values (5,name3,add3);
> {code:java}
> spark-sql> select * from table1;
> 5   name3   add3
> 5   name1   add1
> 5   name2   add2
> But in spark scala shell & spark beeline shows the columns details also in 
> table format
> scala> sql("select * from table1").show()
> +---+-+---+
> | id| name|address|
> +---+-+---+
> |  5|name3|   add3|
> |  5|name1|   add1|
> |  5|name2|   add2|
> +---+-+---+
> scala>
> 0: jdbc:hive2://10.18.18.214:23040/default> select * from table1;
> +-++--+--+
> | id  |  name  | address  |
> +-++--+--+
> | 5   | name3  | add3 |
> | 5   | name1  | add1 |
> | 5   | name2  | add2 |
> +-++--+--+
> 3 rows selected (0.679 seconds)
> 0: jdbc:hive2://10.18.18.214:23040/default>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 128 matches

Mail list logo