[jira] [Assigned] (SPARK-24886) Increase Jenkins build time

2018-08-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24886:


Assignee: Apache Spark  (was: Hyukjin Kwon)

> Increase Jenkins build time
> ---
>
> Key: SPARK-24886
> URL: https://issues.apache.org/jira/browse/SPARK-24886
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Currently, looks we hit the time limit time to time. Looks better increasing 
> the time a bit.
> For instance, please see https://github.com/apache/spark/pull/21822



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24886) Increase Jenkins build time

2018-08-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24886:


Assignee: Hyukjin Kwon  (was: Apache Spark)

> Increase Jenkins build time
> ---
>
> Key: SPARK-24886
> URL: https://issues.apache.org/jira/browse/SPARK-24886
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Currently, looks we hit the time limit time to time. Looks better increasing 
> the time a bit.
> For instance, please see https://github.com/apache/spark/pull/21822



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-24886) Increase Jenkins build time

2018-08-06 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-24886:
--
  Assignee: Hyukjin Kwon

Reopened since we hit the time limit issue.

> Increase Jenkins build time
> ---
>
> Key: SPARK-24886
> URL: https://issues.apache.org/jira/browse/SPARK-24886
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Currently, looks we hit the time limit time to time. Looks better increasing 
> the time a bit.
> For instance, please see https://github.com/apache/spark/pull/21822



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25040) Empty string for double and float types should be nulls in JSON

2018-08-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16571126#comment-16571126
 ] 

Apache Spark commented on SPARK-25040:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/22019

> Empty string for double and float types  should be nulls in JSON
> 
>
> Key: SPARK-25040
> URL: https://issues.apache.org/jira/browse/SPARK-25040
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> The issue itself seems to be a behaviour change between 1.6 and 2.x for 
> treating empty string as null or not in double and float.
> {code}
> {"a":"a1","int":1,"other":4.4}
> {"a":"a2","int":"","other":""}
> {code}
> code :
> {code}
> val config = new SparkConf().setMaster("local[5]").setAppName("test")
> val sc = SparkContext.getOrCreate(config)
> val sql = new SQLContext(sc)
> val file_path = 
> this.getClass.getClassLoader.getResource("Sanity4.json").getFile
> val df = sql.read.schema(null).json(file_path)
> df.show(30)
> {code}
> then in spark 1.6, result is
> {code}
> +---++-+
> | a| int|other|
> +---++-+
> | a1| 1| 4.4|
> | a2|null| null|
> +---++-+
> {code}
> {code}
> root
> |-- a: string (nullable = true)
> |-- int: long (nullable = true)
> |-- other: double (nullable = true)
> {code}
> but in spark 2.2, result is
> {code}
> +++-+
> | a| int|other|
> +++-+
> | a1| 1| 4.4|
> |null|null| null|
> +++-+
> {code}
> {code}
> root
> |-- a: string (nullable = true)
> |-- int: long (nullable = true)
> |-- other: double (nullable = true)
> {code}
> Another easy reproducer:
> {code}
> spark.read.schema("a DOUBLE, b FLOAT")
>   .option("mode", "FAILFAST").json(Seq("""{"a":"", "b": ""}""", """{"a": 
> 1.1, "b": 1.1}""").toDS)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25040) Empty string for double and float types should be nulls in JSON

2018-08-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25040:


Assignee: (was: Apache Spark)

> Empty string for double and float types  should be nulls in JSON
> 
>
> Key: SPARK-25040
> URL: https://issues.apache.org/jira/browse/SPARK-25040
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> The issue itself seems to be a behaviour change between 1.6 and 2.x for 
> treating empty string as null or not in double and float.
> {code}
> {"a":"a1","int":1,"other":4.4}
> {"a":"a2","int":"","other":""}
> {code}
> code :
> {code}
> val config = new SparkConf().setMaster("local[5]").setAppName("test")
> val sc = SparkContext.getOrCreate(config)
> val sql = new SQLContext(sc)
> val file_path = 
> this.getClass.getClassLoader.getResource("Sanity4.json").getFile
> val df = sql.read.schema(null).json(file_path)
> df.show(30)
> {code}
> then in spark 1.6, result is
> {code}
> +---++-+
> | a| int|other|
> +---++-+
> | a1| 1| 4.4|
> | a2|null| null|
> +---++-+
> {code}
> {code}
> root
> |-- a: string (nullable = true)
> |-- int: long (nullable = true)
> |-- other: double (nullable = true)
> {code}
> but in spark 2.2, result is
> {code}
> +++-+
> | a| int|other|
> +++-+
> | a1| 1| 4.4|
> |null|null| null|
> +++-+
> {code}
> {code}
> root
> |-- a: string (nullable = true)
> |-- int: long (nullable = true)
> |-- other: double (nullable = true)
> {code}
> Another easy reproducer:
> {code}
> spark.read.schema("a DOUBLE, b FLOAT")
>   .option("mode", "FAILFAST").json(Seq("""{"a":"", "b": ""}""", """{"a": 
> 1.1, "b": 1.1}""").toDS)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25040) Empty string for double and float types should be nulls in JSON

2018-08-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25040:


Assignee: Apache Spark

> Empty string for double and float types  should be nulls in JSON
> 
>
> Key: SPARK-25040
> URL: https://issues.apache.org/jira/browse/SPARK-25040
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> The issue itself seems to be a behaviour change between 1.6 and 2.x for 
> treating empty string as null or not in double and float.
> {code}
> {"a":"a1","int":1,"other":4.4}
> {"a":"a2","int":"","other":""}
> {code}
> code :
> {code}
> val config = new SparkConf().setMaster("local[5]").setAppName("test")
> val sc = SparkContext.getOrCreate(config)
> val sql = new SQLContext(sc)
> val file_path = 
> this.getClass.getClassLoader.getResource("Sanity4.json").getFile
> val df = sql.read.schema(null).json(file_path)
> df.show(30)
> {code}
> then in spark 1.6, result is
> {code}
> +---++-+
> | a| int|other|
> +---++-+
> | a1| 1| 4.4|
> | a2|null| null|
> +---++-+
> {code}
> {code}
> root
> |-- a: string (nullable = true)
> |-- int: long (nullable = true)
> |-- other: double (nullable = true)
> {code}
> but in spark 2.2, result is
> {code}
> +++-+
> | a| int|other|
> +++-+
> | a1| 1| 4.4|
> |null|null| null|
> +++-+
> {code}
> {code}
> root
> |-- a: string (nullable = true)
> |-- int: long (nullable = true)
> |-- other: double (nullable = true)
> {code}
> Another easy reproducer:
> {code}
> spark.read.schema("a DOUBLE, b FLOAT")
>   .option("mode", "FAILFAST").json(Seq("""{"a":"", "b": ""}""", """{"a": 
> 1.1, "b": 1.1}""").toDS)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25040) Empty string for double and float types should be nulls in JSON

2018-08-06 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-25040:


 Summary: Empty string for double and float types  should be nulls 
in JSON
 Key: SPARK-25040
 URL: https://issues.apache.org/jira/browse/SPARK-25040
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0, 2.4.0
Reporter: Hyukjin Kwon


The issue itself seems to be a behaviour change between 1.6 and 2.x for 
treating empty string as null or not in double and float.

{code}
{"a":"a1","int":1,"other":4.4}
{"a":"a2","int":"","other":""}
{code}

code :

{code}
val config = new SparkConf().setMaster("local[5]").setAppName("test")
val sc = SparkContext.getOrCreate(config)
val sql = new SQLContext(sc)

val file_path = this.getClass.getClassLoader.getResource("Sanity4.json").getFile
val df = sql.read.schema(null).json(file_path)
df.show(30)
{code}

then in spark 1.6, result is
{code}
+---++-+
| a| int|other|
+---++-+
| a1| 1| 4.4|
| a2|null| null|
+---++-+
{code}

{code}
root
|-- a: string (nullable = true)
|-- int: long (nullable = true)
|-- other: double (nullable = true)
{code}

but in spark 2.2, result is

{code}
+++-+
| a| int|other|
+++-+
| a1| 1| 4.4|
|null|null| null|
+++-+
{code}

{code}
root
|-- a: string (nullable = true)
|-- int: long (nullable = true)
|-- other: double (nullable = true)
{code}

Another easy reproducer:

{code}
spark.read.schema("a DOUBLE, b FLOAT")
  .option("mode", "FAILFAST").json(Seq("""{"a":"", "b": ""}""", """{"a": 
1.1, "b": 1.1}""").toDS)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25039) Binary comparison behavior should refer to Teradata

2018-08-06 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-25039:
---

 Summary: Binary comparison behavior should refer to Teradata
 Key: SPARK-25039
 URL: https://issues.apache.org/jira/browse/SPARK-25039
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Yuming Wang


The main difference is:

# When comparing a {{StringType}} value with a {{NumericType}} value, Spark 
converts the {{StringType}} data to a {{NumericType}} value. But Teradata 
converts the {{StringType}} data to a {{DoubleType}} value.
# When comparing a {{StringType}} value with a {{DateType}} value, Spark 
converts the {{DateType}} data to a {{StringType}} value. But Teradata converts 
the {{StringType}} data to a {{DateType}} value.
 

More details:
https://github.com/apache/spark/blob/65a4bc143ab5dc2ced589dc107bbafa8a7290931/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala#L120-L149
https://www.info.teradata.com/HTMLPubs/DB_TTU_16_00/index.html#page/SQL_Reference/B035-1145-160K/lrn1472241011038.html




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23940) High-order function: transform_values(map, function) → map

2018-08-06 Thread Neha Patil (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16571095#comment-16571095
 ] 

Neha Patil commented on SPARK-23940:


Will Publish PR by tomorrow.

> High-order function: transform_values(map, function) → 
> map
> ---
>
> Key: SPARK-23940
> URL: https://issues.apache.org/jira/browse/SPARK-23940
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns a map that applies function to each entry of map and transforms the 
> values.
> {noformat}
> SELECT transform_values(MAP(ARRAY[], ARRAY[]), (k, v) -> v + 1); -- {}
> SELECT transform_values(MAP(ARRAY [1, 2, 3], ARRAY [10, 20, 30]), (k, v) -> v 
> + k); -- {1 -> 11, 2 -> 22, 3 -> 33}
> SELECT transform_values(MAP(ARRAY [1, 2, 3], ARRAY ['a', 'b', 'c']), (k, v) 
> -> k * k); -- {1 -> 1, 2 -> 4, 3 -> 9}
> SELECT transform_values(MAP(ARRAY ['a', 'b'], ARRAY [1, 2]), (k, v) -> k || 
> CAST(v as VARCHAR)); -- {a -> a1, b -> b2}
> SELECT transform_values(MAP(ARRAY [1, 2], ARRAY [1.0, 1.4]), -- {1 -> 
> one_1.0, 2 -> two_1.4}
> (k, v) -> MAP(ARRAY[1, 2], ARRAY['one', 'two'])[k] || 
> '_' || CAST(v AS VARCHAR));
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25038) Accelerate Spark Plan generation when Spark SQL read large amount of data

2018-08-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25038:


Assignee: (was: Apache Spark)

> Accelerate Spark Plan generation when Spark SQL read large amount of data
> -
>
> Key: SPARK-25038
> URL: https://issues.apache.org/jira/browse/SPARK-25038
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jason Guo
>Priority: Major
> Attachments: issue sql optimized.png, issue sql original.png, job 
> start optimized.png, job start original.png
>
>
> When Spark SQL read large amount of data, it take a long time (more than 10 
> minutes) to generate physical Plan and then ActiveJob
>  
> Example:
> There is a table which is partitioned by date and hour. There are more than 
> 13 TB data each hour and 185 TB per day. When we just issue a very simple 
> SQL, it take a long time to generate ActiveJob
>  
> The SQL statement is
> {code:java}
> select count(device_id) from test_tbl where date=20180731 and hour='21';
> {code}
>  
> Before optimization, it takes 2 minutes and 9 seconds to generate the Job
>  
> The SQL is issued at 2018-08-07 09:07:41
> !issue sql original.png!
> However, the job is submitted at 2018-08-07 09:09:53, which is 2minutes and 9 
> seconds later than the SQL issue time
> !job start original.png!
>  
> After the optimization, it takes only 4 seconds to generate the Job
> The SQL is issued at 2018-08-07 09:20:15
> !issue sql optimized.png!
>  
> And the job is submitted at 2018-08-07 09:20:19, which is 4 seconds later 
> than the SQL issue time
> !job start optimized.png!
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25038) Accelerate Spark Plan generation when Spark SQL read large amount of data

2018-08-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16571081#comment-16571081
 ] 

Apache Spark commented on SPARK-25038:
--

User 'habren' has created a pull request for this issue:
https://github.com/apache/spark/pull/22018

> Accelerate Spark Plan generation when Spark SQL read large amount of data
> -
>
> Key: SPARK-25038
> URL: https://issues.apache.org/jira/browse/SPARK-25038
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jason Guo
>Priority: Major
> Attachments: issue sql optimized.png, issue sql original.png, job 
> start optimized.png, job start original.png
>
>
> When Spark SQL read large amount of data, it take a long time (more than 10 
> minutes) to generate physical Plan and then ActiveJob
>  
> Example:
> There is a table which is partitioned by date and hour. There are more than 
> 13 TB data each hour and 185 TB per day. When we just issue a very simple 
> SQL, it take a long time to generate ActiveJob
>  
> The SQL statement is
> {code:java}
> select count(device_id) from test_tbl where date=20180731 and hour='21';
> {code}
>  
> Before optimization, it takes 2 minutes and 9 seconds to generate the Job
>  
> The SQL is issued at 2018-08-07 09:07:41
> !issue sql original.png!
> However, the job is submitted at 2018-08-07 09:09:53, which is 2minutes and 9 
> seconds later than the SQL issue time
> !job start original.png!
>  
> After the optimization, it takes only 4 seconds to generate the Job
> The SQL is issued at 2018-08-07 09:20:15
> !issue sql optimized.png!
>  
> And the job is submitted at 2018-08-07 09:20:19, which is 4 seconds later 
> than the SQL issue time
> !job start optimized.png!
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25038) Accelerate Spark Plan generation when Spark SQL read large amount of data

2018-08-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25038:


Assignee: Apache Spark

> Accelerate Spark Plan generation when Spark SQL read large amount of data
> -
>
> Key: SPARK-25038
> URL: https://issues.apache.org/jira/browse/SPARK-25038
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jason Guo
>Assignee: Apache Spark
>Priority: Major
> Attachments: issue sql optimized.png, issue sql original.png, job 
> start optimized.png, job start original.png
>
>
> When Spark SQL read large amount of data, it take a long time (more than 10 
> minutes) to generate physical Plan and then ActiveJob
>  
> Example:
> There is a table which is partitioned by date and hour. There are more than 
> 13 TB data each hour and 185 TB per day. When we just issue a very simple 
> SQL, it take a long time to generate ActiveJob
>  
> The SQL statement is
> {code:java}
> select count(device_id) from test_tbl where date=20180731 and hour='21';
> {code}
>  
> Before optimization, it takes 2 minutes and 9 seconds to generate the Job
>  
> The SQL is issued at 2018-08-07 09:07:41
> !issue sql original.png!
> However, the job is submitted at 2018-08-07 09:09:53, which is 2minutes and 9 
> seconds later than the SQL issue time
> !job start original.png!
>  
> After the optimization, it takes only 4 seconds to generate the Job
> The SQL is issued at 2018-08-07 09:20:15
> !issue sql optimized.png!
>  
> And the job is submitted at 2018-08-07 09:20:19, which is 4 seconds later 
> than the SQL issue time
> !job start optimized.png!
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25037) plan.transformAllExpressions() doesn't transform expressions in nested SubqueryExpression plans

2018-08-06 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16571047#comment-16571047
 ] 

Hyukjin Kwon commented on SPARK-25037:
--

If it's an actual issue after the discussion in the mailing list, let's reopen 
this JIRA.

> plan.transformAllExpressions() doesn't transform expressions in nested 
> SubqueryExpression plans
> ---
>
> Key: SPARK-25037
> URL: https://issues.apache.org/jira/browse/SPARK-25037
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chris O'Hara
>Priority: Minor
>
> Given the following LogicalPlan:
> {code:java}
> scala> val plan = spark.sql("SELECT 1 bar FROM (SELECT 1 foo) WHERE foo IN 
> (SELECT 1 foo)").queryExecution.logical
> plan: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> 'Project [1 AS bar#29]
> +- 'Filter 'foo IN (list#31 [])
>    :  +- Project [1 AS foo#30]
>    :     +- OneRowRelation
>    +- SubqueryAlias __auto_generated_subquery_name
>       +- Project [1 AS foo#28]
>          +- OneRowRelation
> {code}
> the following transformation should replace all instances of lit(1) with 
> lit(2):
> {code:java}
> scala> plan.transformAllExpressions { case l @ Literal(1, _) => l.copy(value 
> = 2) }
> res0: plan.type =
> 'Project [2 AS bar#29]
> +- 'Filter 'foo IN (list#31 [])
>    :  +- Project [1 AS foo#30]
>    :     +- OneRowRelation
>    +- SubqueryAlias __auto_generated_subquery_name
>       +- Project [2 AS foo#28]
>          +- OneRowRelation
> {code}
> Instead, the nested SubqueryExpression plan is not transformed.
> The expected output is: 
> {code:java}
> 'Project [2 AS bar#29]
> +- 'Filter 'foo IN (list#31 [])
>    :  +- Project [2 AS foo#30]
>    :     +- OneRowRelation
>    +- SubqueryAlias __auto_generated_subquery_name
>       +- Project [2 AS foo#28]
>          +- OneRowRelation
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25037) plan.transformAllExpressions() doesn't transform expressions in nested SubqueryExpression plans

2018-08-06 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25037.
--
Resolution: Invalid

> plan.transformAllExpressions() doesn't transform expressions in nested 
> SubqueryExpression plans
> ---
>
> Key: SPARK-25037
> URL: https://issues.apache.org/jira/browse/SPARK-25037
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chris O'Hara
>Priority: Minor
>
> Given the following LogicalPlan:
> {code:java}
> scala> val plan = spark.sql("SELECT 1 bar FROM (SELECT 1 foo) WHERE foo IN 
> (SELECT 1 foo)").queryExecution.logical
> plan: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> 'Project [1 AS bar#29]
> +- 'Filter 'foo IN (list#31 [])
>    :  +- Project [1 AS foo#30]
>    :     +- OneRowRelation
>    +- SubqueryAlias __auto_generated_subquery_name
>       +- Project [1 AS foo#28]
>          +- OneRowRelation
> {code}
> the following transformation should replace all instances of lit(1) with 
> lit(2):
> {code:java}
> scala> plan.transformAllExpressions { case l @ Literal(1, _) => l.copy(value 
> = 2) }
> res0: plan.type =
> 'Project [2 AS bar#29]
> +- 'Filter 'foo IN (list#31 [])
>    :  +- Project [1 AS foo#30]
>    :     +- OneRowRelation
>    +- SubqueryAlias __auto_generated_subquery_name
>       +- Project [2 AS foo#28]
>          +- OneRowRelation
> {code}
> Instead, the nested SubqueryExpression plan is not transformed.
> The expected output is: 
> {code:java}
> 'Project [2 AS bar#29]
> +- 'Filter 'foo IN (list#31 [])
>    :  +- Project [2 AS foo#30]
>    :     +- OneRowRelation
>    +- SubqueryAlias __auto_generated_subquery_name
>       +- Project [2 AS foo#28]
>          +- OneRowRelation
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25037) plan.transformAllExpressions() doesn't transform expressions in nested SubqueryExpression plans

2018-08-06 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16571046#comment-16571046
 ] 

Hyukjin Kwon commented on SPARK-25037:
--

In that case, let's ask this into mailing list 
(https://spark.apache.org/community.html) first and see if this is an expected 
behaviour or not. Sounds it's rather a question for the current status for now. 
If no one answers, I will take a look and confirm if it's expected or not.

> plan.transformAllExpressions() doesn't transform expressions in nested 
> SubqueryExpression plans
> ---
>
> Key: SPARK-25037
> URL: https://issues.apache.org/jira/browse/SPARK-25037
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chris O'Hara
>Priority: Minor
>
> Given the following LogicalPlan:
> {code:java}
> scala> val plan = spark.sql("SELECT 1 bar FROM (SELECT 1 foo) WHERE foo IN 
> (SELECT 1 foo)").queryExecution.logical
> plan: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> 'Project [1 AS bar#29]
> +- 'Filter 'foo IN (list#31 [])
>    :  +- Project [1 AS foo#30]
>    :     +- OneRowRelation
>    +- SubqueryAlias __auto_generated_subquery_name
>       +- Project [1 AS foo#28]
>          +- OneRowRelation
> {code}
> the following transformation should replace all instances of lit(1) with 
> lit(2):
> {code:java}
> scala> plan.transformAllExpressions { case l @ Literal(1, _) => l.copy(value 
> = 2) }
> res0: plan.type =
> 'Project [2 AS bar#29]
> +- 'Filter 'foo IN (list#31 [])
>    :  +- Project [1 AS foo#30]
>    :     +- OneRowRelation
>    +- SubqueryAlias __auto_generated_subquery_name
>       +- Project [2 AS foo#28]
>          +- OneRowRelation
> {code}
> Instead, the nested SubqueryExpression plan is not transformed.
> The expected output is: 
> {code:java}
> 'Project [2 AS bar#29]
> +- 'Filter 'foo IN (list#31 [])
>    :  +- Project [2 AS foo#30]
>    :     +- OneRowRelation
>    +- SubqueryAlias __auto_generated_subquery_name
>       +- Project [2 AS foo#28]
>          +- OneRowRelation
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25018) Use `Co-Authored-By` git trailer in `merge_spark_pr.py`

2018-08-06 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-25018:


Assignee: DB Tsai

> Use `Co-Authored-By` git trailer in `merge_spark_pr.py`
> ---
>
> Key: SPARK-25018
> URL: https://issues.apache.org/jira/browse/SPARK-25018
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 2.4.0
>
>
> Many projects such as openstack are using `Co-Authored-By: name 
> ` in commit messages to indicate people who worked on a 
> particular patch. 
> It's a convention for recognizing multiple authors, and can encourage people 
> to collaborate.
> Co-authored commits are visible on GitHub and can be included in the profile 
> contributions graph and the repository's statistics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25018) Use `Co-Authored-By` git trailer in `merge_spark_pr.py`

2018-08-06 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25018.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21991
[https://github.com/apache/spark/pull/21991]

> Use `Co-Authored-By` git trailer in `merge_spark_pr.py`
> ---
>
> Key: SPARK-25018
> URL: https://issues.apache.org/jira/browse/SPARK-25018
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 2.4.0
>
>
> Many projects such as openstack are using `Co-Authored-By: name 
> ` in commit messages to indicate people who worked on a 
> particular patch. 
> It's a convention for recognizing multiple authors, and can encourage people 
> to collaborate.
> Co-authored commits are visible on GitHub and can be included in the profile 
> contributions graph and the repository's statistics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24748) Support for reporting custom metrics via Streaming Query Progress

2018-08-06 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24748.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21721
[https://github.com/apache/spark/pull/21721]

> Support for reporting custom metrics via Streaming Query Progress
> -
>
> Key: SPARK-24748
> URL: https://issues.apache.org/jira/browse/SPARK-24748
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Arun Mahadevan
>Assignee: Arun Mahadevan
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently the Structured Streaming sources and sinks does not have a way to 
> report custom metrics. Providing an option to report custom metrics and 
> making it available via Streaming Query progress can enable sources and sinks 
> to report custom progress information (E.g. the lag metrics for Kafka source).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24748) Support for reporting custom metrics via Streaming Query Progress

2018-08-06 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24748:


Assignee: Arun Mahadevan

> Support for reporting custom metrics via Streaming Query Progress
> -
>
> Key: SPARK-24748
> URL: https://issues.apache.org/jira/browse/SPARK-24748
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Arun Mahadevan
>Assignee: Arun Mahadevan
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently the Structured Streaming sources and sinks does not have a way to 
> report custom metrics. Providing an option to report custom metrics and 
> making it available via Streaming Query progress can enable sources and sinks 
> to report custom progress information (E.g. the lag metrics for Kafka source).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25037) plan.transformAllExpressions() doesn't transform expressions in nested SubqueryExpression plans

2018-08-06 Thread Chris O'Hara (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16571034#comment-16571034
 ] 

Chris O'Hara commented on SPARK-25037:
--

If this is expected behavior then please disregard! We weren't sure.

> plan.transformAllExpressions() doesn't transform expressions in nested 
> SubqueryExpression plans
> ---
>
> Key: SPARK-25037
> URL: https://issues.apache.org/jira/browse/SPARK-25037
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chris O'Hara
>Priority: Minor
>
> Given the following LogicalPlan:
> {code:java}
> scala> val plan = spark.sql("SELECT 1 bar FROM (SELECT 1 foo) WHERE foo IN 
> (SELECT 1 foo)").queryExecution.logical
> plan: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> 'Project [1 AS bar#29]
> +- 'Filter 'foo IN (list#31 [])
>    :  +- Project [1 AS foo#30]
>    :     +- OneRowRelation
>    +- SubqueryAlias __auto_generated_subquery_name
>       +- Project [1 AS foo#28]
>          +- OneRowRelation
> {code}
> the following transformation should replace all instances of lit(1) with 
> lit(2):
> {code:java}
> scala> plan.transformAllExpressions { case l @ Literal(1, _) => l.copy(value 
> = 2) }
> res0: plan.type =
> 'Project [2 AS bar#29]
> +- 'Filter 'foo IN (list#31 [])
>    :  +- Project [1 AS foo#30]
>    :     +- OneRowRelation
>    +- SubqueryAlias __auto_generated_subquery_name
>       +- Project [2 AS foo#28]
>          +- OneRowRelation
> {code}
> Instead, the nested SubqueryExpression plan is not transformed.
> The expected output is: 
> {code:java}
> 'Project [2 AS bar#29]
> +- 'Filter 'foo IN (list#31 [])
>    :  +- Project [2 AS foo#30]
>    :     +- OneRowRelation
>    +- SubqueryAlias __auto_generated_subquery_name
>       +- Project [2 AS foo#28]
>          +- OneRowRelation
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25037) plan.transformAllExpressions() doesn't transform expressions in nested SubqueryExpression plans

2018-08-06 Thread Chris O'Hara (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16571032#comment-16571032
 ] 

Chris O'Hara edited comment on SPARK-25037 at 8/7/18 2:17 AM:
--

You're right – we ran into the issue with a custom transformation rule. I see 
that catalyst uses [a dedicated 
rule|https://github.com/apache/spark/blob/136588e95f69923a04458abe4862d336e5244c84/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L164]
 to optimize subquery plans, so maybe transformAllExpressions() was never 
supposed to recurse into nested plans.


was (Author: chriso):
You're right – we ran into the issue with a custom transformation rule. I see 
that catalyst uses [a dedicated 
rule|https://github.com/apache/spark/blob/136588e95f69923a04458abe4862d336e5244c84/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L164]
 to optimize subquery plans, so maybe `transformAllExpressions()` was never 
supposed to recurse into nested plans.

> plan.transformAllExpressions() doesn't transform expressions in nested 
> SubqueryExpression plans
> ---
>
> Key: SPARK-25037
> URL: https://issues.apache.org/jira/browse/SPARK-25037
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chris O'Hara
>Priority: Minor
>
> Given the following LogicalPlan:
> {code:java}
> scala> val plan = spark.sql("SELECT 1 bar FROM (SELECT 1 foo) WHERE foo IN 
> (SELECT 1 foo)").queryExecution.logical
> plan: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> 'Project [1 AS bar#29]
> +- 'Filter 'foo IN (list#31 [])
>    :  +- Project [1 AS foo#30]
>    :     +- OneRowRelation
>    +- SubqueryAlias __auto_generated_subquery_name
>       +- Project [1 AS foo#28]
>          +- OneRowRelation
> {code}
> the following transformation should replace all instances of lit(1) with 
> lit(2):
> {code:java}
> scala> plan.transformAllExpressions { case l @ Literal(1, _) => l.copy(value 
> = 2) }
> res0: plan.type =
> 'Project [2 AS bar#29]
> +- 'Filter 'foo IN (list#31 [])
>    :  +- Project [1 AS foo#30]
>    :     +- OneRowRelation
>    +- SubqueryAlias __auto_generated_subquery_name
>       +- Project [2 AS foo#28]
>          +- OneRowRelation
> {code}
> Instead, the nested SubqueryExpression plan is not transformed.
> The expected output is: 
> {code:java}
> 'Project [2 AS bar#29]
> +- 'Filter 'foo IN (list#31 [])
>    :  +- Project [2 AS foo#30]
>    :     +- OneRowRelation
>    +- SubqueryAlias __auto_generated_subquery_name
>       +- Project [2 AS foo#28]
>          +- OneRowRelation
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25037) plan.transformAllExpressions() doesn't transform expressions in nested SubqueryExpression plans

2018-08-06 Thread Chris O'Hara (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16571032#comment-16571032
 ] 

Chris O'Hara commented on SPARK-25037:
--

You're right – we ran into the issue with a custom transformation rule. I see 
that catalyst uses [a dedicated 
rule|https://github.com/apache/spark/blob/136588e95f69923a04458abe4862d336e5244c84/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L164]
 to optimize subquery plans, so maybe `transformAllExpressions()` was never 
supposed to recurse into nested plans.

> plan.transformAllExpressions() doesn't transform expressions in nested 
> SubqueryExpression plans
> ---
>
> Key: SPARK-25037
> URL: https://issues.apache.org/jira/browse/SPARK-25037
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chris O'Hara
>Priority: Minor
>
> Given the following LogicalPlan:
> {code:java}
> scala> val plan = spark.sql("SELECT 1 bar FROM (SELECT 1 foo) WHERE foo IN 
> (SELECT 1 foo)").queryExecution.logical
> plan: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> 'Project [1 AS bar#29]
> +- 'Filter 'foo IN (list#31 [])
>    :  +- Project [1 AS foo#30]
>    :     +- OneRowRelation
>    +- SubqueryAlias __auto_generated_subquery_name
>       +- Project [1 AS foo#28]
>          +- OneRowRelation
> {code}
> the following transformation should replace all instances of lit(1) with 
> lit(2):
> {code:java}
> scala> plan.transformAllExpressions { case l @ Literal(1, _) => l.copy(value 
> = 2) }
> res0: plan.type =
> 'Project [2 AS bar#29]
> +- 'Filter 'foo IN (list#31 [])
>    :  +- Project [1 AS foo#30]
>    :     +- OneRowRelation
>    +- SubqueryAlias __auto_generated_subquery_name
>       +- Project [2 AS foo#28]
>          +- OneRowRelation
> {code}
> Instead, the nested SubqueryExpression plan is not transformed.
> The expected output is: 
> {code:java}
> 'Project [2 AS bar#29]
> +- 'Filter 'foo IN (list#31 [])
>    :  +- Project [2 AS foo#30]
>    :     +- OneRowRelation
>    +- SubqueryAlias __auto_generated_subquery_name
>       +- Project [2 AS foo#28]
>          +- OneRowRelation
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24637) Add metrics regarding state and watermark to dropwizard metrics

2018-08-06 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24637:


Assignee: Jungtaek Lim

> Add metrics regarding state and watermark to dropwizard metrics
> ---
>
> Key: SPARK-24637
> URL: https://issues.apache.org/jira/browse/SPARK-24637
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 2.4.0
>
>
> Though Spark provides option to enable stream metrics into Dropwizard, it 
> only exposes three metrics: "inputRate-total", "processingRate-total", 
> "latency" which are not enough information to operate.
> Since Spark already exposes other metrics as well which are valuable for 
> operation perspective, we could pick some and expose them into Dropwizard as 
> well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24637) Add metrics regarding state and watermark to dropwizard metrics

2018-08-06 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24637.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21622
[https://github.com/apache/spark/pull/21622]

> Add metrics regarding state and watermark to dropwizard metrics
> ---
>
> Key: SPARK-24637
> URL: https://issues.apache.org/jira/browse/SPARK-24637
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 2.4.0
>
>
> Though Spark provides option to enable stream metrics into Dropwizard, it 
> only exposes three metrics: "inputRate-total", "processingRate-total", 
> "latency" which are not enough information to operate.
> Since Spark already exposes other metrics as well which are valuable for 
> operation perspective, we could pick some and expose them into Dropwizard as 
> well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25038) Accelerate Spark Plan generation when Spark SQL read large amount of data

2018-08-06 Thread Jason Guo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16571027#comment-16571027
 ] 

Jason Guo commented on SPARK-25038:
---

[~hyukjin.kwon] Gotcha

I will create a PR for this today

> Accelerate Spark Plan generation when Spark SQL read large amount of data
> -
>
> Key: SPARK-25038
> URL: https://issues.apache.org/jira/browse/SPARK-25038
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jason Guo
>Priority: Major
> Attachments: issue sql optimized.png, issue sql original.png, job 
> start optimized.png, job start original.png
>
>
> When Spark SQL read large amount of data, it take a long time (more than 10 
> minutes) to generate physical Plan and then ActiveJob
>  
> Example:
> There is a table which is partitioned by date and hour. There are more than 
> 13 TB data each hour and 185 TB per day. When we just issue a very simple 
> SQL, it take a long time to generate ActiveJob
>  
> The SQL statement is
> {code:java}
> select count(device_id) from test_tbl where date=20180731 and hour='21';
> {code}
>  
> Before optimization, it takes 2 minutes and 9 seconds to generate the Job
>  
> The SQL is issued at 2018-08-07 09:07:41
> !issue sql original.png!
> However, the job is submitted at 2018-08-07 09:09:53, which is 2minutes and 9 
> seconds later than the SQL issue time
> !job start original.png!
>  
> After the optimization, it takes only 4 seconds to generate the Job
> The SQL is issued at 2018-08-07 09:20:15
> !issue sql optimized.png!
>  
> And the job is submitted at 2018-08-07 09:20:19, which is 4 seconds later 
> than the SQL issue time
> !job start optimized.png!
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25038) Accelerate Spark Plan generation when Spark SQL read large amount of data

2018-08-06 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16571026#comment-16571026
 ] 

Hyukjin Kwon commented on SPARK-25038:
--

(please avoid to set Critical+ which is usually reserved for committers)

> Accelerate Spark Plan generation when Spark SQL read large amount of data
> -
>
> Key: SPARK-25038
> URL: https://issues.apache.org/jira/browse/SPARK-25038
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jason Guo
>Priority: Major
> Attachments: issue sql optimized.png, issue sql original.png, job 
> start optimized.png, job start original.png
>
>
> When Spark SQL read large amount of data, it take a long time (more than 10 
> minutes) to generate physical Plan and then ActiveJob
>  
> Example:
> There is a table which is partitioned by date and hour. There are more than 
> 13 TB data each hour and 185 TB per day. When we just issue a very simple 
> SQL, it take a long time to generate ActiveJob
>  
> The SQL statement is
> {code:java}
> select count(device_id) from test_tbl where date=20180731 and hour='21';
> {code}
>  
> Before optimization, it takes 2 minutes and 9 seconds to generate the Job
>  
> The SQL is issued at 2018-08-07 09:07:41
> !issue sql original.png!
> However, the job is submitted at 2018-08-07 09:09:53, which is 2minutes and 9 
> seconds later than the SQL issue time
> !job start original.png!
>  
> After the optimization, it takes only 4 seconds to generate the Job
> The SQL is issued at 2018-08-07 09:20:15
> !issue sql optimized.png!
>  
> And the job is submitted at 2018-08-07 09:20:19, which is 4 seconds later 
> than the SQL issue time
> !job start optimized.png!
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25038) Accelerate Spark Plan generation when Spark SQL read large amount of data

2018-08-06 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25038:
-
Priority: Major  (was: Critical)

> Accelerate Spark Plan generation when Spark SQL read large amount of data
> -
>
> Key: SPARK-25038
> URL: https://issues.apache.org/jira/browse/SPARK-25038
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jason Guo
>Priority: Major
> Attachments: issue sql optimized.png, issue sql original.png, job 
> start optimized.png, job start original.png
>
>
> When Spark SQL read large amount of data, it take a long time (more than 10 
> minutes) to generate physical Plan and then ActiveJob
>  
> Example:
> There is a table which is partitioned by date and hour. There are more than 
> 13 TB data each hour and 185 TB per day. When we just issue a very simple 
> SQL, it take a long time to generate ActiveJob
>  
> The SQL statement is
> {code:java}
> select count(device_id) from test_tbl where date=20180731 and hour='21';
> {code}
>  
> Before optimization, it takes 2 minutes and 9 seconds to generate the Job
>  
> The SQL is issued at 2018-08-07 09:07:41
> !issue sql original.png!
> However, the job is submitted at 2018-08-07 09:09:53, which is 2minutes and 9 
> seconds later than the SQL issue time
> !job start original.png!
>  
> After the optimization, it takes only 4 seconds to generate the Job
> The SQL is issued at 2018-08-07 09:20:15
> !issue sql optimized.png!
>  
> And the job is submitted at 2018-08-07 09:20:19, which is 4 seconds later 
> than the SQL issue time
> !job start optimized.png!
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25037) plan.transformAllExpressions() doesn't transform expressions in nested SubqueryExpression plans

2018-08-06 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16571024#comment-16571024
 ] 

Hyukjin Kwon commented on SPARK-25037:
--

Does this cause an actual issue? catalyst modules are meant to be private 
anyway.

> plan.transformAllExpressions() doesn't transform expressions in nested 
> SubqueryExpression plans
> ---
>
> Key: SPARK-25037
> URL: https://issues.apache.org/jira/browse/SPARK-25037
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chris O'Hara
>Priority: Minor
>
> Given the following LogicalPlan:
> {code:java}
> scala> val plan = spark.sql("SELECT 1 bar FROM (SELECT 1 foo) WHERE foo IN 
> (SELECT 1 foo)").queryExecution.logical
> plan: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> 'Project [1 AS bar#29]
> +- 'Filter 'foo IN (list#31 [])
>    :  +- Project [1 AS foo#30]
>    :     +- OneRowRelation
>    +- SubqueryAlias __auto_generated_subquery_name
>       +- Project [1 AS foo#28]
>          +- OneRowRelation
> {code}
> the following transformation should replace all instances of lit(1) with 
> lit(2):
> {code:java}
> scala> plan.transformAllExpressions { case l @ Literal(1, _) => l.copy(value 
> = 2) }
> res0: plan.type =
> 'Project [2 AS bar#29]
> +- 'Filter 'foo IN (list#31 [])
>    :  +- Project [1 AS foo#30]
>    :     +- OneRowRelation
>    +- SubqueryAlias __auto_generated_subquery_name
>       +- Project [2 AS foo#28]
>          +- OneRowRelation
> {code}
> Instead, the nested SubqueryExpression plan is not transformed.
> The expected output is: 
> {code:java}
> 'Project [2 AS bar#29]
> +- 'Filter 'foo IN (list#31 [])
>    :  +- Project [2 AS foo#30]
>    :     +- OneRowRelation
>    +- SubqueryAlias __auto_generated_subquery_name
>       +- Project [2 AS foo#28]
>          +- OneRowRelation
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25028) AnalyzePartitionCommand failed with NPE if value is null

2018-08-06 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25028.
--
Resolution: Cannot Reproduce

Please reopen if there are reproduce steps updated, or anyone is able to 
explain / reproduce.

> AnalyzePartitionCommand failed with NPE if value is null
> 
>
> Key: SPARK-25028
> URL: https://issues.apache.org/jira/browse/SPARK-25028
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Izek Greenfield
>Priority: Major
>
> on line 143: val partitionColumnValues = 
> partitionColumns.indices.map(r.get(_).toString)
> if the value is NULL the code will fail with NPE



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25026) Binary releases should contain some copy of compiled external integration modules

2018-08-06 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16571018#comment-16571018
 ] 

Hyukjin Kwon commented on SPARK-25026:
--

I was wondering about this too. +1 for somehow including this.

> Binary releases should contain some copy of compiled external integration 
> modules
> -
>
> Key: SPARK-25026
> URL: https://issues.apache.org/jira/browse/SPARK-25026
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25013) JDBC urls with jdbc:mariadb don't work as expected

2018-08-06 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16571017#comment-16571017
 ] 

Hyukjin Kwon commented on SPARK-25013:
--

isn't it possible to let someone implement this and leave this dialect as 
thirdparty library for now?

> JDBC urls with jdbc:mariadb don't work as expected
> --
>
> Key: SPARK-25013
> URL: https://issues.apache.org/jira/browse/SPARK-25013
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Dieter Vekeman
>Priority: Minor
>
> When using the MariaDB JDBC driver, the JDBC connection url should be  
> {code:java}
> jdbc:mariadb://localhost:3306/DB?user=someuser=somepassword
> {code}
> https://mariadb.com/kb/en/library/about-mariadb-connector-j/
> However this does not work well in Spark (see below)
> *Workaround*
> The MariaDB driver also supports using mysql which does work.
> The problem seems to have been described and identified in:
> https://jira.mariadb.org/browse/CONJ-421
> All works well with spark using connection string with {{"jdbc:mysql:..."}}, 
> but not using {{"jdbc:mariadb:..."}} because MySQL dialect is then not used.
> when not used, defaut quote is {{"}}, not {{`}}
> So, some internal query generated by spark like {{SELECT `i`,`ip` FROM tmp}} 
> will then be executed as {{SELECT "i","ip" FROM tmp}} with dataType 
> previously retrieved, causing the exception
> The author of the comment says
> {quote}I'll make a pull request to spark so "jdbc:mariadb:" connection string 
> can be handle{quote}
> Did the pull request get lost or should a new one be made?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25012) dataframe creation results in matcherror

2018-08-06 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25012.
--
Resolution: Duplicate

> dataframe creation results in matcherror
> 
>
> Key: SPARK-25012
> URL: https://issues.apache.org/jira/browse/SPARK-25012
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.1
> Environment: spark 2.3.1
> mac
> scala 2.11.12
>  
>Reporter: uwe
>Priority: Major
>
> hi,
>  
> running the attached code results in a 
>  
> {code:java}
> scala.MatchError: 2017-02-09 00:09:27.0 (of class java.sql.Timestamp)
> {code}
>  # i do think this is wrong (at least i do not see the issue in my code)
>  # the error is the ein 90% of the cases (it sometimes passes). that makes me 
> think something weird is going on
>  
>  
> {code:java}
> package misc
> import java.sql.Timestamp
> import java.time.LocalDateTime
> import java.time.format.DateTimeFormatter
> import org.apache.spark.rdd.RDD
> import org.apache.spark.sql.sources._
> import org.apache.spark.sql.types.{StringType, StructField, StructType, 
> TimestampType}
> import org.apache.spark.sql.{Row, SQLContext, SparkSession}
> case class LogRecord(application:String, dateTime: Timestamp, component: 
> String, level: String, message: String)
> class LogRelation(val sqlContext: SQLContext, val path: String) extends 
> BaseRelation with PrunedFilteredScan {
>  override def schema: StructType = StructType(Seq(
>  StructField("application", StringType, false),
>  StructField("dateTime", TimestampType, false),
>  StructField("component", StringType, false),
>  StructField("level", StringType, false),
>  StructField("message", StringType, false)))
>  override def buildScan(requiredColumns: Array[String], filters: 
> Array[Filter]): RDD[Row] = {
>  val str = "2017-02-09T00:09:27"
>  val ts =Timestamp.valueOf(LocalDateTime.parse(str, 
> DateTimeFormatter.ISO_LOCAL_DATE_TIME))
>  val 
> data=List(Row("app",ts,"comp","level","mess"),Row("app",ts,"comp","level","mess"))
>  sqlContext.sparkContext.parallelize(data)
>  }
> }
> class LogDataSource extends DataSourceRegister with RelationProvider {
>  override def shortName(): String = "log"
>  override def createRelation(sqlContext: SQLContext, parameters: Map[String, 
> String]): BaseRelation =
>  new LogRelation(sqlContext, parameters("path"))
> }
> object f0 extends App {
>  lazy val spark: SparkSession = 
> SparkSession.builder().master("local").appName("spark session").getOrCreate()
>  val df = spark.read.format("log").load("hdfs:///logs")
>  df.show()
> }
>  
> {code}
>  
> results in the following stacktrace
>  
> {noformat}
> 11:20:06 [task-result-getter-0] ERROR o.a.spark.scheduler.TaskSetManager - 
> Task 0 in stage 0.0 failed 1 times; aborting job
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): 
> scala.MatchError: 2017-02-09 00:09:27.0 (of class java.sql.Timestamp)
>  at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:276)
>  at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:275)
>  at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
>  at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:379)
>  at 
> org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:60)
>  at 
> org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:57)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>  at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>  

[jira] [Commented] (SPARK-25028) AnalyzePartitionCommand failed with NPE if value is null

2018-08-06 Thread Achuth Narayan Rajagopal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16571005#comment-16571005
 ] 

Achuth Narayan Rajagopal commented on SPARK-25028:
--

[~igreenfi], Can you share a test case to reproduce this? 

> AnalyzePartitionCommand failed with NPE if value is null
> 
>
> Key: SPARK-25028
> URL: https://issues.apache.org/jira/browse/SPARK-25028
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Izek Greenfield
>Priority: Major
>
> on line 143: val partitionColumnValues = 
> partitionColumns.indices.map(r.get(_).toString)
> if the value is NULL the code will fail with NPE



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25038) Accelerate Spark Plan generation when Spark SQL read large amount of data

2018-08-06 Thread Jason Guo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Guo updated SPARK-25038:
--
Description: 
When Spark SQL read large amount of data, it take a long time (more than 10 
minutes) to generate physical Plan and then ActiveJob

 

Example:

There is a table which is partitioned by date and hour. There are more than 13 
TB data each hour and 185 TB per day. When we just issue a very simple SQL, it 
take a long time to generate ActiveJob

 

The SQL statement is
{code:java}
select count(device_id) from test_tbl where date=20180731 and hour='21';
{code}
 

Before optimization, it takes 2 minutes and 9 seconds to generate the Job

 

The SQL is issued at 2018-08-07 09:07:41

!issue sql original.png!

However, the job is submitted at 2018-08-07 09:09:53, which is 2minutes and 9 
seconds later than the SQL issue time

!job start original.png!

 

After the optimization, it takes only 4 seconds to generate the Job

The SQL is issued at 2018-08-07 09:20:15

!issue sql optimized.png!

 

And the job is submitted at 2018-08-07 09:20:19, which is 4 seconds later than 
the SQL issue time

!job start optimized.png!

 

 

  was:
When Spark SQL read large amount of data, it take a long time (more than 10 
minutes) to generate physical Plan and then ActiveJob

 

Example:

There is a table which is partitioned by date and hour. There are more than 13 
TB data each hour and 185 TB per day. When we just issue a very simple SQL, it 
take a long time to generate ActiveJob

 

The SQL statement is
{code:java}
select count(device_id) from test_tbl where date=20180731 and hour='21';
{code}
 

The SQL is issued at 2018-08-07 08:43:48

!issue sql original.png!

However, the job is submitted at 2018-08-07 08:46:05, which is 2minutes and 17 
seconds later than the SQL issue time

  !job start original.png!

 

 


> Accelerate Spark Plan generation when Spark SQL read large amount of data
> -
>
> Key: SPARK-25038
> URL: https://issues.apache.org/jira/browse/SPARK-25038
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jason Guo
>Priority: Critical
> Attachments: issue sql optimized.png, issue sql original.png, job 
> start optimized.png, job start original.png
>
>
> When Spark SQL read large amount of data, it take a long time (more than 10 
> minutes) to generate physical Plan and then ActiveJob
>  
> Example:
> There is a table which is partitioned by date and hour. There are more than 
> 13 TB data each hour and 185 TB per day. When we just issue a very simple 
> SQL, it take a long time to generate ActiveJob
>  
> The SQL statement is
> {code:java}
> select count(device_id) from test_tbl where date=20180731 and hour='21';
> {code}
>  
> Before optimization, it takes 2 minutes and 9 seconds to generate the Job
>  
> The SQL is issued at 2018-08-07 09:07:41
> !issue sql original.png!
> However, the job is submitted at 2018-08-07 09:09:53, which is 2minutes and 9 
> seconds later than the SQL issue time
> !job start original.png!
>  
> After the optimization, it takes only 4 seconds to generate the Job
> The SQL is issued at 2018-08-07 09:20:15
> !issue sql optimized.png!
>  
> And the job is submitted at 2018-08-07 09:20:19, which is 4 seconds later 
> than the SQL issue time
> !job start optimized.png!
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25038) Accelerate Spark Plan generation when Spark SQL read large amount of data

2018-08-06 Thread Jason Guo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Guo updated SPARK-25038:
--
Attachment: (was: job start original.png)

> Accelerate Spark Plan generation when Spark SQL read large amount of data
> -
>
> Key: SPARK-25038
> URL: https://issues.apache.org/jira/browse/SPARK-25038
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jason Guo
>Priority: Critical
> Attachments: issue sql optimized.png, issue sql original.png, job 
> start optimized.png, job start original.png
>
>
> When Spark SQL read large amount of data, it take a long time (more than 10 
> minutes) to generate physical Plan and then ActiveJob
>  
> Example:
> There is a table which is partitioned by date and hour. There are more than 
> 13 TB data each hour and 185 TB per day. When we just issue a very simple 
> SQL, it take a long time to generate ActiveJob
>  
> The SQL statement is
> {code:java}
> select count(device_id) from test_tbl where date=20180731 and hour='21';
> {code}
>  
> The SQL is issued at 2018-08-07 08:43:48
> !issue sql original.png!
> However, the job is submitted at 2018-08-07 08:46:05, which is 2minutes and 
> 17 seconds later than the SQL issue time
>   !job start original.png!
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25038) Accelerate Spark Plan generation when Spark SQL read large amount of data

2018-08-06 Thread Jason Guo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Guo updated SPARK-25038:
--
Attachment: (was: issue sql original.png)

> Accelerate Spark Plan generation when Spark SQL read large amount of data
> -
>
> Key: SPARK-25038
> URL: https://issues.apache.org/jira/browse/SPARK-25038
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jason Guo
>Priority: Critical
> Attachments: issue sql optimized.png, issue sql original.png, job 
> start optimized.png, job start original.png
>
>
> When Spark SQL read large amount of data, it take a long time (more than 10 
> minutes) to generate physical Plan and then ActiveJob
>  
> Example:
> There is a table which is partitioned by date and hour. There are more than 
> 13 TB data each hour and 185 TB per day. When we just issue a very simple 
> SQL, it take a long time to generate ActiveJob
>  
> The SQL statement is
> {code:java}
> select count(device_id) from test_tbl where date=20180731 and hour='21';
> {code}
>  
> The SQL is issued at 2018-08-07 08:43:48
> !issue sql original.png!
> However, the job is submitted at 2018-08-07 08:46:05, which is 2minutes and 
> 17 seconds later than the SQL issue time
>   !job start original.png!
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25038) Accelerate Spark Plan generation when Spark SQL read large amount of data

2018-08-06 Thread Jason Guo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Guo updated SPARK-25038:
--
Attachment: job start original.png
job start optimized.png
issue sql original.png
issue sql optimized.png

> Accelerate Spark Plan generation when Spark SQL read large amount of data
> -
>
> Key: SPARK-25038
> URL: https://issues.apache.org/jira/browse/SPARK-25038
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jason Guo
>Priority: Critical
> Attachments: issue sql optimized.png, issue sql original.png, job 
> start optimized.png, job start original.png
>
>
> When Spark SQL read large amount of data, it take a long time (more than 10 
> minutes) to generate physical Plan and then ActiveJob
>  
> Example:
> There is a table which is partitioned by date and hour. There are more than 
> 13 TB data each hour and 185 TB per day. When we just issue a very simple 
> SQL, it take a long time to generate ActiveJob
>  
> The SQL statement is
> {code:java}
> select count(device_id) from test_tbl where date=20180731 and hour='21';
> {code}
>  
> The SQL is issued at 2018-08-07 08:43:48
> !issue sql original.png!
> However, the job is submitted at 2018-08-07 08:46:05, which is 2minutes and 
> 17 seconds later than the SQL issue time
>   !job start original.png!
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25030) SparkSubmit.doSubmit will not return result if the mainClass submitted creates a Timer()

2018-08-06 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570995#comment-16570995
 ] 

Saisai Shao commented on SPARK-25030:
-

I would like to see more about this issue.

> SparkSubmit.doSubmit will not return result if the mainClass submitted 
> creates a Timer()
> 
>
> Key: SPARK-25030
> URL: https://issues.apache.org/jira/browse/SPARK-25030
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Jiang Xingbo
>Priority: Major
>
> Create a Timer() in the mainClass submitted to SparkSubmit makes it unable to 
> fetch result, it is very easy to reproduce the issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25038) Accelerate Spark Plan generation when Spark SQL read large amount of data

2018-08-06 Thread Jason Guo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Guo updated SPARK-25038:
--
Description: 
When Spark SQL read large amount of data, it take a long time (more than 10 
minutes) to generate physical Plan and then ActiveJob

 

Example:

There is a table which is partitioned by date and hour. There are more than 13 
TB data each hour and 185 TB per day. When we just issue a very simple SQL, it 
take a long time to generate ActiveJob

 

The SQL statement is
{code:java}
select count(device_id) from test_tbl where date=20180731 and hour='21';
{code}
 

The SQL is issued at 2018-08-07 08:43:48

!issue sql original.png!

However, the job is submitted at 2018-08-07 08:46:05, which is 2minutes and 17 
seconds later than the SQL issue time

  !job start original.png!

 

 

  was:
When Spark SQL read large amount of data, it take a long time (more than 10 
minutes) to generate physical Plan and then ActiveJob

 

Example:

There is a table which is partitioned by date and hour. There are more than 13 
TB data each hour and 185 TB per day. When we just issue a very simple SQL, it 
take a long time to generate ActiveJob

 

The SQL statement is
{code:java}
select count(device_id) from test_tbl where date=20180731 and hour='21';
{code}
 

The SQL is issued at 2018-08-07 08:43:48

!image-2018-08-07-08-52-00-558.png!

However, the job is submitted at 2018-08-07 08:46:05, which is 2minutes and 17 
seconds later than the SQL issue time

  !image-2018-08-07-08-52-09-648.png!

 

 

 


> Accelerate Spark Plan generation when Spark SQL read large amount of data
> -
>
> Key: SPARK-25038
> URL: https://issues.apache.org/jira/browse/SPARK-25038
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jason Guo
>Priority: Critical
> Attachments: issue sql original.png, job start original.png
>
>
> When Spark SQL read large amount of data, it take a long time (more than 10 
> minutes) to generate physical Plan and then ActiveJob
>  
> Example:
> There is a table which is partitioned by date and hour. There are more than 
> 13 TB data each hour and 185 TB per day. When we just issue a very simple 
> SQL, it take a long time to generate ActiveJob
>  
> The SQL statement is
> {code:java}
> select count(device_id) from test_tbl where date=20180731 and hour='21';
> {code}
>  
> The SQL is issued at 2018-08-07 08:43:48
> !issue sql original.png!
> However, the job is submitted at 2018-08-07 08:46:05, which is 2minutes and 
> 17 seconds later than the SQL issue time
>   !job start original.png!
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25038) Accelerate Spark Plan generation when Spark SQL read large amount of data

2018-08-06 Thread Jason Guo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Guo updated SPARK-25038:
--
Attachment: issue sql original.png

> Accelerate Spark Plan generation when Spark SQL read large amount of data
> -
>
> Key: SPARK-25038
> URL: https://issues.apache.org/jira/browse/SPARK-25038
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jason Guo
>Priority: Critical
> Attachments: issue sql original.png, job start original.png
>
>
> When Spark SQL read large amount of data, it take a long time (more than 10 
> minutes) to generate physical Plan and then ActiveJob
>  
> Example:
> There is a table which is partitioned by date and hour. There are more than 
> 13 TB data each hour and 185 TB per day. When we just issue a very simple 
> SQL, it take a long time to generate ActiveJob
>  
> The SQL statement is
> {code:java}
> select count(device_id) from test_tbl where date=20180731 and hour='21';
> {code}
>  
> The SQL is issued at 2018-08-07 08:43:48
> !image-2018-08-07-08-52-00-558.png!
> However, the job is submitted at 2018-08-07 08:46:05, which is 2minutes and 
> 17 seconds later than the SQL issue time
>   !image-2018-08-07-08-52-09-648.png!
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25038) Accelerate Spark Plan generation when Spark SQL read large amount of data

2018-08-06 Thread Jason Guo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Guo updated SPARK-25038:
--
Attachment: job start original.png

> Accelerate Spark Plan generation when Spark SQL read large amount of data
> -
>
> Key: SPARK-25038
> URL: https://issues.apache.org/jira/browse/SPARK-25038
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jason Guo
>Priority: Critical
> Attachments: issue sql original.png, job start original.png
>
>
> When Spark SQL read large amount of data, it take a long time (more than 10 
> minutes) to generate physical Plan and then ActiveJob
>  
> Example:
> There is a table which is partitioned by date and hour. There are more than 
> 13 TB data each hour and 185 TB per day. When we just issue a very simple 
> SQL, it take a long time to generate ActiveJob
>  
> The SQL statement is
> {code:java}
> select count(device_id) from test_tbl where date=20180731 and hour='21';
> {code}
>  
> The SQL is issued at 2018-08-07 08:43:48
> !image-2018-08-07-08-52-00-558.png!
> However, the job is submitted at 2018-08-07 08:46:05, which is 2minutes and 
> 17 seconds later than the SQL issue time
>   !image-2018-08-07-08-52-09-648.png!
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25038) Accelerate Spark Plan generation when Spark SQL read large amount of data

2018-08-06 Thread Jason Guo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Guo updated SPARK-25038:
--
Attachment: start.png
issue.png

> Accelerate Spark Plan generation when Spark SQL read large amount of data
> -
>
> Key: SPARK-25038
> URL: https://issues.apache.org/jira/browse/SPARK-25038
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jason Guo
>Priority: Critical
>
> When Spark SQL read large amount of data, it take a long time (more than 10 
> minutes) to generate physical Plan and then ActiveJob
>  
> Example:
> There is a table which is partitioned by date and hour. There are more than 
> 13 TB data each hour and 185 TB per day. When we just issue a very simple 
> SQL, it take a long time to generate ActiveJob
>  
> The SQL statement is
> {code:java}
> select count(device_id) from test_tbl where date=20180731 and hour='21';
> {code}
>  
> The SQL is issued at 2018-08-07 08:43:48
> !image-2018-08-07-08-52-00-558.png!
> However, the job is submitted at 2018-08-07 08:46:05, which is 2minutes and 
> 17 seconds later than the SQL issue time
>   !image-2018-08-07-08-52-09-648.png!
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25038) Accelerate Spark Plan generation when Spark SQL read large amount of data

2018-08-06 Thread Jason Guo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Guo updated SPARK-25038:
--
Attachment: (was: start.png)

> Accelerate Spark Plan generation when Spark SQL read large amount of data
> -
>
> Key: SPARK-25038
> URL: https://issues.apache.org/jira/browse/SPARK-25038
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jason Guo
>Priority: Critical
>
> When Spark SQL read large amount of data, it take a long time (more than 10 
> minutes) to generate physical Plan and then ActiveJob
>  
> Example:
> There is a table which is partitioned by date and hour. There are more than 
> 13 TB data each hour and 185 TB per day. When we just issue a very simple 
> SQL, it take a long time to generate ActiveJob
>  
> The SQL statement is
> {code:java}
> select count(device_id) from test_tbl where date=20180731 and hour='21';
> {code}
>  
> The SQL is issued at 2018-08-07 08:43:48
> !image-2018-08-07-08-52-00-558.png!
> However, the job is submitted at 2018-08-07 08:46:05, which is 2minutes and 
> 17 seconds later than the SQL issue time
>   !image-2018-08-07-08-52-09-648.png!
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25038) Accelerate Spark Plan generation when Spark SQL read large amount of data

2018-08-06 Thread Jason Guo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Guo updated SPARK-25038:
--
Attachment: (was: issue.png)

> Accelerate Spark Plan generation when Spark SQL read large amount of data
> -
>
> Key: SPARK-25038
> URL: https://issues.apache.org/jira/browse/SPARK-25038
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jason Guo
>Priority: Critical
>
> When Spark SQL read large amount of data, it take a long time (more than 10 
> minutes) to generate physical Plan and then ActiveJob
>  
> Example:
> There is a table which is partitioned by date and hour. There are more than 
> 13 TB data each hour and 185 TB per day. When we just issue a very simple 
> SQL, it take a long time to generate ActiveJob
>  
> The SQL statement is
> {code:java}
> select count(device_id) from test_tbl where date=20180731 and hour='21';
> {code}
>  
> The SQL is issued at 2018-08-07 08:43:48
> !image-2018-08-07-08-52-00-558.png!
> However, the job is submitted at 2018-08-07 08:46:05, which is 2minutes and 
> 17 seconds later than the SQL issue time
>   !image-2018-08-07-08-52-09-648.png!
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25038) Accelerate Spark Plan generation when Spark SQL read large amount of data

2018-08-06 Thread Jason Guo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Guo updated SPARK-25038:
--
Description: 
When Spark SQL read large amount of data, it take a long time (more than 10 
minutes) to generate physical Plan and then ActiveJob

 

Example:

There is a table which is partitioned by date and hour. There are more than 13 
TB data each hour and 185 TB per day. When we just issue a very simple SQL, it 
take a long time to generate ActiveJob

 

The SQL statement is
{code:java}
select count(device_id) from test_tbl where date=20180731 and hour='21';
{code}
 

The SQL is issued at 2018-08-07 08:43:48

!image-2018-08-07-08-52-00-558.png!

However, the job is submitted at 2018-08-07 08:46:05, which is 2minutes and 17 
seconds later than the SQL issue time

  !image-2018-08-07-08-52-09-648.png!

 

 

 

  was:
When Spark SQL read large amount of data, it take a long time (more than 10 
minutes) to generate physical Plan and then ActiveJob

 

Example:

There is a table which is partitioned by date and hour. There are more than 13 
TB data each hour and 185 TB per day. When we just issue a very simple SQL, it 
take a long time to generate ActiveJob

 

The SQL statement is
{code:java}
select count(device_id) from test_tbl where date=20180731 and hour='21';
{code}
 

The SQL is issued at 2018-08-07 08:43:48

!image-2018-08-07-08-48-28-753.png!

However, the job is submitted at 2018-08-07 08:46:05, which is 2minutes and 17 
seconds later than the SQL issue time

!image-2018-08-07-08-47-06-321.png!  

 

 

 

 


> Accelerate Spark Plan generation when Spark SQL read large amount of data
> -
>
> Key: SPARK-25038
> URL: https://issues.apache.org/jira/browse/SPARK-25038
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jason Guo
>Priority: Critical
>
> When Spark SQL read large amount of data, it take a long time (more than 10 
> minutes) to generate physical Plan and then ActiveJob
>  
> Example:
> There is a table which is partitioned by date and hour. There are more than 
> 13 TB data each hour and 185 TB per day. When we just issue a very simple 
> SQL, it take a long time to generate ActiveJob
>  
> The SQL statement is
> {code:java}
> select count(device_id) from test_tbl where date=20180731 and hour='21';
> {code}
>  
> The SQL is issued at 2018-08-07 08:43:48
> !image-2018-08-07-08-52-00-558.png!
> However, the job is submitted at 2018-08-07 08:46:05, which is 2minutes and 
> 17 seconds later than the SQL issue time
>   !image-2018-08-07-08-52-09-648.png!
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25038) Accelerate Spark Plan generation when Spark SQL read large amount of data

2018-08-06 Thread Jason Guo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Guo updated SPARK-25038:
--
Description: 
When Spark SQL read large amount of data, it take a long time (more than 10 
minutes) to generate physical Plan and then ActiveJob

 

Example:

There is a table which is partitioned by date and hour. There are more than 13 
TB data each hour and 185 TB per day. When we just issue a very simple SQL, it 
take a long time to generate ActiveJob

 

The SQL statement is
{code:java}
select count(device_id) from test_tbl where date=20180731 and hour='21';
{code}
 

The SQL is issued at 2018-08-07 08:43:48

!image-2018-08-07-08-48-28-753.png!

However, the job is submitted at 2018-08-07 08:46:05, which is 2minutes and 17 
seconds later than the SQL issue time

!image-2018-08-07-08-47-06-321.png!  

 

 

 

 

  was:
When Spark SQL read large amount of data, it take a long time (more than 10 
minutes) to generate physical Plan and then ActiveJob

 

Example:

There is a table which is partitioned by date and hour. There are more than 13 
TB data each hour and 185 TB per day. When we just issue a very simple SQL, it 
take a long time to generate ActiveJob

 

The SQL statement is
{code:java}
select count(device_id) from test_tbl where date=20180731 and hour='21';
{code}
 

The SQL is issued at 2018-08-05 18:33:21

!image-2018-08-07-08-38-01-984.png!

However, the job is submitted at 2018-08-05 18:34:45, which is 1minutes and 24 
seconds later than the SQL issue time

 

 

 


> Accelerate Spark Plan generation when Spark SQL read large amount of data
> -
>
> Key: SPARK-25038
> URL: https://issues.apache.org/jira/browse/SPARK-25038
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jason Guo
>Priority: Critical
>
> When Spark SQL read large amount of data, it take a long time (more than 10 
> minutes) to generate physical Plan and then ActiveJob
>  
> Example:
> There is a table which is partitioned by date and hour. There are more than 
> 13 TB data each hour and 185 TB per day. When we just issue a very simple 
> SQL, it take a long time to generate ActiveJob
>  
> The SQL statement is
> {code:java}
> select count(device_id) from test_tbl where date=20180731 and hour='21';
> {code}
>  
> The SQL is issued at 2018-08-07 08:43:48
> !image-2018-08-07-08-48-28-753.png!
> However, the job is submitted at 2018-08-07 08:46:05, which is 2minutes and 
> 17 seconds later than the SQL issue time
> !image-2018-08-07-08-47-06-321.png!  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25038) Accelerate Spark Plan generation when Spark SQL read large amount of data

2018-08-06 Thread Jason Guo (JIRA)
Jason Guo created SPARK-25038:
-

 Summary: Accelerate Spark Plan generation when Spark SQL read 
large amount of data
 Key: SPARK-25038
 URL: https://issues.apache.org/jira/browse/SPARK-25038
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Jason Guo


When Spark SQL read large amount of data, it take a long time (more than 10 
minutes) to generate physical Plan and then ActiveJob

 

Example:

There is a table which is partitioned by date and hour. There are more than 13 
TB data each hour and 185 TB per day. When we just issue a very simple SQL, it 
take a long time to generate ActiveJob

 

The SQL statement is
{code:java}
select count(device_id) from test_tbl where date=20180731 and hour='21';
{code}
 

The SQL is issued at 2018-08-05 18:33:21

!image-2018-08-07-08-38-01-984.png!

However, the job is submitted at 2018-08-05 18:34:45, which is 1minutes and 24 
seconds later than the SQL issue time

 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24906) Adaptively set split size for columnar file to ensure the task read data size fit expectation

2018-08-06 Thread Jason Guo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Guo updated SPARK-24906:
--
Summary: Adaptively set split size for columnar file to ensure the task 
read data size fit expectation  (was: Enlarge split size for columnar file to 
ensure the task read enough data)

> Adaptively set split size for columnar file to ensure the task read data size 
> fit expectation
> -
>
> Key: SPARK-24906
> URL: https://issues.apache.org/jira/browse/SPARK-24906
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jason Guo
>Priority: Critical
> Attachments: image-2018-07-24-20-26-32-441.png, 
> image-2018-07-24-20-28-06-269.png, image-2018-07-24-20-29-24-797.png, 
> image-2018-07-24-20-30-24-552.png
>
>
> For columnar file, such as, when spark sql read the table, each split will be 
> 128 MB by default since spark.sql.files.maxPartitionBytes is default to 
> 128MB. Even when user set it to a large value, such as 512MB, the task may 
> read only few MB or even hundreds of KB. Because the table (Parquet) may 
> consists of dozens of columns while the SQL only need few columns. And spark 
> will prune the unnecessary columns.
>  
> In this case, spark DataSourceScanExec can enlarge maxPartitionBytes 
> adaptively. 
> For example, there is 40 columns , 20 are integer while another 20 are long. 
> When use query on an integer type column and an long type column, the 
> maxPartitionBytes should be 20 times larger. (20*4+20*8) /  (4+8) = 20. 
>  
> With this optimization, the number of task will be smaller and the job will 
> run faster. More importantly, for a very large cluster (more the 10 thousand 
> nodes), it will relieve RM's schedule pressure.
>  
> Here is the test
>  
> The table named test2 has more than 40 columns and there are more than 5 TB 
> data each hour.
> When we issue a very simple query 
>  
> {code:java}
> select count(device_id) from test2 where date=20180708 and hour='23'{code}
>  
> There are 72176 tasks and the duration of the job is 4.8 minutes
> !image-2018-07-24-20-26-32-441.png!
>  
> Most tasks last less than 1 second and read less than 1.5 MB data
> !image-2018-07-24-20-28-06-269.png!
>  
> After the optimization, there are only 1615 tasks and the job last only 30 
> seconds. It almost 10 times faster.
> !image-2018-07-24-20-29-24-797.png!
>  
> The median of read data is 44.2MB. 
> !image-2018-07-24-20-30-24-552.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25030) SparkSubmit.doSubmit will not return result if the mainClass submitted creates a Timer()

2018-08-06 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-25030:
--
Summary: SparkSubmit.doSubmit will not return result if the mainClass 
submitted creates a Timer()  (was: SparkSubmit will not return result if the 
mainClass submitted creates a Timer())

> SparkSubmit.doSubmit will not return result if the mainClass submitted 
> creates a Timer()
> 
>
> Key: SPARK-25030
> URL: https://issues.apache.org/jira/browse/SPARK-25030
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Jiang Xingbo
>Priority: Major
>
> Create a Timer() in the mainClass submitted to SparkSubmit makes it unable to 
> fetch result, it is very easy to reproduce the issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25030) SparkSubmit will not return result if the mainClass submitted creates a Timer()

2018-08-06 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570951#comment-16570951
 ] 

Xiangrui Meng commented on SPARK-25030:
---

[~jiangxb1987] Could you create a PR to demonstrate the test failures?

[~vanzin] [~jerryshao] Do you know who is the best person to investigate this 
issue?

> SparkSubmit will not return result if the mainClass submitted creates a 
> Timer()
> ---
>
> Key: SPARK-25030
> URL: https://issues.apache.org/jira/browse/SPARK-25030
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Jiang Xingbo
>Priority: Major
>
> Create a Timer() in the mainClass submitted to SparkSubmit makes it unable to 
> fetch result, it is very easy to reproduce the issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25037) plan.transformAllExpressions() doesn't transform expressions in nested SubqueryExpression plans

2018-08-06 Thread Chris O'Hara (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris O'Hara updated SPARK-25037:
-
Description: 
Given the following LogicalPlan:
{code:java}
scala> val plan = spark.sql("SELECT 1 bar FROM (SELECT 1 foo) WHERE foo IN 
(SELECT 1 foo)").queryExecution.logical
plan: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
'Project [1 AS bar#29]
+- 'Filter 'foo IN (list#31 [])
   :  +- Project [1 AS foo#30]
   :     +- OneRowRelation
   +- SubqueryAlias __auto_generated_subquery_name
      +- Project [1 AS foo#28]
         +- OneRowRelation
{code}
The following transformation should replace all instances of lit(1) with lit(2):
{code:java}
scala> plan.transformAllExpressions { case l @ Literal(1, _) => l.copy(value = 
2) }
res0: plan.type =
'Project [2 AS bar#29]
+- 'Filter 'foo IN (list#31 [])
   :  +- Project [1 AS foo#30]
   :     +- OneRowRelation
   +- SubqueryAlias __auto_generated_subquery_name
      +- Project [2 AS foo#28]
         +- OneRowRelation
{code}
Instead, the nested SubqueryExpression plan is not transformed.

The expected output is: 
{code:java}
'Project [2 AS bar#29]
+- 'Filter 'foo IN (list#31 [])
   :  +- Project [2 AS foo#30]
   :     +- OneRowRelation
   +- SubqueryAlias __auto_generated_subquery_name
      +- Project [2 AS foo#28]
         +- OneRowRelation
{code}
 

 

  was:
Given the following LogicalPlan, containing a SubqueryAlias and 
SubqueryExpression:
{code:java}
scala> val plan = spark.sql("SELECT 1 bar FROM (SELECT 1 foo) WHERE foo IN 
(SELECT 1 foo)").queryExecution.logical
plan: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
'Project [1 AS bar#29]
+- 'Filter 'foo IN (list#31 [])
   :  +- Project [1 AS foo#30]
   :     +- OneRowRelation
   +- SubqueryAlias __auto_generated_subquery_name
      +- Project [1 AS foo#28]
         +- OneRowRelation
{code}
The following transformation should replace all instances of lit(1) with lit(2):
{code:java}
scala> plan.transformAllExpressions { case l @ Literal(1, _) => l.copy(value = 
2) }
res0: plan.type =
'Project [2 AS bar#29]
+- 'Filter 'foo IN (list#31 [])
   :  +- Project [1 AS foo#30]
   :     +- OneRowRelation
   +- SubqueryAlias __auto_generated_subquery_name
      +- Project [2 AS foo#28]
         +- OneRowRelation
{code}
Instead, the nested SubqueryExpression plan is not transformed.

The expected output is: 
{code:java}
'Project [2 AS bar#29]
+- 'Filter 'foo IN (list#31 [])
   :  +- Project [2 AS foo#30]
   :     +- OneRowRelation
   +- SubqueryAlias __auto_generated_subquery_name
      +- Project [2 AS foo#28]
         +- OneRowRelation
{code}
 

 


> plan.transformAllExpressions() doesn't transform expressions in nested 
> SubqueryExpression plans
> ---
>
> Key: SPARK-25037
> URL: https://issues.apache.org/jira/browse/SPARK-25037
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chris O'Hara
>Priority: Minor
>
> Given the following LogicalPlan:
> {code:java}
> scala> val plan = spark.sql("SELECT 1 bar FROM (SELECT 1 foo) WHERE foo IN 
> (SELECT 1 foo)").queryExecution.logical
> plan: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> 'Project [1 AS bar#29]
> +- 'Filter 'foo IN (list#31 [])
>    :  +- Project [1 AS foo#30]
>    :     +- OneRowRelation
>    +- SubqueryAlias __auto_generated_subquery_name
>       +- Project [1 AS foo#28]
>          +- OneRowRelation
> {code}
> The following transformation should replace all instances of lit(1) with 
> lit(2):
> {code:java}
> scala> plan.transformAllExpressions { case l @ Literal(1, _) => l.copy(value 
> = 2) }
> res0: plan.type =
> 'Project [2 AS bar#29]
> +- 'Filter 'foo IN (list#31 [])
>    :  +- Project [1 AS foo#30]
>    :     +- OneRowRelation
>    +- SubqueryAlias __auto_generated_subquery_name
>       +- Project [2 AS foo#28]
>          +- OneRowRelation
> {code}
> Instead, the nested SubqueryExpression plan is not transformed.
> The expected output is: 
> {code:java}
> 'Project [2 AS bar#29]
> +- 'Filter 'foo IN (list#31 [])
>    :  +- Project [2 AS foo#30]
>    :     +- OneRowRelation
>    +- SubqueryAlias __auto_generated_subquery_name
>       +- Project [2 AS foo#28]
>          +- OneRowRelation
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25037) plan.transformAllExpressions() doesn't transform expressions in nested SubqueryExpression plans

2018-08-06 Thread Chris O'Hara (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris O'Hara updated SPARK-25037:
-
Description: 
Given the following LogicalPlan:
{code:java}
scala> val plan = spark.sql("SELECT 1 bar FROM (SELECT 1 foo) WHERE foo IN 
(SELECT 1 foo)").queryExecution.logical
plan: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
'Project [1 AS bar#29]
+- 'Filter 'foo IN (list#31 [])
   :  +- Project [1 AS foo#30]
   :     +- OneRowRelation
   +- SubqueryAlias __auto_generated_subquery_name
      +- Project [1 AS foo#28]
         +- OneRowRelation
{code}
the following transformation should replace all instances of lit(1) with lit(2):
{code:java}
scala> plan.transformAllExpressions { case l @ Literal(1, _) => l.copy(value = 
2) }
res0: plan.type =
'Project [2 AS bar#29]
+- 'Filter 'foo IN (list#31 [])
   :  +- Project [1 AS foo#30]
   :     +- OneRowRelation
   +- SubqueryAlias __auto_generated_subquery_name
      +- Project [2 AS foo#28]
         +- OneRowRelation
{code}
Instead, the nested SubqueryExpression plan is not transformed.

The expected output is: 
{code:java}
'Project [2 AS bar#29]
+- 'Filter 'foo IN (list#31 [])
   :  +- Project [2 AS foo#30]
   :     +- OneRowRelation
   +- SubqueryAlias __auto_generated_subquery_name
      +- Project [2 AS foo#28]
         +- OneRowRelation
{code}
 

 

  was:
Given the following LogicalPlan:
{code:java}
scala> val plan = spark.sql("SELECT 1 bar FROM (SELECT 1 foo) WHERE foo IN 
(SELECT 1 foo)").queryExecution.logical
plan: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
'Project [1 AS bar#29]
+- 'Filter 'foo IN (list#31 [])
   :  +- Project [1 AS foo#30]
   :     +- OneRowRelation
   +- SubqueryAlias __auto_generated_subquery_name
      +- Project [1 AS foo#28]
         +- OneRowRelation
{code}
The following transformation should replace all instances of lit(1) with lit(2):
{code:java}
scala> plan.transformAllExpressions { case l @ Literal(1, _) => l.copy(value = 
2) }
res0: plan.type =
'Project [2 AS bar#29]
+- 'Filter 'foo IN (list#31 [])
   :  +- Project [1 AS foo#30]
   :     +- OneRowRelation
   +- SubqueryAlias __auto_generated_subquery_name
      +- Project [2 AS foo#28]
         +- OneRowRelation
{code}
Instead, the nested SubqueryExpression plan is not transformed.

The expected output is: 
{code:java}
'Project [2 AS bar#29]
+- 'Filter 'foo IN (list#31 [])
   :  +- Project [2 AS foo#30]
   :     +- OneRowRelation
   +- SubqueryAlias __auto_generated_subquery_name
      +- Project [2 AS foo#28]
         +- OneRowRelation
{code}
 

 


> plan.transformAllExpressions() doesn't transform expressions in nested 
> SubqueryExpression plans
> ---
>
> Key: SPARK-25037
> URL: https://issues.apache.org/jira/browse/SPARK-25037
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chris O'Hara
>Priority: Minor
>
> Given the following LogicalPlan:
> {code:java}
> scala> val plan = spark.sql("SELECT 1 bar FROM (SELECT 1 foo) WHERE foo IN 
> (SELECT 1 foo)").queryExecution.logical
> plan: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> 'Project [1 AS bar#29]
> +- 'Filter 'foo IN (list#31 [])
>    :  +- Project [1 AS foo#30]
>    :     +- OneRowRelation
>    +- SubqueryAlias __auto_generated_subquery_name
>       +- Project [1 AS foo#28]
>          +- OneRowRelation
> {code}
> the following transformation should replace all instances of lit(1) with 
> lit(2):
> {code:java}
> scala> plan.transformAllExpressions { case l @ Literal(1, _) => l.copy(value 
> = 2) }
> res0: plan.type =
> 'Project [2 AS bar#29]
> +- 'Filter 'foo IN (list#31 [])
>    :  +- Project [1 AS foo#30]
>    :     +- OneRowRelation
>    +- SubqueryAlias __auto_generated_subquery_name
>       +- Project [2 AS foo#28]
>          +- OneRowRelation
> {code}
> Instead, the nested SubqueryExpression plan is not transformed.
> The expected output is: 
> {code:java}
> 'Project [2 AS bar#29]
> +- 'Filter 'foo IN (list#31 [])
>    :  +- Project [2 AS foo#30]
>    :     +- OneRowRelation
>    +- SubqueryAlias __auto_generated_subquery_name
>       +- Project [2 AS foo#28]
>          +- OneRowRelation
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25031) The schema of MapType can not be printed correctly

2018-08-06 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25031:

Target Version/s: 2.4.0

> The schema of MapType can not be printed correctly
> --
>
> Key: SPARK-25031
> URL: https://issues.apache.org/jira/browse/SPARK-25031
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Hao Ren
>Priority: Minor
>  Labels: easyfix
>
> Something wrong with the function `buildFormattedString` in `MapType`
>  
> {code:java}
> import spark.implicits._
> case class Key(a: Int)
> case class Value(b: Int)
> Seq(
>   (1, Map(Key(1) -> Value(2))), 
>   (2, Map(Key(1) -> Value(2)))
> ).toDF("id", "dict").printSchema
> {code}
> The result is:
> {code:java}
> root
> |-- id: integer (nullable = false)
> |-- dict: map (nullable = true)
> | |-- key: struct
> | |-- value: struct (valueContainsNull = true)
> | | |-- a: integer (nullable = false)
> | | |-- b: integer (nullable = false)
> {code}
>  The expected is
> {code:java}
> root
> |-- id: integer (nullable = false)
> |-- dict: map (nullable = true)
> | |-- key: struct
> | | |-- a: integer (nullable = false)
> | |-- value: struct (valueContainsNull = true)
> | | |-- b: integer (nullable = false)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23938) High-order function: map_zip_with(map, map, function) → map

2018-08-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23938:


Assignee: Apache Spark

> High-order function: map_zip_with(map, map, function V3>) → map
> ---
>
> Key: SPARK-23938
> URL: https://issues.apache.org/jira/browse/SPARK-23938
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> Ref:  https://prestodb.io/docs/current/functions/map.html
> Merges the two given maps into a single map by applying function to the pair 
> of values with the same key. For keys only presented in one map, NULL will be 
> passed as the value for the missing key.
> {noformat}
> SELECT map_zip_with(MAP(ARRAY[1, 2, 3], ARRAY['a', 'b', 'c']), -- {1 -> ad, 2 
> -> be, 3 -> cf}
> MAP(ARRAY[1, 2, 3], ARRAY['d', 'e', 'f']),
> (k, v1, v2) -> concat(v1, v2));
> SELECT map_zip_with(MAP(ARRAY['k1', 'k2'], ARRAY[1, 2]), -- {k1 -> ROW(1, 
> null), k2 -> ROW(2, 4), k3 -> ROW(null, 9)}
> MAP(ARRAY['k2', 'k3'], ARRAY[4, 9]),
> (k, v1, v2) -> (v1, v2));
> SELECT map_zip_with(MAP(ARRAY['a', 'b', 'c'], ARRAY[1, 8, 27]), -- {a -> a1, 
> b -> b4, c -> c9}
> MAP(ARRAY['a', 'b', 'c'], ARRAY[1, 2, 3]),
> (k, v1, v2) -> k || CAST(v1/v2 AS VARCHAR));
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23938) High-order function: map_zip_with(map, map, function) → map

2018-08-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570932#comment-16570932
 ] 

Apache Spark commented on SPARK-23938:
--

User 'mn-mikke' has created a pull request for this issue:
https://github.com/apache/spark/pull/22017

> High-order function: map_zip_with(map, map, function V3>) → map
> ---
>
> Key: SPARK-23938
> URL: https://issues.apache.org/jira/browse/SPARK-23938
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref:  https://prestodb.io/docs/current/functions/map.html
> Merges the two given maps into a single map by applying function to the pair 
> of values with the same key. For keys only presented in one map, NULL will be 
> passed as the value for the missing key.
> {noformat}
> SELECT map_zip_with(MAP(ARRAY[1, 2, 3], ARRAY['a', 'b', 'c']), -- {1 -> ad, 2 
> -> be, 3 -> cf}
> MAP(ARRAY[1, 2, 3], ARRAY['d', 'e', 'f']),
> (k, v1, v2) -> concat(v1, v2));
> SELECT map_zip_with(MAP(ARRAY['k1', 'k2'], ARRAY[1, 2]), -- {k1 -> ROW(1, 
> null), k2 -> ROW(2, 4), k3 -> ROW(null, 9)}
> MAP(ARRAY['k2', 'k3'], ARRAY[4, 9]),
> (k, v1, v2) -> (v1, v2));
> SELECT map_zip_with(MAP(ARRAY['a', 'b', 'c'], ARRAY[1, 8, 27]), -- {a -> a1, 
> b -> b4, c -> c9}
> MAP(ARRAY['a', 'b', 'c'], ARRAY[1, 2, 3]),
> (k, v1, v2) -> k || CAST(v1/v2 AS VARCHAR));
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23938) High-order function: map_zip_with(map, map, function) → map

2018-08-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23938:


Assignee: (was: Apache Spark)

> High-order function: map_zip_with(map, map, function V3>) → map
> ---
>
> Key: SPARK-23938
> URL: https://issues.apache.org/jira/browse/SPARK-23938
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref:  https://prestodb.io/docs/current/functions/map.html
> Merges the two given maps into a single map by applying function to the pair 
> of values with the same key. For keys only presented in one map, NULL will be 
> passed as the value for the missing key.
> {noformat}
> SELECT map_zip_with(MAP(ARRAY[1, 2, 3], ARRAY['a', 'b', 'c']), -- {1 -> ad, 2 
> -> be, 3 -> cf}
> MAP(ARRAY[1, 2, 3], ARRAY['d', 'e', 'f']),
> (k, v1, v2) -> concat(v1, v2));
> SELECT map_zip_with(MAP(ARRAY['k1', 'k2'], ARRAY[1, 2]), -- {k1 -> ROW(1, 
> null), k2 -> ROW(2, 4), k3 -> ROW(null, 9)}
> MAP(ARRAY['k2', 'k3'], ARRAY[4, 9]),
> (k, v1, v2) -> (v1, v2));
> SELECT map_zip_with(MAP(ARRAY['a', 'b', 'c'], ARRAY[1, 8, 27]), -- {a -> a1, 
> b -> b4, c -> c9}
> MAP(ARRAY['a', 'b', 'c'], ARRAY[1, 2, 3]),
> (k, v1, v2) -> k || CAST(v1/v2 AS VARCHAR));
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24996) Use DSL to simplify DeclarativeAggregate

2018-08-06 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24996.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

> Use DSL to simplify DeclarativeAggregate
> 
>
> Key: SPARK-24996
> URL: https://issues.apache.org/jira/browse/SPARK-24996
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Major
>  Labels: beginner
> Fix For: 2.4.0
>
>
> Simplify DeclarativeAggregate by DSL. See the example: 
> https://github.com/apache/spark/pull/21951



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24996) Use DSL to simplify DeclarativeAggregate

2018-08-06 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-24996:
---

Assignee: Marco Gaido

> Use DSL to simplify DeclarativeAggregate
> 
>
> Key: SPARK-24996
> URL: https://issues.apache.org/jira/browse/SPARK-24996
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Major
>  Labels: beginner
> Fix For: 2.4.0
>
>
> Simplify DeclarativeAggregate by DSL. See the example: 
> https://github.com/apache/spark/pull/21951



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25037) plan.transformAllExpressions() doesn't transform expressions in nested SubqueryExpression plans

2018-08-06 Thread Chris O'Hara (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris O'Hara updated SPARK-25037:
-
Summary: plan.transformAllExpressions() doesn't transform expressions in 
nested SubqueryExpression plans  (was: plan.transformAllExpressions doesn't 
transform expressions in nested SubqueryExpression plans)

> plan.transformAllExpressions() doesn't transform expressions in nested 
> SubqueryExpression plans
> ---
>
> Key: SPARK-25037
> URL: https://issues.apache.org/jira/browse/SPARK-25037
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chris O'Hara
>Priority: Minor
>
> Given the following LogicalPlan, containing a SubqueryAlias and 
> SubqueryExpression:
> {code:java}
> scala> val plan = spark.sql("SELECT 1 bar FROM (SELECT 1 foo) WHERE foo IN 
> (SELECT 1 foo)").queryExecution.logical
> plan: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> 'Project [1 AS bar#29]
> +- 'Filter 'foo IN (list#31 [])
>    :  +- Project [1 AS foo#30]
>    :     +- OneRowRelation
>    +- SubqueryAlias __auto_generated_subquery_name
>       +- Project [1 AS foo#28]
>          +- OneRowRelation
> {code}
> The following transformation should replace all instances of lit(1) with 
> lit(2):
> {code:java}
> scala> plan.transformAllExpressions { case l @ Literal(1, _) => l.copy(value 
> = 2) }
> res0: plan.type =
> 'Project [2 AS bar#29]
> +- 'Filter 'foo IN (list#31 [])
>    :  +- Project [1 AS foo#30]
>    :     +- OneRowRelation
>    +- SubqueryAlias __auto_generated_subquery_name
>       +- Project [2 AS foo#28]
>          +- OneRowRelation
> {code}
> Instead, the nested SubqueryExpression plan is not transformed.
> The expected output is: 
> {code:java}
> 'Project [2 AS bar#29]
> +- 'Filter 'foo IN (list#31 [])
>    :  +- Project [2 AS foo#30]
>    :     +- OneRowRelation
>    +- SubqueryAlias __auto_generated_subquery_name
>       +- Project [2 AS foo#28]
>          +- OneRowRelation
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25037) plan.transformAllExpressions doesn't transform expressions in nested SubqueryExpression plans

2018-08-06 Thread Chris O'Hara (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris O'Hara updated SPARK-25037:
-
Summary: plan.transformAllExpressions doesn't transform expressions in 
nested SubqueryExpression plans  (was: plan.transformAllExpressions doesn't 
transform expressions in subquery plans)

> plan.transformAllExpressions doesn't transform expressions in nested 
> SubqueryExpression plans
> -
>
> Key: SPARK-25037
> URL: https://issues.apache.org/jira/browse/SPARK-25037
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chris O'Hara
>Priority: Minor
>
> Given the following LogicalPlan, containing a SubqueryAlias and 
> SubqueryExpression:
> {code:java}
> scala> val plan = spark.sql("SELECT 1 bar FROM (SELECT 1 foo) WHERE foo IN 
> (SELECT 1 foo)").queryExecution.logical
> plan: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> 'Project [1 AS bar#29]
> +- 'Filter 'foo IN (list#31 [])
>    :  +- Project [1 AS foo#30]
>    :     +- OneRowRelation
>    +- SubqueryAlias __auto_generated_subquery_name
>       +- Project [1 AS foo#28]
>          +- OneRowRelation
> {code}
> The following transformation should replace all instances of lit(1) with 
> lit(2):
> {code:java}
> scala> plan.transformAllExpressions { case l @ Literal(1, _) => l.copy(value 
> = 2) }
> res0: plan.type =
> 'Project [2 AS bar#29]
> +- 'Filter 'foo IN (list#31 [])
>    :  +- Project [1 AS foo#30]
>    :     +- OneRowRelation
>    +- SubqueryAlias __auto_generated_subquery_name
>       +- Project [2 AS foo#28]
>          +- OneRowRelation
> {code}
> Instead, the nested SubqueryExpression plan is not transformed.
> The expected output is: 
> {code:java}
> 'Project [2 AS bar#29]
> +- 'Filter 'foo IN (list#31 [])
>    :  +- Project [2 AS foo#30]
>    :     +- OneRowRelation
>    +- SubqueryAlias __auto_generated_subquery_name
>       +- Project [2 AS foo#28]
>          +- OneRowRelation
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25037) plan.transformAllExpressions doesn't transform expressions in subquery plans

2018-08-06 Thread Chris O'Hara (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris O'Hara updated SPARK-25037:
-
Description: 
Given the following LogicalPlan, containing a SubqueryAlias and 
SubqueryExpression:
{code:java}
scala> val plan = spark.sql("SELECT 1 bar FROM (SELECT 1 foo) WHERE foo IN 
(SELECT 1 foo)").queryExecution.logical
plan: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
'Project [1 AS bar#29]
+- 'Filter 'foo IN (list#31 [])
   :  +- Project [1 AS foo#30]
   :     +- OneRowRelation
   +- SubqueryAlias __auto_generated_subquery_name
      +- Project [1 AS foo#28]
         +- OneRowRelation
{code}
The following transformation should replace all instances of lit(1) with lit(2):
{code:java}
scala> plan.transformAllExpressions { case l @ Literal(1, _) => l.copy(value = 
2) }
res0: plan.type =
'Project [2 AS bar#29]
+- 'Filter 'foo IN (list#31 [])
   :  +- Project [1 AS foo#30]
   :     +- OneRowRelation
   +- SubqueryAlias __auto_generated_subquery_name
      +- Project [2 AS foo#28]
         +- OneRowRelation
{code}
Instead, the nested SubqueryExpression plan is not transformed.

The expected output is: 
{code:java}
'Project [2 AS bar#29]
+- 'Filter 'foo IN (list#31 [])
   :  +- Project [2 AS foo#30]
   :     +- OneRowRelation
   +- SubqueryAlias __auto_generated_subquery_name
      +- Project [2 AS foo#28]
         +- OneRowRelation
{code}
 

 

  was:
Given the following LogicalPlan, containing a SubqueryAlias and 
SubqueryExpression:

 
{code:java}
scala> val plan = spark.sql("SELECT 1 bar FROM (SELECT 1 foo) WHERE foo IN 
(SELECT 1 foo)").queryExecution.logical
plan: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
'Project [1 AS bar#29]
+- 'Filter 'foo IN (list#31 [])
   :  +- Project [1 AS foo#30]
   :     +- OneRowRelation
   +- SubqueryAlias __auto_generated_subquery_name
      +- Project [1 AS foo#28]
         +- OneRowRelation
{code}
The following transformation should replace all instances of lit(1) with lit(2):

 
{code:java}
scala> plan.transformAllExpressions { case l @ Literal(1, _) => l.copy(value = 
2) }
res0: plan.type =
'Project [2 AS bar#29]
+- 'Filter 'foo IN (list#31 [])
   :  +- Project [1 AS foo#30]
   :     +- OneRowRelation
   +- SubqueryAlias __auto_generated_subquery_name
      +- Project [2 AS foo#28]
         +- OneRowRelation
{code}
 

Instead, the nested SubqueryExpression plan is not transformed.

The expected output is:

 
{code:java}
'Project [2 AS bar#29]
+- 'Filter 'foo IN (list#31 [])
   :  +- Project [2 AS foo#30]
   :     +- OneRowRelation
   +- SubqueryAlias __auto_generated_subquery_name
      +- Project [2 AS foo#28]
         +- OneRowRelation
{code}
 

 

 

 


> plan.transformAllExpressions doesn't transform expressions in subquery plans
> 
>
> Key: SPARK-25037
> URL: https://issues.apache.org/jira/browse/SPARK-25037
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chris O'Hara
>Priority: Minor
>
> Given the following LogicalPlan, containing a SubqueryAlias and 
> SubqueryExpression:
> {code:java}
> scala> val plan = spark.sql("SELECT 1 bar FROM (SELECT 1 foo) WHERE foo IN 
> (SELECT 1 foo)").queryExecution.logical
> plan: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> 'Project [1 AS bar#29]
> +- 'Filter 'foo IN (list#31 [])
>    :  +- Project [1 AS foo#30]
>    :     +- OneRowRelation
>    +- SubqueryAlias __auto_generated_subquery_name
>       +- Project [1 AS foo#28]
>          +- OneRowRelation
> {code}
> The following transformation should replace all instances of lit(1) with 
> lit(2):
> {code:java}
> scala> plan.transformAllExpressions { case l @ Literal(1, _) => l.copy(value 
> = 2) }
> res0: plan.type =
> 'Project [2 AS bar#29]
> +- 'Filter 'foo IN (list#31 [])
>    :  +- Project [1 AS foo#30]
>    :     +- OneRowRelation
>    +- SubqueryAlias __auto_generated_subquery_name
>       +- Project [2 AS foo#28]
>          +- OneRowRelation
> {code}
> Instead, the nested SubqueryExpression plan is not transformed.
> The expected output is: 
> {code:java}
> 'Project [2 AS bar#29]
> +- 'Filter 'foo IN (list#31 [])
>    :  +- Project [2 AS foo#30]
>    :     +- OneRowRelation
>    +- SubqueryAlias __auto_generated_subquery_name
>       +- Project [2 AS foo#28]
>          +- OneRowRelation
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25037) plan.transformAllExpressions doesn't transform expressions in subquery plans

2018-08-06 Thread Chris O'Hara (JIRA)
Chris O'Hara created SPARK-25037:


 Summary: plan.transformAllExpressions doesn't transform 
expressions in subquery plans
 Key: SPARK-25037
 URL: https://issues.apache.org/jira/browse/SPARK-25037
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: Chris O'Hara


Given the following LogicalPlan, containing a SubqueryAlias and 
SubqueryExpression:

 
{code:java}
scala> val plan = spark.sql("SELECT 1 bar FROM (SELECT 1 foo) WHERE foo IN 
(SELECT 1 foo)").queryExecution.logical
plan: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
'Project [1 AS bar#29]
+- 'Filter 'foo IN (list#31 [])
   :  +- Project [1 AS foo#30]
   :     +- OneRowRelation
   +- SubqueryAlias __auto_generated_subquery_name
      +- Project [1 AS foo#28]
         +- OneRowRelation
{code}
The following transformation should replace all instances of lit(1) with lit(2):

 
{code:java}
scala> plan.transformAllExpressions { case l @ Literal(1, _) => l.copy(value = 
2) }
res0: plan.type =
'Project [2 AS bar#29]
+- 'Filter 'foo IN (list#31 [])
   :  +- Project [1 AS foo#30]
   :     +- OneRowRelation
   +- SubqueryAlias __auto_generated_subquery_name
      +- Project [2 AS foo#28]
         +- OneRowRelation
{code}
 

Instead, the nested SubqueryExpression plan is not transformed.

The expected output is:

 
{code:java}
'Project [2 AS bar#29]
+- 'Filter 'foo IN (list#31 [])
   :  +- Project [2 AS foo#30]
   :     +- OneRowRelation
   +- SubqueryAlias __auto_generated_subquery_name
      +- Project [2 AS foo#28]
         +- OneRowRelation
{code}
 

 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-06 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570880#comment-16570880
 ] 

Dongjoon Hyun edited comment on SPARK-24924 at 8/6/18 10:50 PM:


Yep. It will work if those 3rd-party packages are rebuilt on Apache Spark 2.4. 
So, it will be the next releases, not the currently existing ones.

Spark hides Spark-generated metadata. You can see them via `hive` CLI like the 
following.

1. Run Apache Hive 1.2.2 CLI and check tables; This initialize metastores, too.
{code:java}
hive> show tables;
OK
Time taken: 1.163 seconds
{code}
2. Apache Spark 2.3.1 Result (See `Provider` field)
{code:java}
scala> spark.version
res1: String = 2.3.1
scala> 
spark.range(10).write.format("com.databricks.spark.avro").saveAsTable("t")
scala> sql("desc formatted t").show(false)
++-+---+
|col_name|data_type 
   |comment|
++-+---+
|id  |bigint
   |null   |
||  
   |   |
|# Detailed Table Information|  
   |   |
|Database|default   
   |   |
|Table   |t 
   |   |
|Owner   |dongjoon  
   |   |
|Created Time|Mon Aug 06 15:41:40 PDT 2018  
   |   |
|Last Access |Wed Dec 31 16:00:00 PST 1969  
   |   |
|Created By  |Spark 2.3.1   
   |   |
|Type|MANAGED   
   |   |
|Provider|com.databricks.spark.avro 
   |   |
|Table Properties|[transient_lastDdlTime=1533595300]
   |   |
|Location|file:/user/hive/warehouse/t   
   |   |
|Serde Library   
|org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe   |   |
|InputFormat |org.apache.hadoop.mapred.SequenceFileInputFormat  
   |   |
|OutputFormat
|org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat|   |
|Storage Properties  |[serialization.format=1]  
   |   |
++-+---+
{code}
3. Apache Hive 1.2.2 CLI Result (See `Table Parameters`)
{code:java}
hive> describe formatted t;
OK
# col_name  data_type   comment

col array   from deserializer

# Detailed Table Information
Database:   default
Owner:  dongjoon
CreateTime: Mon Aug 06 15:41:40 PDT 2018
LastAccessTime: UNKNOWN
Protect Mode:   None
Retention:  0
Location:   
file:/Users/dongjoon/spark-release/spark-2.3.1-bin-hadoop2.7/spark-warehouse/t
Table Type: MANAGED_TABLE
Table Parameters:
spark.sql.create.version2.3.1
spark.sql.sources.provider  com.databricks.spark.avro
spark.sql.sources.schema.numParts   1
spark.sql.sources.schema.part.0 
{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}
transient_lastDdlTime   1533595300

# Storage Information
SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat:   
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Compressed: No
Num Buckets:-1
Bucket Columns: []
Sort Columns:   []
Storage Desc Params:
pathfile:/user/hive/warehouse/t
serialization.format1
Time taken: 1.373 seconds, Fetched: 31 row(s)
{code}


was (Author: dongjoon):
Yep. It will work if those 3rd-party packages are rebuilt on Apache Spark 2.4. 
So, it will be the next releases, not the currently existing ones.

Spark hides Spark-generated metadata. You can see them via `hive` CLI like the 
following.

1. Run Apache Hive 1.2.2 CLI and check tables; This initialize metastores, too.
{code}
hive> show tables;
OK
Time taken: 1.163 seconds
{code}

2. Apache Spark 2.3.1 Result
{code}
scala> spark.version
res1: String = 2.3.1
scala> 

[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-06 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570880#comment-16570880
 ] 

Dongjoon Hyun commented on SPARK-24924:
---

Yep. It will work if those 3rd-party packages are rebuilt on Apache Spark 2.4. 
So, it will be the next releases, not the currently existing ones.

Spark hides Spark-generated metadata. You can see them via `hive` CLI like the 
following.

1. Run Apache Hive 1.2.2 CLI and check tables; This initialize metastores, too.
{code}
hive> show tables;
OK
Time taken: 1.163 seconds
{code}

2. Apache Spark 2.3.1 Result
{code}
scala> spark.version
res1: String = 2.3.1
scala> 
spark.range(10).write.format("com.databricks.spark.avro").saveAsTable("t")
scala> sql("desc formatted t").show(false)
++-+---+
|col_name|data_type 
   |comment|
++-+---+
|id  |bigint
   |null   |
||  
   |   |
|# Detailed Table Information|  
   |   |
|Database|default   
   |   |
|Table   |t 
   |   |
|Owner   |dongjoon  
   |   |
|Created Time|Mon Aug 06 15:41:40 PDT 2018  
   |   |
|Last Access |Wed Dec 31 16:00:00 PST 1969  
   |   |
|Created By  |Spark 2.3.1   
   |   |
|Type|MANAGED   
   |   |
|Provider|com.databricks.spark.avro 
   |   |
|Table Properties|[transient_lastDdlTime=1533595300]
   |   |
|Location|file:/user/hive/warehouse/t   
   |   |
|Serde Library   
|org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe   |   |
|InputFormat |org.apache.hadoop.mapred.SequenceFileInputFormat  
   |   |
|OutputFormat
|org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat|   |
|Storage Properties  |[serialization.format=1]  
   |   |
++-+---+
{code}

3. Apache Hive 1.2.2 CLI Result
{code}
hive> describe formatted t;
OK
# col_name  data_type   comment

col array   from deserializer

# Detailed Table Information
Database:   default
Owner:  dongjoon
CreateTime: Mon Aug 06 15:41:40 PDT 2018
LastAccessTime: UNKNOWN
Protect Mode:   None
Retention:  0
Location:   
file:/Users/dongjoon/spark-release/spark-2.3.1-bin-hadoop2.7/spark-warehouse/t
Table Type: MANAGED_TABLE
Table Parameters:
spark.sql.create.version2.3.1
spark.sql.sources.provider  com.databricks.spark.avro
spark.sql.sources.schema.numParts   1
spark.sql.sources.schema.part.0 
{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}
transient_lastDdlTime   1533595300

# Storage Information
SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat:   
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Compressed: No
Num Buckets:-1
Bucket Columns: []
Sort Columns:   []
Storage Desc Params:
pathfile:/user/hive/warehouse/t
serialization.format1
Time taken: 1.373 seconds, Fetched: 31 row(s)
{code}

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error 

[jira] [Resolved] (SPARK-24161) Enable debug package feature on structured streaming

2018-08-06 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-24161.
--
   Resolution: Fixed
 Assignee: Jungtaek Lim
Fix Version/s: 2.4.0

> Enable debug package feature on structured streaming
> 
>
> Key: SPARK-24161
> URL: https://issues.apache.org/jira/browse/SPARK-24161
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently, debug package has a implicit class which matches Dataset to 
> provide debug features on Dataset class. It doesn't work with structured 
> streaming: it requires query is already started, and the information can be 
> retrieved from StreamingQuery, not Dataset. For the same reason, "explain" 
> had to be placed to StreamingQuery whereas it exists on Dataset.
> This issue tracks effort to enable debug package feature on structured 
> streaming. Unlike batch, it may have some restrictions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-06 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570840#comment-16570840
 ] 

Thomas Graves commented on SPARK-24924:
---

so officially the spark api compatibility is only at the compilation level: 
[http://spark.apache.org/versioning-policy.html] . We try to keep binary 
compatibility but its not guaranteed between releases.   It might be worth 
bringing up though to make sure they thought of that as it should be a 
conscious decision.

I think if you rebuild databricks avro with spark 2.4 it works, right?

I unfortunately don't have a hive setup working with spark 2.4 right now. When 
I wrote a table (saveAsTable) with 2.3 databricks avro I don't see a table 
property spark.sql.sources.provider, what am I missing?

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24948) SHS filters wrongly some applications due to permission check

2018-08-06 Thread Mridul Muralidharan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-24948.
-
Resolution: Fixed

> SHS filters wrongly some applications due to permission check
> -
>
> Key: SPARK-24948
> URL: https://issues.apache.org/jira/browse/SPARK-24948
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: Marco Gaido
>Priority: Blocker
> Fix For: 2.4.0
>
>
> SHS filters the event logs it doesn't have permissions to read. 
> Unfortunately, this check is quite naive, as it takes into account only the 
> base permissions (ie. user, group, other permissions). For instance, if ACL 
> are enabled, they are ignored in this check; moreover, each filesystem may 
> have different policies (eg. they can consider spark as a superuser who can 
> access everything).
> This results in some applications not being displayed in the SHS, despite the 
> Spark user (or whatever user the SHS is started with) can actually read their 
> ent logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24948) SHS filters wrongly some applications due to permission check

2018-08-06 Thread Mridul Muralidharan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-24948:

Fix Version/s: 2.4.0

> SHS filters wrongly some applications due to permission check
> -
>
> Key: SPARK-24948
> URL: https://issues.apache.org/jira/browse/SPARK-24948
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: Marco Gaido
>Priority: Blocker
> Fix For: 2.4.0
>
>
> SHS filters the event logs it doesn't have permissions to read. 
> Unfortunately, this check is quite naive, as it takes into account only the 
> base permissions (ie. user, group, other permissions). For instance, if ACL 
> are enabled, they are ignored in this check; moreover, each filesystem may 
> have different policies (eg. they can consider spark as a superuser who can 
> access everything).
> This results in some applications not being displayed in the SHS, despite the 
> Spark user (or whatever user the SHS is started with) can actually read their 
> ent logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors

2018-08-06 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570770#comment-16570770
 ] 

Sean Owen commented on SPARK-25029:
---

Thanks [~shaneknapp] – looks good in that it runs and shows one of the test 
failures we're working on here.

> Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods 
> ..." errors
> ---
>
> Key: SPARK-25029
> URL: https://issues.apache.org/jira/browse/SPARK-25029
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Blocker
>
> We actually still have some test failures in the Scala 2.12 build. There seem 
> to be two types. First are that some tests fail with "TaskNotSerializable" 
> because some code construct now captures a reference to scalatest's 
> AssertionHelper. Example:
> {code:java}
> - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode 
> *** FAILED *** java.io.NotSerializableException: 
> org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not 
> serializable (class: org.scalatest.Assertions$AssertionsHelper, value: 
> org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code}
> These seem generally easy to fix by tweaking the test code. It's not clear if 
> something about closure cleaning in 2.12 could be improved to detect this 
> situation automatically; given that yet only a handful of tests fail for this 
> reason, it's unlikely to be a systemic problem.
>  
> The other error is curioser. Janino fails to compile generate code in many 
> cases with errors like:
> {code:java}
> - encode/decode for seq of string: List(abc, xyz) *** FAILED ***
> java.lang.RuntimeException: Error while encoding: 
> org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Two non-abstract methods "public int scala.collection.TraversableOnce.size()" 
> have the same parameter types, declaring type and return type{code}
>  
> I include the full generated code that failed in one case below. There is no 
> {{size()}} in the generated code. It's got to be down to some difference in 
> Scala 2.12, potentially even a Janino problem.
>  
> {code:java}
> Caused by: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Two non-abstract methods "public int 
> scala.collection.TraversableOnce.size()" have the same parameter types, 
> declaring type and return type
> at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
> at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
> at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
> at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)
> at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342)
> ... 30 more
> Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract 
> methods "public int scala.collection.TraversableOnce.size()" have the same 
> parameter types, declaring type and return type
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:9112)
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:)
> at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8770)
> at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8672)
> at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4737)
> at org.codehaus.janino.UnitCompiler.access$8300(UnitCompiler.java:212)
> at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4097)
> at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4070)
> at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4902)
> at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4070)
> at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5253)
> at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4391)
> at org.codehaus.janino.UnitCompiler.access$8000(UnitCompiler.java:212)
> at 
> org.codehaus.janino.UnitCompiler$12.visitConditionalExpression(UnitCompiler.java:4094)
> at 
> org.codehaus.janino.UnitCompiler$12.visitConditionalExpression(UnitCompiler.java:4070)
> at org.codehaus.janino.Java$ConditionalExpression.accept(Java.java:4344)
> at 

[jira] [Commented] (SPARK-24786) Executors not being released after all cached data is unpersisted

2018-08-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570755#comment-16570755
 ] 

Apache Spark commented on SPARK-24786:
--

User 'dhruve' has created a pull request for this issue:
https://github.com/apache/spark/pull/22015

> Executors not being released after all cached data is unpersisted
> -
>
> Key: SPARK-24786
> URL: https://issues.apache.org/jira/browse/SPARK-24786
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.0
> Environment: Zeppelin in EMR
>Reporter: Jeffrey Charles
>Priority: Minor
>
> I'm persisting a dataframe in Zeppelin which has dynamic allocation enabled 
> to get a sense of how much memory the dataframe takes up. After I note the 
> size, I unpersist the dataframe. For some reason, Yarn is not releasing the 
> executors that were added to Zeppelin. If I don't run the persist and 
> unpersist steps, the executors that were added are removed about a minute 
> after the paragraphs complete. Looking at the storage tab in the Spark UI for 
> the Zeppelin job, I don't see anything cached. I do not want to set 
> spark.dynamicAllocation.cachedExecutorIdleTimeout to a lower value because I 
> do not want executors with cached data to be released, but I do want ones 
> that had cached data and no longer have cached data to be released.
>  
> Steps to reproduce:
>  # Enable dynamic allocation
>  # Set spark.dynamicAllocation.executorIdleTimeout to 60s
>  # Set spark.dynamicAllocation.cachedExecutorIdleTimeout to infinity
>  # Load a dataset, persist it, run a count on the persisted dataset, 
> unpersist the persisted dataset
>  # Wait a couple minutes
> Expected behaviour:
> All executors will be released as the executors are no longer caching any data
> Observed behaviour:
> No executors were released



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-06 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570756#comment-16570756
 ] 

Dongjoon Hyun commented on SPARK-24924:
---

1. Theoretically, Spark 2.4 should handle both Hive tables simultaneously if 
the jars co-exist.
2. `ALTER TABLE` is technically possible, but it seems not a good way for users 
because `spark.sql.sources.provider` is a Spark-generated metadata.
3. For now, there is another issue with `FileFormat` trait. In Spark 2.4, 
SPARK-24691 adds `FileFormat.supportDataType` and uses it to verify data types. 
Currently, it's a breaking change because the latest 3rd-party file format like 
databricks avro 4.0.0 doesn't have that method. The current Spark 2.4 master 
branch raises `java.lang.AbstractMethodError`. I think we had better fix this 
in Spark-side for compatibility.

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24786) Executors not being released after all cached data is unpersisted

2018-08-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24786:


Assignee: (was: Apache Spark)

> Executors not being released after all cached data is unpersisted
> -
>
> Key: SPARK-24786
> URL: https://issues.apache.org/jira/browse/SPARK-24786
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.0
> Environment: Zeppelin in EMR
>Reporter: Jeffrey Charles
>Priority: Minor
>
> I'm persisting a dataframe in Zeppelin which has dynamic allocation enabled 
> to get a sense of how much memory the dataframe takes up. After I note the 
> size, I unpersist the dataframe. For some reason, Yarn is not releasing the 
> executors that were added to Zeppelin. If I don't run the persist and 
> unpersist steps, the executors that were added are removed about a minute 
> after the paragraphs complete. Looking at the storage tab in the Spark UI for 
> the Zeppelin job, I don't see anything cached. I do not want to set 
> spark.dynamicAllocation.cachedExecutorIdleTimeout to a lower value because I 
> do not want executors with cached data to be released, but I do want ones 
> that had cached data and no longer have cached data to be released.
>  
> Steps to reproduce:
>  # Enable dynamic allocation
>  # Set spark.dynamicAllocation.executorIdleTimeout to 60s
>  # Set spark.dynamicAllocation.cachedExecutorIdleTimeout to infinity
>  # Load a dataset, persist it, run a count on the persisted dataset, 
> unpersist the persisted dataset
>  # Wait a couple minutes
> Expected behaviour:
> All executors will be released as the executors are no longer caching any data
> Observed behaviour:
> No executors were released



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24786) Executors not being released after all cached data is unpersisted

2018-08-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24786:


Assignee: Apache Spark

> Executors not being released after all cached data is unpersisted
> -
>
> Key: SPARK-24786
> URL: https://issues.apache.org/jira/browse/SPARK-24786
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.0
> Environment: Zeppelin in EMR
>Reporter: Jeffrey Charles
>Assignee: Apache Spark
>Priority: Minor
>
> I'm persisting a dataframe in Zeppelin which has dynamic allocation enabled 
> to get a sense of how much memory the dataframe takes up. After I note the 
> size, I unpersist the dataframe. For some reason, Yarn is not releasing the 
> executors that were added to Zeppelin. If I don't run the persist and 
> unpersist steps, the executors that were added are removed about a minute 
> after the paragraphs complete. Looking at the storage tab in the Spark UI for 
> the Zeppelin job, I don't see anything cached. I do not want to set 
> spark.dynamicAllocation.cachedExecutorIdleTimeout to a lower value because I 
> do not want executors with cached data to be released, but I do want ones 
> that had cached data and no longer have cached data to be released.
>  
> Steps to reproduce:
>  # Enable dynamic allocation
>  # Set spark.dynamicAllocation.executorIdleTimeout to 60s
>  # Set spark.dynamicAllocation.cachedExecutorIdleTimeout to infinity
>  # Load a dataset, persist it, run a count on the persisted dataset, 
> unpersist the persisted dataset
>  # Wait a couple minutes
> Expected behaviour:
> All executors will be released as the executors are no longer caching any data
> Observed behaviour:
> No executors were released



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20286) dynamicAllocation.executorIdleTimeout is ignored after unpersist

2018-08-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570754#comment-16570754
 ] 

Apache Spark commented on SPARK-20286:
--

User 'dhruve' has created a pull request for this issue:
https://github.com/apache/spark/pull/22015

> dynamicAllocation.executorIdleTimeout is ignored after unpersist
> 
>
> Key: SPARK-20286
> URL: https://issues.apache.org/jira/browse/SPARK-20286
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Miguel Pérez
>Priority: Major
>
> With dynamic allocation enabled, it seems that executors with cached data 
> which are unpersisted are still being killed using the 
> {{dynamicAllocation.cachedExecutorIdleTimeout}} configuration, instead of 
> {{dynamicAllocation.executorIdleTimeout}}. Assuming the default configuration 
> ({{dynamicAllocation.cachedExecutorIdleTimeout = Infinity}}), an executor 
> with unpersisted data won't be released until the job ends.
> *How to reproduce*
> - Set different values for {{dynamicAllocation.executorIdleTimeout}} and 
> {{dynamicAllocation.cachedExecutorIdleTimeout}}
> - Load a file into a RDD and persist it
> - Execute an action on the RDD (like a count) so some executors are activated.
> - When the action has finished, unpersist the RDD
> - The application UI removes correctly the persisted data from the *Storage* 
> tab, but if you look in the *Executors* tab, you will find that the executors 
> remain *active* until ({{dynamicAllocation.cachedExecutorIdleTimeout}} is 
> reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20286) dynamicAllocation.executorIdleTimeout is ignored after unpersist

2018-08-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20286:


Assignee: (was: Apache Spark)

> dynamicAllocation.executorIdleTimeout is ignored after unpersist
> 
>
> Key: SPARK-20286
> URL: https://issues.apache.org/jira/browse/SPARK-20286
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Miguel Pérez
>Priority: Major
>
> With dynamic allocation enabled, it seems that executors with cached data 
> which are unpersisted are still being killed using the 
> {{dynamicAllocation.cachedExecutorIdleTimeout}} configuration, instead of 
> {{dynamicAllocation.executorIdleTimeout}}. Assuming the default configuration 
> ({{dynamicAllocation.cachedExecutorIdleTimeout = Infinity}}), an executor 
> with unpersisted data won't be released until the job ends.
> *How to reproduce*
> - Set different values for {{dynamicAllocation.executorIdleTimeout}} and 
> {{dynamicAllocation.cachedExecutorIdleTimeout}}
> - Load a file into a RDD and persist it
> - Execute an action on the RDD (like a count) so some executors are activated.
> - When the action has finished, unpersist the RDD
> - The application UI removes correctly the persisted data from the *Storage* 
> tab, but if you look in the *Executors* tab, you will find that the executors 
> remain *active* until ({{dynamicAllocation.cachedExecutorIdleTimeout}} is 
> reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20286) dynamicAllocation.executorIdleTimeout is ignored after unpersist

2018-08-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20286:


Assignee: Apache Spark

> dynamicAllocation.executorIdleTimeout is ignored after unpersist
> 
>
> Key: SPARK-20286
> URL: https://issues.apache.org/jira/browse/SPARK-20286
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Miguel Pérez
>Assignee: Apache Spark
>Priority: Major
>
> With dynamic allocation enabled, it seems that executors with cached data 
> which are unpersisted are still being killed using the 
> {{dynamicAllocation.cachedExecutorIdleTimeout}} configuration, instead of 
> {{dynamicAllocation.executorIdleTimeout}}. Assuming the default configuration 
> ({{dynamicAllocation.cachedExecutorIdleTimeout = Infinity}}), an executor 
> with unpersisted data won't be released until the job ends.
> *How to reproduce*
> - Set different values for {{dynamicAllocation.executorIdleTimeout}} and 
> {{dynamicAllocation.cachedExecutorIdleTimeout}}
> - Load a file into a RDD and persist it
> - Execute an action on the RDD (like a count) so some executors are activated.
> - When the action has finished, unpersist the RDD
> - The application UI removes correctly the persisted data from the *Storage* 
> tab, but if you look in the *Executors* tab, you will find that the executors 
> remain *active* until ({{dynamicAllocation.cachedExecutorIdleTimeout}} is 
> reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-06 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570736#comment-16570736
 ] 

Thomas Graves commented on SPARK-24924:
---

so if the user includes the databricks jar and they specify 
"com.databricks.spark.avro" can we support that or is there some conflict that 
won't allow us to have both loaded? 

Can you user simply change the sources.provider to be 'avro' and have it work 
with new internal version?

Sorry trying to make sure I don't miss anything with the compatibility story 
here.  

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors

2018-08-06 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570712#comment-16570712
 ] 

shane knapp commented on SPARK-25029:
-

set up a build to help test this out:

[https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/]

[~srowen] you should be able to log in to jenkins and manually kick the job (it 
will currently poll github every ~12 mins for changes and auto-trigger).  if 
you need your login creds refreshed, let me know and i can take care of that 
outside of this issue.

> Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods 
> ..." errors
> ---
>
> Key: SPARK-25029
> URL: https://issues.apache.org/jira/browse/SPARK-25029
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Blocker
>
> We actually still have some test failures in the Scala 2.12 build. There seem 
> to be two types. First are that some tests fail with "TaskNotSerializable" 
> because some code construct now captures a reference to scalatest's 
> AssertionHelper. Example:
> {code:java}
> - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode 
> *** FAILED *** java.io.NotSerializableException: 
> org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not 
> serializable (class: org.scalatest.Assertions$AssertionsHelper, value: 
> org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code}
> These seem generally easy to fix by tweaking the test code. It's not clear if 
> something about closure cleaning in 2.12 could be improved to detect this 
> situation automatically; given that yet only a handful of tests fail for this 
> reason, it's unlikely to be a systemic problem.
>  
> The other error is curioser. Janino fails to compile generate code in many 
> cases with errors like:
> {code:java}
> - encode/decode for seq of string: List(abc, xyz) *** FAILED ***
> java.lang.RuntimeException: Error while encoding: 
> org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Two non-abstract methods "public int scala.collection.TraversableOnce.size()" 
> have the same parameter types, declaring type and return type{code}
>  
> I include the full generated code that failed in one case below. There is no 
> {{size()}} in the generated code. It's got to be down to some difference in 
> Scala 2.12, potentially even a Janino problem.
>  
> {code:java}
> Caused by: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Two non-abstract methods "public int 
> scala.collection.TraversableOnce.size()" have the same parameter types, 
> declaring type and return type
> at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
> at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
> at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
> at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)
> at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342)
> ... 30 more
> Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract 
> methods "public int scala.collection.TraversableOnce.size()" have the same 
> parameter types, declaring type and return type
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:9112)
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:)
> at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8770)
> at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8672)
> at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4737)
> at org.codehaus.janino.UnitCompiler.access$8300(UnitCompiler.java:212)
> at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4097)
> at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4070)
> at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4902)
> at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4070)
> at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5253)
> at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4391)
> at 

[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-06 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570702#comment-16570702
 ] 

Dongjoon Hyun commented on SPARK-24924:
---

For Hive tables, the format name is stored as a table parameter, 
`spark.sql.sources.provider`. For example, 
`spark.sql.sources.provider=com.databricks.spark.avro`. So, without this 
mapping, built-in avro format will not be used for that table. IIUC, one of the 
purposes of the new policy is not to support that.

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors

2018-08-06 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-25029:
--
Priority: Blocker  (was: Major)

Temporarily making this a blocker for 2.4, as Scala 2.12 support is a good goal 
for 2.4 and this may be resolvable quickly.

> Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods 
> ..." errors
> ---
>
> Key: SPARK-25029
> URL: https://issues.apache.org/jira/browse/SPARK-25029
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Blocker
>
> We actually still have some test failures in the Scala 2.12 build. There seem 
> to be two types. First are that some tests fail with "TaskNotSerializable" 
> because some code construct now captures a reference to scalatest's 
> AssertionHelper. Example:
> {code:java}
> - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode 
> *** FAILED *** java.io.NotSerializableException: 
> org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not 
> serializable (class: org.scalatest.Assertions$AssertionsHelper, value: 
> org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code}
> These seem generally easy to fix by tweaking the test code. It's not clear if 
> something about closure cleaning in 2.12 could be improved to detect this 
> situation automatically; given that yet only a handful of tests fail for this 
> reason, it's unlikely to be a systemic problem.
>  
> The other error is curioser. Janino fails to compile generate code in many 
> cases with errors like:
> {code:java}
> - encode/decode for seq of string: List(abc, xyz) *** FAILED ***
> java.lang.RuntimeException: Error while encoding: 
> org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Two non-abstract methods "public int scala.collection.TraversableOnce.size()" 
> have the same parameter types, declaring type and return type{code}
>  
> I include the full generated code that failed in one case below. There is no 
> {{size()}} in the generated code. It's got to be down to some difference in 
> Scala 2.12, potentially even a Janino problem.
>  
> {code:java}
> Caused by: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Two non-abstract methods "public int 
> scala.collection.TraversableOnce.size()" have the same parameter types, 
> declaring type and return type
> at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
> at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
> at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
> at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)
> at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342)
> ... 30 more
> Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract 
> methods "public int scala.collection.TraversableOnce.size()" have the same 
> parameter types, declaring type and return type
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:9112)
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:)
> at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8770)
> at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8672)
> at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4737)
> at org.codehaus.janino.UnitCompiler.access$8300(UnitCompiler.java:212)
> at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4097)
> at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4070)
> at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4902)
> at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4070)
> at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5253)
> at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4391)
> at org.codehaus.janino.UnitCompiler.access$8000(UnitCompiler.java:212)
> at 
> org.codehaus.janino.UnitCompiler$12.visitConditionalExpression(UnitCompiler.java:4094)
> at 
> org.codehaus.janino.UnitCompiler$12.visitConditionalExpression(UnitCompiler.java:4070)
> at org.codehaus.janino.Java$ConditionalExpression.accept(Java.java:4344)
> at 

[jira] [Commented] (SPARK-25036) Scala 2.12 issues: Compilation error with sbt

2018-08-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570684#comment-16570684
 ] 

Apache Spark commented on SPARK-25036:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/22014

> Scala 2.12 issues: Compilation error with sbt
> -
>
> Key: SPARK-25036
> URL: https://issues.apache.org/jira/browse/SPARK-25036
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> When compiling with sbt, the following errors occur:
> There are two types:
> 1. {{ExprValue.isNull}} is compared with unexpected type.
> 1. {{match may not be exhaustive}} is detected at {{match}}
> The first one is more serious since it may also generate incorrect code in 
> Spark 2.3.
> {code}
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:63:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (NumericValueInterval(_, _), 
> _), (_, NumericValueInterval(_, _)), (_, _)
> [error] [warn]   def isIntersected(r1: ValueInterval, r2: ValueInterval): 
> Boolean = (r1, r2) match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:79:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (NumericValueInterval(_, _), 
> _), (_, NumericValueInterval(_, _)), (_, _)
> [error] [warn] (r1, r2) match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala:67:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (ArrayType(_, _), _), (_, 
> ArrayData()), (_, _)
> [error] [warn] (endpointsExpression.dataType, endpointsExpression.eval()) 
> match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:470:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: NewFunctionSpec(_, None, 
> Some(_)), NewFunctionSpec(_, Some(_), None)
> [error] [warn] newFunction match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:94:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely always compare unequal
> [error] [warn] if (eval.isNull != "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:126:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn]  if (eval.isNull == "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:133:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn] if (eval.isNull == "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:709:
>  match may not be exhaustive.
> [error] It would fail on the following input: Schema((x: 
> org.apache.spark.sql.types.DataType forSome x not in 
> org.apache.spark.sql.types.StructType), _)
> [error] [warn]   def attributesFor[T: TypeTag]: Seq[Attribute] = schemaFor[T] 
> match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala:90:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn]   if (inputs.map(_.isNull).forall(_ == "false")) {
> [error] [warn] 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors

2018-08-06 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570666#comment-16570666
 ] 

Sean Owen commented on SPARK-25029:
---

[~skonto] I tried basically that and it looks like resolves the error in Spark. 
I'll see what janino has to say about the change though. 
https://github.com/janino-compiler/janino/pull/54

> Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods 
> ..." errors
> ---
>
> Key: SPARK-25029
> URL: https://issues.apache.org/jira/browse/SPARK-25029
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> We actually still have some test failures in the Scala 2.12 build. There seem 
> to be two types. First are that some tests fail with "TaskNotSerializable" 
> because some code construct now captures a reference to scalatest's 
> AssertionHelper. Example:
> {code:java}
> - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode 
> *** FAILED *** java.io.NotSerializableException: 
> org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not 
> serializable (class: org.scalatest.Assertions$AssertionsHelper, value: 
> org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code}
> These seem generally easy to fix by tweaking the test code. It's not clear if 
> something about closure cleaning in 2.12 could be improved to detect this 
> situation automatically; given that yet only a handful of tests fail for this 
> reason, it's unlikely to be a systemic problem.
>  
> The other error is curioser. Janino fails to compile generate code in many 
> cases with errors like:
> {code:java}
> - encode/decode for seq of string: List(abc, xyz) *** FAILED ***
> java.lang.RuntimeException: Error while encoding: 
> org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Two non-abstract methods "public int scala.collection.TraversableOnce.size()" 
> have the same parameter types, declaring type and return type{code}
>  
> I include the full generated code that failed in one case below. There is no 
> {{size()}} in the generated code. It's got to be down to some difference in 
> Scala 2.12, potentially even a Janino problem.
>  
> {code:java}
> Caused by: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Two non-abstract methods "public int 
> scala.collection.TraversableOnce.size()" have the same parameter types, 
> declaring type and return type
> at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
> at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
> at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
> at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)
> at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342)
> ... 30 more
> Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract 
> methods "public int scala.collection.TraversableOnce.size()" have the same 
> parameter types, declaring type and return type
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:9112)
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:)
> at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8770)
> at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8672)
> at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4737)
> at org.codehaus.janino.UnitCompiler.access$8300(UnitCompiler.java:212)
> at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4097)
> at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4070)
> at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4902)
> at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4070)
> at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5253)
> at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4391)
> at org.codehaus.janino.UnitCompiler.access$8000(UnitCompiler.java:212)
> at 
> org.codehaus.janino.UnitCompiler$12.visitConditionalExpression(UnitCompiler.java:4094)
> at 
> org.codehaus.janino.UnitCompiler$12.visitConditionalExpression(UnitCompiler.java:4070)
> at 

[jira] [Assigned] (SPARK-23939) High-order function: transform_keys(map, function) → map

2018-08-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23939:


Assignee: (was: Apache Spark)

> High-order function: transform_keys(map, function) → 
> map
> 
>
> Key: SPARK-23939
> URL: https://issues.apache.org/jira/browse/SPARK-23939
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns a map that applies function to each entry of map and transforms the 
> keys.
> {noformat}
> SELECT transform_keys(MAP(ARRAY[], ARRAY[]), (k, v) -> k + 1); -- {}
> SELECT transform_keys(MAP(ARRAY [1, 2, 3], ARRAY ['a', 'b', 'c']), (k, v) -> 
> k + 1); -- {2 -> a, 3 -> b, 4 -> c}
> SELECT transform_keys(MAP(ARRAY ['a', 'b', 'c'], ARRAY [1, 2, 3]), (k, v) -> 
> v * v); -- {1 -> 1, 4 -> 2, 9 -> 3}
> SELECT transform_keys(MAP(ARRAY ['a', 'b'], ARRAY [1, 2]), (k, v) -> k || 
> CAST(v as VARCHAR)); -- {a1 -> 1, b2 -> 2}
> SELECT transform_keys(MAP(ARRAY [1, 2], ARRAY [1.0, 1.4]), -- {one -> 1.0, 
> two -> 1.4}
>   (k, v) -> MAP(ARRAY[1, 2], ARRAY['one', 'two'])[k]);
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23939) High-order function: transform_keys(map, function) → map

2018-08-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23939:


Assignee: Apache Spark

> High-order function: transform_keys(map, function) → 
> map
> 
>
> Key: SPARK-23939
> URL: https://issues.apache.org/jira/browse/SPARK-23939
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns a map that applies function to each entry of map and transforms the 
> keys.
> {noformat}
> SELECT transform_keys(MAP(ARRAY[], ARRAY[]), (k, v) -> k + 1); -- {}
> SELECT transform_keys(MAP(ARRAY [1, 2, 3], ARRAY ['a', 'b', 'c']), (k, v) -> 
> k + 1); -- {2 -> a, 3 -> b, 4 -> c}
> SELECT transform_keys(MAP(ARRAY ['a', 'b', 'c'], ARRAY [1, 2, 3]), (k, v) -> 
> v * v); -- {1 -> 1, 4 -> 2, 9 -> 3}
> SELECT transform_keys(MAP(ARRAY ['a', 'b'], ARRAY [1, 2]), (k, v) -> k || 
> CAST(v as VARCHAR)); -- {a1 -> 1, b2 -> 2}
> SELECT transform_keys(MAP(ARRAY [1, 2], ARRAY [1.0, 1.4]), -- {one -> 1.0, 
> two -> 1.4}
>   (k, v) -> MAP(ARRAY[1, 2], ARRAY['one', 'two'])[k]);
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23939) High-order function: transform_keys(map, function) → map

2018-08-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570657#comment-16570657
 ] 

Apache Spark commented on SPARK-23939:
--

User 'codeatri' has created a pull request for this issue:
https://github.com/apache/spark/pull/22013

> High-order function: transform_keys(map, function) → 
> map
> 
>
> Key: SPARK-23939
> URL: https://issues.apache.org/jira/browse/SPARK-23939
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns a map that applies function to each entry of map and transforms the 
> keys.
> {noformat}
> SELECT transform_keys(MAP(ARRAY[], ARRAY[]), (k, v) -> k + 1); -- {}
> SELECT transform_keys(MAP(ARRAY [1, 2, 3], ARRAY ['a', 'b', 'c']), (k, v) -> 
> k + 1); -- {2 -> a, 3 -> b, 4 -> c}
> SELECT transform_keys(MAP(ARRAY ['a', 'b', 'c'], ARRAY [1, 2, 3]), (k, v) -> 
> v * v); -- {1 -> 1, 4 -> 2, 9 -> 3}
> SELECT transform_keys(MAP(ARRAY ['a', 'b'], ARRAY [1, 2]), (k, v) -> k || 
> CAST(v as VARCHAR)); -- {a1 -> 1, b2 -> 2}
> SELECT transform_keys(MAP(ARRAY [1, 2], ARRAY [1.0, 1.4]), -- {one -> 1.0, 
> two -> 1.4}
>   (k, v) -> MAP(ARRAY[1, 2], ARRAY['one', 'two'])[k]);
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25036) Scala 2.12 issues: Compilation error with sbt

2018-08-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25036:


Assignee: (was: Apache Spark)

> Scala 2.12 issues: Compilation error with sbt
> -
>
> Key: SPARK-25036
> URL: https://issues.apache.org/jira/browse/SPARK-25036
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> When compiling with sbt, the following errors occur:
> There are two types:
> 1. {{ExprValue.isNull}} is compared with unexpected type.
> 1. {{match may not be exhaustive}} is detected at {{match}}
> The first one is more serious since it may also generate incorrect code in 
> Spark 2.3.
> {code}
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:63:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (NumericValueInterval(_, _), 
> _), (_, NumericValueInterval(_, _)), (_, _)
> [error] [warn]   def isIntersected(r1: ValueInterval, r2: ValueInterval): 
> Boolean = (r1, r2) match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:79:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (NumericValueInterval(_, _), 
> _), (_, NumericValueInterval(_, _)), (_, _)
> [error] [warn] (r1, r2) match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala:67:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (ArrayType(_, _), _), (_, 
> ArrayData()), (_, _)
> [error] [warn] (endpointsExpression.dataType, endpointsExpression.eval()) 
> match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:470:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: NewFunctionSpec(_, None, 
> Some(_)), NewFunctionSpec(_, Some(_), None)
> [error] [warn] newFunction match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:94:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely always compare unequal
> [error] [warn] if (eval.isNull != "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:126:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn]  if (eval.isNull == "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:133:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn] if (eval.isNull == "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:709:
>  match may not be exhaustive.
> [error] It would fail on the following input: Schema((x: 
> org.apache.spark.sql.types.DataType forSome x not in 
> org.apache.spark.sql.types.StructType), _)
> [error] [warn]   def attributesFor[T: TypeTag]: Seq[Attribute] = schemaFor[T] 
> match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala:90:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn]   if (inputs.map(_.isNull).forall(_ == "false")) {
> [error] [warn] 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25036) Scala 2.12 issues: Compilation error with sbt

2018-08-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25036:


Assignee: Apache Spark

> Scala 2.12 issues: Compilation error with sbt
> -
>
> Key: SPARK-25036
> URL: https://issues.apache.org/jira/browse/SPARK-25036
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>Priority: Major
>
> When compiling with sbt, the following errors occur:
> There are two types:
> 1. {{ExprValue.isNull}} is compared with unexpected type.
> 1. {{match may not be exhaustive}} is detected at {{match}}
> The first one is more serious since it may also generate incorrect code in 
> Spark 2.3.
> {code}
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:63:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (NumericValueInterval(_, _), 
> _), (_, NumericValueInterval(_, _)), (_, _)
> [error] [warn]   def isIntersected(r1: ValueInterval, r2: ValueInterval): 
> Boolean = (r1, r2) match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:79:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (NumericValueInterval(_, _), 
> _), (_, NumericValueInterval(_, _)), (_, _)
> [error] [warn] (r1, r2) match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala:67:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (ArrayType(_, _), _), (_, 
> ArrayData()), (_, _)
> [error] [warn] (endpointsExpression.dataType, endpointsExpression.eval()) 
> match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:470:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: NewFunctionSpec(_, None, 
> Some(_)), NewFunctionSpec(_, Some(_), None)
> [error] [warn] newFunction match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:94:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely always compare unequal
> [error] [warn] if (eval.isNull != "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:126:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn]  if (eval.isNull == "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:133:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn] if (eval.isNull == "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:709:
>  match may not be exhaustive.
> [error] It would fail on the following input: Schema((x: 
> org.apache.spark.sql.types.DataType forSome x not in 
> org.apache.spark.sql.types.StructType), _)
> [error] [warn]   def attributesFor[T: TypeTag]: Seq[Attribute] = schemaFor[T] 
> match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala:90:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn]   if (inputs.map(_.isNull).forall(_ == "false")) {
> [error] [warn] 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25036) Scala 2.12 issues: Compilation error with sbt

2018-08-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570653#comment-16570653
 ] 

Apache Spark commented on SPARK-25036:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/22012

> Scala 2.12 issues: Compilation error with sbt
> -
>
> Key: SPARK-25036
> URL: https://issues.apache.org/jira/browse/SPARK-25036
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> When compiling with sbt, the following errors occur:
> There are two types:
> 1. {{ExprValue.isNull}} is compared with unexpected type.
> 1. {{match may not be exhaustive}} is detected at {{match}}
> The first one is more serious since it may also generate incorrect code in 
> Spark 2.3.
> {code}
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:63:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (NumericValueInterval(_, _), 
> _), (_, NumericValueInterval(_, _)), (_, _)
> [error] [warn]   def isIntersected(r1: ValueInterval, r2: ValueInterval): 
> Boolean = (r1, r2) match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:79:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (NumericValueInterval(_, _), 
> _), (_, NumericValueInterval(_, _)), (_, _)
> [error] [warn] (r1, r2) match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala:67:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (ArrayType(_, _), _), (_, 
> ArrayData()), (_, _)
> [error] [warn] (endpointsExpression.dataType, endpointsExpression.eval()) 
> match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:470:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: NewFunctionSpec(_, None, 
> Some(_)), NewFunctionSpec(_, Some(_), None)
> [error] [warn] newFunction match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:94:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely always compare unequal
> [error] [warn] if (eval.isNull != "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:126:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn]  if (eval.isNull == "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:133:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn] if (eval.isNull == "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:709:
>  match may not be exhaustive.
> [error] It would fail on the following input: Schema((x: 
> org.apache.spark.sql.types.DataType forSome x not in 
> org.apache.spark.sql.types.StructType), _)
> [error] [warn]   def attributesFor[T: TypeTag]: Seq[Attribute] = schemaFor[T] 
> match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala:90:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn]   if (inputs.map(_.isNull).forall(_ == "false")) {
> [error] [warn] 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25019) The published spark sql pom does not exclude the normal version of orc-core

2018-08-06 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-25019.
--
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.4.0

[https://github.com/apache/spark/pull/22003] has been merged.

> The published spark sql pom does not exclude the normal version of orc-core 
> 
>
> Key: SPARK-25019
> URL: https://issues.apache.org/jira/browse/SPARK-25019
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 2.4.0
>Reporter: Yin Huai
>Assignee: Dongjoon Hyun
>Priority: Critical
> Fix For: 2.4.0
>
>
> I noticed that 
> [https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-sql_2.11/2.4.0-SNAPSHOT/spark-sql_2.11-2.4.0-20180803.100335-189.pom]
>  does not exclude the normal version of orc-core. Comparing with 
> [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/sql/core/pom.xml#L108]
>  and 
> [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/pom.xml#L1767,]
>  we only exclude the normal version of orc-core in the parent pom. So, the 
> problem is that if a developer depends on spark-sql-core directly, orc-core 
> and orc-core-nohive will be in the dependency list. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24992) spark should randomize yarn local dir selection

2018-08-06 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-24992.
---
   Resolution: Fixed
 Assignee: Hieu Tri Huynh
Fix Version/s: 2.4.0

> spark should randomize yarn local dir selection
> ---
>
> Key: SPARK-24992
> URL: https://issues.apache.org/jira/browse/SPARK-24992
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Hieu Tri Huynh
>Assignee: Hieu Tri Huynh
>Priority: Minor
> Fix For: 2.4.0
>
>
> Utils.getLocalDir is used to get path of a temporary directory. However, it 
> always returns the the same directory, which is the first element in the 
> array _localRootDirs_. When running on YARN, this might causes the case that 
> we always write to one disk, which makes it busy while other disks are free. 
> We should randomize the selection to spread out the loads. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25036) Scala 2.12 issues: Compilation error with sbt

2018-08-06 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-25036:


 Summary: Scala 2.12 issues: Compilation error with sbt
 Key: SPARK-25036
 URL: https://issues.apache.org/jira/browse/SPARK-25036
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0, 2.4.0
Reporter: Kazuaki Ishizaki


When compiling with sbt, the following errors occur:

There are two types:
1. {{ExprValue.isNull}} is compared with unexpected type.
1. {{match may not be exhaustive}} is detected at {{match}}

The first one is more serious since it may also generate incorrect code in 
Spark 2.3.

{code}
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:63:
 match may not be exhaustive.
[error] It would fail on the following inputs: (NumericValueInterval(_, _), _), 
(_, NumericValueInterval(_, _)), (_, _)
[error] [warn]   def isIntersected(r1: ValueInterval, r2: ValueInterval): 
Boolean = (r1, r2) match {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:79:
 match may not be exhaustive.
[error] It would fail on the following inputs: (NumericValueInterval(_, _), _), 
(_, NumericValueInterval(_, _)), (_, _)
[error] [warn] (r1, r2) match {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala:67:
 match may not be exhaustive.
[error] It would fail on the following inputs: (ArrayType(_, _), _), (_, 
ArrayData()), (_, _)
[error] [warn] (endpointsExpression.dataType, endpointsExpression.eval()) 
match {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:470:
 match may not be exhaustive.
[error] It would fail on the following inputs: NewFunctionSpec(_, None, 
Some(_)), NewFunctionSpec(_, Some(_), None)
[error] [warn] newFunction match {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:94:
 org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
unrelated: they will most likely always compare unequal
[error] [warn] if (eval.isNull != "true") {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:126:
 org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
unrelated: they will most likely never compare equal
[error] [warn]  if (eval.isNull == "true") {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:133:
 org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
unrelated: they will most likely never compare equal
[error] [warn] if (eval.isNull == "true") {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:709:
 match may not be exhaustive.
[error] It would fail on the following input: Schema((x: 
org.apache.spark.sql.types.DataType forSome x not in 
org.apache.spark.sql.types.StructType), _)
[error] [warn]   def attributesFor[T: TypeTag]: Seq[Attribute] = schemaFor[T] 
match {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala:90:
 org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
unrelated: they will most likely never compare equal
[error] [warn]   if (inputs.map(_.isNull).forall(_ == "false")) {
[error] [warn] 
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-24229) Upgrade to the latest Apache Thrift 0.10.0 release

2018-08-06 Thread Sushant Pritmani (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sushant Pritmani updated SPARK-24229:
-
Comment: was deleted

(was: Can we have the latest version of 'libfb303' also? As the current version 
of `libfb303` uses an old version(0.9.3) of 'libthrift' which has the same 
vulnerability mentioned above. Thank you.)

> Upgrade to the latest Apache Thrift 0.10.0 release
> --
>
> Key: SPARK-24229
> URL: https://issues.apache.org/jira/browse/SPARK-24229
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.3.0
>Reporter: Ray Donnelly
>Priority: Critical
>
> According to [https://www.cvedetails.com/cve/CVE-2016-5397/]
>  
> .. there are critical vulnerabilities in libthrift 0.9.3 currently vendored 
> in Apache Spark (and then, for us, into PySpark).
>  
> Can anyone help to assess the seriousness of this and what should be done 
> about it?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24229) Upgrade to the latest Apache Thrift 0.10.0 release

2018-08-06 Thread Sushant Pritmani (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570634#comment-16570634
 ] 

Sushant Pritmani commented on SPARK-24229:
--

Can we have the latest version of 'libfb303' also? As the current version of 
`libfb303` uses an old version(0.9.3) of 'libthrift' which has the same 
vulnerability mentioned above. Thank you.

> Upgrade to the latest Apache Thrift 0.10.0 release
> --
>
> Key: SPARK-24229
> URL: https://issues.apache.org/jira/browse/SPARK-24229
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.3.0
>Reporter: Ray Donnelly
>Priority: Critical
>
> According to [https://www.cvedetails.com/cve/CVE-2016-5397/]
>  
> .. there are critical vulnerabilities in libthrift 0.9.3 currently vendored 
> in Apache Spark (and then, for us, into PySpark).
>  
> Can anyone help to assess the seriousness of this and what should be done 
> about it?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-06 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570638#comment-16570638
 ] 

Thomas Graves commented on SPARK-24924:
---

So something I just thought of that I want to clarify, is this format name 
explicitly stored and used anywhere in say tables created?  For instance lets 
say I'm using the databricks avro format and I create a table with it and save 
it out.  Can I read that table fine with the new built-in avro support without 
this mapping? 

 

 

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-06 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570617#comment-16570617
 ] 

Dongjoon Hyun edited comment on SPARK-24924 at 8/6/18 6:39 PM:
---

Thank you for confirming and giving the right direction for this, [~tgraves]. 
It sounds like a consistent and clear policy for Apache Spark. +1 for moving 
forward to that direction by reverting the commits of this JIRA.


was (Author: dongjoon):
Thank you for confirming and giving the right direction for this, [~tgraves]. 
It must be a consistent and clear policy for Apache Spark. +1 for moving 
forward to that direction by reverting the commits of this JIRA.

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-06 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570617#comment-16570617
 ] 

Dongjoon Hyun commented on SPARK-24924:
---

Thank you for confirming and giving the right direction for this, [~tgraves]. 
It must be a consistent and clear policy for Apache Spark. +1 for moving 
forward to that direction by reverting the commits of this JIRA.

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-06 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570586#comment-16570586
 ] 

Thomas Graves commented on SPARK-24924:
---

For compatibility we can't remove it unless major version, so my vote would be 
to remove it in 3.0.

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24822) Python support for barrier execution mode

2018-08-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24822:


Assignee: Apache Spark

> Python support for barrier execution mode
> -
>
> Key: SPARK-24822
> URL: https://issues.apache.org/jira/browse/SPARK-24822
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Apache Spark
>Priority: Major
>
> Enable launch a job containing barrier stage(s) from PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24822) Python support for barrier execution mode

2018-08-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24822:


Assignee: (was: Apache Spark)

> Python support for barrier execution mode
> -
>
> Key: SPARK-24822
> URL: https://issues.apache.org/jira/browse/SPARK-24822
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Priority: Major
>
> Enable launch a job containing barrier stage(s) from PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24822) Python support for barrier execution mode

2018-08-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570576#comment-16570576
 ] 

Apache Spark commented on SPARK-24822:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/22011

> Python support for barrier execution mode
> -
>
> Key: SPARK-24822
> URL: https://issues.apache.org/jira/browse/SPARK-24822
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Priority: Major
>
> Enable launch a job containing barrier stage(s) from PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >