date:20190506

[jira] [Comment Edited] (SPARK-27638) date format yyyy-M-dd string comparison not handled properly

2019-05-06 Thread peng bo (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834322#comment-16834322
 ] 

peng bo edited comment on SPARK-27638 at 5/7/19 5:47 AM:
-

[~maxgekk] 

I'd love propose a PR for this. However i am in the middle of something, I will 
try to do it by the end of this week if that's convenient for you as well.

Besides, what's your suggestion about corner cases like `date_col > 
'invalid_date_string'` mentioned by [~cloud_fan] ? Switch back to string 
comparison?

Thanks


was (Author: pengbo):
[~maxgekk] 

I'd love propose a PR for this. However i am in the middle of something, I will 
try to do it by the end of this week if that's convenient for you as well.

> date format -M-dd string comparison not handled properly 
> -
>
> Key: SPARK-27638
> URL: https://issues.apache.org/jira/browse/SPARK-27638
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: peng bo
>Priority: Major
>
> The below example works with both Mysql and Hive, however not with spark.
> {code:java}
> mysql> select * from date_test where date_col >= '2000-1-1';
> ++
> | date_col   |
> ++
> | 2000-01-01 |
> ++
> {code}
> The reason is that Spark casts both sides to String type during date and 
> string comparison for partial date support. Please find more details in 
> https://issues.apache.org/jira/browse/SPARK-8420.
> Based on some tests, the behavior of Date and String comparison in Hive and 
> Mysql:
> Hive: Cast to Date, partial date is not supported
> Mysql: Cast to Date,  certain "partial date" is supported by defining certain 
> date string parse rules. Check out {{str_to_datetime}} in 
> https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c
> Here's 2 proposals:
> a. Follow Mysql parse rule, but some partial date string comparison cases 
> won't be supported either. 
> b. Cast String value to Date, if it passes use date.toString, original string 
> otherwise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27647) Metric Gauge not threadsafe

2019-05-06 Thread bettermouse (JIRA)

bettermouse created SPARK-27647:
---

 Summary: Metric Gauge not threadsafe
 Key: SPARK-27647
 URL: https://issues.apache.org/jira/browse/SPARK-27647
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.2
Reporter: bettermouse


when I read class DAGSchedulerSource,I find some Gauges may be not 
threadSafe.like
 metricRegistry.register(MetricRegistry.name("stage", "failedStages"), new 
Gauge[Int] {
 override def getValue: Int = dagScheduler.failedStages.size
 })
this method may be called in other thread,but failedStages field is not thread 
safe
filed runningStages,waitingStages have same problem



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27643) Add supported Hive version list in doc

2019-05-06 Thread Yuming Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834333#comment-16834333
 ] 

Yuming Wang commented on SPARK-27643:
-

Do you mean {{spark.sql.hive.metastore.version}}: 
[http://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore]?

> Add supported Hive version list in doc
> --
>
> Key: SPARK-27643
> URL: https://issues.apache.org/jira/browse/SPARK-27643
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.3.3, 2.4.2, 3.0.0
>Reporter: Zhichao  Zhang
>Priority: Minor
>
> Add supported Hive version list for each spark version in doc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24935) Problem with Executing Hive UDF's from Spark 2.2 Onwards

2019-05-06 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-24935:

Fix Version/s: 2.3.4

> Problem with Executing Hive UDF's from Spark 2.2 Onwards
> 
>
> Key: SPARK-24935
> URL: https://issues.apache.org/jira/browse/SPARK-24935
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.1
>Reporter: Parth Gandhi
>Assignee: Parth Gandhi
>Priority: Major
> Fix For: 2.3.4, 3.0.0, 2.4.3
>
>
> A user of sketches library(https://github.com/DataSketches/sketches-hive) 
> reported an issue with HLL Sketch Hive UDAF that seems to be a bug in Spark 
> or Hive. Their code runs fine in 2.1 but has an issue from 2.2 onwards. For 
> more details on the issue, you can refer to the discussion in the 
> sketches-user list:
> [https://groups.google.com/forum/?utm_medium=email_source=footer#!msg/sketches-user/GmH4-OlHP9g/MW-J7Hg4BwAJ]
>  
> On further debugging, we figured out that from 2.2 onwards, Spark hive UDAF 
> provides support for partial aggregation, and has removed the functionality 
> that supported complete mode aggregation(Refer 
> https://issues.apache.org/jira/browse/SPARK-19060 and 
> https://issues.apache.org/jira/browse/SPARK-18186). Thus, instead of 
> expecting update method to be called, merge method is called here 
> ([https://github.com/DataSketches/sketches-hive/blob/master/src/main/java/com/yahoo/sketches/hive/hll/SketchEvaluator.java#L56)]
>  which throws the exception as described in the forums above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27638) date format yyyy-M-dd string comparison not handled properly

2019-05-06 Thread peng bo (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

peng bo updated SPARK-27638:

Description: 
The below example works with both Mysql and Hive, however not with spark.

{code:java}
mysql> select * from date_test where date_col >= '2000-1-1';
++
| date_col   |
++
| 2000-01-01 |
++
{code}

The reason is that Spark casts both sides to String type during date and string 
comparison for partial date support. Please find more details in 
https://issues.apache.org/jira/browse/SPARK-8420.

Based on some tests, the behavior of Date and String comparison in Hive and 
Mysql:
Hive: Cast to Date, partial date is not supported
Mysql: Cast to Date,  certain "partial date" is supported by defining certain 
date string parse rules. Check out {{str_to_datetime}} in 
https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c

Here's 2 proposals:
a. Follow Mysql parse rule, but some partial date string comparison cases won't 
be supported either. 
b. Cast String value to Date, if it passes use date.toString, original string 
otherwise.


  was:
The below example works with both Mysql and Hive, however not with spark.

{code:java}
mysql> select * from date_test where date_col >= '2000-1-1';
++
| date_col   |
++
| 2000-01-01 |
++
{code}

The reason is that Spark casts both sides to String type during date and string 
comparison for partial date support. Please find more details in 
https://issues.apache.org/jira/browse/SPARK-8420.

Based on some tests, the behavior of Date and String comparison in Hive and 
Mysql:
Hive: Cast to Date, partial date is not supported
Spark: Cast to Date,  certain "partial date" is supported by defining certain 
date string parse rules. Check out {{str_to_datetime}} in 
https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c

Here's 2 proposals:
a. Follow Mysql parse rule, but some partial date string comparison cases won't 
be supported either. 
b. Cast String value to Date, if it passes use date.toString, original string 
otherwise.



> date format -M-dd string comparison not handled properly 
> -
>
> Key: SPARK-27638
> URL: https://issues.apache.org/jira/browse/SPARK-27638
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: peng bo
>Priority: Major
>
> The below example works with both Mysql and Hive, however not with spark.
> {code:java}
> mysql> select * from date_test where date_col >= '2000-1-1';
> ++
> | date_col   |
> ++
> | 2000-01-01 |
> ++
> {code}
> The reason is that Spark casts both sides to String type during date and 
> string comparison for partial date support. Please find more details in 
> https://issues.apache.org/jira/browse/SPARK-8420.
> Based on some tests, the behavior of Date and String comparison in Hive and 
> Mysql:
> Hive: Cast to Date, partial date is not supported
> Mysql: Cast to Date,  certain "partial date" is supported by defining certain 
> date string parse rules. Check out {{str_to_datetime}} in 
> https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c
> Here's 2 proposals:
> a. Follow Mysql parse rule, but some partial date string comparison cases 
> won't be supported either. 
> b. Cast String value to Date, if it passes use date.toString, original string 
> otherwise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27638) date format yyyy-M-dd string comparison not handled properly

2019-05-06 Thread peng bo (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834322#comment-16834322
 ] 

peng bo commented on SPARK-27638:
-

[~maxgekk] 

I'd love propose a PR for this. However i am in the middle of something, I will 
try to do it by the end of this week if that's convenient for you as well.

> date format -M-dd string comparison not handled properly 
> -
>
> Key: SPARK-27638
> URL: https://issues.apache.org/jira/browse/SPARK-27638
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: peng bo
>Priority: Major
>
> The below example works with both Mysql and Hive, however not with spark.
> {code:java}
> mysql> select * from date_test where date_col >= '2000-1-1';
> ++
> | date_col   |
> ++
> | 2000-01-01 |
> ++
> {code}
> The reason is that Spark casts both sides to String type during date and 
> string comparison for partial date support. Please find more details in 
> https://issues.apache.org/jira/browse/SPARK-8420.
> Based on some tests, the behavior of Date and String comparison in Hive and 
> Mysql:
> Hive: Cast to Date, partial date is not supported
> Spark: Cast to Date,  certain "partial date" is supported by defining certain 
> date string parse rules. Check out {{str_to_datetime}} in 
> https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c
> Here's 2 proposals:
> a. Follow Mysql parse rule, but some partial date string comparison cases 
> won't be supported either. 
> b. Cast String value to Date, if it passes use date.toString, original string 
> otherwise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release

2019-05-06 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834276#comment-16834276
 ] 

Apache Spark commented on SPARK-18406:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/24542

> Race between end-of-task and completion iterator read lock release
> --
>
> Key: SPARK-18406
> URL: https://issues.apache.org/jira/browse/SPARK-18406
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Josh Rosen
>Assignee: Xingbo Jiang
>Priority: Major
> Fix For: 2.0.3, 2.1.2, 2.2.0
>
>
> The following log comes from a production streaming job where executors 
> periodically die due to uncaught exceptions during block release:
> {code}
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921
> 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922
> 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923
> 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923)
> 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable 
> 2721
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924
> 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924)
> 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as 
> bytes in memory (estimated size 5.0 KB, free 4.9 GB)
> 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took 
> 3 ms
> 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in 
> memory (estimated size 9.4 KB, free 4.9 GB)
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = 
> 567, finish = 1
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = 
> 541, finish = 6
> 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID 
> 7923). 1429 bytes result sent to driver
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = 
> 533, finish = 7
> 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID 
> 7924). 1429 bytes result sent to driver
> 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID 
> 7921)
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84)
>   at 
> org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356)
>   at 
> org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7925
> 16/11/07 17:11:06 INFO Executor: Running task 0.1 in stage 2390.0 (TID 7925)
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 41, boot = -536, init = 
> 576, finish = 1
> 16/11/07 17:11:06 INFO Executor: Finished task 1.0 in stage 2390.0 (TID 
> 7922). 1429 bytes result sent to driver
> 16/11/07 17:11:06 ERROR Utils: Uncaught

[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release

2019-05-06 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834277#comment-16834277
 ] 

Apache Spark commented on SPARK-18406:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/24542

> Race between end-of-task and completion iterator read lock release
> --
>
> Key: SPARK-18406
> URL: https://issues.apache.org/jira/browse/SPARK-18406
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Josh Rosen
>Assignee: Xingbo Jiang
>Priority: Major
> Fix For: 2.0.3, 2.1.2, 2.2.0
>
>
> The following log comes from a production streaming job where executors 
> periodically die due to uncaught exceptions during block release:
> {code}
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921
> 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922
> 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923
> 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923)
> 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable 
> 2721
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924
> 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924)
> 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as 
> bytes in memory (estimated size 5.0 KB, free 4.9 GB)
> 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took 
> 3 ms
> 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in 
> memory (estimated size 9.4 KB, free 4.9 GB)
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = 
> 567, finish = 1
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = 
> 541, finish = 6
> 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID 
> 7923). 1429 bytes result sent to driver
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = 
> 533, finish = 7
> 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID 
> 7924). 1429 bytes result sent to driver
> 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID 
> 7921)
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84)
>   at 
> org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356)
>   at 
> org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7925
> 16/11/07 17:11:06 INFO Executor: Running task 0.1 in stage 2390.0 (TID 7925)
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 41, boot = -536, init = 
> 576, finish = 1
> 16/11/07 17:11:06 INFO Executor: Finished task 1.0 in stage 2390.0 (TID 
> 7922). 1429 bytes result sent to driver
> 16/11/07 17:11:06 ERROR Utils: Uncaught

[jira] [Assigned] (SPARK-25139) PythonRunner#WriterThread released block after TaskRunner finally block which invoke BlockManager#releaseAllLocksForTask

2019-05-06 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25139:


Assignee: (was: Apache Spark)

> PythonRunner#WriterThread released block after TaskRunner finally block which 
>  invoke BlockManager#releaseAllLocksForTask
> -
>
> Key: SPARK-25139
> URL: https://issues.apache.org/jira/browse/SPARK-25139
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 2.3.1
>Reporter: DENG FEI
>Priority: Major
>
> We run pyspark streaming on YARN, the executor will die caused by the error: 
> the task released lock while finished, but the python writer haven't do real 
> releasing lock.
> Normally the task just double check the lock, but it ran wrong in front.
> The executor trace log is below:
>  18/08/17 13:52:20 Executor task launch worker for task 137 DEBUG 
> BlockManager: Getting local block input-0-1534485138800 18/08/17 13:52:20 
> Executor task launch worker for task 137 TRACE BlockInfoManager: Task 137 
> trying to acquire read lock for input-0-1534485138800 18/08/17 13:52:20 
> Executor task launch worker for task 137 TRACE BlockInfoManager: Task 137 
> acquired read lock for input-0-1534485138800 18/08/17 13:52:20 Executor task 
> launch worker for task 137 DEBUG BlockManager: Level for block 
> input-0-1534485138800 is StorageLevel(disk, memory, 1 replicas) 18/08/17 
> 13:52:20 Executor task launch worker for task 137 INFO BlockManager: Found 
> block input-0-1534485138800 locally 18/08/17 13:52:20 Executor task launch 
> worker for task 137 INFO PythonRunner: Times: total = 8, boot = 3, init = 5, 
> finish = 0 18/08/17 13:52:20 stdout writer for python TRACE BlockInfoManager: 
> Task 137 releasing lock for input-0-1534485138800 18/08/17 13:52:20 Executor 
> task launch worker for task 137 INFO Executor: 1 block locks were not 
> released by TID = 137: [input-0-1534485138800] 18/08/17 13:52:20 stdout 
> writer for python ERROR Utils: Uncaught exception in thread stdout writer for 
> python java.lang.AssertionError: assertion failed: Block 
> input-0-1534485138800 is not locked for reading at 
> scala.Predef$.assert(Predef.scala:170) at 
> org.apache.spark.storage.BlockInfoManager.unlock(BlockInfoManager.scala:299) 
> at org.apache.spark.storage.BlockManager.releaseLock(BlockManager.scala:769) 
> at 
> org.apache.spark.storage.BlockManager$$anonfun$1.apply$mcV$sp(BlockManager.scala:540)
>  at 
> org.apache.spark.util.CompletionIterator$$anon$1.completion(CompletionIterator.scala:44)
>  at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:33) 
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
> scala.collection.Iterator$class.foreach(Iterator.scala:893) at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:213)
>  at 
> org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:407)
>  at 
> org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:215)
>  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1991) at 
> org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:170)
>  18/08/17 13:52:20 stdout writer for python ERROR 
> SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[stdout 
> writer for python,5,main]
>  
> I think shoud wait WriterThread after Task#run.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25139) PythonRunner#WriterThread released block after TaskRunner finally block which invoke BlockManager#releaseAllLocksForTask

2019-05-06 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25139:


Assignee: Apache Spark

> PythonRunner#WriterThread released block after TaskRunner finally block which 
>  invoke BlockManager#releaseAllLocksForTask
> -
>
> Key: SPARK-25139
> URL: https://issues.apache.org/jira/browse/SPARK-25139
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 2.3.1
>Reporter: DENG FEI
>Assignee: Apache Spark
>Priority: Major
>
> We run pyspark streaming on YARN, the executor will die caused by the error: 
> the task released lock while finished, but the python writer haven't do real 
> releasing lock.
> Normally the task just double check the lock, but it ran wrong in front.
> The executor trace log is below:
>  18/08/17 13:52:20 Executor task launch worker for task 137 DEBUG 
> BlockManager: Getting local block input-0-1534485138800 18/08/17 13:52:20 
> Executor task launch worker for task 137 TRACE BlockInfoManager: Task 137 
> trying to acquire read lock for input-0-1534485138800 18/08/17 13:52:20 
> Executor task launch worker for task 137 TRACE BlockInfoManager: Task 137 
> acquired read lock for input-0-1534485138800 18/08/17 13:52:20 Executor task 
> launch worker for task 137 DEBUG BlockManager: Level for block 
> input-0-1534485138800 is StorageLevel(disk, memory, 1 replicas) 18/08/17 
> 13:52:20 Executor task launch worker for task 137 INFO BlockManager: Found 
> block input-0-1534485138800 locally 18/08/17 13:52:20 Executor task launch 
> worker for task 137 INFO PythonRunner: Times: total = 8, boot = 3, init = 5, 
> finish = 0 18/08/17 13:52:20 stdout writer for python TRACE BlockInfoManager: 
> Task 137 releasing lock for input-0-1534485138800 18/08/17 13:52:20 Executor 
> task launch worker for task 137 INFO Executor: 1 block locks were not 
> released by TID = 137: [input-0-1534485138800] 18/08/17 13:52:20 stdout 
> writer for python ERROR Utils: Uncaught exception in thread stdout writer for 
> python java.lang.AssertionError: assertion failed: Block 
> input-0-1534485138800 is not locked for reading at 
> scala.Predef$.assert(Predef.scala:170) at 
> org.apache.spark.storage.BlockInfoManager.unlock(BlockInfoManager.scala:299) 
> at org.apache.spark.storage.BlockManager.releaseLock(BlockManager.scala:769) 
> at 
> org.apache.spark.storage.BlockManager$$anonfun$1.apply$mcV$sp(BlockManager.scala:540)
>  at 
> org.apache.spark.util.CompletionIterator$$anon$1.completion(CompletionIterator.scala:44)
>  at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:33) 
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
> scala.collection.Iterator$class.foreach(Iterator.scala:893) at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:213)
>  at 
> org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:407)
>  at 
> org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:215)
>  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1991) at 
> org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:170)
>  18/08/17 13:52:20 stdout writer for python ERROR 
> SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[stdout 
> writer for python,5,main]
>  
> I think shoud wait WriterThread after Task#run.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27466) LEAD function with 'ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING' causes exception in Spark

2019-05-06 Thread Bruce Robbins (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834258#comment-16834258
 ] 

Bruce Robbins commented on SPARK-27466:
---

This _seems_ to be intentional, according to SPARK-8641 ("Native Spark Window 
Functions"), which states:
{quote}LEAD and LAG are not aggregates. These expressions return the value of 
an expression a number of rows before (LAG) or ahead (LEAD) of the current row. 
These expression put a constraint on the Window frame in which they are 
executed: this can only be a Row frame with equal offsets.
{quote}
I guess it depends on what "equal offsets" means. Does it mean that the offsets 
specified in both PRECEDING and FOLLOWING need to match? Or that the offsets 
need to match the one associated with the LEAD or LAG function? Based on 
experience, it seems to be the latter (offsets need to match LEAD or LAG's 
offsets).

E.g.,
{noformat}
scala> sql("select c, b, a, lead(a, 1) over(partition by c order by a ROWS 
BETWEEN 1 following AND 1 following) as a_avg from windowtest").show(1000)
+---+---+---+-+
|  c|  b|  a|a_avg|
+---+---+---+-+
|  1|  2|  1|   11|
|  1| 12| 11|   21|
|  1| 22| 21|   31|
|  1| 32| 31|   41|
|  1| 42| 41| null|
|  6|  7|  6|   16|
|  6| 17| 16|   26|
|  6| 27| 26|   36|
...etc...
{noformat}
And also
{noformat}
scala> sql("select c, b, a, lead(a, 2) over(partition by c order by a ROWS 
BETWEEN 2 following AND 2 following) as a_avg from windowtest").show(1000)
+---+---+---+-+
|  c|  b|  a|a_avg|
+---+---+---+-+
|  1|  2|  1|   21|
|  1| 12| 11|   31|
|  1| 22| 21|   41|
|  1| 32| 31| null|
|  1| 42| 41| null|
|  6|  7|  6|   26|
|  6| 17| 16|   36|
|  6| 27| 26|   46|
...etc...
{noformat}
But not the following:
{noformat}
scala> sql("select c, b, a, lead(a, 1) over(partition by c order by a ROWS 
BETWEEN 2 following AND 2 following) as a_avg from windowtest").show(1000)
org.apache.spark.sql.AnalysisException: Window Frame 
specifiedwindowframe(RowFrame, 2, 2) must match the required frame 
specifiedwindowframe(RowFrame, 1, 1);
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:43)
{noformat}
I suppose [~hvanhovell] or [~yhuai] would understand better (having implemented 
the native versions of these functions).

> LEAD function with 'ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING' 
> causes exception in Spark
> ---
>
> Key: SPARK-27466
> URL: https://issues.apache.org/jira/browse/SPARK-27466
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.2.0
> Environment: Spark version 2.2.0.2.6.4.92-2
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112)
>Reporter: Zoltan
>Priority: Major
>
> *1. Create a table in Hive:*
>   
> {code:java}
>  CREATE TABLE tab1(
>    col1 varchar(1),
>    col2 varchar(1)
>   )
>  PARTITIONED BY (
>    col3 varchar(1)
>  )
>  LOCATION
>    'hdfs://server1/data/tab1'
> {code}
>  
>  *2. Query the Table in Spark:*
> *2.1: Simple query, no exception thrown:*
> {code:java}
> scala> spark.sql("SELECT * from schema1.tab1").show()
> +-+---++
> |col1|col2|col3|
> +-+---++
> +-+---++
> {code}
> *2.2.: Query causing exception:*
> {code:java}
> scala> spark.sql("*SELECT (LEAD(col1) OVER ( PARTITION BY col3 ORDER BY col1 
> ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING*)) from 
> schema1.tab1")
> {code}
> {code:java}
> org.apache.spark.sql.AnalysisException: Window Frame ROWS BETWEEN UNBOUNDED 
> PRECEDING AND UNBOUNDED FOLLOWING must match the required frame ROWS BETWEEN 
> 1 FOLLOWING AND 1 FOLLOWING;
>    at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
>    at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>    at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$30$$anonfun$applyOrElse$11.applyOrElse(Analyzer.scala:2219)
>    at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$30$$anonfun$applyOrElse$11.applyOrElse(Analyzer.scala:2215)
>    at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>    at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>    at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>    at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>    at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>    at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>    at 
>

[jira] [Assigned] (SPARK-27646) Required refactoring for bytecode analysis

2019-05-06 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27646:


Assignee: (was: Apache Spark)

> Required refactoring for bytecode analysis
> --
>
> Key: SPARK-27646
> URL: https://issues.apache.org/jira/browse/SPARK-27646
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: DB Tsai
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27646) Required refactoring for bytecode analysis

2019-05-06 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27646:


Assignee: Apache Spark

> Required refactoring for bytecode analysis
> --
>
> Key: SPARK-27646
> URL: https://issues.apache.org/jira/browse/SPARK-27646
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: DB Tsai
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27646) Required refactoring for bytecode analysis

2019-05-06 Thread DB Tsai (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-27646:

Summary: Required refactoring for bytecode analysis  (was: Required 
refactoring for bytecode analysis work)

> Required refactoring for bytecode analysis
> --
>
> Key: SPARK-27646
> URL: https://issues.apache.org/jira/browse/SPARK-27646
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: DB Tsai
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27646) Required refactoring for bytecode analysis work

2019-05-06 Thread DB Tsai (JIRA)

DB Tsai created SPARK-27646:
---

 Summary: Required refactoring for bytecode analysis work
 Key: SPARK-27646
 URL: https://issues.apache.org/jira/browse/SPARK-27646
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.2
Reporter: DB Tsai






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27638) date format yyyy-M-dd string comparison not handled properly

2019-05-06 Thread Maxim Gekk (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834214#comment-16834214
 ] 

Maxim Gekk commented on SPARK-27638:


[~pengbo] Are you going to propose a PR for that? If not, I can fix the issue.

> date format -M-dd string comparison not handled properly 
> -
>
> Key: SPARK-27638
> URL: https://issues.apache.org/jira/browse/SPARK-27638
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: peng bo
>Priority: Major
>
> The below example works with both Mysql and Hive, however not with spark.
> {code:java}
> mysql> select * from date_test where date_col >= '2000-1-1';
> ++
> | date_col   |
> ++
> | 2000-01-01 |
> ++
> {code}
> The reason is that Spark casts both sides to String type during date and 
> string comparison for partial date support. Please find more details in 
> https://issues.apache.org/jira/browse/SPARK-8420.
> Based on some tests, the behavior of Date and String comparison in Hive and 
> Mysql:
> Hive: Cast to Date, partial date is not supported
> Spark: Cast to Date,  certain "partial date" is supported by defining certain 
> date string parse rules. Check out {{str_to_datetime}} in 
> https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c
> Here's 2 proposals:
> a. Follow Mysql parse rule, but some partial date string comparison cases 
> won't be supported either. 
> b. Cast String value to Date, if it passes use date.toString, original string 
> otherwise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26555) Thread safety issue causes createDataset to fail with misleading errors

2019-05-06 Thread Josh Rosen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834203#comment-16834203
 ] 

Josh Rosen commented on SPARK-26555:


I won't be able to tackle a backport for at least a week, so this is up for 
grabs in case someone else wants to do it.

If I do end up working on this then I'll loop back here to claim it.

> Thread safety issue causes createDataset to fail with misleading errors
> ---
>
> Key: SPARK-26555
> URL: https://issues.apache.org/jira/browse/SPARK-26555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Martin Loncaric
>Assignee: Martin Loncaric
>Priority: Major
> Fix For: 3.0.0
>
>
> This can be replicated (~2% of the time) with
> {code:scala}
> import java.sql.Timestamp
> import java.util.concurrent.{Executors, Future}
> import org.apache.spark.sql.SparkSession
> import scala.collection.mutable.ListBuffer
> import scala.concurrent.ExecutionContext
> import scala.util.Random
> object Main {
>   def main(args: Array[String]): Unit = {
> val sparkSession = SparkSession.builder
>   .getOrCreate()
> import sparkSession.implicits._
> val executor = Executors.newFixedThreadPool(1)
> try {
>   implicit val xc: ExecutionContext = 
> ExecutionContext.fromExecutorService(executor)
>   val futures = new ListBuffer[Future[_]]()
>   for (i <- 1 to 3) {
> futures += executor.submit(new Runnable {
>   override def run(): Unit = {
> val d = if (Random.nextInt(2) == 0) Some("d value") else None
> val e = if (Random.nextInt(2) == 0) Some(5.0) else None
> val f = if (Random.nextInt(2) == 0) Some(6.0) else None
> println("DEBUG", d, e, f)
> sparkSession.createDataset(Seq(
>   MyClass(new Timestamp(1L), "b", "c", d, e, f)
> ))
>   }
> })
>   }
>   futures.foreach(_.get())
> } finally {
>   println("SHUTDOWN")
>   executor.shutdown()
>   sparkSession.stop()
> }
>   }
>   case class MyClass(
> a: Timestamp,
> b: String,
> c: String,
> d: Option[String],
> e: Option[Double],
> f: Option[Double]
>   )
> }
> {code}
> So it will usually come up during
> {code:bash}
> for i in $(seq 1 200); do
>   echo $i
>   spark-submit --master local[4] target/scala-2.11/spark-test_2.11-0.1.jar
> done
> {code}
> causing a variety of possible errors, such as
> {code}Exception in thread "main" java.util.concurrent.ExecutionException: 
> scala.MatchError: scala.Option[String] (of class 
> scala.reflect.internal.Types$ClassArgsTypeRef)
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> Caused by: scala.MatchError: scala.Option[String] (of class 
> scala.reflect.internal.Types$ClassArgsTypeRef)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:210){code}
> or
> {code}Exception in thread "main" java.util.concurrent.ExecutionException: 
> java.lang.UnsupportedOperationException: Schema for type 
> scala.Option[scala.Double] is not supported
>   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> Caused by: java.lang.UnsupportedOperationException: Schema for type 
> scala.Option[scala.Double] is not supported
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:789){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27645) Cache result of count function to that RDD

2019-05-06 Thread Seungmin Lee (JIRA)

Seungmin Lee created SPARK-27645:


 Summary: Cache result of count function to that RDD
 Key: SPARK-27645
 URL: https://issues.apache.org/jira/browse/SPARK-27645
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.4.3
Reporter: Seungmin Lee


I'm not sure whether there have been an update for this(as far as I know, there 
isn't such feature), since RDD is immutable, why don't we keep the result from 
count function of that RDD and reuse it in future calls?

Sometimes, we only have RDD variable but don't have previously run result from 
count.

In this case, not running whole count action to entire dataset would be very 
beneficial in terms of performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27644) Enable spark.sql.optimizer.nestedSchemaPruning.enabled by default

2019-05-06 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27644:
--
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-25556

> Enable spark.sql.optimizer.nestedSchemaPruning.enabled by default
> -
>
> Key: SPARK-27644
> URL: https://issues.apache.org/jira/browse/SPARK-27644
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> We can enable this after resolving all on-going issues and finishing more 
> verifications.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27622) Avoid network communication when block manager fetches disk persisted RDD blocks from the same host

2019-05-06 Thread Attila Zsolt Piros (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-27622:
---
Summary: Avoid network communication when block manager fetches disk 
persisted RDD blocks from the same host  (was: Avoid network communication when 
block manger fetches from the same host)

> Avoid network communication when block manager fetches disk persisted RDD 
> blocks from the same host
> ---
>
> Key: SPARK-27622
> URL: https://issues.apache.org/jira/browse/SPARK-27622
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Currently fetching blocks always uses the network even when the two block 
> managers are running on the same host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27622) Avoid the network when block manager fetches disk persisted RDD blocks from the same host

2019-05-06 Thread Attila Zsolt Piros (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-27622:
---
Summary: Avoid the network when block manager fetches disk persisted RDD 
blocks from the same host  (was: Avoid network communication when block manager 
fetches disk persisted RDD blocks from the same host)

> Avoid the network when block manager fetches disk persisted RDD blocks from 
> the same host
> -
>
> Key: SPARK-27622
> URL: https://issues.apache.org/jira/browse/SPARK-27622
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Currently fetching blocks always uses the network even when the two block 
> managers are running on the same host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27644) Enable spark.sql.optimizer.nestedSchemaPruning.enabled by default

2019-05-06 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-27644:
-

 Summary: Enable spark.sql.optimizer.nestedSchemaPruning.enabled by 
default
 Key: SPARK-27644
 URL: https://issues.apache.org/jira/browse/SPARK-27644
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Dongjoon Hyun


We can enable this after resolving all on-going issues and finishing more 
verifications.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27622) Avoid network communication when block manger fetches from the same host

2019-05-06 Thread Attila Zsolt Piros (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831553#comment-16831553
 ] 

Attila Zsolt Piros edited comment on SPARK-27622 at 5/6/19 6:22 PM:


I am already working on this. There is already a working prototype for RDD 
blocks.


was (Author: attilapiros):
I am already working on this. A working prototype for RDD blocks are ready and 
working.

> Avoid network communication when block manger fetches from the same host
> 
>
> Key: SPARK-27622
> URL: https://issues.apache.org/jira/browse/SPARK-27622
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Currently fetching blocks always uses the network even when the two block 
> managers are running on the same host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26555) Thread safety issue causes createDataset to fail with misleading errors

2019-05-06 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834055#comment-16834055
 ] 

Sean Owen commented on SPARK-26555:
---

I personally think it's OK to backport -- do you want to open a PR and go for 
it?

> Thread safety issue causes createDataset to fail with misleading errors
> ---
>
> Key: SPARK-26555
> URL: https://issues.apache.org/jira/browse/SPARK-26555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Martin Loncaric
>Assignee: Martin Loncaric
>Priority: Major
> Fix For: 3.0.0
>
>
> This can be replicated (~2% of the time) with
> {code:scala}
> import java.sql.Timestamp
> import java.util.concurrent.{Executors, Future}
> import org.apache.spark.sql.SparkSession
> import scala.collection.mutable.ListBuffer
> import scala.concurrent.ExecutionContext
> import scala.util.Random
> object Main {
>   def main(args: Array[String]): Unit = {
> val sparkSession = SparkSession.builder
>   .getOrCreate()
> import sparkSession.implicits._
> val executor = Executors.newFixedThreadPool(1)
> try {
>   implicit val xc: ExecutionContext = 
> ExecutionContext.fromExecutorService(executor)
>   val futures = new ListBuffer[Future[_]]()
>   for (i <- 1 to 3) {
> futures += executor.submit(new Runnable {
>   override def run(): Unit = {
> val d = if (Random.nextInt(2) == 0) Some("d value") else None
> val e = if (Random.nextInt(2) == 0) Some(5.0) else None
> val f = if (Random.nextInt(2) == 0) Some(6.0) else None
> println("DEBUG", d, e, f)
> sparkSession.createDataset(Seq(
>   MyClass(new Timestamp(1L), "b", "c", d, e, f)
> ))
>   }
> })
>   }
>   futures.foreach(_.get())
> } finally {
>   println("SHUTDOWN")
>   executor.shutdown()
>   sparkSession.stop()
> }
>   }
>   case class MyClass(
> a: Timestamp,
> b: String,
> c: String,
> d: Option[String],
> e: Option[Double],
> f: Option[Double]
>   )
> }
> {code}
> So it will usually come up during
> {code:bash}
> for i in $(seq 1 200); do
>   echo $i
>   spark-submit --master local[4] target/scala-2.11/spark-test_2.11-0.1.jar
> done
> {code}
> causing a variety of possible errors, such as
> {code}Exception in thread "main" java.util.concurrent.ExecutionException: 
> scala.MatchError: scala.Option[String] (of class 
> scala.reflect.internal.Types$ClassArgsTypeRef)
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> Caused by: scala.MatchError: scala.Option[String] (of class 
> scala.reflect.internal.Types$ClassArgsTypeRef)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:210){code}
> or
> {code}Exception in thread "main" java.util.concurrent.ExecutionException: 
> java.lang.UnsupportedOperationException: Schema for type 
> scala.Option[scala.Double] is not supported
>   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> Caused by: java.lang.UnsupportedOperationException: Schema for type 
> scala.Option[scala.Double] is not supported
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:789){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26555) Thread safety issue causes createDataset to fail with misleading errors

2019-05-06 Thread Josh Rosen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834025#comment-16834025
 ] 

Josh Rosen commented on SPARK-26555:


[~cloud_fan] [~srowen], could we backport this to the 2.4.x series? It'd be 
nice to have an LTS fix for users who can't immediately upgrade to 3.0.

> Thread safety issue causes createDataset to fail with misleading errors
> ---
>
> Key: SPARK-26555
> URL: https://issues.apache.org/jira/browse/SPARK-26555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Martin Loncaric
>Assignee: Martin Loncaric
>Priority: Major
> Fix For: 3.0.0
>
>
> This can be replicated (~2% of the time) with
> {code:scala}
> import java.sql.Timestamp
> import java.util.concurrent.{Executors, Future}
> import org.apache.spark.sql.SparkSession
> import scala.collection.mutable.ListBuffer
> import scala.concurrent.ExecutionContext
> import scala.util.Random
> object Main {
>   def main(args: Array[String]): Unit = {
> val sparkSession = SparkSession.builder
>   .getOrCreate()
> import sparkSession.implicits._
> val executor = Executors.newFixedThreadPool(1)
> try {
>   implicit val xc: ExecutionContext = 
> ExecutionContext.fromExecutorService(executor)
>   val futures = new ListBuffer[Future[_]]()
>   for (i <- 1 to 3) {
> futures += executor.submit(new Runnable {
>   override def run(): Unit = {
> val d = if (Random.nextInt(2) == 0) Some("d value") else None
> val e = if (Random.nextInt(2) == 0) Some(5.0) else None
> val f = if (Random.nextInt(2) == 0) Some(6.0) else None
> println("DEBUG", d, e, f)
> sparkSession.createDataset(Seq(
>   MyClass(new Timestamp(1L), "b", "c", d, e, f)
> ))
>   }
> })
>   }
>   futures.foreach(_.get())
> } finally {
>   println("SHUTDOWN")
>   executor.shutdown()
>   sparkSession.stop()
> }
>   }
>   case class MyClass(
> a: Timestamp,
> b: String,
> c: String,
> d: Option[String],
> e: Option[Double],
> f: Option[Double]
>   )
> }
> {code}
> So it will usually come up during
> {code:bash}
> for i in $(seq 1 200); do
>   echo $i
>   spark-submit --master local[4] target/scala-2.11/spark-test_2.11-0.1.jar
> done
> {code}
> causing a variety of possible errors, such as
> {code}Exception in thread "main" java.util.concurrent.ExecutionException: 
> scala.MatchError: scala.Option[String] (of class 
> scala.reflect.internal.Types$ClassArgsTypeRef)
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> Caused by: scala.MatchError: scala.Option[String] (of class 
> scala.reflect.internal.Types$ClassArgsTypeRef)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:210){code}
> or
> {code}Exception in thread "main" java.util.concurrent.ExecutionException: 
> java.lang.UnsupportedOperationException: Schema for type 
> scala.Option[scala.Double] is not supported
>   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> Caused by: java.lang.UnsupportedOperationException: Schema for type 
> scala.Option[scala.Double] is not supported
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:789){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19335) Spark should support doing an efficient DataFrame Upsert via JDBC

2019-05-06 Thread Darshan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-19335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834024#comment-16834024
 ] 

Darshan commented on SPARK-19335:
-

For some row level access related issue, our organisation allows to access kudu 
table via impala. We are connecting to kudu via impala jdbc. However, I am 
having constraint related to using dataframe to upsert data into kudu table. 
This feature will really help. Any updates on this?

> Spark should support doing an efficient DataFrame Upsert via JDBC
> -
>
> Key: SPARK-19335
> URL: https://issues.apache.org/jira/browse/SPARK-19335
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ilya Ganelin
>Priority: Minor
>
> Doing a database update, as opposed to an insert is useful, particularly when 
> working with streaming applications which may require revisions to previously 
> stored data. 
> Spark DataFrames/DataSets do not currently support an Update feature via the 
> JDBC Writer allowing only Overwrite or Append.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23299) repr broken for Rows instantiated with *args

2019-05-06 Thread Holden Karau (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau resolved SPARK-23299.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Merged a fix for this for 3, we can continue the discussion around backporting 
and update the fix version if we do backport.

> __repr__ broken for Rows instantiated with *args
> 
>
> Key: SPARK-23299
> URL: https://issues.apache.org/jira/browse/SPARK-23299
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.0, 2.2.0
> Environment: Tested on OS X with Spark 1.5.0 as well as pip-installed 
> `pyspark` 2.2.0. Code in question appears to still be in error on the master 
> branch of the GitHub repository.
>Reporter: Oli Hall
>Priority: Minor
> Fix For: 3.0.0
>
>
> PySpark Rows throw an exception if instantiated without column names when 
> `__repr__` is called. The most minimal reproducible example I've found is 
> this:
> {code:java}
> > from pyspark.sql.types import Row
> > Row(123)
> 
> /lib/python2.7/site-packages/pyspark/sql/types.pyc in 
> __repr__(self)
> -> 1524             return "" % ", ".join(self)
> TypeError: sequence item 0: expected string, int found{code}
> This appears to be due to the implementation of `__repr__`, which works 
> excellently for Rows created with column names, but for those without, 
> assumes all values are strings ([link 
> here|https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1584]).
> This should be an easy fix, if the values are mapped to `str` first, all 
> should be well (last line is the only modification):
> {code:java}
> def __repr__(self):
> """Printable representation of Row used in Python REPL."""
> if hasattr(self, "__fields__"):
> return "Row(%s)" % ", ".join("%s=%r" % (k, v)
>  for k, v in zip(self.__fields__, 
> tuple(self)))
> else:
> "" % ", ".join(map(str, self))
> {code}
> This will yield the following:
> {code:java}
> > from pyspark.sql.types import Row
> > Row('aaa', 123)
> 
> {code}
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27638) date format yyyy-M-dd string comparison not handled properly

2019-05-06 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833999#comment-16833999
 ] 

Wenchen Fan commented on SPARK-27638:
-

I think it should be changed. When comapring string and int, we cast string to 
int. When comparing string and date, I think it's reasonable to cast string to 
date. We also need to think about some corner cases like `date_col > 
'invalid_date_string'`.

> date format -M-dd string comparison not handled properly 
> -
>
> Key: SPARK-27638
> URL: https://issues.apache.org/jira/browse/SPARK-27638
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: peng bo
>Priority: Major
>
> The below example works with both Mysql and Hive, however not with spark.
> {code:java}
> mysql> select * from date_test where date_col >= '2000-1-1';
> ++
> | date_col   |
> ++
> | 2000-01-01 |
> ++
> {code}
> The reason is that Spark casts both sides to String type during date and 
> string comparison for partial date support. Please find more details in 
> https://issues.apache.org/jira/browse/SPARK-8420.
> Based on some tests, the behavior of Date and String comparison in Hive and 
> Mysql:
> Hive: Cast to Date, partial date is not supported
> Spark: Cast to Date,  certain "partial date" is supported by defining certain 
> date string parse rules. Check out {{str_to_datetime}} in 
> https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c
> Here's 2 proposals:
> a. Follow Mysql parse rule, but some partial date string comparison cases 
> won't be supported either. 
> b. Cast String value to Date, if it passes use date.toString, original string 
> otherwise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27638) date format yyyy-M-dd string comparison not handled properly

2019-05-06 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833992#comment-16833992
 ] 

Sean Owen commented on SPARK-27638:
---

Are you saying that's intended behavior or should be changed?

> date format -M-dd string comparison not handled properly 
> -
>
> Key: SPARK-27638
> URL: https://issues.apache.org/jira/browse/SPARK-27638
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: peng bo
>Priority: Major
>
> The below example works with both Mysql and Hive, however not with spark.
> {code:java}
> mysql> select * from date_test where date_col >= '2000-1-1';
> ++
> | date_col   |
> ++
> | 2000-01-01 |
> ++
> {code}
> The reason is that Spark casts both sides to String type during date and 
> string comparison for partial date support. Please find more details in 
> https://issues.apache.org/jira/browse/SPARK-8420.
> Based on some tests, the behavior of Date and String comparison in Hive and 
> Mysql:
> Hive: Cast to Date, partial date is not supported
> Spark: Cast to Date,  certain "partial date" is supported by defining certain 
> date string parse rules. Check out {{str_to_datetime}} in 
> https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c
> Here's 2 proposals:
> a. Follow Mysql parse rule, but some partial date string comparison cases 
> won't be supported either. 
> b. Cast String value to Date, if it passes use date.toString, original string 
> otherwise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24935) Problem with Executing Hive UDF's from Spark 2.2 Onwards

2019-05-06 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833982#comment-16833982
 ] 

Wenchen Fan commented on SPARK-24935:
-

I have sent https://github.com/apache/spark/pull/24539 to backport it.

> Problem with Executing Hive UDF's from Spark 2.2 Onwards
> 
>
> Key: SPARK-24935
> URL: https://issues.apache.org/jira/browse/SPARK-24935
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.1
>Reporter: Parth Gandhi
>Assignee: Parth Gandhi
>Priority: Major
> Fix For: 3.0.0, 2.4.3
>
>
> A user of sketches library(https://github.com/DataSketches/sketches-hive) 
> reported an issue with HLL Sketch Hive UDAF that seems to be a bug in Spark 
> or Hive. Their code runs fine in 2.1 but has an issue from 2.2 onwards. For 
> more details on the issue, you can refer to the discussion in the 
> sketches-user list:
> [https://groups.google.com/forum/?utm_medium=email_source=footer#!msg/sketches-user/GmH4-OlHP9g/MW-J7Hg4BwAJ]
>  
> On further debugging, we figured out that from 2.2 onwards, Spark hive UDAF 
> provides support for partial aggregation, and has removed the functionality 
> that supported complete mode aggregation(Refer 
> https://issues.apache.org/jira/browse/SPARK-19060 and 
> https://issues.apache.org/jira/browse/SPARK-18186). Thus, instead of 
> expecting update method to be called, merge method is called here 
> ([https://github.com/DataSketches/sketches-hive/blob/master/src/main/java/com/yahoo/sketches/hive/hll/SketchEvaluator.java#L56)]
>  which throws the exception as described in the forums above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24935) Problem with Executing Hive UDF's from Spark 2.2 Onwards

2019-05-06 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-24935:

Fix Version/s: (was: 2.4.1)
   2.4.3

> Problem with Executing Hive UDF's from Spark 2.2 Onwards
> 
>
> Key: SPARK-24935
> URL: https://issues.apache.org/jira/browse/SPARK-24935
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.1
>Reporter: Parth Gandhi
>Assignee: Parth Gandhi
>Priority: Major
> Fix For: 3.0.0, 2.4.3
>
>
> A user of sketches library(https://github.com/DataSketches/sketches-hive) 
> reported an issue with HLL Sketch Hive UDAF that seems to be a bug in Spark 
> or Hive. Their code runs fine in 2.1 but has an issue from 2.2 onwards. For 
> more details on the issue, you can refer to the discussion in the 
> sketches-user list:
> [https://groups.google.com/forum/?utm_medium=email_source=footer#!msg/sketches-user/GmH4-OlHP9g/MW-J7Hg4BwAJ]
>  
> On further debugging, we figured out that from 2.2 onwards, Spark hive UDAF 
> provides support for partial aggregation, and has removed the functionality 
> that supported complete mode aggregation(Refer 
> https://issues.apache.org/jira/browse/SPARK-19060 and 
> https://issues.apache.org/jira/browse/SPARK-18186). Thus, instead of 
> expecting update method to be called, merge method is called here 
> ([https://github.com/DataSketches/sketches-hive/blob/master/src/main/java/com/yahoo/sketches/hive/hll/SketchEvaluator.java#L56)]
>  which throws the exception as described in the forums above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27638) date format yyyy-M-dd string comparison not handled properly

2019-05-06 Thread Maxim Gekk (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833957#comment-16833957
 ] 

Maxim Gekk edited comment on SPARK-27638 at 5/6/19 4:10 PM:


It works with explicit to_date:
{code:scala}
scala> val ds = spark.range(1).selectExpr("date '2000-01-01' as d")
ds: org.apache.spark.sql.DataFrame = [d: date]

scala> ds.where("d >= to_date('2000-1-1')").show
+--+
| d|
+--+
|2000-01-01|
+--+
{code}
but with to_date, it compares strings:
{code}
scala> ds.where("d >= '2000-1-1'").explain(true)
== Parsed Logical Plan ==
'Filter ('d >= 2000-1-1)
+- Project [10957 AS d#51]
   +- Range (0, 1, step=1, splits=Some(8))

== Analyzed Logical Plan ==
d: date
Filter (cast(d#51 as string) >= 2000-1-1)
+- Project [10957 AS d#51]
   +- Range (0, 1, step=1, splits=Some(8))

== Optimized Logical Plan ==
LocalRelation , [d#51]

== Physical Plan ==
LocalTableScan , [d#51]
{code}

The same is for '2000-01-01', the date column is casted to string. 


was (Author: maxgekk):
It works with explicit to_date:
{code:scala}
scala> val ds = spark.range(1).selectExpr("date '2000-01-01' as d")
ds: org.apache.spark.sql.DataFrame = [d: date]

scala> ds.where("d >= to_date('2000-1-1')").show
+--+
| d|
+--+
|2000-01-01|
+--+
{code}
but with to_date, it compares strings:
{code}
scala> ds.where("d >= '2000-1-1'").explain(true)
== Parsed Logical Plan ==
'Filter ('d >= 2000-1-1)
+- Project [10957 AS d#51]
   +- Range (0, 1, step=1, splits=Some(8))

== Analyzed Logical Plan ==
d: date
Filter (cast(d#51 as string) >= 2000-1-1)
+- Project [10957 AS d#51]
   +- Range (0, 1, step=1, splits=Some(8))

== Optimized Logical Plan ==
LocalRelation , [d#51]

== Physical Plan ==
LocalTableScan , [d#51]
{code}

> date format -M-dd string comparison not handled properly 
> -
>
> Key: SPARK-27638
> URL: https://issues.apache.org/jira/browse/SPARK-27638
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: peng bo
>Priority: Major
>
> The below example works with both Mysql and Hive, however not with spark.
> {code:java}
> mysql> select * from date_test where date_col >= '2000-1-1';
> ++
> | date_col   |
> ++
> | 2000-01-01 |
> ++
> {code}
> The reason is that Spark casts both sides to String type during date and 
> string comparison for partial date support. Please find more details in 
> https://issues.apache.org/jira/browse/SPARK-8420.
> Based on some tests, the behavior of Date and String comparison in Hive and 
> Mysql:
> Hive: Cast to Date, partial date is not supported
> Spark: Cast to Date,  certain "partial date" is supported by defining certain 
> date string parse rules. Check out {{str_to_datetime}} in 
> https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c
> Here's 2 proposals:
> a. Follow Mysql parse rule, but some partial date string comparison cases 
> won't be supported either. 
> b. Cast String value to Date, if it passes use date.toString, original string 
> otherwise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27638) date format yyyy-M-dd string comparison not handled properly

2019-05-06 Thread Maxim Gekk (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833957#comment-16833957
 ] 

Maxim Gekk commented on SPARK-27638:


It works with explicit to_date:
{code:scala}
scala> val ds = spark.range(1).selectExpr("date '2000-01-01' as d")
ds: org.apache.spark.sql.DataFrame = [d: date]

scala> ds.where("d >= to_date('2000-1-1')").show
+--+
| d|
+--+
|2000-01-01|
+--+
{code}
but with to_date, it compares strings:
{code}
scala> ds.where("d >= '2000-1-1'").explain(true)
== Parsed Logical Plan ==
'Filter ('d >= 2000-1-1)
+- Project [10957 AS d#51]
   +- Range (0, 1, step=1, splits=Some(8))

== Analyzed Logical Plan ==
d: date
Filter (cast(d#51 as string) >= 2000-1-1)
+- Project [10957 AS d#51]
   +- Range (0, 1, step=1, splits=Some(8))

== Optimized Logical Plan ==
LocalRelation , [d#51]

== Physical Plan ==
LocalTableScan , [d#51]
{code}

> date format -M-dd string comparison not handled properly 
> -
>
> Key: SPARK-27638
> URL: https://issues.apache.org/jira/browse/SPARK-27638
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: peng bo
>Priority: Major
>
> The below example works with both Mysql and Hive, however not with spark.
> {code:java}
> mysql> select * from date_test where date_col >= '2000-1-1';
> ++
> | date_col   |
> ++
> | 2000-01-01 |
> ++
> {code}
> The reason is that Spark casts both sides to String type during date and 
> string comparison for partial date support. Please find more details in 
> https://issues.apache.org/jira/browse/SPARK-8420.
> Based on some tests, the behavior of Date and String comparison in Hive and 
> Mysql:
> Hive: Cast to Date, partial date is not supported
> Spark: Cast to Date,  certain "partial date" is supported by defining certain 
> date string parse rules. Check out {{str_to_datetime}} in 
> https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c
> Here's 2 proposals:
> a. Follow Mysql parse rule, but some partial date string comparison cases 
> won't be supported either. 
> b. Cast String value to Date, if it passes use date.toString, original string 
> otherwise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27638) date format yyyy-M-dd string comparison not handled properly

2019-05-06 Thread Maxim Gekk (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833949#comment-16833949
 ] 

Maxim Gekk edited comment on SPARK-27638 at 5/6/19 3:57 PM:


[~srowen] The date literal should be casted to the date type by 
[stringToDate|https://github.com/apache/spark/blob/ab8710b57916a129fcb89464209361120d224535/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L376]
 that is able to parse the date by default, see supported patterns:
{code}
``
`-[m]m`
`-[m]m-[d]d`
`-[m]m-[d]d `
`-[m]m-[d]d *`
`-[m]m-[d]dT*
{code}

 


was (Author: maxgekk):
[~srowen] The date literal should be casted to the date type by 
[stringToDate|[https://github.com/apache/spark/blob/ab8710b57916a129fcb89464209361120d224535/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L376]]
 that is able to parse the date by default, see supported patterns:
{code}
``
`-[m]m`
`-[m]m-[d]d`
`-[m]m-[d]d `
`-[m]m-[d]d *`
`-[m]m-[d]dT*
{code}

 

> date format -M-dd string comparison not handled properly 
> -
>
> Key: SPARK-27638
> URL: https://issues.apache.org/jira/browse/SPARK-27638
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: peng bo
>Priority: Major
>
> The below example works with both Mysql and Hive, however not with spark.
> {code:java}
> mysql> select * from date_test where date_col >= '2000-1-1';
> ++
> | date_col   |
> ++
> | 2000-01-01 |
> ++
> {code}
> The reason is that Spark casts both sides to String type during date and 
> string comparison for partial date support. Please find more details in 
> https://issues.apache.org/jira/browse/SPARK-8420.
> Based on some tests, the behavior of Date and String comparison in Hive and 
> Mysql:
> Hive: Cast to Date, partial date is not supported
> Spark: Cast to Date,  certain "partial date" is supported by defining certain 
> date string parse rules. Check out {{str_to_datetime}} in 
> https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c
> Here's 2 proposals:
> a. Follow Mysql parse rule, but some partial date string comparison cases 
> won't be supported either. 
> b. Cast String value to Date, if it passes use date.toString, original string 
> otherwise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27638) date format yyyy-M-dd string comparison not handled properly

2019-05-06 Thread Maxim Gekk (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833949#comment-16833949
 ] 

Maxim Gekk commented on SPARK-27638:


[~srowen] The date literal should be casted to the date type by 
[stringToDate|[https://github.com/apache/spark/blob/ab8710b57916a129fcb89464209361120d224535/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L376]]
 that is able to parse the date by default, see supported patterns:
{code}
``
`-[m]m`
`-[m]m-[d]d`
`-[m]m-[d]d `
`-[m]m-[d]d *`
`-[m]m-[d]dT*
{code}

 

> date format -M-dd string comparison not handled properly 
> -
>
> Key: SPARK-27638
> URL: https://issues.apache.org/jira/browse/SPARK-27638
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: peng bo
>Priority: Major
>
> The below example works with both Mysql and Hive, however not with spark.
> {code:java}
> mysql> select * from date_test where date_col >= '2000-1-1';
> ++
> | date_col   |
> ++
> | 2000-01-01 |
> ++
> {code}
> The reason is that Spark casts both sides to String type during date and 
> string comparison for partial date support. Please find more details in 
> https://issues.apache.org/jira/browse/SPARK-8420.
> Based on some tests, the behavior of Date and String comparison in Hive and 
> Mysql:
> Hive: Cast to Date, partial date is not supported
> Spark: Cast to Date,  certain "partial date" is supported by defining certain 
> date string parse rules. Check out {{str_to_datetime}} in 
> https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c
> Here's 2 proposals:
> a. Follow Mysql parse rule, but some partial date string comparison cases 
> won't be supported either. 
> b. Cast String value to Date, if it passes use date.toString, original string 
> otherwise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27642) make v1 offset extends v2 offset

2019-05-06 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27642:


Assignee: Wenchen Fan  (was: Apache Spark)

> make v1 offset extends v2 offset
> 
>
> Key: SPARK-27642
> URL: https://issues.apache.org/jira/browse/SPARK-27642
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27642) make v1 offset extends v2 offset

2019-05-06 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27642:


Assignee: Apache Spark  (was: Wenchen Fan)

> make v1 offset extends v2 offset
> 
>
> Key: SPARK-27642
> URL: https://issues.apache.org/jira/browse/SPARK-27642
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26839) on JDK11, IsolatedClientLoader must be able to load java.sql classes

2019-05-06 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833932#comment-16833932
 ] 

Sean Owen commented on SPARK-26839:
---

Right now it seems to be an issue with datanucleus in Hive. If it gets updated 
it causes other problems. I think the classloader part is OK at the moment, 
but, that's not proven.

> on JDK11, IsolatedClientLoader must be able to load java.sql classes
> 
>
> Key: SPARK-26839
> URL: https://issues.apache.org/jira/browse/SPARK-26839
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> This might be very specific to my fork & a kind of weird system setup I'm 
> working on, I haven't completely confirmed yet, but I wanted to report it 
> anyway in case anybody else sees this.
> When I try to do anything which touches the metastore on java11, I 
> immediately get errors from IsolatedClientLoader that it can't load anything 
> in java.sql.  eg.
> {noformat}
> scala> spark.sql("show tables").show()
> java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: 
> java/sql/SQLTransientException when creating Hive client using classpath: 
> file:/home/systest/jdk-11.0.2/, ...
> ...
> Caused by: java.lang.ClassNotFoundException: java.sql.SQLTransientException
>   at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471)
>   at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:588)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:230)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:219)
>   at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
> {noformat}
> After a bit of debugging, I also discovered that the {{rootClassLoader}} is 
> {{null}} in {{IsolatedClientLoader}}.  I think this would work if either 
> {{rootClassLoader}} could load those classes, or if {{isShared()}} was 
> changed to allow any class starting with "java."  (I'm not sure why it only 
> allows "java.lang" and "java.net" currently.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27638) date format yyyy-M-dd string comparison not handled properly

2019-05-06 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833931#comment-16833931
 ] 

Sean Owen commented on SPARK-27638:
---

CC [~maxgekk] but isn't the issue that your date isn't matching the default 
format -MM-dd? what about parsing the string with the format explicitly 
specified?

> date format -M-dd string comparison not handled properly 
> -
>
> Key: SPARK-27638
> URL: https://issues.apache.org/jira/browse/SPARK-27638
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: peng bo
>Priority: Major
>
> The below example works with both Mysql and Hive, however not with spark.
> {code:java}
> mysql> select * from date_test where date_col >= '2000-1-1';
> ++
> | date_col   |
> ++
> | 2000-01-01 |
> ++
> {code}
> The reason is that Spark casts both sides to String type during date and 
> string comparison for partial date support. Please find more details in 
> https://issues.apache.org/jira/browse/SPARK-8420.
> Based on some tests, the behavior of Date and String comparison in Hive and 
> Mysql:
> Hive: Cast to Date, partial date is not supported
> Spark: Cast to Date,  certain "partial date" is supported by defining certain 
> date string parse rules. Check out {{str_to_datetime}} in 
> https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c
> Here's 2 proposals:
> a. Follow Mysql parse rule, but some partial date string comparison cases 
> won't be supported either. 
> b. Cast String value to Date, if it passes use date.toString, original string 
> otherwise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27643) Add supported Hive version list in doc

2019-05-06 Thread Zhichao Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhichao  Zhang updated SPARK-27643:
---
Description: Add supported Hive version list for each spark version in doc. 
 (was: Add supported Hive version for each spark version in doc.)

> Add supported Hive version list in doc
> --
>
> Key: SPARK-27643
> URL: https://issues.apache.org/jira/browse/SPARK-27643
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.3.3, 2.4.2, 3.0.0
>Reporter: Zhichao  Zhang
>Priority: Minor
>
> Add supported Hive version list for each spark version in doc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27643) Add supported Hive version in doc

2019-05-06 Thread Zhichao Zhang (JIRA)

Zhichao  Zhang created SPARK-27643:
--

 Summary: Add supported Hive version in doc
 Key: SPARK-27643
 URL: https://issues.apache.org/jira/browse/SPARK-27643
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 2.4.2, 2.3.3, 3.0.0
Reporter: Zhichao  Zhang


Add supported Hive version for each spark version in doc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27643) Add supported Hive version list in doc

2019-05-06 Thread Zhichao Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhichao  Zhang updated SPARK-27643:
---
Summary: Add supported Hive version list in doc  (was: Add supported Hive 
version in doc)

> Add supported Hive version list in doc
> --
>
> Key: SPARK-27643
> URL: https://issues.apache.org/jira/browse/SPARK-27643
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.3.3, 2.4.2, 3.0.0
>Reporter: Zhichao  Zhang
>Priority: Minor
>
> Add supported Hive version for each spark version in doc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27642) make v1 offset extends v2 offset

2019-05-06 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-27642:
---

 Summary: make v1 offset extends v2 offset
 Key: SPARK-27642
 URL: https://issues.apache.org/jira/browse/SPARK-27642
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27641) Unregistering a single Metrics Source with no metrics leads to removing all the metrics from other sources with the same name

2019-05-06 Thread Sergey Zhemzhitsky (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Zhemzhitsky updated SPARK-27641:
---
Summary: Unregistering a single Metrics Source with no metrics leads to 
removing all the metrics from other sources with the same name  (was: 
Unregistering a single Metrics Source with no metrics leads to removing all the 
from other sources with the same name)

> Unregistering a single Metrics Source with no metrics leads to removing all 
> the metrics from other sources with the same name
> -
>
> Key: SPARK-27641
> URL: https://issues.apache.org/jira/browse/SPARK-27641
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.3, 2.3.3, 2.4.2
>Reporter: Sergey Zhemzhitsky
>Priority: Major
>
> Currently Spark allows registering multiple Metric Sources with the same 
> source name like the following
> {code:scala}
> val acc1 = sc.longAccumulator
> LongAccumulatorSource.register(sc, {"acc1" -> acc1})
> val acc2 = sc.longAccumulator
> LongAccumulatorSource.register(sc, {"acc2" -> acc2})
> {code}
> In that case there are two metric sources registered and both of these 
> sources have the same name - 
> [AccumulatorSource|https://github.com/apache/spark/blob/6ef45301a46c47c12fbc74bb9ceaffea685ed944/core/src/main/scala/org/apache/spark/metrics/source/AccumulatorSource.scala#L47]
> If you try to unregister the source with no accumulators and metrics 
> registered like the following
> {code:scala}
> SparkEnv.get.metricsSystem.removeSource(new LongAccumulatorSource)
> {code}
> ... then all the metrics for all the sources with the same name will be 
> unregistered because of the 
> [following|https://github.com/apache/spark/blob/6ef45301a46c47c12fbc74bb9ceaffea685ed944/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala#L171]
>  snippet which removes all matching records which start with the 
> corresponding prefix which includes the source name, but does not include 
> metric name to be removed.
> {code:scala}
> def removeSource(source: Source) {
>   sources -= source
>   val regName = buildRegistryName(source)
>   registry.removeMatching((name: String, _: Metric) => 
> name.startsWith(regName))
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27641) Unregistering a single Metrics Source with no metrics leads to removing all the from other sources with the same name

2019-05-06 Thread Sergey Zhemzhitsky (JIRA)

Sergey Zhemzhitsky created SPARK-27641:
--

 Summary: Unregistering a single Metrics Source with no metrics 
leads to removing all the from other sources with the same name
 Key: SPARK-27641
 URL: https://issues.apache.org/jira/browse/SPARK-27641
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.2, 2.3.3, 2.2.3
Reporter: Sergey Zhemzhitsky


Currently Spark allows registering multiple Metric Sources with the same source 
name like the following

{code:scala}
val acc1 = sc.longAccumulator
LongAccumulatorSource.register(sc, {"acc1" -> acc1})

val acc2 = sc.longAccumulator
LongAccumulatorSource.register(sc, {"acc2" -> acc2})
{code}

In that case there are two metric sources registered and both of these sources 
have the same name - 
[AccumulatorSource|https://github.com/apache/spark/blob/6ef45301a46c47c12fbc74bb9ceaffea685ed944/core/src/main/scala/org/apache/spark/metrics/source/AccumulatorSource.scala#L47]

If you try to unregister the source with no accumulators and metrics registered 
like the following
{code:scala}
SparkEnv.get.metricsSystem.removeSource(new LongAccumulatorSource)
{code}
... then all the metrics for all the sources with the same name will be 
unregistered because of the 
[following|https://github.com/apache/spark/blob/6ef45301a46c47c12fbc74bb9ceaffea685ed944/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala#L171]
 snippet which removes all matching records which start with the corresponding 
prefix which includes the source name, but does not include metric name to be 
removed.
{code:scala}
def removeSource(source: Source) {
  sources -= source
  val regName = buildRegistryName(source)
  registry.removeMatching((name: String, _: Metric) => name.startsWith(regName))
}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23887) update query progress

2019-05-06 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23887:


Assignee: (was: Apache Spark)

> update query progress
> -
>
> Key: SPARK-23887
> URL: https://issues.apache.org/jira/browse/SPARK-23887
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23887) update query progress

2019-05-06 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23887:


Assignee: Apache Spark

> update query progress
> -
>
> Key: SPARK-23887
> URL: https://issues.apache.org/jira/browse/SPARK-23887
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27579) remove BaseStreamingSource and BaseStreamingSink

2019-05-06 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27579.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24471
[https://github.com/apache/spark/pull/24471]

> remove BaseStreamingSource and BaseStreamingSink
> 
>
> Key: SPARK-27579
> URL: https://issues.apache.org/jira/browse/SPARK-27579
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26839) on JDK11, IsolatedClientLoader must be able to load java.sql classes

2019-05-06 Thread Mihaly Toth (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833784#comment-16833784
 ] 

Mihaly Toth commented on SPARK-26839:
-

Hmm, sorry, I overlooked something. I only have NucelusException all over the 
test run. So I guess that needs to be resolved first. As I understood 
HIVE-17632 (especially the Datanucleus upgrade) is a dependency here.

> on JDK11, IsolatedClientLoader must be able to load java.sql classes
> 
>
> Key: SPARK-26839
> URL: https://issues.apache.org/jira/browse/SPARK-26839
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> This might be very specific to my fork & a kind of weird system setup I'm 
> working on, I haven't completely confirmed yet, but I wanted to report it 
> anyway in case anybody else sees this.
> When I try to do anything which touches the metastore on java11, I 
> immediately get errors from IsolatedClientLoader that it can't load anything 
> in java.sql.  eg.
> {noformat}
> scala> spark.sql("show tables").show()
> java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: 
> java/sql/SQLTransientException when creating Hive client using classpath: 
> file:/home/systest/jdk-11.0.2/, ...
> ...
> Caused by: java.lang.ClassNotFoundException: java.sql.SQLTransientException
>   at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471)
>   at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:588)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:230)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:219)
>   at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
> {noformat}
> After a bit of debugging, I also discovered that the {{rootClassLoader}} is 
> {{null}} in {{IsolatedClientLoader}}.  I think this would work if either 
> {{rootClassLoader}} could load those classes, or if {{isShared()}} was 
> changed to allow any class starting with "java."  (I'm not sure why it only 
> allows "java.lang" and "java.net" currently.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26839) on JDK11, IsolatedClientLoader must be able to load java.sql classes

2019-05-06 Thread Mihaly Toth (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833763#comment-16833763
 ] 

Mihaly Toth commented on SPARK-26839:
-

[~srowen], I was facing CNFE and I have a potential fix for it on my fork. When 
I reproduced it on master, the CNFE goes away with the change but the 
{{NucleusException: The java type java.lang.Long ... cant be mapped for this 
datastore.}} stays. The problem I saw that in some cases {{HiveUtils}} 
assembles a jar list only comprising the application jar, and this same jar 
list is considered by {{IsolatedClientLoader}} as the source of the hive 
classes.

Shall I submit my change as a PR directly here? I am not fully sure it matches 
the scope of this issue.

Regarding Datanucleus it may deserve a new subtask in SPARK-24417.

> on JDK11, IsolatedClientLoader must be able to load java.sql classes
> 
>
> Key: SPARK-26839
> URL: https://issues.apache.org/jira/browse/SPARK-26839
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> This might be very specific to my fork & a kind of weird system setup I'm 
> working on, I haven't completely confirmed yet, but I wanted to report it 
> anyway in case anybody else sees this.
> When I try to do anything which touches the metastore on java11, I 
> immediately get errors from IsolatedClientLoader that it can't load anything 
> in java.sql.  eg.
> {noformat}
> scala> spark.sql("show tables").show()
> java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: 
> java/sql/SQLTransientException when creating Hive client using classpath: 
> file:/home/systest/jdk-11.0.2/, ...
> ...
> Caused by: java.lang.ClassNotFoundException: java.sql.SQLTransientException
>   at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471)
>   at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:588)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:230)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:219)
>   at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
> {noformat}
> After a bit of debugging, I also discovered that the {{rootClassLoader}} is 
> {{null}} in {{IsolatedClientLoader}}.  I think this would work if either 
> {{rootClassLoader}} could load those classes, or if {{isShared()}} was 
> changed to allow any class starting with "java."  (I'm not sure why it only 
> allows "java.lang" and "java.net" currently.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27634) deleteCheckpointOnStop should be configurable

2019-05-06 Thread Gabor Somogyi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833743#comment-16833743
 ] 

Gabor Somogyi edited comment on SPARK-27634 at 5/6/19 12:08 PM:


[~yuwang0...@gmail.com] I think in such case one should use temporary 
checkpoint location. In Spark 3.0 this can be force deleted with 
"spark.sql.streaming.forceDeleteTempCheckpointLocation".


was (Author: gsomogyi):
[~yuwang0...@gmail.com] I think one should use temporary checkpoint location. 
In Spark 3.0 this can be force deleted with 
"spark.sql.streaming.forceDeleteTempCheckpointLocation".

> deleteCheckpointOnStop should be configurable
> -
>
> Key: SPARK-27634
> URL: https://issues.apache.org/jira/browse/SPARK-27634
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.2
>Reporter: Yu Wang
>Priority: Minor
> Attachments: SPARK-27634.patch
>
>
> we need to delete checkpoint file after running the stream application 
> multiple times, so deleteCheckpointOnStop should be configurable



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27634) deleteCheckpointOnStop should be configurable

2019-05-06 Thread Gabor Somogyi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833743#comment-16833743
 ] 

Gabor Somogyi commented on SPARK-27634:
---

[~yuwang0...@gmail.com] I think one should use temporary checkpoint location. 
In Spark 3.0 this can be force deleted with 
"spark.sql.streaming.forceDeleteTempCheckpointLocation".

> deleteCheckpointOnStop should be configurable
> -
>
> Key: SPARK-27634
> URL: https://issues.apache.org/jira/browse/SPARK-27634
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.2
>Reporter: Yu Wang
>Priority: Minor
> Attachments: SPARK-27634.patch
>
>
> we need to delete checkpoint file after running the stream application 
> multiple times, so deleteCheckpointOnStop should be configurable



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27640) Avoid duplicate lookups for datasource through provider

2019-05-06 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27640:


Assignee: (was: Apache Spark)

> Avoid duplicate lookups for datasource through provider
> ---
>
> Key: SPARK-27640
> URL: https://issues.apache.org/jira/browse/SPARK-27640
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 2.3.0, 2.4.0
>Reporter: jiaan.geng
>Priority: Minor
>
> Spark SQL using code as follows to lookup datasource.
> {code:java}
> DataSource.lookupDataSource(source, sparkSession.sqlContext.conf){code}
> But exists some duplicate call.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27640) Avoid duplicate lookups for datasource through provider

2019-05-06 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27640:


Assignee: Apache Spark

> Avoid duplicate lookups for datasource through provider
> ---
>
> Key: SPARK-27640
> URL: https://issues.apache.org/jira/browse/SPARK-27640
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 2.3.0, 2.4.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Minor
>
> Spark SQL using code as follows to lookup datasource.
> {code:java}
> DataSource.lookupDataSource(source, sparkSession.sqlContext.conf){code}
> But exists some duplicate call.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27640) Avoid duplicate lookups for datasource through provider

2019-05-06 Thread jiaan.geng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-27640:
---
Component/s: Structured Streaming

> Avoid duplicate lookups for datasource through provider
> ---
>
> Key: SPARK-27640
> URL: https://issues.apache.org/jira/browse/SPARK-27640
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 2.3.0, 2.4.0
>Reporter: jiaan.geng
>Priority: Minor
>
> Spark SQL using code as follows to lookup datasource.
> {code:java}
> DataSource.lookupDataSource(source, sparkSession.sqlContext.conf){code}
> But exists some duplicate call.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27640) Avoid duplicate lookups for datasource through provider

2019-05-06 Thread jiaan.geng (JIRA)

jiaan.geng created SPARK-27640:
--

 Summary: Avoid duplicate lookups for datasource through provider
 Key: SPARK-27640
 URL: https://issues.apache.org/jira/browse/SPARK-27640
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0, 2.3.0
Reporter: jiaan.geng


Spark SQL using code as follows to lookup datasource.
{code:java}
DataSource.lookupDataSource(source, sparkSession.sqlContext.conf){code}
But exists some duplicate call.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27622) Avoid network communication when block manger fetches from the same host

2019-05-06 Thread Attila Zsolt Piros (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-27622:
---
Summary: Avoid network communication when block manger fetches from the 
same host  (was: Avoiding network communication when block manger fetching from 
the same host)

> Avoid network communication when block manger fetches from the same host
> 
>
> Key: SPARK-27622
> URL: https://issues.apache.org/jira/browse/SPARK-27622
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Currently fetching blocks always uses the network even when the two block 
> managers are running on the same host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27622) Avoiding network communication when block manger fetching from the same host

2019-05-06 Thread Attila Zsolt Piros (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-27622:
---
Summary: Avoiding network communication when block manger fetching from the 
same host  (was: Avoiding network communication when block mangers are running 
on the same host )

> Avoiding network communication when block manger fetching from the same host
> 
>
> Key: SPARK-27622
> URL: https://issues.apache.org/jira/browse/SPARK-27622
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Currently fetching blocks always uses the network even when the two block 
> managers are running on the same host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27622) Avoiding network communication when block mangers are running on the same host

2019-05-06 Thread Attila Zsolt Piros (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-27622:
---
Summary: Avoiding network communication when block mangers are running on 
the same host   (was: Avoiding network communication when block mangers are 
running on the host )

> Avoiding network communication when block mangers are running on the same 
> host 
> ---
>
> Key: SPARK-27622
> URL: https://issues.apache.org/jira/browse/SPARK-27622
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Currently fetching blocks always uses the network even when the two block 
> managers are running on the same host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5594) SparkException: Failed to get broadcast (TorrentBroadcast)

2019-05-06 Thread Jeffrey(Xilang) Yan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-5594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833632#comment-16833632
 ] 

Jeffrey(Xilang) Yan commented on SPARK-5594:


There is a bug before 2.2.3/2.3.0

If you met "Failed to get broadcast" and the method call stack is from 
MapOutputTracker, then try to upgrade your spark. The bug is due to driver 
remove the broadcast but send the broadcast id to executor, method 
MapOutputTrackerMaster.getSerializedMapOutputStatuses . It has been fixed by 
https://issues.apache.org/jira/browse/SPARK-23243

 

 

> SparkException: Failed to get broadcast (TorrentBroadcast)
> --
>
> Key: SPARK-5594
> URL: https://issues.apache.org/jira/browse/SPARK-5594
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0, 1.3.0
>Reporter: John Sandiford
>Priority: Critical
>
> I am uncertain whether this is a bug, however I am getting the error below 
> when running on a cluster (works locally), and have no idea what is causing 
> it, or where to look for more information.
> Any help is appreciated.  Others appear to experience the same issue, but I 
> have not found any solutions online.
> Please note that this only happens with certain code and is repeatable, all 
> my other spark jobs work fine.
> {noformat}
> ERROR TaskSetManager: Task 3 in stage 6.0 failed 4 times; aborting job
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 3 in stage 6.0 failed 4 times, most recent failure: 
> Lost task 3.3 in stage 6.0 (TID 24, ): java.io.IOException: 
> org.apache.spark.SparkException: Failed to get broadcast_6_piece0 of 
> broadcast_6
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> Caused by: org.apache.spark.SparkException: Failed to get broadcast_6_piece0 
> of broadcast_6
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
> at scala.Option.getOrElse(Option.scala:120)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:136)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:119)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:174)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1008)
> ... 11 more
> {noformat}
> Driver stacktrace:
> {noformat}
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
>

[jira] [Updated] (SPARK-27639) InMemoryTableScan should show the table name on UI

2019-05-06 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27639:

Description: 
It only shows InMemoryTableScan when scanning InMemoryTable.
When there are many InMemoryTables, it is difficult to distinguish which one is 
what we are looking for. This PR show the table name when scanning 
InMemoryTable. 

!https://user-images.githubusercontent.com/5399861/57213799-7bccf100-701a-11e9-9872-d90b4a185dc6.png!

  was:

!image-2019-05-06-16-11-45-164.png!


> InMemoryTableScan should show the table name on UI
> --
>
> Key: SPARK-27639
> URL: https://issues.apache.org/jira/browse/SPARK-27639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> It only shows InMemoryTableScan when scanning InMemoryTable.
> When there are many InMemoryTables, it is difficult to distinguish which one 
> is what we are looking for. This PR show the table name when scanning 
> InMemoryTable. 
> !https://user-images.githubusercontent.com/5399861/57213799-7bccf100-701a-11e9-9872-d90b4a185dc6.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27638) date format yyyy-M-dd string comparison not handled properly

2019-05-06 Thread peng bo (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

peng bo updated SPARK-27638:

Summary: date format -M-dd string comparison not handled properly   
(was: date format -M-dd comparison not handled properly )

> date format -M-dd string comparison not handled properly 
> -
>
> Key: SPARK-27638
> URL: https://issues.apache.org/jira/browse/SPARK-27638
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: peng bo
>Priority: Major
>
> The below example works with both Mysql and Hive, however not with spark.
> {code:java}
> mysql> select * from date_test where date_col >= '2000-1-1';
> ++
> | date_col   |
> ++
> | 2000-01-01 |
> ++
> {code}
> The reason is that Spark casts both sides to String type during date and 
> string comparison for partial date support. Please find more details in 
> https://issues.apache.org/jira/browse/SPARK-8420.
> Based on some tests, the behavior of Date and String comparison in Hive and 
> Mysql:
> Hive: Cast to Date, partial date is not supported
> Spark: Cast to Date,  certain "partial date" is supported by defining certain 
> date string parse rules. Check out {{str_to_datetime}} in 
> https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c
> Here's 2 proposals:
> a. Follow Mysql parse rule, but some partial date string comparison cases 
> won't be supported either. 
> b. Cast String value to Date, if it passes use date.toString, original string 
> otherwise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27638) date format yyyy-M-dd comparison not handled properly

2019-05-06 Thread peng bo (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

peng bo updated SPARK-27638:

Summary: date format -M-dd comparison not handled properly   (was: date 
format -M-dd comparison isn't handled properly )

> date format -M-dd comparison not handled properly 
> --
>
> Key: SPARK-27638
> URL: https://issues.apache.org/jira/browse/SPARK-27638
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: peng bo
>Priority: Major
>
> The below example works with both Mysql and Hive, however not with spark.
> {code:java}
> mysql> select * from date_test where date_col >= '2000-1-1';
> ++
> | date_col   |
> ++
> | 2000-01-01 |
> ++
> {code}
> The reason is that Spark casts both sides to String type during date and 
> string comparison for partial date support. Please find more details in 
> https://issues.apache.org/jira/browse/SPARK-8420.
> Based on some tests, the behavior of Date and String comparison in Hive and 
> Mysql:
> Hive: Cast to Date, partial date is not supported
> Spark: Cast to Date,  certain "partial date" is supported by defining certain 
> date string parse rules. Check out {{str_to_datetime}} in 
> https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c
> Here's 2 proposals:
> a. Follow Mysql parse rule, but some partial date string comparison cases 
> won't be supported either. 
> b. Cast String value to Date, if it passes use date.toString, original string 
> otherwise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27639) InMemoryTableScan should show the table name on UI

2019-05-06 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27639:


Assignee: (was: Apache Spark)

> InMemoryTableScan should show the table name on UI
> --
>
> Key: SPARK-27639
> URL: https://issues.apache.org/jira/browse/SPARK-27639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> !image-2019-05-06-16-11-45-164.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27639) InMemoryTableScan should show the table name on UI

2019-05-06 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27639:


Assignee: Apache Spark

> InMemoryTableScan should show the table name on UI
> --
>
> Key: SPARK-27639
> URL: https://issues.apache.org/jira/browse/SPARK-27639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> !image-2019-05-06-16-11-45-164.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27638) date format yyyy-M-dd comparison isn't handled properly

2019-05-06 Thread peng bo (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833584#comment-16833584
 ] 

peng bo commented on SPARK-27638:
-

[~cloud_fan] [~srowen] Your opinion will be appreciated.

> date format -M-dd comparison isn't handled properly 
> 
>
> Key: SPARK-27638
> URL: https://issues.apache.org/jira/browse/SPARK-27638
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: peng bo
>Priority: Major
>
> The below example works with both Mysql and Hive, however not with spark.
> {code:java}
> mysql> select * from date_test where date_col >= '2000-1-1';
> ++
> | date_col   |
> ++
> | 2000-01-01 |
> ++
> {code}
> The reason is that Spark casts both sides to String type during date and 
> string comparison for partial date support. Please find more details in 
> https://issues.apache.org/jira/browse/SPARK-8420.
> Based on some tests, the behavior of Date and String comparison in Hive and 
> Mysql:
> Hive: Cast to Date, partial date is not supported
> Spark: Cast to Date,  certain "partial date" is supported by defining certain 
> date string parse rules. Check out {{str_to_datetime}} in 
> https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c
> Here's 2 proposals:
> a. Follow Mysql parse rule, but some partial date string comparison cases 
> won't be supported either. 
> b. Cast String value to Date, if it passes use date.toString, original string 
> otherwise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27639) InMemoryTableScan should show the table name on UI

2019-05-06 Thread Yuming Wang (JIRA)

Yuming Wang created SPARK-27639:
---

 Summary: InMemoryTableScan should show the table name on UI
 Key: SPARK-27639
 URL: https://issues.apache.org/jira/browse/SPARK-27639
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang



!image-2019-05-06-16-11-45-164.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27638) date format yyyy-M-dd comparison isn't handled properly

2019-05-06 Thread peng bo (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

peng bo updated SPARK-27638:

Description: 
The below example works with both Mysql and Hive, however not with spark.

{code:java}
mysql> select * from date_test where date_col >= '2000-1-1';
++
| date_col   |
++
| 2000-01-01 |
++
{code}

The reason is that Spark casts both sides to String type during date and string 
comparison for partial date support. Please find more details in 
https://issues.apache.org/jira/browse/SPARK-8420.

Based on some tests, the behavior of Date and String comparison in Hive and 
Mysql:
Hive: Cast to Date, partial date is not supported
Spark: Cast to Date,  certain "partial date" is supported by defining certain 
date string parse rules. Check out {{str_to_datetime}} in 
https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c

Here's 2 proposals:
a. Follow Mysql parse rule, but some partial date string comparison cases won't 
be supported either. 
b. Cast String value to Date, if it passes use date.toString, original string 
otherwise.


  was:
The below example works with both Mysql and Hive, however not with spark.

{code:java}
mysql> select * from date_test where date_col >= '2000-1-1';
++
| date_col   |
++
| 2000-01-01 |
++
{code}

The reason is that Spark casts both sides to String type during date and string 
comparison for partial date support. Please find more details in 
https://issues.apache.org/jira/browse/SPARK-8420.

Based on some tests, the behavior of Date and String comparison in Hive and 
Mysql:
Hive: Cast to Date, partial date is not supported
Spark: Cast to Date, "partial date" is supported by defining certain date 
string parse rules. Check out {{str_to_datetime}} in 
https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c

Here's 2 proposals:
a. Follow Mysql parse rule, but some partial date string comparison cases won't 
be supported either. 
b. Cast String value to Date, if it passes use date.toString, original string 
otherwise.



> date format -M-dd comparison isn't handled properly 
> 
>
> Key: SPARK-27638
> URL: https://issues.apache.org/jira/browse/SPARK-27638
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: peng bo
>Priority: Major
>
> The below example works with both Mysql and Hive, however not with spark.
> {code:java}
> mysql> select * from date_test where date_col >= '2000-1-1';
> ++
> | date_col   |
> ++
> | 2000-01-01 |
> ++
> {code}
> The reason is that Spark casts both sides to String type during date and 
> string comparison for partial date support. Please find more details in 
> https://issues.apache.org/jira/browse/SPARK-8420.
> Based on some tests, the behavior of Date and String comparison in Hive and 
> Mysql:
> Hive: Cast to Date, partial date is not supported
> Spark: Cast to Date,  certain "partial date" is supported by defining certain 
> date string parse rules. Check out {{str_to_datetime}} in 
> https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c
> Here's 2 proposals:
> a. Follow Mysql parse rule, but some partial date string comparison cases 
> won't be supported either. 
> b. Cast String value to Date, if it passes use date.toString, original string 
> otherwise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27638) date format yyyy-M-dd comparison isn't handled properly

2019-05-06 Thread peng bo (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

peng bo updated SPARK-27638:

Description: 
The below example works with both Mysql and Hive, however not with spark.

{code:java}
mysql> select * from date_test where date_col >= '2000-1-1';
++
| date_col   |
++
| 2000-01-01 |
++
{code}

The reason is that Spark casts both sides to String type during date and string 
comparison for partial date support. Please find more details in 
https://issues.apache.org/jira/browse/SPARK-8420.

Based on some tests, the behavior of Date and String comparison in Hive and 
Mysql:
Hive: Cast to Date, partial date is not supported
Spark: Cast to Date, "partial date" is supported by defining certain date 
string parse rules. Check out {{str_to_datetime}} in 
https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c

Here's 2 proposals:
a. Follow Mysql parse rule, but some partial date string comparison cases won't 
be supported either. 
b. Cast String value to Date, if it passes use date.toString, original string 
otherwise.


  was:
The below example works with both Mysql and Hive, however not with spark.

{code:java}
mysql> select * from date_test where date_col >= '2000-1-1';
++
| date_col   |
++
| 2000-01-01 |
++
{code}

The reason is that Spark casts both sides to String type during date and string 
comparison for partial date support. Please find more details in 
https://issues.apache.org/jira/browse/SPARK-8420.

Based on some tests, the behavior of Date and String comparison in Hive and 
Mysql:
Hive: Cast to Date, partial date is not supported
Spark: Cast to Date, "partial date" is supported by defining certain date 
string parse rules. Check out {{str_to_datetime}} in 
https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c

Here's 2 proposals:
a. Follow Mysql parse rule, but some partial date string comparison cases 
wouldn't be supported as well 
b. Cast String value to date, if it passes use date.toString, original string 
otherwise.



> date format -M-dd comparison isn't handled properly 
> 
>
> Key: SPARK-27638
> URL: https://issues.apache.org/jira/browse/SPARK-27638
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: peng bo
>Priority: Major
>
> The below example works with both Mysql and Hive, however not with spark.
> {code:java}
> mysql> select * from date_test where date_col >= '2000-1-1';
> ++
> | date_col   |
> ++
> | 2000-01-01 |
> ++
> {code}
> The reason is that Spark casts both sides to String type during date and 
> string comparison for partial date support. Please find more details in 
> https://issues.apache.org/jira/browse/SPARK-8420.
> Based on some tests, the behavior of Date and String comparison in Hive and 
> Mysql:
> Hive: Cast to Date, partial date is not supported
> Spark: Cast to Date, "partial date" is supported by defining certain date 
> string parse rules. Check out {{str_to_datetime}} in 
> https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c
> Here's 2 proposals:
> a. Follow Mysql parse rule, but some partial date string comparison cases 
> won't be supported either. 
> b. Cast String value to Date, if it passes use date.toString, original string 
> otherwise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27638) date format yyyy-M-dd comparison isn't handled properly

2019-05-06 Thread peng bo (JIRA)

peng bo created SPARK-27638:
---

 Summary: date format -M-dd comparison isn't handled properly 
 Key: SPARK-27638
 URL: https://issues.apache.org/jira/browse/SPARK-27638
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.2
Reporter: peng bo


The below example works with both Mysql and Hive, however not with spark.

{code:java}
mysql> select * from date_test where date_col >= '2000-1-1';
++
| date_col   |
++
| 2000-01-01 |
++
{code}

The reason is that Spark casts both sides to String type during date and string 
comparison for partial date support. Please find more details in 
https://issues.apache.org/jira/browse/SPARK-8420.

Based on some tests, the behavior of Date and String comparison in Hive and 
Mysql:
Hive: Cast to Date, partial date is not supported
Spark: Cast to Date, "partial date" is supported by defining certain date 
string parse rules. Check out {{str_to_datetime}} in 
https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c

Here's 2 proposals:
a. Follow Mysql parse rule, but some partial date string comparison cases 
wouldn't be supported as well 
b. Cast String value to date, if it passes use date.toString, original string 
otherwise.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27227) Spark Runtime Filter

2019-05-06 Thread Song Jun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833558#comment-16833558
 ] 

Song Jun edited comment on SPARK-27227 at 5/6/19 7:32 AM:
--

[~cloud_fan] [~smilegator] could you please help to review this SPIP? thanks 
very much!



was (Author: windpiger):
[~cloud_fan] [~LI,Xiao] could you please help to review this SPIP? thanks very 
much!


> Spark Runtime Filter
> 
>
> Key: SPARK-27227
> URL: https://issues.apache.org/jira/browse/SPARK-27227
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Song Jun
>Priority: Major
>
> When we equi-join one big table with a smaller table, we can collect some 
> statistics from the smaller table side, and use it to the scan of big table 
> to do partition prune or data filter before execute the join.
> This can significantly improve SQL perfermance.
> For a simple example:
> select * from A, B where A.a = B.b
> A is big table ,B is small table.
> There are two scenarios:
> 1. A.a is a partition column of table A
>we can collect  all the values  of B.b, and send it to table A to do 
>partition prune on A.a.
> 2. A.a is not a partition column of table A
>   we can collect real-time some statistics(such as min/max/bloomfilter) of 
> B.b by execute extra sql(select max(b),min(b),bbf(b) from B), and send it to 
> table A to do filter on A.a.
>   Addititionaly, if a more complex query select * from A join (select * from 
> B where B.c = 1) X on A.a = B.b, then we collect real-time statistics(such as 
> min/max/bloomfilter) of X by execute extra sql(select max(b),min(b),bbf(b) 
> from X)
> Above two scenarios, we can filter out lots of data by partition prune or 
> data filter, thus we can imporve perfermance.
> 10TB TPC-DS  gain about 35%  improvement in our test.
> I will submit a SPIP later.
> SPIP: 
> https://docs.google.com/document/d/1hTXxsG_qLu5W_VrVvPx2gumXEFrnVPhQSRMjZ6WkfiY/edit#heading=h.7vhjx9226jbt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27227) Spark Runtime Filter

2019-05-06 Thread Song Jun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833558#comment-16833558
 ] 

Song Jun commented on SPARK-27227:
--

[~cloud_fan] [~LI,Xiao] could you please help to review this SPIP? thanks very 
much!


> Spark Runtime Filter
> 
>
> Key: SPARK-27227
> URL: https://issues.apache.org/jira/browse/SPARK-27227
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Song Jun
>Priority: Major
>
> When we equi-join one big table with a smaller table, we can collect some 
> statistics from the smaller table side, and use it to the scan of big table 
> to do partition prune or data filter before execute the join.
> This can significantly improve SQL perfermance.
> For a simple example:
> select * from A, B where A.a = B.b
> A is big table ,B is small table.
> There are two scenarios:
> 1. A.a is a partition column of table A
>we can collect  all the values  of B.b, and send it to table A to do 
>partition prune on A.a.
> 2. A.a is not a partition column of table A
>   we can collect real-time some statistics(such as min/max/bloomfilter) of 
> B.b by execute extra sql(select max(b),min(b),bbf(b) from B), and send it to 
> table A to do filter on A.a.
>   Addititionaly, if a more complex query select * from A join (select * from 
> B where B.c = 1) X on A.a = B.b, then we collect real-time statistics(such as 
> min/max/bloomfilter) of X by execute extra sql(select max(b),min(b),bbf(b) 
> from X)
> Above two scenarios, we can filter out lots of data by partition prune or 
> data filter, thus we can imporve perfermance.
> 10TB TPC-DS  gain about 35%  improvement in our test.
> I will submit a SPIP later.
> SPIP: 
> https://docs.google.com/document/d/1hTXxsG_qLu5W_VrVvPx2gumXEFrnVPhQSRMjZ6WkfiY/edit#heading=h.7vhjx9226jbt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27227) Spark Runtime Filter

2019-05-06 Thread Song Jun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-27227:
-
Description: 
When we equi-join one big table with a smaller table, we can collect some 
statistics from the smaller table side, and use it to the scan of big table to 
do partition prune or data filter before execute the join.
This can significantly improve SQL perfermance.

For a simple example:
select * from A, B where A.a = B.b
A is big table ,B is small table.

There are two scenarios:
1. A.a is a partition column of table A
   we can collect  all the values  of B.b, and send it to table A to do 
   partition prune on A.a.
2. A.a is not a partition column of table A
  we can collect real-time some statistics(such as min/max/bloomfilter) of B.b 
by execute extra sql(select max(b),min(b),bbf(b) from B), and send it to table 
A to do filter on A.a.
  Addititionaly, if a more complex query select * from A join (select * from B 
where B.c = 1) X on A.a = B.b, then we collect real-time statistics(such as 
min/max/bloomfilter) of X by execute extra sql(select max(b),min(b),bbf(b) from 
X)

Above two scenarios, we can filter out lots of data by partition prune or data 
filter, thus we can imporve perfermance.

10TB TPC-DS  gain about 35%  improvement in our test.

I will submit a SPIP later.

SPIP: 
https://docs.google.com/document/d/1hTXxsG_qLu5W_VrVvPx2gumXEFrnVPhQSRMjZ6WkfiY/edit#heading=h.7vhjx9226jbt

  was:
When we equi-join one big table with a smaller table, we can collect some 
statistics from the smaller table side, and use it to the scan of big table to 
do partition prune or data filter before execute the join.
This can significantly improve SQL perfermance.

For a simple example:
select * from A, B where A.a = B.b
A is big table ,B is small table.

There are two scenarios:
1. A.a is a partition column of table A
   we can collect  all the values  of B.b, and send it to table A to do 
   partition prune on A.a.
2. A.a is not a partition column of table A
  we can collect real-time some statistics(such as min/max/bloomfilter) of B.b 
by execute extra sql(select max(b),min(b),bbf(b) from B), and send it to table 
A to do filter on A.a.
  Addititionaly, if a more complex query select * from A join (select * from B 
where B.c = 1) X on A.a = B.b, then we collect real-time statistics(such as 
min/max/bloomfilter) of X by execute extra sql(select max(b),min(b),bbf(b) from 
X)

Above two scenarios, we can filter out lots of data by partition prune or data 
filter, thus we can imporve perfermance.

10TB TPC-DS  gain about 35%  improvement in our test.

I will submit a SPIP later.


> Spark Runtime Filter
> 
>
> Key: SPARK-27227
> URL: https://issues.apache.org/jira/browse/SPARK-27227
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Song Jun
>Priority: Major
>
> When we equi-join one big table with a smaller table, we can collect some 
> statistics from the smaller table side, and use it to the scan of big table 
> to do partition prune or data filter before execute the join.
> This can significantly improve SQL perfermance.
> For a simple example:
> select * from A, B where A.a = B.b
> A is big table ,B is small table.
> There are two scenarios:
> 1. A.a is a partition column of table A
>we can collect  all the values  of B.b, and send it to table A to do 
>partition prune on A.a.
> 2. A.a is not a partition column of table A
>   we can collect real-time some statistics(such as min/max/bloomfilter) of 
> B.b by execute extra sql(select max(b),min(b),bbf(b) from B), and send it to 
> table A to do filter on A.a.
>   Addititionaly, if a more complex query select * from A join (select * from 
> B where B.c = 1) X on A.a = B.b, then we collect real-time statistics(such as 
> min/max/bloomfilter) of X by execute extra sql(select max(b),min(b),bbf(b) 
> from X)
> Above two scenarios, we can filter out lots of data by partition prune or 
> data filter, thus we can imporve perfermance.
> 10TB TPC-DS  gain about 35%  improvement in our test.
> I will submit a SPIP later.
> SPIP: 
> https://docs.google.com/document/d/1hTXxsG_qLu5W_VrVvPx2gumXEFrnVPhQSRMjZ6WkfiY/edit#heading=h.7vhjx9226jbt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24641) Spark-Mesos integration doesn't respect request to abort itself

2019-05-06 Thread Igor Berman (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833549#comment-16833549
 ] 

Igor Berman commented on SPARK-24641:
-

I believe this issue is connected to 
https://issues.apache.org/jira/browse/SPARK-15359 i.e. what I mentioned as 
"zombie" mode, happens after dispatcher got aborted. Since it's running in 
separate thread and no body basically cares that this thread is finished the 
framework can become inactive(if aborted) or stopped(not sure what when)

the embedding java app with spark context may continue to run

> Spark-Mesos integration doesn't respect request to abort itself
> ---
>
> Key: SPARK-24641
> URL: https://issues.apache.org/jira/browse/SPARK-24641
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Shuffle
>Affects Versions: 2.2.0
>Reporter: Igor Berman
>Priority: Major
>
> Hi,
> lately we came across following corner scenario:
> We are using dynamic allocation with external shuffle service that is managed 
> by marathon.
>  
> Due to some network/operation issue, the external shuffle service on one of 
> the machines(mesos-slaves) is not available for few seconds(e.g. marathon 
> haven't provisioned yet the external shuffle service on particular node, but 
> framework itself already accepted offer on this node and tries to startup 
> executor)
>  
> This makes framework(spark driver) to fail and I see error from stderr of 
> driver(seems like mesos-agent asks driver to abort itself), however spark 
> context continues to run(seems like in kind of zombi mode, since it can't 
> release resources to cluster and can't get additional offers since the 
> framework is aborted from mesos perspective)
>  
> The framework in mesos UI move to "inactive" state.
> [~skonto] [~susanxhuynh] any input on this problem? Have you came across such 
> behavior?
> I'm ready to work on some patch, but currently I don't understand where to 
> start, seems like driver is too fragile in this sense and something in 
> mesos-spark integration is missing
>  
>  
> {code:java}
> I0412 07:31:25.827283   274 sched.cpp:759] Framework registered with 
> 15d9838f-b266-413b-842d-f7c3567bd04a-0051 Exception in thread "Thread-295" 
> java.io.IOException: Failed to connect tomy-company.com/10.106.14.61:7337     
>     at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
>          at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
>          at 
> org.apache.spark.network.shuffle.mesos.MesosExternalShuffleClient.registerDriverWithShuffleService(MesosExternalShuffleClient.java:75)
>          at 
> org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackend.statusUpdate(MesosCoarseGrainedSchedulerBackend.scala:537)
>  Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: 
> Connection refused: my-company.com/10.106.14.61:7337         at 
> sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)         at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)        
>  at 
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)
>          at 
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)
>          at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631)   
>       at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
>          at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)  
>        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)        
>  at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
>          at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
>          at java.lang.Thread.run(Thread.java:748) I0412 07:35:12.032925   277 
> sched.cpp:2055] Asked to abort the driver I0412 07:35:12.033035   277 
> sched.cpp:1233] Aborting framework 15d9838f-b266-413b-842d-f7c3567bd04a-0051  
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27439) Explainging Dataset should show correct resolved plans

2019-05-06 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27439:
--
Summary: Explainging Dataset should show correct resolved plans  (was: Use 
analyzed plan when explaining Dataset)

> Explainging Dataset should show correct resolved plans
> --
>
> Key: SPARK-27439
> URL: https://issues.apache.org/jira/browse/SPARK-27439
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: xjl
>Assignee: Liang-Chi Hsieh
>Priority: Minor
> Fix For: 3.0.0
>
>
> {code}
> scala> spark.range(10).createOrReplaceTempView("test")
> scala> spark.range(5).createOrReplaceTempView("test2")
> scala> spark.sql("select * from test").createOrReplaceTempView("tmp001")
> scala> val df = spark.sql("select * from tmp001")
> scala> spark.sql("select * from test2").createOrReplaceTempView("tmp001")
> scala> df.show
> +---+
> | id|
> +---+
> |  0|
> |  1|
> |  2|
> |  3|
> |  4|
> |  5|
> |  6|
> |  7|
> |  8|
> |  9|
> +---+
> scala> df.explain
> {code}
> Before:
> {code}
> == Physical Plan ==
> *(1) Range (0, 5, step=1, splits=12)
> {code}
> After:
> {code}
> == Physical Plan ==
> *(1) Range (0, 10, step=1, splits=12)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27439) Use analyzed plan when explaining Dataset

2019-05-06 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27439.
---
Resolution: Fixed
  Assignee: Liang-Chi Hsieh

This is resolved via https://github.com/apache/spark/pull/24464

> Use analyzed plan when explaining Dataset
> -
>
> Key: SPARK-27439
> URL: https://issues.apache.org/jira/browse/SPARK-27439
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: xjl
>Assignee: Liang-Chi Hsieh
>Priority: Minor
> Fix For: 3.0.0
>
>
> {code}
> scala> spark.range(10).createOrReplaceTempView("test")
> scala> spark.range(5).createOrReplaceTempView("test2")
> scala> spark.sql("select * from test").createOrReplaceTempView("tmp001")
> scala> val df = spark.sql("select * from tmp001")
> scala> spark.sql("select * from test2").createOrReplaceTempView("tmp001")
> scala> df.show
> +---+
> | id|
> +---+
> |  0|
> |  1|
> |  2|
> |  3|
> |  4|
> |  5|
> |  6|
> |  7|
> |  8|
> |  9|
> +---+
> scala> df.explain
> {code}
> Before:
> {code}
> == Physical Plan ==
> *(1) Range (0, 5, step=1, splits=12)
> {code}
> After:
> {code}
> == Physical Plan ==
> *(1) Range (0, 10, step=1, splits=12)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

76 matches

Mail list logo