[jira] [Comment Edited] (SPARK-27638) date format yyyy-M-dd string comparison not handled properly
[ https://issues.apache.org/jira/browse/SPARK-27638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834322#comment-16834322 ] peng bo edited comment on SPARK-27638 at 5/7/19 5:47 AM: - [~maxgekk] I'd love propose a PR for this. However i am in the middle of something, I will try to do it by the end of this week if that's convenient for you as well. Besides, what's your suggestion about corner cases like `date_col > 'invalid_date_string'` mentioned by [~cloud_fan] ? Switch back to string comparison? Thanks was (Author: pengbo): [~maxgekk] I'd love propose a PR for this. However i am in the middle of something, I will try to do it by the end of this week if that's convenient for you as well. > date format -M-dd string comparison not handled properly > - > > Key: SPARK-27638 > URL: https://issues.apache.org/jira/browse/SPARK-27638 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.2 >Reporter: peng bo >Priority: Major > > The below example works with both Mysql and Hive, however not with spark. > {code:java} > mysql> select * from date_test where date_col >= '2000-1-1'; > ++ > | date_col | > ++ > | 2000-01-01 | > ++ > {code} > The reason is that Spark casts both sides to String type during date and > string comparison for partial date support. Please find more details in > https://issues.apache.org/jira/browse/SPARK-8420. > Based on some tests, the behavior of Date and String comparison in Hive and > Mysql: > Hive: Cast to Date, partial date is not supported > Mysql: Cast to Date, certain "partial date" is supported by defining certain > date string parse rules. Check out {{str_to_datetime}} in > https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c > Here's 2 proposals: > a. Follow Mysql parse rule, but some partial date string comparison cases > won't be supported either. > b. Cast String value to Date, if it passes use date.toString, original string > otherwise. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27647) Metric Gauge not threadsafe
bettermouse created SPARK-27647: --- Summary: Metric Gauge not threadsafe Key: SPARK-27647 URL: https://issues.apache.org/jira/browse/SPARK-27647 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.2 Reporter: bettermouse when I read class DAGSchedulerSource,I find some Gauges may be not threadSafe.like metricRegistry.register(MetricRegistry.name("stage", "failedStages"), new Gauge[Int] { override def getValue: Int = dagScheduler.failedStages.size }) this method may be called in other thread,but failedStages field is not thread safe filed runningStages,waitingStages have same problem -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27643) Add supported Hive version list in doc
[ https://issues.apache.org/jira/browse/SPARK-27643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834333#comment-16834333 ] Yuming Wang commented on SPARK-27643: - Do you mean {{spark.sql.hive.metastore.version}}: [http://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore]? > Add supported Hive version list in doc > -- > > Key: SPARK-27643 > URL: https://issues.apache.org/jira/browse/SPARK-27643 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.3.3, 2.4.2, 3.0.0 >Reporter: Zhichao Zhang >Priority: Minor > > Add supported Hive version list for each spark version in doc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24935) Problem with Executing Hive UDF's from Spark 2.2 Onwards
[ https://issues.apache.org/jira/browse/SPARK-24935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-24935: Fix Version/s: 2.3.4 > Problem with Executing Hive UDF's from Spark 2.2 Onwards > > > Key: SPARK-24935 > URL: https://issues.apache.org/jira/browse/SPARK-24935 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.1 >Reporter: Parth Gandhi >Assignee: Parth Gandhi >Priority: Major > Fix For: 2.3.4, 3.0.0, 2.4.3 > > > A user of sketches library(https://github.com/DataSketches/sketches-hive) > reported an issue with HLL Sketch Hive UDAF that seems to be a bug in Spark > or Hive. Their code runs fine in 2.1 but has an issue from 2.2 onwards. For > more details on the issue, you can refer to the discussion in the > sketches-user list: > [https://groups.google.com/forum/?utm_medium=email_source=footer#!msg/sketches-user/GmH4-OlHP9g/MW-J7Hg4BwAJ] > > On further debugging, we figured out that from 2.2 onwards, Spark hive UDAF > provides support for partial aggregation, and has removed the functionality > that supported complete mode aggregation(Refer > https://issues.apache.org/jira/browse/SPARK-19060 and > https://issues.apache.org/jira/browse/SPARK-18186). Thus, instead of > expecting update method to be called, merge method is called here > ([https://github.com/DataSketches/sketches-hive/blob/master/src/main/java/com/yahoo/sketches/hive/hll/SketchEvaluator.java#L56)] > which throws the exception as described in the forums above. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27638) date format yyyy-M-dd string comparison not handled properly
[ https://issues.apache.org/jira/browse/SPARK-27638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] peng bo updated SPARK-27638: Description: The below example works with both Mysql and Hive, however not with spark. {code:java} mysql> select * from date_test where date_col >= '2000-1-1'; ++ | date_col | ++ | 2000-01-01 | ++ {code} The reason is that Spark casts both sides to String type during date and string comparison for partial date support. Please find more details in https://issues.apache.org/jira/browse/SPARK-8420. Based on some tests, the behavior of Date and String comparison in Hive and Mysql: Hive: Cast to Date, partial date is not supported Mysql: Cast to Date, certain "partial date" is supported by defining certain date string parse rules. Check out {{str_to_datetime}} in https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c Here's 2 proposals: a. Follow Mysql parse rule, but some partial date string comparison cases won't be supported either. b. Cast String value to Date, if it passes use date.toString, original string otherwise. was: The below example works with both Mysql and Hive, however not with spark. {code:java} mysql> select * from date_test where date_col >= '2000-1-1'; ++ | date_col | ++ | 2000-01-01 | ++ {code} The reason is that Spark casts both sides to String type during date and string comparison for partial date support. Please find more details in https://issues.apache.org/jira/browse/SPARK-8420. Based on some tests, the behavior of Date and String comparison in Hive and Mysql: Hive: Cast to Date, partial date is not supported Spark: Cast to Date, certain "partial date" is supported by defining certain date string parse rules. Check out {{str_to_datetime}} in https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c Here's 2 proposals: a. Follow Mysql parse rule, but some partial date string comparison cases won't be supported either. b. Cast String value to Date, if it passes use date.toString, original string otherwise. > date format -M-dd string comparison not handled properly > - > > Key: SPARK-27638 > URL: https://issues.apache.org/jira/browse/SPARK-27638 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.2 >Reporter: peng bo >Priority: Major > > The below example works with both Mysql and Hive, however not with spark. > {code:java} > mysql> select * from date_test where date_col >= '2000-1-1'; > ++ > | date_col | > ++ > | 2000-01-01 | > ++ > {code} > The reason is that Spark casts both sides to String type during date and > string comparison for partial date support. Please find more details in > https://issues.apache.org/jira/browse/SPARK-8420. > Based on some tests, the behavior of Date and String comparison in Hive and > Mysql: > Hive: Cast to Date, partial date is not supported > Mysql: Cast to Date, certain "partial date" is supported by defining certain > date string parse rules. Check out {{str_to_datetime}} in > https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c > Here's 2 proposals: > a. Follow Mysql parse rule, but some partial date string comparison cases > won't be supported either. > b. Cast String value to Date, if it passes use date.toString, original string > otherwise. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27638) date format yyyy-M-dd string comparison not handled properly
[ https://issues.apache.org/jira/browse/SPARK-27638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834322#comment-16834322 ] peng bo commented on SPARK-27638: - [~maxgekk] I'd love propose a PR for this. However i am in the middle of something, I will try to do it by the end of this week if that's convenient for you as well. > date format -M-dd string comparison not handled properly > - > > Key: SPARK-27638 > URL: https://issues.apache.org/jira/browse/SPARK-27638 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.2 >Reporter: peng bo >Priority: Major > > The below example works with both Mysql and Hive, however not with spark. > {code:java} > mysql> select * from date_test where date_col >= '2000-1-1'; > ++ > | date_col | > ++ > | 2000-01-01 | > ++ > {code} > The reason is that Spark casts both sides to String type during date and > string comparison for partial date support. Please find more details in > https://issues.apache.org/jira/browse/SPARK-8420. > Based on some tests, the behavior of Date and String comparison in Hive and > Mysql: > Hive: Cast to Date, partial date is not supported > Spark: Cast to Date, certain "partial date" is supported by defining certain > date string parse rules. Check out {{str_to_datetime}} in > https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c > Here's 2 proposals: > a. Follow Mysql parse rule, but some partial date string comparison cases > won't be supported either. > b. Cast String value to Date, if it passes use date.toString, original string > otherwise. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release
[ https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834276#comment-16834276 ] Apache Spark commented on SPARK-18406: -- User 'jiangxb1987' has created a pull request for this issue: https://github.com/apache/spark/pull/24542 > Race between end-of-task and completion iterator read lock release > -- > > Key: SPARK-18406 > URL: https://issues.apache.org/jira/browse/SPARK-18406 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 2.0.0, 2.0.1 >Reporter: Josh Rosen >Assignee: Xingbo Jiang >Priority: Major > Fix For: 2.0.3, 2.1.2, 2.2.0 > > > The following log comes from a production streaming job where executors > periodically die due to uncaught exceptions during block release: > {code} > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921 > 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922 > 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923 > 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923) > 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable > 2721 > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924 > 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924) > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as > bytes in memory (estimated size 5.0 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took > 3 ms > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in > memory (estimated size 9.4 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = > 567, finish = 1 > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = > 541, finish = 6 > 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID > 7923). 1429 bytes result sent to driver > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = > 533, finish = 7 > 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID > 7924). 1429 bytes result sent to driver > 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID > 7921) > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at > org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84) > at > org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356) > at > org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7925 > 16/11/07 17:11:06 INFO Executor: Running task 0.1 in stage 2390.0 (TID 7925) > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 41, boot = -536, init = > 576, finish = 1 > 16/11/07 17:11:06 INFO Executor: Finished task 1.0 in stage 2390.0 (TID > 7922). 1429 bytes result sent to driver > 16/11/07 17:11:06 ERROR Utils: Uncaught
[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release
[ https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834277#comment-16834277 ] Apache Spark commented on SPARK-18406: -- User 'jiangxb1987' has created a pull request for this issue: https://github.com/apache/spark/pull/24542 > Race between end-of-task and completion iterator read lock release > -- > > Key: SPARK-18406 > URL: https://issues.apache.org/jira/browse/SPARK-18406 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 2.0.0, 2.0.1 >Reporter: Josh Rosen >Assignee: Xingbo Jiang >Priority: Major > Fix For: 2.0.3, 2.1.2, 2.2.0 > > > The following log comes from a production streaming job where executors > periodically die due to uncaught exceptions during block release: > {code} > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921 > 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922 > 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923 > 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923) > 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable > 2721 > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924 > 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924) > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as > bytes in memory (estimated size 5.0 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took > 3 ms > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in > memory (estimated size 9.4 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = > 567, finish = 1 > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = > 541, finish = 6 > 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID > 7923). 1429 bytes result sent to driver > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = > 533, finish = 7 > 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID > 7924). 1429 bytes result sent to driver > 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID > 7921) > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at > org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84) > at > org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356) > at > org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7925 > 16/11/07 17:11:06 INFO Executor: Running task 0.1 in stage 2390.0 (TID 7925) > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 41, boot = -536, init = > 576, finish = 1 > 16/11/07 17:11:06 INFO Executor: Finished task 1.0 in stage 2390.0 (TID > 7922). 1429 bytes result sent to driver > 16/11/07 17:11:06 ERROR Utils: Uncaught
[jira] [Assigned] (SPARK-25139) PythonRunner#WriterThread released block after TaskRunner finally block which invoke BlockManager#releaseAllLocksForTask
[ https://issues.apache.org/jira/browse/SPARK-25139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25139: Assignee: (was: Apache Spark) > PythonRunner#WriterThread released block after TaskRunner finally block which > invoke BlockManager#releaseAllLocksForTask > - > > Key: SPARK-25139 > URL: https://issues.apache.org/jira/browse/SPARK-25139 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.3.1 >Reporter: DENG FEI >Priority: Major > > We run pyspark streaming on YARN, the executor will die caused by the error: > the task released lock while finished, but the python writer haven't do real > releasing lock. > Normally the task just double check the lock, but it ran wrong in front. > The executor trace log is below: > 18/08/17 13:52:20 Executor task launch worker for task 137 DEBUG > BlockManager: Getting local block input-0-1534485138800 18/08/17 13:52:20 > Executor task launch worker for task 137 TRACE BlockInfoManager: Task 137 > trying to acquire read lock for input-0-1534485138800 18/08/17 13:52:20 > Executor task launch worker for task 137 TRACE BlockInfoManager: Task 137 > acquired read lock for input-0-1534485138800 18/08/17 13:52:20 Executor task > launch worker for task 137 DEBUG BlockManager: Level for block > input-0-1534485138800 is StorageLevel(disk, memory, 1 replicas) 18/08/17 > 13:52:20 Executor task launch worker for task 137 INFO BlockManager: Found > block input-0-1534485138800 locally 18/08/17 13:52:20 Executor task launch > worker for task 137 INFO PythonRunner: Times: total = 8, boot = 3, init = 5, > finish = 0 18/08/17 13:52:20 stdout writer for python TRACE BlockInfoManager: > Task 137 releasing lock for input-0-1534485138800 18/08/17 13:52:20 Executor > task launch worker for task 137 INFO Executor: 1 block locks were not > released by TID = 137: [input-0-1534485138800] 18/08/17 13:52:20 stdout > writer for python ERROR Utils: Uncaught exception in thread stdout writer for > python java.lang.AssertionError: assertion failed: Block > input-0-1534485138800 is not locked for reading at > scala.Predef$.assert(Predef.scala:170) at > org.apache.spark.storage.BlockInfoManager.unlock(BlockInfoManager.scala:299) > at org.apache.spark.storage.BlockManager.releaseLock(BlockManager.scala:769) > at > org.apache.spark.storage.BlockManager$$anonfun$1.apply$mcV$sp(BlockManager.scala:540) > at > org.apache.spark.util.CompletionIterator$$anon$1.completion(CompletionIterator.scala:44) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:33) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at > scala.collection.Iterator$class.foreach(Iterator.scala:893) at > scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:213) > at > org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:407) > at > org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:215) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1991) at > org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:170) > 18/08/17 13:52:20 stdout writer for python ERROR > SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[stdout > writer for python,5,main] > > I think shoud wait WriterThread after Task#run. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25139) PythonRunner#WriterThread released block after TaskRunner finally block which invoke BlockManager#releaseAllLocksForTask
[ https://issues.apache.org/jira/browse/SPARK-25139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25139: Assignee: Apache Spark > PythonRunner#WriterThread released block after TaskRunner finally block which > invoke BlockManager#releaseAllLocksForTask > - > > Key: SPARK-25139 > URL: https://issues.apache.org/jira/browse/SPARK-25139 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.3.1 >Reporter: DENG FEI >Assignee: Apache Spark >Priority: Major > > We run pyspark streaming on YARN, the executor will die caused by the error: > the task released lock while finished, but the python writer haven't do real > releasing lock. > Normally the task just double check the lock, but it ran wrong in front. > The executor trace log is below: > 18/08/17 13:52:20 Executor task launch worker for task 137 DEBUG > BlockManager: Getting local block input-0-1534485138800 18/08/17 13:52:20 > Executor task launch worker for task 137 TRACE BlockInfoManager: Task 137 > trying to acquire read lock for input-0-1534485138800 18/08/17 13:52:20 > Executor task launch worker for task 137 TRACE BlockInfoManager: Task 137 > acquired read lock for input-0-1534485138800 18/08/17 13:52:20 Executor task > launch worker for task 137 DEBUG BlockManager: Level for block > input-0-1534485138800 is StorageLevel(disk, memory, 1 replicas) 18/08/17 > 13:52:20 Executor task launch worker for task 137 INFO BlockManager: Found > block input-0-1534485138800 locally 18/08/17 13:52:20 Executor task launch > worker for task 137 INFO PythonRunner: Times: total = 8, boot = 3, init = 5, > finish = 0 18/08/17 13:52:20 stdout writer for python TRACE BlockInfoManager: > Task 137 releasing lock for input-0-1534485138800 18/08/17 13:52:20 Executor > task launch worker for task 137 INFO Executor: 1 block locks were not > released by TID = 137: [input-0-1534485138800] 18/08/17 13:52:20 stdout > writer for python ERROR Utils: Uncaught exception in thread stdout writer for > python java.lang.AssertionError: assertion failed: Block > input-0-1534485138800 is not locked for reading at > scala.Predef$.assert(Predef.scala:170) at > org.apache.spark.storage.BlockInfoManager.unlock(BlockInfoManager.scala:299) > at org.apache.spark.storage.BlockManager.releaseLock(BlockManager.scala:769) > at > org.apache.spark.storage.BlockManager$$anonfun$1.apply$mcV$sp(BlockManager.scala:540) > at > org.apache.spark.util.CompletionIterator$$anon$1.completion(CompletionIterator.scala:44) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:33) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at > scala.collection.Iterator$class.foreach(Iterator.scala:893) at > scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:213) > at > org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:407) > at > org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:215) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1991) at > org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:170) > 18/08/17 13:52:20 stdout writer for python ERROR > SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[stdout > writer for python,5,main] > > I think shoud wait WriterThread after Task#run. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27466) LEAD function with 'ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING' causes exception in Spark
[ https://issues.apache.org/jira/browse/SPARK-27466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834258#comment-16834258 ] Bruce Robbins commented on SPARK-27466: --- This _seems_ to be intentional, according to SPARK-8641 ("Native Spark Window Functions"), which states: {quote}LEAD and LAG are not aggregates. These expressions return the value of an expression a number of rows before (LAG) or ahead (LEAD) of the current row. These expression put a constraint on the Window frame in which they are executed: this can only be a Row frame with equal offsets. {quote} I guess it depends on what "equal offsets" means. Does it mean that the offsets specified in both PRECEDING and FOLLOWING need to match? Or that the offsets need to match the one associated with the LEAD or LAG function? Based on experience, it seems to be the latter (offsets need to match LEAD or LAG's offsets). E.g., {noformat} scala> sql("select c, b, a, lead(a, 1) over(partition by c order by a ROWS BETWEEN 1 following AND 1 following) as a_avg from windowtest").show(1000) +---+---+---+-+ | c| b| a|a_avg| +---+---+---+-+ | 1| 2| 1| 11| | 1| 12| 11| 21| | 1| 22| 21| 31| | 1| 32| 31| 41| | 1| 42| 41| null| | 6| 7| 6| 16| | 6| 17| 16| 26| | 6| 27| 26| 36| ...etc... {noformat} And also {noformat} scala> sql("select c, b, a, lead(a, 2) over(partition by c order by a ROWS BETWEEN 2 following AND 2 following) as a_avg from windowtest").show(1000) +---+---+---+-+ | c| b| a|a_avg| +---+---+---+-+ | 1| 2| 1| 21| | 1| 12| 11| 31| | 1| 22| 21| 41| | 1| 32| 31| null| | 1| 42| 41| null| | 6| 7| 6| 26| | 6| 17| 16| 36| | 6| 27| 26| 46| ...etc... {noformat} But not the following: {noformat} scala> sql("select c, b, a, lead(a, 1) over(partition by c order by a ROWS BETWEEN 2 following AND 2 following) as a_avg from windowtest").show(1000) org.apache.spark.sql.AnalysisException: Window Frame specifiedwindowframe(RowFrame, 2, 2) must match the required frame specifiedwindowframe(RowFrame, 1, 1); at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:43) {noformat} I suppose [~hvanhovell] or [~yhuai] would understand better (having implemented the native versions of these functions). > LEAD function with 'ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING' > causes exception in Spark > --- > > Key: SPARK-27466 > URL: https://issues.apache.org/jira/browse/SPARK-27466 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.2.0 > Environment: Spark version 2.2.0.2.6.4.92-2 > Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112) >Reporter: Zoltan >Priority: Major > > *1. Create a table in Hive:* > > {code:java} > CREATE TABLE tab1( > col1 varchar(1), > col2 varchar(1) > ) > PARTITIONED BY ( > col3 varchar(1) > ) > LOCATION > 'hdfs://server1/data/tab1' > {code} > > *2. Query the Table in Spark:* > *2.1: Simple query, no exception thrown:* > {code:java} > scala> spark.sql("SELECT * from schema1.tab1").show() > +-+---++ > |col1|col2|col3| > +-+---++ > +-+---++ > {code} > *2.2.: Query causing exception:* > {code:java} > scala> spark.sql("*SELECT (LEAD(col1) OVER ( PARTITION BY col3 ORDER BY col1 > ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING*)) from > schema1.tab1") > {code} > {code:java} > org.apache.spark.sql.AnalysisException: Window Frame ROWS BETWEEN UNBOUNDED > PRECEDING AND UNBOUNDED FOLLOWING must match the required frame ROWS BETWEEN > 1 FOLLOWING AND 1 FOLLOWING; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$30$$anonfun$applyOrElse$11.applyOrElse(Analyzer.scala:2219) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$30$$anonfun$applyOrElse$11.applyOrElse(Analyzer.scala:2215) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at >
[jira] [Assigned] (SPARK-27646) Required refactoring for bytecode analysis
[ https://issues.apache.org/jira/browse/SPARK-27646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27646: Assignee: (was: Apache Spark) > Required refactoring for bytecode analysis > -- > > Key: SPARK-27646 > URL: https://issues.apache.org/jira/browse/SPARK-27646 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.2 >Reporter: DB Tsai >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27646) Required refactoring for bytecode analysis
[ https://issues.apache.org/jira/browse/SPARK-27646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27646: Assignee: Apache Spark > Required refactoring for bytecode analysis > -- > > Key: SPARK-27646 > URL: https://issues.apache.org/jira/browse/SPARK-27646 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.2 >Reporter: DB Tsai >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27646) Required refactoring for bytecode analysis
[ https://issues.apache.org/jira/browse/SPARK-27646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-27646: Summary: Required refactoring for bytecode analysis (was: Required refactoring for bytecode analysis work) > Required refactoring for bytecode analysis > -- > > Key: SPARK-27646 > URL: https://issues.apache.org/jira/browse/SPARK-27646 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.2 >Reporter: DB Tsai >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27646) Required refactoring for bytecode analysis work
DB Tsai created SPARK-27646: --- Summary: Required refactoring for bytecode analysis work Key: SPARK-27646 URL: https://issues.apache.org/jira/browse/SPARK-27646 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.4.2 Reporter: DB Tsai -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27638) date format yyyy-M-dd string comparison not handled properly
[ https://issues.apache.org/jira/browse/SPARK-27638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834214#comment-16834214 ] Maxim Gekk commented on SPARK-27638: [~pengbo] Are you going to propose a PR for that? If not, I can fix the issue. > date format -M-dd string comparison not handled properly > - > > Key: SPARK-27638 > URL: https://issues.apache.org/jira/browse/SPARK-27638 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.2 >Reporter: peng bo >Priority: Major > > The below example works with both Mysql and Hive, however not with spark. > {code:java} > mysql> select * from date_test where date_col >= '2000-1-1'; > ++ > | date_col | > ++ > | 2000-01-01 | > ++ > {code} > The reason is that Spark casts both sides to String type during date and > string comparison for partial date support. Please find more details in > https://issues.apache.org/jira/browse/SPARK-8420. > Based on some tests, the behavior of Date and String comparison in Hive and > Mysql: > Hive: Cast to Date, partial date is not supported > Spark: Cast to Date, certain "partial date" is supported by defining certain > date string parse rules. Check out {{str_to_datetime}} in > https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c > Here's 2 proposals: > a. Follow Mysql parse rule, but some partial date string comparison cases > won't be supported either. > b. Cast String value to Date, if it passes use date.toString, original string > otherwise. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26555) Thread safety issue causes createDataset to fail with misleading errors
[ https://issues.apache.org/jira/browse/SPARK-26555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834203#comment-16834203 ] Josh Rosen commented on SPARK-26555: I won't be able to tackle a backport for at least a week, so this is up for grabs in case someone else wants to do it. If I do end up working on this then I'll loop back here to claim it. > Thread safety issue causes createDataset to fail with misleading errors > --- > > Key: SPARK-26555 > URL: https://issues.apache.org/jira/browse/SPARK-26555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Martin Loncaric >Assignee: Martin Loncaric >Priority: Major > Fix For: 3.0.0 > > > This can be replicated (~2% of the time) with > {code:scala} > import java.sql.Timestamp > import java.util.concurrent.{Executors, Future} > import org.apache.spark.sql.SparkSession > import scala.collection.mutable.ListBuffer > import scala.concurrent.ExecutionContext > import scala.util.Random > object Main { > def main(args: Array[String]): Unit = { > val sparkSession = SparkSession.builder > .getOrCreate() > import sparkSession.implicits._ > val executor = Executors.newFixedThreadPool(1) > try { > implicit val xc: ExecutionContext = > ExecutionContext.fromExecutorService(executor) > val futures = new ListBuffer[Future[_]]() > for (i <- 1 to 3) { > futures += executor.submit(new Runnable { > override def run(): Unit = { > val d = if (Random.nextInt(2) == 0) Some("d value") else None > val e = if (Random.nextInt(2) == 0) Some(5.0) else None > val f = if (Random.nextInt(2) == 0) Some(6.0) else None > println("DEBUG", d, e, f) > sparkSession.createDataset(Seq( > MyClass(new Timestamp(1L), "b", "c", d, e, f) > )) > } > }) > } > futures.foreach(_.get()) > } finally { > println("SHUTDOWN") > executor.shutdown() > sparkSession.stop() > } > } > case class MyClass( > a: Timestamp, > b: String, > c: String, > d: Option[String], > e: Option[Double], > f: Option[Double] > ) > } > {code} > So it will usually come up during > {code:bash} > for i in $(seq 1 200); do > echo $i > spark-submit --master local[4] target/scala-2.11/spark-test_2.11-0.1.jar > done > {code} > causing a variety of possible errors, such as > {code}Exception in thread "main" java.util.concurrent.ExecutionException: > scala.MatchError: scala.Option[String] (of class > scala.reflect.internal.Types$ClassArgsTypeRef) > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > Caused by: scala.MatchError: scala.Option[String] (of class > scala.reflect.internal.Types$ClassArgsTypeRef) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:210){code} > or > {code}Exception in thread "main" java.util.concurrent.ExecutionException: > java.lang.UnsupportedOperationException: Schema for type > scala.Option[scala.Double] is not supported > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > Caused by: java.lang.UnsupportedOperationException: Schema for type > scala.Option[scala.Double] is not supported > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:789){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27645) Cache result of count function to that RDD
Seungmin Lee created SPARK-27645: Summary: Cache result of count function to that RDD Key: SPARK-27645 URL: https://issues.apache.org/jira/browse/SPARK-27645 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 2.4.3 Reporter: Seungmin Lee I'm not sure whether there have been an update for this(as far as I know, there isn't such feature), since RDD is immutable, why don't we keep the result from count function of that RDD and reuse it in future calls? Sometimes, we only have RDD variable but don't have previously run result from count. In this case, not running whole count action to entire dataset would be very beneficial in terms of performance. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27644) Enable spark.sql.optimizer.nestedSchemaPruning.enabled by default
[ https://issues.apache.org/jira/browse/SPARK-27644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27644: -- Issue Type: Sub-task (was: Improvement) Parent: SPARK-25556 > Enable spark.sql.optimizer.nestedSchemaPruning.enabled by default > - > > Key: SPARK-27644 > URL: https://issues.apache.org/jira/browse/SPARK-27644 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > > We can enable this after resolving all on-going issues and finishing more > verifications. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27622) Avoid network communication when block manager fetches disk persisted RDD blocks from the same host
[ https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-27622: --- Summary: Avoid network communication when block manager fetches disk persisted RDD blocks from the same host (was: Avoid network communication when block manger fetches from the same host) > Avoid network communication when block manager fetches disk persisted RDD > blocks from the same host > --- > > Key: SPARK-27622 > URL: https://issues.apache.org/jira/browse/SPARK-27622 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Currently fetching blocks always uses the network even when the two block > managers are running on the same host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27622) Avoid the network when block manager fetches disk persisted RDD blocks from the same host
[ https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-27622: --- Summary: Avoid the network when block manager fetches disk persisted RDD blocks from the same host (was: Avoid network communication when block manager fetches disk persisted RDD blocks from the same host) > Avoid the network when block manager fetches disk persisted RDD blocks from > the same host > - > > Key: SPARK-27622 > URL: https://issues.apache.org/jira/browse/SPARK-27622 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Currently fetching blocks always uses the network even when the two block > managers are running on the same host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27644) Enable spark.sql.optimizer.nestedSchemaPruning.enabled by default
Dongjoon Hyun created SPARK-27644: - Summary: Enable spark.sql.optimizer.nestedSchemaPruning.enabled by default Key: SPARK-27644 URL: https://issues.apache.org/jira/browse/SPARK-27644 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Dongjoon Hyun We can enable this after resolving all on-going issues and finishing more verifications. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27622) Avoid network communication when block manger fetches from the same host
[ https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831553#comment-16831553 ] Attila Zsolt Piros edited comment on SPARK-27622 at 5/6/19 6:22 PM: I am already working on this. There is already a working prototype for RDD blocks. was (Author: attilapiros): I am already working on this. A working prototype for RDD blocks are ready and working. > Avoid network communication when block manger fetches from the same host > > > Key: SPARK-27622 > URL: https://issues.apache.org/jira/browse/SPARK-27622 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Currently fetching blocks always uses the network even when the two block > managers are running on the same host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26555) Thread safety issue causes createDataset to fail with misleading errors
[ https://issues.apache.org/jira/browse/SPARK-26555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834055#comment-16834055 ] Sean Owen commented on SPARK-26555: --- I personally think it's OK to backport -- do you want to open a PR and go for it? > Thread safety issue causes createDataset to fail with misleading errors > --- > > Key: SPARK-26555 > URL: https://issues.apache.org/jira/browse/SPARK-26555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Martin Loncaric >Assignee: Martin Loncaric >Priority: Major > Fix For: 3.0.0 > > > This can be replicated (~2% of the time) with > {code:scala} > import java.sql.Timestamp > import java.util.concurrent.{Executors, Future} > import org.apache.spark.sql.SparkSession > import scala.collection.mutable.ListBuffer > import scala.concurrent.ExecutionContext > import scala.util.Random > object Main { > def main(args: Array[String]): Unit = { > val sparkSession = SparkSession.builder > .getOrCreate() > import sparkSession.implicits._ > val executor = Executors.newFixedThreadPool(1) > try { > implicit val xc: ExecutionContext = > ExecutionContext.fromExecutorService(executor) > val futures = new ListBuffer[Future[_]]() > for (i <- 1 to 3) { > futures += executor.submit(new Runnable { > override def run(): Unit = { > val d = if (Random.nextInt(2) == 0) Some("d value") else None > val e = if (Random.nextInt(2) == 0) Some(5.0) else None > val f = if (Random.nextInt(2) == 0) Some(6.0) else None > println("DEBUG", d, e, f) > sparkSession.createDataset(Seq( > MyClass(new Timestamp(1L), "b", "c", d, e, f) > )) > } > }) > } > futures.foreach(_.get()) > } finally { > println("SHUTDOWN") > executor.shutdown() > sparkSession.stop() > } > } > case class MyClass( > a: Timestamp, > b: String, > c: String, > d: Option[String], > e: Option[Double], > f: Option[Double] > ) > } > {code} > So it will usually come up during > {code:bash} > for i in $(seq 1 200); do > echo $i > spark-submit --master local[4] target/scala-2.11/spark-test_2.11-0.1.jar > done > {code} > causing a variety of possible errors, such as > {code}Exception in thread "main" java.util.concurrent.ExecutionException: > scala.MatchError: scala.Option[String] (of class > scala.reflect.internal.Types$ClassArgsTypeRef) > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > Caused by: scala.MatchError: scala.Option[String] (of class > scala.reflect.internal.Types$ClassArgsTypeRef) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:210){code} > or > {code}Exception in thread "main" java.util.concurrent.ExecutionException: > java.lang.UnsupportedOperationException: Schema for type > scala.Option[scala.Double] is not supported > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > Caused by: java.lang.UnsupportedOperationException: Schema for type > scala.Option[scala.Double] is not supported > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:789){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26555) Thread safety issue causes createDataset to fail with misleading errors
[ https://issues.apache.org/jira/browse/SPARK-26555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834025#comment-16834025 ] Josh Rosen commented on SPARK-26555: [~cloud_fan] [~srowen], could we backport this to the 2.4.x series? It'd be nice to have an LTS fix for users who can't immediately upgrade to 3.0. > Thread safety issue causes createDataset to fail with misleading errors > --- > > Key: SPARK-26555 > URL: https://issues.apache.org/jira/browse/SPARK-26555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Martin Loncaric >Assignee: Martin Loncaric >Priority: Major > Fix For: 3.0.0 > > > This can be replicated (~2% of the time) with > {code:scala} > import java.sql.Timestamp > import java.util.concurrent.{Executors, Future} > import org.apache.spark.sql.SparkSession > import scala.collection.mutable.ListBuffer > import scala.concurrent.ExecutionContext > import scala.util.Random > object Main { > def main(args: Array[String]): Unit = { > val sparkSession = SparkSession.builder > .getOrCreate() > import sparkSession.implicits._ > val executor = Executors.newFixedThreadPool(1) > try { > implicit val xc: ExecutionContext = > ExecutionContext.fromExecutorService(executor) > val futures = new ListBuffer[Future[_]]() > for (i <- 1 to 3) { > futures += executor.submit(new Runnable { > override def run(): Unit = { > val d = if (Random.nextInt(2) == 0) Some("d value") else None > val e = if (Random.nextInt(2) == 0) Some(5.0) else None > val f = if (Random.nextInt(2) == 0) Some(6.0) else None > println("DEBUG", d, e, f) > sparkSession.createDataset(Seq( > MyClass(new Timestamp(1L), "b", "c", d, e, f) > )) > } > }) > } > futures.foreach(_.get()) > } finally { > println("SHUTDOWN") > executor.shutdown() > sparkSession.stop() > } > } > case class MyClass( > a: Timestamp, > b: String, > c: String, > d: Option[String], > e: Option[Double], > f: Option[Double] > ) > } > {code} > So it will usually come up during > {code:bash} > for i in $(seq 1 200); do > echo $i > spark-submit --master local[4] target/scala-2.11/spark-test_2.11-0.1.jar > done > {code} > causing a variety of possible errors, such as > {code}Exception in thread "main" java.util.concurrent.ExecutionException: > scala.MatchError: scala.Option[String] (of class > scala.reflect.internal.Types$ClassArgsTypeRef) > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > Caused by: scala.MatchError: scala.Option[String] (of class > scala.reflect.internal.Types$ClassArgsTypeRef) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:210){code} > or > {code}Exception in thread "main" java.util.concurrent.ExecutionException: > java.lang.UnsupportedOperationException: Schema for type > scala.Option[scala.Double] is not supported > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > Caused by: java.lang.UnsupportedOperationException: Schema for type > scala.Option[scala.Double] is not supported > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:789){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19335) Spark should support doing an efficient DataFrame Upsert via JDBC
[ https://issues.apache.org/jira/browse/SPARK-19335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834024#comment-16834024 ] Darshan commented on SPARK-19335: - For some row level access related issue, our organisation allows to access kudu table via impala. We are connecting to kudu via impala jdbc. However, I am having constraint related to using dataframe to upsert data into kudu table. This feature will really help. Any updates on this? > Spark should support doing an efficient DataFrame Upsert via JDBC > - > > Key: SPARK-19335 > URL: https://issues.apache.org/jira/browse/SPARK-19335 > Project: Spark > Issue Type: Improvement >Reporter: Ilya Ganelin >Priority: Minor > > Doing a database update, as opposed to an insert is useful, particularly when > working with streaming applications which may require revisions to previously > stored data. > Spark DataFrames/DataSets do not currently support an Update feature via the > JDBC Writer allowing only Overwrite or Append. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23299) __repr__ broken for Rows instantiated with *args
[ https://issues.apache.org/jira/browse/SPARK-23299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau resolved SPARK-23299. -- Resolution: Fixed Fix Version/s: 3.0.0 Merged a fix for this for 3, we can continue the discussion around backporting and update the fix version if we do backport. > __repr__ broken for Rows instantiated with *args > > > Key: SPARK-23299 > URL: https://issues.apache.org/jira/browse/SPARK-23299 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.0, 2.2.0 > Environment: Tested on OS X with Spark 1.5.0 as well as pip-installed > `pyspark` 2.2.0. Code in question appears to still be in error on the master > branch of the GitHub repository. >Reporter: Oli Hall >Priority: Minor > Fix For: 3.0.0 > > > PySpark Rows throw an exception if instantiated without column names when > `__repr__` is called. The most minimal reproducible example I've found is > this: > {code:java} > > from pyspark.sql.types import Row > > Row(123) > > /lib/python2.7/site-packages/pyspark/sql/types.pyc in > __repr__(self) > -> 1524 return "" % ", ".join(self) > TypeError: sequence item 0: expected string, int found{code} > This appears to be due to the implementation of `__repr__`, which works > excellently for Rows created with column names, but for those without, > assumes all values are strings ([link > here|https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1584]). > This should be an easy fix, if the values are mapped to `str` first, all > should be well (last line is the only modification): > {code:java} > def __repr__(self): > """Printable representation of Row used in Python REPL.""" > if hasattr(self, "__fields__"): > return "Row(%s)" % ", ".join("%s=%r" % (k, v) > for k, v in zip(self.__fields__, > tuple(self))) > else: > "" % ", ".join(map(str, self)) > {code} > This will yield the following: > {code:java} > > from pyspark.sql.types import Row > > Row('aaa', 123) > > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27638) date format yyyy-M-dd string comparison not handled properly
[ https://issues.apache.org/jira/browse/SPARK-27638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833999#comment-16833999 ] Wenchen Fan commented on SPARK-27638: - I think it should be changed. When comapring string and int, we cast string to int. When comparing string and date, I think it's reasonable to cast string to date. We also need to think about some corner cases like `date_col > 'invalid_date_string'`. > date format -M-dd string comparison not handled properly > - > > Key: SPARK-27638 > URL: https://issues.apache.org/jira/browse/SPARK-27638 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.2 >Reporter: peng bo >Priority: Major > > The below example works with both Mysql and Hive, however not with spark. > {code:java} > mysql> select * from date_test where date_col >= '2000-1-1'; > ++ > | date_col | > ++ > | 2000-01-01 | > ++ > {code} > The reason is that Spark casts both sides to String type during date and > string comparison for partial date support. Please find more details in > https://issues.apache.org/jira/browse/SPARK-8420. > Based on some tests, the behavior of Date and String comparison in Hive and > Mysql: > Hive: Cast to Date, partial date is not supported > Spark: Cast to Date, certain "partial date" is supported by defining certain > date string parse rules. Check out {{str_to_datetime}} in > https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c > Here's 2 proposals: > a. Follow Mysql parse rule, but some partial date string comparison cases > won't be supported either. > b. Cast String value to Date, if it passes use date.toString, original string > otherwise. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27638) date format yyyy-M-dd string comparison not handled properly
[ https://issues.apache.org/jira/browse/SPARK-27638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833992#comment-16833992 ] Sean Owen commented on SPARK-27638: --- Are you saying that's intended behavior or should be changed? > date format -M-dd string comparison not handled properly > - > > Key: SPARK-27638 > URL: https://issues.apache.org/jira/browse/SPARK-27638 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.2 >Reporter: peng bo >Priority: Major > > The below example works with both Mysql and Hive, however not with spark. > {code:java} > mysql> select * from date_test where date_col >= '2000-1-1'; > ++ > | date_col | > ++ > | 2000-01-01 | > ++ > {code} > The reason is that Spark casts both sides to String type during date and > string comparison for partial date support. Please find more details in > https://issues.apache.org/jira/browse/SPARK-8420. > Based on some tests, the behavior of Date and String comparison in Hive and > Mysql: > Hive: Cast to Date, partial date is not supported > Spark: Cast to Date, certain "partial date" is supported by defining certain > date string parse rules. Check out {{str_to_datetime}} in > https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c > Here's 2 proposals: > a. Follow Mysql parse rule, but some partial date string comparison cases > won't be supported either. > b. Cast String value to Date, if it passes use date.toString, original string > otherwise. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24935) Problem with Executing Hive UDF's from Spark 2.2 Onwards
[ https://issues.apache.org/jira/browse/SPARK-24935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833982#comment-16833982 ] Wenchen Fan commented on SPARK-24935: - I have sent https://github.com/apache/spark/pull/24539 to backport it. > Problem with Executing Hive UDF's from Spark 2.2 Onwards > > > Key: SPARK-24935 > URL: https://issues.apache.org/jira/browse/SPARK-24935 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.1 >Reporter: Parth Gandhi >Assignee: Parth Gandhi >Priority: Major > Fix For: 3.0.0, 2.4.3 > > > A user of sketches library(https://github.com/DataSketches/sketches-hive) > reported an issue with HLL Sketch Hive UDAF that seems to be a bug in Spark > or Hive. Their code runs fine in 2.1 but has an issue from 2.2 onwards. For > more details on the issue, you can refer to the discussion in the > sketches-user list: > [https://groups.google.com/forum/?utm_medium=email_source=footer#!msg/sketches-user/GmH4-OlHP9g/MW-J7Hg4BwAJ] > > On further debugging, we figured out that from 2.2 onwards, Spark hive UDAF > provides support for partial aggregation, and has removed the functionality > that supported complete mode aggregation(Refer > https://issues.apache.org/jira/browse/SPARK-19060 and > https://issues.apache.org/jira/browse/SPARK-18186). Thus, instead of > expecting update method to be called, merge method is called here > ([https://github.com/DataSketches/sketches-hive/blob/master/src/main/java/com/yahoo/sketches/hive/hll/SketchEvaluator.java#L56)] > which throws the exception as described in the forums above. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24935) Problem with Executing Hive UDF's from Spark 2.2 Onwards
[ https://issues.apache.org/jira/browse/SPARK-24935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-24935: Fix Version/s: (was: 2.4.1) 2.4.3 > Problem with Executing Hive UDF's from Spark 2.2 Onwards > > > Key: SPARK-24935 > URL: https://issues.apache.org/jira/browse/SPARK-24935 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.1 >Reporter: Parth Gandhi >Assignee: Parth Gandhi >Priority: Major > Fix For: 3.0.0, 2.4.3 > > > A user of sketches library(https://github.com/DataSketches/sketches-hive) > reported an issue with HLL Sketch Hive UDAF that seems to be a bug in Spark > or Hive. Their code runs fine in 2.1 but has an issue from 2.2 onwards. For > more details on the issue, you can refer to the discussion in the > sketches-user list: > [https://groups.google.com/forum/?utm_medium=email_source=footer#!msg/sketches-user/GmH4-OlHP9g/MW-J7Hg4BwAJ] > > On further debugging, we figured out that from 2.2 onwards, Spark hive UDAF > provides support for partial aggregation, and has removed the functionality > that supported complete mode aggregation(Refer > https://issues.apache.org/jira/browse/SPARK-19060 and > https://issues.apache.org/jira/browse/SPARK-18186). Thus, instead of > expecting update method to be called, merge method is called here > ([https://github.com/DataSketches/sketches-hive/blob/master/src/main/java/com/yahoo/sketches/hive/hll/SketchEvaluator.java#L56)] > which throws the exception as described in the forums above. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27638) date format yyyy-M-dd string comparison not handled properly
[ https://issues.apache.org/jira/browse/SPARK-27638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833957#comment-16833957 ] Maxim Gekk edited comment on SPARK-27638 at 5/6/19 4:10 PM: It works with explicit to_date: {code:scala} scala> val ds = spark.range(1).selectExpr("date '2000-01-01' as d") ds: org.apache.spark.sql.DataFrame = [d: date] scala> ds.where("d >= to_date('2000-1-1')").show +--+ | d| +--+ |2000-01-01| +--+ {code} but with to_date, it compares strings: {code} scala> ds.where("d >= '2000-1-1'").explain(true) == Parsed Logical Plan == 'Filter ('d >= 2000-1-1) +- Project [10957 AS d#51] +- Range (0, 1, step=1, splits=Some(8)) == Analyzed Logical Plan == d: date Filter (cast(d#51 as string) >= 2000-1-1) +- Project [10957 AS d#51] +- Range (0, 1, step=1, splits=Some(8)) == Optimized Logical Plan == LocalRelation , [d#51] == Physical Plan == LocalTableScan , [d#51] {code} The same is for '2000-01-01', the date column is casted to string. was (Author: maxgekk): It works with explicit to_date: {code:scala} scala> val ds = spark.range(1).selectExpr("date '2000-01-01' as d") ds: org.apache.spark.sql.DataFrame = [d: date] scala> ds.where("d >= to_date('2000-1-1')").show +--+ | d| +--+ |2000-01-01| +--+ {code} but with to_date, it compares strings: {code} scala> ds.where("d >= '2000-1-1'").explain(true) == Parsed Logical Plan == 'Filter ('d >= 2000-1-1) +- Project [10957 AS d#51] +- Range (0, 1, step=1, splits=Some(8)) == Analyzed Logical Plan == d: date Filter (cast(d#51 as string) >= 2000-1-1) +- Project [10957 AS d#51] +- Range (0, 1, step=1, splits=Some(8)) == Optimized Logical Plan == LocalRelation , [d#51] == Physical Plan == LocalTableScan , [d#51] {code} > date format -M-dd string comparison not handled properly > - > > Key: SPARK-27638 > URL: https://issues.apache.org/jira/browse/SPARK-27638 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.2 >Reporter: peng bo >Priority: Major > > The below example works with both Mysql and Hive, however not with spark. > {code:java} > mysql> select * from date_test where date_col >= '2000-1-1'; > ++ > | date_col | > ++ > | 2000-01-01 | > ++ > {code} > The reason is that Spark casts both sides to String type during date and > string comparison for partial date support. Please find more details in > https://issues.apache.org/jira/browse/SPARK-8420. > Based on some tests, the behavior of Date and String comparison in Hive and > Mysql: > Hive: Cast to Date, partial date is not supported > Spark: Cast to Date, certain "partial date" is supported by defining certain > date string parse rules. Check out {{str_to_datetime}} in > https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c > Here's 2 proposals: > a. Follow Mysql parse rule, but some partial date string comparison cases > won't be supported either. > b. Cast String value to Date, if it passes use date.toString, original string > otherwise. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27638) date format yyyy-M-dd string comparison not handled properly
[ https://issues.apache.org/jira/browse/SPARK-27638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833957#comment-16833957 ] Maxim Gekk commented on SPARK-27638: It works with explicit to_date: {code:scala} scala> val ds = spark.range(1).selectExpr("date '2000-01-01' as d") ds: org.apache.spark.sql.DataFrame = [d: date] scala> ds.where("d >= to_date('2000-1-1')").show +--+ | d| +--+ |2000-01-01| +--+ {code} but with to_date, it compares strings: {code} scala> ds.where("d >= '2000-1-1'").explain(true) == Parsed Logical Plan == 'Filter ('d >= 2000-1-1) +- Project [10957 AS d#51] +- Range (0, 1, step=1, splits=Some(8)) == Analyzed Logical Plan == d: date Filter (cast(d#51 as string) >= 2000-1-1) +- Project [10957 AS d#51] +- Range (0, 1, step=1, splits=Some(8)) == Optimized Logical Plan == LocalRelation , [d#51] == Physical Plan == LocalTableScan , [d#51] {code} > date format -M-dd string comparison not handled properly > - > > Key: SPARK-27638 > URL: https://issues.apache.org/jira/browse/SPARK-27638 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.2 >Reporter: peng bo >Priority: Major > > The below example works with both Mysql and Hive, however not with spark. > {code:java} > mysql> select * from date_test where date_col >= '2000-1-1'; > ++ > | date_col | > ++ > | 2000-01-01 | > ++ > {code} > The reason is that Spark casts both sides to String type during date and > string comparison for partial date support. Please find more details in > https://issues.apache.org/jira/browse/SPARK-8420. > Based on some tests, the behavior of Date and String comparison in Hive and > Mysql: > Hive: Cast to Date, partial date is not supported > Spark: Cast to Date, certain "partial date" is supported by defining certain > date string parse rules. Check out {{str_to_datetime}} in > https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c > Here's 2 proposals: > a. Follow Mysql parse rule, but some partial date string comparison cases > won't be supported either. > b. Cast String value to Date, if it passes use date.toString, original string > otherwise. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27638) date format yyyy-M-dd string comparison not handled properly
[ https://issues.apache.org/jira/browse/SPARK-27638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833949#comment-16833949 ] Maxim Gekk edited comment on SPARK-27638 at 5/6/19 3:57 PM: [~srowen] The date literal should be casted to the date type by [stringToDate|https://github.com/apache/spark/blob/ab8710b57916a129fcb89464209361120d224535/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L376] that is able to parse the date by default, see supported patterns: {code} `` `-[m]m` `-[m]m-[d]d` `-[m]m-[d]d ` `-[m]m-[d]d *` `-[m]m-[d]dT* {code} was (Author: maxgekk): [~srowen] The date literal should be casted to the date type by [stringToDate|[https://github.com/apache/spark/blob/ab8710b57916a129fcb89464209361120d224535/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L376]] that is able to parse the date by default, see supported patterns: {code} `` `-[m]m` `-[m]m-[d]d` `-[m]m-[d]d ` `-[m]m-[d]d *` `-[m]m-[d]dT* {code} > date format -M-dd string comparison not handled properly > - > > Key: SPARK-27638 > URL: https://issues.apache.org/jira/browse/SPARK-27638 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.2 >Reporter: peng bo >Priority: Major > > The below example works with both Mysql and Hive, however not with spark. > {code:java} > mysql> select * from date_test where date_col >= '2000-1-1'; > ++ > | date_col | > ++ > | 2000-01-01 | > ++ > {code} > The reason is that Spark casts both sides to String type during date and > string comparison for partial date support. Please find more details in > https://issues.apache.org/jira/browse/SPARK-8420. > Based on some tests, the behavior of Date and String comparison in Hive and > Mysql: > Hive: Cast to Date, partial date is not supported > Spark: Cast to Date, certain "partial date" is supported by defining certain > date string parse rules. Check out {{str_to_datetime}} in > https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c > Here's 2 proposals: > a. Follow Mysql parse rule, but some partial date string comparison cases > won't be supported either. > b. Cast String value to Date, if it passes use date.toString, original string > otherwise. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27638) date format yyyy-M-dd string comparison not handled properly
[ https://issues.apache.org/jira/browse/SPARK-27638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833949#comment-16833949 ] Maxim Gekk commented on SPARK-27638: [~srowen] The date literal should be casted to the date type by [stringToDate|[https://github.com/apache/spark/blob/ab8710b57916a129fcb89464209361120d224535/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L376]] that is able to parse the date by default, see supported patterns: {code} `` `-[m]m` `-[m]m-[d]d` `-[m]m-[d]d ` `-[m]m-[d]d *` `-[m]m-[d]dT* {code} > date format -M-dd string comparison not handled properly > - > > Key: SPARK-27638 > URL: https://issues.apache.org/jira/browse/SPARK-27638 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.2 >Reporter: peng bo >Priority: Major > > The below example works with both Mysql and Hive, however not with spark. > {code:java} > mysql> select * from date_test where date_col >= '2000-1-1'; > ++ > | date_col | > ++ > | 2000-01-01 | > ++ > {code} > The reason is that Spark casts both sides to String type during date and > string comparison for partial date support. Please find more details in > https://issues.apache.org/jira/browse/SPARK-8420. > Based on some tests, the behavior of Date and String comparison in Hive and > Mysql: > Hive: Cast to Date, partial date is not supported > Spark: Cast to Date, certain "partial date" is supported by defining certain > date string parse rules. Check out {{str_to_datetime}} in > https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c > Here's 2 proposals: > a. Follow Mysql parse rule, but some partial date string comparison cases > won't be supported either. > b. Cast String value to Date, if it passes use date.toString, original string > otherwise. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27642) make v1 offset extends v2 offset
[ https://issues.apache.org/jira/browse/SPARK-27642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27642: Assignee: Wenchen Fan (was: Apache Spark) > make v1 offset extends v2 offset > > > Key: SPARK-27642 > URL: https://issues.apache.org/jira/browse/SPARK-27642 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27642) make v1 offset extends v2 offset
[ https://issues.apache.org/jira/browse/SPARK-27642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27642: Assignee: Apache Spark (was: Wenchen Fan) > make v1 offset extends v2 offset > > > Key: SPARK-27642 > URL: https://issues.apache.org/jira/browse/SPARK-27642 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26839) on JDK11, IsolatedClientLoader must be able to load java.sql classes
[ https://issues.apache.org/jira/browse/SPARK-26839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833932#comment-16833932 ] Sean Owen commented on SPARK-26839: --- Right now it seems to be an issue with datanucleus in Hive. If it gets updated it causes other problems. I think the classloader part is OK at the moment, but, that's not proven. > on JDK11, IsolatedClientLoader must be able to load java.sql classes > > > Key: SPARK-26839 > URL: https://issues.apache.org/jira/browse/SPARK-26839 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > This might be very specific to my fork & a kind of weird system setup I'm > working on, I haven't completely confirmed yet, but I wanted to report it > anyway in case anybody else sees this. > When I try to do anything which touches the metastore on java11, I > immediately get errors from IsolatedClientLoader that it can't load anything > in java.sql. eg. > {noformat} > scala> spark.sql("show tables").show() > java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: > java/sql/SQLTransientException when creating Hive client using classpath: > file:/home/systest/jdk-11.0.2/, ... > ... > Caused by: java.lang.ClassNotFoundException: java.sql.SQLTransientException > at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:588) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:230) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:219) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521) > {noformat} > After a bit of debugging, I also discovered that the {{rootClassLoader}} is > {{null}} in {{IsolatedClientLoader}}. I think this would work if either > {{rootClassLoader}} could load those classes, or if {{isShared()}} was > changed to allow any class starting with "java." (I'm not sure why it only > allows "java.lang" and "java.net" currently.) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27638) date format yyyy-M-dd string comparison not handled properly
[ https://issues.apache.org/jira/browse/SPARK-27638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833931#comment-16833931 ] Sean Owen commented on SPARK-27638: --- CC [~maxgekk] but isn't the issue that your date isn't matching the default format -MM-dd? what about parsing the string with the format explicitly specified? > date format -M-dd string comparison not handled properly > - > > Key: SPARK-27638 > URL: https://issues.apache.org/jira/browse/SPARK-27638 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.2 >Reporter: peng bo >Priority: Major > > The below example works with both Mysql and Hive, however not with spark. > {code:java} > mysql> select * from date_test where date_col >= '2000-1-1'; > ++ > | date_col | > ++ > | 2000-01-01 | > ++ > {code} > The reason is that Spark casts both sides to String type during date and > string comparison for partial date support. Please find more details in > https://issues.apache.org/jira/browse/SPARK-8420. > Based on some tests, the behavior of Date and String comparison in Hive and > Mysql: > Hive: Cast to Date, partial date is not supported > Spark: Cast to Date, certain "partial date" is supported by defining certain > date string parse rules. Check out {{str_to_datetime}} in > https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c > Here's 2 proposals: > a. Follow Mysql parse rule, but some partial date string comparison cases > won't be supported either. > b. Cast String value to Date, if it passes use date.toString, original string > otherwise. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27643) Add supported Hive version list in doc
[ https://issues.apache.org/jira/browse/SPARK-27643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhichao Zhang updated SPARK-27643: --- Description: Add supported Hive version list for each spark version in doc. (was: Add supported Hive version for each spark version in doc.) > Add supported Hive version list in doc > -- > > Key: SPARK-27643 > URL: https://issues.apache.org/jira/browse/SPARK-27643 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.3.3, 2.4.2, 3.0.0 >Reporter: Zhichao Zhang >Priority: Minor > > Add supported Hive version list for each spark version in doc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27643) Add supported Hive version in doc
Zhichao Zhang created SPARK-27643: -- Summary: Add supported Hive version in doc Key: SPARK-27643 URL: https://issues.apache.org/jira/browse/SPARK-27643 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 2.4.2, 2.3.3, 3.0.0 Reporter: Zhichao Zhang Add supported Hive version for each spark version in doc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27643) Add supported Hive version list in doc
[ https://issues.apache.org/jira/browse/SPARK-27643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhichao Zhang updated SPARK-27643: --- Summary: Add supported Hive version list in doc (was: Add supported Hive version in doc) > Add supported Hive version list in doc > -- > > Key: SPARK-27643 > URL: https://issues.apache.org/jira/browse/SPARK-27643 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.3.3, 2.4.2, 3.0.0 >Reporter: Zhichao Zhang >Priority: Minor > > Add supported Hive version for each spark version in doc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27642) make v1 offset extends v2 offset
Wenchen Fan created SPARK-27642: --- Summary: make v1 offset extends v2 offset Key: SPARK-27642 URL: https://issues.apache.org/jira/browse/SPARK-27642 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27641) Unregistering a single Metrics Source with no metrics leads to removing all the metrics from other sources with the same name
[ https://issues.apache.org/jira/browse/SPARK-27641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Zhemzhitsky updated SPARK-27641: --- Summary: Unregistering a single Metrics Source with no metrics leads to removing all the metrics from other sources with the same name (was: Unregistering a single Metrics Source with no metrics leads to removing all the from other sources with the same name) > Unregistering a single Metrics Source with no metrics leads to removing all > the metrics from other sources with the same name > - > > Key: SPARK-27641 > URL: https://issues.apache.org/jira/browse/SPARK-27641 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.3, 2.3.3, 2.4.2 >Reporter: Sergey Zhemzhitsky >Priority: Major > > Currently Spark allows registering multiple Metric Sources with the same > source name like the following > {code:scala} > val acc1 = sc.longAccumulator > LongAccumulatorSource.register(sc, {"acc1" -> acc1}) > val acc2 = sc.longAccumulator > LongAccumulatorSource.register(sc, {"acc2" -> acc2}) > {code} > In that case there are two metric sources registered and both of these > sources have the same name - > [AccumulatorSource|https://github.com/apache/spark/blob/6ef45301a46c47c12fbc74bb9ceaffea685ed944/core/src/main/scala/org/apache/spark/metrics/source/AccumulatorSource.scala#L47] > If you try to unregister the source with no accumulators and metrics > registered like the following > {code:scala} > SparkEnv.get.metricsSystem.removeSource(new LongAccumulatorSource) > {code} > ... then all the metrics for all the sources with the same name will be > unregistered because of the > [following|https://github.com/apache/spark/blob/6ef45301a46c47c12fbc74bb9ceaffea685ed944/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala#L171] > snippet which removes all matching records which start with the > corresponding prefix which includes the source name, but does not include > metric name to be removed. > {code:scala} > def removeSource(source: Source) { > sources -= source > val regName = buildRegistryName(source) > registry.removeMatching((name: String, _: Metric) => > name.startsWith(regName)) > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27641) Unregistering a single Metrics Source with no metrics leads to removing all the from other sources with the same name
Sergey Zhemzhitsky created SPARK-27641: -- Summary: Unregistering a single Metrics Source with no metrics leads to removing all the from other sources with the same name Key: SPARK-27641 URL: https://issues.apache.org/jira/browse/SPARK-27641 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.2, 2.3.3, 2.2.3 Reporter: Sergey Zhemzhitsky Currently Spark allows registering multiple Metric Sources with the same source name like the following {code:scala} val acc1 = sc.longAccumulator LongAccumulatorSource.register(sc, {"acc1" -> acc1}) val acc2 = sc.longAccumulator LongAccumulatorSource.register(sc, {"acc2" -> acc2}) {code} In that case there are two metric sources registered and both of these sources have the same name - [AccumulatorSource|https://github.com/apache/spark/blob/6ef45301a46c47c12fbc74bb9ceaffea685ed944/core/src/main/scala/org/apache/spark/metrics/source/AccumulatorSource.scala#L47] If you try to unregister the source with no accumulators and metrics registered like the following {code:scala} SparkEnv.get.metricsSystem.removeSource(new LongAccumulatorSource) {code} ... then all the metrics for all the sources with the same name will be unregistered because of the [following|https://github.com/apache/spark/blob/6ef45301a46c47c12fbc74bb9ceaffea685ed944/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala#L171] snippet which removes all matching records which start with the corresponding prefix which includes the source name, but does not include metric name to be removed. {code:scala} def removeSource(source: Source) { sources -= source val regName = buildRegistryName(source) registry.removeMatching((name: String, _: Metric) => name.startsWith(regName)) } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23887) update query progress
[ https://issues.apache.org/jira/browse/SPARK-23887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23887: Assignee: (was: Apache Spark) > update query progress > - > > Key: SPARK-23887 > URL: https://issues.apache.org/jira/browse/SPARK-23887 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23887) update query progress
[ https://issues.apache.org/jira/browse/SPARK-23887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23887: Assignee: Apache Spark > update query progress > - > > Key: SPARK-23887 > URL: https://issues.apache.org/jira/browse/SPARK-23887 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27579) remove BaseStreamingSource and BaseStreamingSink
[ https://issues.apache.org/jira/browse/SPARK-27579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-27579. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24471 [https://github.com/apache/spark/pull/24471] > remove BaseStreamingSource and BaseStreamingSink > > > Key: SPARK-27579 > URL: https://issues.apache.org/jira/browse/SPARK-27579 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26839) on JDK11, IsolatedClientLoader must be able to load java.sql classes
[ https://issues.apache.org/jira/browse/SPARK-26839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833784#comment-16833784 ] Mihaly Toth commented on SPARK-26839: - Hmm, sorry, I overlooked something. I only have NucelusException all over the test run. So I guess that needs to be resolved first. As I understood HIVE-17632 (especially the Datanucleus upgrade) is a dependency here. > on JDK11, IsolatedClientLoader must be able to load java.sql classes > > > Key: SPARK-26839 > URL: https://issues.apache.org/jira/browse/SPARK-26839 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > This might be very specific to my fork & a kind of weird system setup I'm > working on, I haven't completely confirmed yet, but I wanted to report it > anyway in case anybody else sees this. > When I try to do anything which touches the metastore on java11, I > immediately get errors from IsolatedClientLoader that it can't load anything > in java.sql. eg. > {noformat} > scala> spark.sql("show tables").show() > java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: > java/sql/SQLTransientException when creating Hive client using classpath: > file:/home/systest/jdk-11.0.2/, ... > ... > Caused by: java.lang.ClassNotFoundException: java.sql.SQLTransientException > at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:588) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:230) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:219) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521) > {noformat} > After a bit of debugging, I also discovered that the {{rootClassLoader}} is > {{null}} in {{IsolatedClientLoader}}. I think this would work if either > {{rootClassLoader}} could load those classes, or if {{isShared()}} was > changed to allow any class starting with "java." (I'm not sure why it only > allows "java.lang" and "java.net" currently.) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26839) on JDK11, IsolatedClientLoader must be able to load java.sql classes
[ https://issues.apache.org/jira/browse/SPARK-26839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833763#comment-16833763 ] Mihaly Toth commented on SPARK-26839: - [~srowen], I was facing CNFE and I have a potential fix for it on my fork. When I reproduced it on master, the CNFE goes away with the change but the {{NucleusException: The java type java.lang.Long ... cant be mapped for this datastore.}} stays. The problem I saw that in some cases {{HiveUtils}} assembles a jar list only comprising the application jar, and this same jar list is considered by {{IsolatedClientLoader}} as the source of the hive classes. Shall I submit my change as a PR directly here? I am not fully sure it matches the scope of this issue. Regarding Datanucleus it may deserve a new subtask in SPARK-24417. > on JDK11, IsolatedClientLoader must be able to load java.sql classes > > > Key: SPARK-26839 > URL: https://issues.apache.org/jira/browse/SPARK-26839 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > This might be very specific to my fork & a kind of weird system setup I'm > working on, I haven't completely confirmed yet, but I wanted to report it > anyway in case anybody else sees this. > When I try to do anything which touches the metastore on java11, I > immediately get errors from IsolatedClientLoader that it can't load anything > in java.sql. eg. > {noformat} > scala> spark.sql("show tables").show() > java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: > java/sql/SQLTransientException when creating Hive client using classpath: > file:/home/systest/jdk-11.0.2/, ... > ... > Caused by: java.lang.ClassNotFoundException: java.sql.SQLTransientException > at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:588) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:230) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:219) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521) > {noformat} > After a bit of debugging, I also discovered that the {{rootClassLoader}} is > {{null}} in {{IsolatedClientLoader}}. I think this would work if either > {{rootClassLoader}} could load those classes, or if {{isShared()}} was > changed to allow any class starting with "java." (I'm not sure why it only > allows "java.lang" and "java.net" currently.) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27634) deleteCheckpointOnStop should be configurable
[ https://issues.apache.org/jira/browse/SPARK-27634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833743#comment-16833743 ] Gabor Somogyi edited comment on SPARK-27634 at 5/6/19 12:08 PM: [~yuwang0...@gmail.com] I think in such case one should use temporary checkpoint location. In Spark 3.0 this can be force deleted with "spark.sql.streaming.forceDeleteTempCheckpointLocation". was (Author: gsomogyi): [~yuwang0...@gmail.com] I think one should use temporary checkpoint location. In Spark 3.0 this can be force deleted with "spark.sql.streaming.forceDeleteTempCheckpointLocation". > deleteCheckpointOnStop should be configurable > - > > Key: SPARK-27634 > URL: https://issues.apache.org/jira/browse/SPARK-27634 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.2 >Reporter: Yu Wang >Priority: Minor > Attachments: SPARK-27634.patch > > > we need to delete checkpoint file after running the stream application > multiple times, so deleteCheckpointOnStop should be configurable -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27634) deleteCheckpointOnStop should be configurable
[ https://issues.apache.org/jira/browse/SPARK-27634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833743#comment-16833743 ] Gabor Somogyi commented on SPARK-27634: --- [~yuwang0...@gmail.com] I think one should use temporary checkpoint location. In Spark 3.0 this can be force deleted with "spark.sql.streaming.forceDeleteTempCheckpointLocation". > deleteCheckpointOnStop should be configurable > - > > Key: SPARK-27634 > URL: https://issues.apache.org/jira/browse/SPARK-27634 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.2 >Reporter: Yu Wang >Priority: Minor > Attachments: SPARK-27634.patch > > > we need to delete checkpoint file after running the stream application > multiple times, so deleteCheckpointOnStop should be configurable -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27640) Avoid duplicate lookups for datasource through provider
[ https://issues.apache.org/jira/browse/SPARK-27640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27640: Assignee: (was: Apache Spark) > Avoid duplicate lookups for datasource through provider > --- > > Key: SPARK-27640 > URL: https://issues.apache.org/jira/browse/SPARK-27640 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 2.3.0, 2.4.0 >Reporter: jiaan.geng >Priority: Minor > > Spark SQL using code as follows to lookup datasource. > {code:java} > DataSource.lookupDataSource(source, sparkSession.sqlContext.conf){code} > But exists some duplicate call. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27640) Avoid duplicate lookups for datasource through provider
[ https://issues.apache.org/jira/browse/SPARK-27640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27640: Assignee: Apache Spark > Avoid duplicate lookups for datasource through provider > --- > > Key: SPARK-27640 > URL: https://issues.apache.org/jira/browse/SPARK-27640 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 2.3.0, 2.4.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Minor > > Spark SQL using code as follows to lookup datasource. > {code:java} > DataSource.lookupDataSource(source, sparkSession.sqlContext.conf){code} > But exists some duplicate call. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27640) Avoid duplicate lookups for datasource through provider
[ https://issues.apache.org/jira/browse/SPARK-27640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-27640: --- Component/s: Structured Streaming > Avoid duplicate lookups for datasource through provider > --- > > Key: SPARK-27640 > URL: https://issues.apache.org/jira/browse/SPARK-27640 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 2.3.0, 2.4.0 >Reporter: jiaan.geng >Priority: Minor > > Spark SQL using code as follows to lookup datasource. > {code:java} > DataSource.lookupDataSource(source, sparkSession.sqlContext.conf){code} > But exists some duplicate call. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27640) Avoid duplicate lookups for datasource through provider
jiaan.geng created SPARK-27640: -- Summary: Avoid duplicate lookups for datasource through provider Key: SPARK-27640 URL: https://issues.apache.org/jira/browse/SPARK-27640 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0, 2.3.0 Reporter: jiaan.geng Spark SQL using code as follows to lookup datasource. {code:java} DataSource.lookupDataSource(source, sparkSession.sqlContext.conf){code} But exists some duplicate call. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27622) Avoid network communication when block manger fetches from the same host
[ https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-27622: --- Summary: Avoid network communication when block manger fetches from the same host (was: Avoiding network communication when block manger fetching from the same host) > Avoid network communication when block manger fetches from the same host > > > Key: SPARK-27622 > URL: https://issues.apache.org/jira/browse/SPARK-27622 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Currently fetching blocks always uses the network even when the two block > managers are running on the same host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27622) Avoiding network communication when block manger fetching from the same host
[ https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-27622: --- Summary: Avoiding network communication when block manger fetching from the same host (was: Avoiding network communication when block mangers are running on the same host ) > Avoiding network communication when block manger fetching from the same host > > > Key: SPARK-27622 > URL: https://issues.apache.org/jira/browse/SPARK-27622 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Currently fetching blocks always uses the network even when the two block > managers are running on the same host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27622) Avoiding network communication when block mangers are running on the same host
[ https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-27622: --- Summary: Avoiding network communication when block mangers are running on the same host (was: Avoiding network communication when block mangers are running on the host ) > Avoiding network communication when block mangers are running on the same > host > --- > > Key: SPARK-27622 > URL: https://issues.apache.org/jira/browse/SPARK-27622 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Currently fetching blocks always uses the network even when the two block > managers are running on the same host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5594) SparkException: Failed to get broadcast (TorrentBroadcast)
[ https://issues.apache.org/jira/browse/SPARK-5594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833632#comment-16833632 ] Jeffrey(Xilang) Yan commented on SPARK-5594: There is a bug before 2.2.3/2.3.0 If you met "Failed to get broadcast" and the method call stack is from MapOutputTracker, then try to upgrade your spark. The bug is due to driver remove the broadcast but send the broadcast id to executor, method MapOutputTrackerMaster.getSerializedMapOutputStatuses . It has been fixed by https://issues.apache.org/jira/browse/SPARK-23243 > SparkException: Failed to get broadcast (TorrentBroadcast) > -- > > Key: SPARK-5594 > URL: https://issues.apache.org/jira/browse/SPARK-5594 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0, 1.3.0 >Reporter: John Sandiford >Priority: Critical > > I am uncertain whether this is a bug, however I am getting the error below > when running on a cluster (works locally), and have no idea what is causing > it, or where to look for more information. > Any help is appreciated. Others appear to experience the same issue, but I > have not found any solutions online. > Please note that this only happens with certain code and is repeatable, all > my other spark jobs work fine. > {noformat} > ERROR TaskSetManager: Task 3 in stage 6.0 failed 4 times; aborting job > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 3 in stage 6.0 failed 4 times, most recent failure: > Lost task 3.3 in stage 6.0 (TID 24, ): java.io.IOException: > org.apache.spark.SparkException: Failed to get broadcast_6_piece0 of > broadcast_6 > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > Caused by: org.apache.spark.SparkException: Failed to get broadcast_6_piece0 > of broadcast_6 > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:136) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:119) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:174) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1008) > ... 11 more > {noformat} > Driver stacktrace: > {noformat} > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at >
[jira] [Updated] (SPARK-27639) InMemoryTableScan should show the table name on UI
[ https://issues.apache.org/jira/browse/SPARK-27639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-27639: Description: It only shows InMemoryTableScan when scanning InMemoryTable. When there are many InMemoryTables, it is difficult to distinguish which one is what we are looking for. This PR show the table name when scanning InMemoryTable. !https://user-images.githubusercontent.com/5399861/57213799-7bccf100-701a-11e9-9872-d90b4a185dc6.png! was: !image-2019-05-06-16-11-45-164.png! > InMemoryTableScan should show the table name on UI > -- > > Key: SPARK-27639 > URL: https://issues.apache.org/jira/browse/SPARK-27639 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > It only shows InMemoryTableScan when scanning InMemoryTable. > When there are many InMemoryTables, it is difficult to distinguish which one > is what we are looking for. This PR show the table name when scanning > InMemoryTable. > !https://user-images.githubusercontent.com/5399861/57213799-7bccf100-701a-11e9-9872-d90b4a185dc6.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27638) date format yyyy-M-dd string comparison not handled properly
[ https://issues.apache.org/jira/browse/SPARK-27638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] peng bo updated SPARK-27638: Summary: date format -M-dd string comparison not handled properly (was: date format -M-dd comparison not handled properly ) > date format -M-dd string comparison not handled properly > - > > Key: SPARK-27638 > URL: https://issues.apache.org/jira/browse/SPARK-27638 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.2 >Reporter: peng bo >Priority: Major > > The below example works with both Mysql and Hive, however not with spark. > {code:java} > mysql> select * from date_test where date_col >= '2000-1-1'; > ++ > | date_col | > ++ > | 2000-01-01 | > ++ > {code} > The reason is that Spark casts both sides to String type during date and > string comparison for partial date support. Please find more details in > https://issues.apache.org/jira/browse/SPARK-8420. > Based on some tests, the behavior of Date and String comparison in Hive and > Mysql: > Hive: Cast to Date, partial date is not supported > Spark: Cast to Date, certain "partial date" is supported by defining certain > date string parse rules. Check out {{str_to_datetime}} in > https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c > Here's 2 proposals: > a. Follow Mysql parse rule, but some partial date string comparison cases > won't be supported either. > b. Cast String value to Date, if it passes use date.toString, original string > otherwise. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27638) date format yyyy-M-dd comparison not handled properly
[ https://issues.apache.org/jira/browse/SPARK-27638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] peng bo updated SPARK-27638: Summary: date format -M-dd comparison not handled properly (was: date format -M-dd comparison isn't handled properly ) > date format -M-dd comparison not handled properly > -- > > Key: SPARK-27638 > URL: https://issues.apache.org/jira/browse/SPARK-27638 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.2 >Reporter: peng bo >Priority: Major > > The below example works with both Mysql and Hive, however not with spark. > {code:java} > mysql> select * from date_test where date_col >= '2000-1-1'; > ++ > | date_col | > ++ > | 2000-01-01 | > ++ > {code} > The reason is that Spark casts both sides to String type during date and > string comparison for partial date support. Please find more details in > https://issues.apache.org/jira/browse/SPARK-8420. > Based on some tests, the behavior of Date and String comparison in Hive and > Mysql: > Hive: Cast to Date, partial date is not supported > Spark: Cast to Date, certain "partial date" is supported by defining certain > date string parse rules. Check out {{str_to_datetime}} in > https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c > Here's 2 proposals: > a. Follow Mysql parse rule, but some partial date string comparison cases > won't be supported either. > b. Cast String value to Date, if it passes use date.toString, original string > otherwise. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27639) InMemoryTableScan should show the table name on UI
[ https://issues.apache.org/jira/browse/SPARK-27639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27639: Assignee: (was: Apache Spark) > InMemoryTableScan should show the table name on UI > -- > > Key: SPARK-27639 > URL: https://issues.apache.org/jira/browse/SPARK-27639 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > !image-2019-05-06-16-11-45-164.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27639) InMemoryTableScan should show the table name on UI
[ https://issues.apache.org/jira/browse/SPARK-27639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27639: Assignee: Apache Spark > InMemoryTableScan should show the table name on UI > -- > > Key: SPARK-27639 > URL: https://issues.apache.org/jira/browse/SPARK-27639 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > !image-2019-05-06-16-11-45-164.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27638) date format yyyy-M-dd comparison isn't handled properly
[ https://issues.apache.org/jira/browse/SPARK-27638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833584#comment-16833584 ] peng bo commented on SPARK-27638: - [~cloud_fan] [~srowen] Your opinion will be appreciated. > date format -M-dd comparison isn't handled properly > > > Key: SPARK-27638 > URL: https://issues.apache.org/jira/browse/SPARK-27638 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.2 >Reporter: peng bo >Priority: Major > > The below example works with both Mysql and Hive, however not with spark. > {code:java} > mysql> select * from date_test where date_col >= '2000-1-1'; > ++ > | date_col | > ++ > | 2000-01-01 | > ++ > {code} > The reason is that Spark casts both sides to String type during date and > string comparison for partial date support. Please find more details in > https://issues.apache.org/jira/browse/SPARK-8420. > Based on some tests, the behavior of Date and String comparison in Hive and > Mysql: > Hive: Cast to Date, partial date is not supported > Spark: Cast to Date, certain "partial date" is supported by defining certain > date string parse rules. Check out {{str_to_datetime}} in > https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c > Here's 2 proposals: > a. Follow Mysql parse rule, but some partial date string comparison cases > won't be supported either. > b. Cast String value to Date, if it passes use date.toString, original string > otherwise. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27639) InMemoryTableScan should show the table name on UI
Yuming Wang created SPARK-27639: --- Summary: InMemoryTableScan should show the table name on UI Key: SPARK-27639 URL: https://issues.apache.org/jira/browse/SPARK-27639 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang !image-2019-05-06-16-11-45-164.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27638) date format yyyy-M-dd comparison isn't handled properly
[ https://issues.apache.org/jira/browse/SPARK-27638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] peng bo updated SPARK-27638: Description: The below example works with both Mysql and Hive, however not with spark. {code:java} mysql> select * from date_test where date_col >= '2000-1-1'; ++ | date_col | ++ | 2000-01-01 | ++ {code} The reason is that Spark casts both sides to String type during date and string comparison for partial date support. Please find more details in https://issues.apache.org/jira/browse/SPARK-8420. Based on some tests, the behavior of Date and String comparison in Hive and Mysql: Hive: Cast to Date, partial date is not supported Spark: Cast to Date, certain "partial date" is supported by defining certain date string parse rules. Check out {{str_to_datetime}} in https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c Here's 2 proposals: a. Follow Mysql parse rule, but some partial date string comparison cases won't be supported either. b. Cast String value to Date, if it passes use date.toString, original string otherwise. was: The below example works with both Mysql and Hive, however not with spark. {code:java} mysql> select * from date_test where date_col >= '2000-1-1'; ++ | date_col | ++ | 2000-01-01 | ++ {code} The reason is that Spark casts both sides to String type during date and string comparison for partial date support. Please find more details in https://issues.apache.org/jira/browse/SPARK-8420. Based on some tests, the behavior of Date and String comparison in Hive and Mysql: Hive: Cast to Date, partial date is not supported Spark: Cast to Date, "partial date" is supported by defining certain date string parse rules. Check out {{str_to_datetime}} in https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c Here's 2 proposals: a. Follow Mysql parse rule, but some partial date string comparison cases won't be supported either. b. Cast String value to Date, if it passes use date.toString, original string otherwise. > date format -M-dd comparison isn't handled properly > > > Key: SPARK-27638 > URL: https://issues.apache.org/jira/browse/SPARK-27638 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.2 >Reporter: peng bo >Priority: Major > > The below example works with both Mysql and Hive, however not with spark. > {code:java} > mysql> select * from date_test where date_col >= '2000-1-1'; > ++ > | date_col | > ++ > | 2000-01-01 | > ++ > {code} > The reason is that Spark casts both sides to String type during date and > string comparison for partial date support. Please find more details in > https://issues.apache.org/jira/browse/SPARK-8420. > Based on some tests, the behavior of Date and String comparison in Hive and > Mysql: > Hive: Cast to Date, partial date is not supported > Spark: Cast to Date, certain "partial date" is supported by defining certain > date string parse rules. Check out {{str_to_datetime}} in > https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c > Here's 2 proposals: > a. Follow Mysql parse rule, but some partial date string comparison cases > won't be supported either. > b. Cast String value to Date, if it passes use date.toString, original string > otherwise. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27638) date format yyyy-M-dd comparison isn't handled properly
[ https://issues.apache.org/jira/browse/SPARK-27638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] peng bo updated SPARK-27638: Description: The below example works with both Mysql and Hive, however not with spark. {code:java} mysql> select * from date_test where date_col >= '2000-1-1'; ++ | date_col | ++ | 2000-01-01 | ++ {code} The reason is that Spark casts both sides to String type during date and string comparison for partial date support. Please find more details in https://issues.apache.org/jira/browse/SPARK-8420. Based on some tests, the behavior of Date and String comparison in Hive and Mysql: Hive: Cast to Date, partial date is not supported Spark: Cast to Date, "partial date" is supported by defining certain date string parse rules. Check out {{str_to_datetime}} in https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c Here's 2 proposals: a. Follow Mysql parse rule, but some partial date string comparison cases won't be supported either. b. Cast String value to Date, if it passes use date.toString, original string otherwise. was: The below example works with both Mysql and Hive, however not with spark. {code:java} mysql> select * from date_test where date_col >= '2000-1-1'; ++ | date_col | ++ | 2000-01-01 | ++ {code} The reason is that Spark casts both sides to String type during date and string comparison for partial date support. Please find more details in https://issues.apache.org/jira/browse/SPARK-8420. Based on some tests, the behavior of Date and String comparison in Hive and Mysql: Hive: Cast to Date, partial date is not supported Spark: Cast to Date, "partial date" is supported by defining certain date string parse rules. Check out {{str_to_datetime}} in https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c Here's 2 proposals: a. Follow Mysql parse rule, but some partial date string comparison cases wouldn't be supported as well b. Cast String value to date, if it passes use date.toString, original string otherwise. > date format -M-dd comparison isn't handled properly > > > Key: SPARK-27638 > URL: https://issues.apache.org/jira/browse/SPARK-27638 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.2 >Reporter: peng bo >Priority: Major > > The below example works with both Mysql and Hive, however not with spark. > {code:java} > mysql> select * from date_test where date_col >= '2000-1-1'; > ++ > | date_col | > ++ > | 2000-01-01 | > ++ > {code} > The reason is that Spark casts both sides to String type during date and > string comparison for partial date support. Please find more details in > https://issues.apache.org/jira/browse/SPARK-8420. > Based on some tests, the behavior of Date and String comparison in Hive and > Mysql: > Hive: Cast to Date, partial date is not supported > Spark: Cast to Date, "partial date" is supported by defining certain date > string parse rules. Check out {{str_to_datetime}} in > https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c > Here's 2 proposals: > a. Follow Mysql parse rule, but some partial date string comparison cases > won't be supported either. > b. Cast String value to Date, if it passes use date.toString, original string > otherwise. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27638) date format yyyy-M-dd comparison isn't handled properly
peng bo created SPARK-27638: --- Summary: date format -M-dd comparison isn't handled properly Key: SPARK-27638 URL: https://issues.apache.org/jira/browse/SPARK-27638 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.2 Reporter: peng bo The below example works with both Mysql and Hive, however not with spark. {code:java} mysql> select * from date_test where date_col >= '2000-1-1'; ++ | date_col | ++ | 2000-01-01 | ++ {code} The reason is that Spark casts both sides to String type during date and string comparison for partial date support. Please find more details in https://issues.apache.org/jira/browse/SPARK-8420. Based on some tests, the behavior of Date and String comparison in Hive and Mysql: Hive: Cast to Date, partial date is not supported Spark: Cast to Date, "partial date" is supported by defining certain date string parse rules. Check out {{str_to_datetime}} in https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c Here's 2 proposals: a. Follow Mysql parse rule, but some partial date string comparison cases wouldn't be supported as well b. Cast String value to date, if it passes use date.toString, original string otherwise. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27227) Spark Runtime Filter
[ https://issues.apache.org/jira/browse/SPARK-27227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833558#comment-16833558 ] Song Jun edited comment on SPARK-27227 at 5/6/19 7:32 AM: -- [~cloud_fan] [~smilegator] could you please help to review this SPIP? thanks very much! was (Author: windpiger): [~cloud_fan] [~LI,Xiao] could you please help to review this SPIP? thanks very much! > Spark Runtime Filter > > > Key: SPARK-27227 > URL: https://issues.apache.org/jira/browse/SPARK-27227 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Song Jun >Priority: Major > > When we equi-join one big table with a smaller table, we can collect some > statistics from the smaller table side, and use it to the scan of big table > to do partition prune or data filter before execute the join. > This can significantly improve SQL perfermance. > For a simple example: > select * from A, B where A.a = B.b > A is big table ,B is small table. > There are two scenarios: > 1. A.a is a partition column of table A >we can collect all the values of B.b, and send it to table A to do >partition prune on A.a. > 2. A.a is not a partition column of table A > we can collect real-time some statistics(such as min/max/bloomfilter) of > B.b by execute extra sql(select max(b),min(b),bbf(b) from B), and send it to > table A to do filter on A.a. > Addititionaly, if a more complex query select * from A join (select * from > B where B.c = 1) X on A.a = B.b, then we collect real-time statistics(such as > min/max/bloomfilter) of X by execute extra sql(select max(b),min(b),bbf(b) > from X) > Above two scenarios, we can filter out lots of data by partition prune or > data filter, thus we can imporve perfermance. > 10TB TPC-DS gain about 35% improvement in our test. > I will submit a SPIP later. > SPIP: > https://docs.google.com/document/d/1hTXxsG_qLu5W_VrVvPx2gumXEFrnVPhQSRMjZ6WkfiY/edit#heading=h.7vhjx9226jbt -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27227) Spark Runtime Filter
[ https://issues.apache.org/jira/browse/SPARK-27227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833558#comment-16833558 ] Song Jun commented on SPARK-27227: -- [~cloud_fan] [~LI,Xiao] could you please help to review this SPIP? thanks very much! > Spark Runtime Filter > > > Key: SPARK-27227 > URL: https://issues.apache.org/jira/browse/SPARK-27227 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Song Jun >Priority: Major > > When we equi-join one big table with a smaller table, we can collect some > statistics from the smaller table side, and use it to the scan of big table > to do partition prune or data filter before execute the join. > This can significantly improve SQL perfermance. > For a simple example: > select * from A, B where A.a = B.b > A is big table ,B is small table. > There are two scenarios: > 1. A.a is a partition column of table A >we can collect all the values of B.b, and send it to table A to do >partition prune on A.a. > 2. A.a is not a partition column of table A > we can collect real-time some statistics(such as min/max/bloomfilter) of > B.b by execute extra sql(select max(b),min(b),bbf(b) from B), and send it to > table A to do filter on A.a. > Addititionaly, if a more complex query select * from A join (select * from > B where B.c = 1) X on A.a = B.b, then we collect real-time statistics(such as > min/max/bloomfilter) of X by execute extra sql(select max(b),min(b),bbf(b) > from X) > Above two scenarios, we can filter out lots of data by partition prune or > data filter, thus we can imporve perfermance. > 10TB TPC-DS gain about 35% improvement in our test. > I will submit a SPIP later. > SPIP: > https://docs.google.com/document/d/1hTXxsG_qLu5W_VrVvPx2gumXEFrnVPhQSRMjZ6WkfiY/edit#heading=h.7vhjx9226jbt -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27227) Spark Runtime Filter
[ https://issues.apache.org/jira/browse/SPARK-27227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-27227: - Description: When we equi-join one big table with a smaller table, we can collect some statistics from the smaller table side, and use it to the scan of big table to do partition prune or data filter before execute the join. This can significantly improve SQL perfermance. For a simple example: select * from A, B where A.a = B.b A is big table ,B is small table. There are two scenarios: 1. A.a is a partition column of table A we can collect all the values of B.b, and send it to table A to do partition prune on A.a. 2. A.a is not a partition column of table A we can collect real-time some statistics(such as min/max/bloomfilter) of B.b by execute extra sql(select max(b),min(b),bbf(b) from B), and send it to table A to do filter on A.a. Addititionaly, if a more complex query select * from A join (select * from B where B.c = 1) X on A.a = B.b, then we collect real-time statistics(such as min/max/bloomfilter) of X by execute extra sql(select max(b),min(b),bbf(b) from X) Above two scenarios, we can filter out lots of data by partition prune or data filter, thus we can imporve perfermance. 10TB TPC-DS gain about 35% improvement in our test. I will submit a SPIP later. SPIP: https://docs.google.com/document/d/1hTXxsG_qLu5W_VrVvPx2gumXEFrnVPhQSRMjZ6WkfiY/edit#heading=h.7vhjx9226jbt was: When we equi-join one big table with a smaller table, we can collect some statistics from the smaller table side, and use it to the scan of big table to do partition prune or data filter before execute the join. This can significantly improve SQL perfermance. For a simple example: select * from A, B where A.a = B.b A is big table ,B is small table. There are two scenarios: 1. A.a is a partition column of table A we can collect all the values of B.b, and send it to table A to do partition prune on A.a. 2. A.a is not a partition column of table A we can collect real-time some statistics(such as min/max/bloomfilter) of B.b by execute extra sql(select max(b),min(b),bbf(b) from B), and send it to table A to do filter on A.a. Addititionaly, if a more complex query select * from A join (select * from B where B.c = 1) X on A.a = B.b, then we collect real-time statistics(such as min/max/bloomfilter) of X by execute extra sql(select max(b),min(b),bbf(b) from X) Above two scenarios, we can filter out lots of data by partition prune or data filter, thus we can imporve perfermance. 10TB TPC-DS gain about 35% improvement in our test. I will submit a SPIP later. > Spark Runtime Filter > > > Key: SPARK-27227 > URL: https://issues.apache.org/jira/browse/SPARK-27227 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Song Jun >Priority: Major > > When we equi-join one big table with a smaller table, we can collect some > statistics from the smaller table side, and use it to the scan of big table > to do partition prune or data filter before execute the join. > This can significantly improve SQL perfermance. > For a simple example: > select * from A, B where A.a = B.b > A is big table ,B is small table. > There are two scenarios: > 1. A.a is a partition column of table A >we can collect all the values of B.b, and send it to table A to do >partition prune on A.a. > 2. A.a is not a partition column of table A > we can collect real-time some statistics(such as min/max/bloomfilter) of > B.b by execute extra sql(select max(b),min(b),bbf(b) from B), and send it to > table A to do filter on A.a. > Addititionaly, if a more complex query select * from A join (select * from > B where B.c = 1) X on A.a = B.b, then we collect real-time statistics(such as > min/max/bloomfilter) of X by execute extra sql(select max(b),min(b),bbf(b) > from X) > Above two scenarios, we can filter out lots of data by partition prune or > data filter, thus we can imporve perfermance. > 10TB TPC-DS gain about 35% improvement in our test. > I will submit a SPIP later. > SPIP: > https://docs.google.com/document/d/1hTXxsG_qLu5W_VrVvPx2gumXEFrnVPhQSRMjZ6WkfiY/edit#heading=h.7vhjx9226jbt -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24641) Spark-Mesos integration doesn't respect request to abort itself
[ https://issues.apache.org/jira/browse/SPARK-24641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833549#comment-16833549 ] Igor Berman commented on SPARK-24641: - I believe this issue is connected to https://issues.apache.org/jira/browse/SPARK-15359 i.e. what I mentioned as "zombie" mode, happens after dispatcher got aborted. Since it's running in separate thread and no body basically cares that this thread is finished the framework can become inactive(if aborted) or stopped(not sure what when) the embedding java app with spark context may continue to run > Spark-Mesos integration doesn't respect request to abort itself > --- > > Key: SPARK-24641 > URL: https://issues.apache.org/jira/browse/SPARK-24641 > Project: Spark > Issue Type: Bug > Components: Mesos, Shuffle >Affects Versions: 2.2.0 >Reporter: Igor Berman >Priority: Major > > Hi, > lately we came across following corner scenario: > We are using dynamic allocation with external shuffle service that is managed > by marathon. > > Due to some network/operation issue, the external shuffle service on one of > the machines(mesos-slaves) is not available for few seconds(e.g. marathon > haven't provisioned yet the external shuffle service on particular node, but > framework itself already accepted offer on this node and tries to startup > executor) > > This makes framework(spark driver) to fail and I see error from stderr of > driver(seems like mesos-agent asks driver to abort itself), however spark > context continues to run(seems like in kind of zombi mode, since it can't > release resources to cluster and can't get additional offers since the > framework is aborted from mesos perspective) > > The framework in mesos UI move to "inactive" state. > [~skonto] [~susanxhuynh] any input on this problem? Have you came across such > behavior? > I'm ready to work on some patch, but currently I don't understand where to > start, seems like driver is too fragile in this sense and something in > mesos-spark integration is missing > > > {code:java} > I0412 07:31:25.827283 274 sched.cpp:759] Framework registered with > 15d9838f-b266-413b-842d-f7c3567bd04a-0051 Exception in thread "Thread-295" > java.io.IOException: Failed to connect tomy-company.com/10.106.14.61:7337 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182) > at > org.apache.spark.network.shuffle.mesos.MesosExternalShuffleClient.registerDriverWithShuffleService(MesosExternalShuffleClient.java:75) > at > org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackend.statusUpdate(MesosCoarseGrainedSchedulerBackend.scala:537) > Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: > Connection refused: my-company.com/10.106.14.61:7337 at > sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > at > io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257) > at > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) > at java.lang.Thread.run(Thread.java:748) I0412 07:35:12.032925 277 > sched.cpp:2055] Asked to abort the driver I0412 07:35:12.033035 277 > sched.cpp:1233] Aborting framework 15d9838f-b266-413b-842d-f7c3567bd04a-0051 > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27439) Explainging Dataset should show correct resolved plans
[ https://issues.apache.org/jira/browse/SPARK-27439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27439: -- Summary: Explainging Dataset should show correct resolved plans (was: Use analyzed plan when explaining Dataset) > Explainging Dataset should show correct resolved plans > -- > > Key: SPARK-27439 > URL: https://issues.apache.org/jira/browse/SPARK-27439 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.1 >Reporter: xjl >Assignee: Liang-Chi Hsieh >Priority: Minor > Fix For: 3.0.0 > > > {code} > scala> spark.range(10).createOrReplaceTempView("test") > scala> spark.range(5).createOrReplaceTempView("test2") > scala> spark.sql("select * from test").createOrReplaceTempView("tmp001") > scala> val df = spark.sql("select * from tmp001") > scala> spark.sql("select * from test2").createOrReplaceTempView("tmp001") > scala> df.show > +---+ > | id| > +---+ > | 0| > | 1| > | 2| > | 3| > | 4| > | 5| > | 6| > | 7| > | 8| > | 9| > +---+ > scala> df.explain > {code} > Before: > {code} > == Physical Plan == > *(1) Range (0, 5, step=1, splits=12) > {code} > After: > {code} > == Physical Plan == > *(1) Range (0, 10, step=1, splits=12) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27439) Use analyzed plan when explaining Dataset
[ https://issues.apache.org/jira/browse/SPARK-27439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27439. --- Resolution: Fixed Assignee: Liang-Chi Hsieh This is resolved via https://github.com/apache/spark/pull/24464 > Use analyzed plan when explaining Dataset > - > > Key: SPARK-27439 > URL: https://issues.apache.org/jira/browse/SPARK-27439 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.1 >Reporter: xjl >Assignee: Liang-Chi Hsieh >Priority: Minor > Fix For: 3.0.0 > > > {code} > scala> spark.range(10).createOrReplaceTempView("test") > scala> spark.range(5).createOrReplaceTempView("test2") > scala> spark.sql("select * from test").createOrReplaceTempView("tmp001") > scala> val df = spark.sql("select * from tmp001") > scala> spark.sql("select * from test2").createOrReplaceTempView("tmp001") > scala> df.show > +---+ > | id| > +---+ > | 0| > | 1| > | 2| > | 3| > | 4| > | 5| > | 6| > | 7| > | 8| > | 9| > +---+ > scala> df.explain > {code} > Before: > {code} > == Physical Plan == > *(1) Range (0, 5, step=1, splits=12) > {code} > After: > {code} > == Physical Plan == > *(1) Range (0, 10, step=1, splits=12) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org