[jira] [Commented] (SPARK-23711) Add fallback to interpreted execution logic
[ https://issues.apache.org/jira/browse/SPARK-23711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443588#comment-16443588 ] Takeshi Yamamuro commented on SPARK-23711: -- +1, I also think so. > Add fallback to interpreted execution logic > --- > > Key: SPARK-23711 > URL: https://issues.apache.org/jira/browse/SPARK-23711 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23711) Add fallback to interpreted execution logic
[ https://issues.apache.org/jira/browse/SPARK-23711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443573#comment-16443573 ] Liang-Chi Hsieh commented on SPARK-23711: - About this, I suppose that in {{WholeStageCodegenExec}}, we already have logic to fallback to interpreted mode? Other places we need to add this kind of fallback? > Add fallback to interpreted execution logic > --- > > Key: SPARK-23711 > URL: https://issues.apache.org/jira/browse/SPARK-23711 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24008) SQL/Hive Context fails with NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-24008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated SPARK-24008: -- Description: SQL / Hive Context fails with NullPointerException while getting configuration from SQLConf. This happens when the MemoryStore is filled with lot of broadcast and started dropping and then SQL / Hive Context is created and broadcast. When using this Context to access a table fails with below NullPointerException. Repro is attached - the Spark Example which fills the MemoryStore with broadcasts and then creates and accesses a SQL Context. {code} java.lang.NullPointerException at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638) at org.apache.spark.sql.SQLConf.defaultDataSourceName(SQLConf.scala:558) at org.apache.spark.sql.DataFrameReader.(DataFrameReader.scala:362) at org.apache.spark.sql.SQLContext.read(SQLContext.scala:623) at SparkHiveExample$.main(SparkHiveExample.scala:76) at SparkHiveExample.main(SparkHiveExample.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) 18/04/06 14:17:42 ERROR ApplicationMaster: User class threw exception: java.lang.NullPointerException java.lang.NullPointerException at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638) at org.apache.spark.sql.SQLContext.getConf(SQLContext.scala:153) at org.apache.spark.sql.hive.HiveContext.hiveMetastoreVersion(HiveContext.scala:166) at org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:258) at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:255) at org.apache.spark.sql.hive.HiveContext$$anon$2.(HiveContext.scala:475) at org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:475) at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:474) at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:90) at org.apache.spark.sql.SQLContext.table(SQLContext.scala:831) at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827) {code} MemoryStore got filled and started dropping the blocks. {code} 18/04/17 08:03:43 INFO MemoryStore: 2 blocks selected for dropping 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_14 stored as values in memory (estimated size 78.1 MB, free 64.4 MB) 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_14_piece0 stored as bytes in memory (estimated size 1522.0 B, free 64.4 MB) 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_15 stored as values in memory (estimated size 350.9 KB, free 64.1 MB) 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_15_piece0 stored as bytes in memory (estimated size 29.9 KB, free 64.0 MB) 18/04/17 08:03:43 INFO MemoryStore: 10 blocks selected for dropping 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_16 stored as values in memory (estimated size 78.1 MB, free 64.7 MB) 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_16_piece0 stored as bytes in memory (estimated size 1522.0 B, free 64.7 MB) 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 136.0 B, free 64.7 MB) 18/04/17 08:03:44 INFO MemoryStore: MemoryStore cleared 18/04/17 08:03:20 INFO MemoryStore: MemoryStore started with capacity 511.1 MB 18/04/17 08:03:44 INFO MemoryStore: MemoryStore cleared 18/04/17 08:03:57 INFO MemoryStore: MemoryStore started with capacity 511.1 MB 18/04/17 08:04:23 INFO MemoryStore: MemoryStore cleared {code} Fix is to remove broadcasting SQL/Hive Context or Increasing the Driver memory. was: SQL / Hive Context fails with NullPointerException while getting configuration from SQLConf. This happens when the MemoryStore is filled with lot of broadcast and started dropping and then SQL / Hive Context is created and broadcast. When using this Context to access a table fails with below NullPointerException. Repro is attached - the Spark Example which fills the MemoryStore with broadcasts and then creates and accesses a SQL Context. {code} java.lang.NullPointerException at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638) at org.apache.spark.sql.SQLConf.defaultDataSourceName(SQLConf.scala:558) at org.apache.spark.sql.DataFrameReader.(DataFrameReader.scala:362) at org.apache.spark.sql.SQLContext.read(SQLContext.scala:623) at SparkHiveExample$.main(SparkHiveExample.scala:76) at SparkHiveExample.main(SparkHiveExample.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at
[jira] [Commented] (SPARK-24015) Does method SchedulerBackend.isReady is really well implemented?
[ https://issues.apache.org/jira/browse/SPARK-24015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443509#comment-16443509 ] Hyukjin Kwon commented on SPARK-24015: -- [~chenk], is this a question? I think we would have a better answer if you ask this to a mailing list. We don't usually open a JIRA for a question. I think it's better to ask it to the mailing list first before describing here as a bug. > Does method SchedulerBackend.isReady is really well implemented? > > > Key: SPARK-24015 > URL: https://issues.apache.org/jira/browse/SPARK-24015 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.3.0 >Reporter: chenk >Priority: Minor > > The mothod *isReady* in the interface *SchedulerBackend*,which implement by > *CoarseGrainedSchedulerBackend* , as well as has default value is True. > And, the implemention in *CoarseGrainedSchedulerBackend* use > *sufficientResourcesRegistered* mothed to check it , which default value is > True too. > So,What does the meaning of the *isReady* method? Because it will alway > return true actually. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24015) Does method SchedulerBackend.isReady is really well implemented?
[ https://issues.apache.org/jira/browse/SPARK-24015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24015. -- Resolution: Invalid Let me leave this resolved but please let me know if I am mistaken. > Does method SchedulerBackend.isReady is really well implemented? > > > Key: SPARK-24015 > URL: https://issues.apache.org/jira/browse/SPARK-24015 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.3.0 >Reporter: chenk >Priority: Minor > > The mothod *isReady* in the interface *SchedulerBackend*,which implement by > *CoarseGrainedSchedulerBackend* , as well as has default value is True. > And, the implemention in *CoarseGrainedSchedulerBackend* use > *sufficientResourcesRegistered* mothed to check it , which default value is > True too. > So,What does the meaning of the *isReady* method? Because it will alway > return true actually. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23919) High-order function: array_position(x, element) → bigint
[ https://issues.apache.org/jira/browse/SPARK-23919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-23919. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21037 [https://github.com/apache/spark/pull/21037] > High-order function: array_position(x, element) → bigint > > > Key: SPARK-23919 > URL: https://issues.apache.org/jira/browse/SPARK-23919 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Kazuaki Ishizaki >Priority: Major > Fix For: 2.4.0 > > > Ref: https://prestodb.io/docs/current/functions/array.html > Returns the position of the first occurrence of the element in array x (or 0 > if not found). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23919) High-order function: array_position(x, element) → bigint
[ https://issues.apache.org/jira/browse/SPARK-23919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin reassigned SPARK-23919: - Assignee: Kazuaki Ishizaki > High-order function: array_position(x, element) → bigint > > > Key: SPARK-23919 > URL: https://issues.apache.org/jira/browse/SPARK-23919 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Kazuaki Ishizaki >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Returns the position of the first occurrence of the element in array x (or 0 > if not found). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24021) Fix bug in BlacklistTracker's updateBlacklistForFetchFailure
[ https://issues.apache.org/jira/browse/SPARK-24021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-24021: - Description: There's a miswrite in BlacklistTracker's updateBlacklistForFetchFailure: {code:java} val blacklistedExecsOnNode = nodeToBlacklistedExecs.getOrElseUpdate(exec, HashSet[String]()) blacklistedExecsOnNode += exec{code} where first *exec* should be *host*. was: There‘s a miswrite in BlacklistTracker's updateBlacklistForFetchFailure: ``` val blacklistedExecsOnNode = nodeToBlacklistedExecs.getOrElseUpdate(exec, HashSet[String]()) blacklistedExecsOnNode += exec ``` where first **exec** should be **host**. > Fix bug in BlacklistTracker's updateBlacklistForFetchFailure > > > Key: SPARK-24021 > URL: https://issues.apache.org/jira/browse/SPARK-24021 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: wuyi >Priority: Major > Labels: easyfix > > There's a miswrite in BlacklistTracker's updateBlacklistForFetchFailure: > > {code:java} > val blacklistedExecsOnNode = > nodeToBlacklistedExecs.getOrElseUpdate(exec, HashSet[String]()) > blacklistedExecsOnNode += exec{code} > > where first *exec* should be *host*. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24021) Fix bug in BlacklistTracker's updateBlacklistForFetchFailure
[ https://issues.apache.org/jira/browse/SPARK-24021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24021: Assignee: Apache Spark > Fix bug in BlacklistTracker's updateBlacklistForFetchFailure > > > Key: SPARK-24021 > URL: https://issues.apache.org/jira/browse/SPARK-24021 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: wuyi >Assignee: Apache Spark >Priority: Major > Labels: easyfix > > There‘s a miswrite in BlacklistTracker's updateBlacklistForFetchFailure: > ``` > val blacklistedExecsOnNode = > nodeToBlacklistedExecs.getOrElseUpdate(exec, HashSet[String]()) > blacklistedExecsOnNode += exec > ``` > where first **exec** should be **host**. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24021) Fix bug in BlacklistTracker's updateBlacklistForFetchFailure
[ https://issues.apache.org/jira/browse/SPARK-24021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24021: Assignee: (was: Apache Spark) > Fix bug in BlacklistTracker's updateBlacklistForFetchFailure > > > Key: SPARK-24021 > URL: https://issues.apache.org/jira/browse/SPARK-24021 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: wuyi >Priority: Major > Labels: easyfix > > There‘s a miswrite in BlacklistTracker's updateBlacklistForFetchFailure: > ``` > val blacklistedExecsOnNode = > nodeToBlacklistedExecs.getOrElseUpdate(exec, HashSet[String]()) > blacklistedExecsOnNode += exec > ``` > where first **exec** should be **host**. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24021) Fix bug in BlacklistTracker's updateBlacklistForFetchFailure
[ https://issues.apache.org/jira/browse/SPARK-24021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443496#comment-16443496 ] Apache Spark commented on SPARK-24021: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/21104 > Fix bug in BlacklistTracker's updateBlacklistForFetchFailure > > > Key: SPARK-24021 > URL: https://issues.apache.org/jira/browse/SPARK-24021 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: wuyi >Priority: Major > Labels: easyfix > > There‘s a miswrite in BlacklistTracker's updateBlacklistForFetchFailure: > ``` > val blacklistedExecsOnNode = > nodeToBlacklistedExecs.getOrElseUpdate(exec, HashSet[String]()) > blacklistedExecsOnNode += exec > ``` > where first **exec** should be **host**. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24021) Fix bug in BlacklistTracker's updateBlacklistForFetchFailure
wuyi created SPARK-24021: Summary: Fix bug in BlacklistTracker's updateBlacklistForFetchFailure Key: SPARK-24021 URL: https://issues.apache.org/jira/browse/SPARK-24021 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0 Reporter: wuyi There‘s a miswrite in BlacklistTracker's updateBlacklistForFetchFailure: ``` val blacklistedExecsOnNode = nodeToBlacklistedExecs.getOrElseUpdate(exec, HashSet[String]()) blacklistedExecsOnNode += exec ``` where first **exec** should be **host**. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24008) SQL/Hive Context fails with NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-24008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443495#comment-16443495 ] Saisai Shao commented on SPARK-24008: - The case you provided seems not so valid. You're trying to broadcast SQL entry point and RDDs, which should not use broadcasting. I'm not sure if the issue here is due to invalid usage pattern. If you met such issue, would you please provide a more meaningful case. > SQL/Hive Context fails with NullPointerException > - > > Key: SPARK-24008 > URL: https://issues.apache.org/jira/browse/SPARK-24008 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.3 >Reporter: Prabhu Joseph >Priority: Major > Attachments: Repro > > > SQL / Hive Context fails with NullPointerException while getting > configuration from SQLConf. This happens when the MemoryStore is filled with > lot of broadcast and started dropping and then SQL / Hive Context is created > and broadcast. When using this Context to access a table fails with below > NullPointerException. > Repro is attached - the Spark Example which fills the MemoryStore with > broadcasts and then creates and accesses a SQL Context. > {code} > java.lang.NullPointerException > at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638) > at > org.apache.spark.sql.SQLConf.defaultDataSourceName(SQLConf.scala:558) > at > org.apache.spark.sql.DataFrameReader.(DataFrameReader.scala:362) > at org.apache.spark.sql.SQLContext.read(SQLContext.scala:623) > at SparkHiveExample$.main(SparkHiveExample.scala:76) > at SparkHiveExample.main(SparkHiveExample.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > 18/04/06 14:17:42 ERROR ApplicationMaster: User class threw exception: > java.lang.NullPointerException > java.lang.NullPointerException > at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638) > at org.apache.spark.sql.SQLContext.getConf(SQLContext.scala:153) > at > org.apache.spark.sql.hive.HiveContext.hiveMetastoreVersion(HiveContext.scala:166) > > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:258) > > at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:255) > at > org.apache.spark.sql.hive.HiveContext$$anon$2.(HiveContext.scala:475) > at > org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:475) > > at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:474) > at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:90) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:831) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827) > {code} > MemoryStore got filled and started dropping the blocks. > {code} > 18/04/17 08:03:43 INFO MemoryStore: 2 blocks selected for dropping > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_14 stored as values in > memory (estimated size 78.1 MB, free 64.4 MB) > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_14_piece0 stored as bytes > in memory (estimated size 1522.0 B, free 64.4 MB) > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_15 stored as values in > memory (estimated size 350.9 KB, free 64.1 MB) > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_15_piece0 stored as bytes > in memory (estimated size 29.9 KB, free 64.0 MB) > 18/04/17 08:03:43 INFO MemoryStore: 10 blocks selected for dropping > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_16 stored as values in > memory (estimated size 78.1 MB, free 64.7 MB) > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_16_piece0 stored as bytes > in memory (estimated size 1522.0 B, free 64.7 MB) > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_1 stored as values in > memory (estimated size 136.0 B, free 64.7 MB) > 18/04/17 08:03:44 INFO MemoryStore: MemoryStore cleared > 18/04/17 08:03:20 INFO MemoryStore: MemoryStore started with capacity 511.1 MB > 18/04/17 08:03:44 INFO MemoryStore: MemoryStore cleared > 18/04/17 08:03:57 INFO MemoryStore: MemoryStore started with capacity 511.1 MB > 18/04/17 08:04:23 INFO MemoryStore: MemoryStore cleared > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24008) SQL/Hive Context fails with NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-24008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443484#comment-16443484 ] Prabhu Joseph commented on SPARK-24008: --- Yes it's driver specific. But better to handle this case instead of failing. > SQL/Hive Context fails with NullPointerException > - > > Key: SPARK-24008 > URL: https://issues.apache.org/jira/browse/SPARK-24008 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.3 >Reporter: Prabhu Joseph >Priority: Major > Attachments: Repro > > > SQL / Hive Context fails with NullPointerException while getting > configuration from SQLConf. This happens when the MemoryStore is filled with > lot of broadcast and started dropping and then SQL / Hive Context is created > and broadcast. When using this Context to access a table fails with below > NullPointerException. > Repro is attached - the Spark Example which fills the MemoryStore with > broadcasts and then creates and accesses a SQL Context. > {code} > java.lang.NullPointerException > at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638) > at > org.apache.spark.sql.SQLConf.defaultDataSourceName(SQLConf.scala:558) > at > org.apache.spark.sql.DataFrameReader.(DataFrameReader.scala:362) > at org.apache.spark.sql.SQLContext.read(SQLContext.scala:623) > at SparkHiveExample$.main(SparkHiveExample.scala:76) > at SparkHiveExample.main(SparkHiveExample.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > 18/04/06 14:17:42 ERROR ApplicationMaster: User class threw exception: > java.lang.NullPointerException > java.lang.NullPointerException > at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638) > at org.apache.spark.sql.SQLContext.getConf(SQLContext.scala:153) > at > org.apache.spark.sql.hive.HiveContext.hiveMetastoreVersion(HiveContext.scala:166) > > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:258) > > at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:255) > at > org.apache.spark.sql.hive.HiveContext$$anon$2.(HiveContext.scala:475) > at > org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:475) > > at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:474) > at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:90) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:831) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827) > {code} > MemoryStore got filled and started dropping the blocks. > {code} > 18/04/17 08:03:43 INFO MemoryStore: 2 blocks selected for dropping > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_14 stored as values in > memory (estimated size 78.1 MB, free 64.4 MB) > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_14_piece0 stored as bytes > in memory (estimated size 1522.0 B, free 64.4 MB) > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_15 stored as values in > memory (estimated size 350.9 KB, free 64.1 MB) > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_15_piece0 stored as bytes > in memory (estimated size 29.9 KB, free 64.0 MB) > 18/04/17 08:03:43 INFO MemoryStore: 10 blocks selected for dropping > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_16 stored as values in > memory (estimated size 78.1 MB, free 64.7 MB) > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_16_piece0 stored as bytes > in memory (estimated size 1522.0 B, free 64.7 MB) > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_1 stored as values in > memory (estimated size 136.0 B, free 64.7 MB) > 18/04/17 08:03:44 INFO MemoryStore: MemoryStore cleared > 18/04/17 08:03:20 INFO MemoryStore: MemoryStore started with capacity 511.1 MB > 18/04/17 08:03:44 INFO MemoryStore: MemoryStore cleared > 18/04/17 08:03:57 INFO MemoryStore: MemoryStore started with capacity 511.1 MB > 18/04/17 08:04:23 INFO MemoryStore: MemoryStore cleared > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24008) SQL/Hive Context fails with NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-24008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443475#comment-16443475 ] Saisai Shao commented on SPARK-24008: - Why do you need to broadcast {{SQLContext}} or {{HiveContext}}? > SQL/Hive Context fails with NullPointerException > - > > Key: SPARK-24008 > URL: https://issues.apache.org/jira/browse/SPARK-24008 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.3 >Reporter: Prabhu Joseph >Priority: Major > Attachments: Repro > > > SQL / Hive Context fails with NullPointerException while getting > configuration from SQLConf. This happens when the MemoryStore is filled with > lot of broadcast and started dropping and then SQL / Hive Context is created > and broadcast. When using this Context to access a table fails with below > NullPointerException. > Repro is attached - the Spark Example which fills the MemoryStore with > broadcasts and then creates and accesses a SQL Context. > {code} > java.lang.NullPointerException > at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638) > at > org.apache.spark.sql.SQLConf.defaultDataSourceName(SQLConf.scala:558) > at > org.apache.spark.sql.DataFrameReader.(DataFrameReader.scala:362) > at org.apache.spark.sql.SQLContext.read(SQLContext.scala:623) > at SparkHiveExample$.main(SparkHiveExample.scala:76) > at SparkHiveExample.main(SparkHiveExample.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > 18/04/06 14:17:42 ERROR ApplicationMaster: User class threw exception: > java.lang.NullPointerException > java.lang.NullPointerException > at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638) > at org.apache.spark.sql.SQLContext.getConf(SQLContext.scala:153) > at > org.apache.spark.sql.hive.HiveContext.hiveMetastoreVersion(HiveContext.scala:166) > > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:258) > > at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:255) > at > org.apache.spark.sql.hive.HiveContext$$anon$2.(HiveContext.scala:475) > at > org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:475) > > at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:474) > at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:90) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:831) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827) > {code} > MemoryStore got filled and started dropping the blocks. > {code} > 18/04/17 08:03:43 INFO MemoryStore: 2 blocks selected for dropping > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_14 stored as values in > memory (estimated size 78.1 MB, free 64.4 MB) > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_14_piece0 stored as bytes > in memory (estimated size 1522.0 B, free 64.4 MB) > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_15 stored as values in > memory (estimated size 350.9 KB, free 64.1 MB) > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_15_piece0 stored as bytes > in memory (estimated size 29.9 KB, free 64.0 MB) > 18/04/17 08:03:43 INFO MemoryStore: 10 blocks selected for dropping > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_16 stored as values in > memory (estimated size 78.1 MB, free 64.7 MB) > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_16_piece0 stored as bytes > in memory (estimated size 1522.0 B, free 64.7 MB) > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_1 stored as values in > memory (estimated size 136.0 B, free 64.7 MB) > 18/04/17 08:03:44 INFO MemoryStore: MemoryStore cleared > 18/04/17 08:03:20 INFO MemoryStore: MemoryStore started with capacity 511.1 MB > 18/04/17 08:03:44 INFO MemoryStore: MemoryStore cleared > 18/04/17 08:03:57 INFO MemoryStore: MemoryStore started with capacity 511.1 MB > 18/04/17 08:04:23 INFO MemoryStore: MemoryStore cleared > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24014) Add onStreamingStarted method to StreamingListener
[ https://issues.apache.org/jira/browse/SPARK-24014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao resolved SPARK-24014. - Resolution: Fixed Fix Version/s: 2.3.1 2.4.0 Issue resolved by pull request 21098 [https://github.com/apache/spark/pull/21098] > Add onStreamingStarted method to StreamingListener > -- > > Key: SPARK-24014 > URL: https://issues.apache.org/jira/browse/SPARK-24014 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Trivial > Fix For: 2.4.0, 2.3.1 > > > The {{StreamingListener}} in PySpark side seems to be lack of > {{onStreamingStarted}} method. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24014) Add onStreamingStarted method to StreamingListener
[ https://issues.apache.org/jira/browse/SPARK-24014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao reassigned SPARK-24014: --- Assignee: Liang-Chi Hsieh > Add onStreamingStarted method to StreamingListener > -- > > Key: SPARK-24014 > URL: https://issues.apache.org/jira/browse/SPARK-24014 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Trivial > > The {{StreamingListener}} in PySpark side seems to be lack of > {{onStreamingStarted}} method. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17901) NettyRpcEndpointRef: Error sending message and Caused by: java.util.ConcurrentModificationException
[ https://issues.apache.org/jira/browse/SPARK-17901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443449#comment-16443449 ] zhangzhaolong commented on SPARK-17901: --- hi [~srowen], did it reopen on master ? we encountered a same problem. > NettyRpcEndpointRef: Error sending message and Caused by: > java.util.ConcurrentModificationException > --- > > Key: SPARK-17901 > URL: https://issues.apache.org/jira/browse/SPARK-17901 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.1 >Reporter: Harish >Priority: Major > > I have 2 data frames one with 10K rows and 10,000 columns and another with 4M > rows with 50 columns. I joined this and trying to find mean of merged data > set, > i calculated the mean using lamda using python mean() function. I cant write > in pyspark due to 64KB code limit issue. > After calculating the mean i did rdd.take(2). it works.But creating the DF > from RDD and DF.show is progress for more than 2 hours (I stopped the > process) with below message (102 GB , 6 cores per node -- total 10 nodes+ > 1master) > 16/10/13 04:36:31 WARN NettyRpcEndpointRef: Error sending message [message = > Heartbeat(9,[Lscala.Tuple2;@68eedafb,BlockManagerId(9, 10.63.136.103, > 35729))] in 1 attempts > org.apache.spark.SparkException: Exception thrown in awaitResult > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1877) > at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.util.ConcurrentModificationException > at java.util.ArrayList.writeObject(ArrayList.java:766) > at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441) > at > java.util.Collections$SynchronizedCollection.writeObject(Collections.java:2081) > at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at >
[jira] [Commented] (SPARK-24020) Sort-merge join inner range optimization
[ https://issues.apache.org/jira/browse/SPARK-24020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443392#comment-16443392 ] Henry Robinson commented on SPARK-24020: This sounds like a 'band join' (e.g. http://pages.cs.wisc.edu/~dewitt/includes/paralleldb/vldb91.pdf, also see [Oracle's documentation|https://docs.oracle.com/en/database/oracle/oracle-database/12.2/sqlrf/Joins.html#GUID-568EC26F-199A-4339-BFD9-C4A0B9588937]). Does your implementation also handle non-equi joins? e.g. {{WHERE t1.Y BETWEEN t2.Y -d AND t2.Y +d}}, with no equality clause in the join predicates. > Sort-merge join inner range optimization > > > Key: SPARK-24020 > URL: https://issues.apache.org/jira/browse/SPARK-24020 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Petar Zecevic >Priority: Major > > The problem we are solving is the case where you have two big tables > partitioned by X column, but also sorted by Y column (within partitions) and > you need to calculate an expensive function on the joined rows. During a > sort-merge join, Spark will do cross-joins of all rows that have the same X > values and calculate the function's value on all of them. If the two tables > have a large number of rows per X, this can result in a huge number of > calculations. > We hereby propose an optimization that would allow you to reduce the number > of matching rows per X using a range condition on Y columns of the two > tables. Something like: > ... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d > The way SMJ is currently implemented, these extra conditions have no > influence on the number of rows (per X) being checked because these extra > conditions are put in the same block with the function being calculated. > Here we propose a change to the sort-merge join so that, when these extra > conditions are specified, a queue is used instead of the > ExternalAppendOnlyUnsafeRowArray class. This queue would then used as a > moving window across the values from the right relation as the left row > changes. You could call this a combination of an equi-join and a theta join > (we call it "sort-merge inner range join"). > Potential use-cases for this are joins based on spatial or temporal distance > calculations. > The optimization should be triggered automatically when an equi-join > expression is present AND lower and upper range conditions on a secondary > column are specified. If the tables aren't sorted by both columns, > appropriate sorts should be added. > To limit the impact of this change we also propose adding a new parameter > (tentatively named "spark.sql.join.smj.useInnerRangeOptimization") which > could be used to switch off the optimization entirely. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24020) Sort-merge join inner range optimization
Petar Zecevic created SPARK-24020: - Summary: Sort-merge join inner range optimization Key: SPARK-24020 URL: https://issues.apache.org/jira/browse/SPARK-24020 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Petar Zecevic The problem we are solving is the case where you have two big tables partitioned by X column, but also sorted by Y column (within partitions) and you need to calculate an expensive function on the joined rows. During a sort-merge join, Spark will do cross-joins of all rows that have the same X values and calculate the function's value on all of them. If the two tables have a large number of rows per X, this can result in a huge number of calculations. We hereby propose an optimization that would allow you to reduce the number of matching rows per X using a range condition on Y columns of the two tables. Something like: ... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d The way SMJ is currently implemented, these extra conditions have no influence on the number of rows (per X) being checked because these extra conditions are put in the same block with the function being calculated. Here we propose a change to the sort-merge join so that, when these extra conditions are specified, a queue is used instead of the ExternalAppendOnlyUnsafeRowArray class. This queue would then used as a moving window across the values from the right relation as the left row changes. You could call this a combination of an equi-join and a theta join (we call it "sort-merge inner range join"). Potential use-cases for this are joins based on spatial or temporal distance calculations. The optimization should be triggered automatically when an equi-join expression is present AND lower and upper range conditions on a secondary column are specified. If the tables aren't sorted by both columns, appropriate sorts should be added. To limit the impact of this change we also propose adding a new parameter (tentatively named "spark.sql.join.smj.useInnerRangeOptimization") which could be used to switch off the optimization entirely. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23775) Flaky test: DataFrameRangeSuite
[ https://issues.apache.org/jira/browse/SPARK-23775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-23775. Resolution: Fixed Fix Version/s: 2.3.1 2.4.0 Issue resolved by pull request 20888 [https://github.com/apache/spark/pull/20888] > Flaky test: DataFrameRangeSuite > --- > > Key: SPARK-23775 > URL: https://issues.apache.org/jira/browse/SPARK-23775 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Major > Fix For: 2.4.0, 2.3.1 > > Attachments: filtered.log, filtered_more_logs.log > > > DataFrameRangeSuite.test("Cancelling stage in a query with Range.") stays > sometimes in an infinite loop and times out the build. > I presume the original intention of this test is to start a job with range > and just cancel it. > The submitted job has 2 stages but I think the author tried to cancel the > first stage with ID 0 which is not the case here: > {code:java} > eventually(timeout(10.seconds), interval(1.millis)) { > assert(DataFrameRangeSuite.stageToKill > 0) > } > {code} > All in all if the first stage is slower than 10 seconds it throws > TestFailedDueToTimeoutException and cancelStage will be never ever called. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23775) Flaky test: DataFrameRangeSuite
[ https://issues.apache.org/jira/browse/SPARK-23775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-23775: -- Assignee: Gabor Somogyi > Flaky test: DataFrameRangeSuite > --- > > Key: SPARK-23775 > URL: https://issues.apache.org/jira/browse/SPARK-23775 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Major > Fix For: 2.3.1, 2.4.0 > > Attachments: filtered.log, filtered_more_logs.log > > > DataFrameRangeSuite.test("Cancelling stage in a query with Range.") stays > sometimes in an infinite loop and times out the build. > I presume the original intention of this test is to start a job with range > and just cancel it. > The submitted job has 2 stages but I think the author tried to cancel the > first stage with ID 0 which is not the case here: > {code:java} > eventually(timeout(10.seconds), interval(1.millis)) { > assert(DataFrameRangeSuite.stageToKill > 0) > } > {code} > All in all if the first stage is slower than 10 seconds it throws > TestFailedDueToTimeoutException and cancelStage will be never ever called. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443239#comment-16443239 ] Jordan Moore commented on SPARK-18057: -- Thanks, [~c...@koeninger.org]. I will forward that information to the relevant parties. > Update structured streaming kafka from 0.10.0.1 to 1.1.0 > > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger >Priority: Major > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-19941) Spark should not schedule tasks on executors on decommissioning YARN nodes
[ https://issues.apache.org/jira/browse/SPARK-19941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Palaniappan reopened SPARK-19941: - This should be marked as a duplicate of https://issues.apache.org/jira/browse/SPARK-20628. > Spark should not schedule tasks on executors on decommissioning YARN nodes > -- > > Key: SPARK-19941 > URL: https://issues.apache.org/jira/browse/SPARK-19941 > Project: Spark > Issue Type: Improvement > Components: Scheduler, YARN >Affects Versions: 2.1.0 > Environment: Hadoop 2.8.0-rc1 >Reporter: Karthik Palaniappan >Priority: Major > > Hadoop 2.8 added a mechanism to gracefully decommission Node Managers in > YARN: https://issues.apache.org/jira/browse/YARN-914 > Essentially you can mark nodes to be decommissioned, and let them a) finish > work in progress and b) finish serving shuffle data. But no new work will be > scheduled on the node. > Spark should respect when NMs are set to decommissioned, and similarly > decommission executors on those nodes by not scheduling any more tasks on > them. > It looks like in the future YARN may inform the app master when containers > will be killed: https://issues.apache.org/jira/browse/YARN-3784. However, I > don't think Spark should schedule based on a timeout. We should gracefully > decommission the executor as fast as possible (which is the spirit of > YARN-914). The app master can query the RM for NM statuses (if it doesn't > already have them) and stop scheduling on executors on NMs that are > decommissioning. > Stretch feature: The timeout may be useful in determining whether running > further tasks on the executor is even helpful. Spark may be able to tell that > shuffle data will not be consumed by the time the node is decommissioned, so > it is not worth computing. The executor can be killed immediately. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name
[ https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443154#comment-16443154 ] Franck Tago commented on SPARK-23519: - thanks for the suggestion [~shahid] The issue with your suggestion is that I dynamically generate the create view statement ;Moreover the select statement is kind of Opaque to me because it is provided by the customer. It would be nice is spark could fix such a simple case. > Create View Commands Fails with The view output (col1,col1) contains > duplicate column name > --- > > Key: SPARK-23519 > URL: https://issues.apache.org/jira/browse/SPARK-23519 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: Franck Tago >Priority: Critical > > 1- create and populate a hive table . I did this in a hive cli session .[ > not that this matters ] > create table atable (col1 int) ; > insert into atable values (10 ) , (100) ; > 2. create a view from the table. > [These actions were performed from a spark shell ] > spark.sql("create view default.aview (int1 , int2 ) as select col1 , col1 > from atable ") > java.lang.AssertionError: assertion failed: The view output (col1,col1) > contains duplicate column name. > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361) > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24019) AnalysisException for Window function expression to compute derivative
Barry Becker created SPARK-24019: Summary: AnalysisException for Window function expression to compute derivative Key: SPARK-24019 URL: https://issues.apache.org/jira/browse/SPARK-24019 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.1 Environment: Ubuntu, spark 2.1.1, standalone. Reporter: Barry Becker I am using spark 2.1.1 currently. I created an expression to compute the derivative of some series data using a window function. I have a simple reproducible case of the error. I'm only filing this bug because the error message says "Please file a bug report with this error message, stack trace, and the query." Here they are: {code:java} ((coalesce(lead(value#9, 1, null) windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING), value#9) - coalesce(lag(value#9, 1, null) windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING), value#9)) / cast((coalesce(lead(sequence_num#7, 1, null) windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING), sequence_num#7) - coalesce(lag(sequence_num#7, 1, null) windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING), sequence_num#7)) as double)) windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS derivative#14 has multiple Window Specifications (ArrayBuffer(windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING), windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING))). Please file a bug report with this error message, stack trace, and the query.; org.apache.spark.sql.AnalysisException: ((coalesce(lead(value#9, 1, null) windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING), value#9) - coalesce(lag(value#9, 1, null) windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING), value#9)) / cast((coalesce(lead(sequence_num#7, 1, null) windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING), sequence_num#7) - coalesce(lag(sequence_num#7, 1, null) windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING), sequence_num#7)) as double)) windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS derivative#14 has multiple Window Specifications (ArrayBuffer(windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING), windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING))). Please file a bug report with this error message, stack trace, and the query.; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57) at org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$78.apply(Analyzer.scala:1772){code} And here is a simple unit test that can be used to reproduce the problem: {code:java} import com.mineset.spark.testsupport.SparkTestCase.SPARK_SESSION import org.apache.spark.sql.Column import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.functions._ import org.scalatest.FunSuite import com.mineset.spark.testsupport.SparkTestCase._ /** * Test to see that window functions work as expected on spark. * @author Barry Becker */ class WindowFunctionSuite extends FunSuite { val simpleDf = createSimpleData() test("Window function for finding derivatives for 2 series") { val window = Window.partitionBy("category").orderBy("sequence_num")//.rangeBetween(-1, 1) // Consider the three sequential series points (Xlag, Ylag), (X, Y), (Xlead, Ylead). // This defines the derivative as (Ylead - Ylag) / (Xlead - Xlag) // If the lead or lag points are null, then we fall back on using the middle point. val yLead = coalesce(lead("value", 1).over(window), col("value")) val yLag = coalesce(lag("value", 1).over(window), col("value")) val xLead = coalesce(lead("sequence_num", 1).over(window), col("sequence_num")) val xLag = coalesce(lag("sequence_num", 1).over(window), col("sequence_num")) val derivative: Column = (yLead - yLag) / (xLead - xLag) val resultDf =
[jira] [Updated] (SPARK-24018) Spark-without-hadoop package fails to create or read parquet files with snappy compression
[ https://issues.apache.org/jira/browse/SPARK-24018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Francis Roy updated SPARK-24018: - Description: On a brand-new installation of Spark 2.3.0 with a user-provided hadoop-2.8.3, Spark fails to read or write dataframes in parquet format with snappy compression. This is due to an incompatibility between the snappy-java version that is required by parquet (parquet is provided in Spark jars but snappy isn't) and the version that is available from hadoop-2.8.3. Steps to reproduce: * Download and extract hadoop-2.8.3 * Download and extract spark-2.3.0-without-hadoop * export JAVA_HOME, HADOOP_HOME, SPARK_HOME, PATH * Following instructions from [https://spark.apache.org/docs/latest/hadoop-provided.html], set SPARK_DIST_CLASSPATH=$(hadoop classpath) in spark-env.sh * Start a spark-shell, enter the following: {code:java} import spark.implicits._ val df = List(1, 2, 3, 4).toDF df.write .format("parquet") .option("compression", "snappy") .mode("overwrite") .save("test.parquet") {code} This fails with the following: {noformat} java.lang.UnsatisfiedLinkError: org.xerial.snappy.SnappyNative.maxCompressedLength(I)I at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method) at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316) at org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67) at org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81) at org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92) at org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112) at org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93) at org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150) at org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238) at org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121) at org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:167) at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:109) at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:163) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:405) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:396) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748){noformat} Downloading snappy-java-1.1.2.6.jar and placing it in Sparks's jar folder solves the issue. was: On a brand-new installation of Spark 2.3.0 with a user-provided hadoop-2.8.3, Spark fails to read or write dataframes in parquet format with snappy compression. This is due to an incompatibility between the snappy-java version that is required by parquet (parquet is provided in Spark jars but snappy isn't) and the version that is available from hadoop-2.8.3. Steps to reproduce: * Download and extract hadoop-2.8.3 * Download and extract spark-2.3.0-without-hadoop * export JAVA_HOME, HADOOP_HOME, SPARK_HOME, PATH * Following instructions from https://spark.apache.org/docs/latest/hadoop-provided.html, set SPARK_DIST_CLASSPATH=$(hadoop classpath) in spark-env.sh * Start a spark-shell,
[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics
[ https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443025#comment-16443025 ] Edwina Lu commented on SPARK-23206: --- We are planning a design discussion for early next week. Please let me know if you are interested in attending, and I will post once there is a set date/time. > Additional Memory Tuning Metrics > > > Key: SPARK-23206 > URL: https://issues.apache.org/jira/browse/SPARK-23206 > Project: Spark > Issue Type: Umbrella > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Edwina Lu >Priority: Major > Attachments: ExecutorsTab.png, ExecutorsTab2.png, > MemoryTuningMetricsDesignDoc.pdf, SPARK-23206 Design Doc.pdf, StageTab.png > > > At LinkedIn, we have multiple clusters, running thousands of Spark > applications, and these numbers are growing rapidly. We need to ensure that > these Spark applications are well tuned – cluster resources, including > memory, should be used efficiently so that the cluster can support running > more applications concurrently, and applications should run quickly and > reliably. > Currently there is limited visibility into how much memory executors are > using, and users are guessing numbers for executor and driver memory sizing. > These estimates are often much larger than needed, leading to memory wastage. > Examining the metrics for one cluster for a month, the average percentage of > used executor memory (max JVM used memory across executors / > spark.executor.memory) is 35%, leading to an average of 591GB unused memory > per application (number of executors * (spark.executor.memory - max JVM used > memory)). Spark has multiple memory regions (user memory, execution memory, > storage memory, and overhead memory), and to understand how memory is being > used and fine-tune allocation between regions, it would be useful to have > information about how much memory is being used for the different regions. > To improve visibility into memory usage for the driver and executors and > different memory regions, the following additional memory metrics can be be > tracked for each executor and driver: > * JVM used memory: the JVM heap size for the executor/driver. > * Execution memory: memory used for computation in shuffles, joins, sorts > and aggregations. > * Storage memory: memory used caching and propagating internal data across > the cluster. > * Unified memory: sum of execution and storage memory. > The peak values for each memory metric can be tracked for each executor, and > also per stage. This information can be shown in the Spark UI and the REST > APIs. Information for peak JVM used memory can help with determining > appropriate values for spark.executor.memory and spark.driver.memory, and > information about the unified memory region can help with determining > appropriate values for spark.memory.fraction and > spark.memory.storageFraction. Stage memory information can help identify > which stages are most memory intensive, and users can look into the relevant > code to determine if it can be optimized. > The memory metrics can be gathered by adding the current JVM used memory, > execution memory and storage memory to the heartbeat. SparkListeners are > modified to collect the new metrics for the executors, stages and Spark > history log. Only interesting values (peak values per stage per executor) are > recorded in the Spark history log, to minimize the amount of additional > logging. > We have attached our design documentation with this ticket and would like to > receive feedback from the community for this proposal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24018) Spark-without-hadoop package fails to create or read parquet files with snappy compression
Jean-Francis Roy created SPARK-24018: Summary: Spark-without-hadoop package fails to create or read parquet files with snappy compression Key: SPARK-24018 URL: https://issues.apache.org/jira/browse/SPARK-24018 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 2.3.0 Reporter: Jean-Francis Roy On a brand-new installation of Spark 2.3.0 with a user-provided hadoop-2.8.3, Spark fails to read or write dataframes in parquet format with snappy compression. This is due to an incompatibility between the snappy-java version that is required by parquet (parquet is provided in Spark jars but snappy isn't) and the version that is available from hadoop-2.8.3. Steps to reproduce: * Download and extract hadoop-2.8.3 * Download and extract spark-2.3.0-without-hadoop * export JAVA_HOME, HADOOP_HOME, SPARK_HOME, PATH * Following instructions from https://spark.apache.org/docs/latest/hadoop-provided.html, set SPARK_DIST_CLASSPATH=$(hadoop classpath) in spark-env.sh * Start a spark-shell, enter the following: {code:java} import spark.implicits._ val df = List(1, 2, 3, 4).toDF df.write .format("parquet") .option("compression", "snappy") .mode("overwrite") .save("test.parquet") {code} This fails with the following: {noformat} java.lang.UnsatisfiedLinkError: org.xerial.snappy.SnappyNative.maxCompressedLength(I)I at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method) at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316) at org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67) at org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81) at org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92) at org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112) at org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93) at org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150) at org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238) at org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121) at org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:167) at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:109) at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:163) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:405) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:396) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {noformat} Downloading snappy-java-1.1.2.6.jar and placing it in Sparks's jar folder solves the issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23915) High-order function: array_except(x, y) → array
[ https://issues.apache.org/jira/browse/SPARK-23915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442960#comment-16442960 ] Apache Spark commented on SPARK-23915: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/21103 > High-order function: array_except(x, y) → array > --- > > Key: SPARK-23915 > URL: https://issues.apache.org/jira/browse/SPARK-23915 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Returns an array of elements in x but not in y, without duplicates. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23915) High-order function: array_except(x, y) → array
[ https://issues.apache.org/jira/browse/SPARK-23915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23915: Assignee: Apache Spark > High-order function: array_except(x, y) → array > --- > > Key: SPARK-23915 > URL: https://issues.apache.org/jira/browse/SPARK-23915 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Returns an array of elements in x but not in y, without duplicates. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23915) High-order function: array_except(x, y) → array
[ https://issues.apache.org/jira/browse/SPARK-23915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23915: Assignee: (was: Apache Spark) > High-order function: array_except(x, y) → array > --- > > Key: SPARK-23915 > URL: https://issues.apache.org/jira/browse/SPARK-23915 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Returns an array of elements in x but not in y, without duplicates. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24017) Refactor ExternalCatalog to be an interface
Xiao Li created SPARK-24017: --- Summary: Refactor ExternalCatalog to be an interface Key: SPARK-24017 URL: https://issues.apache.org/jira/browse/SPARK-24017 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Xiao Li Assignee: Xiao Li This refactors the external catalog to be an interface. It can be easier for the future work in the catalog federation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23443) Spark with Glue as external catalog
[ https://issues.apache.org/jira/browse/SPARK-23443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442948#comment-16442948 ] Xiao Li commented on SPARK-23443: - We need to clean the ExternalCatalog interface at first before the catalog federation. > Spark with Glue as external catalog > --- > > Key: SPARK-23443 > URL: https://issues.apache.org/jira/browse/SPARK-23443 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Ameen Tayyebi >Priority: Major > > AWS Glue Catalog is an external Hive metastore backed by a web service. It > allows permanent storage of catalog data for BigData use cases. > To find out more information about AWS Glue, please consult: > * AWS Glue - [https://aws.amazon.com/glue/] > * Using Glue as a Metastore catalog for Spark - > [https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html] > Today, the integration of Glue and Spark is through the Hive layer. Glue > implements the IMetaStore interface of Hive and for installations of Spark > that contain Hive, Glue can be used as the metastore. > The feature set that Glue supports does not align 1-1 with the set of > features that the latest version of Spark supports. For example, Glue > interface supports more advanced partition pruning that the latest version of > Hive embedded in Spark. > To enable a more natural integration with Spark and to allow leveraging > latest features of Glue, without being coupled to Hive, a direct integration > through Spark's own Catalog API is proposed. This Jira tracks this work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics
[ https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442928#comment-16442928 ] Edwina Lu commented on SPARK-23206: --- [~assia6], there is an initial PR: [[Github] Pull Request #20940 (edwinalu)|https://github.com/apache/spark/pull/20940], and I am working on making changes based on comments. > Additional Memory Tuning Metrics > > > Key: SPARK-23206 > URL: https://issues.apache.org/jira/browse/SPARK-23206 > Project: Spark > Issue Type: Umbrella > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Edwina Lu >Priority: Major > Attachments: ExecutorsTab.png, ExecutorsTab2.png, > MemoryTuningMetricsDesignDoc.pdf, SPARK-23206 Design Doc.pdf, StageTab.png > > > At LinkedIn, we have multiple clusters, running thousands of Spark > applications, and these numbers are growing rapidly. We need to ensure that > these Spark applications are well tuned – cluster resources, including > memory, should be used efficiently so that the cluster can support running > more applications concurrently, and applications should run quickly and > reliably. > Currently there is limited visibility into how much memory executors are > using, and users are guessing numbers for executor and driver memory sizing. > These estimates are often much larger than needed, leading to memory wastage. > Examining the metrics for one cluster for a month, the average percentage of > used executor memory (max JVM used memory across executors / > spark.executor.memory) is 35%, leading to an average of 591GB unused memory > per application (number of executors * (spark.executor.memory - max JVM used > memory)). Spark has multiple memory regions (user memory, execution memory, > storage memory, and overhead memory), and to understand how memory is being > used and fine-tune allocation between regions, it would be useful to have > information about how much memory is being used for the different regions. > To improve visibility into memory usage for the driver and executors and > different memory regions, the following additional memory metrics can be be > tracked for each executor and driver: > * JVM used memory: the JVM heap size for the executor/driver. > * Execution memory: memory used for computation in shuffles, joins, sorts > and aggregations. > * Storage memory: memory used caching and propagating internal data across > the cluster. > * Unified memory: sum of execution and storage memory. > The peak values for each memory metric can be tracked for each executor, and > also per stage. This information can be shown in the Spark UI and the REST > APIs. Information for peak JVM used memory can help with determining > appropriate values for spark.executor.memory and spark.driver.memory, and > information about the unified memory region can help with determining > appropriate values for spark.memory.fraction and > spark.memory.storageFraction. Stage memory information can help identify > which stages are most memory intensive, and users can look into the relevant > code to determine if it can be optimized. > The memory metrics can be gathered by adding the current JVM used memory, > execution memory and storage memory to the heartbeat. SparkListeners are > modified to collect the new metrics for the executors, stages and Spark > history log. Only interesting values (peak values per stage per executor) are > recorded in the Spark history log, to minimize the amount of additional > logging. > We have attached our design documentation with this ticket and would like to > receive feedback from the community for this proposal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17508) Setting weightCol to None in ML library causes an error
[ https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-17508. -- Resolution: Won't Fix Resolving this for now unless there is more interest in the fix > Setting weightCol to None in ML library causes an error > --- > > Key: SPARK-17508 > URL: https://issues.apache.org/jira/browse/SPARK-17508 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Evan Zamir >Priority: Minor > > The following code runs without error: > {code} > spark = SparkSession.builder.appName('WeightBug').getOrCreate() > df = spark.createDataFrame( > [ > (1.0, 1.0, Vectors.dense(1.0)), > (0.0, 1.0, Vectors.dense(-1.0)) > ], > ["label", "weight", "features"]) > lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight") > model = lr.fit(df) > {code} > My expectation from reading the documentation is that setting weightCol=None > should treat all weights as 1.0 (regardless of whether a column exists). > However, the same code with weightCol set to None causes the following errors: > Traceback (most recent call last): > File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in > > model = lr.fit(df) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line > 64, in fit > return self._fit(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 213, in _fit > java_model = self._fit_java(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 210, in _fit_java > return self._java_obj.fit(dataset._jdf) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", > line 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit. > : java.lang.NullPointerException > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > Process finished with exit code 1 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23963) Queries on text-based Hive tables grow disproportionately slower as the number of columns increase
[ https://issues.apache.org/jira/browse/SPARK-23963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-23963: Fix Version/s: 2.3.1 2.2.2 > Queries on text-based Hive tables grow disproportionately slower as the > number of columns increase > -- > > Key: SPARK-23963 > URL: https://issues.apache.org/jira/browse/SPARK-23963 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Minor > Fix For: 2.2.2, 2.3.1, 2.4.0 > > > TableReader gets disproportionately slower as the number of columns in the > query increase. > For example, reading a table with 6000 columns is 4 times more expensive per > record than reading a table with 3000 columns, rather than twice as expensive. > The increase in processing time is due to several Lists (fieldRefs, > fieldOrdinals, and unwrappers), each of which the reader accesses by column > number for each column in a record. Because each List has O\(n\) time for > lookup by column number, these lookups grow increasingly expensive as the > column count increases. > When I patched the code to change those 3 Lists to Arrays, the query times > became proportional. > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442838#comment-16442838 ] Cody Koeninger commented on SPARK-18057: [~cricket007] here's a branch with spark 2.1.1 / kafka 0.11.0.2 if you want to try it out and see if it fixes your issue [https://github.com/koeninger/spark-1/tree/kafka-dependency-update-2.1.1] I assume you'll need to skip tests in order to build it, I didn't do anything more than updating dependency + fixing compile errors. > Update structured streaming kafka from 0.10.0.1 to 1.1.0 > > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger >Priority: Major > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24010) Select from table needs read access on DB folder when storage based auth is enabled
[ https://issues.apache.org/jira/browse/SPARK-24010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442802#comment-16442802 ] Reynold Xin commented on SPARK-24010: - It's better to throw an error message to tell the user the db doesn't exist, rather than the table. Also other catalog implementations in the future might not support this. > Select from table needs read access on DB folder when storage based auth is > enabled > --- > > Key: SPARK-24010 > URL: https://issues.apache.org/jira/browse/SPARK-24010 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Rui Li >Priority: Major > > When HMS enables storage based authorization, SparkSQL requires read access > on DB folder in order to select from a table. Such requirement doesn't seem > necessary and is not required in Hive. > The reason is when Analyzer tries to resolve a relation, it calls > [SessionCatalog::databaseExists|https://github.com/apache/spark/blob/v2.1.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L469]. > This will call the metastore get_database API which will perform > authorization check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24006) ExecutorAllocationManager.onExecutorAdded is an O(n) operation
[ https://issues.apache.org/jira/browse/SPARK-24006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442787#comment-16442787 ] Hyukjin Kwon commented on SPARK-24006: -- Yea, I wouldn't. Just simply wondering how much it could affect in practice. I would appreciate. It's even better if there's an actual case to check. > ExecutorAllocationManager.onExecutorAdded is an O(n) operation > -- > > Key: SPARK-24006 > URL: https://issues.apache.org/jira/browse/SPARK-24006 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Xianjin YE >Priority: Major > > The ExecutorAllocationManager.onExecutorAdded is an O(n) operations, I > believe it will be a problem when scaling out with large number of Executors > as it effectively makes adding N executors at time complexity O(N^2). > > I propose to invoke onExecutorIdle guarded by > {code:java} > if (executorIds.size - executorsPendingToRemove.size >= minNumExecutors +1) { > // Since we only need to re-remark idle executors when low bound > executorIds.filter(listener.isExecutorIdle).foreach(onExecutorIdle) > } else { > onExecutorIdle(executorId) > }{code} > cc [~zsxwing] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23963) Queries on text-based Hive tables grow disproportionately slower as the number of columns increase
[ https://issues.apache.org/jira/browse/SPARK-23963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442783#comment-16442783 ] Ruslan Dautkhanov commented on SPARK-23963: --- I thought it's just a matter of a Spark committer to commit the same PR [https://github.com/apache/spark/pull/21043] to a different branch? Spark2.2 in this case. This PR gives 24x improvement on 6000 columns as you discovered, so I think this 1-line change should be admitted to Spark 2.2 fairly easily. > Queries on text-based Hive tables grow disproportionately slower as the > number of columns increase > -- > > Key: SPARK-23963 > URL: https://issues.apache.org/jira/browse/SPARK-23963 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Minor > Fix For: 2.4.0 > > > TableReader gets disproportionately slower as the number of columns in the > query increase. > For example, reading a table with 6000 columns is 4 times more expensive per > record than reading a table with 3000 columns, rather than twice as expensive. > The increase in processing time is due to several Lists (fieldRefs, > fieldOrdinals, and unwrappers), each of which the reader accesses by column > number for each column in a record. Because each List has O\(n\) time for > lookup by column number, these lookups grow increasingly expensive as the > column count increases. > When I patched the code to change those 3 Lists to Arrays, the query times > became proportional. > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24006) ExecutorAllocationManager.onExecutorAdded is an O(n) operation
[ https://issues.apache.org/jira/browse/SPARK-24006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442781#comment-16442781 ] Xianjin YE commented on SPARK-24006: Could you leave this issue open for a while? In my company, we are trying to scaling Spark Application with large data and a lot of executors in the near future. I will report back when I had first experience how this impact dynamic allocation. > ExecutorAllocationManager.onExecutorAdded is an O(n) operation > -- > > Key: SPARK-24006 > URL: https://issues.apache.org/jira/browse/SPARK-24006 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Xianjin YE >Priority: Major > > The ExecutorAllocationManager.onExecutorAdded is an O(n) operations, I > believe it will be a problem when scaling out with large number of Executors > as it effectively makes adding N executors at time complexity O(N^2). > > I propose to invoke onExecutorIdle guarded by > {code:java} > if (executorIds.size - executorsPendingToRemove.size >= minNumExecutors +1) { > // Since we only need to re-remark idle executors when low bound > executorIds.filter(listener.isExecutorIdle).foreach(onExecutorIdle) > } else { > onExecutorIdle(executorId) > }{code} > cc [~zsxwing] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23933) High-order function: map(array, array) → map<K,V>
[ https://issues.apache.org/jira/browse/SPARK-23933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436976#comment-16436976 ] Kazuaki Ishizaki edited comment on SPARK-23933 at 4/18/18 4:22 PM: --- [~smilegator] [~ueshin] Could you favor us? SparkSQL already uses syntax of {{map}} function for the different purpose. Even if we limit two array in the argument list, we may have conflict between this new feature and creating a map with one entry having an array for key and value. Do you have any good idea? {code} @ExpressionDescription( usage = "_FUNC_(key0, value0, key1, value1, ...) - Creates a map with the given key/value pairs.", examples = """ Examples: > SELECT _FUNC_(1.0, '2', 3.0, '4'); {1.0:"2",3.0:"4"} """) case class CreateMap(children: Seq[Expression]) extends Expression { ... {code} was (Author: kiszk): [~smilegator] [~ueshin] Could you favor us? SparkSQL already uses syntax of {{map}} function for the similar purpose. Even if we limit two array in the argument list, we may have conflict between this new feature and creating a map with one entry having an array for key and value. Do you have any good idea? {code} @ExpressionDescription( usage = "_FUNC_(key0, value0, key1, value1, ...) - Creates a map with the given key/value pairs.", examples = """ Examples: > SELECT _FUNC_(1.0, '2', 3.0, '4'); {1.0:"2",3.0:"4"} """) case class CreateMap(children: Seq[Expression]) extends Expression { ... {code} > High-order function: map(array, array) → map> --- > > Key: SPARK-23933 > URL: https://issues.apache.org/jira/browse/SPARK-23933 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/map.html > Returns a map created using the given key/value arrays. > {noformat} > SELECT map(ARRAY[1,3], ARRAY[2,4]); -- {1 -> 2, 3 -> 4} > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23913) High-order function: array_intersect(x, y) → array
[ https://issues.apache.org/jira/browse/SPARK-23913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23913: Assignee: Apache Spark > High-order function: array_intersect(x, y) → array > -- > > Key: SPARK-23913 > URL: https://issues.apache.org/jira/browse/SPARK-23913 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Returns an array of the elements in the intersection of x and y, without > duplicates. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23913) High-order function: array_intersect(x, y) → array
[ https://issues.apache.org/jira/browse/SPARK-23913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23913: Assignee: (was: Apache Spark) > High-order function: array_intersect(x, y) → array > -- > > Key: SPARK-23913 > URL: https://issues.apache.org/jira/browse/SPARK-23913 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Returns an array of the elements in the intersection of x and y, without > duplicates. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23913) High-order function: array_intersect(x, y) → array
[ https://issues.apache.org/jira/browse/SPARK-23913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442766#comment-16442766 ] Apache Spark commented on SPARK-23913: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/21102 > High-order function: array_intersect(x, y) → array > -- > > Key: SPARK-23913 > URL: https://issues.apache.org/jira/browse/SPARK-23913 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Returns an array of the elements in the intersection of x and y, without > duplicates. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24006) ExecutorAllocationManager.onExecutorAdded is an O(n) operation
[ https://issues.apache.org/jira/browse/SPARK-24006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442735#comment-16442735 ] Hyukjin Kwon commented on SPARK-24006: -- Would be good to make a small test with the logic above and see if how much it improves. If that's ignorable in practice, I would leave this resolved. > ExecutorAllocationManager.onExecutorAdded is an O(n) operation > -- > > Key: SPARK-24006 > URL: https://issues.apache.org/jira/browse/SPARK-24006 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Xianjin YE >Priority: Major > > The ExecutorAllocationManager.onExecutorAdded is an O(n) operations, I > believe it will be a problem when scaling out with large number of Executors > as it effectively makes adding N executors at time complexity O(N^2). > > I propose to invoke onExecutorIdle guarded by > {code:java} > if (executorIds.size - executorsPendingToRemove.size >= minNumExecutors +1) { > // Since we only need to re-remark idle executors when low bound > executorIds.filter(listener.isExecutorIdle).foreach(onExecutorIdle) > } else { > onExecutorIdle(executorId) > }{code} > cc [~zsxwing] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24015) Does method SchedulerBackend.isReady is really well implemented?
[ https://issues.apache.org/jira/browse/SPARK-24015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenk updated SPARK-24015: -- Docs Text: (was: The mothod isReady in the interface SchedulerBackend,which implement by CoarseGrainedSchedulerBackend , as well as has default value is True. And, the implemention in CoarseGrainedSchedulerBackend use sufficientResourcesRegistered mothed to check it , which default value is True too. So,What does the meaning of the isReady method,because it will alway return true actually.) Description: The mothod *isReady* in the interface *SchedulerBackend*,which implement by *CoarseGrainedSchedulerBackend* , as well as has default value is True. And, the implemention in *CoarseGrainedSchedulerBackend* use *sufficientResourcesRegistered* mothed to check it , which default value is True too. So,What does the meaning of the *isReady* method? Because it will alway return true actually. > Does method SchedulerBackend.isReady is really well implemented? > > > Key: SPARK-24015 > URL: https://issues.apache.org/jira/browse/SPARK-24015 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.3.0 >Reporter: chenk >Priority: Minor > > The mothod *isReady* in the interface *SchedulerBackend*,which implement by > *CoarseGrainedSchedulerBackend* , as well as has default value is True. > And, the implemention in *CoarseGrainedSchedulerBackend* use > *sufficientResourcesRegistered* mothed to check it , which default value is > True too. > So,What does the meaning of the *isReady* method? Because it will alway > return true actually. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24016) Yarn does not update node blacklist in static allocation
Imran Rashid created SPARK-24016: Summary: Yarn does not update node blacklist in static allocation Key: SPARK-24016 URL: https://issues.apache.org/jira/browse/SPARK-24016 Project: Spark Issue Type: Improvement Components: Scheduler, YARN Affects Versions: 2.3.0 Reporter: Imran Rashid Task-based blacklisting keeps track of bad nodes, and updates YARN with that set of nodes so that Spark will not receive more containers on that node. However, that only happens with dynamic allocation. Though its far more important with dynamic allocation, even with static allocation this matters; if executors die, or if the cluster was too busy at the original resource request to give all the containers, the spark application will add new containers in the middle. And we want an updated node blacklist for that. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24015) Does method SchedulerBackend.isReady is really well implemented?
[ https://issues.apache.org/jira/browse/SPARK-24015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenk updated SPARK-24015: -- Summary: Does method SchedulerBackend.isReady is really well implemented? (was: Does method SchedulerBackend.isReady is really well implemented) > Does method SchedulerBackend.isReady is really well implemented? > > > Key: SPARK-24015 > URL: https://issues.apache.org/jira/browse/SPARK-24015 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.3.0 >Reporter: chenk >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24015) Does method SchedulerBackend.isReady is really well implemented
[ https://issues.apache.org/jira/browse/SPARK-24015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenk updated SPARK-24015: -- Summary: Does method SchedulerBackend.isReady is really well implemented (was: Does method BackendReady.isReady is really well implemented) > Does method SchedulerBackend.isReady is really well implemented > --- > > Key: SPARK-24015 > URL: https://issues.apache.org/jira/browse/SPARK-24015 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.3.0 >Reporter: chenk >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24007) EqualNullSafe for FloatType and DoubleType might generate a wrong result by codegen.
[ https://issues.apache.org/jira/browse/SPARK-24007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24007. - Resolution: Fixed Fix Version/s: 2.4.0 2.3.1 2.2.2 > EqualNullSafe for FloatType and DoubleType might generate a wrong result by > codegen. > > > Key: SPARK-24007 > URL: https://issues.apache.org/jira/browse/SPARK-24007 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Labels: correctness > Fix For: 2.2.2, 2.3.1, 2.4.0 > > > {{EqualNullSafe}} for {{FloatType}} and {{DoubleType}} might generate a wrong > result by codegen. > {noformat} > scala> val df = Seq((Some(-1.0d), None), (None, Some(-1.0d))).toDF() > df: org.apache.spark.sql.DataFrame = [_1: double, _2: double] > scala> df.show() > +++ > | _1| _2| > +++ > |-1.0|null| > |null|-1.0| > +++ > scala> df.filter("_1 <=> _2").show() > +++ > | _1| _2| > +++ > |-1.0|null| > |null|-1.0| > +++ > {noformat} > The result should be empty but the result remains two rows. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24015) Does method BackendReady.isReady is really well implemented
chenk created SPARK-24015: - Summary: Does method BackendReady.isReady is really well implemented Key: SPARK-24015 URL: https://issues.apache.org/jira/browse/SPARK-24015 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 2.3.0 Reporter: chenk -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten
[ https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442568#comment-16442568 ] Wenchen Fan edited comment on SPARK-23989 at 4/18/18 2:47 PM: -- OK now I see the problem, `ShuffleExchangeExec.needToCopyObjectsBeforeShuffle` doesn't catch all the cases, so we may produce wrong result. was (Author: cloud_fan): OK now I see the problem, `ShuffleExchangeExec.needToCopyObjectsBeforeShuffle` doesn't catch all the cases, so we may reproduce wrong result. > When using `SortShuffleWriter`, the data will be overwritten > > > Key: SPARK-23989 > URL: https://issues.apache.org/jira/browse/SPARK-23989 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Critical > > {color:#33}When using `SortShuffleWriter`, we only insert > '{color}{color:#cc7832}AnyRef{color}{color:#33}' into > '{color}PartitionedAppendOnlyMap{color:#33}' or > '{color}PartitionedPairBuffer{color:#33}'.{color} > {color:#33}For this function:{color} > {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: > {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832}, > {color}{color:#4e807d}V{color}]]) > the value of 'records' is `UnsafeRow`, so the value will be overwritten > {color:#33} {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten
[ https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23989: Assignee: (was: Apache Spark) > When using `SortShuffleWriter`, the data will be overwritten > > > Key: SPARK-23989 > URL: https://issues.apache.org/jira/browse/SPARK-23989 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Critical > > {color:#33}When using `SortShuffleWriter`, we only insert > '{color}{color:#cc7832}AnyRef{color}{color:#33}' into > '{color}PartitionedAppendOnlyMap{color:#33}' or > '{color}PartitionedPairBuffer{color:#33}'.{color} > {color:#33}For this function:{color} > {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: > {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832}, > {color}{color:#4e807d}V{color}]]) > the value of 'records' is `UnsafeRow`, so the value will be overwritten > {color:#33} {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten
[ https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23989: Assignee: Apache Spark > When using `SortShuffleWriter`, the data will be overwritten > > > Key: SPARK-23989 > URL: https://issues.apache.org/jira/browse/SPARK-23989 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Assignee: Apache Spark >Priority: Critical > > {color:#33}When using `SortShuffleWriter`, we only insert > '{color}{color:#cc7832}AnyRef{color}{color:#33}' into > '{color}PartitionedAppendOnlyMap{color:#33}' or > '{color}PartitionedPairBuffer{color:#33}'.{color} > {color:#33}For this function:{color} > {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: > {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832}, > {color}{color:#4e807d}V{color}]]) > the value of 'records' is `UnsafeRow`, so the value will be overwritten > {color:#33} {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten
[ https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442620#comment-16442620 ] Apache Spark commented on SPARK-23989: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/21101 > When using `SortShuffleWriter`, the data will be overwritten > > > Key: SPARK-23989 > URL: https://issues.apache.org/jira/browse/SPARK-23989 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Critical > > {color:#33}When using `SortShuffleWriter`, we only insert > '{color}{color:#cc7832}AnyRef{color}{color:#33}' into > '{color}PartitionedAppendOnlyMap{color:#33}' or > '{color}PartitionedPairBuffer{color:#33}'.{color} > {color:#33}For this function:{color} > {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: > {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832}, > {color}{color:#4e807d}V{color}]]) > the value of 'records' is `UnsafeRow`, so the value will be overwritten > {color:#33} {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24012) Union of map and other compatible column
[ https://issues.apache.org/jira/browse/SPARK-24012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24012: Assignee: Apache Spark > Union of map and other compatible column > > > Key: SPARK-24012 > URL: https://issues.apache.org/jira/browse/SPARK-24012 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.1 >Reporter: Lijia Liu >Assignee: Apache Spark >Priority: Major > > Union of map and other compatible column result in unresolved operator > 'Union; exception > Reproduction > spark-sql>select map(1,2), 'str' union all select map(1,2,3,null), 1 > Output: > Error in query: unresolved operator 'Union;; > 'Union > :- Project [map(1, 2) AS map(1, 2)#106, str AS str#107] > : +- OneRowRelation$ > +- Project [map(1, cast(2 as int), 3, cast(null as int)) AS map(1, CAST(2 AS > INT), 3, CAST(NULL AS INT))#109, 1 AS 1#108] > +- OneRowRelation$ > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24012) Union of map and other compatible column
[ https://issues.apache.org/jira/browse/SPARK-24012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24012: Assignee: (was: Apache Spark) > Union of map and other compatible column > > > Key: SPARK-24012 > URL: https://issues.apache.org/jira/browse/SPARK-24012 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.1 >Reporter: Lijia Liu >Priority: Major > > Union of map and other compatible column result in unresolved operator > 'Union; exception > Reproduction > spark-sql>select map(1,2), 'str' union all select map(1,2,3,null), 1 > Output: > Error in query: unresolved operator 'Union;; > 'Union > :- Project [map(1, 2) AS map(1, 2)#106, str AS str#107] > : +- OneRowRelation$ > +- Project [map(1, cast(2 as int), 3, cast(null as int)) AS map(1, CAST(2 AS > INT), 3, CAST(NULL AS INT))#109, 1 AS 1#108] > +- OneRowRelation$ > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24012) Union of map and other compatible column
[ https://issues.apache.org/jira/browse/SPARK-24012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442615#comment-16442615 ] Apache Spark commented on SPARK-24012: -- User 'liutang123' has created a pull request for this issue: https://github.com/apache/spark/pull/21100 > Union of map and other compatible column > > > Key: SPARK-24012 > URL: https://issues.apache.org/jira/browse/SPARK-24012 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.1 >Reporter: Lijia Liu >Priority: Major > > Union of map and other compatible column result in unresolved operator > 'Union; exception > Reproduction > spark-sql>select map(1,2), 'str' union all select map(1,2,3,null), 1 > Output: > Error in query: unresolved operator 'Union;; > 'Union > :- Project [map(1, 2) AS map(1, 2)#106, str AS str#107] > : +- OneRowRelation$ > +- Project [map(1, cast(2 as int), 3, cast(null as int)) AS map(1, CAST(2 AS > INT), 3, CAST(NULL AS INT))#109, 1 AS 1#108] > +- OneRowRelation$ > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24012) Union of map and other compatible column
[ https://issues.apache.org/jira/browse/SPARK-24012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lijia Liu updated SPARK-24012: -- Description: Union of map and other compatible column result in unresolved operator 'Union; exception Reproduction spark-sql>select map(1,2), 'str' union all select map(1,2,3,null), 1 Output: Error in query: unresolved operator 'Union;; 'Union :- Project [map(1, 2) AS map(1, 2)#106, str AS str#107] : +- OneRowRelation$ +- Project [map(1, cast(2 as int), 3, cast(null as int)) AS map(1, CAST(2 AS INT), 3, CAST(NULL AS INT))#109, 1 AS 1#108] +- OneRowRelation$ was: Union of map and other compatible column result in unresolved operator 'Union; exception Reproduction spark-sql>select map(1,2), 'str' union all select map(1,2,3,null), 1 Output: 'Union :- Project map(1, 2) AS map(1, 2)#6655, str AS str#6656 : +- OneRowRelation +- Project map(1, cast(2 as int), 3, cast(null as int)) AS map(1, CAST(2 AS INT), 3, CAST(NULL AS INT))#6658, 1 AS 1#6657 +- OneRowRelation > Union of map and other compatible column > > > Key: SPARK-24012 > URL: https://issues.apache.org/jira/browse/SPARK-24012 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.1 >Reporter: Lijia Liu >Priority: Major > > Union of map and other compatible column result in unresolved operator > 'Union; exception > Reproduction > spark-sql>select map(1,2), 'str' union all select map(1,2,3,null), 1 > Output: > Error in query: unresolved operator 'Union;; > 'Union > :- Project [map(1, 2) AS map(1, 2)#106, str AS str#107] > : +- OneRowRelation$ > +- Project [map(1, cast(2 as int), 3, cast(null as int)) AS map(1, CAST(2 AS > INT), 3, CAST(NULL AS INT))#109, 1 AS 1#108] > +- OneRowRelation$ > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23875) Create IndexedSeq wrapper for ArrayData
[ https://issues.apache.org/jira/browse/SPARK-23875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442602#comment-16442602 ] Apache Spark commented on SPARK-23875: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/21099 > Create IndexedSeq wrapper for ArrayData > --- > > Key: SPARK-23875 > URL: https://issues.apache.org/jira/browse/SPARK-23875 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.4.0 > > > We don't have a good way to sequentially access {{UnsafeArrayData}} with a > common interface such as Seq. An example is {{MapObject}} where we need to > access several sequence collection types together. But {{UnsafeArrayData}} > doesn't implement {{ArrayData.array}}. Calling {{toArray}} will copy the > entire array. We can provide an {{IndexedSeq}} wrapper for {{ArrayData}}, so > we can avoid copying the entire array. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten
[ https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442568#comment-16442568 ] Wenchen Fan commented on SPARK-23989: - OK now I see the problem, `ShuffleExchangeExec.needToCopyObjectsBeforeShuffle` doesn't catch all the cases, so we may reproduce wrong result. > When using `SortShuffleWriter`, the data will be overwritten > > > Key: SPARK-23989 > URL: https://issues.apache.org/jira/browse/SPARK-23989 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Critical > > {color:#33}When using `SortShuffleWriter`, we only insert > '{color}{color:#cc7832}AnyRef{color}{color:#33}' into > '{color}PartitionedAppendOnlyMap{color:#33}' or > '{color}PartitionedPairBuffer{color:#33}'.{color} > {color:#33}For this function:{color} > {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: > {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832}, > {color}{color:#4e807d}V{color}]]) > the value of 'records' is `UnsafeRow`, so the value will be overwritten > {color:#33} {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24014) Add onStreamingStarted method to StreamingListener
[ https://issues.apache.org/jira/browse/SPARK-24014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24014: Assignee: (was: Apache Spark) > Add onStreamingStarted method to StreamingListener > -- > > Key: SPARK-24014 > URL: https://issues.apache.org/jira/browse/SPARK-24014 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Liang-Chi Hsieh >Priority: Trivial > > The {{StreamingListener}} in PySpark side seems to be lack of > {{onStreamingStarted}} method. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24014) Add onStreamingStarted method to StreamingListener
[ https://issues.apache.org/jira/browse/SPARK-24014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442553#comment-16442553 ] Apache Spark commented on SPARK-24014: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/21098 > Add onStreamingStarted method to StreamingListener > -- > > Key: SPARK-24014 > URL: https://issues.apache.org/jira/browse/SPARK-24014 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Liang-Chi Hsieh >Priority: Trivial > > The {{StreamingListener}} in PySpark side seems to be lack of > {{onStreamingStarted}} method. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24014) Add onStreamingStarted method to StreamingListener
[ https://issues.apache.org/jira/browse/SPARK-24014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24014: Assignee: Apache Spark > Add onStreamingStarted method to StreamingListener > -- > > Key: SPARK-24014 > URL: https://issues.apache.org/jira/browse/SPARK-24014 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark >Priority: Trivial > > The {{StreamingListener}} in PySpark side seems to be lack of > {{onStreamingStarted}} method. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24014) Add onStreamingStarted method to StreamingListener
Liang-Chi Hsieh created SPARK-24014: --- Summary: Add onStreamingStarted method to StreamingListener Key: SPARK-24014 URL: https://issues.apache.org/jira/browse/SPARK-24014 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.4.0 Reporter: Liang-Chi Hsieh The {{StreamingListener}} in PySpark side seems to be lack of {{onStreamingStarted}} method. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten
[ https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442514#comment-16442514 ] Wenchen Fan commented on SPARK-23989: - it goes to `SortShuffleWriter` and then? We get the correct result, don't we? > When using `SortShuffleWriter`, the data will be overwritten > > > Key: SPARK-23989 > URL: https://issues.apache.org/jira/browse/SPARK-23989 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Critical > > {color:#33}When using `SortShuffleWriter`, we only insert > '{color}{color:#cc7832}AnyRef{color}{color:#33}' into > '{color}PartitionedAppendOnlyMap{color:#33}' or > '{color}PartitionedPairBuffer{color:#33}'.{color} > {color:#33}For this function:{color} > {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: > {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832}, > {color}{color:#4e807d}V{color}]]) > the value of 'records' is `UnsafeRow`, so the value will be overwritten > {color:#33} {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23830) Spark on YARN in cluster deploy mode fail with NullPointerException when a Spark application is a Scala class not object
[ https://issues.apache.org/jira/browse/SPARK-23830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442488#comment-16442488 ] Jacek Laskowski commented on SPARK-23830: - It's about how easy it is to find out that the issue is `class` vs `object`. If that's just a single change that could be reported to end users to help them I think it's worth it. > Spark on YARN in cluster deploy mode fail with NullPointerException when a > Spark application is a Scala class not object > > > Key: SPARK-23830 > URL: https://issues.apache.org/jira/browse/SPARK-23830 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Priority: Trivial > > As reported on StackOverflow in [Why does Spark on YARN fail with “Exception > in thread ”Driver“ > java.lang.NullPointerException”?|https://stackoverflow.com/q/49564334/1305344] > the following Spark application fails with {{Exception in thread "Driver" > java.lang.NullPointerException}} with Spark on YARN in cluster deploy mode: > {code} > class MyClass { > def main(args: Array[String]): Unit = { > val c = new MyClass() > c.process() > } > def process(): Unit = { > val sparkConf = new SparkConf().setAppName("my-test") > val sparkSession: SparkSession = > SparkSession.builder().config(sparkConf).getOrCreate() > import sparkSession.implicits._ > > } > ... > } > {code} > The exception is as follows: > {code} > 18/03/29 20:07:52 INFO ApplicationMaster: Starting the user application in a > separate Thread > 18/03/29 20:07:52 INFO ApplicationMaster: Waiting for spark context > initialization... > Exception in thread "Driver" java.lang.NullPointerException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637) > {code} > I think the reason for the exception {{Exception in thread "Driver" > java.lang.NullPointerException}} is due to [the following > code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L700-L701]: > {code} > val mainMethod = userClassLoader.loadClass(args.userClass) > .getMethod("main", classOf[Array[String]]) > {code} > So when {{mainMethod}} is used in [the following > code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L706] > it simply gives NPE. > {code} > mainMethod.invoke(null, userArgs.toArray) > {code} > That could be easily avoided with an extra check if the {{mainMethod}} is > initialized and give a user a message what may have been a reason. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24013) ApproximatePercentile grinds to a halt on sorted input.
[ https://issues.apache.org/jira/browse/SPARK-24013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442477#comment-16442477 ] Juliusz Sompolski commented on SPARK-24013: --- This hits when trying to create histogram statistics (SPARK-21975) on columns like monotonically increasing id - histograms cannot be created in reasonable time. > ApproximatePercentile grinds to a halt on sorted input. > --- > > Key: SPARK-24013 > URL: https://issues.apache.org/jira/browse/SPARK-24013 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Juliusz Sompolski >Priority: Major > > Running > {code} > sql("select approx_percentile(rid, array(0.1)) from (select rand() as rid > from range(1000))").collect() > {code} > takes 7 seconds, while > {code} > sql("select approx_percentile(id, array(0.1)) from range(1000)").collect() > {code} > grinds to a halt - processes the first million rows quickly, and then slows > down to a few thousands rows / second (4m rows processed after 20 minutes). > Thread dumps show that it spends time in QuantileSummary.compress. > Seems it hits some edge case inefficiency when dealing with sorted data? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24013) ApproximatePercentile grinds to a halt on sorted input.
Juliusz Sompolski created SPARK-24013: - Summary: ApproximatePercentile grinds to a halt on sorted input. Key: SPARK-24013 URL: https://issues.apache.org/jira/browse/SPARK-24013 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Juliusz Sompolski Running {code} sql("select approx_percentile(rid, array(0.1)) from (select rand() as rid from range(1000))").collect() {code} takes 7 seconds, while {code} sql("select approx_percentile(id, array(0.1)) from range(1000)").collect() {code} grinds to a halt - processes the first million rows quickly, and then slows down to a few thousands rows / second (4m rows processed after 20 minutes). Thread dumps show that it spends time in QuantileSummary.compress. Seems it hits some edge case inefficiency when dealing with sorted data? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics
[ https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442379#comment-16442379 ] assia ydroudj commented on SPARK-23206: --- [~elu], thank you for the shared dos, it works for me Is there a final PR to get the executor metrics ? > Additional Memory Tuning Metrics > > > Key: SPARK-23206 > URL: https://issues.apache.org/jira/browse/SPARK-23206 > Project: Spark > Issue Type: Umbrella > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Edwina Lu >Priority: Major > Attachments: ExecutorsTab.png, ExecutorsTab2.png, > MemoryTuningMetricsDesignDoc.pdf, SPARK-23206 Design Doc.pdf, StageTab.png > > > At LinkedIn, we have multiple clusters, running thousands of Spark > applications, and these numbers are growing rapidly. We need to ensure that > these Spark applications are well tuned – cluster resources, including > memory, should be used efficiently so that the cluster can support running > more applications concurrently, and applications should run quickly and > reliably. > Currently there is limited visibility into how much memory executors are > using, and users are guessing numbers for executor and driver memory sizing. > These estimates are often much larger than needed, leading to memory wastage. > Examining the metrics for one cluster for a month, the average percentage of > used executor memory (max JVM used memory across executors / > spark.executor.memory) is 35%, leading to an average of 591GB unused memory > per application (number of executors * (spark.executor.memory - max JVM used > memory)). Spark has multiple memory regions (user memory, execution memory, > storage memory, and overhead memory), and to understand how memory is being > used and fine-tune allocation between regions, it would be useful to have > information about how much memory is being used for the different regions. > To improve visibility into memory usage for the driver and executors and > different memory regions, the following additional memory metrics can be be > tracked for each executor and driver: > * JVM used memory: the JVM heap size for the executor/driver. > * Execution memory: memory used for computation in shuffles, joins, sorts > and aggregations. > * Storage memory: memory used caching and propagating internal data across > the cluster. > * Unified memory: sum of execution and storage memory. > The peak values for each memory metric can be tracked for each executor, and > also per stage. This information can be shown in the Spark UI and the REST > APIs. Information for peak JVM used memory can help with determining > appropriate values for spark.executor.memory and spark.driver.memory, and > information about the unified memory region can help with determining > appropriate values for spark.memory.fraction and > spark.memory.storageFraction. Stage memory information can help identify > which stages are most memory intensive, and users can look into the relevant > code to determine if it can be optimized. > The memory metrics can be gathered by adding the current JVM used memory, > execution memory and storage memory to the heartbeat. SparkListeners are > modified to collect the new metrics for the executors, stages and Spark > history log. Only interesting values (peak values per stage per executor) are > recorded in the Spark history log, to minimize the amount of additional > logging. > We have attached our design documentation with this ticket and would like to > receive feedback from the community for this proposal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23990) Instruments logging improvements - ML regression package
[ https://issues.apache.org/jira/browse/SPARK-23990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442370#comment-16442370 ] Weichen Xu commented on SPARK-23990: [~josephkb] I agree that option 2b looks better. Updating PR... > Instruments logging improvements - ML regression package > > > Key: SPARK-23990 > URL: https://issues.apache.org/jira/browse/SPARK-23990 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 > Environment: Instruments logging improvements - ML regression package >Reporter: Weichen Xu >Priority: Major > Original Estimate: 120h > Remaining Estimate: 120h > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14682) Provide evaluateEachIteration method or equivalent for spark.ml GBTs
[ https://issues.apache.org/jira/browse/SPARK-14682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442368#comment-16442368 ] Apache Spark commented on SPARK-14682: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/21097 > Provide evaluateEachIteration method or equivalent for spark.ml GBTs > > > Key: SPARK-14682 > URL: https://issues.apache.org/jira/browse/SPARK-14682 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > spark.mllib GradientBoostedTrees provide an evaluateEachIteration method. We > should provide that or an equivalent for spark.ml. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14682) Provide evaluateEachIteration method or equivalent for spark.ml GBTs
[ https://issues.apache.org/jira/browse/SPARK-14682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14682: Assignee: (was: Apache Spark) > Provide evaluateEachIteration method or equivalent for spark.ml GBTs > > > Key: SPARK-14682 > URL: https://issues.apache.org/jira/browse/SPARK-14682 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > spark.mllib GradientBoostedTrees provide an evaluateEachIteration method. We > should provide that or an equivalent for spark.ml. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14682) Provide evaluateEachIteration method or equivalent for spark.ml GBTs
[ https://issues.apache.org/jira/browse/SPARK-14682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14682: Assignee: Apache Spark > Provide evaluateEachIteration method or equivalent for spark.ml GBTs > > > Key: SPARK-14682 > URL: https://issues.apache.org/jira/browse/SPARK-14682 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Minor > > spark.mllib GradientBoostedTrees provide an evaluateEachIteration method. We > should provide that or an equivalent for spark.ml. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24012) Union of map and other compatible column
[ https://issues.apache.org/jira/browse/SPARK-24012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lijia Liu updated SPARK-24012: -- Description: Union of map and other compatible column result in unresolved operator 'Union; exception Reproduction spark-sql>select map(1,2), 'str' union all select map(1,2,3,null), 1 Output: 'Union :- Project map(1, 2) AS map(1, 2)#6655, str AS str#6656 : +- OneRowRelation +- Project map(1, cast(2 as int), 3, cast(null as int)) AS map(1, CAST(2 AS INT), 3, CAST(NULL AS INT))#6658, 1 AS 1#6657 +- OneRowRelation was: Union of map and other compatible column result in unresolved operator 'Union; exception Reproduction spark-sql>*select map(1,2), 'str' union all select map(1,2,3,null), 1* Output: 'Union :- Project [map(1, 2) AS map(1, 2)#6655, str AS str#6656] : +- OneRowRelation +- Project [map(1, cast(2 as int), 3, cast(null as int)) AS map(1, CAST(2 AS INT), 3, CAST(NULL AS INT))#6658, 1 AS 1#6657] +- OneRowRelation > Union of map and other compatible column > > > Key: SPARK-24012 > URL: https://issues.apache.org/jira/browse/SPARK-24012 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.1 >Reporter: Lijia Liu >Priority: Major > > Union of map and other compatible column result in unresolved operator > 'Union; exception > Reproduction > spark-sql>select map(1,2), 'str' union all select map(1,2,3,null), 1 > Output: > 'Union > :- Project map(1, 2) AS map(1, 2)#6655, str AS str#6656 > : +- OneRowRelation > +- Project map(1, cast(2 as int), 3, cast(null as int)) AS map(1, CAST(2 AS > INT), 3, CAST(NULL AS INT))#6658, 1 AS 1#6657 > +- OneRowRelation > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24012) Union of map and other compatible column
Lijia Liu created SPARK-24012: - Summary: Union of map and other compatible column Key: SPARK-24012 URL: https://issues.apache.org/jira/browse/SPARK-24012 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.1, 2.1.0 Reporter: Lijia Liu Union of map and other compatible column result in unresolved operator 'Union; exception Reproduction spark-sql>*select map(1,2), 'str' union all select map(1,2,3,null), 1* Output: 'Union :- Project [map(1, 2) AS map(1, 2)#6655, str AS str#6656] : +- OneRowRelation +- Project [map(1, cast(2 as int), 3, cast(null as int)) AS map(1, CAST(2 AS INT), 3, CAST(NULL AS INT))#6658, 1 AS 1#6657] +- OneRowRelation -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24011) Cache rdd's immediate parent ShuffleDependencies to accelerate getShuffleDependencies()
[ https://issues.apache.org/jira/browse/SPARK-24011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442259#comment-16442259 ] Apache Spark commented on SPARK-24011: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/21096 > Cache rdd's immediate parent ShuffleDependencies to accelerate > getShuffleDependencies() > --- > > Key: SPARK-24011 > URL: https://issues.apache.org/jira/browse/SPARK-24011 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: wuyi >Priority: Minor > > When creating stages for jobs, we need to find a rdd's (except the final rdd) > immediate parent ShuffleDependencies by method getShuffleDependencies() for > at least 2 times (first in > getMissingAncestorShuffleDependencies(), and second in > getOrCreateParentStages()). > So, we can cache the result at the fist time we call getShuffleDependencies(). > This is helpful for cutting time consuming when there's many > NarrowDependencies between the rdd and its immediate parent > ShuffleDependencies or if the rdd has a number of immediate parent > ShuffleDependencies . > > There's an exception for checkpointed rdd. If a rdd is checkpointed, it's > immediate parent ShuffleDependencies should adjust to empty. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24011) Cache rdd's immediate parent ShuffleDependencies to accelerate getShuffleDependencies()
[ https://issues.apache.org/jira/browse/SPARK-24011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24011: Assignee: Apache Spark > Cache rdd's immediate parent ShuffleDependencies to accelerate > getShuffleDependencies() > --- > > Key: SPARK-24011 > URL: https://issues.apache.org/jira/browse/SPARK-24011 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: wuyi >Assignee: Apache Spark >Priority: Minor > > When creating stages for jobs, we need to find a rdd's (except the final rdd) > immediate parent ShuffleDependencies by method getShuffleDependencies() for > at least 2 times (first in > getMissingAncestorShuffleDependencies(), and second in > getOrCreateParentStages()). > So, we can cache the result at the fist time we call getShuffleDependencies(). > This is helpful for cutting time consuming when there's many > NarrowDependencies between the rdd and its immediate parent > ShuffleDependencies or if the rdd has a number of immediate parent > ShuffleDependencies . > > There's an exception for checkpointed rdd. If a rdd is checkpointed, it's > immediate parent ShuffleDependencies should adjust to empty. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24011) Cache rdd's immediate parent ShuffleDependencies to accelerate getShuffleDependencies()
[ https://issues.apache.org/jira/browse/SPARK-24011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24011: Assignee: (was: Apache Spark) > Cache rdd's immediate parent ShuffleDependencies to accelerate > getShuffleDependencies() > --- > > Key: SPARK-24011 > URL: https://issues.apache.org/jira/browse/SPARK-24011 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: wuyi >Priority: Minor > > When creating stages for jobs, we need to find a rdd's (except the final rdd) > immediate parent ShuffleDependencies by method getShuffleDependencies() for > at least 2 times (first in > getMissingAncestorShuffleDependencies(), and second in > getOrCreateParentStages()). > So, we can cache the result at the fist time we call getShuffleDependencies(). > This is helpful for cutting time consuming when there's many > NarrowDependencies between the rdd and its immediate parent > ShuffleDependencies or if the rdd has a number of immediate parent > ShuffleDependencies . > > There's an exception for checkpointed rdd. If a rdd is checkpointed, it's > immediate parent ShuffleDependencies should adjust to empty. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24011) Cache rdd's immediate parent ShuffleDependencies to accelerate getShuffleDependencies()
[ https://issues.apache.org/jira/browse/SPARK-24011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-24011: - Summary: Cache rdd's immediate parent ShuffleDependencies to accelerate getShuffleDependencies() (was: Cache rdd's immediate parent ShuffleDependency to accelerate getShuffleDependencies()) > Cache rdd's immediate parent ShuffleDependencies to accelerate > getShuffleDependencies() > --- > > Key: SPARK-24011 > URL: https://issues.apache.org/jira/browse/SPARK-24011 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: wuyi >Priority: Minor > > When creating stages for jobs, we need to find a rdd's (except the final rdd) > immediate parent ShuffleDependencies by method getShuffleDependencies() for > at least 2 times (first in > getMissingAncestorShuffleDependencies(), and second in > getOrCreateParentStages()). > So, we can cache the result at the fist time we call getShuffleDependencies(). > This is helpful for cutting time consuming when there's many > NarrowDependencies between the rdd and its immediate parent > ShuffleDependencies or if the rdd has a number of immediate parent > ShuffleDependencies . > > There's an exception for checkpointed rdd. If a rdd is checkpointed, it's > immediate parent ShuffleDependencies should adjust to empty. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24011) Cache rdd's immediate parent ShuffleDependency to accelerate getShuffleDependencies()
wuyi created SPARK-24011: Summary: Cache rdd's immediate parent ShuffleDependency to accelerate getShuffleDependencies() Key: SPARK-24011 URL: https://issues.apache.org/jira/browse/SPARK-24011 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.3.0 Reporter: wuyi When creating stages for jobs, we need to find a rdd's (except the final rdd) immediate parent ShuffleDependencies by method getShuffleDependencies() for at least 2 times (first in getMissingAncestorShuffleDependencies(), and second in getOrCreateParentStages()). So, we can cache the result at the fist time we call getShuffleDependencies(). This is helpful for cutting time consuming when there's many NarrowDependencies between the rdd and its immediate parent ShuffleDependencies or if the rdd has a number of immediate parent ShuffleDependencies . There's an exception for checkpointed rdd. If a rdd is checkpointed, it's immediate parent ShuffleDependencies should adjust to empty. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24006) ExecutorAllocationManager.onExecutorAdded is an O(n) operation
[ https://issues.apache.org/jira/browse/SPARK-24006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442224#comment-16442224 ] Xianjin YE commented on SPARK-24006: Haven't launch a large enough job to confirm my assumption.. But roughly I guess 5_000-10_000 executors would show some impact, and 100_000 and above executors would cause an actual problem.. > ExecutorAllocationManager.onExecutorAdded is an O(n) operation > -- > > Key: SPARK-24006 > URL: https://issues.apache.org/jira/browse/SPARK-24006 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Xianjin YE >Priority: Major > > The ExecutorAllocationManager.onExecutorAdded is an O(n) operations, I > believe it will be a problem when scaling out with large number of Executors > as it effectively makes adding N executors at time complexity O(N^2). > > I propose to invoke onExecutorIdle guarded by > {code:java} > if (executorIds.size - executorsPendingToRemove.size >= minNumExecutors +1) { > // Since we only need to re-remark idle executors when low bound > executorIds.filter(listener.isExecutorIdle).foreach(onExecutorIdle) > } else { > onExecutorIdle(executorId) > }{code} > cc [~zsxwing] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23926) High-order function: reverse(x) → array
[ https://issues.apache.org/jira/browse/SPARK-23926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-23926. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21034 [https://github.com/apache/spark/pull/21034] > High-order function: reverse(x) → array > --- > > Key: SPARK-23926 > URL: https://issues.apache.org/jira/browse/SPARK-23926 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Marek Novotny >Priority: Major > Fix For: 2.4.0 > > > Ref: https://prestodb.io/docs/current/functions/array.html > Returns an array which has the reversed order of array x. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23926) High-order function: reverse(x) → array
[ https://issues.apache.org/jira/browse/SPARK-23926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin reassigned SPARK-23926: - Assignee: Marek Novotny > High-order function: reverse(x) → array > --- > > Key: SPARK-23926 > URL: https://issues.apache.org/jira/browse/SPARK-23926 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Marek Novotny >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Returns an array which has the reversed order of array x. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten
[ https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442176#comment-16442176 ] liuxian edited comment on SPARK-23989 at 4/18/18 9:21 AM: -- test({color:#6a8759}"groupBy"{color}) { {color:#808080} spark.conf.set("spark.sql.shuffle.partitions", 16777217){color} {color:#cc7832}val {color}df1 = {color:#9876aa}Seq{color}(({color:#6a8759}"a"{color}{color:#cc7832}, {color}{color:#6897bb}1{color}{color:#cc7832}, {color}{color:#6897bb}0{color}{color:#cc7832}, {color}{color:#6a8759}"b"{color}){color:#cc7832}, {color}({color:#6a8759}"b"{color}{color:#cc7832}, {color}{color:#6897bb}2{color}{color:#cc7832}, {color}{color:#6897bb}4{color}{color:#cc7832}, {color}{color:#6a8759}"c"{color}){color:#cc7832}, {color}({color:#6a8759}"a"{color}{color:#cc7832}, {color}{color:#6897bb}2{color}{color:#cc7832}, {color}{color:#6897bb}3{color}{color:#cc7832}, {color}{color:#6a8759}"d"{color})) .toDF({color:#6a8759}"key"{color}{color:#cc7832}, {color}{color:#6a8759}"value1"{color}{color:#cc7832}, {color}{color:#6a8759}"value2"{color}{color:#cc7832}, {color}{color:#6a8759}"rest"{color}) checkAnswer( df1.groupBy({color:#6a8759}"key"{color}).min({color:#6a8759}"value2"{color}){color:#cc7832},{color} {color:#9876aa}Seq{color}(Row({color:#6a8759}"a"{color}{color:#cc7832}, {color}{color:#6897bb}0{color}){color:#cc7832}, {color}Row({color:#6a8759}"b"{color}{color:#cc7832}, {color}{color:#6897bb}4{color})) ) } Because the number of partitions is too large, it will run for a long time. The number of partitions is so large that the purpose is to go `SortShuffleWriter` was (Author: 10110346): test({color:#6a8759}"groupBy"{color}) { {color:#808080} spark.conf.set("spark.sql.shuffle.partitions", 16777217){color}{color:#808080} {color} {color:#cc7832}val {color}df1 = {color:#9876aa}Seq{color}(({color:#6a8759}"a"{color}{color:#cc7832}, {color}{color:#6897bb}1{color}{color:#cc7832}, {color}{color:#6897bb}0{color}{color:#cc7832}, {color}{color:#6a8759}"b"{color}){color:#cc7832}, {color}({color:#6a8759}"b"{color}{color:#cc7832}, {color}{color:#6897bb}2{color}{color:#cc7832}, {color}{color:#6897bb}4{color}{color:#cc7832}, {color}{color:#6a8759}"c"{color}){color:#cc7832}, {color}({color:#6a8759}"a"{color}{color:#cc7832}, {color}{color:#6897bb}2{color}{color:#cc7832}, {color}{color:#6897bb}3{color}{color:#cc7832}, {color}{color:#6a8759}"d"{color})) .toDF({color:#6a8759}"key"{color}{color:#cc7832}, {color}{color:#6a8759}"value1"{color}{color:#cc7832}, {color}{color:#6a8759}"value2"{color}{color:#cc7832}, {color}{color:#6a8759}"rest"{color}) checkAnswer( df1.groupBy({color:#6a8759}"key"{color}).min({color:#6a8759}"value2"{color}){color:#cc7832}, {color} {color:#9876aa}Seq{color}(Row({color:#6a8759}"a"{color}{color:#cc7832}, {color}{color:#6897bb}0{color}){color:#cc7832}, {color}Row({color:#6a8759}"b"{color}{color:#cc7832}, {color}{color:#6897bb}4{color})) ) } Because the number of partitions is too large, it will run for a long time. The number of partitions is so large that the purpose is to go `SortShuffleWriter` > When using `SortShuffleWriter`, the data will be overwritten > > > Key: SPARK-23989 > URL: https://issues.apache.org/jira/browse/SPARK-23989 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Critical > > {color:#33}When using `SortShuffleWriter`, we only insert > '{color}{color:#cc7832}AnyRef{color}{color:#33}' into > '{color}PartitionedAppendOnlyMap{color:#33}' or > '{color}PartitionedPairBuffer{color:#33}'.{color} > {color:#33}For this function:{color} > {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: > {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832}, > {color}{color:#4e807d}V{color}]]) > the value of 'records' is `UnsafeRow`, so the value will be overwritten > {color:#33} {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten
[ https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442176#comment-16442176 ] liuxian commented on SPARK-23989: - test({color:#6a8759}"groupBy"{color}) { {color:#808080} spark.conf.set("spark.sql.shuffle.partitions", 16777217){color}{color:#808080} {color} {color:#cc7832}val {color}df1 = {color:#9876aa}Seq{color}(({color:#6a8759}"a"{color}{color:#cc7832}, {color}{color:#6897bb}1{color}{color:#cc7832}, {color}{color:#6897bb}0{color}{color:#cc7832}, {color}{color:#6a8759}"b"{color}){color:#cc7832}, {color}({color:#6a8759}"b"{color}{color:#cc7832}, {color}{color:#6897bb}2{color}{color:#cc7832}, {color}{color:#6897bb}4{color}{color:#cc7832}, {color}{color:#6a8759}"c"{color}){color:#cc7832}, {color}({color:#6a8759}"a"{color}{color:#cc7832}, {color}{color:#6897bb}2{color}{color:#cc7832}, {color}{color:#6897bb}3{color}{color:#cc7832}, {color}{color:#6a8759}"d"{color})) .toDF({color:#6a8759}"key"{color}{color:#cc7832}, {color}{color:#6a8759}"value1"{color}{color:#cc7832}, {color}{color:#6a8759}"value2"{color}{color:#cc7832}, {color}{color:#6a8759}"rest"{color}) checkAnswer( df1.groupBy({color:#6a8759}"key"{color}).min({color:#6a8759}"value2"{color}){color:#cc7832}, {color} {color:#9876aa}Seq{color}(Row({color:#6a8759}"a"{color}{color:#cc7832}, {color}{color:#6897bb}0{color}){color:#cc7832}, {color}Row({color:#6a8759}"b"{color}{color:#cc7832}, {color}{color:#6897bb}4{color})) ) } Because the number of partitions is too large, it will run for a long time. The number of partitions is so large that the purpose is to go `SortShuffleWriter` > When using `SortShuffleWriter`, the data will be overwritten > > > Key: SPARK-23989 > URL: https://issues.apache.org/jira/browse/SPARK-23989 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Critical > > {color:#33}When using `SortShuffleWriter`, we only insert > '{color}{color:#cc7832}AnyRef{color}{color:#33}' into > '{color}PartitionedAppendOnlyMap{color:#33}' or > '{color}PartitionedPairBuffer{color:#33}'.{color} > {color:#33}For this function:{color} > {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: > {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832}, > {color}{color:#4e807d}V{color}]]) > the value of 'records' is `UnsafeRow`, so the value will be overwritten > {color:#33} {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24009) spark2.3.0 INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/aaaaab'
[ https://issues.apache.org/jira/browse/SPARK-24009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442165#comment-16442165 ] ANDY GUAN commented on SPARK-24009: --- Try to config *hive.exec.scratchdir* and *hive.exec.stagingdir* in hive-site.xml. Make sure your current user has the right to write into the configured directory. > spark2.3.0 INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab' > - > > Key: SPARK-24009 > URL: https://issues.apache.org/jira/browse/SPARK-24009 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: chris_j >Priority: Major > > 1.spark-sql -e "INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab'row > format delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from > default.dim_date" write local directory successful > 3.spark-sql --master yarn -e "INSERT OVERWRITE DIRECTORY 'ab'row format > delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from > default.dim_date" write hdfs successful > 2.spark-sql --master yarn -e "INSERT OVERWRITE LOCAL DIRECTORY > '/home/spark/ab'row format delimited FIELDS TERMINATED BY '\t' STORED AS > TEXTFILE select * from default.dim_date" on yarn writr local directory failed > > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: > java.io.IOException: Mkdirs failed to create > [file:/home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0|file://home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0] > (exists=false, > cwd=[file:/data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02|file://data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02]) > at > org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249) > at > org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:123) > at > org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:367) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:378) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272) > ... 8 more > Caused by: java.io.IOException: Mkdirs failed to create > [file:/home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0|file://home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0] > (exists=false, > cwd=[file:/data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02|file://data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02]) > at > org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:447) > at > org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801) > at > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat.getHiveRecordWriter(HiveIgnoreKeyTextOutputFormat.java:80) > at > org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:261) > at > org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:246) > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-24008) SQL/Hive Context fails with NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-24008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ANDY GUAN updated SPARK-24008: -- Comment: was deleted (was: Try to config *hive.exec.scratchdir* and *hive.exec.stagingdir* in hive-site.xml. Make sure your current user has the right to write into the configured directory.) > SQL/Hive Context fails with NullPointerException > - > > Key: SPARK-24008 > URL: https://issues.apache.org/jira/browse/SPARK-24008 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.3 >Reporter: Prabhu Joseph >Priority: Major > Attachments: Repro > > > SQL / Hive Context fails with NullPointerException while getting > configuration from SQLConf. This happens when the MemoryStore is filled with > lot of broadcast and started dropping and then SQL / Hive Context is created > and broadcast. When using this Context to access a table fails with below > NullPointerException. > Repro is attached - the Spark Example which fills the MemoryStore with > broadcasts and then creates and accesses a SQL Context. > {code} > java.lang.NullPointerException > at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638) > at > org.apache.spark.sql.SQLConf.defaultDataSourceName(SQLConf.scala:558) > at > org.apache.spark.sql.DataFrameReader.(DataFrameReader.scala:362) > at org.apache.spark.sql.SQLContext.read(SQLContext.scala:623) > at SparkHiveExample$.main(SparkHiveExample.scala:76) > at SparkHiveExample.main(SparkHiveExample.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > 18/04/06 14:17:42 ERROR ApplicationMaster: User class threw exception: > java.lang.NullPointerException > java.lang.NullPointerException > at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638) > at org.apache.spark.sql.SQLContext.getConf(SQLContext.scala:153) > at > org.apache.spark.sql.hive.HiveContext.hiveMetastoreVersion(HiveContext.scala:166) > > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:258) > > at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:255) > at > org.apache.spark.sql.hive.HiveContext$$anon$2.(HiveContext.scala:475) > at > org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:475) > > at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:474) > at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:90) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:831) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827) > {code} > MemoryStore got filled and started dropping the blocks. > {code} > 18/04/17 08:03:43 INFO MemoryStore: 2 blocks selected for dropping > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_14 stored as values in > memory (estimated size 78.1 MB, free 64.4 MB) > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_14_piece0 stored as bytes > in memory (estimated size 1522.0 B, free 64.4 MB) > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_15 stored as values in > memory (estimated size 350.9 KB, free 64.1 MB) > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_15_piece0 stored as bytes > in memory (estimated size 29.9 KB, free 64.0 MB) > 18/04/17 08:03:43 INFO MemoryStore: 10 blocks selected for dropping > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_16 stored as values in > memory (estimated size 78.1 MB, free 64.7 MB) > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_16_piece0 stored as bytes > in memory (estimated size 1522.0 B, free 64.7 MB) > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_1 stored as values in > memory (estimated size 136.0 B, free 64.7 MB) > 18/04/17 08:03:44 INFO MemoryStore: MemoryStore cleared > 18/04/17 08:03:20 INFO MemoryStore: MemoryStore started with capacity 511.1 MB > 18/04/17 08:03:44 INFO MemoryStore: MemoryStore cleared > 18/04/17 08:03:57 INFO MemoryStore: MemoryStore started with capacity 511.1 MB > 18/04/17 08:04:23 INFO MemoryStore: MemoryStore cleared > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24008) SQL/Hive Context fails with NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-24008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442164#comment-16442164 ] ANDY GUAN commented on SPARK-24008: --- Try to config *hive.exec.scratchdir* and *hive.exec.stagingdir* in hive-site.xml. Make sure your current user has the right to write into the configured directory. > SQL/Hive Context fails with NullPointerException > - > > Key: SPARK-24008 > URL: https://issues.apache.org/jira/browse/SPARK-24008 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.3 >Reporter: Prabhu Joseph >Priority: Major > Attachments: Repro > > > SQL / Hive Context fails with NullPointerException while getting > configuration from SQLConf. This happens when the MemoryStore is filled with > lot of broadcast and started dropping and then SQL / Hive Context is created > and broadcast. When using this Context to access a table fails with below > NullPointerException. > Repro is attached - the Spark Example which fills the MemoryStore with > broadcasts and then creates and accesses a SQL Context. > {code} > java.lang.NullPointerException > at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638) > at > org.apache.spark.sql.SQLConf.defaultDataSourceName(SQLConf.scala:558) > at > org.apache.spark.sql.DataFrameReader.(DataFrameReader.scala:362) > at org.apache.spark.sql.SQLContext.read(SQLContext.scala:623) > at SparkHiveExample$.main(SparkHiveExample.scala:76) > at SparkHiveExample.main(SparkHiveExample.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > 18/04/06 14:17:42 ERROR ApplicationMaster: User class threw exception: > java.lang.NullPointerException > java.lang.NullPointerException > at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638) > at org.apache.spark.sql.SQLContext.getConf(SQLContext.scala:153) > at > org.apache.spark.sql.hive.HiveContext.hiveMetastoreVersion(HiveContext.scala:166) > > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:258) > > at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:255) > at > org.apache.spark.sql.hive.HiveContext$$anon$2.(HiveContext.scala:475) > at > org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:475) > > at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:474) > at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:90) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:831) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827) > {code} > MemoryStore got filled and started dropping the blocks. > {code} > 18/04/17 08:03:43 INFO MemoryStore: 2 blocks selected for dropping > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_14 stored as values in > memory (estimated size 78.1 MB, free 64.4 MB) > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_14_piece0 stored as bytes > in memory (estimated size 1522.0 B, free 64.4 MB) > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_15 stored as values in > memory (estimated size 350.9 KB, free 64.1 MB) > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_15_piece0 stored as bytes > in memory (estimated size 29.9 KB, free 64.0 MB) > 18/04/17 08:03:43 INFO MemoryStore: 10 blocks selected for dropping > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_16 stored as values in > memory (estimated size 78.1 MB, free 64.7 MB) > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_16_piece0 stored as bytes > in memory (estimated size 1522.0 B, free 64.7 MB) > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_1 stored as values in > memory (estimated size 136.0 B, free 64.7 MB) > 18/04/17 08:03:44 INFO MemoryStore: MemoryStore cleared > 18/04/17 08:03:20 INFO MemoryStore: MemoryStore started with capacity 511.1 MB > 18/04/17 08:03:44 INFO MemoryStore: MemoryStore cleared > 18/04/17 08:03:57 INFO MemoryStore: MemoryStore started with capacity 511.1 MB > 18/04/17 08:04:23 INFO MemoryStore: MemoryStore cleared > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24010) Select from table needs read access on DB folder when storage based auth is enabled
[ https://issues.apache.org/jira/browse/SPARK-24010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442146#comment-16442146 ] Rui Li commented on SPARK-24010: Hi [~rxin], I think checking databaseExists is added in SPARK-14869. Do you remember why we need to check whether DB exists? In {{HiveClientImpl:: tableExists}}, we're telling Hive not to throw exceptions, so there shouldn't be exceptions if DB doesn't exist. > Select from table needs read access on DB folder when storage based auth is > enabled > --- > > Key: SPARK-24010 > URL: https://issues.apache.org/jira/browse/SPARK-24010 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Rui Li >Priority: Major > > When HMS enables storage based authorization, SparkSQL requires read access > on DB folder in order to select from a table. Such requirement doesn't seem > necessary and is not required in Hive. > The reason is when Analyzer tries to resolve a relation, it calls > [SessionCatalog::databaseExists|https://github.com/apache/spark/blob/v2.1.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L469]. > This will call the metastore get_database API which will perform > authorization check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24006) ExecutorAllocationManager.onExecutorAdded is an O(n) operation
[ https://issues.apache.org/jira/browse/SPARK-24006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442147#comment-16442147 ] Hyukjin Kwon commented on SPARK-24006: -- How many executors would you guess cause an actual problem roughly? > ExecutorAllocationManager.onExecutorAdded is an O(n) operation > -- > > Key: SPARK-24006 > URL: https://issues.apache.org/jira/browse/SPARK-24006 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Xianjin YE >Priority: Major > > The ExecutorAllocationManager.onExecutorAdded is an O(n) operations, I > believe it will be a problem when scaling out with large number of Executors > as it effectively makes adding N executors at time complexity O(N^2). > > I propose to invoke onExecutorIdle guarded by > {code:java} > if (executorIds.size - executorsPendingToRemove.size >= minNumExecutors +1) { > // Since we only need to re-remark idle executors when low bound > executorIds.filter(listener.isExecutorIdle).foreach(onExecutorIdle) > } else { > onExecutorIdle(executorId) > }{code} > cc [~zsxwing] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24010) Select from table needs read access on DB folder when storage based auth is enabled
Rui Li created SPARK-24010: -- Summary: Select from table needs read access on DB folder when storage based auth is enabled Key: SPARK-24010 URL: https://issues.apache.org/jira/browse/SPARK-24010 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Rui Li When HMS enables storage based authorization, SparkSQL requires read access on DB folder in order to select from a table. Such requirement doesn't seem necessary and is not required in Hive. The reason is when Analyzer tries to resolve a relation, it calls [SessionCatalog::databaseExists|https://github.com/apache/spark/blob/v2.1.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L469]. This will call the metastore get_database API which will perform authorization check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24009) spark2.3.0 INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/aaaaab'
[ https://issues.apache.org/jira/browse/SPARK-24009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chris_j updated SPARK-24009: Description: 1.spark-sql -e "INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab'row format delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from default.dim_date" write local directory successful 3.spark-sql --master yarn -e "INSERT OVERWRITE DIRECTORY 'ab'row format delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from default.dim_date" write hdfs successful 2.spark-sql --master yarn -e "INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab'row format delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from default.dim_date" on yarn writr local directory failed Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: Mkdirs failed to create [file:/home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0|file://home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0] (exists=false, cwd=[file:/data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02|file://data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02]) at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249) at org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:123) at org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:367) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:378) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272) ... 8 more Caused by: java.io.IOException: Mkdirs failed to create [file:/home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0|file://home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0] (exists=false, cwd=[file:/data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02|file://data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02]) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:447) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801) at org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat.getHiveRecordWriter(HiveIgnoreKeyTextOutputFormat.java:80) at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:261) at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:246) was: 1.spark-sql -e "INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab'row format delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from default.dim_date" successful 2.spark-sql --master yarn -e "INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab'row format delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from default.dim_date" failed Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: Mkdirs failed to create file:/home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0 (exists=false, cwd=file:/data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02) at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249) at
[jira] [Commented] (SPARK-23529) Specify hostpath volume and mount the volume in Spark driver and executor pods in Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-23529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442050#comment-16442050 ] Apache Spark commented on SPARK-23529: -- User 'madanadit' has created a pull request for this issue: https://github.com/apache/spark/pull/21095 > Specify hostpath volume and mount the volume in Spark driver and executor > pods in Kubernetes > > > Key: SPARK-23529 > URL: https://issues.apache.org/jira/browse/SPARK-23529 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Suman Somasundar >Assignee: Anirudh Ramanathan >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24009) spark2.3.0 INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/aaaaab'
[ https://issues.apache.org/jira/browse/SPARK-24009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chris_j updated SPARK-24009: Description: 1.spark-sql -e "INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab'row format delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from default.dim_date" successful 2.spark-sql --master yarn -e "INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab'row format delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from default.dim_date" failed Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: Mkdirs failed to create file:/home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0 (exists=false, cwd=file:/data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02) at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249) at org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:123) at org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:367) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:378) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272) ... 8 more Caused by: java.io.IOException: Mkdirs failed to create file:/home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0 (exists=false, cwd=file:/data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:447) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801) at org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat.getHiveRecordWriter(HiveIgnoreKeyTextOutputFormat.java:80) at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:261) at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:246) was:INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab'row format delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from default.dim_date > spark2.3.0 INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab' > - > > Key: SPARK-24009 > URL: https://issues.apache.org/jira/browse/SPARK-24009 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: chris_j >Priority: Major > > 1.spark-sql -e "INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab'row > format delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from > default.dim_date" successful > 2.spark-sql --master yarn -e "INSERT OVERWRITE LOCAL DIRECTORY > '/home/spark/ab'row format delimited FIELDS TERMINATED BY '\t' STORED AS > TEXTFILE select * from default.dim_date" failed > > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: > java.io.IOException: Mkdirs failed to create > file:/home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0 > (exists=false, > cwd=file:/data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02) > at > org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249) > at > org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:123) > at > org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103) > at >
[jira] [Created] (SPARK-24009) spark2.3.0 INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/aaaaab'
chris_j created SPARK-24009: --- Summary: spark2.3.0 INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab' Key: SPARK-24009 URL: https://issues.apache.org/jira/browse/SPARK-24009 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: chris_j INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab'row format delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from default.dim_date -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten
[ https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441997#comment-16441997 ] Wenchen Fan commented on SPARK-23989: - You have to provide an end-to-end use case to convince us this is a bug(e.g. a SQL/DataFrame query). If you fork Spark, change some code and break something, it is not a bug. > When using `SortShuffleWriter`, the data will be overwritten > > > Key: SPARK-23989 > URL: https://issues.apache.org/jira/browse/SPARK-23989 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Critical > > {color:#33}When using `SortShuffleWriter`, we only insert > '{color}{color:#cc7832}AnyRef{color}{color:#33}' into > '{color}PartitionedAppendOnlyMap{color:#33}' or > '{color}PartitionedPairBuffer{color:#33}'.{color} > {color:#33}For this function:{color} > {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: > {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832}, > {color}{color:#4e807d}V{color}]]) > the value of 'records' is `UnsafeRow`, so the value will be overwritten > {color:#33} {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten
[ https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-23989: Attachment: (was: 无标题2.png) > When using `SortShuffleWriter`, the data will be overwritten > > > Key: SPARK-23989 > URL: https://issues.apache.org/jira/browse/SPARK-23989 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Critical > > {color:#33}When using `SortShuffleWriter`, we only insert > '{color}{color:#cc7832}AnyRef{color}{color:#33}' into > '{color}PartitionedAppendOnlyMap{color:#33}' or > '{color}PartitionedPairBuffer{color:#33}'.{color} > {color:#33}For this function:{color} > {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: > {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832}, > {color}{color:#4e807d}V{color}]]) > the value of 'records' is `UnsafeRow`, so the value will be overwritten > {color:#33} {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org