[jira] [Commented] (SPARK-20998) BroadcastHashJoin producing wrong results
[ https://issues.apache.org/jira/browse/SPARK-20998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16040321#comment-16040321 ] Liang-Chi Hsieh commented on SPARK-20998: - Can you provide a sample data to reproduce this issue? > BroadcastHashJoin producing wrong results > - > > Key: SPARK-20998 > URL: https://issues.apache.org/jira/browse/SPARK-20998 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Mohit > > I have a hive table : _eagle_edw_batch.DistributionAttributes_, with > *Schema*: > root > |-- distributionstatus: string (nullable = true) > |-- enabledforselectionflag: boolean (nullable = true) > |-- sourcedistributionid: integer (nullable = true) > |-- rowstartdate: date (nullable = true) > |-- rowenddate: date (nullable = true) > |-- rowiscurrent: string (nullable = true) > |-- dwcreatedate: timestamp (nullable = true) > |-- dwlastupdatedate: timestamp (nullable = true) > |-- appid: integer (nullable = true) > |-- siteid: integer (nullable = true) > |-- brandid: integer (nullable = true) > *DataFrame* > val df = spark.sql("SELECT s.sourcedistributionid as sid, > t.sourcedistributionid as tid, s.appid as sapp, t.appid as tapp, s.brandid > as sbrand, t.brandid as tbrand FROM eagle_edw_batch.DistributionAttributes t > INNER JOIN eagle_edw_batch.DistributionAttributes s ON > t.sourcedistributionid=s.sourcedistributionid AND t.appid=s.appid AND > t.brandid=s.brandid"). > *Without BroadCastJoin* ( spark-shell --conf > "spark.sql.autoBroadcastJoinThreshold=-1") : > df.explain > == Physical Plan == > *Project [sourcedistributionid#71 AS sid#0, sourcedistributionid#60 AS tid#1, > appid#77 AS sapp#2, appid#66 AS tapp#3, brandid#79 AS sbrand#4, brandid#68 AS > tbrand#5] > +- *SortMergeJoin [sourcedistributionid#60, appid#66, brandid#68], > [sourcedistributionid#71, appid#77, brandid#79], Inner >:- *Sort [sourcedistributionid#60 ASC, appid#66 ASC, brandid#68 ASC], > false, 0 >: +- Exchange hashpartitioning(sourcedistributionid#60, appid#66, > brandid#68, 200) >: +- *Filter ((isnotnull(sourcedistributionid#60) && > isnotnull(brandid#68)) && isnotnull(appid#66)) >:+- HiveTableScan [sourcedistributionid#60, appid#66, brandid#68], > MetastoreRelation eagle_edw_batch, distributionattributes, t >+- *Sort [sourcedistributionid#71 ASC, appid#77 ASC, brandid#79 ASC], > false, 0 > +- Exchange hashpartitioning(sourcedistributionid#71, appid#77, > brandid#79, 200) > +- *Filter ((isnotnull(sourcedistributionid#71) && > isnotnull(appid#77)) && isnotnull(brandid#79)) > +- HiveTableScan [sourcedistributionid#71, appid#77, brandid#79], > MetastoreRelation eagle_edw_batch, distributionattributes, s > df.show > |sid|tid|sapp|tapp|sbrand|tbrand| > | 22| 22| 61| 61| 614| 614| > | 29| 29| 65| 65| 0| 0| > | 30| 30| 12| 12| 121| 121| > | 10| 10| 73| 73| 731| 731| > | 24| 24| 61| 61| 611| 611| > | 35| 35| 65| 65| 0| 0| > *With BroadCastJoin* ( spark-shell ) > df.explain > == Physical Plan == > *Project [sourcedistributionid#136 AS sid#65, sourcedistributionid#125 AS > tid#66, appid#142 AS sapp#67, appid#131 AS tapp#68, brandid#144 AS sbrand#69, > brandid#133 AS tbrand#70] > +- *BroadcastHashJoin [sourcedistributionid#125, appid#131, brandid#133], > [sourcedistributionid#136, appid#142, brandid#144], Inner, BuildRight >:- *Filter ((isnotnull(brandid#133) && isnotnull(appid#131)) && > isnotnull(sourcedistributionid#125)) >: +- HiveTableScan [sourcedistributionid#125, appid#131, brandid#133], > MetastoreRelation eagle_edw_batch, distributionattributes, t >+- BroadcastExchange > HashedRelationBroadcastMode(List((shiftleft((shiftleft(cast(input[0, int, > false] as bigint), 32) | (cast(input[1, int, false] as bigint) & > 4294967295)), 32) | (cast(input[2, int, false] as bigint) & 4294967295 > +- *Filter ((isnotnull(brandid#144) && > isnotnull(sourcedistributionid#136)) && isnotnull(appid#142)) > +- HiveTableScan [sourcedistributionid#136, appid#142, brandid#144], > MetastoreRelation eagle_edw_batch, distributionattributes, s > df.show > |sid|tid|sapp|tapp|sbrand|tbrand| > | 15| 22| 61| 61| 614| 614| > | 13| 22| 61| 61| 614| 614| > | 10| 22| 61| 61| 614| 614| > | 7| 22| 61| 61| 614| 614| > | 9| 22| 61| 61| 614| 614| > | 16| 22| 61| 61| 614| 614| -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20691) Difference between Storage Memory as seen internally and in web UI
[ https://issues.apache.org/jira/browse/SPARK-20691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16040294#comment-16040294 ] Maarten Kesselaers commented on SPARK-20691: OK, some components use the function {noformat} byteStringAsMb {noformat} from JavaUtils which return MebiBytes. (This is also called bij Utils.scala in the function byteStringAsMb) In the other hand in Utils.scala, we have Utils.bytesToString which returns MegaBytes. Does it make sense to have both and mix them? This causes a lot of confusion, in my opinion. > Difference between Storage Memory as seen internally and in web UI > -- > > Key: SPARK-20691 > URL: https://issues.apache.org/jira/browse/SPARK-20691 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski > Labels: starter > > I set Major priority as it's visible to a user. > There's a difference in what the size of Storage Memory is managed internally > and displayed to a user in web UI. > I found it while answering [How does web UI calculate Storage Memory (in > Executors tab)?|http://stackoverflow.com/q/43801062/1305344] on StackOverflow. > In short (quoting the main parts), when you start a Spark app (say > spark-shell) you see 912.3 MB RAM for Storage Memory: > {code} > $ ./bin/spark-shell --conf spark.driver.memory=2g > ... > 17/05/07 15:20:50 INFO BlockManagerMasterEndpoint: Registering block manager > 192.168.1.8:57177 with 912.3 MB RAM, BlockManagerId(driver, 192.168.1.8, > 57177, None) > {code} > but in the web UI you'll see 956.6 MB due to the way the custom JavaScript > function {{formatBytes}} in > [utils.js|https://github.com/apache/spark/blob/master/core/src/main/resources/org/apache/spark/ui/static/utils.js#L40-L48] > calculates the value. That translates to the following Scala code: > {code} > def formatBytes(bytes: Double) = { > val k = 1000 > val i = math.floor(math.log(bytes) / math.log(k)) > val maxMemoryWebUI = bytes / math.pow(k, i) > f"$maxMemoryWebUI%1.1f" > } > scala> println(formatBytes(maxMemory)) > 956.6 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20691) Difference between Storage Memory as seen internally and in web UI
[ https://issues.apache.org/jira/browse/SPARK-20691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16040263#comment-16040263 ] Maarten Kesselaers edited comment on SPARK-20691 at 6/7/17 6:24 AM: [~srowen], I would like to work on this one. As far as I can see, the behaviour of Utils.bytesToString should not be changed since it returns MegaBytes instead of MebiBytes. Let's first clarify the reason of difference between the shell and Web UI and then we can discuss how to proceed. was (Author: mkesselaers): [~srowen], I would like to work on this one. As far as I've understood (I'm quite new), the behaviour of Utils.bytesToString should be changed to return MiB instead of MB and we should look for all occurences to see if the right value is used? > Difference between Storage Memory as seen internally and in web UI > -- > > Key: SPARK-20691 > URL: https://issues.apache.org/jira/browse/SPARK-20691 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski > Labels: starter > > I set Major priority as it's visible to a user. > There's a difference in what the size of Storage Memory is managed internally > and displayed to a user in web UI. > I found it while answering [How does web UI calculate Storage Memory (in > Executors tab)?|http://stackoverflow.com/q/43801062/1305344] on StackOverflow. > In short (quoting the main parts), when you start a Spark app (say > spark-shell) you see 912.3 MB RAM for Storage Memory: > {code} > $ ./bin/spark-shell --conf spark.driver.memory=2g > ... > 17/05/07 15:20:50 INFO BlockManagerMasterEndpoint: Registering block manager > 192.168.1.8:57177 with 912.3 MB RAM, BlockManagerId(driver, 192.168.1.8, > 57177, None) > {code} > but in the web UI you'll see 956.6 MB due to the way the custom JavaScript > function {{formatBytes}} in > [utils.js|https://github.com/apache/spark/blob/master/core/src/main/resources/org/apache/spark/ui/static/utils.js#L40-L48] > calculates the value. That translates to the following Scala code: > {code} > def formatBytes(bytes: Double) = { > val k = 1000 > val i = math.floor(math.log(bytes) / math.log(k)) > val maxMemoryWebUI = bytes / math.pow(k, i) > f"$maxMemoryWebUI%1.1f" > } > scala> println(formatBytes(maxMemory)) > 956.6 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20691) Difference between Storage Memory as seen internally and in web UI
[ https://issues.apache.org/jira/browse/SPARK-20691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16040263#comment-16040263 ] Maarten Kesselaers commented on SPARK-20691: [~srowen], I would like to work on this one. As far as I've understood (I'm quite new), the behaviour of Utils.bytesToString should be changed to return MiB instead of MB and we should look for all occurences to see if the right value is used? > Difference between Storage Memory as seen internally and in web UI > -- > > Key: SPARK-20691 > URL: https://issues.apache.org/jira/browse/SPARK-20691 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski > Labels: starter > > I set Major priority as it's visible to a user. > There's a difference in what the size of Storage Memory is managed internally > and displayed to a user in web UI. > I found it while answering [How does web UI calculate Storage Memory (in > Executors tab)?|http://stackoverflow.com/q/43801062/1305344] on StackOverflow. > In short (quoting the main parts), when you start a Spark app (say > spark-shell) you see 912.3 MB RAM for Storage Memory: > {code} > $ ./bin/spark-shell --conf spark.driver.memory=2g > ... > 17/05/07 15:20:50 INFO BlockManagerMasterEndpoint: Registering block manager > 192.168.1.8:57177 with 912.3 MB RAM, BlockManagerId(driver, 192.168.1.8, > 57177, None) > {code} > but in the web UI you'll see 956.6 MB due to the way the custom JavaScript > function {{formatBytes}} in > [utils.js|https://github.com/apache/spark/blob/master/core/src/main/resources/org/apache/spark/ui/static/utils.js#L40-L48] > calculates the value. That translates to the following Scala code: > {code} > def formatBytes(bytes: Double) = { > val k = 1000 > val i = math.floor(math.log(bytes) / math.log(k)) > val maxMemoryWebUI = bytes / math.pow(k, i) > f"$maxMemoryWebUI%1.1f" > } > scala> println(formatBytes(maxMemory)) > 956.6 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20972) rename HintInfo.isBroadcastable to broadcast
[ https://issues.apache.org/jira/browse/SPARK-20972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-20972. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18189 [https://github.com/apache/spark/pull/18189] > rename HintInfo.isBroadcastable to broadcast > > > Key: SPARK-20972 > URL: https://issues.apache.org/jira/browse/SPARK-20972 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Minor > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11966) Spark API for UDTFs
[ https://issues.apache.org/jira/browse/SPARK-11966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16040235#comment-16040235 ] Dayou Zhou commented on SPARK-11966: Hi [~hvanhovell], any examples on how this might be done using Data Sources API? Thanks. > Spark API for UDTFs > --- > > Key: SPARK-11966 > URL: https://issues.apache.org/jira/browse/SPARK-11966 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Jaka Jancar >Priority: Minor > > Defining UDFs is easy using sqlContext.udf.register, but not table-generating > functions. For those you still have to use these horrendous Hive interfaces: > https://github.com/prongs/apache-hive/blob/master/contrib/src/java/org/apache/hadoop/hive/contrib/udtf/example/GenericUDTFCount2.java -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20972) rename HintInfo.isBroadcastable to broadcast
[ https://issues.apache.org/jira/browse/SPARK-20972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-20972: Summary: rename HintInfo.isBroadcastable to broadcast (was: rename HintInfo.isBroadcastable to forceBroadcast) > rename HintInfo.isBroadcastable to broadcast > > > Key: SPARK-20972 > URL: https://issues.apache.org/jira/browse/SPARK-20972 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21002) Syntax error regression when creating Hive storage handlers on Spark shell
[ https://issues.apache.org/jira/browse/SPARK-21002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16040226#comment-16040226 ] Dayou Zhou commented on SPARK-21002: Hi [~viirya], thanks for pointing to SPARK-19360 -- I missed it in my search. When will this be fixed? > Syntax error regression when creating Hive storage handlers on Spark shell > -- > > Key: SPARK-21002 > URL: https://issues.apache.org/jira/browse/SPARK-21002 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Dayou Zhou > > I use the following syntax to create my Hive storage handlers: > CREATE TABLE t1 > ROW FORMAT SERDE 'com.foo.MySerDe' > STORED BY 'com.foo.MyStorageHandler' > TBLPROPERTIES ( >.. > ); > And I'm trying to do the same from Spark shell. It works fine on Spark 1.6, > but on Spark 2.0+, it fails with the following:' > org.apache.spark.sql.catalyst.parser.ParseException: > Operation not allowed: Unexpected combination of ROW FORMAT SERDE > 'com.foo.MySerDe' and STORED BY 'com.foo.MyStorageHandler'(line 1, pos 0) > Could you please confirm whether this is a regression and if so could it be > fixed in Spark 2.2? Thank you. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19360) Spark 2.X does not support stored by cluase
[ https://issues.apache.org/jira/browse/SPARK-19360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16040223#comment-16040223 ] Dayou Zhou commented on SPARK-19360: Same here... Since STORED BY is **essential** for using Hive storage handlers, is this a regression which will be fixed? Any workarounds? Thanks. > Spark 2.X does not support stored by cluase > --- > > Key: SPARK-19360 > URL: https://issues.apache.org/jira/browse/SPARK-19360 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Ran Haim >Priority: Minor > > Spark 1.6 and below versions support HiveContext which supports Hive storage > handler with "stored by" clause. However, Spark 2.x does not support "stored > by". -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20997) spark-submit's --driver-cores marked as "YARN-only" but listed under "Spark standalone with cluster deploy mode only"
[ https://issues.apache.org/jira/browse/SPARK-20997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16040210#comment-16040210 ] guoxiaolongzte commented on SPARK-20997: Can i fix this jira? Because i found a similar problem here. I am revising these questions together. [~srowen] [~jlaskowski] > spark-submit's --driver-cores marked as "YARN-only" but listed under "Spark > standalone with cluster deploy mode only" > - > > Key: SPARK-20997 > URL: https://issues.apache.org/jira/browse/SPARK-20997 > Project: Spark > Issue Type: Bug > Components: Documentation, Spark Submit >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Priority: Trivial > > Just noticed that {{spark-submit}} describes {{--driver-cores}} under: > * Spark standalone with cluster deploy mode only > * YARN-only > While I can understand "only" in "Spark standalone with cluster deploy mode > only" to refer to cluster deploy mode (not the default client mode), but > YARN-only baffles me which I think deserves a fix. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20935) A daemon thread, "BatchedWriteAheadLog Writer", left behind after terminating StreamingContext.
[ https://issues.apache.org/jira/browse/SPARK-20935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20935: Assignee: (was: Apache Spark) > A daemon thread, "BatchedWriteAheadLog Writer", left behind after terminating > StreamingContext. > --- > > Key: SPARK-20935 > URL: https://issues.apache.org/jira/browse/SPARK-20935 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.6.3, 2.1.1 >Reporter: Terence Yim > > With batched write ahead log on by default in driver (SPARK-11731), if there > is no receiver based {{InputDStream}}, the "BatchedWriteAheadLog Writer" > thread created by {{BatchedWriteAheadLog}} never get shutdown. > The root cause is due to > https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala#L168 > that it never call {{ReceivedBlockTracker.stop()}} (which in turn call > {{BatchedWriteAheadLog.close()}}) if there is no receiver based input. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20935) A daemon thread, "BatchedWriteAheadLog Writer", left behind after terminating StreamingContext.
[ https://issues.apache.org/jira/browse/SPARK-20935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20935: Assignee: Apache Spark > A daemon thread, "BatchedWriteAheadLog Writer", left behind after terminating > StreamingContext. > --- > > Key: SPARK-20935 > URL: https://issues.apache.org/jira/browse/SPARK-20935 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.6.3, 2.1.1 >Reporter: Terence Yim >Assignee: Apache Spark > > With batched write ahead log on by default in driver (SPARK-11731), if there > is no receiver based {{InputDStream}}, the "BatchedWriteAheadLog Writer" > thread created by {{BatchedWriteAheadLog}} never get shutdown. > The root cause is due to > https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala#L168 > that it never call {{ReceivedBlockTracker.stop()}} (which in turn call > {{BatchedWriteAheadLog.close()}}) if there is no receiver based input. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20935) A daemon thread, "BatchedWriteAheadLog Writer", left behind after terminating StreamingContext.
[ https://issues.apache.org/jira/browse/SPARK-20935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16040176#comment-16040176 ] Apache Spark commented on SPARK-20935: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/18224 > A daemon thread, "BatchedWriteAheadLog Writer", left behind after terminating > StreamingContext. > --- > > Key: SPARK-20935 > URL: https://issues.apache.org/jira/browse/SPARK-20935 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.6.3, 2.1.1 >Reporter: Terence Yim > > With batched write ahead log on by default in driver (SPARK-11731), if there > is no receiver based {{InputDStream}}, the "BatchedWriteAheadLog Writer" > thread created by {{BatchedWriteAheadLog}} never get shutdown. > The root cause is due to > https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala#L168 > that it never call {{ReceivedBlockTracker.stop()}} (which in turn call > {{BatchedWriteAheadLog.close()}}) if there is no receiver based input. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20760) Memory Leak of RDD blocks
[ https://issues.apache.org/jira/browse/SPARK-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16040112#comment-16040112 ] Binzi Cao edited comment on SPARK-20760 at 6/7/17 3:47 AM: --- Hi David, Thanks very much for the message, I did a test with spark 2.1.1 in local mode. The issue seems still happening, while it seems much better than spark 2.1.0 as the RDD blocks grows much slower. After running the task for 2 hours, I got around 6000 rdd blocks in memory. I attached the screenshots for the 2.1.1 Binzi was (Author: caobinzi): Hi David, Thanks very much for the message, I did a test with spark 2.1.1 in local mode. The issue seems still happening, while it seems much better than spark 2.0 as the RDD blocks grows much slower. After running the task for 2 hours, I got around 6000 rdd blocks in memory. I attached the screenshots for the 2.1.1 Binzi > Memory Leak of RDD blocks > -- > > Key: SPARK-20760 > URL: https://issues.apache.org/jira/browse/SPARK-20760 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.1.0 > Environment: Spark 2.1.0 >Reporter: Binzi Cao > Attachments: RDD blocks in spark 2.1.1.png, RDD Blocks .png, Storage > in spark 2.1.1.png > > > Memory leak for RDD blocks for a long time running rdd process. > We have a long term running application, which is doing computations of > RDDs. and we found the RDD blocks are keep increasing in the spark ui page. > The rdd blocks and memory usage do not mach the cached rdds and memory. It > looks like spark keeps old rdd in memory and never released it or never got a > chance to release it. The job will eventually die of out of memory. > In addition, I'm not seeing this issue in spark 1.6. We are seeing the same > issue in Yarn Cluster mode both in kafka streaming and batch applications. > The issue in streaming is similar, however, it seems the rdd blocks grows a > bit slower than batch jobs. > The below is the sample code and it is reproducible by justing running it in > local mode. > Scala file: > {code} > import scala.concurrent.duration.Duration > import scala.util.{Try, Failure, Success} > import org.apache.spark.SparkConf > import org.apache.spark.SparkContext > import org.apache.spark.rdd.RDD > import scala.concurrent._ > import ExecutionContext.Implicits.global > case class Person(id: String, name: String) > object RDDApp { > def run(sc: SparkContext) = { > while (true) { > val r = scala.util.Random > val data = (1 to r.nextInt(100)).toList.map { a => > Person(a.toString, a.toString) > } > val rdd = sc.parallelize(data) > rdd.cache > println("running") > val a = (1 to 100).toList.map { x => > Future(rdd.filter(_.id == x.toString).collect) > } > a.foreach { f => > println(Await.ready(f, Duration.Inf).value.get) > } > rdd.unpersist() > } > } > def main(args: Array[String]): Unit = { >val conf = new SparkConf().setAppName("test") > val sc = new SparkContext(conf) > run(sc) > } > } > {code} > build sbt file: > {code} > name := "RDDTest" > version := "0.1.1" > scalaVersion := "2.11.5" > libraryDependencies ++= Seq ( > "org.scalaz" %% "scalaz-core" % "7.2.0", > "org.scalaz" %% "scalaz-concurrent" % "7.2.0", > "org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided", > "org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided" > ) > addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1") > mainClass in assembly := Some("RDDApp") > test in assembly := {} > {code} > To reproduce it: > Just > {code} > spark-2.1.0-bin-hadoop2.7/bin/spark-submit --driver-memory 4G \ > --executor-memory 4G \ > --executor-cores 1 \ > --num-executors 1 \ > --class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20760) Memory Leak of RDD blocks
[ https://issues.apache.org/jira/browse/SPARK-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Binzi Cao updated SPARK-20760: -- Attachment: Storage in spark 2.1.1.png RDD blocks in spark 2.1.1.png Spark 2.1.1 Test Result > Memory Leak of RDD blocks > -- > > Key: SPARK-20760 > URL: https://issues.apache.org/jira/browse/SPARK-20760 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.1.0 > Environment: Spark 2.1.0 >Reporter: Binzi Cao > Attachments: RDD blocks in spark 2.1.1.png, RDD Blocks .png, Storage > in spark 2.1.1.png > > > Memory leak for RDD blocks for a long time running rdd process. > We have a long term running application, which is doing computations of > RDDs. and we found the RDD blocks are keep increasing in the spark ui page. > The rdd blocks and memory usage do not mach the cached rdds and memory. It > looks like spark keeps old rdd in memory and never released it or never got a > chance to release it. The job will eventually die of out of memory. > In addition, I'm not seeing this issue in spark 1.6. We are seeing the same > issue in Yarn Cluster mode both in kafka streaming and batch applications. > The issue in streaming is similar, however, it seems the rdd blocks grows a > bit slower than batch jobs. > The below is the sample code and it is reproducible by justing running it in > local mode. > Scala file: > {code} > import scala.concurrent.duration.Duration > import scala.util.{Try, Failure, Success} > import org.apache.spark.SparkConf > import org.apache.spark.SparkContext > import org.apache.spark.rdd.RDD > import scala.concurrent._ > import ExecutionContext.Implicits.global > case class Person(id: String, name: String) > object RDDApp { > def run(sc: SparkContext) = { > while (true) { > val r = scala.util.Random > val data = (1 to r.nextInt(100)).toList.map { a => > Person(a.toString, a.toString) > } > val rdd = sc.parallelize(data) > rdd.cache > println("running") > val a = (1 to 100).toList.map { x => > Future(rdd.filter(_.id == x.toString).collect) > } > a.foreach { f => > println(Await.ready(f, Duration.Inf).value.get) > } > rdd.unpersist() > } > } > def main(args: Array[String]): Unit = { >val conf = new SparkConf().setAppName("test") > val sc = new SparkContext(conf) > run(sc) > } > } > {code} > build sbt file: > {code} > name := "RDDTest" > version := "0.1.1" > scalaVersion := "2.11.5" > libraryDependencies ++= Seq ( > "org.scalaz" %% "scalaz-core" % "7.2.0", > "org.scalaz" %% "scalaz-concurrent" % "7.2.0", > "org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided", > "org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided" > ) > addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1") > mainClass in assembly := Some("RDDApp") > test in assembly := {} > {code} > To reproduce it: > Just > {code} > spark-2.1.0-bin-hadoop2.7/bin/spark-submit --driver-memory 4G \ > --executor-memory 4G \ > --executor-cores 1 \ > --num-executors 1 \ > --class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20760) Memory Leak of RDD blocks
[ https://issues.apache.org/jira/browse/SPARK-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16040112#comment-16040112 ] Binzi Cao commented on SPARK-20760: --- Hi David, Thanks very much for the message, I did a test with spark 2.1.1 in local mode. The issue seems still happening, while it seems much better than spark 2.0 as the RDD blocks grows much slower. After running the task for 2 hours, I got around 6000 rdd blocks in memory. I attached the screenshots for the 2.1.1 Binzi > Memory Leak of RDD blocks > -- > > Key: SPARK-20760 > URL: https://issues.apache.org/jira/browse/SPARK-20760 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.1.0 > Environment: Spark 2.1.0 >Reporter: Binzi Cao > Attachments: RDD Blocks .png > > > Memory leak for RDD blocks for a long time running rdd process. > We have a long term running application, which is doing computations of > RDDs. and we found the RDD blocks are keep increasing in the spark ui page. > The rdd blocks and memory usage do not mach the cached rdds and memory. It > looks like spark keeps old rdd in memory and never released it or never got a > chance to release it. The job will eventually die of out of memory. > In addition, I'm not seeing this issue in spark 1.6. We are seeing the same > issue in Yarn Cluster mode both in kafka streaming and batch applications. > The issue in streaming is similar, however, it seems the rdd blocks grows a > bit slower than batch jobs. > The below is the sample code and it is reproducible by justing running it in > local mode. > Scala file: > {code} > import scala.concurrent.duration.Duration > import scala.util.{Try, Failure, Success} > import org.apache.spark.SparkConf > import org.apache.spark.SparkContext > import org.apache.spark.rdd.RDD > import scala.concurrent._ > import ExecutionContext.Implicits.global > case class Person(id: String, name: String) > object RDDApp { > def run(sc: SparkContext) = { > while (true) { > val r = scala.util.Random > val data = (1 to r.nextInt(100)).toList.map { a => > Person(a.toString, a.toString) > } > val rdd = sc.parallelize(data) > rdd.cache > println("running") > val a = (1 to 100).toList.map { x => > Future(rdd.filter(_.id == x.toString).collect) > } > a.foreach { f => > println(Await.ready(f, Duration.Inf).value.get) > } > rdd.unpersist() > } > } > def main(args: Array[String]): Unit = { >val conf = new SparkConf().setAppName("test") > val sc = new SparkContext(conf) > run(sc) > } > } > {code} > build sbt file: > {code} > name := "RDDTest" > version := "0.1.1" > scalaVersion := "2.11.5" > libraryDependencies ++= Seq ( > "org.scalaz" %% "scalaz-core" % "7.2.0", > "org.scalaz" %% "scalaz-concurrent" % "7.2.0", > "org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided", > "org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided" > ) > addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1") > mainClass in assembly := Some("RDDApp") > test in assembly := {} > {code} > To reproduce it: > Just > {code} > spark-2.1.0-bin-hadoop2.7/bin/spark-submit --driver-memory 4G \ > --executor-memory 4G \ > --executor-cores 1 \ > --num-executors 1 \ > --class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21002) Syntax error regression when creating Hive storage handlers on Spark shell
[ https://issues.apache.org/jira/browse/SPARK-21002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21002. -- Resolution: Duplicate > Syntax error regression when creating Hive storage handlers on Spark shell > -- > > Key: SPARK-21002 > URL: https://issues.apache.org/jira/browse/SPARK-21002 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Dayou Zhou > > I use the following syntax to create my Hive storage handlers: > CREATE TABLE t1 > ROW FORMAT SERDE 'com.foo.MySerDe' > STORED BY 'com.foo.MyStorageHandler' > TBLPROPERTIES ( >.. > ); > And I'm trying to do the same from Spark shell. It works fine on Spark 1.6, > but on Spark 2.0+, it fails with the following:' > org.apache.spark.sql.catalyst.parser.ParseException: > Operation not allowed: Unexpected combination of ROW FORMAT SERDE > 'com.foo.MySerDe' and STORED BY 'com.foo.MyStorageHandler'(line 1, pos 0) > Could you please confirm whether this is a regression and if so could it be > fixed in Spark 2.2? Thank you. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21002) Syntax error regression when creating Hive storage handlers on Spark shell
[ https://issues.apache.org/jira/browse/SPARK-21002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16040057#comment-16040057 ] Liang-Chi Hsieh commented on SPARK-21002: - This is duplicate to SPARK-19360. > Syntax error regression when creating Hive storage handlers on Spark shell > -- > > Key: SPARK-21002 > URL: https://issues.apache.org/jira/browse/SPARK-21002 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Dayou Zhou > > I use the following syntax to create my Hive storage handlers: > CREATE TABLE t1 > ROW FORMAT SERDE 'com.foo.MySerDe' > STORED BY 'com.foo.MyStorageHandler' > TBLPROPERTIES ( >.. > ); > And I'm trying to do the same from Spark shell. It works fine on Spark 1.6, > but on Spark 2.0+, it fails with the following:' > org.apache.spark.sql.catalyst.parser.ParseException: > Operation not allowed: Unexpected combination of ROW FORMAT SERDE > 'com.foo.MySerDe' and STORED BY 'com.foo.MyStorageHandler'(line 1, pos 0) > Could you please confirm whether this is a regression and if so could it be > fixed in Spark 2.2? Thank you. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21002) Syntax error regression when creating Hive storage handlers on Spark shell
[ https://issues.apache.org/jira/browse/SPARK-21002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16040055#comment-16040055 ] Liang-Chi Hsieh commented on SPARK-21002: - Seems that we don't support {{STORED BY storage_handler}} anymore since 2.0. > Syntax error regression when creating Hive storage handlers on Spark shell > -- > > Key: SPARK-21002 > URL: https://issues.apache.org/jira/browse/SPARK-21002 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Dayou Zhou > > I use the following syntax to create my Hive storage handlers: > CREATE TABLE t1 > ROW FORMAT SERDE 'com.foo.MySerDe' > STORED BY 'com.foo.MyStorageHandler' > TBLPROPERTIES ( >.. > ); > And I'm trying to do the same from Spark shell. It works fine on Spark 1.6, > but on Spark 2.0+, it fails with the following:' > org.apache.spark.sql.catalyst.parser.ParseException: > Operation not allowed: Unexpected combination of ROW FORMAT SERDE > 'com.foo.MySerDe' and STORED BY 'com.foo.MyStorageHandler'(line 1, pos 0) > Could you please confirm whether this is a regression and if so could it be > fixed in Spark 2.2? Thank you. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20977) NPE in CollectionAccumulator
[ https://issues.apache.org/jira/browse/SPARK-20977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16039870#comment-16039870 ] Boris Capitanu edited comment on SPARK-20977 at 6/6/17 11:35 PM: - I am seeing this problem as well. I have a scala SBT project with JavaAppPackaging using Spark 2.1.1. I built my project with "sbt stage" and when I ran the resulting app, the log showed the same error reported here. {noformat} 17/06/06 19:25:09 ERROR [heartbeat-receiver-event-loop-thread] o.a.s.u.Utils: Uncaught exception in thread heartbeat-receiver-even t-loop-thread java.lang.NullPointerException: null at org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:464) at org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:439) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.Iterator$class.foreach(Iterator.scala:742) at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6.apply(TaskSchedulerImpl.scala:408) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6.apply(TaskSchedulerImpl.scala:407) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252) at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186) at org.apache.spark.scheduler.TaskSchedulerImpl.executorHeartbeatReceived(TaskSchedulerImpl.scala:407) at org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2$$anonfun$run$2.apply$mcV$sp(HeartbeatReceiver.scala:129) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1283) at org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2.run(HeartbeatReceiver.scala:128) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) {noformat} was (Author: borice): I am seeing this problem as well. I have a scala SBT project with JavaAppPackaging using Spark 2.1.1. I built my project with "sbt stage" and when I ran the resulting app, the log showed the same error reported here. {quote} 17/06/06 19:25:09 ERROR [heartbeat-receiver-event-loop-thread] o.a.s.u.Utils: Uncaught exception in thread heartbeat-receiver-even t-loop-thread java.lang.NullPointerException: null at org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:464) at org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:439) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.Iterator$class.foreach(Iterator.scala:742) at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
[jira] [Commented] (SPARK-20977) NPE in CollectionAccumulator
[ https://issues.apache.org/jira/browse/SPARK-20977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16039870#comment-16039870 ] Boris Capitanu commented on SPARK-20977: I am seeing this problem as well. I have a scala SBT project with JavaAppPackaging using Spark 2.1.1. I built my project with "sbt stage" and when I ran the resulting app, the log showed the same error reported here. {quote} 17/06/06 19:25:09 ERROR [heartbeat-receiver-event-loop-thread] o.a.s.u.Utils: Uncaught exception in thread heartbeat-receiver-even t-loop-thread java.lang.NullPointerException: null at org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:464) at org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:439) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.Iterator$class.foreach(Iterator.scala:742) at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6.apply(TaskSchedulerImpl.scala:408) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6.apply(TaskSchedulerImpl.scala:407) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252) at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186) at org.apache.spark.scheduler.TaskSchedulerImpl.executorHeartbeatReceived(TaskSchedulerImpl.scala:407) at org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2$$anonfun$run$2.apply$mcV$sp(HeartbeatReceiver.scala:129) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1283) at org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2.run(HeartbeatReceiver.scala:128) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) {quote} > NPE in CollectionAccumulator > > > Key: SPARK-20977 > URL: https://issues.apache.org/jira/browse/SPARK-20977 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 > Environment: JDK: > openjdk version "1.8.0-internal" > OpenJDK Runtime Environment (build 1.8.0-internal-horii_2016_12_20_18_43-b00) > OpenJDK 64-Bit Server VM (build 25.71-b00, mixed mode) > CPU: > POWER8 >Reporter: sharkd tu > > 17/06/03 13:39:31 ERROR Utils: Uncaught exception in thread > heartbeat-receiver-event-loop-thread > java.lang.NullPointerException > at > org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:464) > at > org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:439) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLi
[jira] [Commented] (SPARK-20969) last() aggregate function fails returning the right answer with ordered windows
[ https://issues.apache.org/jira/browse/SPARK-20969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16039862#comment-16039862 ] Liang-Chi Hsieh commented on SPARK-20969: - [~pletelli] I don't find an api doc for that. Maybe we can add one for it. > last() aggregate function fails returning the right answer with ordered > windows > --- > > Key: SPARK-20969 > URL: https://issues.apache.org/jira/browse/SPARK-20969 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Perrine Letellier > > The column on which `orderBy` is performed is considered as another column on > which to partition. > {code} > scala> val df = sc.parallelize(List(("i1", 1, "desc1"), ("i1", 1, "desc2"), > ("i1", 2, "desc3"))).toDF("id", "ts", "description") > scala> import org.apache.spark.sql.expressions.Window > scala> val window = Window.partitionBy("id").orderBy(col("ts").asc) > scala> df.withColumn("last", last(col("description")).over(window)).show > +---+---+---+-+ > | id| ts|description| last| > +---+---+---+-+ > | i1| 1| desc1|desc2| > | i1| 1| desc2|desc2| > | i1| 2| desc3|desc3| > +---+---+---+-+ > {code} > However what is expected is the same answer as if asking for `first()` with a > window with descending order. > {code} > scala> val window = Window.partitionBy("id").orderBy(col("ts").desc) > scala> df.withColumn("hackedLast", > first(col("description")).over(window)).show > +---+---+---+--+ > | id| ts|description|hackedLast| > +---+---+---+--+ > | i1| 2| desc3| desc3| > | i1| 1| desc1| desc3| > | i1| 1| desc2| desc3| > +---+---+---+--+ > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-20973) insert table fail caused by unable to fetch data definition file from remote hdfs
[ https://issues.apache.org/jira/browse/SPARK-20973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yunjian Zhang updated SPARK-20973: -- Comment: was deleted (was: I did check the source code and add a patch to fix the insert issue as below, unable to attach file here, so just past the content as well. -- --- a/./workspace1/spark-2.1.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala +++ b/./workspace/git/gdr/spark/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala @@ -57,7 +57,7 @@ private[hive] class SparkHiveWriterContainer( extends Logging with HiveInspectors with Serializable { - + private val now = new Date() private val tableDesc: TableDesc = fileSinkConf.getTableInfo // Add table properties from storage handler to jobConf, so any custom storage @@ -154,6 +154,12 @@ private[hive] class SparkHiveWriterContainer( conf.value.setBoolean("mapred.task.is.map", true) conf.value.setInt("mapred.task.partition", splitID) } + + def newSerializer(tableDesc: TableDesc): Serializer = { +val serializer = tableDesc.getDeserializerClass.newInstance().asInstanceOf[Serializer] +serializer.initialize(null, tableDesc.getProperties) +serializer + } def newSerializer(jobConf: JobConf, tableDesc: TableDesc): Serializer = { val serializer = tableDesc.getDeserializerClass.newInstance().asInstanceOf[Serializer] @@ -162,10 +168,11 @@ private[hive] class SparkHiveWriterContainer( } protected def prepareForWrite() = { -val serializer = newSerializer(jobConf, fileSinkConf.getTableInfo) +val serializer = newSerializer(conf.value, fileSinkConf.getTableInfo) +logInfo("CHECK table deser:" + fileSinkConf.getTableInfo.getDeserializer(conf.value)) val standardOI = ObjectInspectorUtils .getStandardObjectInspector( -fileSinkConf.getTableInfo.getDeserializer.getObjectInspector, + fileSinkConf.getTableInfo.getDeserializer(conf.value).getObjectInspector, ObjectInspectorCopyOption.JAVA) .asInstanceOf[StructObjectInspector]) > insert table fail caused by unable to fetch data definition file from remote > hdfs > -- > > Key: SPARK-20973 > URL: https://issues.apache.org/jira/browse/SPARK-20973 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Yunjian Zhang > Labels: patch > Attachments: spark-sql-insert.patch > > > I implemented my own hive serde to handle special data files which needs to > read data definition during process. > the process include > 1.read definition file location from TBLPROPERTIES > 2.read file content as per step 1 > 3.init serde base on step 2. > //DDL of the table as below: > - > CREATE EXTERNAL TABLE dw_user_stg_txt_out > ROW FORMAT SERDE 'com.ebay.dss.gdr.hive.serde.abvro.AbvroSerDe' > STORED AS > INPUTFORMAT 'com.ebay.dss.gdr.mapred.AbAsAvroInputFormat' > OUTPUTFORMAT 'com.ebay.dss.gdr.hive.ql.io.ab.AvroAsAbOutputFormat' > LOCATION 'hdfs://${remote_hdfs}/user/data' > TBLPROPERTIES ( > 'com.ebay.dss.dml.file' = 'hdfs://${remote_hdfs}/dml/user.dml' > ) > // insert statement > insert overwrite table dw_user_stg_txt_out select * from dw_user_stg_txt_avro; > //fail with ERROR > 17/06/02 15:46:34 ERROR SparkSQLDriver: Failed in [insert overwrite table > dw_user_stg_txt_out select * from dw_user_stg_txt_avro] > java.lang.RuntimeException: FAILED to get dml file from: > hdfs://${remote-hdfs}/dml/user.dml > at > com.ebay.dss.gdr.hive.serde.abvro.AbvroSerDe.initialize(AbvroSerDe.java:109) > at > org.apache.spark.sql.hive.SparkHiveWriterContainer.newSerializer(hiveWriterContainers.scala:160) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:258) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:170) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:347) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20973) insert table fail caused by unable to fetch data definition file from remote hdfs
[ https://issues.apache.org/jira/browse/SPARK-20973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yunjian Zhang updated SPARK-20973: -- Attachment: spark-sql-insert.patch > insert table fail caused by unable to fetch data definition file from remote > hdfs > -- > > Key: SPARK-20973 > URL: https://issues.apache.org/jira/browse/SPARK-20973 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Yunjian Zhang > Labels: patch > Attachments: spark-sql-insert.patch > > > I implemented my own hive serde to handle special data files which needs to > read data definition during process. > the process include > 1.read definition file location from TBLPROPERTIES > 2.read file content as per step 1 > 3.init serde base on step 2. > //DDL of the table as below: > - > CREATE EXTERNAL TABLE dw_user_stg_txt_out > ROW FORMAT SERDE 'com.ebay.dss.gdr.hive.serde.abvro.AbvroSerDe' > STORED AS > INPUTFORMAT 'com.ebay.dss.gdr.mapred.AbAsAvroInputFormat' > OUTPUTFORMAT 'com.ebay.dss.gdr.hive.ql.io.ab.AvroAsAbOutputFormat' > LOCATION 'hdfs://${remote_hdfs}/user/data' > TBLPROPERTIES ( > 'com.ebay.dss.dml.file' = 'hdfs://${remote_hdfs}/dml/user.dml' > ) > // insert statement > insert overwrite table dw_user_stg_txt_out select * from dw_user_stg_txt_avro; > //fail with ERROR > 17/06/02 15:46:34 ERROR SparkSQLDriver: Failed in [insert overwrite table > dw_user_stg_txt_out select * from dw_user_stg_txt_avro] > java.lang.RuntimeException: FAILED to get dml file from: > hdfs://${remote-hdfs}/dml/user.dml > at > com.ebay.dss.gdr.hive.serde.abvro.AbvroSerDe.initialize(AbvroSerDe.java:109) > at > org.apache.spark.sql.hive.SparkHiveWriterContainer.newSerializer(hiveWriterContainers.scala:160) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:258) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:170) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:347) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21002) Syntax error regression when creating Hive storage handlers on Spark shell
Dayou Zhou created SPARK-21002: -- Summary: Syntax error regression when creating Hive storage handlers on Spark shell Key: SPARK-21002 URL: https://issues.apache.org/jira/browse/SPARK-21002 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Dayou Zhou I use the following syntax to create my Hive storage handlers: CREATE TABLE t1 ROW FORMAT SERDE 'com.foo.MySerDe' STORED BY 'com.foo.MyStorageHandler' TBLPROPERTIES ( .. ); And I'm trying to do the same from Spark shell. It works fine on Spark 1.6, but on Spark 2.0+, it fails with the following:' org.apache.spark.sql.catalyst.parser.ParseException: Operation not allowed: Unexpected combination of ROW FORMAT SERDE 'com.foo.MySerDe' and STORED BY 'com.foo.MyStorageHandler'(line 1, pos 0) Could you please confirm whether this is a regression and if so could it be fixed in Spark 2.2? Thank you. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19878) Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala
[ https://issues.apache.org/jira/browse/SPARK-19878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16039762#comment-16039762 ] Dayou Zhou commented on SPARK-19878: Hello, just checking will this be shipped to Spark 2.2? Thank you. > Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala > -- > > Key: SPARK-19878 > URL: https://issues.apache.org/jira/browse/SPARK-19878 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.0, 1.6.0, 2.0.0 > Environment: Centos 6.5: Hadoop 2.6.0, Spark 1.5.0, Hive 1.1.0 >Reporter: kavn qin > Labels: patch > Attachments: SPARK-19878.patch > > > When case class InsertIntoHiveTable intializes a serde it explicitly passes > null for the Configuration in Spark 1.5.0: > [https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L58] > While in Spark 2.0.0, the HiveWriterContainer intializes a serde it also just > passes null for the Configuration: > [https://github.com/apache/spark/blob/v2.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala#L161] > When we implement a hive serde, we want to use the hive configuration to get > some static and dynamic settings, but we can not do it ! > So this patch add the configuration when initialize hive serde. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20960) make ColumnVector public
[ https://issues.apache.org/jira/browse/SPARK-20960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16039648#comment-16039648 ] Wes McKinney commented on SPARK-20960: -- [~cloud_fan] this will be very exciting to have as a supported public API for more efficient UDF execution. We're ready to help with improvements to Arrow (like in-memory encodings / compression a la ARROW-300) to help with these use cases. cc [~jnadeau] [~julienledem] > make ColumnVector public > > > Key: SPARK-20960 > URL: https://issues.apache.org/jira/browse/SPARK-20960 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan > > ColumnVector is an internal interface in Spark SQL, which is only used for > vectorized parquet reader to represent the in-memory columnar format. > In Spark 2.3 we want to make ColumnVector public, so that we can provide a > more efficient way for data exchanges between Spark and external systems. For > example, we can use ColumnVector to build the columnar read API in data > source framework, we can use ColumnVector to build a more efficient UDF API, > etc. > We also want to introduce a new ColumnVector implementation based on Apache > Arrow(basically just a wrapper over Arrow), so that external systems(like > Python Pandas DataFrame) can build ColumnVector very easily. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20655) In-memory key-value store implementation
[ https://issues.apache.org/jira/browse/SPARK-20655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20655: Assignee: Apache Spark > In-memory key-value store implementation > > > Key: SPARK-20655 > URL: https://issues.apache.org/jira/browse/SPARK-20655 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark > > See spec in parent issue (SPARK-18085) for more details. > This task tracks adding an in-memory implementation for the key-value store > abstraction added in SPARK-20641. This is desired because people might want > to avoid having to store this data on disk when running applications, and > also because the LevelDB native libraries are not available on all platforms. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20655) In-memory key-value store implementation
[ https://issues.apache.org/jira/browse/SPARK-20655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20655: Assignee: (was: Apache Spark) > In-memory key-value store implementation > > > Key: SPARK-20655 > URL: https://issues.apache.org/jira/browse/SPARK-20655 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin > > See spec in parent issue (SPARK-18085) for more details. > This task tracks adding an in-memory implementation for the key-value store > abstraction added in SPARK-20641. This is desired because people might want > to avoid having to store this data on disk when running applications, and > also because the LevelDB native libraries are not available on all platforms. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20655) In-memory key-value store implementation
[ https://issues.apache.org/jira/browse/SPARK-20655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16039619#comment-16039619 ] Apache Spark commented on SPARK-20655: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/18221 > In-memory key-value store implementation > > > Key: SPARK-20655 > URL: https://issues.apache.org/jira/browse/SPARK-20655 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin > > See spec in parent issue (SPARK-18085) for more details. > This task tracks adding an in-memory implementation for the key-value store > abstraction added in SPARK-20641. This is desired because people might want > to avoid having to store this data on disk when running applications, and > also because the LevelDB native libraries are not available on all platforms. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20357) Expose Calendar.getWeekYear() as Spark SQL date function to be consistent with weekofyear()
[ https://issues.apache.org/jira/browse/SPARK-20357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16039496#comment-16039496 ] Cam Mach commented on SPARK-20357: -- I can work on this issue. Can someone assign it to me? Thanks > Expose Calendar.getWeekYear() as Spark SQL date function to be consistent > with weekofyear() > --- > > Key: SPARK-20357 > URL: https://issues.apache.org/jira/browse/SPARK-20357 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.0, 2.2.0 >Reporter: Jeeyoung Kim >Priority: Minor > > Since weeks and years are extracted using different boundaries (weeks happen > every 7 days, years happen every 365-ish days, which is not divisible by 7), > there are weird inconsistencies around how end-of-the year dates are handled > if you use {{year}} and {{weekofyear}} Spark SQL functions. The example below > shows how "2016-01-01" and "2016-12-30" has the same {{(year, week)}} pair. > This happens because the week for "2016-01-01" is calculated as "last week of > 2015". the Year function in Spark SQL ignores this and returns component > of -MM-DD. > The correct way to fix this is by exposing {{Java.util.dates.getWeekYear}}. > This function calculates week-based years, so "2016-01-01" will return 2015 > instead. in this case. > {noformat} > # Trying out the bug for date - using PySpark > import pyspark.sql.functions as F > df = spark.createDataFrame([("2016-12-31",),("2016-12-30",), ("2017-01-01",), > ("2017-01-02",),("2017-12-30",)], ['id']) > df_parsed = ( > df > .withColumn("year", F.year(df['id'].cast("date"))) > .withColumn("weekofyear", F.weekofyear(df['id'].cast("date"))) > ) > df_parsed.show() > {noformat} > Prints > {noformat} > +--++--+ > |id|year|weekofyear| > +--++--+ > |2016-12-31|2016|52| > |2016-12-30|2016|52| > |2017-01-01|2017|52| <- same (year, weekofyear) output > |2017-01-02|2017| 1| > |2017-12-30|2017|52| <- > +--++--+ > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20998) BroadcastHashJoin producing wrong results
[ https://issues.apache.org/jira/browse/SPARK-20998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16039499#comment-16039499 ] Takeshi Yamamuro commented on SPARK-20998: -- How about v2.1? The version has the same issue? > BroadcastHashJoin producing wrong results > - > > Key: SPARK-20998 > URL: https://issues.apache.org/jira/browse/SPARK-20998 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Mohit > > I have a hive table : _eagle_edw_batch.DistributionAttributes_, with > *Schema*: > root > |-- distributionstatus: string (nullable = true) > |-- enabledforselectionflag: boolean (nullable = true) > |-- sourcedistributionid: integer (nullable = true) > |-- rowstartdate: date (nullable = true) > |-- rowenddate: date (nullable = true) > |-- rowiscurrent: string (nullable = true) > |-- dwcreatedate: timestamp (nullable = true) > |-- dwlastupdatedate: timestamp (nullable = true) > |-- appid: integer (nullable = true) > |-- siteid: integer (nullable = true) > |-- brandid: integer (nullable = true) > *DataFrame* > val df = spark.sql("SELECT s.sourcedistributionid as sid, > t.sourcedistributionid as tid, s.appid as sapp, t.appid as tapp, s.brandid > as sbrand, t.brandid as tbrand FROM eagle_edw_batch.DistributionAttributes t > INNER JOIN eagle_edw_batch.DistributionAttributes s ON > t.sourcedistributionid=s.sourcedistributionid AND t.appid=s.appid AND > t.brandid=s.brandid"). > *Without BroadCastJoin* ( spark-shell --conf > "spark.sql.autoBroadcastJoinThreshold=-1") : > df.explain > == Physical Plan == > *Project [sourcedistributionid#71 AS sid#0, sourcedistributionid#60 AS tid#1, > appid#77 AS sapp#2, appid#66 AS tapp#3, brandid#79 AS sbrand#4, brandid#68 AS > tbrand#5] > +- *SortMergeJoin [sourcedistributionid#60, appid#66, brandid#68], > [sourcedistributionid#71, appid#77, brandid#79], Inner >:- *Sort [sourcedistributionid#60 ASC, appid#66 ASC, brandid#68 ASC], > false, 0 >: +- Exchange hashpartitioning(sourcedistributionid#60, appid#66, > brandid#68, 200) >: +- *Filter ((isnotnull(sourcedistributionid#60) && > isnotnull(brandid#68)) && isnotnull(appid#66)) >:+- HiveTableScan [sourcedistributionid#60, appid#66, brandid#68], > MetastoreRelation eagle_edw_batch, distributionattributes, t >+- *Sort [sourcedistributionid#71 ASC, appid#77 ASC, brandid#79 ASC], > false, 0 > +- Exchange hashpartitioning(sourcedistributionid#71, appid#77, > brandid#79, 200) > +- *Filter ((isnotnull(sourcedistributionid#71) && > isnotnull(appid#77)) && isnotnull(brandid#79)) > +- HiveTableScan [sourcedistributionid#71, appid#77, brandid#79], > MetastoreRelation eagle_edw_batch, distributionattributes, s > df.show > |sid|tid|sapp|tapp|sbrand|tbrand| > | 22| 22| 61| 61| 614| 614| > | 29| 29| 65| 65| 0| 0| > | 30| 30| 12| 12| 121| 121| > | 10| 10| 73| 73| 731| 731| > | 24| 24| 61| 61| 611| 611| > | 35| 35| 65| 65| 0| 0| > *With BroadCastJoin* ( spark-shell ) > df.explain > == Physical Plan == > *Project [sourcedistributionid#136 AS sid#65, sourcedistributionid#125 AS > tid#66, appid#142 AS sapp#67, appid#131 AS tapp#68, brandid#144 AS sbrand#69, > brandid#133 AS tbrand#70] > +- *BroadcastHashJoin [sourcedistributionid#125, appid#131, brandid#133], > [sourcedistributionid#136, appid#142, brandid#144], Inner, BuildRight >:- *Filter ((isnotnull(brandid#133) && isnotnull(appid#131)) && > isnotnull(sourcedistributionid#125)) >: +- HiveTableScan [sourcedistributionid#125, appid#131, brandid#133], > MetastoreRelation eagle_edw_batch, distributionattributes, t >+- BroadcastExchange > HashedRelationBroadcastMode(List((shiftleft((shiftleft(cast(input[0, int, > false] as bigint), 32) | (cast(input[1, int, false] as bigint) & > 4294967295)), 32) | (cast(input[2, int, false] as bigint) & 4294967295 > +- *Filter ((isnotnull(brandid#144) && > isnotnull(sourcedistributionid#136)) && isnotnull(appid#142)) > +- HiveTableScan [sourcedistributionid#136, appid#142, brandid#144], > MetastoreRelation eagle_edw_batch, distributionattributes, s > df.show > |sid|tid|sapp|tapp|sbrand|tbrand| > | 15| 22| 61| 61| 614| 614| > | 13| 22| 61| 61| 614| 614| > | 10| 22| 61| 61| 614| 614| > | 7| 22| 61| 61| 614| 614| > | 9| 22| 61| 61| 614| 614| > | 16| 22| 61| 61| 614| 614| -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20641) Key-value store abstraction and implementation for storing application data
[ https://issues.apache.org/jira/browse/SPARK-20641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-20641. -- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 17902 [https://github.com/apache/spark/pull/17902] > Key-value store abstraction and implementation for storing application data > --- > > Key: SPARK-20641 > URL: https://issues.apache.org/jira/browse/SPARK-20641 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin > Fix For: 2.3.0 > > > See spec in parent issue (SPARK-18085) for more details. > This task tracks adding a key-value store abstraction and initial LevelDB > implementation to be used to store application data for building the UI and > REST API. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20760) Memory Leak of RDD blocks
[ https://issues.apache.org/jira/browse/SPARK-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16039375#comment-16039375 ] David Lewis commented on SPARK-20760: - I believe this is fixed by this issue: https://issues.apache.org/jira/browse/SPARK-18991 > Memory Leak of RDD blocks > -- > > Key: SPARK-20760 > URL: https://issues.apache.org/jira/browse/SPARK-20760 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.1.0 > Environment: Spark 2.1.0 >Reporter: Binzi Cao > Attachments: RDD Blocks .png > > > Memory leak for RDD blocks for a long time running rdd process. > We have a long term running application, which is doing computations of > RDDs. and we found the RDD blocks are keep increasing in the spark ui page. > The rdd blocks and memory usage do not mach the cached rdds and memory. It > looks like spark keeps old rdd in memory and never released it or never got a > chance to release it. The job will eventually die of out of memory. > In addition, I'm not seeing this issue in spark 1.6. We are seeing the same > issue in Yarn Cluster mode both in kafka streaming and batch applications. > The issue in streaming is similar, however, it seems the rdd blocks grows a > bit slower than batch jobs. > The below is the sample code and it is reproducible by justing running it in > local mode. > Scala file: > {code} > import scala.concurrent.duration.Duration > import scala.util.{Try, Failure, Success} > import org.apache.spark.SparkConf > import org.apache.spark.SparkContext > import org.apache.spark.rdd.RDD > import scala.concurrent._ > import ExecutionContext.Implicits.global > case class Person(id: String, name: String) > object RDDApp { > def run(sc: SparkContext) = { > while (true) { > val r = scala.util.Random > val data = (1 to r.nextInt(100)).toList.map { a => > Person(a.toString, a.toString) > } > val rdd = sc.parallelize(data) > rdd.cache > println("running") > val a = (1 to 100).toList.map { x => > Future(rdd.filter(_.id == x.toString).collect) > } > a.foreach { f => > println(Await.ready(f, Duration.Inf).value.get) > } > rdd.unpersist() > } > } > def main(args: Array[String]): Unit = { >val conf = new SparkConf().setAppName("test") > val sc = new SparkContext(conf) > run(sc) > } > } > {code} > build sbt file: > {code} > name := "RDDTest" > version := "0.1.1" > scalaVersion := "2.11.5" > libraryDependencies ++= Seq ( > "org.scalaz" %% "scalaz-core" % "7.2.0", > "org.scalaz" %% "scalaz-concurrent" % "7.2.0", > "org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided", > "org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided" > ) > addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1") > mainClass in assembly := Some("RDDApp") > test in assembly := {} > {code} > To reproduce it: > Just > {code} > spark-2.1.0-bin-hadoop2.7/bin/spark-submit --driver-memory 4G \ > --executor-memory 4G \ > --executor-cores 1 \ > --num-executors 1 \ > --class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up
[ https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16039330#comment-16039330 ] Ajay Cherukuri commented on SPARK-18372: I have this issue in Spark 2.0.2 > .Hive-staging folders created from Spark hiveContext are not getting cleaned > up > --- > > Key: SPARK-18372 > URL: https://issues.apache.org/jira/browse/SPARK-18372 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.2, 1.6.3 > Environment: spark standalone and spark yarn >Reporter: Mingjie Tang >Assignee: Mingjie Tang > Fix For: 1.6.4 > > Attachments: _thumb_37664.png > > > Steps to reproduce: > > 1. Launch spark-shell > 2. Run the following scala code via Spark-Shell > scala> val hivesampletabledf = sqlContext.table("hivesampletable") > scala> import org.apache.spark.sql.DataFrameWriter > scala> val dfw : DataFrameWriter = hivesampletabledf.write > scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( > clientid string, querytime string, market string, deviceplatform string, > devicemake string, devicemodel string, state string, country string, > querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") > scala> dfw.insertInto("hivesampletablecopypy") > scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, > querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE > state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """) > hivesampletablecopypydfdf.show > 3. in HDFS (in our case, WASB), we can see the following folders > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666 > > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1 > > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693 > the issue is that these don't get cleaned up and get accumulated > = > with the customer, we have tried setting "SET > hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any > difference. > .hive-staging folders are created under the folder - > hive/warehouse/hivesampletablecopypy/ > we have tried adding this property to hive-site.xml and restart the > components - > > hive.exec.stagingdir > $ {hive.exec.scratchdir} > /$ > {user.name} > /.staging > > a new .hive-staging folder was created in hive/warehouse/ folder > moreover, please understand that if we run the hive query in pure Hive via > Hive CLI on the same Spark cluster, we don't see the behavior > so it doesn't appear to be a Hive issue/behavior in this case- this is a > spark behavior > I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark > configuration already > The issue happens via Spark-submit as well - customer used the following > command to reproduce this - > spark-submit test-hive-staging-cleanup.py -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21001) Staging folders from Hive table are not being cleared.
Ajay Cherukuri created SPARK-21001: -- Summary: Staging folders from Hive table are not being cleared. Key: SPARK-21001 URL: https://issues.apache.org/jira/browse/SPARK-21001 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.2 Reporter: Ajay Cherukuri Staging folders that were being created as a part of Data loading to Hive table by using spark job, are not cleared. Staging folder are remaining in Hive External table folders even after Spark job is completed. This is the same issue mentioned in the ticket:https://issues.apache.org/jira/browse/SPARK-18372 This ticket says the issues was resolved in 1.6.4. But, now i found that it's still existing on 2.0.2. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21000) Add labels support to the Spark Dispatcher
[ https://issues.apache.org/jira/browse/SPARK-21000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21000: Assignee: (was: Apache Spark) > Add labels support to the Spark Dispatcher > -- > > Key: SPARK-21000 > URL: https://issues.apache.org/jira/browse/SPARK-21000 > Project: Spark > Issue Type: New Feature > Components: Mesos >Affects Versions: 2.2.1 >Reporter: Michael Gummelt > > Labels can be used for tagging drivers with arbitrary data, which can then be > used by an organization's tooling. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21000) Add labels support to the Spark Dispatcher
[ https://issues.apache.org/jira/browse/SPARK-21000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16039308#comment-16039308 ] Apache Spark commented on SPARK-21000: -- User 'mgummelt' has created a pull request for this issue: https://github.com/apache/spark/pull/18220 > Add labels support to the Spark Dispatcher > -- > > Key: SPARK-21000 > URL: https://issues.apache.org/jira/browse/SPARK-21000 > Project: Spark > Issue Type: New Feature > Components: Mesos >Affects Versions: 2.2.1 >Reporter: Michael Gummelt > > Labels can be used for tagging drivers with arbitrary data, which can then be > used by an organization's tooling. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21000) Add labels support to the Spark Dispatcher
[ https://issues.apache.org/jira/browse/SPARK-21000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21000: Assignee: Apache Spark > Add labels support to the Spark Dispatcher > -- > > Key: SPARK-21000 > URL: https://issues.apache.org/jira/browse/SPARK-21000 > Project: Spark > Issue Type: New Feature > Components: Mesos >Affects Versions: 2.2.1 >Reporter: Michael Gummelt >Assignee: Apache Spark > > Labels can be used for tagging drivers with arbitrary data, which can then be > used by an organization's tooling. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21000) Add labels support to the Spark Dispatcher
Michael Gummelt created SPARK-21000: --- Summary: Add labels support to the Spark Dispatcher Key: SPARK-21000 URL: https://issues.apache.org/jira/browse/SPARK-21000 Project: Spark Issue Type: New Feature Components: Mesos Affects Versions: 2.2.1 Reporter: Michael Gummelt Labels can be used for tagging drivers with arbitrary data, which can then be used by an organization's tooling. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20926) Exposure to Guava libraries by directly accessing tableRelationCache in SessionCatalog caused failures
[ https://issues.apache.org/jira/browse/SPARK-20926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-20926: --- Fix Version/s: 2.2.1 > Exposure to Guava libraries by directly accessing tableRelationCache in > SessionCatalog caused failures > -- > > Key: SPARK-20926 > URL: https://issues.apache.org/jira/browse/SPARK-20926 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Reza Safi >Assignee: Reza Safi > Fix For: 2.2.1, 2.3.0 > > > Because of shading that we did for guava libraries, we see test failures > whenever those components directly access tableRelationCache in > SessionCatalog. > This can happen in any component that shaded guava library. Failures looks > like this: > {noformat} > java.lang.NoSuchMethodError: > org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableRelationCache()Lcom/google/common/cache/Cache; > 01:25:14 at > org.apache.spark.sql.hive.test.TestHiveSparkSession.reset(TestHive.scala:492) > 01:25:14 at > org.apache.spark.sql.hive.test.TestHiveContext.reset(TestHive.scala:138) > 01:25:14 at > org.apache.spark.sql.hive.test.TestHiveSingleton$class.afterAll(TestHiveSingleton.scala:32) > 01:25:14 at > org.apache.spark.sql.hive.StatisticsSuite.afterAll(StatisticsSuite.scala:34) > 01:25:14 at > org.scalatest.BeforeAndAfterAll$class.afterAll(BeforeAndAfterAll.scala:213) > 01:25:14 at org.apache.spark.SparkFunSuite.afterAll(SparkFunSuite.scala:31) > 01:25:14 at > org.scalatest.BeforeAndAfterAll$$anonfun$run$1.apply(BeforeAndAfterAll.scala:280) > 01:25:14 at > org.scalatest.BeforeAndAfterAll$$anonfun$run$1.apply(BeforeAndAfterAll.scala:278) > 01:25:14 at org.scalatest.CompositeStatus.whenCompleted(Status.scala:377) > 01:25:14 at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:278) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20926) Exposure to Guava libraries by directly accessing tableRelationCache in SessionCatalog caused failures
[ https://issues.apache.org/jira/browse/SPARK-20926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16039249#comment-16039249 ] Marcelo Vanzin commented on SPARK-20926: Turns out I didn't notice the PR was against 2.2 so it's there. Oh well, it's an internal change only anyway. > Exposure to Guava libraries by directly accessing tableRelationCache in > SessionCatalog caused failures > -- > > Key: SPARK-20926 > URL: https://issues.apache.org/jira/browse/SPARK-20926 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Reza Safi >Assignee: Reza Safi > Fix For: 2.2.1, 2.3.0 > > > Because of shading that we did for guava libraries, we see test failures > whenever those components directly access tableRelationCache in > SessionCatalog. > This can happen in any component that shaded guava library. Failures looks > like this: > {noformat} > java.lang.NoSuchMethodError: > org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableRelationCache()Lcom/google/common/cache/Cache; > 01:25:14 at > org.apache.spark.sql.hive.test.TestHiveSparkSession.reset(TestHive.scala:492) > 01:25:14 at > org.apache.spark.sql.hive.test.TestHiveContext.reset(TestHive.scala:138) > 01:25:14 at > org.apache.spark.sql.hive.test.TestHiveSingleton$class.afterAll(TestHiveSingleton.scala:32) > 01:25:14 at > org.apache.spark.sql.hive.StatisticsSuite.afterAll(StatisticsSuite.scala:34) > 01:25:14 at > org.scalatest.BeforeAndAfterAll$class.afterAll(BeforeAndAfterAll.scala:213) > 01:25:14 at org.apache.spark.SparkFunSuite.afterAll(SparkFunSuite.scala:31) > 01:25:14 at > org.scalatest.BeforeAndAfterAll$$anonfun$run$1.apply(BeforeAndAfterAll.scala:280) > 01:25:14 at > org.scalatest.BeforeAndAfterAll$$anonfun$run$1.apply(BeforeAndAfterAll.scala:278) > 01:25:14 at org.scalatest.CompositeStatus.whenCompleted(Status.scala:377) > 01:25:14 at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:278) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18085) Better History Server scalability for many / large applications
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16039223#comment-16039223 ] Marcelo Vanzin commented on SPARK-18085: Yes, https://github.com/vanzin/spark/commits/shs-ng/HEAD should always contain the most up-to-date code. To actually try the leveldb backend you need to set a new configuration in the SHS (check config.scala in the o.a.s.history package). > Better History Server scalability for many / large applications > --- > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Attachments: spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-18085) Better History Server scalability for many / large applications
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-18085: --- Comment: was deleted (was: Your contact with NYSE has changed, please visit https://www.nyse.com/contact or call +1 866 873 7422 or +65 6594 0160 or +852 3962 8100. ) > Better History Server scalability for many / large applications > --- > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Attachments: spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20712) [SQL] Spark can't read Hive table when column type has length greater than 4000 bytes
[ https://issues.apache.org/jira/browse/SPARK-20712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-20712: --- Affects Version/s: 2.1.2 > [SQL] Spark can't read Hive table when column type has length greater than > 4000 bytes > - > > Key: SPARK-20712 > URL: https://issues.apache.org/jira/browse/SPARK-20712 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.1.2, 2.3.0 >Reporter: Maciej Bryński >Priority: Critical > > Hi, > I have following issue. > I'm trying to read a table from hive when one of the column is nested so it's > schema has length longer than 4000 bytes. > Everything worked on Spark 2.0.2. On 2.1.1 I'm getting Exception: > {code} > >> spark.read.table("SOME_TABLE") > Traceback (most recent call last): > File "", line 1, in > File "/opt/spark-2.1.1/python/pyspark/sql/readwriter.py", line 259, in table > return self._df(self._jreader.table(tableName)) > File > "/opt/spark-2.1.1/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line > 1133, in __call__ > File "/opt/spark-2.1.1/python/pyspark/sql/utils.py", line 63, in deco > return f(*a, **kw) > File "/opt/spark-2.1.1/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", > line 319, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o71.table. > : org.apache.spark.SparkException: Cannot recognize hive type string: > SOME_VERY_LONG_FIELD_TYPE > at > org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:789) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:365) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:361) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:361) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:359) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:279) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:226) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:225) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:268) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:359) > at > org.apache.spark.sql.hive.client.HiveClient$class.getTable(HiveClient.scala:74) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:78) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) > at > org.apache.spark.sql.hive.HiveExternalCatalog.org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable(HiveExternalCatalog.scala:117) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:628) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:628) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClien
[jira] [Resolved] (SPARK-20969) last() aggregate function fails returning the right answer with ordered windows
[ https://issues.apache.org/jira/browse/SPARK-20969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Perrine Letellier resolved SPARK-20969. --- Resolution: Not A Problem > last() aggregate function fails returning the right answer with ordered > windows > --- > > Key: SPARK-20969 > URL: https://issues.apache.org/jira/browse/SPARK-20969 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Perrine Letellier > > The column on which `orderBy` is performed is considered as another column on > which to partition. > {code} > scala> val df = sc.parallelize(List(("i1", 1, "desc1"), ("i1", 1, "desc2"), > ("i1", 2, "desc3"))).toDF("id", "ts", "description") > scala> import org.apache.spark.sql.expressions.Window > scala> val window = Window.partitionBy("id").orderBy(col("ts").asc) > scala> df.withColumn("last", last(col("description")).over(window)).show > +---+---+---+-+ > | id| ts|description| last| > +---+---+---+-+ > | i1| 1| desc1|desc2| > | i1| 1| desc2|desc2| > | i1| 2| desc3|desc3| > +---+---+---+-+ > {code} > However what is expected is the same answer as if asking for `first()` with a > window with descending order. > {code} > scala> val window = Window.partitionBy("id").orderBy(col("ts").desc) > scala> df.withColumn("hackedLast", > first(col("description")).over(window)).show > +---+---+---+--+ > | id| ts|description|hackedLast| > +---+---+---+--+ > | i1| 2| desc3| desc3| > | i1| 1| desc1| desc3| > | i1| 1| desc2| desc3| > +---+---+---+--+ > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20969) last() aggregate function fails returning the right answer with ordered windows
[ https://issues.apache.org/jira/browse/SPARK-20969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16038825#comment-16038825 ] Perrine Letellier edited comment on SPARK-20969 at 6/6/17 1:00 PM: --- [~viirya] Thanks for your answer ! I could get the expected result by specifying {code}Window.partitionBy("id").orderBy(col("ts").asc).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing){code} instead of simply {code}Window.partitionBy("id").orderBy(col("ts").asc) {code}. Is it documented in any api doc that the default frame is {{RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW}} ? was (Author: pletelli): [~viirya] Thanks for your answer ! I could get the expected result by specifying {code}Window.partitionBy("id").orderBy(col("ts").asc).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing){code} instead of simply {code}Window.partitionBy("id").orderBy(col("ts").asc) {code}. Is it documented in any api doc that the default frame is {code}RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW{code} ? > last() aggregate function fails returning the right answer with ordered > windows > --- > > Key: SPARK-20969 > URL: https://issues.apache.org/jira/browse/SPARK-20969 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Perrine Letellier > > The column on which `orderBy` is performed is considered as another column on > which to partition. > {code} > scala> val df = sc.parallelize(List(("i1", 1, "desc1"), ("i1", 1, "desc2"), > ("i1", 2, "desc3"))).toDF("id", "ts", "description") > scala> import org.apache.spark.sql.expressions.Window > scala> val window = Window.partitionBy("id").orderBy(col("ts").asc) > scala> df.withColumn("last", last(col("description")).over(window)).show > +---+---+---+-+ > | id| ts|description| last| > +---+---+---+-+ > | i1| 1| desc1|desc2| > | i1| 1| desc2|desc2| > | i1| 2| desc3|desc3| > +---+---+---+-+ > {code} > However what is expected is the same answer as if asking for `first()` with a > window with descending order. > {code} > scala> val window = Window.partitionBy("id").orderBy(col("ts").desc) > scala> df.withColumn("hackedLast", > first(col("description")).over(window)).show > +---+---+---+--+ > | id| ts|description|hackedLast| > +---+---+---+--+ > | i1| 2| desc3| desc3| > | i1| 1| desc1| desc3| > | i1| 1| desc2| desc3| > +---+---+---+--+ > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20969) last() aggregate function fails returning the right answer with ordered windows
[ https://issues.apache.org/jira/browse/SPARK-20969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16038825#comment-16038825 ] Perrine Letellier edited comment on SPARK-20969 at 6/6/17 1:00 PM: --- [~viirya] Thanks for your answer ! I could get the expected result by specifying {{Window.partitionBy("id").orderBy(col("ts").asc).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)}} instead of simply {{Window.partitionBy("id").orderBy(col("ts").asc)}}. Is it documented in any api doc that the default frame is {{RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW}} ? was (Author: pletelli): [~viirya] Thanks for your answer ! I could get the expected result by specifying {code}Window.partitionBy("id").orderBy(col("ts").asc).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing){code} instead of simply {code}Window.partitionBy("id").orderBy(col("ts").asc) {code}. Is it documented in any api doc that the default frame is {{RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW}} ? > last() aggregate function fails returning the right answer with ordered > windows > --- > > Key: SPARK-20969 > URL: https://issues.apache.org/jira/browse/SPARK-20969 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Perrine Letellier > > The column on which `orderBy` is performed is considered as another column on > which to partition. > {code} > scala> val df = sc.parallelize(List(("i1", 1, "desc1"), ("i1", 1, "desc2"), > ("i1", 2, "desc3"))).toDF("id", "ts", "description") > scala> import org.apache.spark.sql.expressions.Window > scala> val window = Window.partitionBy("id").orderBy(col("ts").asc) > scala> df.withColumn("last", last(col("description")).over(window)).show > +---+---+---+-+ > | id| ts|description| last| > +---+---+---+-+ > | i1| 1| desc1|desc2| > | i1| 1| desc2|desc2| > | i1| 2| desc3|desc3| > +---+---+---+-+ > {code} > However what is expected is the same answer as if asking for `first()` with a > window with descending order. > {code} > scala> val window = Window.partitionBy("id").orderBy(col("ts").desc) > scala> df.withColumn("hackedLast", > first(col("description")).over(window)).show > +---+---+---+--+ > | id| ts|description|hackedLast| > +---+---+---+--+ > | i1| 2| desc3| desc3| > | i1| 1| desc1| desc3| > | i1| 1| desc2| desc3| > +---+---+---+--+ > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20969) last() aggregate function fails returning the right answer with ordered windows
[ https://issues.apache.org/jira/browse/SPARK-20969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16038825#comment-16038825 ] Perrine Letellier edited comment on SPARK-20969 at 6/6/17 12:59 PM: [~viirya] Thanks for your answer ! I could get the expected result by specifying {code}Window.partitionBy("id").orderBy(col("ts").asc).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing){code} instead of simply {code}Window.partitionBy("id").orderBy(col("ts").asc) {code}. Is it documented in any api doc that the default frame is {code}RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW{code} ? was (Author: pletelli): [~viirya] Thanks for your answer ! I could get the expected result by specifying {{Window.partitionBy("id").orderBy(col("ts").asc).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing) }} instead of simply {{ Window.partitionBy("id").orderBy(col("ts").asc) }}. Is it documented in any api doc that the default frame is {{ RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW }} ? > last() aggregate function fails returning the right answer with ordered > windows > --- > > Key: SPARK-20969 > URL: https://issues.apache.org/jira/browse/SPARK-20969 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Perrine Letellier > > The column on which `orderBy` is performed is considered as another column on > which to partition. > {code} > scala> val df = sc.parallelize(List(("i1", 1, "desc1"), ("i1", 1, "desc2"), > ("i1", 2, "desc3"))).toDF("id", "ts", "description") > scala> import org.apache.spark.sql.expressions.Window > scala> val window = Window.partitionBy("id").orderBy(col("ts").asc) > scala> df.withColumn("last", last(col("description")).over(window)).show > +---+---+---+-+ > | id| ts|description| last| > +---+---+---+-+ > | i1| 1| desc1|desc2| > | i1| 1| desc2|desc2| > | i1| 2| desc3|desc3| > +---+---+---+-+ > {code} > However what is expected is the same answer as if asking for `first()` with a > window with descending order. > {code} > scala> val window = Window.partitionBy("id").orderBy(col("ts").desc) > scala> df.withColumn("hackedLast", > first(col("description")).over(window)).show > +---+---+---+--+ > | id| ts|description|hackedLast| > +---+---+---+--+ > | i1| 2| desc3| desc3| > | i1| 1| desc1| desc3| > | i1| 1| desc2| desc3| > +---+---+---+--+ > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20969) last() aggregate function fails returning the right answer with ordered windows
[ https://issues.apache.org/jira/browse/SPARK-20969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16038825#comment-16038825 ] Perrine Letellier edited comment on SPARK-20969 at 6/6/17 12:59 PM: [~viirya] Thanks for your answer ! I could get the expected result by specifying {{Window.partitionBy("id").orderBy(col("ts").asc).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing) }} instead of simply {{ Window.partitionBy("id").orderBy(col("ts").asc) }}. Is it documented in any api doc that the default frame is {{ RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW }} ? was (Author: pletelli): [~viirya] Thanks for your answer ! I could get the expected result by specifying {code} Window.partitionBy("id").orderBy(col("ts").asc).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing) {code} instead of simply {code} Window.partitionBy("id").orderBy(col("ts").asc) {code}. Is it documented in any api doc that the default frame is {code} RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW {code} ? > last() aggregate function fails returning the right answer with ordered > windows > --- > > Key: SPARK-20969 > URL: https://issues.apache.org/jira/browse/SPARK-20969 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Perrine Letellier > > The column on which `orderBy` is performed is considered as another column on > which to partition. > {code} > scala> val df = sc.parallelize(List(("i1", 1, "desc1"), ("i1", 1, "desc2"), > ("i1", 2, "desc3"))).toDF("id", "ts", "description") > scala> import org.apache.spark.sql.expressions.Window > scala> val window = Window.partitionBy("id").orderBy(col("ts").asc) > scala> df.withColumn("last", last(col("description")).over(window)).show > +---+---+---+-+ > | id| ts|description| last| > +---+---+---+-+ > | i1| 1| desc1|desc2| > | i1| 1| desc2|desc2| > | i1| 2| desc3|desc3| > +---+---+---+-+ > {code} > However what is expected is the same answer as if asking for `first()` with a > window with descending order. > {code} > scala> val window = Window.partitionBy("id").orderBy(col("ts").desc) > scala> df.withColumn("hackedLast", > first(col("description")).over(window)).show > +---+---+---+--+ > | id| ts|description|hackedLast| > +---+---+---+--+ > | i1| 2| desc3| desc3| > | i1| 1| desc1| desc3| > | i1| 1| desc2| desc3| > +---+---+---+--+ > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20969) last() aggregate function fails returning the right answer with ordered windows
[ https://issues.apache.org/jira/browse/SPARK-20969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16038825#comment-16038825 ] Perrine Letellier commented on SPARK-20969: --- [~viirya] Thanks for your answer ! I could get the expected result by specifying {code} Window.partitionBy("id").orderBy(col("ts").asc).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing) {code} instead of simply {code} Window.partitionBy("id").orderBy(col("ts").asc) {code}. Is it documented in any api doc that the default frame is {code} RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW {code} ? > last() aggregate function fails returning the right answer with ordered > windows > --- > > Key: SPARK-20969 > URL: https://issues.apache.org/jira/browse/SPARK-20969 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Perrine Letellier > > The column on which `orderBy` is performed is considered as another column on > which to partition. > {code} > scala> val df = sc.parallelize(List(("i1", 1, "desc1"), ("i1", 1, "desc2"), > ("i1", 2, "desc3"))).toDF("id", "ts", "description") > scala> import org.apache.spark.sql.expressions.Window > scala> val window = Window.partitionBy("id").orderBy(col("ts").asc) > scala> df.withColumn("last", last(col("description")).over(window)).show > +---+---+---+-+ > | id| ts|description| last| > +---+---+---+-+ > | i1| 1| desc1|desc2| > | i1| 1| desc2|desc2| > | i1| 2| desc3|desc3| > +---+---+---+-+ > {code} > However what is expected is the same answer as if asking for `first()` with a > window with descending order. > {code} > scala> val window = Window.partitionBy("id").orderBy(col("ts").desc) > scala> df.withColumn("hackedLast", > first(col("description")).over(window)).show > +---+---+---+--+ > | id| ts|description|hackedLast| > +---+---+---+--+ > | i1| 2| desc3| desc3| > | i1| 1| desc1| desc3| > | i1| 1| desc2| desc3| > +---+---+---+--+ > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20999) No failure Stages, no log 'DAGScheduler: failed: Set()' output.
[ https://issues.apache.org/jira/browse/SPARK-20999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen closed SPARK-20999. - Not A Problem > No failure Stages, no log 'DAGScheduler: failed: Set()' output. > --- > > Key: SPARK-20999 > URL: https://issues.apache.org/jira/browse/SPARK-20999 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.2.1 >Reporter: caoxuewen >Priority: Trivial > > In the output of the spark log information: > INFO DAGScheduler: looking for newly runnable stages > INFO DAGScheduler: running: Set(ShuffleMapStage 14) > INFO DAGScheduler: waiting: Set(ResultStage 15) > INFO DAGScheduler: failed: Set() > If there is no failure stage, "INFO DAGScheduler: failed: Set()" is no need > to output in the log information. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20999) No failure Stages, no log 'DAGScheduler: failed: Set()' output.
[ https://issues.apache.org/jira/browse/SPARK-20999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20999. --- Resolution: Not A Problem > No failure Stages, no log 'DAGScheduler: failed: Set()' output. > --- > > Key: SPARK-20999 > URL: https://issues.apache.org/jira/browse/SPARK-20999 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.2.1 >Reporter: caoxuewen >Priority: Trivial > > In the output of the spark log information: > INFO DAGScheduler: looking for newly runnable stages > INFO DAGScheduler: running: Set(ShuffleMapStage 14) > INFO DAGScheduler: waiting: Set(ResultStage 15) > INFO DAGScheduler: failed: Set() > If there is no failure stage, "INFO DAGScheduler: failed: Set()" is no need > to output in the log information. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20999) No failure Stages, no log 'DAGScheduler: failed: Set()' output.
[ https://issues.apache.org/jira/browse/SPARK-20999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16038793#comment-16038793 ] Sean Owen commented on SPARK-20999: --- As I've already commented, there doesn't seem to be value in hiding this info, and this isn't something you make a JIRA for > No failure Stages, no log 'DAGScheduler: failed: Set()' output. > --- > > Key: SPARK-20999 > URL: https://issues.apache.org/jira/browse/SPARK-20999 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.2.1 >Reporter: caoxuewen >Priority: Trivial > > In the output of the spark log information: > INFO DAGScheduler: looking for newly runnable stages > INFO DAGScheduler: running: Set(ShuffleMapStage 14) > INFO DAGScheduler: waiting: Set(ResultStage 15) > INFO DAGScheduler: failed: Set() > If there is no failure stage, "INFO DAGScheduler: failed: Set()" is no need > to output in the log information. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20999) No failure Stages, no log 'DAGScheduler: failed: Set()' output.
[ https://issues.apache.org/jira/browse/SPARK-20999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20999: Assignee: (was: Apache Spark) > No failure Stages, no log 'DAGScheduler: failed: Set()' output. > --- > > Key: SPARK-20999 > URL: https://issues.apache.org/jira/browse/SPARK-20999 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.2.1 >Reporter: caoxuewen >Priority: Trivial > > In the output of the spark log information: > INFO DAGScheduler: looking for newly runnable stages > INFO DAGScheduler: running: Set(ShuffleMapStage 14) > INFO DAGScheduler: waiting: Set(ResultStage 15) > INFO DAGScheduler: failed: Set() > If there is no failure stage, "INFO DAGScheduler: failed: Set()" is no need > to output in the log information. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20999) No failure Stages, no log 'DAGScheduler: failed: Set()' output.
[ https://issues.apache.org/jira/browse/SPARK-20999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16038789#comment-16038789 ] Apache Spark commented on SPARK-20999: -- User 'heary-cao' has created a pull request for this issue: https://github.com/apache/spark/pull/18218 > No failure Stages, no log 'DAGScheduler: failed: Set()' output. > --- > > Key: SPARK-20999 > URL: https://issues.apache.org/jira/browse/SPARK-20999 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.2.1 >Reporter: caoxuewen >Priority: Trivial > > In the output of the spark log information: > INFO DAGScheduler: looking for newly runnable stages > INFO DAGScheduler: running: Set(ShuffleMapStage 14) > INFO DAGScheduler: waiting: Set(ResultStage 15) > INFO DAGScheduler: failed: Set() > If there is no failure stage, "INFO DAGScheduler: failed: Set()" is no need > to output in the log information. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20999) No failure Stages, no log 'DAGScheduler: failed: Set()' output.
[ https://issues.apache.org/jira/browse/SPARK-20999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20999: Assignee: Apache Spark > No failure Stages, no log 'DAGScheduler: failed: Set()' output. > --- > > Key: SPARK-20999 > URL: https://issues.apache.org/jira/browse/SPARK-20999 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.2.1 >Reporter: caoxuewen >Assignee: Apache Spark >Priority: Trivial > > In the output of the spark log information: > INFO DAGScheduler: looking for newly runnable stages > INFO DAGScheduler: running: Set(ShuffleMapStage 14) > INFO DAGScheduler: waiting: Set(ResultStage 15) > INFO DAGScheduler: failed: Set() > If there is no failure stage, "INFO DAGScheduler: failed: Set()" is no need > to output in the log information. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20999) No failure Stages, no log 'DAGScheduler: failed: Set()' output.
caoxuewen created SPARK-20999: - Summary: No failure Stages, no log 'DAGScheduler: failed: Set()' output. Key: SPARK-20999 URL: https://issues.apache.org/jira/browse/SPARK-20999 Project: Spark Issue Type: Improvement Components: Scheduler Affects Versions: 2.2.1 Reporter: caoxuewen Priority: Trivial In the output of the spark log information: INFO DAGScheduler: looking for newly runnable stages INFO DAGScheduler: running: Set(ShuffleMapStage 14) INFO DAGScheduler: waiting: Set(ResultStage 15) INFO DAGScheduler: failed: Set() If there is no failure stage, "INFO DAGScheduler: failed: Set()" is no need to output in the log information. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13669) Job will always fail in the external shuffle service unavailable situation
[ https://issues.apache.org/jira/browse/SPARK-13669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-13669: Description: Currently we are running into an issue with Yarn work preserving enabled + external shuffle service. In the work preserving enabled scenario, the failure of NM will not lead to the exit of executors, so executors can still accept and run the tasks. The problem here is when NM is failed, external shuffle service is actually inaccessible, so reduce tasks will always complain about the “Fetch failure”, and the failure of reduce stage will make the parent stage (map stage) rerun. The tricky thing here is Spark scheduler is not aware of the unavailability of external shuffle service, and will reschedule the map tasks on the executor where NM is failed, and again reduce stage will be failed with “Fetch failure”, and after 4 retries, the job is failed. So here the main problem is that we should avoid assigning tasks to those bad executors (where shuffle service is unavailable). Current Spark's blacklist mechanism could blacklist executors/nodes by failure tasks, but it doesn't handle this specific fetch failure scenario. So here propose to improve the current application blacklist mechanism to handle fetch failure issue (especially with external shuffle service unavailable issue), to blacklist the executors/nodes where shuffle fetch is unavailable. was: Currently we are running into an issue with Yarn work preserving enabled + external shuffle service. In the work preserving enabled scenario, the failure of NM will not lead to the exit of executors, so executors can still accept and run the tasks. The problem here is when NM is failed, external shuffle service is actually inaccessible, so reduce tasks will always complain about the “Fetch failure”, and the failure of reduce stage will make the parent stage (map stage) rerun. The tricky thing here is Spark scheduler is not aware of the unavailability of external shuffle service, and will reschedule the map tasks on the executor where NM is failed, and again reduce stage will be failed with “Fetch failure”, and after 4 retries, the job is failed. So here the actual problem is Spark’s scheduler is not aware of the unavailability of external shuffle service, and will still assign the tasks on to that nodes. The fix is to avoid assigning tasks on to that nodes. Currently in the Spark, one related configuration is “spark.scheduler.executorTaskBlacklistTime”, but I don’t think it will be worked in this scenario. This configuration is used to avoid same reattempt task to run on the same executor. Also ways like MapReduce’s blacklist mechanism may not handle this scenario, since all the reduce tasks will be failed, so counting the failure tasks will equally mark all the executors as “bad” one. > Job will always fail in the external shuffle service unavailable situation > -- > > Key: SPARK-13669 > URL: https://issues.apache.org/jira/browse/SPARK-13669 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Reporter: Saisai Shao > > Currently we are running into an issue with Yarn work preserving enabled + > external shuffle service. > In the work preserving enabled scenario, the failure of NM will not lead to > the exit of executors, so executors can still accept and run the tasks. The > problem here is when NM is failed, external shuffle service is actually > inaccessible, so reduce tasks will always complain about the “Fetch failure”, > and the failure of reduce stage will make the parent stage (map stage) rerun. > The tricky thing here is Spark scheduler is not aware of the unavailability > of external shuffle service, and will reschedule the map tasks on the > executor where NM is failed, and again reduce stage will be failed with > “Fetch failure”, and after 4 retries, the job is failed. > So here the main problem is that we should avoid assigning tasks to those bad > executors (where shuffle service is unavailable). Current Spark's blacklist > mechanism could blacklist executors/nodes by failure tasks, but it doesn't > handle this specific fetch failure scenario. So here propose to improve the > current application blacklist mechanism to handle fetch failure issue > (especially with external shuffle service unavailable issue), to blacklist > the executors/nodes where shuffle fetch is unavailable. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20998) BroadcastHashJoin producing wrong results
[ https://issues.apache.org/jira/browse/SPARK-20998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohit updated SPARK-20998: -- Description: I have a hive table : _eagle_edw_batch.DistributionAttributes_, with *Schema*: root |-- distributionstatus: string (nullable = true) |-- enabledforselectionflag: boolean (nullable = true) |-- sourcedistributionid: integer (nullable = true) |-- rowstartdate: date (nullable = true) |-- rowenddate: date (nullable = true) |-- rowiscurrent: string (nullable = true) |-- dwcreatedate: timestamp (nullable = true) |-- dwlastupdatedate: timestamp (nullable = true) |-- appid: integer (nullable = true) |-- siteid: integer (nullable = true) |-- brandid: integer (nullable = true) *DataFrame* val df = spark.sql("SELECT s.sourcedistributionid as sid, t.sourcedistributionid as tid, s.appid as sapp, t.appid as tapp, s.brandid as sbrand, t.brandid as tbrand FROM eagle_edw_batch.DistributionAttributes t INNER JOIN eagle_edw_batch.DistributionAttributes s ON t.sourcedistributionid=s.sourcedistributionid AND t.appid=s.appid AND t.brandid=s.brandid"). *Without BroadCastJoin* ( spark-shell --conf "spark.sql.autoBroadcastJoinThreshold=-1") : df.explain == Physical Plan == *Project [sourcedistributionid#71 AS sid#0, sourcedistributionid#60 AS tid#1, appid#77 AS sapp#2, appid#66 AS tapp#3, brandid#79 AS sbrand#4, brandid#68 AS tbrand#5] +- *SortMergeJoin [sourcedistributionid#60, appid#66, brandid#68], [sourcedistributionid#71, appid#77, brandid#79], Inner :- *Sort [sourcedistributionid#60 ASC, appid#66 ASC, brandid#68 ASC], false, 0 : +- Exchange hashpartitioning(sourcedistributionid#60, appid#66, brandid#68, 200) : +- *Filter ((isnotnull(sourcedistributionid#60) && isnotnull(brandid#68)) && isnotnull(appid#66)) :+- HiveTableScan [sourcedistributionid#60, appid#66, brandid#68], MetastoreRelation eagle_edw_batch, distributionattributes, t +- *Sort [sourcedistributionid#71 ASC, appid#77 ASC, brandid#79 ASC], false, 0 +- Exchange hashpartitioning(sourcedistributionid#71, appid#77, brandid#79, 200) +- *Filter ((isnotnull(sourcedistributionid#71) && isnotnull(appid#77)) && isnotnull(brandid#79)) +- HiveTableScan [sourcedistributionid#71, appid#77, brandid#79], MetastoreRelation eagle_edw_batch, distributionattributes, s df.show |sid|tid|sapp|tapp|sbrand|tbrand| | 22| 22| 61| 61| 614| 614| | 29| 29| 65| 65| 0| 0| | 30| 30| 12| 12| 121| 121| | 10| 10| 73| 73| 731| 731| | 24| 24| 61| 61| 611| 611| | 35| 35| 65| 65| 0| 0| *With BroadCastJoin* ( spark-shell ) df.explain == Physical Plan == *Project [sourcedistributionid#136 AS sid#65, sourcedistributionid#125 AS tid#66, appid#142 AS sapp#67, appid#131 AS tapp#68, brandid#144 AS sbrand#69, brandid#133 AS tbrand#70] +- *BroadcastHashJoin [sourcedistributionid#125, appid#131, brandid#133], [sourcedistributionid#136, appid#142, brandid#144], Inner, BuildRight :- *Filter ((isnotnull(brandid#133) && isnotnull(appid#131)) && isnotnull(sourcedistributionid#125)) : +- HiveTableScan [sourcedistributionid#125, appid#131, brandid#133], MetastoreRelation eagle_edw_batch, distributionattributes, t +- BroadcastExchange HashedRelationBroadcastMode(List((shiftleft((shiftleft(cast(input[0, int, false] as bigint), 32) | (cast(input[1, int, false] as bigint) & 4294967295)), 32) | (cast(input[2, int, false] as bigint) & 4294967295 +- *Filter ((isnotnull(brandid#144) && isnotnull(sourcedistributionid#136)) && isnotnull(appid#142)) +- HiveTableScan [sourcedistributionid#136, appid#142, brandid#144], MetastoreRelation eagle_edw_batch, distributionattributes, s df.show |sid|tid|sapp|tapp|sbrand|tbrand| | 15| 22| 61| 61| 614| 614| | 13| 22| 61| 61| 614| 614| | 10| 22| 61| 61| 614| 614| | 7| 22| 61| 61| 614| 614| | 9| 22| 61| 61| 614| 614| | 16| 22| 61| 61| 614| 614| was: I have a hive table : _eagle_edw_batch.DistributionAttributes_, with *Schema*: root |-- distributionstatus: string (nullable = true) |-- enabledforselectionflag: boolean (nullable = true) |-- sourcedistributionid: integer (nullable = true) |-- rowstartdate: date (nullable = true) |-- rowenddate: date (nullable = true) |-- rowiscurrent: string (nullable = true) |-- dwcreatedate: timestamp (nullable = true) |-- dwlastupdatedate: timestamp (nullable = true) |-- appid: integer (nullable = true) |-- siteid: integer (nullable = true) |-- brandid: integer (nullable = true) *DataFrame* val df = spark.sql("SELECT s.sourcedistributionid as sid, t.sourcedistributionid as tid, s.appid as sapp, t.appid as tapp, s.brandid as sbrand, t.brandid as tbrand FROM eagle_edw_batch.DistributionAttributes t INNER JOIN eagle_edw_batch.DistributionAttributes s ON t.sourcedistributionid=s.sourcedistributionid AND t.appid=s.appid A
[jira] [Updated] (SPARK-20998) BroadcastHashJoin producing wrong results
[ https://issues.apache.org/jira/browse/SPARK-20998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohit updated SPARK-20998: -- Description: I have a hive table : _eagle_edw_batch.DistributionAttributes_, with *Schema*: root |-- distributionstatus: string (nullable = true) |-- enabledforselectionflag: boolean (nullable = true) |-- sourcedistributionid: integer (nullable = true) |-- rowstartdate: date (nullable = true) |-- rowenddate: date (nullable = true) |-- rowiscurrent: string (nullable = true) |-- dwcreatedate: timestamp (nullable = true) |-- dwlastupdatedate: timestamp (nullable = true) |-- appid: integer (nullable = true) |-- siteid: integer (nullable = true) |-- brandid: integer (nullable = true) *DataFrame* val df = spark.sql("SELECT s.sourcedistributionid as sid, t.sourcedistributionid as tid, s.appid as sapp, t.appid as tapp, s.brandid as sbrand, t.brandid as tbrand FROM eagle_edw_batch.DistributionAttributes t INNER JOIN eagle_edw_batch.DistributionAttributes s ON t.sourcedistributionid=s.sourcedistributionid AND t.appid=s.appid AND t.brandid=s.brandid"). *Without BroadCastJoin* ( spark-shell --conf "spark.sql.autoBroadcastJoinThreshold=-1") : df.explain == Physical Plan == *Project [sourcedistributionid#71 AS sid#0, sourcedistributionid#60 AS tid#1, appid#77 AS sapp#2, appid#66 AS tapp#3, brandid#79 AS sbrand#4, brandid#68 AS tbrand#5] +- *SortMergeJoin [sourcedistributionid#60, appid#66, brandid#68], [sourcedistributionid#71, appid#77, brandid#79], Inner :- *Sort [sourcedistributionid#60 ASC, appid#66 ASC, brandid#68 ASC], false, 0 : +- Exchange hashpartitioning(sourcedistributionid#60, appid#66, brandid#68, 200) : +- *Filter ((isnotnull(sourcedistributionid#60) && isnotnull(brandid#68)) && isnotnull(appid#66)) :+- HiveTableScan [sourcedistributionid#60, appid#66, brandid#68], MetastoreRelation eagle_edw_batch, distributionattributes, t +- *Sort [sourcedistributionid#71 ASC, appid#77 ASC, brandid#79 ASC], false, 0 +- Exchange hashpartitioning(sourcedistributionid#71, appid#77, brandid#79, 200) +- *Filter ((isnotnull(sourcedistributionid#71) && isnotnull(appid#77)) && isnotnull(brandid#79)) +- HiveTableScan [sourcedistributionid#71, appid#77, brandid#79], MetastoreRelation eagle_edw_batch, distributionattributes, s df.show +---+---+++--+--+ |sid|tid|sapp|tapp|sbrand|tbrand| +---+---+++--+--+ | 22| 22| 61| 61| 614| 614| | 29| 29| 65| 65| 0| 0| | 30| 30| 12| 12| 121| 121| | 10| 10| 73| 73| 731| 731| | 24| 24| 61| 61| 611| 611| | 35| 35| 65| 65| 0| 0| *With BroadCastJoin* ( spark-shell ) df.explain == Physical Plan == *Project [sourcedistributionid#136 AS sid#65, sourcedistributionid#125 AS tid#66, appid#142 AS sapp#67, appid#131 AS tapp#68, brandid#144 AS sbrand#69, brandid#133 AS tbrand#70] +- *BroadcastHashJoin [sourcedistributionid#125, appid#131, brandid#133], [sourcedistributionid#136, appid#142, brandid#144], Inner, BuildRight :- *Filter ((isnotnull(brandid#133) && isnotnull(appid#131)) && isnotnull(sourcedistributionid#125)) : +- HiveTableScan [sourcedistributionid#125, appid#131, brandid#133], MetastoreRelation eagle_edw_batch, distributionattributes, t +- BroadcastExchange HashedRelationBroadcastMode(List((shiftleft((shiftleft(cast(input[0, int, false] as bigint), 32) | (cast(input[1, int, false] as bigint) & 4294967295)), 32) | (cast(input[2, int, false] as bigint) & 4294967295 +- *Filter ((isnotnull(brandid#144) && isnotnull(sourcedistributionid#136)) && isnotnull(appid#142)) +- HiveTableScan [sourcedistributionid#136, appid#142, brandid#144], MetastoreRelation eagle_edw_batch, distributionattributes, s df.show +---+---+++--+--+ |sid|tid|sapp|tapp|sbrand|tbrand| +---+---+++--+--+ | 15| 22| 61| 61| 614| 614| | 13| 22| 61| 61| 614| 614| | 10| 22| 61| 61| 614| 614| | 7| 22| 61| 61| 614| 614| | 9| 22| 61| 61| 614| 614| | 16| 22| 61| 61| 614| 614| was: I have a hive table : _eagle_edw_batch.DistributionAttributes_, with schema: root |-- distributionstatus: string (nullable = true) |-- enabledforselectionflag: boolean (nullable = true) |-- sourcedistributionid: integer (nullable = true) |-- rowstartdate: date (nullable = true) |-- rowenddate: date (nullable = true) |-- rowiscurrent: string (nullable = true) |-- dwcreatedate: timestamp (nullable = true) |-- dwlastupdatedate: timestamp (nullable = true) |-- appid: integer (nullable = true) |-- siteid: integer (nullable = true) |-- brandid: integer (nullable = true) DataFrame: val df = spark.sql("SELECT s.sourcedistributionid as sid, t.sourcedistributionid as tid, s.appid as sapp, t.appid as
[jira] [Updated] (SPARK-20998) BroadcastHashJoin producing wrong results
[ https://issues.apache.org/jira/browse/SPARK-20998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohit updated SPARK-20998: -- Description: I have a hive table : _eagle_edw_batch.DistributionAttributes_, with schema: root |-- distributionstatus: string (nullable = true) |-- enabledforselectionflag: boolean (nullable = true) |-- sourcedistributionid: integer (nullable = true) |-- rowstartdate: date (nullable = true) |-- rowenddate: date (nullable = true) |-- rowiscurrent: string (nullable = true) |-- dwcreatedate: timestamp (nullable = true) |-- dwlastupdatedate: timestamp (nullable = true) |-- appid: integer (nullable = true) |-- siteid: integer (nullable = true) |-- brandid: integer (nullable = true) DataFrame: val df = spark.sql("SELECT s.sourcedistributionid as sid, t.sourcedistributionid as tid, s.appid as sapp, t.appid as tapp, s.brandid as sbrand, t.brandid as tbrand FROM eagle_edw_batch.DistributionAttributes t INNER JOIN eagle_edw_batch.DistributionAttributes s ON t.sourcedistributionid=s.sourcedistributionid AND t.appid=s.appid AND t.brandid=s.brandid"). *Without BroadCastJoin* ( spark-shell --conf "spark.sql.autoBroadcastJoinThreshold=-1") : df.explain == Physical Plan == *Project [sourcedistributionid#71 AS sid#0, sourcedistributionid#60 AS tid#1, appid#77 AS sapp#2, appid#66 AS tapp#3, brandid#79 AS sbrand#4, brandid#68 AS tbrand#5] +- *SortMergeJoin [sourcedistributionid#60, appid#66, brandid#68], [sourcedistributionid#71, appid#77, brandid#79], Inner :- *Sort [sourcedistributionid#60 ASC, appid#66 ASC, brandid#68 ASC], false, 0 : +- Exchange hashpartitioning(sourcedistributionid#60, appid#66, brandid#68, 200) : +- *Filter ((isnotnull(sourcedistributionid#60) && isnotnull(brandid#68)) && isnotnull(appid#66)) :+- HiveTableScan [sourcedistributionid#60, appid#66, brandid#68], MetastoreRelation eagle_edw_batch, distributionattributes, t +- *Sort [sourcedistributionid#71 ASC, appid#77 ASC, brandid#79 ASC], false, 0 +- Exchange hashpartitioning(sourcedistributionid#71, appid#77, brandid#79, 200) +- *Filter ((isnotnull(sourcedistributionid#71) && isnotnull(appid#77)) && isnotnull(brandid#79)) +- HiveTableScan [sourcedistributionid#71, appid#77, brandid#79], MetastoreRelation eagle_edw_batch, distributionattributes, s df.show +---+---+++--+--+ |sid|tid|sapp|tapp|sbrand|tbrand| +---+---+++--+--+ | 22| 22| 61| 61| 614| 614| | 29| 29| 65| 65| 0| 0| | 30| 30| 12| 12| 121| 121| | 10| 10| 73| 73| 731| 731| | 24| 24| 61| 61| 611| 611| | 35| 35| 65| 65| 0| 0| *With BroadCastJoin* ( spark-shell ) df.explain == Physical Plan == *Project [sourcedistributionid#136 AS sid#65, sourcedistributionid#125 AS tid#66, appid#142 AS sapp#67, appid#131 AS tapp#68, brandid#144 AS sbrand#69, brandid#133 AS tbrand#70] +- *BroadcastHashJoin [sourcedistributionid#125, appid#131, brandid#133], [sourcedistributionid#136, appid#142, brandid#144], Inner, BuildRight :- *Filter ((isnotnull(brandid#133) && isnotnull(appid#131)) && isnotnull(sourcedistributionid#125)) : +- HiveTableScan [sourcedistributionid#125, appid#131, brandid#133], MetastoreRelation eagle_edw_batch, distributionattributes, t +- BroadcastExchange HashedRelationBroadcastMode(List((shiftleft((shiftleft(cast(input[0, int, false] as bigint), 32) | (cast(input[1, int, false] as bigint) & 4294967295)), 32) | (cast(input[2, int, false] as bigint) & 4294967295 +- *Filter ((isnotnull(brandid#144) && isnotnull(sourcedistributionid#136)) && isnotnull(appid#142)) +- HiveTableScan [sourcedistributionid#136, appid#142, brandid#144], MetastoreRelation eagle_edw_batch, distributionattributes, s df.show +---+---+++--+--+ |sid|tid|sapp|tapp|sbrand|tbrand| +---+---+++--+--+ | 15| 22| 61| 61| 614| 614| | 13| 22| 61| 61| 614| 614| | 10| 22| 61| 61| 614| 614| | 7| 22| 61| 61| 614| 614| | 9| 22| 61| 61| 614| 614| | 16| 22| 61| 61| 614| 614| was: I have a hive tables : eagle_edw_batch.DistributionAttributes, with schema: root |-- distributionstatus: string (nullable = true) |-- enabledforselectionflag: boolean (nullable = true) |-- sourcedistributionid: integer (nullable = true) |-- rowstartdate: date (nullable = true) |-- rowenddate: date (nullable = true) |-- rowiscurrent: string (nullable = true) |-- dwcreatedate: timestamp (nullable = true) |-- dwlastupdatedate: timestamp (nullable = true) |-- appid: integer (nullable = true) |-- siteid: integer (nullable = true) |-- brandid: integer (nullable = true) DataFrame: val df = spark.sql("SELECT s.sourcedistributionid as sid, t.sourcedistributionid as tid, s.appid as sapp, t.appid as tapp, s.b
[jira] [Comment Edited] (SPARK-18791) Stream-Stream Joins
[ https://issues.apache.org/jira/browse/SPARK-18791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16038690#comment-16038690 ] xianyao jiang edited comment on SPARK-18791 at 6/6/17 12:15 PM: We have the draft design for the stream-stream inner join , and complete a demo based on it. We hope we can get more advice or helps from open source community and make the stream join implementation more popular and common. If you have any question or advice, please contact us and let us know. hi [~marmbrus] , can you give some advice about it? thanks Document link: https://docs.google.com/document/d/1i528WI7KFica0Dg1LTQfdQMsW8ai3WDvHmUvkH1BKg4/edit?usp=sharing was (Author: xianyao.jiang): We have the draft design for the stream-stream inner join , and complete a demo based on it. We hope we can get more advice or helps from open source community and make the stream join implementation more popular and common. If you have any question or advice, please contact us and let us know. Thanks Document link: https://docs.google.com/document/d/1i528WI7KFica0Dg1LTQfdQMsW8ai3WDvHmUvkH1BKg4/edit?usp=sharing > Stream-Stream Joins > --- > > Key: SPARK-18791 > URL: https://issues.apache.org/jira/browse/SPARK-18791 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Reporter: Michael Armbrust > > Just a placeholder for now. Please comment with your requirements. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20998) BroadcastHashJoin producing wrong results
[ https://issues.apache.org/jira/browse/SPARK-20998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohit updated SPARK-20998: -- Summary: BroadcastHashJoin producing wrong results (was: BroadcastHashJoin producing different results) > BroadcastHashJoin producing wrong results > - > > Key: SPARK-20998 > URL: https://issues.apache.org/jira/browse/SPARK-20998 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Mohit > > I have a hive tables : eagle_edw_batch.DistributionAttributes, with schema: > root > |-- distributionstatus: string (nullable = true) > |-- enabledforselectionflag: boolean (nullable = true) > |-- sourcedistributionid: integer (nullable = true) > |-- rowstartdate: date (nullable = true) > |-- rowenddate: date (nullable = true) > |-- rowiscurrent: string (nullable = true) > |-- dwcreatedate: timestamp (nullable = true) > |-- dwlastupdatedate: timestamp (nullable = true) > |-- appid: integer (nullable = true) > |-- siteid: integer (nullable = true) > |-- brandid: integer (nullable = true) > DataFrame: > val df = spark.sql("SELECT s.sourcedistributionid as sid, > t.sourcedistributionid as tid, s.appid as sapp, t.appid as tapp, s.brandid > as sbrand, t.brandid as tbrand FROM eagle_edw_batch.DistributionAttributes t > INNER JOIN eagle_edw_batch.DistributionAttributes s ON > t.sourcedistributionid=s.sourcedistributionid AND t.appid=s.appid AND > t.brandid=s.brandid"). > *Without BroadCastJoin* ( spark-shell --conf > "spark.sql.autoBroadcastJoinThreshold=-1") : > df.explain > == Physical Plan == > *Project [sourcedistributionid#71 AS sid#0, sourcedistributionid#60 AS tid#1, > appid#77 AS sapp#2, appid#66 AS tapp#3, brandid#79 AS sbrand#4, brandid#68 AS > tbrand#5] > +- *SortMergeJoin [sourcedistributionid#60, appid#66, brandid#68], > [sourcedistributionid#71, appid#77, brandid#79], Inner >:- *Sort [sourcedistributionid#60 ASC, appid#66 ASC, brandid#68 ASC], > false, 0 >: +- Exchange hashpartitioning(sourcedistributionid#60, appid#66, > brandid#68, 200) >: +- *Filter ((isnotnull(sourcedistributionid#60) && > isnotnull(brandid#68)) && isnotnull(appid#66)) >:+- HiveTableScan [sourcedistributionid#60, appid#66, brandid#68], > MetastoreRelation eagle_edw_batch, distributionattributes, t >+- *Sort [sourcedistributionid#71 ASC, appid#77 ASC, brandid#79 ASC], > false, 0 > +- Exchange hashpartitioning(sourcedistributionid#71, appid#77, > brandid#79, 200) > +- *Filter ((isnotnull(sourcedistributionid#71) && > isnotnull(appid#77)) && isnotnull(brandid#79)) > +- HiveTableScan [sourcedistributionid#71, appid#77, brandid#79], > MetastoreRelation eagle_edw_batch, distributionattributes, s > df.show > +---+---+++--+--+ > > |sid|tid|sapp|tapp|sbrand|tbrand| > +---+---+++--+--+ > | 22| 22| 61| 61| 614| 614| > | 29| 29| 65| 65| 0| 0| > | 30| 30| 12| 12| 121| 121| > | 10| 10| 73| 73| 731| 731| > | 24| 24| 61| 61| 611| 611| > | 35| 35| 65| 65| 0| 0| > *With BroadCastJoin* ( spark-shell ) > df.explain > == Physical Plan == > *Project [sourcedistributionid#136 AS sid#65, sourcedistributionid#125 AS > tid#66, appid#142 AS sapp#67, appid#131 AS tapp#68, brandid#144 AS sbrand#69, > brandid#133 AS tbrand#70] > +- *BroadcastHashJoin [sourcedistributionid#125, appid#131, brandid#133], > [sourcedistributionid#136, appid#142, brandid#144], Inner, BuildRight >:- *Filter ((isnotnull(brandid#133) && isnotnull(appid#131)) && > isnotnull(sourcedistributionid#125)) >: +- HiveTableScan [sourcedistributionid#125, appid#131, brandid#133], > MetastoreRelation eagle_edw_batch, distributionattributes, t >+- BroadcastExchange > HashedRelationBroadcastMode(List((shiftleft((shiftleft(cast(input[0, int, > false] as bigint), 32) | (cast(input[1, int, false] as bigint) & > 4294967295)), 32) | (cast(input[2, int, false] as bigint) & 4294967295 > +- *Filter ((isnotnull(brandid#144) && > isnotnull(sourcedistributionid#136)) && isnotnull(appid#142)) > +- HiveTableScan [sourcedistributionid#136, appid#142, brandid#144], > MetastoreRelation eagle_edw_batch, distributionattributes, s > df.show > +---+---+++--+--+ > > |sid|tid|sapp|tapp|sbrand|tbrand| > +---+---+++--+--+ > | 15| 22| 61| 61| 614| 614| > | 13| 22| 61| 61| 614| 614| > | 10| 22| 61| 61| 614| 614| > | 7| 22| 61| 61| 614| 614| > | 9| 22| 61| 61| 614| 614| > | 16| 22| 61| 61| 614| 614| -- This message was sent by Atlassian JIRA (v6.3.15#6346) ---
[jira] [Updated] (SPARK-20998) BroadcastHashJoin producing wrong results
[ https://issues.apache.org/jira/browse/SPARK-20998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohit updated SPARK-20998: -- Affects Version/s: (was: 2.0.1) 2.0.0 > BroadcastHashJoin producing wrong results > - > > Key: SPARK-20998 > URL: https://issues.apache.org/jira/browse/SPARK-20998 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Mohit > > I have a hive tables : eagle_edw_batch.DistributionAttributes, with schema: > root > |-- distributionstatus: string (nullable = true) > |-- enabledforselectionflag: boolean (nullable = true) > |-- sourcedistributionid: integer (nullable = true) > |-- rowstartdate: date (nullable = true) > |-- rowenddate: date (nullable = true) > |-- rowiscurrent: string (nullable = true) > |-- dwcreatedate: timestamp (nullable = true) > |-- dwlastupdatedate: timestamp (nullable = true) > |-- appid: integer (nullable = true) > |-- siteid: integer (nullable = true) > |-- brandid: integer (nullable = true) > DataFrame: > val df = spark.sql("SELECT s.sourcedistributionid as sid, > t.sourcedistributionid as tid, s.appid as sapp, t.appid as tapp, s.brandid > as sbrand, t.brandid as tbrand FROM eagle_edw_batch.DistributionAttributes t > INNER JOIN eagle_edw_batch.DistributionAttributes s ON > t.sourcedistributionid=s.sourcedistributionid AND t.appid=s.appid AND > t.brandid=s.brandid"). > *Without BroadCastJoin* ( spark-shell --conf > "spark.sql.autoBroadcastJoinThreshold=-1") : > df.explain > == Physical Plan == > *Project [sourcedistributionid#71 AS sid#0, sourcedistributionid#60 AS tid#1, > appid#77 AS sapp#2, appid#66 AS tapp#3, brandid#79 AS sbrand#4, brandid#68 AS > tbrand#5] > +- *SortMergeJoin [sourcedistributionid#60, appid#66, brandid#68], > [sourcedistributionid#71, appid#77, brandid#79], Inner >:- *Sort [sourcedistributionid#60 ASC, appid#66 ASC, brandid#68 ASC], > false, 0 >: +- Exchange hashpartitioning(sourcedistributionid#60, appid#66, > brandid#68, 200) >: +- *Filter ((isnotnull(sourcedistributionid#60) && > isnotnull(brandid#68)) && isnotnull(appid#66)) >:+- HiveTableScan [sourcedistributionid#60, appid#66, brandid#68], > MetastoreRelation eagle_edw_batch, distributionattributes, t >+- *Sort [sourcedistributionid#71 ASC, appid#77 ASC, brandid#79 ASC], > false, 0 > +- Exchange hashpartitioning(sourcedistributionid#71, appid#77, > brandid#79, 200) > +- *Filter ((isnotnull(sourcedistributionid#71) && > isnotnull(appid#77)) && isnotnull(brandid#79)) > +- HiveTableScan [sourcedistributionid#71, appid#77, brandid#79], > MetastoreRelation eagle_edw_batch, distributionattributes, s > df.show > +---+---+++--+--+ > > |sid|tid|sapp|tapp|sbrand|tbrand| > +---+---+++--+--+ > | 22| 22| 61| 61| 614| 614| > | 29| 29| 65| 65| 0| 0| > | 30| 30| 12| 12| 121| 121| > | 10| 10| 73| 73| 731| 731| > | 24| 24| 61| 61| 611| 611| > | 35| 35| 65| 65| 0| 0| > *With BroadCastJoin* ( spark-shell ) > df.explain > == Physical Plan == > *Project [sourcedistributionid#136 AS sid#65, sourcedistributionid#125 AS > tid#66, appid#142 AS sapp#67, appid#131 AS tapp#68, brandid#144 AS sbrand#69, > brandid#133 AS tbrand#70] > +- *BroadcastHashJoin [sourcedistributionid#125, appid#131, brandid#133], > [sourcedistributionid#136, appid#142, brandid#144], Inner, BuildRight >:- *Filter ((isnotnull(brandid#133) && isnotnull(appid#131)) && > isnotnull(sourcedistributionid#125)) >: +- HiveTableScan [sourcedistributionid#125, appid#131, brandid#133], > MetastoreRelation eagle_edw_batch, distributionattributes, t >+- BroadcastExchange > HashedRelationBroadcastMode(List((shiftleft((shiftleft(cast(input[0, int, > false] as bigint), 32) | (cast(input[1, int, false] as bigint) & > 4294967295)), 32) | (cast(input[2, int, false] as bigint) & 4294967295 > +- *Filter ((isnotnull(brandid#144) && > isnotnull(sourcedistributionid#136)) && isnotnull(appid#142)) > +- HiveTableScan [sourcedistributionid#136, appid#142, brandid#144], > MetastoreRelation eagle_edw_batch, distributionattributes, s > df.show > +---+---+++--+--+ > > |sid|tid|sapp|tapp|sbrand|tbrand| > +---+---+++--+--+ > | 15| 22| 61| 61| 614| 614| > | 13| 22| 61| 61| 614| 614| > | 10| 22| 61| 61| 614| 614| > | 7| 22| 61| 61| 614| 614| > | 9| 22| 61| 61| 614| 614| > | 16| 22| 61| 61| 614| 614| -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubsc
[jira] [Created] (SPARK-20998) BroadcastHashJoin producing different results
Mohit created SPARK-20998: - Summary: BroadcastHashJoin producing different results Key: SPARK-20998 URL: https://issues.apache.org/jira/browse/SPARK-20998 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.1 Reporter: Mohit I have a hive tables : eagle_edw_batch.DistributionAttributes, with schema: root |-- distributionstatus: string (nullable = true) |-- enabledforselectionflag: boolean (nullable = true) |-- sourcedistributionid: integer (nullable = true) |-- rowstartdate: date (nullable = true) |-- rowenddate: date (nullable = true) |-- rowiscurrent: string (nullable = true) |-- dwcreatedate: timestamp (nullable = true) |-- dwlastupdatedate: timestamp (nullable = true) |-- appid: integer (nullable = true) |-- siteid: integer (nullable = true) |-- brandid: integer (nullable = true) DataFrame: val df = spark.sql("SELECT s.sourcedistributionid as sid, t.sourcedistributionid as tid, s.appid as sapp, t.appid as tapp, s.brandid as sbrand, t.brandid as tbrand FROM eagle_edw_batch.DistributionAttributes t INNER JOIN eagle_edw_batch.DistributionAttributes s ON t.sourcedistributionid=s.sourcedistributionid AND t.appid=s.appid AND t.brandid=s.brandid"). *Without BroadCastJoin* ( spark-shell --conf "spark.sql.autoBroadcastJoinThreshold=-1") : df.explain == Physical Plan == *Project [sourcedistributionid#71 AS sid#0, sourcedistributionid#60 AS tid#1, appid#77 AS sapp#2, appid#66 AS tapp#3, brandid#79 AS sbrand#4, brandid#68 AS tbrand#5] +- *SortMergeJoin [sourcedistributionid#60, appid#66, brandid#68], [sourcedistributionid#71, appid#77, brandid#79], Inner :- *Sort [sourcedistributionid#60 ASC, appid#66 ASC, brandid#68 ASC], false, 0 : +- Exchange hashpartitioning(sourcedistributionid#60, appid#66, brandid#68, 200) : +- *Filter ((isnotnull(sourcedistributionid#60) && isnotnull(brandid#68)) && isnotnull(appid#66)) :+- HiveTableScan [sourcedistributionid#60, appid#66, brandid#68], MetastoreRelation eagle_edw_batch, distributionattributes, t +- *Sort [sourcedistributionid#71 ASC, appid#77 ASC, brandid#79 ASC], false, 0 +- Exchange hashpartitioning(sourcedistributionid#71, appid#77, brandid#79, 200) +- *Filter ((isnotnull(sourcedistributionid#71) && isnotnull(appid#77)) && isnotnull(brandid#79)) +- HiveTableScan [sourcedistributionid#71, appid#77, brandid#79], MetastoreRelation eagle_edw_batch, distributionattributes, s df.show +---+---+++--+--+ |sid|tid|sapp|tapp|sbrand|tbrand| +---+---+++--+--+ | 22| 22| 61| 61| 614| 614| | 29| 29| 65| 65| 0| 0| | 30| 30| 12| 12| 121| 121| | 10| 10| 73| 73| 731| 731| | 24| 24| 61| 61| 611| 611| | 35| 35| 65| 65| 0| 0| *With BroadCastJoin* ( spark-shell ) df.explain == Physical Plan == *Project [sourcedistributionid#136 AS sid#65, sourcedistributionid#125 AS tid#66, appid#142 AS sapp#67, appid#131 AS tapp#68, brandid#144 AS sbrand#69, brandid#133 AS tbrand#70] +- *BroadcastHashJoin [sourcedistributionid#125, appid#131, brandid#133], [sourcedistributionid#136, appid#142, brandid#144], Inner, BuildRight :- *Filter ((isnotnull(brandid#133) && isnotnull(appid#131)) && isnotnull(sourcedistributionid#125)) : +- HiveTableScan [sourcedistributionid#125, appid#131, brandid#133], MetastoreRelation eagle_edw_batch, distributionattributes, t +- BroadcastExchange HashedRelationBroadcastMode(List((shiftleft((shiftleft(cast(input[0, int, false] as bigint), 32) | (cast(input[1, int, false] as bigint) & 4294967295)), 32) | (cast(input[2, int, false] as bigint) & 4294967295 +- *Filter ((isnotnull(brandid#144) && isnotnull(sourcedistributionid#136)) && isnotnull(appid#142)) +- HiveTableScan [sourcedistributionid#136, appid#142, brandid#144], MetastoreRelation eagle_edw_batch, distributionattributes, s df.show +---+---+++--+--+ |sid|tid|sapp|tapp|sbrand|tbrand| +---+---+++--+--+ | 15| 22| 61| 61| 614| 614| | 13| 22| 61| 61| 614| 614| | 10| 22| 61| 61| 614| 614| | 7| 22| 61| 61| 614| 614| | 9| 22| 61| 61| 614| 614| | 16| 22| 61| 61| 614| 614| -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20969) last() aggregate function fails returning the right answer with ordered windows
[ https://issues.apache.org/jira/browse/SPARK-20969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Perrine Letellier updated SPARK-20969: -- Description: The column on which `orderBy` is performed is considered as another column on which to partition. {code} scala> val df = sc.parallelize(List(("i1", 1, "desc1"), ("i1", 1, "desc2"), ("i1", 2, "desc3"))).toDF("id", "ts", "description") scala> import org.apache.spark.sql.expressions.Window scala> val window = Window.partitionBy("id").orderBy(col("ts").asc) scala> df.withColumn("last", last(col("description")).over(window)).show +---+---+---+-+ | id| ts|description| last| +---+---+---+-+ | i1| 1| desc1|desc2| | i1| 1| desc2|desc2| | i1| 2| desc3|desc3| +---+---+---+-+ {code} However what is expected is the same answer as if asking for `first()` with a window with descending order. {code} scala> val window = Window.partitionBy("id").orderBy(col("ts").desc) scala> df.withColumn("hackedLast", first(col("description")).over(window)).show +---+---+---+--+ | id| ts|description|hackedLast| +---+---+---+--+ | i1| 2| desc3| desc3| | i1| 1| desc1| desc3| | i1| 1| desc2| desc3| +---+---+---+--+ {code} was: The column on which `orderBy` is performed is considered as another column on which to partition. {code} scala> val df = sc.parallelize(List(("i1", 1, "desc1"), ("i1", 1, "desc2"), ("i1", 2, "desc3"))).toDF("id", "ts", "description") scala> val window = Window.partitionBy("id").orderBy(col("ts").asc) scala> df.withColumn("last", last(col("description")).over(window)).show +---+---+-+-+ | id| ts| description| last| +---+---+-+-+ | i1| 1|desc1|desc2| | i1| 1|desc2|desc2| | i1| 2|desc3|desc3| +---+---+-+-+ {code} However what is expected is the same answer as if asking for `first()` with a window with descending order. {code} scala> val window = Window.partitionBy("id").orderBy(col("ts").desc) scala> df.withColumn("last", first(col("description")).over(window)).show +---+---+-+-+ | id| ts| description| last| +---+---+-+-+ | i1| 2|desc3|desc3| | i1| 1|desc1|desc3| | i1| 1|desc2|desc3| +---+---+-+-+ {code} > last() aggregate function fails returning the right answer with ordered > windows > --- > > Key: SPARK-20969 > URL: https://issues.apache.org/jira/browse/SPARK-20969 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Perrine Letellier > > The column on which `orderBy` is performed is considered as another column on > which to partition. > {code} > scala> val df = sc.parallelize(List(("i1", 1, "desc1"), ("i1", 1, "desc2"), > ("i1", 2, "desc3"))).toDF("id", "ts", "description") > scala> import org.apache.spark.sql.expressions.Window > scala> val window = Window.partitionBy("id").orderBy(col("ts").asc) > scala> df.withColumn("last", last(col("description")).over(window)).show > +---+---+---+-+ > | id| ts|description| last| > +---+---+---+-+ > | i1| 1| desc1|desc2| > | i1| 1| desc2|desc2| > | i1| 2| desc3|desc3| > +---+---+---+-+ > {code} > However what is expected is the same answer as if asking for `first()` with a > window with descending order. > {code} > scala> val window = Window.partitionBy("id").orderBy(col("ts").desc) > scala> df.withColumn("hackedLast", > first(col("description")).over(window)).show > +---+---+---+--+ > | id| ts|description|hackedLast| > +---+---+---+--+ > | i1| 2| desc3| desc3| > | i1| 1| desc1| desc3| > | i1| 1| desc2| desc3| > +---+---+---+--+ > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18791) Stream-Stream Joins
[ https://issues.apache.org/jira/browse/SPARK-18791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16038690#comment-16038690 ] xianyao jiang edited comment on SPARK-18791 at 6/6/17 12:09 PM: We have the draft design for the stream-stream inner join , and complete a demo based on it. We hope we can get more advice or helps from open source community and make the stream join implementation more popular and common. If you have any question or advice, please contact us and let us know. Thanks Document link: https://docs.google.com/document/d/1i528WI7KFica0Dg1LTQfdQMsW8ai3WDvHmUvkH1BKg4/edit?usp=sharing was (Author: xianyao.jiang): We have the draft design for the stream-stream inner join , and complete a demo based on it, it seems it can work. We hope we can get more advice or helps form open source social and make the stream join implementation more popular and common. If you have any question or advice, please contact us and let us know. Thanks Document link: https://docs.google.com/document/d/1i528WI7KFica0Dg1LTQfdQMsW8ai3WDvHmUvkH1BKg4/edit?usp=sharing > Stream-Stream Joins > --- > > Key: SPARK-18791 > URL: https://issues.apache.org/jira/browse/SPARK-18791 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Reporter: Michael Armbrust > > Just a placeholder for now. Please comment with your requirements. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18791) Stream-Stream Joins
[ https://issues.apache.org/jira/browse/SPARK-18791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16038690#comment-16038690 ] xianyao jiang commented on SPARK-18791: --- We have the draft design for the stream-stream inner join , and complete a demo based on it, it seems it can work. We hope we can get more advice or helps form open source social and make the stream join implementation more popular and common. If you have any question or advice, please contact us and let us know. Thanks Document link: https://docs.google.com/document/d/1i528WI7KFica0Dg1LTQfdQMsW8ai3WDvHmUvkH1BKg4/edit?usp=sharing > Stream-Stream Joins > --- > > Key: SPARK-18791 > URL: https://issues.apache.org/jira/browse/SPARK-18791 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Reporter: Michael Armbrust > > Just a placeholder for now. Please comment with your requirements. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20854) extend hint syntax to support any expression, not just identifiers or strings
[ https://issues.apache.org/jira/browse/SPARK-20854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16038671#comment-16038671 ] Apache Spark commented on SPARK-20854: -- User 'bogdanrdc' has created a pull request for this issue: https://github.com/apache/spark/pull/18217 > extend hint syntax to support any expression, not just identifiers or strings > - > > Key: SPARK-20854 > URL: https://issues.apache.org/jira/browse/SPARK-20854 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Bogdan Raducanu >Assignee: Bogdan Raducanu >Priority: Blocker > Fix For: 2.2.0 > > > Currently the SQL hint syntax supports as parameters only identifiers while > the Dataset hint syntax supports only strings. > They should support any expression as parameters, for example numbers. This > is useful for implementing other hints in the future. > Examples: > {code} > df.hint("hint1", Seq(1, 2, 3)) > df.hint("hint2", "A", 1) > sql("select /*+ hint1((1,2,3)) */") > sql("select /*+ hint2('A', 1) */") > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19878) Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala
[ https://issues.apache.org/jira/browse/SPARK-19878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16038648#comment-16038648 ] lyc commented on SPARK-19878: - See [contributing|https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark] for contributing to spark, and it seems your patch is not a recommended way for contributing~ > Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala > -- > > Key: SPARK-19878 > URL: https://issues.apache.org/jira/browse/SPARK-19878 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.0, 1.6.0, 2.0.0 > Environment: Centos 6.5: Hadoop 2.6.0, Spark 1.5.0, Hive 1.1.0 >Reporter: kavn qin > Labels: patch > Attachments: SPARK-19878.patch > > > When case class InsertIntoHiveTable intializes a serde it explicitly passes > null for the Configuration in Spark 1.5.0: > [https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L58] > While in Spark 2.0.0, the HiveWriterContainer intializes a serde it also just > passes null for the Configuration: > [https://github.com/apache/spark/blob/v2.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala#L161] > When we implement a hive serde, we want to use the hive configuration to get > some static and dynamic settings, but we can not do it ! > So this patch add the configuration when initialize hive serde. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20997) spark-submit's --driver-cores marked as "YARN-only" but listed under "Spark standalone with cluster deploy mode only"
[ https://issues.apache.org/jira/browse/SPARK-20997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-20997: -- Priority: Trivial (was: Minor) Component/s: Documentation I suppose this should be removed from "YARN-only", and the previous section should be called "Spark standalone or YARN with cluster deploy mode only". The help message is getting hairy, but this seems like an improvement. > spark-submit's --driver-cores marked as "YARN-only" but listed under "Spark > standalone with cluster deploy mode only" > - > > Key: SPARK-20997 > URL: https://issues.apache.org/jira/browse/SPARK-20997 > Project: Spark > Issue Type: Bug > Components: Documentation, Spark Submit >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Priority: Trivial > > Just noticed that {{spark-submit}} describes {{--driver-cores}} under: > * Spark standalone with cluster deploy mode only > * YARN-only > While I can understand "only" in "Spark standalone with cluster deploy mode > only" to refer to cluster deploy mode (not the default client mode), but > YARN-only baffles me which I think deserves a fix. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20997) spark-submit's --driver-cores marked as "YARN-only" but listed under "Spark standalone with cluster deploy mode only"
Jacek Laskowski created SPARK-20997: --- Summary: spark-submit's --driver-cores marked as "YARN-only" but listed under "Spark standalone with cluster deploy mode only" Key: SPARK-20997 URL: https://issues.apache.org/jira/browse/SPARK-20997 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 2.3.0 Reporter: Jacek Laskowski Priority: Minor Just noticed that {{spark-submit}} describes {{--driver-cores}} under: * Spark standalone with cluster deploy mode only * YARN-only While I can understand "only" in "Spark standalone with cluster deploy mode only" to refer to cluster deploy mode (not the default client mode), but YARN-only baffles me which I think deserves a fix. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20920) ForkJoinPool pools are leaked when writing hive tables with many partitions
[ https://issues.apache.org/jira/browse/SPARK-20920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16038583#comment-16038583 ] Apache Spark commented on SPARK-20920: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/18216 > ForkJoinPool pools are leaked when writing hive tables with many partitions > --- > > Key: SPARK-20920 > URL: https://issues.apache.org/jira/browse/SPARK-20920 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Rares Mirica > > This bug is loosely related to SPARK-17396 > In this case it happens when writing to a hive table with many, many, > partitions (my table is partitioned by hour and stores data it gets from > kafka in a spark streaming application): > df.repartition() > .write > .format("orc") > .option("path", s"$tablesStoragePath/$tableName") > .mode(SaveMode.Append) > .partitionBy("dt", "hh") > .saveAsTable(tableName) > As this table grows beyond a certain size, ForkJoinPool pools start leaking. > Upon examination (with a debugger) I found that the caller is > AlterTableRecoverPartitionsCommand and the problem happens when > `evalTaskSupport` is used (line 555). I have tried setting a very large > threshold via `spark.rdd.parallelListingThreshold` and the problem went away. > My assumption is that the problem happens in this case and not in the one in > SPARK-17396 due to the fact that AlterTableRecoverPartitionsCommand is a case > class while UnionRDD is an object so multiple instances are not possible, > therefore no leak. > Regards, > Rares -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20920) ForkJoinPool pools are leaked when writing hive tables with many partitions
[ https://issues.apache.org/jira/browse/SPARK-20920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20920: Assignee: (was: Apache Spark) > ForkJoinPool pools are leaked when writing hive tables with many partitions > --- > > Key: SPARK-20920 > URL: https://issues.apache.org/jira/browse/SPARK-20920 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Rares Mirica > > This bug is loosely related to SPARK-17396 > In this case it happens when writing to a hive table with many, many, > partitions (my table is partitioned by hour and stores data it gets from > kafka in a spark streaming application): > df.repartition() > .write > .format("orc") > .option("path", s"$tablesStoragePath/$tableName") > .mode(SaveMode.Append) > .partitionBy("dt", "hh") > .saveAsTable(tableName) > As this table grows beyond a certain size, ForkJoinPool pools start leaking. > Upon examination (with a debugger) I found that the caller is > AlterTableRecoverPartitionsCommand and the problem happens when > `evalTaskSupport` is used (line 555). I have tried setting a very large > threshold via `spark.rdd.parallelListingThreshold` and the problem went away. > My assumption is that the problem happens in this case and not in the one in > SPARK-17396 due to the fact that AlterTableRecoverPartitionsCommand is a case > class while UnionRDD is an object so multiple instances are not possible, > therefore no leak. > Regards, > Rares -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20920) ForkJoinPool pools are leaked when writing hive tables with many partitions
[ https://issues.apache.org/jira/browse/SPARK-20920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20920: Assignee: Apache Spark > ForkJoinPool pools are leaked when writing hive tables with many partitions > --- > > Key: SPARK-20920 > URL: https://issues.apache.org/jira/browse/SPARK-20920 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Rares Mirica >Assignee: Apache Spark > > This bug is loosely related to SPARK-17396 > In this case it happens when writing to a hive table with many, many, > partitions (my table is partitioned by hour and stores data it gets from > kafka in a spark streaming application): > df.repartition() > .write > .format("orc") > .option("path", s"$tablesStoragePath/$tableName") > .mode(SaveMode.Append) > .partitionBy("dt", "hh") > .saveAsTable(tableName) > As this table grows beyond a certain size, ForkJoinPool pools start leaking. > Upon examination (with a debugger) I found that the caller is > AlterTableRecoverPartitionsCommand and the problem happens when > `evalTaskSupport` is used (line 555). I have tried setting a very large > threshold via `spark.rdd.parallelListingThreshold` and the problem went away. > My assumption is that the problem happens in this case and not in the one in > SPARK-17396 due to the fact that AlterTableRecoverPartitionsCommand is a case > class while UnionRDD is an object so multiple instances are not possible, > therefore no leak. > Regards, > Rares -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20935) A daemon thread, "BatchedWriteAheadLog Writer", left behind after terminating StreamingContext.
[ https://issues.apache.org/jira/browse/SPARK-20935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16038549#comment-16038549 ] Sean Owen commented on SPARK-20935: --- Sounds OK to me [~hyukjin.kwon] if you're up for it > A daemon thread, "BatchedWriteAheadLog Writer", left behind after terminating > StreamingContext. > --- > > Key: SPARK-20935 > URL: https://issues.apache.org/jira/browse/SPARK-20935 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.6.3, 2.1.1 >Reporter: Terence Yim > > With batched write ahead log on by default in driver (SPARK-11731), if there > is no receiver based {{InputDStream}}, the "BatchedWriteAheadLog Writer" > thread created by {{BatchedWriteAheadLog}} never get shutdown. > The root cause is due to > https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala#L168 > that it never call {{ReceivedBlockTracker.stop()}} (which in turn call > {{BatchedWriteAheadLog.close()}}) if there is no receiver based input. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20985) Improve KryoSerializerResizableOutputSuite
[ https://issues.apache.org/jira/browse/SPARK-20985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-20985: - Assignee: jin xing > Improve KryoSerializerResizableOutputSuite > -- > > Key: SPARK-20985 > URL: https://issues.apache.org/jira/browse/SPARK-20985 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: jin xing >Assignee: jin xing >Priority: Trivial > Fix For: 2.3.0 > > > SparkContext should always be stopped after using, thus other tests won't > complain that there's only one SparkContext can exist. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20985) Improve KryoSerializerResizableOutputSuite
[ https://issues.apache.org/jira/browse/SPARK-20985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20985. --- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18204 [https://github.com/apache/spark/pull/18204] > Improve KryoSerializerResizableOutputSuite > -- > > Key: SPARK-20985 > URL: https://issues.apache.org/jira/browse/SPARK-20985 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: jin xing >Priority: Trivial > Fix For: 2.3.0 > > > SparkContext should always be stopped after using, thus other tests won't > complain that there's only one SparkContext can exist. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20914) Javadoc contains code that is invalid
[ https://issues.apache.org/jira/browse/SPARK-20914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16038518#comment-16038518 ] Apache Spark commented on SPARK-20914: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/18215 > Javadoc contains code that is invalid > - > > Key: SPARK-20914 > URL: https://issues.apache.org/jira/browse/SPARK-20914 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.1.1 >Reporter: Cristian Teodor >Priority: Trivial > > i was looking over the > [dataset|https://spark.apache.org/docs/2.1.1/api/java/org/apache/spark/sql/Dataset.html] > and noticed the code on top that does not make sense in java. > {code} > // To create Dataset using SparkSession >Dataset people = spark.read().parquet("..."); >Dataset department = spark.read().parquet("..."); >people.filter("age".gt(30)) > .join(department, people.col("deptId").equalTo(department("id"))) > .groupBy(department.col("name"), "gender") > .agg(avg(people.col("salary")), max(people.col("age"))); > {code} > invalid parts: > * "age".gt(30) > * department("id") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20914) Javadoc contains code that is invalid
[ https://issues.apache.org/jira/browse/SPARK-20914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20914: Assignee: Apache Spark > Javadoc contains code that is invalid > - > > Key: SPARK-20914 > URL: https://issues.apache.org/jira/browse/SPARK-20914 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.1.1 >Reporter: Cristian Teodor >Assignee: Apache Spark >Priority: Trivial > > i was looking over the > [dataset|https://spark.apache.org/docs/2.1.1/api/java/org/apache/spark/sql/Dataset.html] > and noticed the code on top that does not make sense in java. > {code} > // To create Dataset using SparkSession >Dataset people = spark.read().parquet("..."); >Dataset department = spark.read().parquet("..."); >people.filter("age".gt(30)) > .join(department, people.col("deptId").equalTo(department("id"))) > .groupBy(department.col("name"), "gender") > .agg(avg(people.col("salary")), max(people.col("age"))); > {code} > invalid parts: > * "age".gt(30) > * department("id") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20914) Javadoc contains code that is invalid
[ https://issues.apache.org/jira/browse/SPARK-20914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20914: Assignee: (was: Apache Spark) > Javadoc contains code that is invalid > - > > Key: SPARK-20914 > URL: https://issues.apache.org/jira/browse/SPARK-20914 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.1.1 >Reporter: Cristian Teodor >Priority: Trivial > > i was looking over the > [dataset|https://spark.apache.org/docs/2.1.1/api/java/org/apache/spark/sql/Dataset.html] > and noticed the code on top that does not make sense in java. > {code} > // To create Dataset using SparkSession >Dataset people = spark.read().parquet("..."); >Dataset department = spark.read().parquet("..."); >people.filter("age".gt(30)) > .join(department, people.col("deptId").equalTo(department("id"))) > .groupBy(department.col("name"), "gender") > .agg(avg(people.col("salary")), max(people.col("age"))); > {code} > invalid parts: > * "age".gt(30) > * department("id") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20162) Reading data from MySQL - Cannot up cast from decimal(30,6) to decimal(38,18)
[ https://issues.apache.org/jira/browse/SPARK-20162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16038515#comment-16038515 ] Yuming Wang commented on SPARK-20162: - [~mspehar] How to reproduce it? read a table like {{spark_20162}}? {code:sql} CREATE TABLE `spark_20162` ( `spark` decimal(30,6) DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1; {code} > Reading data from MySQL - Cannot up cast from decimal(30,6) to decimal(38,18) > - > > Key: SPARK-20162 > URL: https://issues.apache.org/jira/browse/SPARK-20162 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Miroslav Spehar > > While reading data from MySQL, type conversion doesn't work for Decimal type > when the decimal in database is of lower precision/scale than the one spark > expects. > Error: > Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up > cast `DECIMAL_AMOUNT` from decimal(30,6) to decimal(38,18) as it may truncate > The type path of the target object is: > - field (class: "org.apache.spark.sql.types.Decimal", name: "DECIMAL_AMOUNT") > - root class: "com.misp.spark.Structure" > You can either add an explicit cast to the input data or choose a higher > precision type of the field in the target object; > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:2119) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34$$anonfun$applyOrElse$14.applyOrElse(Analyzer.scala:2141) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34$$anonfun$applyOrElse$14.applyOrElse(Analyzer.scala:2136) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:288) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:288) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:287) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5$$anonfun$apply$11.apply(TreeNode.scala:360) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:358) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:248) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:258) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$6.apply(QueryPlan.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:267) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:236) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34.applyOrElse(Analyzer.scala:2136) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34.applyOrElse(Analyzer.scala:2132) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.apply(Analyzer.scala:2132) > at > org.apache.spark.sql.catalyst.analysis.Analy
[jira] [Updated] (SPARK-20995) 'Spark-env.sh.template' should add 'YARN_CONF_DIR' configuration instructions.
[ https://issues.apache.org/jira/browse/SPARK-20995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-20995: -- Priority: Trivial (was: Minor) > 'Spark-env.sh.template' should add 'YARN_CONF_DIR' configuration instructions. > -- > > Key: SPARK-20995 > URL: https://issues.apache.org/jira/browse/SPARK-20995 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: guoxiaolongzte >Priority: Trivial > > Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which > contains the (client side) configuration files for the Hadoop cluster. > These configs are used to write to HDFS and connect to the YARN > ResourceManager. The > configuration contained in this directory will be distributed to the YARN > cluster so that all > containers used by the application use the same configuration. > Sometimes, HADOOP_CONF_DIR is set to the hdfs configuration file path. So, > YARN_CONF_DIR should be set to the yarn configuration file path. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20884) Spark' masters will be both standby due to the bug of curator
[ https://issues.apache.org/jira/browse/SPARK-20884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20884. --- Resolution: Won't Fix We can't merge this change, because Curator 2.8 won't work with Curator 2.6 environments, which include Hadoop 2.6/2.7 envs, which are still supported. Much later we could update this, basically when Spark 3 moves on to Hadoop 3 (which uses Curator 2.12) > Spark' masters will be both standby due to the bug of curator > -- > > Key: SPARK-20884 > URL: https://issues.apache.org/jira/browse/SPARK-20884 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: liuzhaokun > > I built a cluster with two masters and three workers.When there is a switch > of master's state,two masters are both standby in a long period.There are > some ERROR in master's logfile " ERROR CuratorFrameworkImpl: Background > exception was not retry-able or retry gave up > java.lang.IllegalArgumentException: Path must start with / character > at org.apache.curator.utils.PathUtils.validatePath(PathUtils.java:53) > at org.apache.curator.utils.ZKPaths.getNodeFromPath(ZKPaths.java:56) > at > org.apache.curator.framework.recipes.leader.LeaderLatch.checkLeadership(LeaderLatch.java:421) > at > org.apache.curator.framework.recipes.leader.LeaderLatch.access$500(LeaderLatch.java:60) > at > org.apache.curator.framework.recipes.leader.LeaderLatch$6.processResult(LeaderLatch.java:478) > at > org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:686) > at > org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:485) > at > org.apache.curator.framework.imps.GetChildrenBuilderImpl$2.processResult(GetChildrenBuilderImpl.java:166) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:587) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)" > it can be related to https://issues.apache.org/jira/browse/CURATOR-168. > So I think the version of curator in spark should be 2.8. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20851) Drop spark table failed if a column name is a numeric string
[ https://issues.apache.org/jira/browse/SPARK-20851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16038445#comment-16038445 ] Ben P commented on SPARK-20851: --- This issue seems to be handled/prevented in v2.0.1 as 2.0.1 throws below exception when spark.sql('create table...') pyspark.sql.utils.ParseException: "\nmismatched input '123018231' expecting '>'(line 1, pos 7)\n\n== SQL ==\nstruct<123018231:string,123121:bigint>\n---^^^\n" > Drop spark table failed if a column name is a numeric string > > > Key: SPARK-20851 > URL: https://issues.apache.org/jira/browse/SPARK-20851 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: linux redhat >Reporter: Chen Gong > > I tried to read a json file to a spark dataframe > {noformat} > df = spark.read.json('path.json') > df.write.parquet('dataframe', compression='snappy') > {noformat} > However, there are some columns' names are numeric strings, such as > "989238883". Then I created spark sql table by using this > {noformat} > create table if not exists `a` using org.apache.spark.sql.parquet options > (path 'dataframe'); // It works well > {noformat} > But after created table, any operations, like select, drop table on this > table will raise the same exceptions below > {noformat} > org.apache.spark.SparkException: Cannot recognize hive type string: > array,assignee_id:bigint,attachments:array>,url:string,width:bigint>>,audit_id:bigint,author_id:bigint,body:string,brand_id:string,created_at:string,custom_ticket_fields:struct<49244727:string,51588527:string,51591767:string,51950848:string,51950868:string,51950888:string,51950928:string,52359587:string,55276747:string,56958227:string,57080067:string,57080667:string,57107727:string,57112447:string,57113207:string,57411128:string,57424648:string,57442588:string,62382188:string,74862088:string,74871788:string>,event_type:string,group_id:bigint,html_body:string,id:bigint,is_public:string,locale_id:string,organization_id:string,plain_body:string,previous_value:string,priority:string,public:boolean,rel:string,removed_tags:array,requester_id:bigint,satisfaction_probability:string,satisfaction_score:string,sla_policy:string,status:string,tags:array,ticket_form_id:string,type:string,via:string,via_reference_id:bigint>> > at > org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:785) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$10$$anonfun$7.apply(HiveClientImpl.scala:365) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$10$$anonfun$7.apply(HiveClientImpl.scala:365) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$10.apply(HiveClientImpl.scala:365) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$10.apply(HiveClientImpl.scala:361) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:361) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:359) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:230) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:229) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:272) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:359) > at > org.apache.spark.sql.hive.client.HiveClient$class.getTable(HiveClient.scala:76) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:78) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatal
[jira] [Resolved] (SPARK-20943) Correct BypassMergeSortShuffleWriter's comment
[ https://issues.apache.org/jira/browse/SPARK-20943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20943. --- Resolution: Not A Problem If there's no (other) support for a request to change the wording, and no proposed change, I think we should close this. > Correct BypassMergeSortShuffleWriter's comment > -- > > Key: SPARK-20943 > URL: https://issues.apache.org/jira/browse/SPARK-20943 > Project: Spark > Issue Type: Improvement > Components: Documentation, Shuffle >Affects Versions: 2.1.1 >Reporter: CanBin Zheng >Priority: Trivial > Labels: starter > > There are some comments written in BypassMergeSortShuffleWriter.java about > when to select this write path, the three required conditions are described > as follows: > 1. no Ordering is specified, and > 2. no Aggregator is specified, and > 3. the number of partitions is less than > spark.shuffle.sort.bypassMergeThreshold > Obviously, the conditions written are partially wrong and misleading, the > right conditions should be: > 1. map-side combine is false, and > 2. the number of partitions is less than > spark.shuffle.sort.bypassMergeThreshold -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20996) Better handling AM reattempt based on exit code in yarn mode
[ https://issues.apache.org/jira/browse/SPARK-20996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20996: Assignee: Apache Spark > Better handling AM reattempt based on exit code in yarn mode > > > Key: SPARK-20996 > URL: https://issues.apache.org/jira/browse/SPARK-20996 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.2.0 >Reporter: Saisai Shao >Assignee: Apache Spark >Priority: Minor > > Yarn provides max attempt configuration for applications running on it, > application has the chance to retry itself when failed. In the current Spark > code, no matter which failure AM occurred and if the failure doesn't reach to > the max attempt, RM will restart AM, this is not reasonable for some cases if > this issue is coming from AM itself, like user code failure, OOM, Spark > issue, executor failures, in large chance the reattempt of AM will meet this > issue again. Only when AM is failed due to external issue like crash, process > kill, NM failure, then AM should retry again. > So here propose to improve this reattempt mechanism to only retry when it > meets external issues. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20199) GradientBoostedTreesModel doesn't have featureSubsetStrategy parameter
[ https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16038403#comment-16038403 ] Kedarnath Reddy commented on SPARK-20199: - Please look into this feature , as I needed this for my implementation of GBT in my organization > GradientBoostedTreesModel doesn't have featureSubsetStrategy parameter > --- > > Key: SPARK-20199 > URL: https://issues.apache.org/jira/browse/SPARK-20199 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: pralabhkumar > > Spark GradientBoostedTreesModel doesn't have featureSubsetStrategy . It Uses > random forest internally ,which have featureSubsetStrategy hardcoded "all". > It should be provided by the user to have randomness at the feature level. > This parameter is available in H2O and XGBoost. > Sample from H2O.ai > gbmParams._col_sample_rate > Please provide the parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20996) Better handling AM reattempt based on exit code in yarn mode
[ https://issues.apache.org/jira/browse/SPARK-20996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20996: Assignee: (was: Apache Spark) > Better handling AM reattempt based on exit code in yarn mode > > > Key: SPARK-20996 > URL: https://issues.apache.org/jira/browse/SPARK-20996 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.2.0 >Reporter: Saisai Shao >Priority: Minor > > Yarn provides max attempt configuration for applications running on it, > application has the chance to retry itself when failed. In the current Spark > code, no matter which failure AM occurred and if the failure doesn't reach to > the max attempt, RM will restart AM, this is not reasonable for some cases if > this issue is coming from AM itself, like user code failure, OOM, Spark > issue, executor failures, in large chance the reattempt of AM will meet this > issue again. Only when AM is failed due to external issue like crash, process > kill, NM failure, then AM should retry again. > So here propose to improve this reattempt mechanism to only retry when it > meets external issues. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20996) Better handling AM reattempt based on exit code in yarn mode
[ https://issues.apache.org/jira/browse/SPARK-20996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16038404#comment-16038404 ] Apache Spark commented on SPARK-20996: -- User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/18213 > Better handling AM reattempt based on exit code in yarn mode > > > Key: SPARK-20996 > URL: https://issues.apache.org/jira/browse/SPARK-20996 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.2.0 >Reporter: Saisai Shao >Priority: Minor > > Yarn provides max attempt configuration for applications running on it, > application has the chance to retry itself when failed. In the current Spark > code, no matter which failure AM occurred and if the failure doesn't reach to > the max attempt, RM will restart AM, this is not reasonable for some cases if > this issue is coming from AM itself, like user code failure, OOM, Spark > issue, executor failures, in large chance the reattempt of AM will meet this > issue again. Only when AM is failed due to external issue like crash, process > kill, NM failure, then AM should retry again. > So here propose to improve this reattempt mechanism to only retry when it > meets external issues. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20499) Spark MLlib, GraphX 2.2 QA umbrella
[ https://issues.apache.org/jira/browse/SPARK-20499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-20499. Resolution: Done > Spark MLlib, GraphX 2.2 QA umbrella > --- > > Key: SPARK-20499 > URL: https://issues.apache.org/jira/browse/SPARK-20499 > Project: Spark > Issue Type: Umbrella > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Critical > > This JIRA lists tasks for the next Spark release's QA period for MLlib and > GraphX. *SparkR is separate: [SPARK-20508].* > The list below gives an overview of what is involved, and the corresponding > JIRA issues are linked below that. > h2. API > * Check binary API compatibility for Scala/Java > * Audit new public APIs (from the generated html doc) > ** Scala > ** Java compatibility > ** Python coverage > * Check Experimental, DeveloperApi tags > h2. Algorithms and performance > * Performance tests > h2. Documentation and example code > * For new algorithms, create JIRAs for updating the user guide sections & > examples > * Update Programming Guide > * Update website -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20507) Update MLlib, GraphX websites for 2.2
[ https://issues.apache.org/jira/browse/SPARK-20507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-20507. Resolution: Done Assignee: Nick Pentreath No updates to MLlib project website required for {{2.2}} release. > Update MLlib, GraphX websites for 2.2 > - > > Key: SPARK-20507 > URL: https://issues.apache.org/jira/browse/SPARK-20507 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Nick Pentreath >Priority: Critical > > Update the sub-projects' websites to include new features in this release. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20996) Better handling AM reattempt based on exit code in yarn mode
Saisai Shao created SPARK-20996: --- Summary: Better handling AM reattempt based on exit code in yarn mode Key: SPARK-20996 URL: https://issues.apache.org/jira/browse/SPARK-20996 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 2.2.0 Reporter: Saisai Shao Priority: Minor Yarn provides max attempt configuration for applications running on it, application has the chance to retry itself when failed. In the current Spark code, no matter which failure AM occurred and if the failure doesn't reach to the max attempt, RM will restart AM, this is not reasonable for some cases if this issue is coming from AM itself, like user code failure, OOM, Spark issue, executor failures, in large case the reattempt of AM will meet this issue again. Only when AM is failed due to external issue like crash, process kill, NM failure, then AM should retry again. So here propose to improve this reattempt mechanism to only retry when it meets external issues. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20995) 'Spark-env.sh.template' should add 'YARN_CONF_DIR' configuration instructions.
[ https://issues.apache.org/jira/browse/SPARK-20995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20995: Assignee: Apache Spark > 'Spark-env.sh.template' should add 'YARN_CONF_DIR' configuration instructions. > -- > > Key: SPARK-20995 > URL: https://issues.apache.org/jira/browse/SPARK-20995 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: guoxiaolongzte >Assignee: Apache Spark >Priority: Minor > > Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which > contains the (client side) configuration files for the Hadoop cluster. > These configs are used to write to HDFS and connect to the YARN > ResourceManager. The > configuration contained in this directory will be distributed to the YARN > cluster so that all > containers used by the application use the same configuration. > Sometimes, HADOOP_CONF_DIR is set to the hdfs configuration file path. So, > YARN_CONF_DIR should be set to the yarn configuration file path. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20995) 'Spark-env.sh.template' should add 'YARN_CONF_DIR' configuration instructions.
[ https://issues.apache.org/jira/browse/SPARK-20995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20995: Assignee: (was: Apache Spark) > 'Spark-env.sh.template' should add 'YARN_CONF_DIR' configuration instructions. > -- > > Key: SPARK-20995 > URL: https://issues.apache.org/jira/browse/SPARK-20995 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: guoxiaolongzte >Priority: Minor > > Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which > contains the (client side) configuration files for the Hadoop cluster. > These configs are used to write to HDFS and connect to the YARN > ResourceManager. The > configuration contained in this directory will be distributed to the YARN > cluster so that all > containers used by the application use the same configuration. > Sometimes, HADOOP_CONF_DIR is set to the hdfs configuration file path. So, > YARN_CONF_DIR should be set to the yarn configuration file path. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20995) 'Spark-env.sh.template' should add 'YARN_CONF_DIR' configuration instructions.
[ https://issues.apache.org/jira/browse/SPARK-20995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16038361#comment-16038361 ] Apache Spark commented on SPARK-20995: -- User 'guoxiaolongzte' has created a pull request for this issue: https://github.com/apache/spark/pull/18212 > 'Spark-env.sh.template' should add 'YARN_CONF_DIR' configuration instructions. > -- > > Key: SPARK-20995 > URL: https://issues.apache.org/jira/browse/SPARK-20995 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: guoxiaolongzte >Priority: Minor > > Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which > contains the (client side) configuration files for the Hadoop cluster. > These configs are used to write to HDFS and connect to the YARN > ResourceManager. The > configuration contained in this directory will be distributed to the YARN > cluster so that all > containers used by the application use the same configuration. > Sometimes, HADOOP_CONF_DIR is set to the hdfs configuration file path. So, > YARN_CONF_DIR should be set to the yarn configuration file path. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20995) 'Spark-env.sh.template' should add 'YARN_CONF_DIR' configuration instructions.
[ https://issues.apache.org/jira/browse/SPARK-20995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guoxiaolongzte updated SPARK-20995: --- Description: Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster. These configs are used to write to HDFS and connect to the YARN ResourceManager. The configuration contained in this directory will be distributed to the YARN cluster so that all containers used by the application use the same configuration. Sometimes, HADOOP_CONF_DIR is set to the hdfs configuration file path. So, YARN_CONF_DIR should be set to the yarn configuration file path. > 'Spark-env.sh.template' should add 'YARN_CONF_DIR' configuration instructions. > -- > > Key: SPARK-20995 > URL: https://issues.apache.org/jira/browse/SPARK-20995 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: guoxiaolongzte >Priority: Minor > > Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which > contains the (client side) configuration files for the Hadoop cluster. > These configs are used to write to HDFS and connect to the YARN > ResourceManager. The > configuration contained in this directory will be distributed to the YARN > cluster so that all > containers used by the application use the same configuration. > Sometimes, HADOOP_CONF_DIR is set to the hdfs configuration file path. So, > YARN_CONF_DIR should be set to the yarn configuration file path. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20996) Better handling AM reattempt based on exit code in yarn mode
[ https://issues.apache.org/jira/browse/SPARK-20996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-20996: Description: Yarn provides max attempt configuration for applications running on it, application has the chance to retry itself when failed. In the current Spark code, no matter which failure AM occurred and if the failure doesn't reach to the max attempt, RM will restart AM, this is not reasonable for some cases if this issue is coming from AM itself, like user code failure, OOM, Spark issue, executor failures, in large chance the reattempt of AM will meet this issue again. Only when AM is failed due to external issue like crash, process kill, NM failure, then AM should retry again. So here propose to improve this reattempt mechanism to only retry when it meets external issues. was: Yarn provides max attempt configuration for applications running on it, application has the chance to retry itself when failed. In the current Spark code, no matter which failure AM occurred and if the failure doesn't reach to the max attempt, RM will restart AM, this is not reasonable for some cases if this issue is coming from AM itself, like user code failure, OOM, Spark issue, executor failures, in large case the reattempt of AM will meet this issue again. Only when AM is failed due to external issue like crash, process kill, NM failure, then AM should retry again. So here propose to improve this reattempt mechanism to only retry when it meets external issues. > Better handling AM reattempt based on exit code in yarn mode > > > Key: SPARK-20996 > URL: https://issues.apache.org/jira/browse/SPARK-20996 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.2.0 >Reporter: Saisai Shao >Priority: Minor > > Yarn provides max attempt configuration for applications running on it, > application has the chance to retry itself when failed. In the current Spark > code, no matter which failure AM occurred and if the failure doesn't reach to > the max attempt, RM will restart AM, this is not reasonable for some cases if > this issue is coming from AM itself, like user code failure, OOM, Spark > issue, executor failures, in large chance the reattempt of AM will meet this > issue again. Only when AM is failed due to external issue like crash, process > kill, NM failure, then AM should retry again. > So here propose to improve this reattempt mechanism to only retry when it > meets external issues. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20995) 'Spark-env.sh.template' should add 'YARN_CONF_DIR' configuration instructions.
guoxiaolongzte created SPARK-20995: -- Summary: 'Spark-env.sh.template' should add 'YARN_CONF_DIR' configuration instructions. Key: SPARK-20995 URL: https://issues.apache.org/jira/browse/SPARK-20995 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.1.1 Reporter: guoxiaolongzte Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org