[jira] [Commented] (SPARK-11714) Make Spark on Mesos honor port restrictions
[ https://issues.apache.org/jira/browse/SPARK-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15136747#comment-15136747 ] Stavros Kontopoulos commented on SPARK-11714: - [~andrewor14] Would it be meaningful to move the code to coarse grained, since now fine grained is deprecated? > Make Spark on Mesos honor port restrictions > --- > > Key: SPARK-11714 > URL: https://issues.apache.org/jira/browse/SPARK-11714 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Charles Allen > > Currently the MesosSchedulerBackend does not make any effort to honor "ports" > as a resource offer in Mesos. This ask is to have the ports which the > executor binds to honor the limits of the "ports" resource of an offer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13198) sc.stop() does not clean up on driver, causes Java heap OOM.
[ https://issues.apache.org/jira/browse/SPARK-13198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15136757#comment-15136757 ] Herman Schistad commented on SPARK-13198: - Hi [~srowen], thanks for your reply. I have indeed tried to look at the program using a profiler and I've attached two screenshots from jvisualvm connected to the driver JMX interface. You can see that the "Old Gen" space is completely full. You see that dip at 09:30:00? That's me triggering a manual GC. It might be unusual to do this, but in any case (given the existence of sc.stop()) it should work right? My use case is having X number of different parquet directories which need to be loaded and analysed linearly, as part of a generic platform where users are able to upload data and apply daily/hourly aggregations on them. I've also seen people starting and stopping contexts quite frequently when doing unit tests etc. Using G1 garbage collection doesn't seem to affect the end result either. I'm also attaching a GC log in it's raw format. You can see it's trying to do a full GC at multiple times during the execution of the program. Thanks again Sean. > sc.stop() does not clean up on driver, causes Java heap OOM. > > > Key: SPARK-13198 > URL: https://issues.apache.org/jira/browse/SPARK-13198 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.6.0 >Reporter: Herman Schistad > Attachments: Screen Shot 2016-02-04 at 16.31.28.png, Screen Shot > 2016-02-04 at 16.31.40.png, Screen Shot 2016-02-04 at 16.31.51.png > > > When starting and stopping multiple SparkContext's linearly eventually the > driver stops working with a "io.netty.handler.codec.EncoderException: > java.lang.OutOfMemoryError: Java heap space" error. > Reproduce by running the following code and loading in ~7MB parquet data each > time. The driver heap space is not changed and thus defaults to 1GB: > {code:java} > def main(args: Array[String]) { > val conf = new SparkConf().setMaster("MASTER_URL").setAppName("") > conf.set("spark.mesos.coarse", "true") > conf.set("spark.cores.max", "10") > for (i <- 1 until 100) { > val sc = new SparkContext(conf) > val sqlContext = new SQLContext(sc) > val events = sqlContext.read.parquet("hdfs://locahost/tmp/something") > println(s"Context ($i), number of events: " + events.count) > sc.stop() > } > } > {code} > The heap space fills up within 20 loops on my cluster. Increasing the number > of cores to 50 in the above example results in heap space error after 12 > contexts. > Dumping the heap reveals many equally sized "CoarseMesosSchedulerBackend" > objects (see attachments). Digging into the inner objects tells me that the > `executorDataMap` is where 99% of the data in said object is stored. I do > believe though that this is beside the point as I'd expect this whole object > to be garbage collected or freed on sc.stop(). > Additionally I can see in the Spark web UI that each time a new context is > created the number of the "SQL" tab increments by one (i.e. last iteration > would have SQL99). After doing stop and creating a completely new context I > was expecting this number to be reset to 1 ("SQL"). > I'm submitting the jar file with `spark-submit` and no special flags. The > cluster is running Mesos 0.23. I'm running Spark 1.6.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13198) sc.stop() does not clean up on driver, causes Java heap OOM.
[ https://issues.apache.org/jira/browse/SPARK-13198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman Schistad updated SPARK-13198: Attachment: Screen Shot 2016-02-08 at 09.30.59.png Screen Shot 2016-02-08 at 09.31.10.png gc.log > sc.stop() does not clean up on driver, causes Java heap OOM. > > > Key: SPARK-13198 > URL: https://issues.apache.org/jira/browse/SPARK-13198 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.6.0 >Reporter: Herman Schistad > Attachments: Screen Shot 2016-02-04 at 16.31.28.png, Screen Shot > 2016-02-04 at 16.31.40.png, Screen Shot 2016-02-04 at 16.31.51.png, Screen > Shot 2016-02-08 at 09.30.59.png, Screen Shot 2016-02-08 at 09.31.10.png, > gc.log > > > When starting and stopping multiple SparkContext's linearly eventually the > driver stops working with a "io.netty.handler.codec.EncoderException: > java.lang.OutOfMemoryError: Java heap space" error. > Reproduce by running the following code and loading in ~7MB parquet data each > time. The driver heap space is not changed and thus defaults to 1GB: > {code:java} > def main(args: Array[String]) { > val conf = new SparkConf().setMaster("MASTER_URL").setAppName("") > conf.set("spark.mesos.coarse", "true") > conf.set("spark.cores.max", "10") > for (i <- 1 until 100) { > val sc = new SparkContext(conf) > val sqlContext = new SQLContext(sc) > val events = sqlContext.read.parquet("hdfs://locahost/tmp/something") > println(s"Context ($i), number of events: " + events.count) > sc.stop() > } > } > {code} > The heap space fills up within 20 loops on my cluster. Increasing the number > of cores to 50 in the above example results in heap space error after 12 > contexts. > Dumping the heap reveals many equally sized "CoarseMesosSchedulerBackend" > objects (see attachments). Digging into the inner objects tells me that the > `executorDataMap` is where 99% of the data in said object is stored. I do > believe though that this is beside the point as I'd expect this whole object > to be garbage collected or freed on sc.stop(). > Additionally I can see in the Spark web UI that each time a new context is > created the number of the "SQL" tab increments by one (i.e. last iteration > would have SQL99). After doing stop and creating a completely new context I > was expecting this number to be reset to 1 ("SQL"). > I'm submitting the jar file with `spark-submit` and no special flags. The > cluster is running Mesos 0.23. I'm running Spark 1.6.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13198) sc.stop() does not clean up on driver, causes Java heap OOM.
[ https://issues.apache.org/jira/browse/SPARK-13198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15136757#comment-15136757 ] Herman Schistad edited comment on SPARK-13198 at 2/8/16 9:55 AM: - Hi [~srowen], thanks for your reply. I have indeed tried to look at the program using a profiler and I've attached two screenshots ([one|^Screen Shot 2016-02-08 at 09.31.10.png] and [two|^Screen Shot 2016-02-08 at 09.30.59.png]) from jvisualvm connected to the driver JMX interface. You can see that the "Old Gen" space is completely full. You see that dip at 09:30:00? That's me triggering a manual GC. It might be unusual to do this, but in any case (given the existence of sc.stop()) it should work right? My use case is having X number of different parquet directories which need to be loaded and analysed linearly, as part of a generic platform where users are able to upload data and apply daily/hourly aggregations on them. I've also seen people starting and stopping contexts quite frequently when doing unit tests etc. Using G1 garbage collection doesn't seem to affect the end result either. I'm also attaching a [GC log|^gc.log] in it's raw format. You can see it's trying to do a full GC at multiple times during the execution of the program. Thanks again Sean. was (Author: hermansc): Hi [~srowen], thanks for your reply. I have indeed tried to look at the program using a profiler and I've attached two screenshots from jvisualvm connected to the driver JMX interface. You can see that the "Old Gen" space is completely full. You see that dip at 09:30:00? That's me triggering a manual GC. It might be unusual to do this, but in any case (given the existence of sc.stop()) it should work right? My use case is having X number of different parquet directories which need to be loaded and analysed linearly, as part of a generic platform where users are able to upload data and apply daily/hourly aggregations on them. I've also seen people starting and stopping contexts quite frequently when doing unit tests etc. Using G1 garbage collection doesn't seem to affect the end result either. I'm also attaching a GC log in it's raw format. You can see it's trying to do a full GC at multiple times during the execution of the program. Thanks again Sean. > sc.stop() does not clean up on driver, causes Java heap OOM. > > > Key: SPARK-13198 > URL: https://issues.apache.org/jira/browse/SPARK-13198 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.6.0 >Reporter: Herman Schistad > Attachments: Screen Shot 2016-02-04 at 16.31.28.png, Screen Shot > 2016-02-04 at 16.31.40.png, Screen Shot 2016-02-04 at 16.31.51.png, Screen > Shot 2016-02-08 at 09.30.59.png, Screen Shot 2016-02-08 at 09.31.10.png, > gc.log > > > When starting and stopping multiple SparkContext's linearly eventually the > driver stops working with a "io.netty.handler.codec.EncoderException: > java.lang.OutOfMemoryError: Java heap space" error. > Reproduce by running the following code and loading in ~7MB parquet data each > time. The driver heap space is not changed and thus defaults to 1GB: > {code:java} > def main(args: Array[String]) { > val conf = new SparkConf().setMaster("MASTER_URL").setAppName("") > conf.set("spark.mesos.coarse", "true") > conf.set("spark.cores.max", "10") > for (i <- 1 until 100) { > val sc = new SparkContext(conf) > val sqlContext = new SQLContext(sc) > val events = sqlContext.read.parquet("hdfs://locahost/tmp/something") > println(s"Context ($i), number of events: " + events.count) > sc.stop() > } > } > {code} > The heap space fills up within 20 loops on my cluster. Increasing the number > of cores to 50 in the above example results in heap space error after 12 > contexts. > Dumping the heap reveals many equally sized "CoarseMesosSchedulerBackend" > objects (see attachments). Digging into the inner objects tells me that the > `executorDataMap` is where 99% of the data in said object is stored. I do > believe though that this is beside the point as I'd expect this whole object > to be garbage collected or freed on sc.stop(). > Additionally I can see in the Spark web UI that each time a new context is > created the number of the "SQL" tab increments by one (i.e. last iteration > would have SQL99). After doing stop and creating a completely new context I > was expecting this number to be reset to 1 ("SQL"). > I'm submitting the jar file with `spark-submit` and no special flags. The > cluster is running Mesos 0.23. I'm running Spark 1.6.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsub
[jira] [Commented] (SPARK-13198) sc.stop() does not clean up on driver, causes Java heap OOM.
[ https://issues.apache.org/jira/browse/SPARK-13198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15136773#comment-15136773 ] Sean Owen commented on SPARK-13198: --- I don't think stop() is relevant here. There's not an active attempt to free up resources once the app is done. It's assumed the driver JVM is shutting down. Yes, the question was whether it had tried to do a full GC, and sounds like it has done, OK. Still if you're just finding there is a bunch of left over bookkeeping info for executors, probably from all the old contexts, I think that's "normal" or at least "not a problem as Spark is intended to be used" > sc.stop() does not clean up on driver, causes Java heap OOM. > > > Key: SPARK-13198 > URL: https://issues.apache.org/jira/browse/SPARK-13198 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.6.0 >Reporter: Herman Schistad > Attachments: Screen Shot 2016-02-04 at 16.31.28.png, Screen Shot > 2016-02-04 at 16.31.40.png, Screen Shot 2016-02-04 at 16.31.51.png, Screen > Shot 2016-02-08 at 09.30.59.png, Screen Shot 2016-02-08 at 09.31.10.png, > Screen Shot 2016-02-08 at 10.03.04.png, gc.log > > > When starting and stopping multiple SparkContext's linearly eventually the > driver stops working with a "io.netty.handler.codec.EncoderException: > java.lang.OutOfMemoryError: Java heap space" error. > Reproduce by running the following code and loading in ~7MB parquet data each > time. The driver heap space is not changed and thus defaults to 1GB: > {code:java} > def main(args: Array[String]) { > val conf = new SparkConf().setMaster("MASTER_URL").setAppName("") > conf.set("spark.mesos.coarse", "true") > conf.set("spark.cores.max", "10") > for (i <- 1 until 100) { > val sc = new SparkContext(conf) > val sqlContext = new SQLContext(sc) > val events = sqlContext.read.parquet("hdfs://locahost/tmp/something") > println(s"Context ($i), number of events: " + events.count) > sc.stop() > } > } > {code} > The heap space fills up within 20 loops on my cluster. Increasing the number > of cores to 50 in the above example results in heap space error after 12 > contexts. > Dumping the heap reveals many equally sized "CoarseMesosSchedulerBackend" > objects (see attachments). Digging into the inner objects tells me that the > `executorDataMap` is where 99% of the data in said object is stored. I do > believe though that this is beside the point as I'd expect this whole object > to be garbage collected or freed on sc.stop(). > Additionally I can see in the Spark web UI that each time a new context is > created the number of the "SQL" tab increments by one (i.e. last iteration > would have SQL99). After doing stop and creating a completely new context I > was expecting this number to be reset to 1 ("SQL"). > I'm submitting the jar file with `spark-submit` and no special flags. The > cluster is running Mesos 0.23. I'm running Spark 1.6.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13198) sc.stop() does not clean up on driver, causes Java heap OOM.
[ https://issues.apache.org/jira/browse/SPARK-13198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman Schistad updated SPARK-13198: Attachment: Screen Shot 2016-02-08 at 10.03.04.png > sc.stop() does not clean up on driver, causes Java heap OOM. > > > Key: SPARK-13198 > URL: https://issues.apache.org/jira/browse/SPARK-13198 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.6.0 >Reporter: Herman Schistad > Attachments: Screen Shot 2016-02-04 at 16.31.28.png, Screen Shot > 2016-02-04 at 16.31.40.png, Screen Shot 2016-02-04 at 16.31.51.png, Screen > Shot 2016-02-08 at 09.30.59.png, Screen Shot 2016-02-08 at 09.31.10.png, > Screen Shot 2016-02-08 at 10.03.04.png, gc.log > > > When starting and stopping multiple SparkContext's linearly eventually the > driver stops working with a "io.netty.handler.codec.EncoderException: > java.lang.OutOfMemoryError: Java heap space" error. > Reproduce by running the following code and loading in ~7MB parquet data each > time. The driver heap space is not changed and thus defaults to 1GB: > {code:java} > def main(args: Array[String]) { > val conf = new SparkConf().setMaster("MASTER_URL").setAppName("") > conf.set("spark.mesos.coarse", "true") > conf.set("spark.cores.max", "10") > for (i <- 1 until 100) { > val sc = new SparkContext(conf) > val sqlContext = new SQLContext(sc) > val events = sqlContext.read.parquet("hdfs://locahost/tmp/something") > println(s"Context ($i), number of events: " + events.count) > sc.stop() > } > } > {code} > The heap space fills up within 20 loops on my cluster. Increasing the number > of cores to 50 in the above example results in heap space error after 12 > contexts. > Dumping the heap reveals many equally sized "CoarseMesosSchedulerBackend" > objects (see attachments). Digging into the inner objects tells me that the > `executorDataMap` is where 99% of the data in said object is stored. I do > believe though that this is beside the point as I'd expect this whole object > to be garbage collected or freed on sc.stop(). > Additionally I can see in the Spark web UI that each time a new context is > created the number of the "SQL" tab increments by one (i.e. last iteration > would have SQL99). After doing stop and creating a completely new context I > was expecting this number to be reset to 1 ("SQL"). > I'm submitting the jar file with `spark-submit` and no special flags. The > cluster is running Mesos 0.23. I'm running Spark 1.6.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13198) sc.stop() does not clean up on driver, causes Java heap OOM.
[ https://issues.apache.org/jira/browse/SPARK-13198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15136774#comment-15136774 ] Herman Schistad commented on SPARK-13198: - Digging more into the dumped heap and running a memory leak report (using Eclipse Memory Analyzer) I'm seeing the following result: !Screen Shot 2016-02-08 at 10.03.04.png|width=400! > sc.stop() does not clean up on driver, causes Java heap OOM. > > > Key: SPARK-13198 > URL: https://issues.apache.org/jira/browse/SPARK-13198 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.6.0 >Reporter: Herman Schistad > Attachments: Screen Shot 2016-02-04 at 16.31.28.png, Screen Shot > 2016-02-04 at 16.31.40.png, Screen Shot 2016-02-04 at 16.31.51.png, Screen > Shot 2016-02-08 at 09.30.59.png, Screen Shot 2016-02-08 at 09.31.10.png, > Screen Shot 2016-02-08 at 10.03.04.png, gc.log > > > When starting and stopping multiple SparkContext's linearly eventually the > driver stops working with a "io.netty.handler.codec.EncoderException: > java.lang.OutOfMemoryError: Java heap space" error. > Reproduce by running the following code and loading in ~7MB parquet data each > time. The driver heap space is not changed and thus defaults to 1GB: > {code:java} > def main(args: Array[String]) { > val conf = new SparkConf().setMaster("MASTER_URL").setAppName("") > conf.set("spark.mesos.coarse", "true") > conf.set("spark.cores.max", "10") > for (i <- 1 until 100) { > val sc = new SparkContext(conf) > val sqlContext = new SQLContext(sc) > val events = sqlContext.read.parquet("hdfs://locahost/tmp/something") > println(s"Context ($i), number of events: " + events.count) > sc.stop() > } > } > {code} > The heap space fills up within 20 loops on my cluster. Increasing the number > of cores to 50 in the above example results in heap space error after 12 > contexts. > Dumping the heap reveals many equally sized "CoarseMesosSchedulerBackend" > objects (see attachments). Digging into the inner objects tells me that the > `executorDataMap` is where 99% of the data in said object is stored. I do > believe though that this is beside the point as I'd expect this whole object > to be garbage collected or freed on sc.stop(). > Additionally I can see in the Spark web UI that each time a new context is > created the number of the "SQL" tab increments by one (i.e. last iteration > would have SQL99). After doing stop and creating a completely new context I > was expecting this number to be reset to 1 ("SQL"). > I'm submitting the jar file with `spark-submit` and no special flags. The > cluster is running Mesos 0.23. I'm running Spark 1.6.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13177) Update ActorWordCount example to not directly use low level linked list as it is deprecated.
[ https://issues.apache.org/jira/browse/SPARK-13177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13177: Assignee: (was: Apache Spark) > Update ActorWordCount example to not directly use low level linked list as it > is deprecated. > > > Key: SPARK-13177 > URL: https://issues.apache.org/jira/browse/SPARK-13177 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: holdenk >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13177) Update ActorWordCount example to not directly use low level linked list as it is deprecated.
[ https://issues.apache.org/jira/browse/SPARK-13177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15136791#comment-15136791 ] Apache Spark commented on SPARK-13177: -- User 'agsachin' has created a pull request for this issue: https://github.com/apache/spark/pull/3 > Update ActorWordCount example to not directly use low level linked list as it > is deprecated. > > > Key: SPARK-13177 > URL: https://issues.apache.org/jira/browse/SPARK-13177 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: holdenk >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13177) Update ActorWordCount example to not directly use low level linked list as it is deprecated.
[ https://issues.apache.org/jira/browse/SPARK-13177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13177: Assignee: Apache Spark > Update ActorWordCount example to not directly use low level linked list as it > is deprecated. > > > Key: SPARK-13177 > URL: https://issues.apache.org/jira/browse/SPARK-13177 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: holdenk >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13231) Rename Accumulable.countFailedValues to Accumulable.includeValuesOfFailedTasks and make it a user facing API.
Prashant Sharma created SPARK-13231: --- Summary: Rename Accumulable.countFailedValues to Accumulable.includeValuesOfFailedTasks and make it a user facing API. Key: SPARK-13231 URL: https://issues.apache.org/jira/browse/SPARK-13231 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.6.0 Reporter: Prashant Sharma Priority: Minor Rename Accumulable.countFailedValues to Accumulable.includeValuesOfFailedTasks (or includeFailedTasks) I liked the longer version though. Exposing it to user has no disadvantage I can think of, but it can be useful for them. One scenario can be a user defined metric. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13156) JDBC using multiple partitions creates additional tasks but only executes on one
[ https://issues.apache.org/jira/browse/SPARK-13156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15136834#comment-15136834 ] Charles Drotar commented on SPARK-13156: Thanks Sean. The driver inhibiting the concurrent connections was the issue. Apparently the Teradata driver does not support concurrent connections and instead suggests creating different sessions for each query. I don't think this is truly an issue so I will close out the JIRA. > JDBC using multiple partitions creates additional tasks but only executes on > one > > > Key: SPARK-13156 > URL: https://issues.apache.org/jira/browse/SPARK-13156 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.5.0 > Environment: Hadoop 2.6.0-cdh5.4.0, Teradata, yarn-client >Reporter: Charles Drotar > > I can successfully kick off a query through JDBC to Teradata, and when it > runs it creates a task on each executor for every partition. The problem is > that all of the tasks except for one complete within a couple seconds and the > final task handles the entire dataset. > Example Code: > private val properties = new java.util.Properties() > properties.setProperty("driver","com.teradata.jdbc.TeraDriver") > properties.setProperty("username","foo") > properties.setProperty("password","bar") > val url = "jdbc:teradata://oneview/, TMODE=TERA,TYPE=FASTEXPORT,SESSIONS=10" > val numPartitions = 5 > val dbTableTemp = "( SELECT id MOD $numPartitions%d AS modulo, id FROM > db.table) AS TEMP_TABLE" > val partitionColumn = "modulo" > val lowerBound = 0.toLong > val upperBound = (numPartitions-1).toLong > val df = > sqlContext.read.jdbc(url,dbTableTemp,partitionColumn,lowerBound,upperBound,numPartitions,properties) > df.write.parquet("/output/path/for/df/") > When I look at the Spark UI I see the 5 tasks, but only 1 is actually > querying. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-13156) JDBC using multiple partitions creates additional tasks but only executes on one
[ https://issues.apache.org/jira/browse/SPARK-13156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Charles Drotar closed SPARK-13156. -- Resolution: Not A Problem The driver class was inhibiting concurrent connections. This was unrelated to Spark's jdbc functionality. > JDBC using multiple partitions creates additional tasks but only executes on > one > > > Key: SPARK-13156 > URL: https://issues.apache.org/jira/browse/SPARK-13156 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.5.0 > Environment: Hadoop 2.6.0-cdh5.4.0, Teradata, yarn-client >Reporter: Charles Drotar > > I can successfully kick off a query through JDBC to Teradata, and when it > runs it creates a task on each executor for every partition. The problem is > that all of the tasks except for one complete within a couple seconds and the > final task handles the entire dataset. > Example Code: > private val properties = new java.util.Properties() > properties.setProperty("driver","com.teradata.jdbc.TeraDriver") > properties.setProperty("username","foo") > properties.setProperty("password","bar") > val url = "jdbc:teradata://oneview/, TMODE=TERA,TYPE=FASTEXPORT,SESSIONS=10" > val numPartitions = 5 > val dbTableTemp = "( SELECT id MOD $numPartitions%d AS modulo, id FROM > db.table) AS TEMP_TABLE" > val partitionColumn = "modulo" > val lowerBound = 0.toLong > val upperBound = (numPartitions-1).toLong > val df = > sqlContext.read.jdbc(url,dbTableTemp,partitionColumn,lowerBound,upperBound,numPartitions,properties) > df.write.parquet("/output/path/for/df/") > When I look at the Spark UI I see the 5 tasks, but only 1 is actually > querying. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13156) JDBC using multiple partitions creates additional tasks but only executes on one
[ https://issues.apache.org/jira/browse/SPARK-13156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15136835#comment-15136835 ] Charles Drotar edited comment on SPARK-13156 at 2/8/16 11:28 AM: - Thanks Sean. The driver inhibiting the concurrent connections was the issue. Apparently the Teradata driver does not support concurrent connections and instead suggests creating different sessions for each query. I don't think this is truly an issue so I will close out the JIRA. was (Author: charles.dro...@capitalone.com): The driver class was inhibiting concurrent connections. This was unrelated to Spark's jdbc functionality. > JDBC using multiple partitions creates additional tasks but only executes on > one > > > Key: SPARK-13156 > URL: https://issues.apache.org/jira/browse/SPARK-13156 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.5.0 > Environment: Hadoop 2.6.0-cdh5.4.0, Teradata, yarn-client >Reporter: Charles Drotar > > I can successfully kick off a query through JDBC to Teradata, and when it > runs it creates a task on each executor for every partition. The problem is > that all of the tasks except for one complete within a couple seconds and the > final task handles the entire dataset. > Example Code: > private val properties = new java.util.Properties() > properties.setProperty("driver","com.teradata.jdbc.TeraDriver") > properties.setProperty("username","foo") > properties.setProperty("password","bar") > val url = "jdbc:teradata://oneview/, TMODE=TERA,TYPE=FASTEXPORT,SESSIONS=10" > val numPartitions = 5 > val dbTableTemp = "( SELECT id MOD $numPartitions%d AS modulo, id FROM > db.table) AS TEMP_TABLE" > val partitionColumn = "modulo" > val lowerBound = 0.toLong > val upperBound = (numPartitions-1).toLong > val df = > sqlContext.read.jdbc(url,dbTableTemp,partitionColumn,lowerBound,upperBound,numPartitions,properties) > df.write.parquet("/output/path/for/df/") > When I look at the Spark UI I see the 5 tasks, but only 1 is actually > querying. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-13156) JDBC using multiple partitions creates additional tasks but only executes on one
[ https://issues.apache.org/jira/browse/SPARK-13156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Charles Drotar updated SPARK-13156: --- Comment: was deleted (was: Thanks Sean. The driver inhibiting the concurrent connections was the issue. Apparently the Teradata driver does not support concurrent connections and instead suggests creating different sessions for each query. I don't think this is truly an issue so I will close out the JIRA.) > JDBC using multiple partitions creates additional tasks but only executes on > one > > > Key: SPARK-13156 > URL: https://issues.apache.org/jira/browse/SPARK-13156 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.5.0 > Environment: Hadoop 2.6.0-cdh5.4.0, Teradata, yarn-client >Reporter: Charles Drotar > > I can successfully kick off a query through JDBC to Teradata, and when it > runs it creates a task on each executor for every partition. The problem is > that all of the tasks except for one complete within a couple seconds and the > final task handles the entire dataset. > Example Code: > private val properties = new java.util.Properties() > properties.setProperty("driver","com.teradata.jdbc.TeraDriver") > properties.setProperty("username","foo") > properties.setProperty("password","bar") > val url = "jdbc:teradata://oneview/, TMODE=TERA,TYPE=FASTEXPORT,SESSIONS=10" > val numPartitions = 5 > val dbTableTemp = "( SELECT id MOD $numPartitions%d AS modulo, id FROM > db.table) AS TEMP_TABLE" > val partitionColumn = "modulo" > val lowerBound = 0.toLong > val upperBound = (numPartitions-1).toLong > val df = > sqlContext.read.jdbc(url,dbTableTemp,partitionColumn,lowerBound,upperBound,numPartitions,properties) > df.write.parquet("/output/path/for/df/") > When I look at the Spark UI I see the 5 tasks, but only 1 is actually > querying. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7848) Update SparkStreaming docs to incorporate FAQ and/or bullets w/ "knobs" information.
[ https://issues.apache.org/jira/browse/SPARK-7848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7848: --- Assignee: Apache Spark > Update SparkStreaming docs to incorporate FAQ and/or bullets w/ "knobs" > information. > > > Key: SPARK-7848 > URL: https://issues.apache.org/jira/browse/SPARK-7848 > Project: Spark > Issue Type: Documentation > Components: Streaming >Reporter: jay vyas >Assignee: Apache Spark > > A recent email on the maligning list detailed a bunch of great "knobs" to > remember for spark streaming. > Lets integrate this into the docs where appropriate. > I'll paste the raw text in a comment field below -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7848) Update SparkStreaming docs to incorporate FAQ and/or bullets w/ "knobs" information.
[ https://issues.apache.org/jira/browse/SPARK-7848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7848: --- Assignee: (was: Apache Spark) > Update SparkStreaming docs to incorporate FAQ and/or bullets w/ "knobs" > information. > > > Key: SPARK-7848 > URL: https://issues.apache.org/jira/browse/SPARK-7848 > Project: Spark > Issue Type: Documentation > Components: Streaming >Reporter: jay vyas > > A recent email on the maligning list detailed a bunch of great "knobs" to > remember for spark streaming. > Lets integrate this into the docs where appropriate. > I'll paste the raw text in a comment field below -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7848) Update SparkStreaming docs to incorporate FAQ and/or bullets w/ "knobs" information.
[ https://issues.apache.org/jira/browse/SPARK-7848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15136849#comment-15136849 ] Apache Spark commented on SPARK-7848: - User 'nirmannarang' has created a pull request for this issue: https://github.com/apache/spark/pull/4 > Update SparkStreaming docs to incorporate FAQ and/or bullets w/ "knobs" > information. > > > Key: SPARK-7848 > URL: https://issues.apache.org/jira/browse/SPARK-7848 > Project: Spark > Issue Type: Documentation > Components: Streaming >Reporter: jay vyas > > A recent email on the maligning list detailed a bunch of great "knobs" to > remember for spark streaming. > Lets integrate this into the docs where appropriate. > I'll paste the raw text in a comment field below -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13231) Rename Accumulable.countFailedValues to Accumulable.includeValuesOfFailedTasks and make it a user facing API.
[ https://issues.apache.org/jira/browse/SPARK-13231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13231: Assignee: (was: Apache Spark) > Rename Accumulable.countFailedValues to > Accumulable.includeValuesOfFailedTasks and make it a user facing API. > - > > Key: SPARK-13231 > URL: https://issues.apache.org/jira/browse/SPARK-13231 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Prashant Sharma >Priority: Minor > > Rename Accumulable.countFailedValues to > Accumulable.includeValuesOfFailedTasks (or includeFailedTasks) I liked the > longer version though. > Exposing it to user has no disadvantage I can think of, but it can be useful > for them. One scenario can be a user defined metric. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13231) Rename Accumulable.countFailedValues to Accumulable.includeValuesOfFailedTasks and make it a user facing API.
[ https://issues.apache.org/jira/browse/SPARK-13231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13231: Assignee: Apache Spark > Rename Accumulable.countFailedValues to > Accumulable.includeValuesOfFailedTasks and make it a user facing API. > - > > Key: SPARK-13231 > URL: https://issues.apache.org/jira/browse/SPARK-13231 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Prashant Sharma >Assignee: Apache Spark >Priority: Minor > > Rename Accumulable.countFailedValues to > Accumulable.includeValuesOfFailedTasks (or includeFailedTasks) I liked the > longer version though. > Exposing it to user has no disadvantage I can think of, but it can be useful > for them. One scenario can be a user defined metric. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13231) Rename Accumulable.countFailedValues to Accumulable.includeValuesOfFailedTasks and make it a user facing API.
[ https://issues.apache.org/jira/browse/SPARK-13231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15136944#comment-15136944 ] Apache Spark commented on SPARK-13231: -- User 'ScrapCodes' has created a pull request for this issue: https://github.com/apache/spark/pull/5 > Rename Accumulable.countFailedValues to > Accumulable.includeValuesOfFailedTasks and make it a user facing API. > - > > Key: SPARK-13231 > URL: https://issues.apache.org/jira/browse/SPARK-13231 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Prashant Sharma >Priority: Minor > > Rename Accumulable.countFailedValues to > Accumulable.includeValuesOfFailedTasks (or includeFailedTasks) I liked the > longer version though. > Exposing it to user has no disadvantage I can think of, but it can be useful > for them. One scenario can be a user defined metric. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12316) Stack overflow with endless call of `Delegation token thread` when application end.
[ https://issues.apache.org/jira/browse/SPARK-12316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-12316: -- Assignee: SaintBacchus > Stack overflow with endless call of `Delegation token thread` when > application end. > --- > > Key: SPARK-12316 > URL: https://issues.apache.org/jira/browse/SPARK-12316 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.0 >Reporter: SaintBacchus >Assignee: SaintBacchus > Attachments: 20151210045149.jpg, 20151210045533.jpg > > > When application end, AM will clean the staging dir. > But if the driver trigger to update the delegation token, it will can't find > the right token file and then it will endless cycle call the method > 'updateCredentialsIfRequired'. > Then it lead to StackOverflowError. > !https://issues.apache.org/jira/secure/attachment/12779495/20151210045149.jpg! > !https://issues.apache.org/jira/secure/attachment/12779496/20151210045533.jpg! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13013) Replace example code in mllib-clustering.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13013: Assignee: Apache Spark > Replace example code in mllib-clustering.md using include_example > - > > Key: SPARK-13013 > URL: https://issues.apache.org/jira/browse/SPARK-13013 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Assignee: Apache Spark >Priority: Minor > Labels: starter > > See examples in other finished sub-JIRAs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13013) Replace example code in mllib-clustering.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13013: Assignee: (was: Apache Spark) > Replace example code in mllib-clustering.md using include_example > - > > Key: SPARK-13013 > URL: https://issues.apache.org/jira/browse/SPARK-13013 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > See examples in other finished sub-JIRAs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13013) Replace example code in mllib-clustering.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137004#comment-15137004 ] Apache Spark commented on SPARK-13013: -- User 'keypointt' has created a pull request for this issue: https://github.com/apache/spark/pull/6 > Replace example code in mllib-clustering.md using include_example > - > > Key: SPARK-13013 > URL: https://issues.apache.org/jira/browse/SPARK-13013 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > See examples in other finished sub-JIRAs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13014) Replace example code in mllib-collaborative-filtering.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-13014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137007#comment-15137007 ] Xin Ren commented on SPARK-13014: - I'm working on this one, thanks :) > Replace example code in mllib-collaborative-filtering.md using include_example > -- > > Key: SPARK-13014 > URL: https://issues.apache.org/jira/browse/SPARK-13014 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > See examples in other finished sub-JIRAs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12316) Stack overflow with endless call of `Delegation token thread` when application end.
[ https://issues.apache.org/jira/browse/SPARK-12316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137000#comment-15137000 ] Thomas Graves commented on SPARK-12316: --- you say "endless cycle call" do you mean the application master hangs? It seems like it should throw and if the application is done it should just exit anyway since the AM is just calling stop on it.I just want to clarify what is happening because I assume even if you wait a minute you could still hit the same condition once when its tearing down. > Stack overflow with endless call of `Delegation token thread` when > application end. > --- > > Key: SPARK-12316 > URL: https://issues.apache.org/jira/browse/SPARK-12316 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.0 >Reporter: SaintBacchus >Assignee: SaintBacchus > Attachments: 20151210045149.jpg, 20151210045533.jpg > > > When application end, AM will clean the staging dir. > But if the driver trigger to update the delegation token, it will can't find > the right token file and then it will endless cycle call the method > 'updateCredentialsIfRequired'. > Then it lead to StackOverflowError. > !https://issues.apache.org/jira/secure/attachment/12779495/20151210045149.jpg! > !https://issues.apache.org/jira/secure/attachment/12779496/20151210045533.jpg! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API
[ https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137029#comment-15137029 ] Rama Mullapudi commented on SPARK-12177: Does the update include kerberos support, since 0.9 producers and consumers now support kerberos (SASL) and ssl. > Update KafkaDStreams to new Kafka 0.9 Consumer API > -- > > Key: SPARK-12177 > URL: https://issues.apache.org/jira/browse/SPARK-12177 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Nikita Tarasenko > Labels: consumer, kafka > > Kafka 0.9 already released and it introduce new consumer API that not > compatible with old one. So, I added new consumer api. I made separate > classes in package org.apache.spark.streaming.kafka.v09 with changed API. I > didn't remove old classes for more backward compatibility. User will not need > to change his old spark applications when he uprgade to new Spark version. > Please rewiew my changes -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13232) YARN executor node label expressions bug
Atkins created SPARK-13232: -- Summary: YARN executor node label expressions bug Key: SPARK-13232 URL: https://issues.apache.org/jira/browse/SPARK-13232 Project: Spark Issue Type: Bug Components: YARN Environment: Scala 2.11.7, Hadoop 2.7.2, Spark 1.6.0 Reporter: Atkins Using node label expression for executor failed to request container request and throws *InvalidContainerRequestException*. The code {code:title=AMRMClientImpl.java} /** * Valid if a node label expression specified on container request is valid or * not * * @param containerRequest */ private void checkNodeLabelExpression(T containerRequest) { String exp = containerRequest.getNodeLabelExpression(); if (null == exp || exp.isEmpty()) { return; } // Don't support specifying >= 2 node labels in a node label expression now if (exp.contains("&&") || exp.contains("||")) { throw new InvalidContainerRequestException( "Cannot specify more than two node labels" + " in a single node label expression"); } // Don't allow specify node label against ANY request if ((containerRequest.getRacks() != null && (!containerRequest.getRacks().isEmpty())) || (containerRequest.getNodes() != null && (!containerRequest.getNodes().isEmpty( { throw new InvalidContainerRequestException( "Cannot specify node label with rack and node"); } } {code} doesn't allow node label with rack and node. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13232) YARN executor node label expressions
[ https://issues.apache.org/jira/browse/SPARK-13232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13232: -- Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) Summary: YARN executor node label expressions (was: YARN executor node label expressions bug) What are you specifically referring to in this code -- what change are you proposing? As far as I can tell you're referring to something that's just not supported yet, which are conjunctions? > YARN executor node label expressions > > > Key: SPARK-13232 > URL: https://issues.apache.org/jira/browse/SPARK-13232 > Project: Spark > Issue Type: Improvement > Components: YARN > Environment: Scala 2.11.7, Hadoop 2.7.2, Spark 1.6.0 >Reporter: Atkins >Priority: Minor > > Using node label expression for executor failed to request container request > and throws *InvalidContainerRequestException*. > The code > {code:title=AMRMClientImpl.java} > /** >* Valid if a node label expression specified on container request is valid > or >* not >* >* @param containerRequest >*/ > private void checkNodeLabelExpression(T containerRequest) { > String exp = containerRequest.getNodeLabelExpression(); > > if (null == exp || exp.isEmpty()) { > return; > } > // Don't support specifying >= 2 node labels in a node label expression > now > if (exp.contains("&&") || exp.contains("||")) { > throw new InvalidContainerRequestException( > "Cannot specify more than two node labels" > + " in a single node label expression"); > } > > // Don't allow specify node label against ANY request > if ((containerRequest.getRacks() != null && > (!containerRequest.getRacks().isEmpty())) > || > (containerRequest.getNodes() != null && > (!containerRequest.getNodes().isEmpty( { > throw new InvalidContainerRequestException( > "Cannot specify node label with rack and node"); > } > } > {code} > doesn't allow node label with rack and node. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13172) Stop using RichException.getStackTrace it is deprecated
[ https://issues.apache.org/jira/browse/SPARK-13172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137136#comment-15137136 ] sachin aggarwal commented on SPARK-13172: - instead of getStackTraceString should I use e.getStackTrace or e.printStackTrace > Stop using RichException.getStackTrace it is deprecated > --- > > Key: SPARK-13172 > URL: https://issues.apache.org/jira/browse/SPARK-13172 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: holdenk >Priority: Trivial > > Throwable getStackTrace is the recommended alternative. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13233) Python Dataset
Wenchen Fan created SPARK-13233: --- Summary: Python Dataset Key: SPARK-13233 URL: https://issues.apache.org/jira/browse/SPARK-13233 Project: Spark Issue Type: New Feature Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13233) Python Dataset
[ https://issues.apache.org/jira/browse/SPARK-13233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-13233: Attachment: DesignDocPythonDataset.pdf > Python Dataset > -- > > Key: SPARK-13233 > URL: https://issues.apache.org/jira/browse/SPARK-13233 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Wenchen Fan > Attachments: DesignDocPythonDataset.pdf > > > add Python Dataset w.r.t. the scala version -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13233) Python Dataset
[ https://issues.apache.org/jira/browse/SPARK-13233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-13233: Description: add Python Dataset w.r.t. the scala version > Python Dataset > -- > > Key: SPARK-13233 > URL: https://issues.apache.org/jira/browse/SPARK-13233 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Wenchen Fan > Attachments: DesignDocPythonDataset.pdf > > > add Python Dataset w.r.t. the scala version -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10528) spark-shell throws java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.
[ https://issues.apache.org/jira/browse/SPARK-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137168#comment-15137168 ] Sangeet Chourey commented on SPARK-10528: - RESOLVED : Downloaded the correct Winutils version and issue was resolved. Ideally, it should be locally compiled but if downloading compiled version make sure that it is 32/64 bit as applicable. I tried on Windows 7 64 bit, Spark 1.6 and downloaded winutils.exe from https://www.barik.net/archive/2015/01/19/172716/ and it worked..!! > spark-shell throws java.lang.RuntimeException: The root scratch dir: > /tmp/hive on HDFS should be writable. > -- > > Key: SPARK-10528 > URL: https://issues.apache.org/jira/browse/SPARK-10528 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.5.0 > Environment: Windows 7 x64 >Reporter: Aliaksei Belablotski >Priority: Minor > > Starting spark-shell throws > java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: > /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw- -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10528) spark-shell throws java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.
[ https://issues.apache.org/jira/browse/SPARK-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137168#comment-15137168 ] Sangeet Chourey edited comment on SPARK-10528 at 2/8/16 4:26 PM: - RESOLVED : Downloaded the correct Winutils version and issue was resolved. Ideally, it should be locally compiled but if downloading compiled version make sure that it is 32/64 bit as applicable. I tried on Windows 7 64 bit, Spark 1.6 and downloaded winutils.exe from https://www.barik.net/archive/2015/01/19/172716/ and it worked..!! Complete Steps are at : http://letstalkspark.blogspot.com/2016/02/getting-started-with-spark-on-window-64.html was (Author: sybergeek): RESOLVED : Downloaded the correct Winutils version and issue was resolved. Ideally, it should be locally compiled but if downloading compiled version make sure that it is 32/64 bit as applicable. I tried on Windows 7 64 bit, Spark 1.6 and downloaded winutils.exe from https://www.barik.net/archive/2015/01/19/172716/ and it worked..!! > spark-shell throws java.lang.RuntimeException: The root scratch dir: > /tmp/hive on HDFS should be writable. > -- > > Key: SPARK-10528 > URL: https://issues.apache.org/jira/browse/SPARK-10528 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.5.0 > Environment: Windows 7 x64 >Reporter: Aliaksei Belablotski >Priority: Minor > > Starting spark-shell throws > java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: > /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw- -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13233) Python Dataset
[ https://issues.apache.org/jira/browse/SPARK-13233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13233: Assignee: Apache Spark > Python Dataset > -- > > Key: SPARK-13233 > URL: https://issues.apache.org/jira/browse/SPARK-13233 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > Attachments: DesignDocPythonDataset.pdf > > > add Python Dataset w.r.t. the scala version -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13233) Python Dataset
[ https://issues.apache.org/jira/browse/SPARK-13233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137180#comment-15137180 ] Apache Spark commented on SPARK-13233: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/7 > Python Dataset > -- > > Key: SPARK-13233 > URL: https://issues.apache.org/jira/browse/SPARK-13233 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Wenchen Fan > Attachments: DesignDocPythonDataset.pdf > > > add Python Dataset w.r.t. the scala version -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13233) Python Dataset
[ https://issues.apache.org/jira/browse/SPARK-13233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13233: Assignee: (was: Apache Spark) > Python Dataset > -- > > Key: SPARK-13233 > URL: https://issues.apache.org/jira/browse/SPARK-13233 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Wenchen Fan > Attachments: DesignDocPythonDataset.pdf > > > add Python Dataset w.r.t. the scala version -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10066) Can't create HiveContext with spark-shell or spark-sql on snapshot
[ https://issues.apache.org/jira/browse/SPARK-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137201#comment-15137201 ] Sangeet Chourey commented on SPARK-10066: - RESOLVED : Downloaded the correct Winutils version and issue was resolved. Ideally, it should be locally compiled but if downloading compiled version make sure that it is 32/64 bit as applicable. I tried on Windows 7 64 bit, Spark 1.6 and downloaded winutils.exe from https://www.barik.net/archive/2015/01/19/172716/ and it worked..!! Complete Steps are at : http://letstalkspark.blogspot.com/2016/02/getting-started-with-spark-on-window-64.html > Can't create HiveContext with spark-shell or spark-sql on snapshot > -- > > Key: SPARK-10066 > URL: https://issues.apache.org/jira/browse/SPARK-10066 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 1.5.0 > Environment: Centos 6.6 >Reporter: Robert Beauchemin >Priority: Minor > > Built the 1.5.0-preview-20150812 with the following: > ./make-distribution.sh -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive > -Phive-thriftserver -Psparkr -DskipTests > Starting spark-shell or spark-sql returns the following error: > java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: > /tmp/hive on HDFS should be writable. Current permissions are: rwx-- > at > org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:612) > [elided] > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508) > > It's trying to create a new HiveContext. Running pySpark or sparkR works and > creates a HiveContext successfully. SqlContext can be created successfully > with any shell. > I've tried changing permissions on that HDFS directory (even as far as making > it world-writable) without success. Tried changing SPARK_USER and also > running spark-shell as different users without success. > This works on same machine on 1.4.1 and on earlier pre-release versions of > Spark 1.5.0 (same make-distribution parms) sucessfully. Just trying the > snapshot... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13232) YARN executor node label expressions
[ https://issues.apache.org/jira/browse/SPARK-13232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137222#comment-15137222 ] Atkins commented on SPARK-13232: If spark config "spark.yarn.executor.nodeLabelExpression" present, *org.apache.spark.deploy.yarn.YarnAllocator#createContainerRequest* will create a ContainerRequest instance with locality specification of nodes, racks, and nodelabel which cause InvalidContainerRequestException be thrown. This can reproduce by adding test suite in *org.apache.spark.deploy.yarn.YarnAllocatorSuite* {code} test("request executors with locality") { val handler = createAllocator(1) handler.updateResourceRequests() handler.getNumExecutorsRunning should be (0) handler.getPendingAllocate.size should be (1) handler.requestTotalExecutorsWithPreferredLocalities(3, 20, Map(("host1", 10), ("host2", 20))) handler.updateResourceRequests() handler.getPendingAllocate.size should be (3) val container = createContainer("host1") handler.handleAllocatedContainers(Array(container)) handler.getNumExecutorsRunning should be (1) handler.allocatedContainerToHostMap.get(container.getId).get should be ("host1") handler.allocatedHostToContainersMap.get("host1").get should contain (container.getId) } {code} > YARN executor node label expressions > > > Key: SPARK-13232 > URL: https://issues.apache.org/jira/browse/SPARK-13232 > Project: Spark > Issue Type: Improvement > Components: YARN > Environment: Scala 2.11.7, Hadoop 2.7.2, Spark 1.6.0 >Reporter: Atkins >Priority: Minor > > Using node label expression for executor failed to request container request > and throws *InvalidContainerRequestException*. > The code > {code:title=AMRMClientImpl.java} > /** >* Valid if a node label expression specified on container request is valid > or >* not >* >* @param containerRequest >*/ > private void checkNodeLabelExpression(T containerRequest) { > String exp = containerRequest.getNodeLabelExpression(); > > if (null == exp || exp.isEmpty()) { > return; > } > // Don't support specifying >= 2 node labels in a node label expression > now > if (exp.contains("&&") || exp.contains("||")) { > throw new InvalidContainerRequestException( > "Cannot specify more than two node labels" > + " in a single node label expression"); > } > > // Don't allow specify node label against ANY request > if ((containerRequest.getRacks() != null && > (!containerRequest.getRacks().isEmpty())) > || > (containerRequest.getNodes() != null && > (!containerRequest.getNodes().isEmpty( { > throw new InvalidContainerRequestException( > "Cannot specify node label with rack and node"); > } > } > {code} > doesn't allow node label with rack and node. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13104) Spark Metrics currently does not return executors hostname
[ https://issues.apache.org/jira/browse/SPARK-13104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik updated SPARK-13104: Description: We been using Spark Metrics and porting the data to InfluxDB using the Graphite sink that is available in Spark. From what I can see, it only provides the executorId and not the executor hostname. With each spark job, the executorID changes. Is there any way to find the hostname based on the executorID? (was: We been using Spark Metrics and porting the data to InfluxDB using the Graphite sink that is available in Spark. From what I can see, it only provides he executorId and not the executor hostname. With each spark job, the executorID changes. Is there any way to find the hostname based on the executorID?) > Spark Metrics currently does not return executors hostname > --- > > Key: SPARK-13104 > URL: https://issues.apache.org/jira/browse/SPARK-13104 > Project: Spark > Issue Type: Question >Reporter: Karthik >Priority: Critical > Labels: executor, executorId, graphite, hostname, metrics > > We been using Spark Metrics and porting the data to InfluxDB using the > Graphite sink that is available in Spark. From what I can see, it only > provides the executorId and not the executor hostname. With each spark job, > the executorID changes. Is there any way to find the hostname based on the > executorID? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13232) YARN executor node label expressions
[ https://issues.apache.org/jira/browse/SPARK-13232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137222#comment-15137222 ] Atkins edited comment on SPARK-13232 at 2/8/16 4:59 PM: I am telling about yarn doesn't allow specify node label with racks or nodes, so the current version of Spark is not working with config of nodeLabel on Yarn. If spark config "spark.yarn.executor.nodeLabelExpression" present, *org.apache.spark.deploy.yarn.YarnAllocator#createContainerRequest* will create a ContainerRequest instance with locality specification of nodes, racks, and nodelabel which cause InvalidContainerRequestException be thrown. This can reproduce by adding test suite in *org.apache.spark.deploy.yarn.YarnAllocatorSuite* {code} test("request executors with locality") { val handler = createAllocator(1) handler.updateResourceRequests() handler.getNumExecutorsRunning should be (0) handler.getPendingAllocate.size should be (1) handler.requestTotalExecutorsWithPreferredLocalities(3, 20, Map(("host1", 10), ("host2", 20))) handler.updateResourceRequests() handler.getPendingAllocate.size should be (3) val container = createContainer("host1") handler.handleAllocatedContainers(Array(container)) handler.getNumExecutorsRunning should be (1) handler.allocatedContainerToHostMap.get(container.getId).get should be ("host1") handler.allocatedHostToContainersMap.get("host1").get should contain (container.getId) } {code} was (Author: atkins): If spark config "spark.yarn.executor.nodeLabelExpression" present, *org.apache.spark.deploy.yarn.YarnAllocator#createContainerRequest* will create a ContainerRequest instance with locality specification of nodes, racks, and nodelabel which cause InvalidContainerRequestException be thrown. This can reproduce by adding test suite in *org.apache.spark.deploy.yarn.YarnAllocatorSuite* {code} test("request executors with locality") { val handler = createAllocator(1) handler.updateResourceRequests() handler.getNumExecutorsRunning should be (0) handler.getPendingAllocate.size should be (1) handler.requestTotalExecutorsWithPreferredLocalities(3, 20, Map(("host1", 10), ("host2", 20))) handler.updateResourceRequests() handler.getPendingAllocate.size should be (3) val container = createContainer("host1") handler.handleAllocatedContainers(Array(container)) handler.getNumExecutorsRunning should be (1) handler.allocatedContainerToHostMap.get(container.getId).get should be ("host1") handler.allocatedHostToContainersMap.get("host1").get should contain (container.getId) } {code} > YARN executor node label expressions > > > Key: SPARK-13232 > URL: https://issues.apache.org/jira/browse/SPARK-13232 > Project: Spark > Issue Type: Improvement > Components: YARN > Environment: Scala 2.11.7, Hadoop 2.7.2, Spark 1.6.0 >Reporter: Atkins >Priority: Minor > > Using node label expression for executor failed to request container request > and throws *InvalidContainerRequestException*. > The code > {code:title=AMRMClientImpl.java} > /** >* Valid if a node label expression specified on container request is valid > or >* not >* >* @param containerRequest >*/ > private void checkNodeLabelExpression(T containerRequest) { > String exp = containerRequest.getNodeLabelExpression(); > > if (null == exp || exp.isEmpty()) { > return; > } > // Don't support specifying >= 2 node labels in a node label expression > now > if (exp.contains("&&") || exp.contains("||")) { > throw new InvalidContainerRequestException( > "Cannot specify more than two node labels" > + " in a single node label expression"); > } > > // Don't allow specify node label against ANY request > if ((containerRequest.getRacks() != null && > (!containerRequest.getRacks().isEmpty())) > || > (containerRequest.getNodes() != null && > (!containerRequest.getNodes().isEmpty( { > throw new InvalidContainerRequestException( > "Cannot specify node label with rack and node"); > } > } > {code} > doesn't allow node label with rack and node. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13219) Pushdown predicate propagation in SparkSQL with join
[ https://issues.apache.org/jira/browse/SPARK-13219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-13219: Component/s: (was: Spark Core) SQL > Pushdown predicate propagation in SparkSQL with join > > > Key: SPARK-13219 > URL: https://issues.apache.org/jira/browse/SPARK-13219 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.4.1, 1.6.0 > Environment: Spark 1.4 > Datastax Spark connector 1.4 > Cassandra. 2.1.12 > Centos 6.6 >Reporter: Abhinav Chawade > > When 2 or more tables are joined in SparkSQL and there is an equality clause > in query on attributes used to perform the join, it is useful to apply that > clause on scans for both table. If this is not done, one of the tables > results in full scan which can reduce the query dramatically. Consider > following example with 2 tables being joined. > {code} > CREATE TABLE assets ( > assetid int PRIMARY KEY, > address text, > propertyname text > ) > CREATE TABLE tenants ( > assetid int PRIMARY KEY, > name text > ) > spark-sql> explain select t.name from tenants t, assets a where a.assetid = > t.assetid and t.assetid='1201'; > WARN 2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to > load native-hadoop library for your platform... using builtin-java classes > where applicable > == Physical Plan == > Project [name#14] > ShuffledHashJoin [assetid#13], [assetid#15], BuildRight > Exchange (HashPartitioning 200) >Filter (CAST(assetid#13, DoubleType) = 1201.0) > HiveTableScan [assetid#13,name#14], (MetastoreRelation element, tenants, > Some(t)), None > Exchange (HashPartitioning 200) >HiveTableScan [assetid#15], (MetastoreRelation element, assets, Some(a)), > None > Time taken: 1.354 seconds, Fetched 8 row(s) > {code} > The simple workaround is to add another equality condition for each table but > it becomes cumbersome. It will be helpful if the query planner could improve > filter propagation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13219) Pushdown predicate propagation in SparkSQL with join
[ https://issues.apache.org/jira/browse/SPARK-13219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137282#comment-15137282 ] Xiao Li commented on SPARK-13219: - See this PR: https://github.com/apache/spark/pull/10490. Let me know if you hit any bug. Thanks! > Pushdown predicate propagation in SparkSQL with join > > > Key: SPARK-13219 > URL: https://issues.apache.org/jira/browse/SPARK-13219 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.4.1, 1.6.0 > Environment: Spark 1.4 > Datastax Spark connector 1.4 > Cassandra. 2.1.12 > Centos 6.6 >Reporter: Abhinav Chawade > > When 2 or more tables are joined in SparkSQL and there is an equality clause > in query on attributes used to perform the join, it is useful to apply that > clause on scans for both table. If this is not done, one of the tables > results in full scan which can reduce the query dramatically. Consider > following example with 2 tables being joined. > {code} > CREATE TABLE assets ( > assetid int PRIMARY KEY, > address text, > propertyname text > ) > CREATE TABLE tenants ( > assetid int PRIMARY KEY, > name text > ) > spark-sql> explain select t.name from tenants t, assets a where a.assetid = > t.assetid and t.assetid='1201'; > WARN 2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to > load native-hadoop library for your platform... using builtin-java classes > where applicable > == Physical Plan == > Project [name#14] > ShuffledHashJoin [assetid#13], [assetid#15], BuildRight > Exchange (HashPartitioning 200) >Filter (CAST(assetid#13, DoubleType) = 1201.0) > HiveTableScan [assetid#13,name#14], (MetastoreRelation element, tenants, > Some(t)), None > Exchange (HashPartitioning 200) >HiveTableScan [assetid#15], (MetastoreRelation element, assets, Some(a)), > None > Time taken: 1.354 seconds, Fetched 8 row(s) > {code} > The simple workaround is to add another equality condition for each table but > it becomes cumbersome. It will be helpful if the query planner could improve > filter propagation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13016) Replace example code in mllib-dimensionality-reduction.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-13016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137310#comment-15137310 ] Devaraj K commented on SPARK-13016: --- I am working on this, I will provide PR for this. Thanks > Replace example code in mllib-dimensionality-reduction.md using > include_example > --- > > Key: SPARK-13016 > URL: https://issues.apache.org/jira/browse/SPARK-13016 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > See examples in other finished sub-JIRAs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13117) WebUI should use the local ip not 0.0.0.0
[ https://issues.apache.org/jira/browse/SPARK-13117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137320#comment-15137320 ] Devaraj K commented on SPARK-13117: --- Thanks [~jjordan] for reporting. I would like to provide PR if you are not planning to work on this. Please let me know, Thanks. > WebUI should use the local ip not 0.0.0.0 > - > > Key: SPARK-13117 > URL: https://issues.apache.org/jira/browse/SPARK-13117 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.0 >Reporter: Jeremiah Jordan > > When SPARK_LOCAL_IP is set everything seems to correctly bind and use that IP > except the WebUI. The WebUI should use the SPARK_LOCAL_IP not always use > 0.0.0.0 > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/WebUI.scala#L137 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13219) Pushdown predicate propagation in SparkSQL with join
[ https://issues.apache.org/jira/browse/SPARK-13219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137330#comment-15137330 ] Abhinav Chawade commented on SPARK-13219: - Thanks Xiao. I will pull in the request and see how it performs. > Pushdown predicate propagation in SparkSQL with join > > > Key: SPARK-13219 > URL: https://issues.apache.org/jira/browse/SPARK-13219 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.4.1, 1.6.0 > Environment: Spark 1.4 > Datastax Spark connector 1.4 > Cassandra. 2.1.12 > Centos 6.6 >Reporter: Abhinav Chawade > > When 2 or more tables are joined in SparkSQL and there is an equality clause > in query on attributes used to perform the join, it is useful to apply that > clause on scans for both table. If this is not done, one of the tables > results in full scan which can reduce the query dramatically. Consider > following example with 2 tables being joined. > {code} > CREATE TABLE assets ( > assetid int PRIMARY KEY, > address text, > propertyname text > ) > CREATE TABLE tenants ( > assetid int PRIMARY KEY, > name text > ) > spark-sql> explain select t.name from tenants t, assets a where a.assetid = > t.assetid and t.assetid='1201'; > WARN 2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to > load native-hadoop library for your platform... using builtin-java classes > where applicable > == Physical Plan == > Project [name#14] > ShuffledHashJoin [assetid#13], [assetid#15], BuildRight > Exchange (HashPartitioning 200) >Filter (CAST(assetid#13, DoubleType) = 1201.0) > HiveTableScan [assetid#13,name#14], (MetastoreRelation element, tenants, > Some(t)), None > Exchange (HashPartitioning 200) >HiveTableScan [assetid#15], (MetastoreRelation element, assets, Some(a)), > None > Time taken: 1.354 seconds, Fetched 8 row(s) > {code} > The simple workaround is to add another equality condition for each table but > it becomes cumbersome. It will be helpful if the query planner could improve > filter propagation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7889) Jobs progress of apps on complete page of HistoryServer shows uncompleted
[ https://issues.apache.org/jira/browse/SPARK-7889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137336#comment-15137336 ] Apache Spark commented on SPARK-7889: - User 'squito' has created a pull request for this issue: https://github.com/apache/spark/pull/8 > Jobs progress of apps on complete page of HistoryServer shows uncompleted > - > > Key: SPARK-7889 > URL: https://issues.apache.org/jira/browse/SPARK-7889 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: meiyoula >Priority: Minor > > When running a SparkPi with 2000 tasks, cliking into the app on incomplete > page, the job progress shows 400/2000. After the app is completed, the app > goes to complete page from incomplete, and now cliking into the app, the job > progress still shows 400/2000. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13219) Pushdown predicate propagation in SparkSQL with join
[ https://issues.apache.org/jira/browse/SPARK-13219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137333#comment-15137333 ] Xiao Li commented on SPARK-13219: - Welcome > Pushdown predicate propagation in SparkSQL with join > > > Key: SPARK-13219 > URL: https://issues.apache.org/jira/browse/SPARK-13219 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.4.1, 1.6.0 > Environment: Spark 1.4 > Datastax Spark connector 1.4 > Cassandra. 2.1.12 > Centos 6.6 >Reporter: Abhinav Chawade > > When 2 or more tables are joined in SparkSQL and there is an equality clause > in query on attributes used to perform the join, it is useful to apply that > clause on scans for both table. If this is not done, one of the tables > results in full scan which can reduce the query dramatically. Consider > following example with 2 tables being joined. > {code} > CREATE TABLE assets ( > assetid int PRIMARY KEY, > address text, > propertyname text > ) > CREATE TABLE tenants ( > assetid int PRIMARY KEY, > name text > ) > spark-sql> explain select t.name from tenants t, assets a where a.assetid = > t.assetid and t.assetid='1201'; > WARN 2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to > load native-hadoop library for your platform... using builtin-java classes > where applicable > == Physical Plan == > Project [name#14] > ShuffledHashJoin [assetid#13], [assetid#15], BuildRight > Exchange (HashPartitioning 200) >Filter (CAST(assetid#13, DoubleType) = 1201.0) > HiveTableScan [assetid#13,name#14], (MetastoreRelation element, tenants, > Some(t)), None > Exchange (HashPartitioning 200) >HiveTableScan [assetid#15], (MetastoreRelation element, assets, Some(a)), > None > Time taken: 1.354 seconds, Fetched 8 row(s) > {code} > The simple workaround is to add another equality condition for each table but > it becomes cumbersome. It will be helpful if the query planner could improve > filter propagation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13117) WebUI should use the local ip not 0.0.0.0
[ https://issues.apache.org/jira/browse/SPARK-13117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137353#comment-15137353 ] Jeremiah Jordan commented on SPARK-13117: - go for it. > WebUI should use the local ip not 0.0.0.0 > - > > Key: SPARK-13117 > URL: https://issues.apache.org/jira/browse/SPARK-13117 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.0 >Reporter: Jeremiah Jordan > > When SPARK_LOCAL_IP is set everything seems to correctly bind and use that IP > except the WebUI. The WebUI should use the SPARK_LOCAL_IP not always use > 0.0.0.0 > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/WebUI.scala#L137 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12455) Add ExpressionDescription to window functions
[ https://issues.apache.org/jira/browse/SPARK-12455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-12455. --- Resolution: Resolved Fix Version/s: 2.0.0 > Add ExpressionDescription to window functions > - > > Key: SPARK-12455 > URL: https://issues.apache.org/jira/browse/SPARK-12455 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Herman van Hovell > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137369#comment-15137369 ] Sean Owen commented on SPARK-6305: -- I've started working on this, and it's as awful a dependency mess as you'd imagine. > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Tal Sliwowicz >Priority: Minor > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12986) Fix pydoc warnings in mllib/regression.py
[ https://issues.apache.org/jira/browse/SPARK-12986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-12986. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11025 [https://github.com/apache/spark/pull/11025] > Fix pydoc warnings in mllib/regression.py > - > > Key: SPARK-12986 > URL: https://issues.apache.org/jira/browse/SPARK-12986 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Nam Pham >Priority: Minor > Fix For: 2.0.0 > > > Got those warnings by running "make html" under "python/docs/": > {code} > /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of > pyspark.mllib.regression.LinearRegressionWithSGD:3: ERROR: Unexpected > indentation. > /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of > pyspark.mllib.regression.LinearRegressionWithSGD:4: WARNING: Block quote ends > without a blank line; unexpected unindent. > /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of > pyspark.mllib.regression.RidgeRegressionWithSGD:3: ERROR: Unexpected > indentation. > /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of > pyspark.mllib.regression.RidgeRegressionWithSGD:4: WARNING: Block quote ends > without a blank line; unexpected unindent. > /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of > pyspark.mllib.regression.LassoWithSGD:3: ERROR: Unexpected indentation. > /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of > pyspark.mllib.regression.LassoWithSGD:4: WARNING: Block quote ends without a > blank line; unexpected unindent. > /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of > pyspark.mllib.regression.IsotonicRegression:7: ERROR: Unexpected indentation. > /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of > pyspark.mllib.regression.IsotonicRegression:12: ERROR: Unexpected indentation. > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13234) Remove duplicated SQL metrics
Davies Liu created SPARK-13234: -- Summary: Remove duplicated SQL metrics Key: SPARK-13234 URL: https://issues.apache.org/jira/browse/SPARK-13234 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu For lots of SQL operators, we have metrics for both of input and output, the number of input rows should be exactly the number of output rows of child, we could only have metrics for output rows. After we improve the performance using whole stage codegen, the overhead of SQL metrics are not trivial anymore, we should avoid that if it's not necessary. Some of the operator does not have SQL metrics, we should add that for them. For those operators that have the same number of rows from input and output (for example, Projection, we may don't need that). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8964) Use Exchange in limit operations (per partition limit -> exchange to one partition -> per partition limit)
[ https://issues.apache.org/jira/browse/SPARK-8964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8964. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 7334 [https://github.com/apache/spark/pull/7334] > Use Exchange in limit operations (per partition limit -> exchange to one > partition -> per partition limit) > -- > > Key: SPARK-8964 > URL: https://issues.apache.org/jira/browse/SPARK-8964 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Josh Rosen > Fix For: 2.0.0 > > > Spark SQL's physical Limit operator currently performs its own shuffle rather > than using Exchange to perform the shuffling. This is less efficient since > this non-exchange shuffle path won't be able to benefit from SQL-specific > shuffling optimizations, such as SQLSerializer2. It also involves additional > unnecessary row copying. > Instead, I think that we should rewrite Limit to expand into three physical > operators: > PerParititonLimit -> Exchange to one partition -> PerPartitionLimit -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13213) BroadcastNestedLoopJoin is very slow
[ https://issues.apache.org/jira/browse/SPARK-13213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137551#comment-15137551 ] Davies Liu commented on SPARK-13213: [~sowen] Thanks very much for update these, I try to remember to add that recently, but may still missed sometimes. Can we mark that as required (or remember the last action as default value)? > BroadcastNestedLoopJoin is very slow > > > Key: SPARK-13213 > URL: https://issues.apache.org/jira/browse/SPARK-13213 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu > > Since we have improve the performance of CartisianProduct, which should be > faster and robuster than BroacastNestedLoopJoin, we should do > CartisianProduct instead of BroacastNestedLoopJoin, especially when the > broadcasted table is not that small. > Today, we hit a query that take very long time but still not finished, once > decrease the threshold for broadcast (disable BroacastNestedLoopJoin), it > just finished in seconds. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12585) The numFields of UnsafeRow should not changed by pointTo()
[ https://issues.apache.org/jira/browse/SPARK-12585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-12585: --- Component/s: SQL > The numFields of UnsafeRow should not changed by pointTo() > -- > > Key: SPARK-12585 > URL: https://issues.apache.org/jira/browse/SPARK-12585 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Apache Spark > Fix For: 2.0.0 > > > Right now, numFields will be passed in by pointTo(), then bitSetWidthInBytes > is calculated, making pointTo() a little bit heavy. > It should be part of constructor of UnsafeRow. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12840) Support passing arbitrary objects (not just expressions) into code generated classes
[ https://issues.apache.org/jira/browse/SPARK-12840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-12840: --- Component/s: SQL > Support passing arbitrary objects (not just expressions) into code generated > classes > > > Key: SPARK-12840 > URL: https://issues.apache.org/jira/browse/SPARK-12840 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > > As of now, our code generator only allows passing Expression objects into the > generated class as arguments. In order to support whole-stage codegen (e.g. > for broadcast joins), the generated classes need to accept other types of > objects such as hash tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13215) Remove fallback in codegen
[ https://issues.apache.org/jira/browse/SPARK-13215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-13215: --- Component/s: SQL > Remove fallback in codegen > -- > > Key: SPARK-13215 > URL: https://issues.apache.org/jira/browse/SPARK-13215 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > > in newMutableProjection, it will fallback to InterpretedMutableProjection if > failed to compile. > Since we remove the configuration for codegen, we are heavily reply on > codegen (also TungstenAggregate require the generated MutableProjection to > update UnsafeRow), should remove the fallback, which could make user > confusing, see the discussion in SPARK-13116. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13172) Stop using RichException.getStackTrace it is deprecated
[ https://issues.apache.org/jira/browse/SPARK-13172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137575#comment-15137575 ] Jakob Odersky edited comment on SPARK-13172 at 2/8/16 8:03 PM: --- I would suggest taking similar approach to what the Scala library does: https://github.com/scala/scala/blob/v2.11.7/src/library/scala/runtime/RichException.scala#L16, that is just call mkString on the stack trace. Using e.printStackTrace is not as flexible, it doesn't give you a string and as far as I know it prints to stderr with no option to redirect. was (Author: jodersky): I would suggest taking similar approach to what the Scala library does: https://github.com/scala/scala/blob/v2.11.7/src/library/scala/runtime/RichException.scala#L1, that is just call mkString on the stack trace. Using e.printStackTrace is not as flexible, it doesn't give you a string and as far as I know it prints to stderr with no option to redirect. > Stop using RichException.getStackTrace it is deprecated > --- > > Key: SPARK-13172 > URL: https://issues.apache.org/jira/browse/SPARK-13172 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: holdenk >Priority: Trivial > > Throwable getStackTrace is the recommended alternative. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13172) Stop using RichException.getStackTrace it is deprecated
[ https://issues.apache.org/jira/browse/SPARK-13172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137575#comment-15137575 ] Jakob Odersky commented on SPARK-13172: --- I would suggest taking similar approach to what the Scala library does: https://github.com/scala/scala/blob/v2.11.7/src/library/scala/runtime/RichException.scala#L1, that is just call mkString on the stack trace. Using e.printStackTrace is not as flexible, it doesn't give you a string and as far as I know it prints to stderr with no option to redirect. > Stop using RichException.getStackTrace it is deprecated > --- > > Key: SPARK-13172 > URL: https://issues.apache.org/jira/browse/SPARK-13172 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: holdenk >Priority: Trivial > > Throwable getStackTrace is the recommended alternative. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13101) Dataset complex types mapping to DataFrame (element nullability) mismatch
[ https://issues.apache.org/jira/browse/SPARK-13101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-13101: - Fix Version/s: 1.6.1 > Dataset complex types mapping to DataFrame (element nullability) mismatch > -- > > Key: SPARK-13101 > URL: https://issues.apache.org/jira/browse/SPARK-13101 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Deenar Toraskar >Assignee: Wenchen Fan >Priority: Blocker > Fix For: 1.6.1, 2.0.0 > > > There seems to be a regression between 1.6.0 and 1.6.1 (snapshot build). By > default a scala {{Seq\[Double\]}} is mapped by Spark as an ArrayType with > nullable element > {noformat} > |-- valuations: array (nullable = true) > ||-- element: double (containsNull = true) > {noformat} > This could be read back to as a Dataset in Spark 1.6.0 > {code} > val df = sqlContext.table("valuations").as[Valuation] > {code} > But with Spark 1.6.1 the same fails with > {code} > val df = sqlContext.table("valuations").as[Valuation] > org.apache.spark.sql.AnalysisException: cannot resolve 'cast(valuations as > array)' due to data type mismatch: cannot cast > ArrayType(DoubleType,true) to ArrayType(DoubleType,false); > {code} > Here's the classes I am using > {code} > case class Valuation(tradeId : String, > counterparty: String, > nettingAgreement: String, > wrongWay: Boolean, > valuations : Seq[Double], /* one per scenario */ > timeInterval: Int, > jobId: String) /* used for hdfs partitioning */ > val vals : Seq[Valuation] = Seq() > val valsDF = sqlContext.sparkContext.parallelize(vals).toDF > valsDF.write.partitionBy("jobId").mode(SaveMode.Overwrite).saveAsTable("valuations") > {code} > even the following gives the same result > {code} > val valsDF = vals.toDS.toDF > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13101) Dataset complex types mapping to DataFrame (element nullability) mismatch
[ https://issues.apache.org/jira/browse/SPARK-13101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-13101. -- Resolution: Fixed Fix Version/s: (was: 1.6.1) 2.0.0 Issue resolved by pull request 11035 [https://github.com/apache/spark/pull/11035] > Dataset complex types mapping to DataFrame (element nullability) mismatch > -- > > Key: SPARK-13101 > URL: https://issues.apache.org/jira/browse/SPARK-13101 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Deenar Toraskar >Assignee: Wenchen Fan >Priority: Blocker > Fix For: 2.0.0 > > > There seems to be a regression between 1.6.0 and 1.6.1 (snapshot build). By > default a scala {{Seq\[Double\]}} is mapped by Spark as an ArrayType with > nullable element > {noformat} > |-- valuations: array (nullable = true) > ||-- element: double (containsNull = true) > {noformat} > This could be read back to as a Dataset in Spark 1.6.0 > {code} > val df = sqlContext.table("valuations").as[Valuation] > {code} > But with Spark 1.6.1 the same fails with > {code} > val df = sqlContext.table("valuations").as[Valuation] > org.apache.spark.sql.AnalysisException: cannot resolve 'cast(valuations as > array)' due to data type mismatch: cannot cast > ArrayType(DoubleType,true) to ArrayType(DoubleType,false); > {code} > Here's the classes I am using > {code} > case class Valuation(tradeId : String, > counterparty: String, > nettingAgreement: String, > wrongWay: Boolean, > valuations : Seq[Double], /* one per scenario */ > timeInterval: Int, > jobId: String) /* used for hdfs partitioning */ > val vals : Seq[Valuation] = Seq() > val valsDF = sqlContext.sparkContext.parallelize(vals).toDF > valsDF.write.partitionBy("jobId").mode(SaveMode.Overwrite).saveAsTable("valuations") > {code} > even the following gives the same result > {code} > val valsDF = vals.toDS.toDF > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13210) NPE in Sort
[ https://issues.apache.org/jira/browse/SPARK-13210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-13210. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11095 [https://github.com/apache/spark/pull/11095] > NPE in Sort > --- > > Key: SPARK-13210 > URL: https://issues.apache.org/jira/browse/SPARK-13210 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > Fix For: 2.0.0 > > > When run TPCDS query Q78 with scale 10: > {code} > 16/02/04 22:39:09 ERROR Executor: Managed memory leak detected; size = > 268435456 bytes, TID = 143 > 16/02/04 22:39:09 ERROR Executor: Exception in task 0.0 in stage 47.0 (TID > 143) > java.lang.NullPointerException > at > org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:333) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:60) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:39) > at > org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270) > at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142) > at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:239) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:415) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:116) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168) > at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:87) > at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:60) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:45) > at org.apache.spark.scheduler.Task.run(Task.scala:81) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13210) NPE in Sort
[ https://issues.apache.org/jira/browse/SPARK-13210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137598#comment-15137598 ] Josh Rosen commented on SPARK-13210: I'm also going to cherry-pick this for 1.6.1. > NPE in Sort > --- > > Key: SPARK-13210 > URL: https://issues.apache.org/jira/browse/SPARK-13210 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > Fix For: 2.0.0 > > > When run TPCDS query Q78 with scale 10: > {code} > 16/02/04 22:39:09 ERROR Executor: Managed memory leak detected; size = > 268435456 bytes, TID = 143 > 16/02/04 22:39:09 ERROR Executor: Exception in task 0.0 in stage 47.0 (TID > 143) > java.lang.NullPointerException > at > org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:333) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:60) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:39) > at > org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270) > at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142) > at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:239) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:415) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:116) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168) > at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:87) > at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:60) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:45) > at org.apache.spark.scheduler.Task.run(Task.scala:81) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10780) Set initialModel in KMeans in Pipelines API
[ https://issues.apache.org/jira/browse/SPARK-10780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137607#comment-15137607 ] Apache Spark commented on SPARK-10780: -- User 'yinxusen' has created a pull request for this issue: https://github.com/apache/spark/pull/9 > Set initialModel in KMeans in Pipelines API > --- > > Key: SPARK-10780 > URL: https://issues.apache.org/jira/browse/SPARK-10780 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > This is for the Scala version. After this is merged, create a JIRA for > Python version. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13235) Remove extra Distinct in Union Distinct
Xiao Li created SPARK-13235: --- Summary: Remove extra Distinct in Union Distinct Key: SPARK-13235 URL: https://issues.apache.org/jira/browse/SPARK-13235 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Xiao Li Union Distinct has two Distinct that generates two Aggregation in the plan. {code} sql("select * from t0 union select * from t0").explain(true) {code} {code} == Parsed Logical Plan == 'Project [unresolvedalias(*,None)] +- 'Subquery u_2 +- 'Distinct +- 'Project [unresolvedalias(*,None)] +- 'Subquery u_1 +- 'Distinct +- 'Union :- 'Project [unresolvedalias(*,None)] : +- 'UnresolvedRelation `t0`, None +- 'Project [unresolvedalias(*,None)] +- 'UnresolvedRelation `t0`, None == Analyzed Logical Plan == id: bigint Project [id#16L] +- Subquery u_2 +- Distinct +- Project [id#16L] +- Subquery u_1 +- Distinct +- Union :- Project [id#16L] : +- Subquery t0 : +- Relation[id#16L] ParquetRelation +- Project [id#16L] +- Subquery t0 +- Relation[id#16L] ParquetRelation == Optimized Logical Plan == Aggregate [id#16L], [id#16L] +- Aggregate [id#16L], [id#16L] +- Union :- Project [id#16L] : +- Relation[id#16L] ParquetRelation +- Project [id#16L] +- Relation[id#16L] ParquetRelation {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10561) Provide tooling for auto-generating Spark SQL reference manual
[ https://issues.apache.org/jira/browse/SPARK-10561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated SPARK-10561: --- Description: Here is the discussion thread: http://search-hadoop.com/m/q3RTtcD20F1o62xE Richard Hillegas made the following suggestion: A machine-generated BNF, however, is easy to imagine. But perhaps not so easy to implement. Spark's SQL grammar is implemented in Scala, extending the DSL support provided by the Scala language. I am new to programming in Scala, so I don't know whether the Scala ecosystem provides any good tools for reverse-engineering a BNF from a class which extends scala.util.parsing.combinator.syntactical.StandardTokenParsers. was: Here is the discussion thread: http://search-hadoop.com/m/q3RTtcD20F1o62xE Richard Hillegas made the following suggestion: A machine-generated BNF, however, is easy to imagine. But perhaps not so easy to implement. Spark's SQL grammar is implemented in Scala, extending the DSL support provided by the Scala language. I am new to programming in Scala, so I don't know whether the Scala ecosystem provides any good tools for reverse-engineering a BNF from a class which extends scala.util.parsing.combinator.syntactical.StandardTokenParsers. > Provide tooling for auto-generating Spark SQL reference manual > -- > > Key: SPARK-10561 > URL: https://issues.apache.org/jira/browse/SPARK-10561 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Reporter: Ted Yu > > Here is the discussion thread: > http://search-hadoop.com/m/q3RTtcD20F1o62xE > Richard Hillegas made the following suggestion: > A machine-generated BNF, however, is easy to imagine. But perhaps not so easy > to implement. Spark's SQL grammar is implemented in Scala, extending the DSL > support provided by the Scala language. I am new to programming in Scala, so > I don't know whether the Scala ecosystem provides any good tools for > reverse-engineering a BNF from a class which extends > scala.util.parsing.combinator.syntactical.StandardTokenParsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13180) Protect against SessionState being null when accessing HiveClientImpl#conf
[ https://issues.apache.org/jira/browse/SPARK-13180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137674#comment-15137674 ] Ted Yu commented on SPARK-13180: I wonder if we should provide better error message when NPE happens - the cause may be mixed dependencies. See last response on the thread. > Protect against SessionState being null when accessing HiveClientImpl#conf > -- > > Key: SPARK-13180 > URL: https://issues.apache.org/jira/browse/SPARK-13180 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Ted Yu >Priority: Minor > Attachments: spark-13180-util.patch > > > See this thread http://search-hadoop.com/m/q3RTtFoTDi2HVCrM1 > {code} > java.lang.NullPointerException > at > org.apache.spark.sql.hive.client.ClientWrapper.conf(ClientWrapper.scala:205) > at > org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext.scala:552) > at org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:551) > at > org.apache.spark.sql.hive.HiveContext$$anonfun$configure$1.apply(HiveContext.scala:538) > at > org.apache.spark.sql.hive.HiveContext$$anonfun$configure$1.apply(HiveContext.scala:537) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:537) > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250) > at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237) > at org.apache.spark.sql.hive.HiveContext$$anon$2.(HiveContext.scala:457) > at > org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:457) > at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:456) > at org.apache.spark.sql.hive.HiveContext$$anon$3.(HiveContext.scala:473) > at > org.apache.spark.sql.hive.HiveContext.analyzer$lzycompute(HiveContext.scala:473) > at org.apache.spark.sql.hive.HiveContext.analyzer(HiveContext.scala:472) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:133) > at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) > at > org.apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:442) > at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:223) > at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:146) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13235) Remove extra Distinct in Union Distinct
[ https://issues.apache.org/jira/browse/SPARK-13235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13235: Assignee: (was: Apache Spark) > Remove extra Distinct in Union Distinct > --- > > Key: SPARK-13235 > URL: https://issues.apache.org/jira/browse/SPARK-13235 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Union Distinct has two Distinct that generates two Aggregation in the plan. > {code} > sql("select * from t0 union select * from t0").explain(true) > {code} > {code} > == Parsed Logical Plan == > 'Project [unresolvedalias(*,None)] > +- 'Subquery u_2 >+- 'Distinct > +- 'Project [unresolvedalias(*,None)] > +- 'Subquery u_1 > +- 'Distinct >+- 'Union > :- 'Project [unresolvedalias(*,None)] > : +- 'UnresolvedRelation `t0`, None > +- 'Project [unresolvedalias(*,None)] > +- 'UnresolvedRelation `t0`, None > == Analyzed Logical Plan == > id: bigint > Project [id#16L] > +- Subquery u_2 >+- Distinct > +- Project [id#16L] > +- Subquery u_1 > +- Distinct >+- Union > :- Project [id#16L] > : +- Subquery t0 > : +- Relation[id#16L] ParquetRelation > +- Project [id#16L] > +- Subquery t0 > +- Relation[id#16L] ParquetRelation > == Optimized Logical Plan == > Aggregate [id#16L], [id#16L] > +- Aggregate [id#16L], [id#16L] >+- Union > :- Project [id#16L] > : +- Relation[id#16L] ParquetRelation > +- Project [id#16L] > +- Relation[id#16L] ParquetRelation > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13235) Remove extra Distinct in Union Distinct
[ https://issues.apache.org/jira/browse/SPARK-13235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137682#comment-15137682 ] Apache Spark commented on SPARK-13235: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/11120 > Remove extra Distinct in Union Distinct > --- > > Key: SPARK-13235 > URL: https://issues.apache.org/jira/browse/SPARK-13235 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Union Distinct has two Distinct that generates two Aggregation in the plan. > {code} > sql("select * from t0 union select * from t0").explain(true) > {code} > {code} > == Parsed Logical Plan == > 'Project [unresolvedalias(*,None)] > +- 'Subquery u_2 >+- 'Distinct > +- 'Project [unresolvedalias(*,None)] > +- 'Subquery u_1 > +- 'Distinct >+- 'Union > :- 'Project [unresolvedalias(*,None)] > : +- 'UnresolvedRelation `t0`, None > +- 'Project [unresolvedalias(*,None)] > +- 'UnresolvedRelation `t0`, None > == Analyzed Logical Plan == > id: bigint > Project [id#16L] > +- Subquery u_2 >+- Distinct > +- Project [id#16L] > +- Subquery u_1 > +- Distinct >+- Union > :- Project [id#16L] > : +- Subquery t0 > : +- Relation[id#16L] ParquetRelation > +- Project [id#16L] > +- Subquery t0 > +- Relation[id#16L] ParquetRelation > == Optimized Logical Plan == > Aggregate [id#16L], [id#16L] > +- Aggregate [id#16L], [id#16L] >+- Union > :- Project [id#16L] > : +- Relation[id#16L] ParquetRelation > +- Project [id#16L] > +- Relation[id#16L] ParquetRelation > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13235) Remove extra Distinct in Union Distinct
[ https://issues.apache.org/jira/browse/SPARK-13235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13235: Assignee: Apache Spark > Remove extra Distinct in Union Distinct > --- > > Key: SPARK-13235 > URL: https://issues.apache.org/jira/browse/SPARK-13235 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Apache Spark > > Union Distinct has two Distinct that generates two Aggregation in the plan. > {code} > sql("select * from t0 union select * from t0").explain(true) > {code} > {code} > == Parsed Logical Plan == > 'Project [unresolvedalias(*,None)] > +- 'Subquery u_2 >+- 'Distinct > +- 'Project [unresolvedalias(*,None)] > +- 'Subquery u_1 > +- 'Distinct >+- 'Union > :- 'Project [unresolvedalias(*,None)] > : +- 'UnresolvedRelation `t0`, None > +- 'Project [unresolvedalias(*,None)] > +- 'UnresolvedRelation `t0`, None > == Analyzed Logical Plan == > id: bigint > Project [id#16L] > +- Subquery u_2 >+- Distinct > +- Project [id#16L] > +- Subquery u_1 > +- Distinct >+- Union > :- Project [id#16L] > : +- Subquery t0 > : +- Relation[id#16L] ParquetRelation > +- Project [id#16L] > +- Subquery t0 > +- Relation[id#16L] ParquetRelation > == Optimized Logical Plan == > Aggregate [id#16L], [id#16L] > +- Aggregate [id#16L], [id#16L] >+- Union > :- Project [id#16L] > : +- Relation[id#16L] ParquetRelation > +- Project [id#16L] > +- Relation[id#16L] ParquetRelation > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13235) Remove extra Distinct in Union
[ https://issues.apache.org/jira/browse/SPARK-13235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-13235: Summary: Remove extra Distinct in Union (was: Remove extra Distinct in Union Distinct) > Remove extra Distinct in Union > -- > > Key: SPARK-13235 > URL: https://issues.apache.org/jira/browse/SPARK-13235 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Union Distinct has two Distinct that generates two Aggregation in the plan. > {code} > sql("select * from t0 union select * from t0").explain(true) > {code} > {code} > == Parsed Logical Plan == > 'Project [unresolvedalias(*,None)] > +- 'Subquery u_2 >+- 'Distinct > +- 'Project [unresolvedalias(*,None)] > +- 'Subquery u_1 > +- 'Distinct >+- 'Union > :- 'Project [unresolvedalias(*,None)] > : +- 'UnresolvedRelation `t0`, None > +- 'Project [unresolvedalias(*,None)] > +- 'UnresolvedRelation `t0`, None > == Analyzed Logical Plan == > id: bigint > Project [id#16L] > +- Subquery u_2 >+- Distinct > +- Project [id#16L] > +- Subquery u_1 > +- Distinct >+- Union > :- Project [id#16L] > : +- Subquery t0 > : +- Relation[id#16L] ParquetRelation > +- Project [id#16L] > +- Subquery t0 > +- Relation[id#16L] ParquetRelation > == Optimized Logical Plan == > Aggregate [id#16L], [id#16L] > +- Aggregate [id#16L], [id#16L] >+- Union > :- Project [id#16L] > : +- Relation[id#16L] ParquetRelation > +- Project [id#16L] > +- Relation[id#16L] ParquetRelation > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13235) Remove an extra Distinct in Union
[ https://issues.apache.org/jira/browse/SPARK-13235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-13235: Summary: Remove an extra Distinct in Union (was: Remove extra Distinct in Union) > Remove an extra Distinct in Union > - > > Key: SPARK-13235 > URL: https://issues.apache.org/jira/browse/SPARK-13235 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Union Distinct has two Distinct that generates two Aggregation in the plan. > {code} > sql("select * from t0 union select * from t0").explain(true) > {code} > {code} > == Parsed Logical Plan == > 'Project [unresolvedalias(*,None)] > +- 'Subquery u_2 >+- 'Distinct > +- 'Project [unresolvedalias(*,None)] > +- 'Subquery u_1 > +- 'Distinct >+- 'Union > :- 'Project [unresolvedalias(*,None)] > : +- 'UnresolvedRelation `t0`, None > +- 'Project [unresolvedalias(*,None)] > +- 'UnresolvedRelation `t0`, None > == Analyzed Logical Plan == > id: bigint > Project [id#16L] > +- Subquery u_2 >+- Distinct > +- Project [id#16L] > +- Subquery u_1 > +- Distinct >+- Union > :- Project [id#16L] > : +- Subquery t0 > : +- Relation[id#16L] ParquetRelation > +- Project [id#16L] > +- Subquery t0 > +- Relation[id#16L] ParquetRelation > == Optimized Logical Plan == > Aggregate [id#16L], [id#16L] > +- Aggregate [id#16L], [id#16L] >+- Union > :- Project [id#16L] > : +- Relation[id#16L] ParquetRelation > +- Project [id#16L] > +- Relation[id#16L] ParquetRelation > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13235) Remove an Extra Distinct in Union
[ https://issues.apache.org/jira/browse/SPARK-13235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-13235: Summary: Remove an Extra Distinct in Union (was: Remove an extra Distinct in Union) > Remove an Extra Distinct in Union > - > > Key: SPARK-13235 > URL: https://issues.apache.org/jira/browse/SPARK-13235 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Union Distinct has two Distinct that generates two Aggregation in the plan. > {code} > sql("select * from t0 union select * from t0").explain(true) > {code} > {code} > == Parsed Logical Plan == > 'Project [unresolvedalias(*,None)] > +- 'Subquery u_2 >+- 'Distinct > +- 'Project [unresolvedalias(*,None)] > +- 'Subquery u_1 > +- 'Distinct >+- 'Union > :- 'Project [unresolvedalias(*,None)] > : +- 'UnresolvedRelation `t0`, None > +- 'Project [unresolvedalias(*,None)] > +- 'UnresolvedRelation `t0`, None > == Analyzed Logical Plan == > id: bigint > Project [id#16L] > +- Subquery u_2 >+- Distinct > +- Project [id#16L] > +- Subquery u_1 > +- Distinct >+- Union > :- Project [id#16L] > : +- Subquery t0 > : +- Relation[id#16L] ParquetRelation > +- Project [id#16L] > +- Subquery t0 > +- Relation[id#16L] ParquetRelation > == Optimized Logical Plan == > Aggregate [id#16L], [id#16L] > +- Aggregate [id#16L], [id#16L] >+- Union > :- Project [id#16L] > : +- Relation[id#16L] ParquetRelation > +- Project [id#16L] > +- Relation[id#16L] ParquetRelation > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13235) Remove an Extra Distinct in Union
[ https://issues.apache.org/jira/browse/SPARK-13235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-13235: Description: Union Distinct has two Distinct that generate two Aggregation in the plan. {code} sql("select * from t0 union select * from t0").explain(true) {code} {code} == Parsed Logical Plan == 'Project [unresolvedalias(*,None)] +- 'Subquery u_2 +- 'Distinct +- 'Project [unresolvedalias(*,None)] +- 'Subquery u_1 +- 'Distinct +- 'Union :- 'Project [unresolvedalias(*,None)] : +- 'UnresolvedRelation `t0`, None +- 'Project [unresolvedalias(*,None)] +- 'UnresolvedRelation `t0`, None == Analyzed Logical Plan == id: bigint Project [id#16L] +- Subquery u_2 +- Distinct +- Project [id#16L] +- Subquery u_1 +- Distinct +- Union :- Project [id#16L] : +- Subquery t0 : +- Relation[id#16L] ParquetRelation +- Project [id#16L] +- Subquery t0 +- Relation[id#16L] ParquetRelation == Optimized Logical Plan == Aggregate [id#16L], [id#16L] +- Aggregate [id#16L], [id#16L] +- Union :- Project [id#16L] : +- Relation[id#16L] ParquetRelation +- Project [id#16L] +- Relation[id#16L] ParquetRelation {code} was: Union Distinct has two Distinct that generates two Aggregation in the plan. {code} sql("select * from t0 union select * from t0").explain(true) {code} {code} == Parsed Logical Plan == 'Project [unresolvedalias(*,None)] +- 'Subquery u_2 +- 'Distinct +- 'Project [unresolvedalias(*,None)] +- 'Subquery u_1 +- 'Distinct +- 'Union :- 'Project [unresolvedalias(*,None)] : +- 'UnresolvedRelation `t0`, None +- 'Project [unresolvedalias(*,None)] +- 'UnresolvedRelation `t0`, None == Analyzed Logical Plan == id: bigint Project [id#16L] +- Subquery u_2 +- Distinct +- Project [id#16L] +- Subquery u_1 +- Distinct +- Union :- Project [id#16L] : +- Subquery t0 : +- Relation[id#16L] ParquetRelation +- Project [id#16L] +- Subquery t0 +- Relation[id#16L] ParquetRelation == Optimized Logical Plan == Aggregate [id#16L], [id#16L] +- Aggregate [id#16L], [id#16L] +- Union :- Project [id#16L] : +- Relation[id#16L] ParquetRelation +- Project [id#16L] +- Relation[id#16L] ParquetRelation {code} > Remove an Extra Distinct in Union > - > > Key: SPARK-13235 > URL: https://issues.apache.org/jira/browse/SPARK-13235 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Union Distinct has two Distinct that generate two Aggregation in the plan. > {code} > sql("select * from t0 union select * from t0").explain(true) > {code} > {code} > == Parsed Logical Plan == > 'Project [unresolvedalias(*,None)] > +- 'Subquery u_2 >+- 'Distinct > +- 'Project [unresolvedalias(*,None)] > +- 'Subquery u_1 > +- 'Distinct >+- 'Union > :- 'Project [unresolvedalias(*,None)] > : +- 'UnresolvedRelation `t0`, None > +- 'Project [unresolvedalias(*,None)] > +- 'UnresolvedRelation `t0`, None > == Analyzed Logical Plan == > id: bigint > Project [id#16L] > +- Subquery u_2 >+- Distinct > +- Project [id#16L] > +- Subquery u_1 > +- Distinct >+- Union > :- Project [id#16L] > : +- Subquery t0 > : +- Relation[id#16L] ParquetRelation > +- Project [id#16L] > +- Subquery t0 > +- Relation[id#16L] ParquetRelation > == Optimized Logical Plan == > Aggregate [id#16L], [id#16L] > +- Aggregate [id#16L], [id#16L] >+- Union > :- Project [id#16L] > : +- Relation[id#16L] ParquetRelation > +- Project [id#16L] > +- Relation[id#16L] ParquetRelation > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13171) Update promise & future to Promise and Future as the old ones are deprecated
[ https://issues.apache.org/jira/browse/SPARK-13171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137702#comment-15137702 ] Shixiong Zhu commented on SPARK-13171: -- Looks something wrong happened between: [info] 2016-02-06 08:42:00.219 - stderr> found org.apache.hadoop#hadoop-mapreduce-client-app;2.3.0 in list [info] 2016-02-06 08:46:13.188 - stderr> found org.apache.hadoop#hadoop-mapreduce-client-common;2.3.0 in central It took 4 minutes. > Update promise & future to Promise and Future as the old ones are deprecated > > > Key: SPARK-13171 > URL: https://issues.apache.org/jira/browse/SPARK-13171 > Project: Spark > Issue Type: Sub-task >Reporter: holdenk >Assignee: Jakob Odersky >Priority: Trivial > Fix For: 2.0.0 > > > We use the promise and future functions on the concurrent object, both of > which have been deprecated in 2.11 . The full traits are present in Scala > 2.10 as well so this should be a safe migration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13219) Pushdown predicate propagation in SparkSQL with join
[ https://issues.apache.org/jira/browse/SPARK-13219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137701#comment-15137701 ] Abhinav Chawade commented on SPARK-13219: - I created a build of Spark 1.4.1 which incorporates your patch but somehow predicates are still not being propagated. The set of steps I followed 1) Build Spark 1.4.1 with patch incorporated. 2) Replace spark-catalyst jar on all nodes. 3) Run explain on following command in spark-sql. Notice the query plan. {code} spark-sql> explain select t.assetid from tenants t inner join assets on t.assetid = assets.assetid where t.assetid=1201; == Physical Plan == Project [assetid#18] ShuffledHashJoin [assetid#18], [assetid#20], BuildRight Exchange (HashPartitioning 200) Filter (assetid#18 = 1201) HiveTableScan [assetid#18], (MetastoreRelation element22082, tenants, Some(t)), None Exchange (HashPartitioning 200) HiveTableScan [assetid#20], (MetastoreRelation element22082, assets, None), None Time taken: 2.741 seconds, Fetched 8 row(s) {code} > Pushdown predicate propagation in SparkSQL with join > > > Key: SPARK-13219 > URL: https://issues.apache.org/jira/browse/SPARK-13219 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.4.1, 1.6.0 > Environment: Spark 1.4 > Datastax Spark connector 1.4 > Cassandra. 2.1.12 > Centos 6.6 >Reporter: Abhinav Chawade > > When 2 or more tables are joined in SparkSQL and there is an equality clause > in query on attributes used to perform the join, it is useful to apply that > clause on scans for both table. If this is not done, one of the tables > results in full scan which can reduce the query dramatically. Consider > following example with 2 tables being joined. > {code} > CREATE TABLE assets ( > assetid int PRIMARY KEY, > address text, > propertyname text > ) > CREATE TABLE tenants ( > assetid int PRIMARY KEY, > name text > ) > spark-sql> explain select t.name from tenants t, assets a where a.assetid = > t.assetid and t.assetid='1201'; > WARN 2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to > load native-hadoop library for your platform... using builtin-java classes > where applicable > == Physical Plan == > Project [name#14] > ShuffledHashJoin [assetid#13], [assetid#15], BuildRight > Exchange (HashPartitioning 200) >Filter (CAST(assetid#13, DoubleType) = 1201.0) > HiveTableScan [assetid#13,name#14], (MetastoreRelation element, tenants, > Some(t)), None > Exchange (HashPartitioning 200) >HiveTableScan [assetid#15], (MetastoreRelation element, assets, Some(a)), > None > Time taken: 1.354 seconds, Fetched 8 row(s) > {code} > The simple workaround is to add another equality condition for each table but > it becomes cumbersome. It will be helpful if the query planner could improve > filter propagation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12505) Pushdown a Limit on top of an Outer-Join
[ https://issues.apache.org/jira/browse/SPARK-12505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137710#comment-15137710 ] Apache Spark commented on SPARK-12505: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/11121 > Pushdown a Limit on top of an Outer-Join > > > Key: SPARK-12505 > URL: https://issues.apache.org/jira/browse/SPARK-12505 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.0, 1.6.0 >Reporter: Xiao Li > > "Rule that applies to a Limit on top of an OUTER Join. The original Limit > won't go away after applying this rule, but additional Limit node(s) will be > created on top of the outer-side child (or children if it's a FULL OUTER > Join). " > – from https://issues.apache.org/jira/browse/CALCITE-832 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12503) Pushdown a Limit on top of a Union
[ https://issues.apache.org/jira/browse/SPARK-12503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137709#comment-15137709 ] Apache Spark commented on SPARK-12503: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/11121 > Pushdown a Limit on top of a Union > -- > > Key: SPARK-12503 > URL: https://issues.apache.org/jira/browse/SPARK-12503 > Project: Spark > Issue Type: Improvement > Components: Optimizer, SQL >Affects Versions: 1.5.0, 1.6.0 >Reporter: Xiao Li > > "Rule that applies to a Limit on top of a Union. The original Limit won't go > away after applying this rule, but additional Limit nodes will be created on > top of each child of Union, so that these children produce less rows and > Limit can be further optimized for children Relations." > -- from https://issues.apache.org/jira/browse/CALCITE-832 > Also, the same topic in Hive: https://issues.apache.org/jira/browse/HIVE-11775 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13171) Update promise & future to Promise and Future as the old ones are deprecated
[ https://issues.apache.org/jira/browse/SPARK-13171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137714#comment-15137714 ] holdenk commented on SPARK-13171: - I've been seeing that for intermittently for awhile in my own PRs builds. > Update promise & future to Promise and Future as the old ones are deprecated > > > Key: SPARK-13171 > URL: https://issues.apache.org/jira/browse/SPARK-13171 > Project: Spark > Issue Type: Sub-task >Reporter: holdenk >Assignee: Jakob Odersky >Priority: Trivial > Fix For: 2.0.0 > > > We use the promise and future functions on the concurrent object, both of > which have been deprecated in 2.11 . The full traits are present in Scala > 2.10 as well so this should be a safe migration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13219) Pushdown predicate propagation in SparkSQL with join
[ https://issues.apache.org/jira/browse/SPARK-13219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137721#comment-15137721 ] Xiao Li commented on SPARK-13219: - Let me try your SQL query in Spark 1.6.1. > Pushdown predicate propagation in SparkSQL with join > > > Key: SPARK-13219 > URL: https://issues.apache.org/jira/browse/SPARK-13219 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.4.1, 1.6.0 > Environment: Spark 1.4 > Datastax Spark connector 1.4 > Cassandra. 2.1.12 > Centos 6.6 >Reporter: Abhinav Chawade > > When 2 or more tables are joined in SparkSQL and there is an equality clause > in query on attributes used to perform the join, it is useful to apply that > clause on scans for both table. If this is not done, one of the tables > results in full scan which can reduce the query dramatically. Consider > following example with 2 tables being joined. > {code} > CREATE TABLE assets ( > assetid int PRIMARY KEY, > address text, > propertyname text > ) > CREATE TABLE tenants ( > assetid int PRIMARY KEY, > name text > ) > spark-sql> explain select t.name from tenants t, assets a where a.assetid = > t.assetid and t.assetid='1201'; > WARN 2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to > load native-hadoop library for your platform... using builtin-java classes > where applicable > == Physical Plan == > Project [name#14] > ShuffledHashJoin [assetid#13], [assetid#15], BuildRight > Exchange (HashPartitioning 200) >Filter (CAST(assetid#13, DoubleType) = 1201.0) > HiveTableScan [assetid#13,name#14], (MetastoreRelation element, tenants, > Some(t)), None > Exchange (HashPartitioning 200) >HiveTableScan [assetid#15], (MetastoreRelation element, assets, Some(a)), > None > Time taken: 1.354 seconds, Fetched 8 row(s) > {code} > The simple workaround is to add another equality condition for each table but > it becomes cumbersome. It will be helpful if the query planner could improve > filter propagation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13236) SQL generation support for union
Xiao Li created SPARK-13236: --- Summary: SQL generation support for union Key: SPARK-13236 URL: https://issues.apache.org/jira/browse/SPARK-13236 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.0.0 Reporter: Xiao Li checkHiveQl("SELECT * FROM t0 UNION SELECT * FROM t0") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API
[ https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137771#comment-15137771 ] Mark Grover commented on SPARK-12177: - Hi Rama, This particular PR adds support for the new API. There is some small code for SSL support in it too but I haven't invested much time in testing that, apart from the simple unit test that was written for it. Kerberos (SASL) will have to done incrementally in another patch because, it can't be done until Kafka supports delegation tokens (which is still not there yet: https://issues.apache.org/jira/browse/KAFKA-1696) > Update KafkaDStreams to new Kafka 0.9 Consumer API > -- > > Key: SPARK-12177 > URL: https://issues.apache.org/jira/browse/SPARK-12177 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Nikita Tarasenko > Labels: consumer, kafka > > Kafka 0.9 already released and it introduce new consumer API that not > compatible with old one. So, I added new consumer api. I made separate > classes in package org.apache.spark.streaming.kafka.v09 with changed API. I > didn't remove old classes for more backward compatibility. User will not need > to change his old spark applications when he uprgade to new Spark version. > Please rewiew my changes -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13219) Pushdown predicate propagation in SparkSQL with join
[ https://issues.apache.org/jira/browse/SPARK-13219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137770#comment-15137770 ] Abhinav Chawade commented on SPARK-13219: - Here is my branch on github if you'd like to take a look. https://github.com/drnushooz/spark/tree/v1.4.1-SPARK-13219 > Pushdown predicate propagation in SparkSQL with join > > > Key: SPARK-13219 > URL: https://issues.apache.org/jira/browse/SPARK-13219 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.4.1, 1.6.0 > Environment: Spark 1.4 > Datastax Spark connector 1.4 > Cassandra. 2.1.12 > Centos 6.6 >Reporter: Abhinav Chawade > > When 2 or more tables are joined in SparkSQL and there is an equality clause > in query on attributes used to perform the join, it is useful to apply that > clause on scans for both table. If this is not done, one of the tables > results in full scan which can reduce the query dramatically. Consider > following example with 2 tables being joined. > {code} > CREATE TABLE assets ( > assetid int PRIMARY KEY, > address text, > propertyname text > ) > CREATE TABLE tenants ( > assetid int PRIMARY KEY, > name text > ) > spark-sql> explain select t.name from tenants t, assets a where a.assetid = > t.assetid and t.assetid='1201'; > WARN 2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to > load native-hadoop library for your platform... using builtin-java classes > where applicable > == Physical Plan == > Project [name#14] > ShuffledHashJoin [assetid#13], [assetid#15], BuildRight > Exchange (HashPartitioning 200) >Filter (CAST(assetid#13, DoubleType) = 1201.0) > HiveTableScan [assetid#13,name#14], (MetastoreRelation element, tenants, > Some(t)), None > Exchange (HashPartitioning 200) >HiveTableScan [assetid#15], (MetastoreRelation element, assets, Some(a)), > None > Time taken: 1.354 seconds, Fetched 8 row(s) > {code} > The simple workaround is to add another equality condition for each table but > it becomes cumbersome. It will be helpful if the query planner could improve > filter propagation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12194) Add Sink for reporting Spark Metrics to OpenTSDB
[ https://issues.apache.org/jira/browse/SPARK-12194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137774#comment-15137774 ] Matt Kapilevich commented on SPARK-12194: - I am also looking to capture Spark metrics into OpenTSDB. FWIW, I've reviewed the PR, and it looks good to me. Can one of the committers please see if this patch can be merged? > Add Sink for reporting Spark Metrics to OpenTSDB > > > Key: SPARK-12194 > URL: https://issues.apache.org/jira/browse/SPARK-12194 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.5.2 >Reporter: Kapil Singh > > Add OpenTSDB Sink to the currently supported metric sinks. Since OpenTSDB is > a popular open-source Time Series Database (based on HBase), this will make > it convenient for those who want metrics data for time series analysis > purposes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13171) Update promise & future to Promise and Future as the old ones are deprecated
[ https://issues.apache.org/jira/browse/SPARK-13171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137795#comment-15137795 ] Jakob Odersky commented on SPARK-13171: --- This is very strange, are you sure it has something to do with the changes introduced by my PR? As mentioned previously, the only effective change between future() and Future.apply() is one less indirection. The only potentially visible changes would be for code that relies on reflection or does some macro magic. > Update promise & future to Promise and Future as the old ones are deprecated > > > Key: SPARK-13171 > URL: https://issues.apache.org/jira/browse/SPARK-13171 > Project: Spark > Issue Type: Sub-task >Reporter: holdenk >Assignee: Jakob Odersky >Priority: Trivial > Fix For: 2.0.0 > > > We use the promise and future functions on the concurrent object, both of > which have been deprecated in 2.11 . The full traits are present in Scala > 2.10 as well so this should be a safe migration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13216) Spark streaming application not honoring --num-executors in restarting of an application from a checkpoint
[ https://issues.apache.org/jira/browse/SPARK-13216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137797#comment-15137797 ] Hari Shreedharan commented on SPARK-13216: -- I disagree that checkpointing is only for failed applications. For any of the receiver-based streaming applications, checkpoints are important to recover as yet unprocessed data. If the application cannot be reloaded from a checkpoint - then the old data is pretty much gone. I know that checkpointing basically makes application and spark upgrades difficult or impossible, but there are configuration parameters that the users might want to change based on load requirements etc. I don't see a reason why we should not allow this, since it has nothing to do with starting the app from checkpoint or not - if we want the number of executors to change we should be able to. This is especially true when migrating from a non-dynamic allocation situation to a dynamic allocation situation. > Spark streaming application not honoring --num-executors in restarting of an > application from a checkpoint > -- > > Key: SPARK-13216 > URL: https://issues.apache.org/jira/browse/SPARK-13216 > Project: Spark > Issue Type: Bug > Components: Spark Submit, Streaming >Affects Versions: 1.5.0 >Reporter: Neelesh Srinivas Salian >Priority: Minor > Labels: Streaming > > Scenario to help understand: > 1) The Spark streaming job with 12 executors was initiated with checkpointing > enabled. > 2) In version 1.3, the user was able to append the number of executors to 20 > using --num-executors but was unable to do so in version 1.5. > In 1.5, the spark application still runs with 13 executors (1 for driver and > 12 executors). > There is a need to start from the checkpoint itself and not restart the > application to avoid the loss of information. > 3) Checked the code in 1.3 and 1.5, which shows the command > ''--num-executors" has been deprecated. > Any thoughts on this? Not sure if anyone hit this one specifically before. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13095) improve performance of hash join with dimension table
[ https://issues.apache.org/jira/browse/SPARK-13095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13095. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11065 [https://github.com/apache/spark/pull/11065] > improve performance of hash join with dimension table > - > > Key: SPARK-13095 > URL: https://issues.apache.org/jira/browse/SPARK-13095 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > > The join key is usually an integer or long (primary key, unique), we could > have special HashRelation for them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13027) Add API for updateStateByKey to provide batch time as input
[ https://issues.apache.org/jira/browse/SPARK-13027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137858#comment-15137858 ] Apache Spark commented on SPARK-13027: -- User 'aramesh117' has created a pull request for this issue: https://github.com/apache/spark/pull/11122 > Add API for updateStateByKey to provide batch time as input > --- > > Key: SPARK-13027 > URL: https://issues.apache.org/jira/browse/SPARK-13027 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Aaditya Ramesh > > The StateDStream currently does not provide the batch time as input to the > state update function. This is required in cases where the behavior depends > on the batch start time. > We (Conviva) have been patching it manually for the past several Spark > versions but we thought it might be useful for others as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13027) Add API for updateStateByKey to provide batch time as input
[ https://issues.apache.org/jira/browse/SPARK-13027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13027: Assignee: (was: Apache Spark) > Add API for updateStateByKey to provide batch time as input > --- > > Key: SPARK-13027 > URL: https://issues.apache.org/jira/browse/SPARK-13027 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Aaditya Ramesh > > The StateDStream currently does not provide the batch time as input to the > state update function. This is required in cases where the behavior depends > on the batch start time. > We (Conviva) have been patching it manually for the past several Spark > versions but we thought it might be useful for others as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13027) Add API for updateStateByKey to provide batch time as input
[ https://issues.apache.org/jira/browse/SPARK-13027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13027: Assignee: Apache Spark > Add API for updateStateByKey to provide batch time as input > --- > > Key: SPARK-13027 > URL: https://issues.apache.org/jira/browse/SPARK-13027 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Aaditya Ramesh >Assignee: Apache Spark > > The StateDStream currently does not provide the batch time as input to the > state update function. This is required in cases where the behavior depends > on the batch start time. > We (Conviva) have been patching it manually for the past several Spark > versions but we thought it might be useful for others as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10528) spark-shell throws java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.
[ https://issues.apache.org/jira/browse/SPARK-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137924#comment-15137924 ] Amir Gur commented on SPARK-10528: -- Thanks, confirming it worked on win8.1 64 bit. > spark-shell throws java.lang.RuntimeException: The root scratch dir: > /tmp/hive on HDFS should be writable. > -- > > Key: SPARK-10528 > URL: https://issues.apache.org/jira/browse/SPARK-10528 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.5.0 > Environment: Windows 7 x64 >Reporter: Aliaksei Belablotski >Priority: Minor > > Starting spark-shell throws > java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: > /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw- -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13014) Replace example code in mllib-collaborative-filtering.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-13014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13014: Assignee: (was: Apache Spark) > Replace example code in mllib-collaborative-filtering.md using include_example > -- > > Key: SPARK-13014 > URL: https://issues.apache.org/jira/browse/SPARK-13014 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > See examples in other finished sub-JIRAs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13014) Replace example code in mllib-collaborative-filtering.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-13014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137982#comment-15137982 ] Apache Spark commented on SPARK-13014: -- User 'keypointt' has created a pull request for this issue: https://github.com/apache/spark/pull/11123 > Replace example code in mllib-collaborative-filtering.md using include_example > -- > > Key: SPARK-13014 > URL: https://issues.apache.org/jira/browse/SPARK-13014 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > See examples in other finished sub-JIRAs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13014) Replace example code in mllib-collaborative-filtering.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-13014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13014: Assignee: Apache Spark > Replace example code in mllib-collaborative-filtering.md using include_example > -- > > Key: SPARK-13014 > URL: https://issues.apache.org/jira/browse/SPARK-13014 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Assignee: Apache Spark >Priority: Minor > Labels: starter > > See examples in other finished sub-JIRAs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13236) SQL generation support for union
[ https://issues.apache.org/jira/browse/SPARK-13236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137988#comment-15137988 ] Xiao Li commented on SPARK-13236: - After the merge of Spark-13235, I will upload a PR for this. Thanks! > SQL generation support for union > > > Key: SPARK-13236 > URL: https://issues.apache.org/jira/browse/SPARK-13236 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > checkHiveQl("SELECT * FROM t0 UNION SELECT * FROM t0") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13018) Replace example code in mllib-pmml-model-export.md using include_example
[ https://issues.apache.org/jira/browse/SPARK-13018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137996#comment-15137996 ] Xin Ren commented on SPARK-13018: - I'm working on this one, thanks :) > Replace example code in mllib-pmml-model-export.md using include_example > > > Key: SPARK-13018 > URL: https://issues.apache.org/jira/browse/SPARK-13018 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > See examples in other finished sub-JIRAs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13219) Pushdown predicate propagation in SparkSQL with join
[ https://issues.apache.org/jira/browse/SPARK-13219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15138024#comment-15138024 ] Evan Chan commented on SPARK-13219: --- [~smilegator] does your PR take care of the case where no JOIN clause is invoked? does it also take care of multiple join conditions? (e.g., select from a a, b b, c c where a.col1 = b.col1 && b.col1 = c.col1 && ) > Pushdown predicate propagation in SparkSQL with join > > > Key: SPARK-13219 > URL: https://issues.apache.org/jira/browse/SPARK-13219 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.4.1, 1.6.0 > Environment: Spark 1.4 > Datastax Spark connector 1.4 > Cassandra. 2.1.12 > Centos 6.6 >Reporter: Abhinav Chawade > > When 2 or more tables are joined in SparkSQL and there is an equality clause > in query on attributes used to perform the join, it is useful to apply that > clause on scans for both table. If this is not done, one of the tables > results in full scan which can reduce the query dramatically. Consider > following example with 2 tables being joined. > {code} > CREATE TABLE assets ( > assetid int PRIMARY KEY, > address text, > propertyname text > ) > CREATE TABLE tenants ( > assetid int PRIMARY KEY, > name text > ) > spark-sql> explain select t.name from tenants t, assets a where a.assetid = > t.assetid and t.assetid='1201'; > WARN 2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to > load native-hadoop library for your platform... using builtin-java classes > where applicable > == Physical Plan == > Project [name#14] > ShuffledHashJoin [assetid#13], [assetid#15], BuildRight > Exchange (HashPartitioning 200) >Filter (CAST(assetid#13, DoubleType) = 1201.0) > HiveTableScan [assetid#13,name#14], (MetastoreRelation element, tenants, > Some(t)), None > Exchange (HashPartitioning 200) >HiveTableScan [assetid#15], (MetastoreRelation element, assets, Some(a)), > None > Time taken: 1.354 seconds, Fetched 8 row(s) > {code} > The simple workaround is to add another equality condition for each table but > it becomes cumbersome. It will be helpful if the query planner could improve > filter propagation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org