date:20160208

[jira] [Commented] (SPARK-11714) Make Spark on Mesos honor port restrictions

2016-02-08 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15136747#comment-15136747
 ] 

Stavros Kontopoulos commented on SPARK-11714:
-

[~andrewor14] Would it be meaningful to move the code to coarse grained, since 
now fine grained is deprecated?

> Make Spark on Mesos honor port restrictions
> ---
>
> Key: SPARK-11714
> URL: https://issues.apache.org/jira/browse/SPARK-11714
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Charles Allen
>
> Currently the MesosSchedulerBackend does not make any effort to honor "ports" 
> as a resource offer in Mesos. This ask is to have the ports which the 
> executor binds to honor the limits of the "ports" resource of an offer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13198) sc.stop() does not clean up on driver, causes Java heap OOM.

2016-02-08 Thread Herman Schistad (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15136757#comment-15136757
 ] 

Herman Schistad commented on SPARK-13198:
-

Hi [~srowen], thanks for your reply. I have indeed tried to look at the program 
using a profiler and I've attached two screenshots from jvisualvm connected to 
the driver JMX interface. You can see that the "Old Gen" space is completely 
full. You see that dip at 09:30:00? That's me triggering a manual GC.

It might be unusual to do this, but in any case (given the existence of 
sc.stop()) it should work right? My use case is having X number of different 
parquet directories which need to be loaded and analysed linearly, as part of a 
generic platform where users are able to upload data and apply daily/hourly 
aggregations on them. I've also seen people starting and stopping contexts 
quite frequently when doing unit tests etc.

Using G1 garbage collection doesn't seem to affect the end result either.

I'm also attaching a GC log in it's raw format. You can see it's trying to do a 
full GC at multiple times during the execution of the program.

Thanks again Sean.

> sc.stop() does not clean up on driver, causes Java heap OOM.
> 
>
> Key: SPARK-13198
> URL: https://issues.apache.org/jira/browse/SPARK-13198
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.0
>Reporter: Herman Schistad
> Attachments: Screen Shot 2016-02-04 at 16.31.28.png, Screen Shot 
> 2016-02-04 at 16.31.40.png, Screen Shot 2016-02-04 at 16.31.51.png
>
>
> When starting and stopping multiple SparkContext's linearly eventually the 
> driver stops working with a "io.netty.handler.codec.EncoderException: 
> java.lang.OutOfMemoryError: Java heap space" error.
> Reproduce by running the following code and loading in ~7MB parquet data each 
> time. The driver heap space is not changed and thus defaults to 1GB:
> {code:java}
> def main(args: Array[String]) {
>   val conf = new SparkConf().setMaster("MASTER_URL").setAppName("")
>   conf.set("spark.mesos.coarse", "true")
>   conf.set("spark.cores.max", "10")
>   for (i <- 1 until 100) {
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> val events = sqlContext.read.parquet("hdfs://locahost/tmp/something")
> println(s"Context ($i), number of events: " + events.count)
> sc.stop()
>   }
> }
> {code}
> The heap space fills up within 20 loops on my cluster. Increasing the number 
> of cores to 50 in the above example results in heap space error after 12 
> contexts.
> Dumping the heap reveals many equally sized "CoarseMesosSchedulerBackend" 
> objects (see attachments). Digging into the inner objects tells me that the 
> `executorDataMap` is where 99% of the data in said object is stored. I do 
> believe though that this is beside the point as I'd expect this whole object 
> to be garbage collected or freed on sc.stop(). 
> Additionally I can see in the Spark web UI that each time a new context is 
> created the number of the "SQL" tab increments by one (i.e. last iteration 
> would have SQL99). After doing stop and creating a completely new context I 
> was expecting this number to be reset to 1 ("SQL").
> I'm submitting the jar file with `spark-submit` and no special flags. The 
> cluster is running Mesos 0.23. I'm running Spark 1.6.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13198) sc.stop() does not clean up on driver, causes Java heap OOM.

2016-02-08 Thread Herman Schistad (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman Schistad updated SPARK-13198:

Attachment: Screen Shot 2016-02-08 at 09.30.59.png
Screen Shot 2016-02-08 at 09.31.10.png
gc.log

> sc.stop() does not clean up on driver, causes Java heap OOM.
> 
>
> Key: SPARK-13198
> URL: https://issues.apache.org/jira/browse/SPARK-13198
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.0
>Reporter: Herman Schistad
> Attachments: Screen Shot 2016-02-04 at 16.31.28.png, Screen Shot 
> 2016-02-04 at 16.31.40.png, Screen Shot 2016-02-04 at 16.31.51.png, Screen 
> Shot 2016-02-08 at 09.30.59.png, Screen Shot 2016-02-08 at 09.31.10.png, 
> gc.log
>
>
> When starting and stopping multiple SparkContext's linearly eventually the 
> driver stops working with a "io.netty.handler.codec.EncoderException: 
> java.lang.OutOfMemoryError: Java heap space" error.
> Reproduce by running the following code and loading in ~7MB parquet data each 
> time. The driver heap space is not changed and thus defaults to 1GB:
> {code:java}
> def main(args: Array[String]) {
>   val conf = new SparkConf().setMaster("MASTER_URL").setAppName("")
>   conf.set("spark.mesos.coarse", "true")
>   conf.set("spark.cores.max", "10")
>   for (i <- 1 until 100) {
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> val events = sqlContext.read.parquet("hdfs://locahost/tmp/something")
> println(s"Context ($i), number of events: " + events.count)
> sc.stop()
>   }
> }
> {code}
> The heap space fills up within 20 loops on my cluster. Increasing the number 
> of cores to 50 in the above example results in heap space error after 12 
> contexts.
> Dumping the heap reveals many equally sized "CoarseMesosSchedulerBackend" 
> objects (see attachments). Digging into the inner objects tells me that the 
> `executorDataMap` is where 99% of the data in said object is stored. I do 
> believe though that this is beside the point as I'd expect this whole object 
> to be garbage collected or freed on sc.stop(). 
> Additionally I can see in the Spark web UI that each time a new context is 
> created the number of the "SQL" tab increments by one (i.e. last iteration 
> would have SQL99). After doing stop and creating a completely new context I 
> was expecting this number to be reset to 1 ("SQL").
> I'm submitting the jar file with `spark-submit` and no special flags. The 
> cluster is running Mesos 0.23. I'm running Spark 1.6.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13198) sc.stop() does not clean up on driver, causes Java heap OOM.

2016-02-08 Thread Herman Schistad (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15136757#comment-15136757
 ] 

Herman Schistad edited comment on SPARK-13198 at 2/8/16 9:55 AM:
-

Hi [~srowen], thanks for your reply. I have indeed tried to look at the program 
using a profiler and I've attached two screenshots ([one|^Screen Shot 
2016-02-08 at 09.31.10.png] and [two|^Screen Shot 2016-02-08 at 09.30.59.png]) 
from jvisualvm connected to the driver JMX interface. You can see that the "Old 
Gen" space is completely full. You see that dip at 09:30:00? That's me 
triggering a manual GC.

It might be unusual to do this, but in any case (given the existence of 
sc.stop()) it should work right? My use case is having X number of different 
parquet directories which need to be loaded and analysed linearly, as part of a 
generic platform where users are able to upload data and apply daily/hourly 
aggregations on them. I've also seen people starting and stopping contexts 
quite frequently when doing unit tests etc.

Using G1 garbage collection doesn't seem to affect the end result either.

I'm also attaching a [GC log|^gc.log] in it's raw format. You can see it's 
trying to do a full GC at multiple times during the execution of the program.

Thanks again Sean.


was (Author: hermansc):
Hi [~srowen], thanks for your reply. I have indeed tried to look at the program 
using a profiler and I've attached two screenshots from jvisualvm connected to 
the driver JMX interface. You can see that the "Old Gen" space is completely 
full. You see that dip at 09:30:00? That's me triggering a manual GC.

It might be unusual to do this, but in any case (given the existence of 
sc.stop()) it should work right? My use case is having X number of different 
parquet directories which need to be loaded and analysed linearly, as part of a 
generic platform where users are able to upload data and apply daily/hourly 
aggregations on them. I've also seen people starting and stopping contexts 
quite frequently when doing unit tests etc.

Using G1 garbage collection doesn't seem to affect the end result either.

I'm also attaching a GC log in it's raw format. You can see it's trying to do a 
full GC at multiple times during the execution of the program.

Thanks again Sean.

> sc.stop() does not clean up on driver, causes Java heap OOM.
> 
>
> Key: SPARK-13198
> URL: https://issues.apache.org/jira/browse/SPARK-13198
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.0
>Reporter: Herman Schistad
> Attachments: Screen Shot 2016-02-04 at 16.31.28.png, Screen Shot 
> 2016-02-04 at 16.31.40.png, Screen Shot 2016-02-04 at 16.31.51.png, Screen 
> Shot 2016-02-08 at 09.30.59.png, Screen Shot 2016-02-08 at 09.31.10.png, 
> gc.log
>
>
> When starting and stopping multiple SparkContext's linearly eventually the 
> driver stops working with a "io.netty.handler.codec.EncoderException: 
> java.lang.OutOfMemoryError: Java heap space" error.
> Reproduce by running the following code and loading in ~7MB parquet data each 
> time. The driver heap space is not changed and thus defaults to 1GB:
> {code:java}
> def main(args: Array[String]) {
>   val conf = new SparkConf().setMaster("MASTER_URL").setAppName("")
>   conf.set("spark.mesos.coarse", "true")
>   conf.set("spark.cores.max", "10")
>   for (i <- 1 until 100) {
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> val events = sqlContext.read.parquet("hdfs://locahost/tmp/something")
> println(s"Context ($i), number of events: " + events.count)
> sc.stop()
>   }
> }
> {code}
> The heap space fills up within 20 loops on my cluster. Increasing the number 
> of cores to 50 in the above example results in heap space error after 12 
> contexts.
> Dumping the heap reveals many equally sized "CoarseMesosSchedulerBackend" 
> objects (see attachments). Digging into the inner objects tells me that the 
> `executorDataMap` is where 99% of the data in said object is stored. I do 
> believe though that this is beside the point as I'd expect this whole object 
> to be garbage collected or freed on sc.stop(). 
> Additionally I can see in the Spark web UI that each time a new context is 
> created the number of the "SQL" tab increments by one (i.e. last iteration 
> would have SQL99). After doing stop and creating a completely new context I 
> was expecting this number to be reset to 1 ("SQL").
> I'm submitting the jar file with `spark-submit` and no special flags. The 
> cluster is running Mesos 0.23. I'm running Spark 1.6.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsub

[jira] [Commented] (SPARK-13198) sc.stop() does not clean up on driver, causes Java heap OOM.

2016-02-08 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15136773#comment-15136773
 ] 

Sean Owen commented on SPARK-13198:
---

I don't think stop() is relevant here. There's not an active attempt to free up 
resources once the app is done. It's assumed the driver JVM is shutting down.

Yes, the question was whether it had tried to do a full GC, and sounds like it 
has done, OK.

Still if you're just finding there is a bunch of left over bookkeeping info for 
executors, probably from all the old contexts, I think that's "normal" or at 
least "not a problem as Spark is intended to be used"

> sc.stop() does not clean up on driver, causes Java heap OOM.
> 
>
> Key: SPARK-13198
> URL: https://issues.apache.org/jira/browse/SPARK-13198
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.0
>Reporter: Herman Schistad
> Attachments: Screen Shot 2016-02-04 at 16.31.28.png, Screen Shot 
> 2016-02-04 at 16.31.40.png, Screen Shot 2016-02-04 at 16.31.51.png, Screen 
> Shot 2016-02-08 at 09.30.59.png, Screen Shot 2016-02-08 at 09.31.10.png, 
> Screen Shot 2016-02-08 at 10.03.04.png, gc.log
>
>
> When starting and stopping multiple SparkContext's linearly eventually the 
> driver stops working with a "io.netty.handler.codec.EncoderException: 
> java.lang.OutOfMemoryError: Java heap space" error.
> Reproduce by running the following code and loading in ~7MB parquet data each 
> time. The driver heap space is not changed and thus defaults to 1GB:
> {code:java}
> def main(args: Array[String]) {
>   val conf = new SparkConf().setMaster("MASTER_URL").setAppName("")
>   conf.set("spark.mesos.coarse", "true")
>   conf.set("spark.cores.max", "10")
>   for (i <- 1 until 100) {
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> val events = sqlContext.read.parquet("hdfs://locahost/tmp/something")
> println(s"Context ($i), number of events: " + events.count)
> sc.stop()
>   }
> }
> {code}
> The heap space fills up within 20 loops on my cluster. Increasing the number 
> of cores to 50 in the above example results in heap space error after 12 
> contexts.
> Dumping the heap reveals many equally sized "CoarseMesosSchedulerBackend" 
> objects (see attachments). Digging into the inner objects tells me that the 
> `executorDataMap` is where 99% of the data in said object is stored. I do 
> believe though that this is beside the point as I'd expect this whole object 
> to be garbage collected or freed on sc.stop(). 
> Additionally I can see in the Spark web UI that each time a new context is 
> created the number of the "SQL" tab increments by one (i.e. last iteration 
> would have SQL99). After doing stop and creating a completely new context I 
> was expecting this number to be reset to 1 ("SQL").
> I'm submitting the jar file with `spark-submit` and no special flags. The 
> cluster is running Mesos 0.23. I'm running Spark 1.6.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13198) sc.stop() does not clean up on driver, causes Java heap OOM.

2016-02-08 Thread Herman Schistad (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman Schistad updated SPARK-13198:

Attachment: Screen Shot 2016-02-08 at 10.03.04.png

> sc.stop() does not clean up on driver, causes Java heap OOM.
> 
>
> Key: SPARK-13198
> URL: https://issues.apache.org/jira/browse/SPARK-13198
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.0
>Reporter: Herman Schistad
> Attachments: Screen Shot 2016-02-04 at 16.31.28.png, Screen Shot 
> 2016-02-04 at 16.31.40.png, Screen Shot 2016-02-04 at 16.31.51.png, Screen 
> Shot 2016-02-08 at 09.30.59.png, Screen Shot 2016-02-08 at 09.31.10.png, 
> Screen Shot 2016-02-08 at 10.03.04.png, gc.log
>
>
> When starting and stopping multiple SparkContext's linearly eventually the 
> driver stops working with a "io.netty.handler.codec.EncoderException: 
> java.lang.OutOfMemoryError: Java heap space" error.
> Reproduce by running the following code and loading in ~7MB parquet data each 
> time. The driver heap space is not changed and thus defaults to 1GB:
> {code:java}
> def main(args: Array[String]) {
>   val conf = new SparkConf().setMaster("MASTER_URL").setAppName("")
>   conf.set("spark.mesos.coarse", "true")
>   conf.set("spark.cores.max", "10")
>   for (i <- 1 until 100) {
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> val events = sqlContext.read.parquet("hdfs://locahost/tmp/something")
> println(s"Context ($i), number of events: " + events.count)
> sc.stop()
>   }
> }
> {code}
> The heap space fills up within 20 loops on my cluster. Increasing the number 
> of cores to 50 in the above example results in heap space error after 12 
> contexts.
> Dumping the heap reveals many equally sized "CoarseMesosSchedulerBackend" 
> objects (see attachments). Digging into the inner objects tells me that the 
> `executorDataMap` is where 99% of the data in said object is stored. I do 
> believe though that this is beside the point as I'd expect this whole object 
> to be garbage collected or freed on sc.stop(). 
> Additionally I can see in the Spark web UI that each time a new context is 
> created the number of the "SQL" tab increments by one (i.e. last iteration 
> would have SQL99). After doing stop and creating a completely new context I 
> was expecting this number to be reset to 1 ("SQL").
> I'm submitting the jar file with `spark-submit` and no special flags. The 
> cluster is running Mesos 0.23. I'm running Spark 1.6.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13198) sc.stop() does not clean up on driver, causes Java heap OOM.

2016-02-08 Thread Herman Schistad (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15136774#comment-15136774
 ] 

Herman Schistad commented on SPARK-13198:
-

Digging more into the dumped heap and running a memory leak report (using 
Eclipse Memory Analyzer) I'm seeing the following result:

!Screen Shot 2016-02-08 at 10.03.04.png|width=400!

> sc.stop() does not clean up on driver, causes Java heap OOM.
> 
>
> Key: SPARK-13198
> URL: https://issues.apache.org/jira/browse/SPARK-13198
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.0
>Reporter: Herman Schistad
> Attachments: Screen Shot 2016-02-04 at 16.31.28.png, Screen Shot 
> 2016-02-04 at 16.31.40.png, Screen Shot 2016-02-04 at 16.31.51.png, Screen 
> Shot 2016-02-08 at 09.30.59.png, Screen Shot 2016-02-08 at 09.31.10.png, 
> Screen Shot 2016-02-08 at 10.03.04.png, gc.log
>
>
> When starting and stopping multiple SparkContext's linearly eventually the 
> driver stops working with a "io.netty.handler.codec.EncoderException: 
> java.lang.OutOfMemoryError: Java heap space" error.
> Reproduce by running the following code and loading in ~7MB parquet data each 
> time. The driver heap space is not changed and thus defaults to 1GB:
> {code:java}
> def main(args: Array[String]) {
>   val conf = new SparkConf().setMaster("MASTER_URL").setAppName("")
>   conf.set("spark.mesos.coarse", "true")
>   conf.set("spark.cores.max", "10")
>   for (i <- 1 until 100) {
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> val events = sqlContext.read.parquet("hdfs://locahost/tmp/something")
> println(s"Context ($i), number of events: " + events.count)
> sc.stop()
>   }
> }
> {code}
> The heap space fills up within 20 loops on my cluster. Increasing the number 
> of cores to 50 in the above example results in heap space error after 12 
> contexts.
> Dumping the heap reveals many equally sized "CoarseMesosSchedulerBackend" 
> objects (see attachments). Digging into the inner objects tells me that the 
> `executorDataMap` is where 99% of the data in said object is stored. I do 
> believe though that this is beside the point as I'd expect this whole object 
> to be garbage collected or freed on sc.stop(). 
> Additionally I can see in the Spark web UI that each time a new context is 
> created the number of the "SQL" tab increments by one (i.e. last iteration 
> would have SQL99). After doing stop and creating a completely new context I 
> was expecting this number to be reset to 1 ("SQL").
> I'm submitting the jar file with `spark-submit` and no special flags. The 
> cluster is running Mesos 0.23. I'm running Spark 1.6.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13177) Update ActorWordCount example to not directly use low level linked list as it is deprecated.

2016-02-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13177:


Assignee: (was: Apache Spark)

> Update ActorWordCount example to not directly use low level linked list as it 
> is deprecated.
> 
>
> Key: SPARK-13177
> URL: https://issues.apache.org/jira/browse/SPARK-13177
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: holdenk
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13177) Update ActorWordCount example to not directly use low level linked list as it is deprecated.

2016-02-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15136791#comment-15136791
 ] 

Apache Spark commented on SPARK-13177:
--

User 'agsachin' has created a pull request for this issue:
https://github.com/apache/spark/pull/3

> Update ActorWordCount example to not directly use low level linked list as it 
> is deprecated.
> 
>
> Key: SPARK-13177
> URL: https://issues.apache.org/jira/browse/SPARK-13177
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: holdenk
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13177) Update ActorWordCount example to not directly use low level linked list as it is deprecated.

2016-02-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13177:


Assignee: Apache Spark

> Update ActorWordCount example to not directly use low level linked list as it 
> is deprecated.
> 
>
> Key: SPARK-13177
> URL: https://issues.apache.org/jira/browse/SPARK-13177
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13231) Rename Accumulable.countFailedValues to Accumulable.includeValuesOfFailedTasks and make it a user facing API.

2016-02-08 Thread Prashant Sharma (JIRA)

Prashant Sharma created SPARK-13231:
---

 Summary: Rename Accumulable.countFailedValues to 
Accumulable.includeValuesOfFailedTasks and make it a user facing API.
 Key: SPARK-13231
 URL: https://issues.apache.org/jira/browse/SPARK-13231
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Prashant Sharma
Priority: Minor


Rename Accumulable.countFailedValues to Accumulable.includeValuesOfFailedTasks 
(or includeFailedTasks) I liked the longer version though. 

Exposing it to user has no disadvantage I can think of, but it can be useful 
for them. One scenario can be a user defined metric.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13156) JDBC using multiple partitions creates additional tasks but only executes on one

2016-02-08 Thread Charles Drotar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15136834#comment-15136834
 ] 

Charles Drotar commented on SPARK-13156:


Thanks Sean. The driver inhibiting the concurrent connections was the issue. 
Apparently the Teradata driver does not support concurrent connections and 
instead suggests creating different sessions for each query. I don't think this 
is truly an issue so I will close out the JIRA.

> JDBC using multiple partitions creates additional tasks but only executes on 
> one
> 
>
> Key: SPARK-13156
> URL: https://issues.apache.org/jira/browse/SPARK-13156
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.5.0
> Environment: Hadoop 2.6.0-cdh5.4.0, Teradata, yarn-client
>Reporter: Charles Drotar
>
> I can successfully kick off a query through JDBC to Teradata, and when it 
> runs it creates a task on each executor for every partition. The problem is 
> that all of the tasks except for one complete within a couple seconds and the 
> final task handles the entire dataset.
> Example Code:
> private val properties = new java.util.Properties()
> properties.setProperty("driver","com.teradata.jdbc.TeraDriver")
> properties.setProperty("username","foo")
> properties.setProperty("password","bar")
> val url = "jdbc:teradata://oneview/, TMODE=TERA,TYPE=FASTEXPORT,SESSIONS=10"
> val numPartitions = 5
> val dbTableTemp = "( SELECT  id MOD $numPartitions%d AS modulo, id FROM 
> db.table) AS TEMP_TABLE"
> val partitionColumn = "modulo"
> val lowerBound = 0.toLong
> val upperBound = (numPartitions-1).toLong
> val df = 
> sqlContext.read.jdbc(url,dbTableTemp,partitionColumn,lowerBound,upperBound,numPartitions,properties)
> df.write.parquet("/output/path/for/df/")
> When I look at the Spark UI I see the 5 tasks, but only 1 is actually 
> querying.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-13156) JDBC using multiple partitions creates additional tasks but only executes on one

2016-02-08 Thread Charles Drotar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Drotar closed SPARK-13156.
--
Resolution: Not A Problem

The driver class was inhibiting concurrent connections. This was unrelated to 
Spark's jdbc functionality.

> JDBC using multiple partitions creates additional tasks but only executes on 
> one
> 
>
> Key: SPARK-13156
> URL: https://issues.apache.org/jira/browse/SPARK-13156
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.5.0
> Environment: Hadoop 2.6.0-cdh5.4.0, Teradata, yarn-client
>Reporter: Charles Drotar
>
> I can successfully kick off a query through JDBC to Teradata, and when it 
> runs it creates a task on each executor for every partition. The problem is 
> that all of the tasks except for one complete within a couple seconds and the 
> final task handles the entire dataset.
> Example Code:
> private val properties = new java.util.Properties()
> properties.setProperty("driver","com.teradata.jdbc.TeraDriver")
> properties.setProperty("username","foo")
> properties.setProperty("password","bar")
> val url = "jdbc:teradata://oneview/, TMODE=TERA,TYPE=FASTEXPORT,SESSIONS=10"
> val numPartitions = 5
> val dbTableTemp = "( SELECT  id MOD $numPartitions%d AS modulo, id FROM 
> db.table) AS TEMP_TABLE"
> val partitionColumn = "modulo"
> val lowerBound = 0.toLong
> val upperBound = (numPartitions-1).toLong
> val df = 
> sqlContext.read.jdbc(url,dbTableTemp,partitionColumn,lowerBound,upperBound,numPartitions,properties)
> df.write.parquet("/output/path/for/df/")
> When I look at the Spark UI I see the 5 tasks, but only 1 is actually 
> querying.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13156) JDBC using multiple partitions creates additional tasks but only executes on one

2016-02-08 Thread Charles Drotar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15136835#comment-15136835
 ] 

Charles Drotar edited comment on SPARK-13156 at 2/8/16 11:28 AM:
-

Thanks Sean. The driver inhibiting the concurrent connections was the issue. 
Apparently the Teradata driver does not support concurrent connections and 
instead suggests creating different sessions for each query. I don't think this 
is truly an issue so I will close out the JIRA.


was (Author: charles.dro...@capitalone.com):
The driver class was inhibiting concurrent connections. This was unrelated to 
Spark's jdbc functionality.

> JDBC using multiple partitions creates additional tasks but only executes on 
> one
> 
>
> Key: SPARK-13156
> URL: https://issues.apache.org/jira/browse/SPARK-13156
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.5.0
> Environment: Hadoop 2.6.0-cdh5.4.0, Teradata, yarn-client
>Reporter: Charles Drotar
>
> I can successfully kick off a query through JDBC to Teradata, and when it 
> runs it creates a task on each executor for every partition. The problem is 
> that all of the tasks except for one complete within a couple seconds and the 
> final task handles the entire dataset.
> Example Code:
> private val properties = new java.util.Properties()
> properties.setProperty("driver","com.teradata.jdbc.TeraDriver")
> properties.setProperty("username","foo")
> properties.setProperty("password","bar")
> val url = "jdbc:teradata://oneview/, TMODE=TERA,TYPE=FASTEXPORT,SESSIONS=10"
> val numPartitions = 5
> val dbTableTemp = "( SELECT  id MOD $numPartitions%d AS modulo, id FROM 
> db.table) AS TEMP_TABLE"
> val partitionColumn = "modulo"
> val lowerBound = 0.toLong
> val upperBound = (numPartitions-1).toLong
> val df = 
> sqlContext.read.jdbc(url,dbTableTemp,partitionColumn,lowerBound,upperBound,numPartitions,properties)
> df.write.parquet("/output/path/for/df/")
> When I look at the Spark UI I see the 5 tasks, but only 1 is actually 
> querying.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-13156) JDBC using multiple partitions creates additional tasks but only executes on one

2016-02-08 Thread Charles Drotar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Drotar updated SPARK-13156:
---
Comment: was deleted

(was: Thanks Sean. The driver inhibiting the concurrent connections was the 
issue. Apparently the Teradata driver does not support concurrent connections 
and instead suggests creating different sessions for each query. I don't think 
this is truly an issue so I will close out the JIRA.)

> JDBC using multiple partitions creates additional tasks but only executes on 
> one
> 
>
> Key: SPARK-13156
> URL: https://issues.apache.org/jira/browse/SPARK-13156
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.5.0
> Environment: Hadoop 2.6.0-cdh5.4.0, Teradata, yarn-client
>Reporter: Charles Drotar
>
> I can successfully kick off a query through JDBC to Teradata, and when it 
> runs it creates a task on each executor for every partition. The problem is 
> that all of the tasks except for one complete within a couple seconds and the 
> final task handles the entire dataset.
> Example Code:
> private val properties = new java.util.Properties()
> properties.setProperty("driver","com.teradata.jdbc.TeraDriver")
> properties.setProperty("username","foo")
> properties.setProperty("password","bar")
> val url = "jdbc:teradata://oneview/, TMODE=TERA,TYPE=FASTEXPORT,SESSIONS=10"
> val numPartitions = 5
> val dbTableTemp = "( SELECT  id MOD $numPartitions%d AS modulo, id FROM 
> db.table) AS TEMP_TABLE"
> val partitionColumn = "modulo"
> val lowerBound = 0.toLong
> val upperBound = (numPartitions-1).toLong
> val df = 
> sqlContext.read.jdbc(url,dbTableTemp,partitionColumn,lowerBound,upperBound,numPartitions,properties)
> df.write.parquet("/output/path/for/df/")
> When I look at the Spark UI I see the 5 tasks, but only 1 is actually 
> querying.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7848) Update SparkStreaming docs to incorporate FAQ and/or bullets w/ "knobs" information.

2016-02-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7848:
---

Assignee: Apache Spark

> Update SparkStreaming docs to incorporate FAQ and/or bullets w/ "knobs" 
> information.
> 
>
> Key: SPARK-7848
> URL: https://issues.apache.org/jira/browse/SPARK-7848
> Project: Spark
>  Issue Type: Documentation
>  Components: Streaming
>Reporter: jay vyas
>Assignee: Apache Spark
>
> A recent email on the maligning list detailed a bunch of great "knobs" to 
> remember for spark streaming. 
> Lets integrate this  into the docs where appropriate.
> I'll paste the raw text in a comment field below



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7848) Update SparkStreaming docs to incorporate FAQ and/or bullets w/ "knobs" information.

2016-02-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7848:
---

Assignee: (was: Apache Spark)

> Update SparkStreaming docs to incorporate FAQ and/or bullets w/ "knobs" 
> information.
> 
>
> Key: SPARK-7848
> URL: https://issues.apache.org/jira/browse/SPARK-7848
> Project: Spark
>  Issue Type: Documentation
>  Components: Streaming
>Reporter: jay vyas
>
> A recent email on the maligning list detailed a bunch of great "knobs" to 
> remember for spark streaming. 
> Lets integrate this  into the docs where appropriate.
> I'll paste the raw text in a comment field below



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7848) Update SparkStreaming docs to incorporate FAQ and/or bullets w/ "knobs" information.

2016-02-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15136849#comment-15136849
 ] 

Apache Spark commented on SPARK-7848:
-

User 'nirmannarang' has created a pull request for this issue:
https://github.com/apache/spark/pull/4

> Update SparkStreaming docs to incorporate FAQ and/or bullets w/ "knobs" 
> information.
> 
>
> Key: SPARK-7848
> URL: https://issues.apache.org/jira/browse/SPARK-7848
> Project: Spark
>  Issue Type: Documentation
>  Components: Streaming
>Reporter: jay vyas
>
> A recent email on the maligning list detailed a bunch of great "knobs" to 
> remember for spark streaming. 
> Lets integrate this  into the docs where appropriate.
> I'll paste the raw text in a comment field below



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13231) Rename Accumulable.countFailedValues to Accumulable.includeValuesOfFailedTasks and make it a user facing API.

2016-02-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13231:


Assignee: (was: Apache Spark)

> Rename Accumulable.countFailedValues to 
> Accumulable.includeValuesOfFailedTasks and make it a user facing API.
> -
>
> Key: SPARK-13231
> URL: https://issues.apache.org/jira/browse/SPARK-13231
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Prashant Sharma
>Priority: Minor
>
> Rename Accumulable.countFailedValues to 
> Accumulable.includeValuesOfFailedTasks (or includeFailedTasks) I liked the 
> longer version though. 
> Exposing it to user has no disadvantage I can think of, but it can be useful 
> for them. One scenario can be a user defined metric.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13231) Rename Accumulable.countFailedValues to Accumulable.includeValuesOfFailedTasks and make it a user facing API.

2016-02-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13231:


Assignee: Apache Spark

> Rename Accumulable.countFailedValues to 
> Accumulable.includeValuesOfFailedTasks and make it a user facing API.
> -
>
> Key: SPARK-13231
> URL: https://issues.apache.org/jira/browse/SPARK-13231
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Prashant Sharma
>Assignee: Apache Spark
>Priority: Minor
>
> Rename Accumulable.countFailedValues to 
> Accumulable.includeValuesOfFailedTasks (or includeFailedTasks) I liked the 
> longer version though. 
> Exposing it to user has no disadvantage I can think of, but it can be useful 
> for them. One scenario can be a user defined metric.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13231) Rename Accumulable.countFailedValues to Accumulable.includeValuesOfFailedTasks and make it a user facing API.

2016-02-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15136944#comment-15136944
 ] 

Apache Spark commented on SPARK-13231:
--

User 'ScrapCodes' has created a pull request for this issue:
https://github.com/apache/spark/pull/5

> Rename Accumulable.countFailedValues to 
> Accumulable.includeValuesOfFailedTasks and make it a user facing API.
> -
>
> Key: SPARK-13231
> URL: https://issues.apache.org/jira/browse/SPARK-13231
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Prashant Sharma
>Priority: Minor
>
> Rename Accumulable.countFailedValues to 
> Accumulable.includeValuesOfFailedTasks (or includeFailedTasks) I liked the 
> longer version though. 
> Exposing it to user has no disadvantage I can think of, but it can be useful 
> for them. One scenario can be a user defined metric.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12316) Stack overflow with endless call of `Delegation token thread` when application end.

2016-02-08 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-12316:
--
Assignee: SaintBacchus

> Stack overflow with endless call of `Delegation token thread` when 
> application end.
> ---
>
> Key: SPARK-12316
> URL: https://issues.apache.org/jira/browse/SPARK-12316
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: SaintBacchus
>Assignee: SaintBacchus
> Attachments: 20151210045149.jpg, 20151210045533.jpg
>
>
> When application end, AM will clean the staging dir.
> But if the driver trigger to update the delegation token, it will can't find 
> the right token file and then it will endless cycle call the method 
> 'updateCredentialsIfRequired'.
> Then it lead to StackOverflowError.
> !https://issues.apache.org/jira/secure/attachment/12779495/20151210045149.jpg!
> !https://issues.apache.org/jira/secure/attachment/12779496/20151210045533.jpg!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13013) Replace example code in mllib-clustering.md using include_example

2016-02-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13013:


Assignee: Apache Spark

> Replace example code in mllib-clustering.md using include_example
> -
>
> Key: SPARK-13013
> URL: https://issues.apache.org/jira/browse/SPARK-13013
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>
> See examples in other finished sub-JIRAs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13013) Replace example code in mllib-clustering.md using include_example

2016-02-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13013:


Assignee: (was: Apache Spark)

> Replace example code in mllib-clustering.md using include_example
> -
>
> Key: SPARK-13013
> URL: https://issues.apache.org/jira/browse/SPARK-13013
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> See examples in other finished sub-JIRAs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13013) Replace example code in mllib-clustering.md using include_example

2016-02-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137004#comment-15137004
 ] 

Apache Spark commented on SPARK-13013:
--

User 'keypointt' has created a pull request for this issue:
https://github.com/apache/spark/pull/6

> Replace example code in mllib-clustering.md using include_example
> -
>
> Key: SPARK-13013
> URL: https://issues.apache.org/jira/browse/SPARK-13013
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> See examples in other finished sub-JIRAs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13014) Replace example code in mllib-collaborative-filtering.md using include_example

2016-02-08 Thread Xin Ren (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137007#comment-15137007
 ] 

Xin Ren commented on SPARK-13014:
-

I'm working on this one, thanks :)

> Replace example code in mllib-collaborative-filtering.md using include_example
> --
>
> Key: SPARK-13014
> URL: https://issues.apache.org/jira/browse/SPARK-13014
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> See examples in other finished sub-JIRAs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12316) Stack overflow with endless call of `Delegation token thread` when application end.

2016-02-08 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137000#comment-15137000
 ] 

Thomas Graves commented on SPARK-12316:
---

you say "endless cycle call" do you mean the application master hangs?  It 
seems like it should throw and if the application is done it should just exit 
anyway since the AM is just calling stop on it.I just want to clarify what 
is happening because I assume even if you wait a minute you could still hit the 
same condition once when its tearing down.

> Stack overflow with endless call of `Delegation token thread` when 
> application end.
> ---
>
> Key: SPARK-12316
> URL: https://issues.apache.org/jira/browse/SPARK-12316
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: SaintBacchus
>Assignee: SaintBacchus
> Attachments: 20151210045149.jpg, 20151210045533.jpg
>
>
> When application end, AM will clean the staging dir.
> But if the driver trigger to update the delegation token, it will can't find 
> the right token file and then it will endless cycle call the method 
> 'updateCredentialsIfRequired'.
> Then it lead to StackOverflowError.
> !https://issues.apache.org/jira/secure/attachment/12779495/20151210045149.jpg!
> !https://issues.apache.org/jira/secure/attachment/12779496/20151210045533.jpg!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-02-08 Thread Rama Mullapudi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137029#comment-15137029
 ] 

Rama Mullapudi commented on SPARK-12177:


Does the update include kerberos support, since 0.9 producers and consumers now 
support kerberos (SASL) and ssl. 

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13232) YARN executor node label expressions bug

2016-02-08 Thread Atkins (JIRA)

Atkins created SPARK-13232:
--

 Summary: YARN executor node label expressions bug
 Key: SPARK-13232
 URL: https://issues.apache.org/jira/browse/SPARK-13232
 Project: Spark
  Issue Type: Bug
  Components: YARN
 Environment: Scala 2.11.7,  Hadoop 2.7.2, Spark 1.6.0
Reporter: Atkins


Using node label expression for executor failed to request container request 
and throws *InvalidContainerRequestException*.
The code
{code:title=AMRMClientImpl.java}
  /**
   * Valid if a node label expression specified on container request is valid or
   * not
   * 
   * @param containerRequest
   */
  private void checkNodeLabelExpression(T containerRequest) {
String exp = containerRequest.getNodeLabelExpression();

if (null == exp || exp.isEmpty()) {
  return;
}

// Don't support specifying >= 2 node labels in a node label expression now
if (exp.contains("&&") || exp.contains("||")) {
  throw new InvalidContainerRequestException(
  "Cannot specify more than two node labels"
  + " in a single node label expression");
}

// Don't allow specify node label against ANY request
if ((containerRequest.getRacks() != null && 
(!containerRequest.getRacks().isEmpty()))
|| 
(containerRequest.getNodes() != null && 
(!containerRequest.getNodes().isEmpty( {
  throw new InvalidContainerRequestException(
  "Cannot specify node label with rack and node");
}
  }
{code}
doesn't allow node label with rack and node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13232) YARN executor node label expressions

2016-02-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13232:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)
   Summary: YARN executor node label expressions  (was: YARN executor node 
label expressions bug)

What are you specifically referring to in this code -- what change are you 
proposing?

As far as I can tell you're referring to something that's just not supported 
yet, which are conjunctions?

> YARN executor node label expressions
> 
>
> Key: SPARK-13232
> URL: https://issues.apache.org/jira/browse/SPARK-13232
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
> Environment: Scala 2.11.7,  Hadoop 2.7.2, Spark 1.6.0
>Reporter: Atkins
>Priority: Minor
>
> Using node label expression for executor failed to request container request 
> and throws *InvalidContainerRequestException*.
> The code
> {code:title=AMRMClientImpl.java}
>   /**
>* Valid if a node label expression specified on container request is valid 
> or
>* not
>* 
>* @param containerRequest
>*/
>   private void checkNodeLabelExpression(T containerRequest) {
> String exp = containerRequest.getNodeLabelExpression();
> 
> if (null == exp || exp.isEmpty()) {
>   return;
> }
> // Don't support specifying >= 2 node labels in a node label expression 
> now
> if (exp.contains("&&") || exp.contains("||")) {
>   throw new InvalidContainerRequestException(
>   "Cannot specify more than two node labels"
>   + " in a single node label expression");
> }
> 
> // Don't allow specify node label against ANY request
> if ((containerRequest.getRacks() != null && 
> (!containerRequest.getRacks().isEmpty()))
> || 
> (containerRequest.getNodes() != null && 
> (!containerRequest.getNodes().isEmpty( {
>   throw new InvalidContainerRequestException(
>   "Cannot specify node label with rack and node");
> }
>   }
> {code}
> doesn't allow node label with rack and node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13172) Stop using RichException.getStackTrace it is deprecated

2016-02-08 Thread sachin aggarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137136#comment-15137136
 ] 

sachin aggarwal commented on SPARK-13172:
-

instead of getStackTraceString should I  use e.getStackTrace or 
e.printStackTrace

> Stop using RichException.getStackTrace it is deprecated
> ---
>
> Key: SPARK-13172
> URL: https://issues.apache.org/jira/browse/SPARK-13172
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: holdenk
>Priority: Trivial
>
> Throwable getStackTrace is the recommended alternative.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13233) Python Dataset

2016-02-08 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-13233:
---

 Summary: Python Dataset
 Key: SPARK-13233
 URL: https://issues.apache.org/jira/browse/SPARK-13233
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13233) Python Dataset

2016-02-08 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-13233:

Attachment: DesignDocPythonDataset.pdf

> Python Dataset
> --
>
> Key: SPARK-13233
> URL: https://issues.apache.org/jira/browse/SPARK-13233
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Wenchen Fan
> Attachments: DesignDocPythonDataset.pdf
>
>
> add Python Dataset w.r.t. the scala version



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13233) Python Dataset

2016-02-08 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-13233:

Description: add Python Dataset w.r.t. the scala version

> Python Dataset
> --
>
> Key: SPARK-13233
> URL: https://issues.apache.org/jira/browse/SPARK-13233
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Wenchen Fan
> Attachments: DesignDocPythonDataset.pdf
>
>
> add Python Dataset w.r.t. the scala version



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10528) spark-shell throws java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.

2016-02-08 Thread Sangeet Chourey (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137168#comment-15137168
 ] 

Sangeet Chourey commented on SPARK-10528:
-

RESOLVED  : Downloaded the correct Winutils version and issue was resolved. 
Ideally, it should be locally compiled but if downloading compiled version make 
sure that it is 32/64 bit as applicable. 

I tried on Windows 7 64 bit, Spark 1.6 and downloaded winutils.exe from 
https://www.barik.net/archive/2015/01/19/172716/  and it worked..!!


> spark-shell throws java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable.
> --
>
> Key: SPARK-10528
> URL: https://issues.apache.org/jira/browse/SPARK-10528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.0
> Environment: Windows 7 x64
>Reporter: Aliaksei Belablotski
>Priority: Minor
>
> Starting spark-shell throws
> java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10528) spark-shell throws java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.

2016-02-08 Thread Sangeet Chourey (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137168#comment-15137168
 ] 

Sangeet Chourey edited comment on SPARK-10528 at 2/8/16 4:26 PM:
-

RESOLVED  : Downloaded the correct Winutils version and issue was resolved. 
Ideally, it should be locally compiled but if downloading compiled version make 
sure that it is 32/64 bit as applicable. 

I tried on Windows 7 64 bit, Spark 1.6 and downloaded winutils.exe from 
https://www.barik.net/archive/2015/01/19/172716/  and it worked..!!

Complete Steps are at : 
http://letstalkspark.blogspot.com/2016/02/getting-started-with-spark-on-window-64.html


was (Author: sybergeek):
RESOLVED  : Downloaded the correct Winutils version and issue was resolved. 
Ideally, it should be locally compiled but if downloading compiled version make 
sure that it is 32/64 bit as applicable. 

I tried on Windows 7 64 bit, Spark 1.6 and downloaded winutils.exe from 
https://www.barik.net/archive/2015/01/19/172716/  and it worked..!!


> spark-shell throws java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable.
> --
>
> Key: SPARK-10528
> URL: https://issues.apache.org/jira/browse/SPARK-10528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.0
> Environment: Windows 7 x64
>Reporter: Aliaksei Belablotski
>Priority: Minor
>
> Starting spark-shell throws
> java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13233) Python Dataset

2016-02-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13233:


Assignee: Apache Spark

> Python Dataset
> --
>
> Key: SPARK-13233
> URL: https://issues.apache.org/jira/browse/SPARK-13233
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
> Attachments: DesignDocPythonDataset.pdf
>
>
> add Python Dataset w.r.t. the scala version



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13233) Python Dataset

2016-02-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137180#comment-15137180
 ] 

Apache Spark commented on SPARK-13233:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/7

> Python Dataset
> --
>
> Key: SPARK-13233
> URL: https://issues.apache.org/jira/browse/SPARK-13233
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Wenchen Fan
> Attachments: DesignDocPythonDataset.pdf
>
>
> add Python Dataset w.r.t. the scala version



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13233) Python Dataset

2016-02-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13233:


Assignee: (was: Apache Spark)

> Python Dataset
> --
>
> Key: SPARK-13233
> URL: https://issues.apache.org/jira/browse/SPARK-13233
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Wenchen Fan
> Attachments: DesignDocPythonDataset.pdf
>
>
> add Python Dataset w.r.t. the scala version



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10066) Can't create HiveContext with spark-shell or spark-sql on snapshot

2016-02-08 Thread Sangeet Chourey (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137201#comment-15137201
 ] 

Sangeet Chourey commented on SPARK-10066:
-

RESOLVED : Downloaded the correct Winutils version and issue was resolved. 
Ideally, it should be locally compiled but if downloading compiled version make 
sure that it is 32/64 bit as applicable.
I tried on Windows 7 64 bit, Spark 1.6 and downloaded winutils.exe from 
https://www.barik.net/archive/2015/01/19/172716/ and it worked..!!
Complete Steps are at : 
http://letstalkspark.blogspot.com/2016/02/getting-started-with-spark-on-window-64.html

> Can't create HiveContext with spark-shell or spark-sql on snapshot
> --
>
> Key: SPARK-10066
> URL: https://issues.apache.org/jira/browse/SPARK-10066
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 1.5.0
> Environment: Centos 6.6
>Reporter: Robert Beauchemin
>Priority: Minor
>
> Built the 1.5.0-preview-20150812 with the following:
> ./make-distribution.sh -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive 
> -Phive-thriftserver -Psparkr -DskipTests
> Starting spark-shell or spark-sql returns the following error: 
> java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable. Current permissions are: rwx--
> at 
> org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:612)
>  [elided]
> at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508)   
> 
> It's trying to create a new HiveContext. Running pySpark or sparkR works and 
> creates a HiveContext successfully. SqlContext can be created successfully 
> with any shell.
> I've tried changing permissions on that HDFS directory (even as far as making 
> it world-writable) without success. Tried changing SPARK_USER and also 
> running spark-shell as different users without success.
> This works on same machine on 1.4.1 and on earlier pre-release versions of 
> Spark 1.5.0 (same make-distribution parms) sucessfully. Just trying the 
> snapshot... 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13232) YARN executor node label expressions

2016-02-08 Thread Atkins (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137222#comment-15137222
 ] 

Atkins commented on SPARK-13232:


If spark config "spark.yarn.executor.nodeLabelExpression" present, 
*org.apache.spark.deploy.yarn.YarnAllocator#createContainerRequest* will create 
a ContainerRequest instance with locality specification of nodes, racks, and 
nodelabel which cause InvalidContainerRequestException be thrown.
This can reproduce by adding test suite in 
*org.apache.spark.deploy.yarn.YarnAllocatorSuite*
{code}
test("request executors with locality") {
val handler = createAllocator(1)
handler.updateResourceRequests()
handler.getNumExecutorsRunning should be (0)
handler.getPendingAllocate.size should be (1)

handler.requestTotalExecutorsWithPreferredLocalities(3, 20, Map(("host1", 
10), ("host2", 20)))
handler.updateResourceRequests()
handler.getPendingAllocate.size should be (3)

val container = createContainer("host1")
handler.handleAllocatedContainers(Array(container))

handler.getNumExecutorsRunning should be (1)
handler.allocatedContainerToHostMap.get(container.getId).get should be 
("host1")
handler.allocatedHostToContainersMap.get("host1").get should contain 
(container.getId)
  }
{code}

> YARN executor node label expressions
> 
>
> Key: SPARK-13232
> URL: https://issues.apache.org/jira/browse/SPARK-13232
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
> Environment: Scala 2.11.7,  Hadoop 2.7.2, Spark 1.6.0
>Reporter: Atkins
>Priority: Minor
>
> Using node label expression for executor failed to request container request 
> and throws *InvalidContainerRequestException*.
> The code
> {code:title=AMRMClientImpl.java}
>   /**
>* Valid if a node label expression specified on container request is valid 
> or
>* not
>* 
>* @param containerRequest
>*/
>   private void checkNodeLabelExpression(T containerRequest) {
> String exp = containerRequest.getNodeLabelExpression();
> 
> if (null == exp || exp.isEmpty()) {
>   return;
> }
> // Don't support specifying >= 2 node labels in a node label expression 
> now
> if (exp.contains("&&") || exp.contains("||")) {
>   throw new InvalidContainerRequestException(
>   "Cannot specify more than two node labels"
>   + " in a single node label expression");
> }
> 
> // Don't allow specify node label against ANY request
> if ((containerRequest.getRacks() != null && 
> (!containerRequest.getRacks().isEmpty()))
> || 
> (containerRequest.getNodes() != null && 
> (!containerRequest.getNodes().isEmpty( {
>   throw new InvalidContainerRequestException(
>   "Cannot specify node label with rack and node");
> }
>   }
> {code}
> doesn't allow node label with rack and node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13104) Spark Metrics currently does not return executors hostname

2016-02-08 Thread Karthik (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik updated SPARK-13104:

Description: We been using Spark Metrics and porting the data to InfluxDB 
using the Graphite sink that is available in Spark. From what I can see, it 
only provides the executorId and not the executor hostname. With each spark 
job, the executorID changes. Is there any way to find the hostname based on the 
executorID?  (was: We been using Spark Metrics and porting the data to InfluxDB 
using the Graphite sink that is available in Spark. From what I can see, it 
only provides he executorId and not the executor hostname. With each spark job, 
the executorID changes. Is there any way to find the hostname based on the 
executorID?)

> Spark Metrics currently does not return executors hostname 
> ---
>
> Key: SPARK-13104
> URL: https://issues.apache.org/jira/browse/SPARK-13104
> Project: Spark
>  Issue Type: Question
>Reporter: Karthik
>Priority: Critical
>  Labels: executor, executorId, graphite, hostname, metrics
>
> We been using Spark Metrics and porting the data to InfluxDB using the 
> Graphite sink that is available in Spark. From what I can see, it only 
> provides the executorId and not the executor hostname. With each spark job, 
> the executorID changes. Is there any way to find the hostname based on the 
> executorID?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13232) YARN executor node label expressions

2016-02-08 Thread Atkins (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137222#comment-15137222
 ] 

Atkins edited comment on SPARK-13232 at 2/8/16 4:59 PM:


I am telling about yarn doesn't allow specify node label with racks or nodes, 
so the current version of Spark is not working with config of nodeLabel on Yarn.

If spark config "spark.yarn.executor.nodeLabelExpression" present, 
*org.apache.spark.deploy.yarn.YarnAllocator#createContainerRequest* will create 
a ContainerRequest instance with locality specification of nodes, racks, and 
nodelabel which cause InvalidContainerRequestException be thrown.
This can reproduce by adding test suite in 
*org.apache.spark.deploy.yarn.YarnAllocatorSuite*
{code}
test("request executors with locality") {
val handler = createAllocator(1)
handler.updateResourceRequests()
handler.getNumExecutorsRunning should be (0)
handler.getPendingAllocate.size should be (1)

handler.requestTotalExecutorsWithPreferredLocalities(3, 20, Map(("host1", 
10), ("host2", 20)))
handler.updateResourceRequests()
handler.getPendingAllocate.size should be (3)

val container = createContainer("host1")
handler.handleAllocatedContainers(Array(container))

handler.getNumExecutorsRunning should be (1)
handler.allocatedContainerToHostMap.get(container.getId).get should be 
("host1")
handler.allocatedHostToContainersMap.get("host1").get should contain 
(container.getId)
  }
{code}


was (Author: atkins):
If spark config "spark.yarn.executor.nodeLabelExpression" present, 
*org.apache.spark.deploy.yarn.YarnAllocator#createContainerRequest* will create 
a ContainerRequest instance with locality specification of nodes, racks, and 
nodelabel which cause InvalidContainerRequestException be thrown.
This can reproduce by adding test suite in 
*org.apache.spark.deploy.yarn.YarnAllocatorSuite*
{code}
test("request executors with locality") {
val handler = createAllocator(1)
handler.updateResourceRequests()
handler.getNumExecutorsRunning should be (0)
handler.getPendingAllocate.size should be (1)

handler.requestTotalExecutorsWithPreferredLocalities(3, 20, Map(("host1", 
10), ("host2", 20)))
handler.updateResourceRequests()
handler.getPendingAllocate.size should be (3)

val container = createContainer("host1")
handler.handleAllocatedContainers(Array(container))

handler.getNumExecutorsRunning should be (1)
handler.allocatedContainerToHostMap.get(container.getId).get should be 
("host1")
handler.allocatedHostToContainersMap.get("host1").get should contain 
(container.getId)
  }
{code}

> YARN executor node label expressions
> 
>
> Key: SPARK-13232
> URL: https://issues.apache.org/jira/browse/SPARK-13232
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
> Environment: Scala 2.11.7,  Hadoop 2.7.2, Spark 1.6.0
>Reporter: Atkins
>Priority: Minor
>
> Using node label expression for executor failed to request container request 
> and throws *InvalidContainerRequestException*.
> The code
> {code:title=AMRMClientImpl.java}
>   /**
>* Valid if a node label expression specified on container request is valid 
> or
>* not
>* 
>* @param containerRequest
>*/
>   private void checkNodeLabelExpression(T containerRequest) {
> String exp = containerRequest.getNodeLabelExpression();
> 
> if (null == exp || exp.isEmpty()) {
>   return;
> }
> // Don't support specifying >= 2 node labels in a node label expression 
> now
> if (exp.contains("&&") || exp.contains("||")) {
>   throw new InvalidContainerRequestException(
>   "Cannot specify more than two node labels"
>   + " in a single node label expression");
> }
> 
> // Don't allow specify node label against ANY request
> if ((containerRequest.getRacks() != null && 
> (!containerRequest.getRacks().isEmpty()))
> || 
> (containerRequest.getNodes() != null && 
> (!containerRequest.getNodes().isEmpty( {
>   throw new InvalidContainerRequestException(
>   "Cannot specify node label with rack and node");
> }
>   }
> {code}
> doesn't allow node label with rack and node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13219) Pushdown predicate propagation in SparkSQL with join

2016-02-08 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-13219:

Component/s: (was: Spark Core)
 SQL

> Pushdown predicate propagation in SparkSQL with join
> 
>
> Key: SPARK-13219
> URL: https://issues.apache.org/jira/browse/SPARK-13219
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1, 1.6.0
> Environment: Spark 1.4
> Datastax Spark connector 1.4
> Cassandra. 2.1.12
> Centos 6.6
>Reporter: Abhinav Chawade
>
> When 2 or more tables are joined in SparkSQL and there is an equality clause 
> in query on attributes used to perform the join, it is useful to apply that 
> clause on scans for both table. If this is not done, one of the tables 
> results in full scan which can reduce the query dramatically. Consider 
> following example with 2 tables being joined.
> {code}
> CREATE TABLE assets (
> assetid int PRIMARY KEY,
> address text,
> propertyname text
> )
> CREATE TABLE tenants (
> assetid int PRIMARY KEY,
> name text
> )
> spark-sql> explain select t.name from tenants t, assets a where a.assetid = 
> t.assetid and t.assetid='1201';
> WARN  2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to 
> load native-hadoop library for your platform... using builtin-java classes 
> where applicable
> == Physical Plan ==
> Project [name#14]
>  ShuffledHashJoin [assetid#13], [assetid#15], BuildRight
>   Exchange (HashPartitioning 200)
>Filter (CAST(assetid#13, DoubleType) = 1201.0)
> HiveTableScan [assetid#13,name#14], (MetastoreRelation element, tenants, 
> Some(t)), None
>   Exchange (HashPartitioning 200)
>HiveTableScan [assetid#15], (MetastoreRelation element, assets, Some(a)), 
> None
> Time taken: 1.354 seconds, Fetched 8 row(s)
> {code}
> The simple workaround is to add another equality condition for each table but 
> it becomes cumbersome. It will be helpful if the query planner could improve 
> filter propagation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13219) Pushdown predicate propagation in SparkSQL with join

2016-02-08 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137282#comment-15137282
 ] 

Xiao Li commented on SPARK-13219:
-

See this PR: https://github.com/apache/spark/pull/10490. 

Let me know if you hit any bug. Thanks!

> Pushdown predicate propagation in SparkSQL with join
> 
>
> Key: SPARK-13219
> URL: https://issues.apache.org/jira/browse/SPARK-13219
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1, 1.6.0
> Environment: Spark 1.4
> Datastax Spark connector 1.4
> Cassandra. 2.1.12
> Centos 6.6
>Reporter: Abhinav Chawade
>
> When 2 or more tables are joined in SparkSQL and there is an equality clause 
> in query on attributes used to perform the join, it is useful to apply that 
> clause on scans for both table. If this is not done, one of the tables 
> results in full scan which can reduce the query dramatically. Consider 
> following example with 2 tables being joined.
> {code}
> CREATE TABLE assets (
> assetid int PRIMARY KEY,
> address text,
> propertyname text
> )
> CREATE TABLE tenants (
> assetid int PRIMARY KEY,
> name text
> )
> spark-sql> explain select t.name from tenants t, assets a where a.assetid = 
> t.assetid and t.assetid='1201';
> WARN  2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to 
> load native-hadoop library for your platform... using builtin-java classes 
> where applicable
> == Physical Plan ==
> Project [name#14]
>  ShuffledHashJoin [assetid#13], [assetid#15], BuildRight
>   Exchange (HashPartitioning 200)
>Filter (CAST(assetid#13, DoubleType) = 1201.0)
> HiveTableScan [assetid#13,name#14], (MetastoreRelation element, tenants, 
> Some(t)), None
>   Exchange (HashPartitioning 200)
>HiveTableScan [assetid#15], (MetastoreRelation element, assets, Some(a)), 
> None
> Time taken: 1.354 seconds, Fetched 8 row(s)
> {code}
> The simple workaround is to add another equality condition for each table but 
> it becomes cumbersome. It will be helpful if the query planner could improve 
> filter propagation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13016) Replace example code in mllib-dimensionality-reduction.md using include_example

2016-02-08 Thread Devaraj K (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137310#comment-15137310
 ] 

Devaraj K commented on SPARK-13016:
---

I am working on this, I will provide PR for this. Thanks

> Replace example code in mllib-dimensionality-reduction.md using 
> include_example
> ---
>
> Key: SPARK-13016
> URL: https://issues.apache.org/jira/browse/SPARK-13016
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> See examples in other finished sub-JIRAs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13117) WebUI should use the local ip not 0.0.0.0

2016-02-08 Thread Devaraj K (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137320#comment-15137320
 ] 

Devaraj K commented on SPARK-13117:
---

Thanks [~jjordan] for reporting. I would like to provide PR if you are not 
planning to work on this. Please let me know, Thanks.

> WebUI should use the local ip not 0.0.0.0
> -
>
> Key: SPARK-13117
> URL: https://issues.apache.org/jira/browse/SPARK-13117
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Jeremiah Jordan
>
> When SPARK_LOCAL_IP is set everything seems to correctly bind and use that IP 
> except the WebUI.  The WebUI should use the SPARK_LOCAL_IP not always use 
> 0.0.0.0
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/WebUI.scala#L137



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13219) Pushdown predicate propagation in SparkSQL with join

2016-02-08 Thread Abhinav Chawade (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137330#comment-15137330
 ] 

Abhinav Chawade commented on SPARK-13219:
-

Thanks Xiao. I will pull in the request and see how it performs.

> Pushdown predicate propagation in SparkSQL with join
> 
>
> Key: SPARK-13219
> URL: https://issues.apache.org/jira/browse/SPARK-13219
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1, 1.6.0
> Environment: Spark 1.4
> Datastax Spark connector 1.4
> Cassandra. 2.1.12
> Centos 6.6
>Reporter: Abhinav Chawade
>
> When 2 or more tables are joined in SparkSQL and there is an equality clause 
> in query on attributes used to perform the join, it is useful to apply that 
> clause on scans for both table. If this is not done, one of the tables 
> results in full scan which can reduce the query dramatically. Consider 
> following example with 2 tables being joined.
> {code}
> CREATE TABLE assets (
> assetid int PRIMARY KEY,
> address text,
> propertyname text
> )
> CREATE TABLE tenants (
> assetid int PRIMARY KEY,
> name text
> )
> spark-sql> explain select t.name from tenants t, assets a where a.assetid = 
> t.assetid and t.assetid='1201';
> WARN  2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to 
> load native-hadoop library for your platform... using builtin-java classes 
> where applicable
> == Physical Plan ==
> Project [name#14]
>  ShuffledHashJoin [assetid#13], [assetid#15], BuildRight
>   Exchange (HashPartitioning 200)
>Filter (CAST(assetid#13, DoubleType) = 1201.0)
> HiveTableScan [assetid#13,name#14], (MetastoreRelation element, tenants, 
> Some(t)), None
>   Exchange (HashPartitioning 200)
>HiveTableScan [assetid#15], (MetastoreRelation element, assets, Some(a)), 
> None
> Time taken: 1.354 seconds, Fetched 8 row(s)
> {code}
> The simple workaround is to add another equality condition for each table but 
> it becomes cumbersome. It will be helpful if the query planner could improve 
> filter propagation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7889) Jobs progress of apps on complete page of HistoryServer shows uncompleted

2016-02-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137336#comment-15137336
 ] 

Apache Spark commented on SPARK-7889:
-

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/8

> Jobs progress of apps on complete page of HistoryServer shows uncompleted
> -
>
> Key: SPARK-7889
> URL: https://issues.apache.org/jira/browse/SPARK-7889
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: meiyoula
>Priority: Minor
>
> When running a SparkPi with 2000 tasks, cliking into the app on incomplete 
> page, the job progress shows 400/2000. After the app is completed, the app 
> goes to complete page from incomplete, and now cliking into the app, the  job 
> progress still shows 400/2000.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13219) Pushdown predicate propagation in SparkSQL with join

2016-02-08 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137333#comment-15137333
 ] 

Xiao Li commented on SPARK-13219:
-

Welcome

> Pushdown predicate propagation in SparkSQL with join
> 
>
> Key: SPARK-13219
> URL: https://issues.apache.org/jira/browse/SPARK-13219
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1, 1.6.0
> Environment: Spark 1.4
> Datastax Spark connector 1.4
> Cassandra. 2.1.12
> Centos 6.6
>Reporter: Abhinav Chawade
>
> When 2 or more tables are joined in SparkSQL and there is an equality clause 
> in query on attributes used to perform the join, it is useful to apply that 
> clause on scans for both table. If this is not done, one of the tables 
> results in full scan which can reduce the query dramatically. Consider 
> following example with 2 tables being joined.
> {code}
> CREATE TABLE assets (
> assetid int PRIMARY KEY,
> address text,
> propertyname text
> )
> CREATE TABLE tenants (
> assetid int PRIMARY KEY,
> name text
> )
> spark-sql> explain select t.name from tenants t, assets a where a.assetid = 
> t.assetid and t.assetid='1201';
> WARN  2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to 
> load native-hadoop library for your platform... using builtin-java classes 
> where applicable
> == Physical Plan ==
> Project [name#14]
>  ShuffledHashJoin [assetid#13], [assetid#15], BuildRight
>   Exchange (HashPartitioning 200)
>Filter (CAST(assetid#13, DoubleType) = 1201.0)
> HiveTableScan [assetid#13,name#14], (MetastoreRelation element, tenants, 
> Some(t)), None
>   Exchange (HashPartitioning 200)
>HiveTableScan [assetid#15], (MetastoreRelation element, assets, Some(a)), 
> None
> Time taken: 1.354 seconds, Fetched 8 row(s)
> {code}
> The simple workaround is to add another equality condition for each table but 
> it becomes cumbersome. It will be helpful if the query planner could improve 
> filter propagation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13117) WebUI should use the local ip not 0.0.0.0

2016-02-08 Thread Jeremiah Jordan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137353#comment-15137353
 ] 

Jeremiah Jordan commented on SPARK-13117:
-

go for it.

> WebUI should use the local ip not 0.0.0.0
> -
>
> Key: SPARK-13117
> URL: https://issues.apache.org/jira/browse/SPARK-13117
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Jeremiah Jordan
>
> When SPARK_LOCAL_IP is set everything seems to correctly bind and use that IP 
> except the WebUI.  The WebUI should use the SPARK_LOCAL_IP not always use 
> 0.0.0.0
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/WebUI.scala#L137



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12455) Add ExpressionDescription to window functions

2016-02-08 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-12455.
---
   Resolution: Resolved
Fix Version/s: 2.0.0

> Add ExpressionDescription to window functions
> -
>
> Key: SPARK-12455
> URL: https://issues.apache.org/jira/browse/SPARK-12455
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Herman van Hovell
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2016-02-08 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137369#comment-15137369
 ] 

Sean Owen commented on SPARK-6305:
--

I've started working on this, and it's as awful a dependency mess as you'd 
imagine.

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12986) Fix pydoc warnings in mllib/regression.py

2016-02-08 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-12986.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11025
[https://github.com/apache/spark/pull/11025]

> Fix pydoc warnings in mllib/regression.py
> -
>
> Key: SPARK-12986
> URL: https://issues.apache.org/jira/browse/SPARK-12986
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Nam Pham
>Priority: Minor
> Fix For: 2.0.0
>
>
> Got those warnings by running "make html" under "python/docs/":
> {code}
> /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
> pyspark.mllib.regression.LinearRegressionWithSGD:3: ERROR: Unexpected 
> indentation.
> /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
> pyspark.mllib.regression.LinearRegressionWithSGD:4: WARNING: Block quote ends 
> without a blank line; unexpected unindent.
> /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
> pyspark.mllib.regression.RidgeRegressionWithSGD:3: ERROR: Unexpected 
> indentation.
> /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
> pyspark.mllib.regression.RidgeRegressionWithSGD:4: WARNING: Block quote ends 
> without a blank line; unexpected unindent.
> /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
> pyspark.mllib.regression.LassoWithSGD:3: ERROR: Unexpected indentation.
> /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
> pyspark.mllib.regression.LassoWithSGD:4: WARNING: Block quote ends without a 
> blank line; unexpected unindent.
> /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
> pyspark.mllib.regression.IsotonicRegression:7: ERROR: Unexpected indentation.
> /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
> pyspark.mllib.regression.IsotonicRegression:12: ERROR: Unexpected indentation.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13234) Remove duplicated SQL metrics

2016-02-08 Thread Davies Liu (JIRA)

Davies Liu created SPARK-13234:
--

 Summary: Remove duplicated SQL metrics
 Key: SPARK-13234
 URL: https://issues.apache.org/jira/browse/SPARK-13234
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu


For lots of SQL operators, we have metrics for both of input and output, the 
number of input rows should be exactly the number of output rows of child, we 
could only have metrics for output rows.

After we improve the performance using whole stage codegen, the overhead of SQL 
metrics are not trivial anymore, we should avoid that if it's not necessary.

Some of the operator does not have SQL metrics, we should add that for them.

For those operators that have the same number of rows from input and output 
(for example, Projection, we may don't need that).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8964) Use Exchange in limit operations (per partition limit -> exchange to one partition -> per partition limit)

2016-02-08 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8964.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 7334
[https://github.com/apache/spark/pull/7334]

> Use Exchange in limit operations (per partition limit -> exchange to one 
> partition -> per partition limit)
> --
>
> Key: SPARK-8964
> URL: https://issues.apache.org/jira/browse/SPARK-8964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Josh Rosen
> Fix For: 2.0.0
>
>
> Spark SQL's physical Limit operator currently performs its own shuffle rather 
> than using Exchange to perform the shuffling.  This is less efficient since 
> this non-exchange shuffle path won't be able to benefit from SQL-specific 
> shuffling optimizations, such as SQLSerializer2.  It also involves additional 
> unnecessary row copying.
> Instead, I think that we should rewrite Limit to expand into three physical 
> operators:
> PerParititonLimit -> Exchange to one partition -> PerPartitionLimit



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13213) BroadcastNestedLoopJoin is very slow

2016-02-08 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137551#comment-15137551
 ] 

Davies Liu commented on SPARK-13213:


[~sowen] Thanks very much for update these, I try to remember to add that 
recently, but may still missed sometimes. Can we mark that as required (or 
remember the last action as default value)?  

> BroadcastNestedLoopJoin is very slow
> 
>
> Key: SPARK-13213
> URL: https://issues.apache.org/jira/browse/SPARK-13213
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> Since we have improve the performance of CartisianProduct, which should be 
> faster and robuster than BroacastNestedLoopJoin, we should do 
> CartisianProduct instead of BroacastNestedLoopJoin, especially  when the 
> broadcasted table is not that small.
> Today, we hit a query that take very long time but still not finished, once 
> decrease the threshold for broadcast (disable BroacastNestedLoopJoin), it 
> just finished in seconds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12585) The numFields of UnsafeRow should not changed by pointTo()

2016-02-08 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-12585:
---
Component/s: SQL

> The numFields of UnsafeRow should not changed by pointTo()
> --
>
> Key: SPARK-12585
> URL: https://issues.apache.org/jira/browse/SPARK-12585
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>
> Right now, numFields will be passed in by pointTo(), then bitSetWidthInBytes 
> is calculated, making pointTo() a little bit heavy.
> It should be part of constructor of UnsafeRow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12840) Support passing arbitrary objects (not just expressions) into code generated classes

2016-02-08 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-12840:
---
Component/s: SQL

> Support passing arbitrary objects (not just expressions) into code generated 
> classes
> 
>
> Key: SPARK-12840
> URL: https://issues.apache.org/jira/browse/SPARK-12840
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> As of now, our code generator only allows passing Expression objects into the 
> generated class as arguments. In order to support whole-stage codegen (e.g. 
> for broadcast joins), the generated classes need to accept other types of 
> objects such as hash tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13215) Remove fallback in codegen

2016-02-08 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-13215:
---
Component/s: SQL

> Remove fallback in codegen
> --
>
> Key: SPARK-13215
> URL: https://issues.apache.org/jira/browse/SPARK-13215
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> in newMutableProjection, it will fallback to InterpretedMutableProjection if 
> failed to compile.
> Since we remove the configuration for codegen, we are heavily reply on 
> codegen (also TungstenAggregate require the generated MutableProjection to 
> update UnsafeRow), should remove the fallback, which could make user 
> confusing, see the discussion in SPARK-13116.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13172) Stop using RichException.getStackTrace it is deprecated

2016-02-08 Thread Jakob Odersky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137575#comment-15137575
 ] 

Jakob Odersky edited comment on SPARK-13172 at 2/8/16 8:03 PM:
---

I would suggest taking similar approach to what the Scala library does: 
https://github.com/scala/scala/blob/v2.11.7/src/library/scala/runtime/RichException.scala#L16,
 that is just call mkString on the stack trace.

Using e.printStackTrace is not as flexible, it doesn't give you a string and as 
far as I know it prints to stderr with no option to redirect.


was (Author: jodersky):
I would suggest taking similar approach to what the Scala library does: 
https://github.com/scala/scala/blob/v2.11.7/src/library/scala/runtime/RichException.scala#L1,
 that is just call mkString on the stack trace.

Using e.printStackTrace is not as flexible, it doesn't give you a string and as 
far as I know it prints to stderr with no option to redirect.

> Stop using RichException.getStackTrace it is deprecated
> ---
>
> Key: SPARK-13172
> URL: https://issues.apache.org/jira/browse/SPARK-13172
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: holdenk
>Priority: Trivial
>
> Throwable getStackTrace is the recommended alternative.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13172) Stop using RichException.getStackTrace it is deprecated

2016-02-08 Thread Jakob Odersky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137575#comment-15137575
 ] 

Jakob Odersky commented on SPARK-13172:
---

I would suggest taking similar approach to what the Scala library does: 
https://github.com/scala/scala/blob/v2.11.7/src/library/scala/runtime/RichException.scala#L1,
 that is just call mkString on the stack trace.

Using e.printStackTrace is not as flexible, it doesn't give you a string and as 
far as I know it prints to stderr with no option to redirect.

> Stop using RichException.getStackTrace it is deprecated
> ---
>
> Key: SPARK-13172
> URL: https://issues.apache.org/jira/browse/SPARK-13172
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: holdenk
>Priority: Trivial
>
> Throwable getStackTrace is the recommended alternative.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13101) Dataset complex types mapping to DataFrame (element nullability) mismatch

2016-02-08 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-13101:
-
Fix Version/s: 1.6.1

> Dataset complex types mapping to DataFrame  (element nullability) mismatch
> --
>
> Key: SPARK-13101
> URL: https://issues.apache.org/jira/browse/SPARK-13101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Deenar Toraskar
>Assignee: Wenchen Fan
>Priority: Blocker
> Fix For: 1.6.1, 2.0.0
>
>
> There seems to be a regression between 1.6.0 and 1.6.1 (snapshot build). By 
> default a scala {{Seq\[Double\]}} is mapped by Spark as an ArrayType with 
> nullable element
> {noformat}
>  |-- valuations: array (nullable = true)
>  ||-- element: double (containsNull = true)
> {noformat}
> This could be read back to as a Dataset in Spark 1.6.0
> {code}
> val df = sqlContext.table("valuations").as[Valuation]
> {code}
> But with Spark 1.6.1 the same fails with
> {code}
> val df = sqlContext.table("valuations").as[Valuation]
> org.apache.spark.sql.AnalysisException: cannot resolve 'cast(valuations as 
> array)' due to data type mismatch: cannot cast 
> ArrayType(DoubleType,true) to ArrayType(DoubleType,false);
> {code}
> Here's the classes I am using
> {code}
> case class Valuation(tradeId : String,
>  counterparty: String,
>  nettingAgreement: String,
>  wrongWay: Boolean,
>  valuations : Seq[Double], /* one per scenario */
>  timeInterval: Int,
>  jobId: String)  /* used for hdfs partitioning */
> val vals : Seq[Valuation] = Seq()
> val valsDF = sqlContext.sparkContext.parallelize(vals).toDF
> valsDF.write.partitionBy("jobId").mode(SaveMode.Overwrite).saveAsTable("valuations")
> {code}
> even the following gives the same result
> {code}
> val valsDF = vals.toDS.toDF
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13101) Dataset complex types mapping to DataFrame (element nullability) mismatch

2016-02-08 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-13101.
--
   Resolution: Fixed
Fix Version/s: (was: 1.6.1)
   2.0.0

Issue resolved by pull request 11035
[https://github.com/apache/spark/pull/11035]

> Dataset complex types mapping to DataFrame  (element nullability) mismatch
> --
>
> Key: SPARK-13101
> URL: https://issues.apache.org/jira/browse/SPARK-13101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Deenar Toraskar
>Assignee: Wenchen Fan
>Priority: Blocker
> Fix For: 2.0.0
>
>
> There seems to be a regression between 1.6.0 and 1.6.1 (snapshot build). By 
> default a scala {{Seq\[Double\]}} is mapped by Spark as an ArrayType with 
> nullable element
> {noformat}
>  |-- valuations: array (nullable = true)
>  ||-- element: double (containsNull = true)
> {noformat}
> This could be read back to as a Dataset in Spark 1.6.0
> {code}
> val df = sqlContext.table("valuations").as[Valuation]
> {code}
> But with Spark 1.6.1 the same fails with
> {code}
> val df = sqlContext.table("valuations").as[Valuation]
> org.apache.spark.sql.AnalysisException: cannot resolve 'cast(valuations as 
> array)' due to data type mismatch: cannot cast 
> ArrayType(DoubleType,true) to ArrayType(DoubleType,false);
> {code}
> Here's the classes I am using
> {code}
> case class Valuation(tradeId : String,
>  counterparty: String,
>  nettingAgreement: String,
>  wrongWay: Boolean,
>  valuations : Seq[Double], /* one per scenario */
>  timeInterval: Int,
>  jobId: String)  /* used for hdfs partitioning */
> val vals : Seq[Valuation] = Seq()
> val valsDF = sqlContext.sparkContext.parallelize(vals).toDF
> valsDF.write.partitionBy("jobId").mode(SaveMode.Overwrite).saveAsTable("valuations")
> {code}
> even the following gives the same result
> {code}
> val valsDF = vals.toDS.toDF
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13210) NPE in Sort

2016-02-08 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-13210.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11095
[https://github.com/apache/spark/pull/11095]

> NPE in Sort
> ---
>
> Key: SPARK-13210
> URL: https://issues.apache.org/jira/browse/SPARK-13210
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 2.0.0
>
>
> When run TPCDS query Q78 with scale 10:
> {code}
> 16/02/04 22:39:09 ERROR Executor: Managed memory leak detected; size = 
> 268435456 bytes, TID = 143
> 16/02/04 22:39:09 ERROR Executor: Exception in task 0.0 in stage 47.0 (TID 
> 143)
> java.lang.NullPointerException
>   at 
> org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:333)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:60)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:39)
>   at 
> org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:239)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:415)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:116)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:87)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:60)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:45)
>   at org.apache.spark.scheduler.Task.run(Task.scala:81)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13210) NPE in Sort

2016-02-08 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137598#comment-15137598
 ] 

Josh Rosen commented on SPARK-13210:


I'm also going to cherry-pick this for 1.6.1.

> NPE in Sort
> ---
>
> Key: SPARK-13210
> URL: https://issues.apache.org/jira/browse/SPARK-13210
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 2.0.0
>
>
> When run TPCDS query Q78 with scale 10:
> {code}
> 16/02/04 22:39:09 ERROR Executor: Managed memory leak detected; size = 
> 268435456 bytes, TID = 143
> 16/02/04 22:39:09 ERROR Executor: Exception in task 0.0 in stage 47.0 (TID 
> 143)
> java.lang.NullPointerException
>   at 
> org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:333)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:60)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:39)
>   at 
> org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:239)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:415)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:116)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:87)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:60)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:45)
>   at org.apache.spark.scheduler.Task.run(Task.scala:81)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10780) Set initialModel in KMeans in Pipelines API

2016-02-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137607#comment-15137607
 ] 

Apache Spark commented on SPARK-10780:
--

User 'yinxusen' has created a pull request for this issue:
https://github.com/apache/spark/pull/9

> Set initialModel in KMeans in Pipelines API
> ---
>
> Key: SPARK-10780
> URL: https://issues.apache.org/jira/browse/SPARK-10780
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This is for the Scala version.  After this is merged, create a JIRA for 
> Python version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13235) Remove extra Distinct in Union Distinct

2016-02-08 Thread Xiao Li (JIRA)

Xiao Li created SPARK-13235:
---

 Summary: Remove extra Distinct in Union Distinct
 Key: SPARK-13235
 URL: https://issues.apache.org/jira/browse/SPARK-13235
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li


Union Distinct has two Distinct that generates two Aggregation in the plan.

{code}
sql("select * from t0 union select * from t0").explain(true)
{code}

{code}
== Parsed Logical Plan ==
'Project [unresolvedalias(*,None)]
+- 'Subquery u_2
   +- 'Distinct
  +- 'Project [unresolvedalias(*,None)]
 +- 'Subquery u_1
+- 'Distinct
   +- 'Union
  :- 'Project [unresolvedalias(*,None)]
  :  +- 'UnresolvedRelation `t0`, None
  +- 'Project [unresolvedalias(*,None)]
 +- 'UnresolvedRelation `t0`, None

== Analyzed Logical Plan ==
id: bigint
Project [id#16L]
+- Subquery u_2
   +- Distinct
  +- Project [id#16L]
 +- Subquery u_1
+- Distinct
   +- Union
  :- Project [id#16L]
  :  +- Subquery t0
  : +- Relation[id#16L] ParquetRelation
  +- Project [id#16L]
 +- Subquery t0
+- Relation[id#16L] ParquetRelation

== Optimized Logical Plan ==
Aggregate [id#16L], [id#16L]
+- Aggregate [id#16L], [id#16L]
   +- Union
  :- Project [id#16L]
  :  +- Relation[id#16L] ParquetRelation
  +- Project [id#16L]
 +- Relation[id#16L] ParquetRelation
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10561) Provide tooling for auto-generating Spark SQL reference manual

2016-02-08 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated SPARK-10561:
---
Description: 
Here is the discussion thread:
http://search-hadoop.com/m/q3RTtcD20F1o62xE

Richard Hillegas made the following suggestion:


A machine-generated BNF, however, is easy to imagine. But perhaps not so easy 
to implement. Spark's SQL grammar is implemented in Scala, extending the DSL 
support provided by the Scala language. I am new to programming in Scala, so I 
don't know whether the Scala ecosystem provides any good tools for 
reverse-engineering a BNF from a class which extends 
scala.util.parsing.combinator.syntactical.StandardTokenParsers.

  was:
Here is the discussion thread:
http://search-hadoop.com/m/q3RTtcD20F1o62xE

Richard Hillegas made the following suggestion:

A machine-generated BNF, however, is easy to imagine. But perhaps not so easy 
to implement. Spark's SQL grammar is implemented in Scala, extending the DSL 
support provided by the Scala language. I am new to programming in Scala, so I 
don't know whether the Scala ecosystem provides any good tools for 
reverse-engineering a BNF from a class which extends 
scala.util.parsing.combinator.syntactical.StandardTokenParsers.


> Provide tooling for auto-generating Spark SQL reference manual
> --
>
> Key: SPARK-10561
> URL: https://issues.apache.org/jira/browse/SPARK-10561
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Reporter: Ted Yu
>
> Here is the discussion thread:
> http://search-hadoop.com/m/q3RTtcD20F1o62xE
> Richard Hillegas made the following suggestion:
> A machine-generated BNF, however, is easy to imagine. But perhaps not so easy 
> to implement. Spark's SQL grammar is implemented in Scala, extending the DSL 
> support provided by the Scala language. I am new to programming in Scala, so 
> I don't know whether the Scala ecosystem provides any good tools for 
> reverse-engineering a BNF from a class which extends 
> scala.util.parsing.combinator.syntactical.StandardTokenParsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13180) Protect against SessionState being null when accessing HiveClientImpl#conf

2016-02-08 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137674#comment-15137674
 ] 

Ted Yu commented on SPARK-13180:


I wonder if we should provide better error message when NPE happens - the cause 
may be mixed dependencies. See last response on the thread.

> Protect against SessionState being null when accessing HiveClientImpl#conf
> --
>
> Key: SPARK-13180
> URL: https://issues.apache.org/jira/browse/SPARK-13180
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Ted Yu
>Priority: Minor
> Attachments: spark-13180-util.patch
>
>
> See this thread http://search-hadoop.com/m/q3RTtFoTDi2HVCrM1
> {code}
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.conf(ClientWrapper.scala:205)
> at 
> org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext.scala:552)
> at org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:551)
> at 
> org.apache.spark.sql.hive.HiveContext$$anonfun$configure$1.apply(HiveContext.scala:538)
> at 
> org.apache.spark.sql.hive.HiveContext$$anonfun$configure$1.apply(HiveContext.scala:537)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:537)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250)
> at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
> at org.apache.spark.sql.hive.HiveContext$$anon$2.(HiveContext.scala:457)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:457)
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:456)
> at org.apache.spark.sql.hive.HiveContext$$anon$3.(HiveContext.scala:473)
> at 
> org.apache.spark.sql.hive.HiveContext.analyzer$lzycompute(HiveContext.scala:473)
> at org.apache.spark.sql.hive.HiveContext.analyzer(HiveContext.scala:472)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at 
> org.apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:442)
> at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:223)
> at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:146)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13235) Remove extra Distinct in Union Distinct

2016-02-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13235:


Assignee: (was: Apache Spark)

> Remove extra Distinct in Union Distinct
> ---
>
> Key: SPARK-13235
> URL: https://issues.apache.org/jira/browse/SPARK-13235
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Union Distinct has two Distinct that generates two Aggregation in the plan.
> {code}
> sql("select * from t0 union select * from t0").explain(true)
> {code}
> {code}
> == Parsed Logical Plan ==
> 'Project [unresolvedalias(*,None)]
> +- 'Subquery u_2
>+- 'Distinct
>   +- 'Project [unresolvedalias(*,None)]
>  +- 'Subquery u_1
> +- 'Distinct
>+- 'Union
>   :- 'Project [unresolvedalias(*,None)]
>   :  +- 'UnresolvedRelation `t0`, None
>   +- 'Project [unresolvedalias(*,None)]
>  +- 'UnresolvedRelation `t0`, None
> == Analyzed Logical Plan ==
> id: bigint
> Project [id#16L]
> +- Subquery u_2
>+- Distinct
>   +- Project [id#16L]
>  +- Subquery u_1
> +- Distinct
>+- Union
>   :- Project [id#16L]
>   :  +- Subquery t0
>   : +- Relation[id#16L] ParquetRelation
>   +- Project [id#16L]
>  +- Subquery t0
> +- Relation[id#16L] ParquetRelation
> == Optimized Logical Plan ==
> Aggregate [id#16L], [id#16L]
> +- Aggregate [id#16L], [id#16L]
>+- Union
>   :- Project [id#16L]
>   :  +- Relation[id#16L] ParquetRelation
>   +- Project [id#16L]
>  +- Relation[id#16L] ParquetRelation
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13235) Remove extra Distinct in Union Distinct

2016-02-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137682#comment-15137682
 ] 

Apache Spark commented on SPARK-13235:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/11120

> Remove extra Distinct in Union Distinct
> ---
>
> Key: SPARK-13235
> URL: https://issues.apache.org/jira/browse/SPARK-13235
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Union Distinct has two Distinct that generates two Aggregation in the plan.
> {code}
> sql("select * from t0 union select * from t0").explain(true)
> {code}
> {code}
> == Parsed Logical Plan ==
> 'Project [unresolvedalias(*,None)]
> +- 'Subquery u_2
>+- 'Distinct
>   +- 'Project [unresolvedalias(*,None)]
>  +- 'Subquery u_1
> +- 'Distinct
>+- 'Union
>   :- 'Project [unresolvedalias(*,None)]
>   :  +- 'UnresolvedRelation `t0`, None
>   +- 'Project [unresolvedalias(*,None)]
>  +- 'UnresolvedRelation `t0`, None
> == Analyzed Logical Plan ==
> id: bigint
> Project [id#16L]
> +- Subquery u_2
>+- Distinct
>   +- Project [id#16L]
>  +- Subquery u_1
> +- Distinct
>+- Union
>   :- Project [id#16L]
>   :  +- Subquery t0
>   : +- Relation[id#16L] ParquetRelation
>   +- Project [id#16L]
>  +- Subquery t0
> +- Relation[id#16L] ParquetRelation
> == Optimized Logical Plan ==
> Aggregate [id#16L], [id#16L]
> +- Aggregate [id#16L], [id#16L]
>+- Union
>   :- Project [id#16L]
>   :  +- Relation[id#16L] ParquetRelation
>   +- Project [id#16L]
>  +- Relation[id#16L] ParquetRelation
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13235) Remove extra Distinct in Union Distinct

2016-02-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13235:


Assignee: Apache Spark

> Remove extra Distinct in Union Distinct
> ---
>
> Key: SPARK-13235
> URL: https://issues.apache.org/jira/browse/SPARK-13235
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Union Distinct has two Distinct that generates two Aggregation in the plan.
> {code}
> sql("select * from t0 union select * from t0").explain(true)
> {code}
> {code}
> == Parsed Logical Plan ==
> 'Project [unresolvedalias(*,None)]
> +- 'Subquery u_2
>+- 'Distinct
>   +- 'Project [unresolvedalias(*,None)]
>  +- 'Subquery u_1
> +- 'Distinct
>+- 'Union
>   :- 'Project [unresolvedalias(*,None)]
>   :  +- 'UnresolvedRelation `t0`, None
>   +- 'Project [unresolvedalias(*,None)]
>  +- 'UnresolvedRelation `t0`, None
> == Analyzed Logical Plan ==
> id: bigint
> Project [id#16L]
> +- Subquery u_2
>+- Distinct
>   +- Project [id#16L]
>  +- Subquery u_1
> +- Distinct
>+- Union
>   :- Project [id#16L]
>   :  +- Subquery t0
>   : +- Relation[id#16L] ParquetRelation
>   +- Project [id#16L]
>  +- Subquery t0
> +- Relation[id#16L] ParquetRelation
> == Optimized Logical Plan ==
> Aggregate [id#16L], [id#16L]
> +- Aggregate [id#16L], [id#16L]
>+- Union
>   :- Project [id#16L]
>   :  +- Relation[id#16L] ParquetRelation
>   +- Project [id#16L]
>  +- Relation[id#16L] ParquetRelation
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13235) Remove extra Distinct in Union

2016-02-08 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-13235:

Summary: Remove extra Distinct in Union  (was: Remove extra Distinct in 
Union Distinct)

> Remove extra Distinct in Union
> --
>
> Key: SPARK-13235
> URL: https://issues.apache.org/jira/browse/SPARK-13235
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Union Distinct has two Distinct that generates two Aggregation in the plan.
> {code}
> sql("select * from t0 union select * from t0").explain(true)
> {code}
> {code}
> == Parsed Logical Plan ==
> 'Project [unresolvedalias(*,None)]
> +- 'Subquery u_2
>+- 'Distinct
>   +- 'Project [unresolvedalias(*,None)]
>  +- 'Subquery u_1
> +- 'Distinct
>+- 'Union
>   :- 'Project [unresolvedalias(*,None)]
>   :  +- 'UnresolvedRelation `t0`, None
>   +- 'Project [unresolvedalias(*,None)]
>  +- 'UnresolvedRelation `t0`, None
> == Analyzed Logical Plan ==
> id: bigint
> Project [id#16L]
> +- Subquery u_2
>+- Distinct
>   +- Project [id#16L]
>  +- Subquery u_1
> +- Distinct
>+- Union
>   :- Project [id#16L]
>   :  +- Subquery t0
>   : +- Relation[id#16L] ParquetRelation
>   +- Project [id#16L]
>  +- Subquery t0
> +- Relation[id#16L] ParquetRelation
> == Optimized Logical Plan ==
> Aggregate [id#16L], [id#16L]
> +- Aggregate [id#16L], [id#16L]
>+- Union
>   :- Project [id#16L]
>   :  +- Relation[id#16L] ParquetRelation
>   +- Project [id#16L]
>  +- Relation[id#16L] ParquetRelation
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13235) Remove an extra Distinct in Union

2016-02-08 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-13235:

Summary: Remove an extra Distinct in Union  (was: Remove extra Distinct in 
Union)

> Remove an extra Distinct in Union
> -
>
> Key: SPARK-13235
> URL: https://issues.apache.org/jira/browse/SPARK-13235
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Union Distinct has two Distinct that generates two Aggregation in the plan.
> {code}
> sql("select * from t0 union select * from t0").explain(true)
> {code}
> {code}
> == Parsed Logical Plan ==
> 'Project [unresolvedalias(*,None)]
> +- 'Subquery u_2
>+- 'Distinct
>   +- 'Project [unresolvedalias(*,None)]
>  +- 'Subquery u_1
> +- 'Distinct
>+- 'Union
>   :- 'Project [unresolvedalias(*,None)]
>   :  +- 'UnresolvedRelation `t0`, None
>   +- 'Project [unresolvedalias(*,None)]
>  +- 'UnresolvedRelation `t0`, None
> == Analyzed Logical Plan ==
> id: bigint
> Project [id#16L]
> +- Subquery u_2
>+- Distinct
>   +- Project [id#16L]
>  +- Subquery u_1
> +- Distinct
>+- Union
>   :- Project [id#16L]
>   :  +- Subquery t0
>   : +- Relation[id#16L] ParquetRelation
>   +- Project [id#16L]
>  +- Subquery t0
> +- Relation[id#16L] ParquetRelation
> == Optimized Logical Plan ==
> Aggregate [id#16L], [id#16L]
> +- Aggregate [id#16L], [id#16L]
>+- Union
>   :- Project [id#16L]
>   :  +- Relation[id#16L] ParquetRelation
>   +- Project [id#16L]
>  +- Relation[id#16L] ParquetRelation
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13235) Remove an Extra Distinct in Union

2016-02-08 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-13235:

Summary: Remove an Extra Distinct in Union  (was: Remove an extra Distinct 
in Union)

> Remove an Extra Distinct in Union
> -
>
> Key: SPARK-13235
> URL: https://issues.apache.org/jira/browse/SPARK-13235
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Union Distinct has two Distinct that generates two Aggregation in the plan.
> {code}
> sql("select * from t0 union select * from t0").explain(true)
> {code}
> {code}
> == Parsed Logical Plan ==
> 'Project [unresolvedalias(*,None)]
> +- 'Subquery u_2
>+- 'Distinct
>   +- 'Project [unresolvedalias(*,None)]
>  +- 'Subquery u_1
> +- 'Distinct
>+- 'Union
>   :- 'Project [unresolvedalias(*,None)]
>   :  +- 'UnresolvedRelation `t0`, None
>   +- 'Project [unresolvedalias(*,None)]
>  +- 'UnresolvedRelation `t0`, None
> == Analyzed Logical Plan ==
> id: bigint
> Project [id#16L]
> +- Subquery u_2
>+- Distinct
>   +- Project [id#16L]
>  +- Subquery u_1
> +- Distinct
>+- Union
>   :- Project [id#16L]
>   :  +- Subquery t0
>   : +- Relation[id#16L] ParquetRelation
>   +- Project [id#16L]
>  +- Subquery t0
> +- Relation[id#16L] ParquetRelation
> == Optimized Logical Plan ==
> Aggregate [id#16L], [id#16L]
> +- Aggregate [id#16L], [id#16L]
>+- Union
>   :- Project [id#16L]
>   :  +- Relation[id#16L] ParquetRelation
>   +- Project [id#16L]
>  +- Relation[id#16L] ParquetRelation
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13235) Remove an Extra Distinct in Union

2016-02-08 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-13235:

Description: 
Union Distinct has two Distinct that generate two Aggregation in the plan.

{code}
sql("select * from t0 union select * from t0").explain(true)
{code}

{code}
== Parsed Logical Plan ==
'Project [unresolvedalias(*,None)]
+- 'Subquery u_2
   +- 'Distinct
  +- 'Project [unresolvedalias(*,None)]
 +- 'Subquery u_1
+- 'Distinct
   +- 'Union
  :- 'Project [unresolvedalias(*,None)]
  :  +- 'UnresolvedRelation `t0`, None
  +- 'Project [unresolvedalias(*,None)]
 +- 'UnresolvedRelation `t0`, None

== Analyzed Logical Plan ==
id: bigint
Project [id#16L]
+- Subquery u_2
   +- Distinct
  +- Project [id#16L]
 +- Subquery u_1
+- Distinct
   +- Union
  :- Project [id#16L]
  :  +- Subquery t0
  : +- Relation[id#16L] ParquetRelation
  +- Project [id#16L]
 +- Subquery t0
+- Relation[id#16L] ParquetRelation

== Optimized Logical Plan ==
Aggregate [id#16L], [id#16L]
+- Aggregate [id#16L], [id#16L]
   +- Union
  :- Project [id#16L]
  :  +- Relation[id#16L] ParquetRelation
  +- Project [id#16L]
 +- Relation[id#16L] ParquetRelation
{code}

  was:
Union Distinct has two Distinct that generates two Aggregation in the plan.

{code}
sql("select * from t0 union select * from t0").explain(true)
{code}

{code}
== Parsed Logical Plan ==
'Project [unresolvedalias(*,None)]
+- 'Subquery u_2
   +- 'Distinct
  +- 'Project [unresolvedalias(*,None)]
 +- 'Subquery u_1
+- 'Distinct
   +- 'Union
  :- 'Project [unresolvedalias(*,None)]
  :  +- 'UnresolvedRelation `t0`, None
  +- 'Project [unresolvedalias(*,None)]
 +- 'UnresolvedRelation `t0`, None

== Analyzed Logical Plan ==
id: bigint
Project [id#16L]
+- Subquery u_2
   +- Distinct
  +- Project [id#16L]
 +- Subquery u_1
+- Distinct
   +- Union
  :- Project [id#16L]
  :  +- Subquery t0
  : +- Relation[id#16L] ParquetRelation
  +- Project [id#16L]
 +- Subquery t0
+- Relation[id#16L] ParquetRelation

== Optimized Logical Plan ==
Aggregate [id#16L], [id#16L]
+- Aggregate [id#16L], [id#16L]
   +- Union
  :- Project [id#16L]
  :  +- Relation[id#16L] ParquetRelation
  +- Project [id#16L]
 +- Relation[id#16L] ParquetRelation
{code}


> Remove an Extra Distinct in Union
> -
>
> Key: SPARK-13235
> URL: https://issues.apache.org/jira/browse/SPARK-13235
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Union Distinct has two Distinct that generate two Aggregation in the plan.
> {code}
> sql("select * from t0 union select * from t0").explain(true)
> {code}
> {code}
> == Parsed Logical Plan ==
> 'Project [unresolvedalias(*,None)]
> +- 'Subquery u_2
>+- 'Distinct
>   +- 'Project [unresolvedalias(*,None)]
>  +- 'Subquery u_1
> +- 'Distinct
>+- 'Union
>   :- 'Project [unresolvedalias(*,None)]
>   :  +- 'UnresolvedRelation `t0`, None
>   +- 'Project [unresolvedalias(*,None)]
>  +- 'UnresolvedRelation `t0`, None
> == Analyzed Logical Plan ==
> id: bigint
> Project [id#16L]
> +- Subquery u_2
>+- Distinct
>   +- Project [id#16L]
>  +- Subquery u_1
> +- Distinct
>+- Union
>   :- Project [id#16L]
>   :  +- Subquery t0
>   : +- Relation[id#16L] ParquetRelation
>   +- Project [id#16L]
>  +- Subquery t0
> +- Relation[id#16L] ParquetRelation
> == Optimized Logical Plan ==
> Aggregate [id#16L], [id#16L]
> +- Aggregate [id#16L], [id#16L]
>+- Union
>   :- Project [id#16L]
>   :  +- Relation[id#16L] ParquetRelation
>   +- Project [id#16L]
>  +- Relation[id#16L] ParquetRelation
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13171) Update promise & future to Promise and Future as the old ones are deprecated

2016-02-08 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137702#comment-15137702
 ] 

Shixiong Zhu commented on SPARK-13171:
--

Looks something wrong happened between:

[info]   2016-02-06 08:42:00.219 - stderr>  found 
org.apache.hadoop#hadoop-mapreduce-client-app;2.3.0 in list
[info]   2016-02-06 08:46:13.188 - stderr>  found 
org.apache.hadoop#hadoop-mapreduce-client-common;2.3.0 in central

It took 4 minutes.

> Update promise & future to Promise and Future as the old ones are deprecated
> 
>
> Key: SPARK-13171
> URL: https://issues.apache.org/jira/browse/SPARK-13171
> Project: Spark
>  Issue Type: Sub-task
>Reporter: holdenk
>Assignee: Jakob Odersky
>Priority: Trivial
> Fix For: 2.0.0
>
>
> We use the promise and future functions on the concurrent object, both of 
> which have been deprecated in 2.11 . The full traits are present in Scala 
> 2.10 as well so this should be a safe migration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13219) Pushdown predicate propagation in SparkSQL with join

2016-02-08 Thread Abhinav Chawade (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137701#comment-15137701
 ] 

Abhinav Chawade commented on SPARK-13219:
-

I created a build of Spark 1.4.1 which incorporates your patch but somehow 
predicates are still not being propagated. The set of steps I followed
1) Build Spark 1.4.1 with patch incorporated.
2) Replace spark-catalyst jar on all nodes.
3) Run explain on following command in spark-sql. Notice the query plan.
{code}
spark-sql> explain select t.assetid from tenants t inner join assets on 
t.assetid = assets.assetid where t.assetid=1201;
== Physical Plan ==
Project [assetid#18]
 ShuffledHashJoin [assetid#18], [assetid#20], BuildRight
  Exchange (HashPartitioning 200)
   Filter (assetid#18 = 1201)
HiveTableScan [assetid#18], (MetastoreRelation element22082, tenants, 
Some(t)), None
  Exchange (HashPartitioning 200)
   HiveTableScan [assetid#20], (MetastoreRelation element22082, assets, None), 
None
Time taken: 2.741 seconds, Fetched 8 row(s)
{code}

> Pushdown predicate propagation in SparkSQL with join
> 
>
> Key: SPARK-13219
> URL: https://issues.apache.org/jira/browse/SPARK-13219
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1, 1.6.0
> Environment: Spark 1.4
> Datastax Spark connector 1.4
> Cassandra. 2.1.12
> Centos 6.6
>Reporter: Abhinav Chawade
>
> When 2 or more tables are joined in SparkSQL and there is an equality clause 
> in query on attributes used to perform the join, it is useful to apply that 
> clause on scans for both table. If this is not done, one of the tables 
> results in full scan which can reduce the query dramatically. Consider 
> following example with 2 tables being joined.
> {code}
> CREATE TABLE assets (
> assetid int PRIMARY KEY,
> address text,
> propertyname text
> )
> CREATE TABLE tenants (
> assetid int PRIMARY KEY,
> name text
> )
> spark-sql> explain select t.name from tenants t, assets a where a.assetid = 
> t.assetid and t.assetid='1201';
> WARN  2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to 
> load native-hadoop library for your platform... using builtin-java classes 
> where applicable
> == Physical Plan ==
> Project [name#14]
>  ShuffledHashJoin [assetid#13], [assetid#15], BuildRight
>   Exchange (HashPartitioning 200)
>Filter (CAST(assetid#13, DoubleType) = 1201.0)
> HiveTableScan [assetid#13,name#14], (MetastoreRelation element, tenants, 
> Some(t)), None
>   Exchange (HashPartitioning 200)
>HiveTableScan [assetid#15], (MetastoreRelation element, assets, Some(a)), 
> None
> Time taken: 1.354 seconds, Fetched 8 row(s)
> {code}
> The simple workaround is to add another equality condition for each table but 
> it becomes cumbersome. It will be helpful if the query planner could improve 
> filter propagation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12505) Pushdown a Limit on top of an Outer-Join

2016-02-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137710#comment-15137710
 ] 

Apache Spark commented on SPARK-12505:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/11121

> Pushdown a Limit on top of an Outer-Join
> 
>
> Key: SPARK-12505
> URL: https://issues.apache.org/jira/browse/SPARK-12505
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Xiao Li
>
> "Rule that applies to a Limit on top of an OUTER Join. The original Limit 
> won't go away after applying this rule, but additional Limit node(s) will be 
> created on top of the outer-side child (or children if it's a FULL OUTER 
> Join). "
> – from https://issues.apache.org/jira/browse/CALCITE-832



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12503) Pushdown a Limit on top of a Union

2016-02-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137709#comment-15137709
 ] 

Apache Spark commented on SPARK-12503:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/11121

> Pushdown a Limit on top of a Union
> --
>
> Key: SPARK-12503
> URL: https://issues.apache.org/jira/browse/SPARK-12503
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Xiao Li
>
> "Rule that applies to a Limit on top of a Union. The original Limit won't go 
> away after applying this rule, but additional Limit nodes will be created on 
> top of each child of Union, so that these children produce less rows and 
> Limit can be further optimized for children Relations."
> -- from https://issues.apache.org/jira/browse/CALCITE-832
> Also, the same topic in Hive: https://issues.apache.org/jira/browse/HIVE-11775



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13171) Update promise & future to Promise and Future as the old ones are deprecated

2016-02-08 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137714#comment-15137714
 ] 

holdenk commented on SPARK-13171:
-

I've been seeing that for intermittently for awhile in my own PRs builds.

> Update promise & future to Promise and Future as the old ones are deprecated
> 
>
> Key: SPARK-13171
> URL: https://issues.apache.org/jira/browse/SPARK-13171
> Project: Spark
>  Issue Type: Sub-task
>Reporter: holdenk
>Assignee: Jakob Odersky
>Priority: Trivial
> Fix For: 2.0.0
>
>
> We use the promise and future functions on the concurrent object, both of 
> which have been deprecated in 2.11 . The full traits are present in Scala 
> 2.10 as well so this should be a safe migration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13219) Pushdown predicate propagation in SparkSQL with join

2016-02-08 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137721#comment-15137721
 ] 

Xiao Li commented on SPARK-13219:
-

Let me try your SQL query in Spark 1.6.1. 

> Pushdown predicate propagation in SparkSQL with join
> 
>
> Key: SPARK-13219
> URL: https://issues.apache.org/jira/browse/SPARK-13219
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1, 1.6.0
> Environment: Spark 1.4
> Datastax Spark connector 1.4
> Cassandra. 2.1.12
> Centos 6.6
>Reporter: Abhinav Chawade
>
> When 2 or more tables are joined in SparkSQL and there is an equality clause 
> in query on attributes used to perform the join, it is useful to apply that 
> clause on scans for both table. If this is not done, one of the tables 
> results in full scan which can reduce the query dramatically. Consider 
> following example with 2 tables being joined.
> {code}
> CREATE TABLE assets (
> assetid int PRIMARY KEY,
> address text,
> propertyname text
> )
> CREATE TABLE tenants (
> assetid int PRIMARY KEY,
> name text
> )
> spark-sql> explain select t.name from tenants t, assets a where a.assetid = 
> t.assetid and t.assetid='1201';
> WARN  2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to 
> load native-hadoop library for your platform... using builtin-java classes 
> where applicable
> == Physical Plan ==
> Project [name#14]
>  ShuffledHashJoin [assetid#13], [assetid#15], BuildRight
>   Exchange (HashPartitioning 200)
>Filter (CAST(assetid#13, DoubleType) = 1201.0)
> HiveTableScan [assetid#13,name#14], (MetastoreRelation element, tenants, 
> Some(t)), None
>   Exchange (HashPartitioning 200)
>HiveTableScan [assetid#15], (MetastoreRelation element, assets, Some(a)), 
> None
> Time taken: 1.354 seconds, Fetched 8 row(s)
> {code}
> The simple workaround is to add another equality condition for each table but 
> it becomes cumbersome. It will be helpful if the query planner could improve 
> filter propagation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13236) SQL generation support for union

2016-02-08 Thread Xiao Li (JIRA)

Xiao Li created SPARK-13236:
---

 Summary: SQL generation support for union
 Key: SPARK-13236
 URL: https://issues.apache.org/jira/browse/SPARK-13236
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li


checkHiveQl("SELECT * FROM t0 UNION SELECT * FROM t0")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-02-08 Thread Mark Grover (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137771#comment-15137771
 ] 

Mark Grover commented on SPARK-12177:
-

Hi Rama,
This particular PR adds support for the new API. There is some small code for 
SSL support in it too but I haven't invested much time in testing that, apart 
from the simple unit test that was written for it. Kerberos (SASL) will have to 
done incrementally in another patch because, it can't be done until Kafka 
supports delegation tokens (which is still not there yet: 
https://issues.apache.org/jira/browse/KAFKA-1696)

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13219) Pushdown predicate propagation in SparkSQL with join

2016-02-08 Thread Abhinav Chawade (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137770#comment-15137770
 ] 

Abhinav Chawade commented on SPARK-13219:
-

Here is my branch on github if you'd like to take a look. 
https://github.com/drnushooz/spark/tree/v1.4.1-SPARK-13219

> Pushdown predicate propagation in SparkSQL with join
> 
>
> Key: SPARK-13219
> URL: https://issues.apache.org/jira/browse/SPARK-13219
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1, 1.6.0
> Environment: Spark 1.4
> Datastax Spark connector 1.4
> Cassandra. 2.1.12
> Centos 6.6
>Reporter: Abhinav Chawade
>
> When 2 or more tables are joined in SparkSQL and there is an equality clause 
> in query on attributes used to perform the join, it is useful to apply that 
> clause on scans for both table. If this is not done, one of the tables 
> results in full scan which can reduce the query dramatically. Consider 
> following example with 2 tables being joined.
> {code}
> CREATE TABLE assets (
> assetid int PRIMARY KEY,
> address text,
> propertyname text
> )
> CREATE TABLE tenants (
> assetid int PRIMARY KEY,
> name text
> )
> spark-sql> explain select t.name from tenants t, assets a where a.assetid = 
> t.assetid and t.assetid='1201';
> WARN  2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to 
> load native-hadoop library for your platform... using builtin-java classes 
> where applicable
> == Physical Plan ==
> Project [name#14]
>  ShuffledHashJoin [assetid#13], [assetid#15], BuildRight
>   Exchange (HashPartitioning 200)
>Filter (CAST(assetid#13, DoubleType) = 1201.0)
> HiveTableScan [assetid#13,name#14], (MetastoreRelation element, tenants, 
> Some(t)), None
>   Exchange (HashPartitioning 200)
>HiveTableScan [assetid#15], (MetastoreRelation element, assets, Some(a)), 
> None
> Time taken: 1.354 seconds, Fetched 8 row(s)
> {code}
> The simple workaround is to add another equality condition for each table but 
> it becomes cumbersome. It will be helpful if the query planner could improve 
> filter propagation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12194) Add Sink for reporting Spark Metrics to OpenTSDB

2016-02-08 Thread Matt Kapilevich (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137774#comment-15137774
 ] 

Matt Kapilevich commented on SPARK-12194:
-

I am also looking to capture Spark metrics into OpenTSDB. FWIW, I've reviewed 
the PR, and it looks good to me. Can one of the committers please see if this 
patch can be merged?


> Add Sink for reporting Spark Metrics to OpenTSDB
> 
>
> Key: SPARK-12194
> URL: https://issues.apache.org/jira/browse/SPARK-12194
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Kapil Singh
>
> Add OpenTSDB Sink to the currently supported metric sinks. Since OpenTSDB is 
> a popular open-source Time Series Database (based on HBase), this will make 
> it convenient for those who want metrics data for time series analysis 
> purposes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13171) Update promise & future to Promise and Future as the old ones are deprecated

2016-02-08 Thread Jakob Odersky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137795#comment-15137795
 ] 

Jakob Odersky commented on SPARK-13171:
---

This is very strange, are you sure it has something to do with the changes 
introduced by my PR? As mentioned previously, the only effective change between 
future() and Future.apply() is one less indirection. The only potentially 
visible changes would be for code that relies on reflection or does some macro 
magic.

> Update promise & future to Promise and Future as the old ones are deprecated
> 
>
> Key: SPARK-13171
> URL: https://issues.apache.org/jira/browse/SPARK-13171
> Project: Spark
>  Issue Type: Sub-task
>Reporter: holdenk
>Assignee: Jakob Odersky
>Priority: Trivial
> Fix For: 2.0.0
>
>
> We use the promise and future functions on the concurrent object, both of 
> which have been deprecated in 2.11 . The full traits are present in Scala 
> 2.10 as well so this should be a safe migration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13216) Spark streaming application not honoring --num-executors in restarting of an application from a checkpoint

2016-02-08 Thread Hari Shreedharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137797#comment-15137797
 ] 

Hari Shreedharan commented on SPARK-13216:
--

I disagree that checkpointing is only for failed applications. For any of the 
receiver-based streaming applications, checkpoints are important to recover as 
yet unprocessed data.

If the application cannot be reloaded from a checkpoint - then the old data is 
pretty much gone. I know that checkpointing basically makes application and 
spark upgrades difficult or impossible, but there are configuration parameters 
that the users might want to change based on load requirements etc. I don't see 
a reason why we should not allow this, since it has nothing to do with starting 
the app from checkpoint or not - if we want the number of executors to change 
we should be able to. This is especially true when migrating from a non-dynamic 
allocation situation to a dynamic allocation situation.

> Spark streaming application not honoring --num-executors in restarting of an 
> application from a checkpoint
> --
>
> Key: SPARK-13216
> URL: https://issues.apache.org/jira/browse/SPARK-13216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, Streaming
>Affects Versions: 1.5.0
>Reporter: Neelesh Srinivas Salian
>Priority: Minor
>  Labels: Streaming
>
> Scenario to help understand:
> 1) The Spark streaming job with 12 executors was initiated with checkpointing 
> enabled.
> 2) In version 1.3, the user was able to append the number of executors to 20 
> using --num-executors but was unable to do so in version 1.5.
> In 1.5, the spark application still runs with 13 executors (1 for driver and 
> 12 executors).
> There is a need to start from the checkpoint itself and not restart the 
> application to avoid the loss of information.
> 3) Checked the code in 1.3 and 1.5, which shows the command 
> ''--num-executors" has been deprecated.
> Any thoughts on this? Not sure if anyone hit this one specifically before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13095) improve performance of hash join with dimension table

2016-02-08 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13095.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11065
[https://github.com/apache/spark/pull/11065]

> improve performance of hash join with dimension table
> -
>
> Key: SPARK-13095
> URL: https://issues.apache.org/jira/browse/SPARK-13095
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> The join key is usually an integer or long (primary key, unique), we could 
> have special HashRelation for them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13027) Add API for updateStateByKey to provide batch time as input

2016-02-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137858#comment-15137858
 ] 

Apache Spark commented on SPARK-13027:
--

User 'aramesh117' has created a pull request for this issue:
https://github.com/apache/spark/pull/11122

> Add API for updateStateByKey to provide batch time as input
> ---
>
> Key: SPARK-13027
> URL: https://issues.apache.org/jira/browse/SPARK-13027
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Aaditya Ramesh
>
> The StateDStream currently does not provide the batch time as input to the 
> state update function. This is required in cases where the behavior depends 
> on the batch start time.
> We (Conviva) have been patching it manually for the past several Spark 
> versions but we thought it might be useful for others as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13027) Add API for updateStateByKey to provide batch time as input

2016-02-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13027:


Assignee: (was: Apache Spark)

> Add API for updateStateByKey to provide batch time as input
> ---
>
> Key: SPARK-13027
> URL: https://issues.apache.org/jira/browse/SPARK-13027
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Aaditya Ramesh
>
> The StateDStream currently does not provide the batch time as input to the 
> state update function. This is required in cases where the behavior depends 
> on the batch start time.
> We (Conviva) have been patching it manually for the past several Spark 
> versions but we thought it might be useful for others as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13027) Add API for updateStateByKey to provide batch time as input

2016-02-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13027:


Assignee: Apache Spark

> Add API for updateStateByKey to provide batch time as input
> ---
>
> Key: SPARK-13027
> URL: https://issues.apache.org/jira/browse/SPARK-13027
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Aaditya Ramesh
>Assignee: Apache Spark
>
> The StateDStream currently does not provide the batch time as input to the 
> state update function. This is required in cases where the behavior depends 
> on the batch start time.
> We (Conviva) have been patching it manually for the past several Spark 
> versions but we thought it might be useful for others as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10528) spark-shell throws java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.

2016-02-08 Thread Amir Gur (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137924#comment-15137924
 ] 

Amir Gur commented on SPARK-10528:
--

Thanks, confirming it worked on win8.1 64 bit.


> spark-shell throws java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable.
> --
>
> Key: SPARK-10528
> URL: https://issues.apache.org/jira/browse/SPARK-10528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.0
> Environment: Windows 7 x64
>Reporter: Aliaksei Belablotski
>Priority: Minor
>
> Starting spark-shell throws
> java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13014) Replace example code in mllib-collaborative-filtering.md using include_example

2016-02-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13014:


Assignee: (was: Apache Spark)

> Replace example code in mllib-collaborative-filtering.md using include_example
> --
>
> Key: SPARK-13014
> URL: https://issues.apache.org/jira/browse/SPARK-13014
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> See examples in other finished sub-JIRAs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13014) Replace example code in mllib-collaborative-filtering.md using include_example

2016-02-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137982#comment-15137982
 ] 

Apache Spark commented on SPARK-13014:
--

User 'keypointt' has created a pull request for this issue:
https://github.com/apache/spark/pull/11123

> Replace example code in mllib-collaborative-filtering.md using include_example
> --
>
> Key: SPARK-13014
> URL: https://issues.apache.org/jira/browse/SPARK-13014
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> See examples in other finished sub-JIRAs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13014) Replace example code in mllib-collaborative-filtering.md using include_example

2016-02-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13014:


Assignee: Apache Spark

> Replace example code in mllib-collaborative-filtering.md using include_example
> --
>
> Key: SPARK-13014
> URL: https://issues.apache.org/jira/browse/SPARK-13014
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>
> See examples in other finished sub-JIRAs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13236) SQL generation support for union

2016-02-08 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137988#comment-15137988
 ] 

Xiao Li commented on SPARK-13236:
-

After the merge of Spark-13235, I will upload a PR for this. Thanks!

> SQL generation support for union
> 
>
> Key: SPARK-13236
> URL: https://issues.apache.org/jira/browse/SPARK-13236
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> checkHiveQl("SELECT * FROM t0 UNION SELECT * FROM t0")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13018) Replace example code in mllib-pmml-model-export.md using include_example

2016-02-08 Thread Xin Ren (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137996#comment-15137996
 ] 

Xin Ren commented on SPARK-13018:
-

I'm working on this one, thanks :)

> Replace example code in mllib-pmml-model-export.md using include_example
> 
>
> Key: SPARK-13018
> URL: https://issues.apache.org/jira/browse/SPARK-13018
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> See examples in other finished sub-JIRAs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13219) Pushdown predicate propagation in SparkSQL with join

2016-02-08 Thread Evan Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15138024#comment-15138024
 ] 

Evan Chan commented on SPARK-13219:
---

[~smilegator]  does your PR take care of the case where no JOIN clause is 
invoked?
does it also take care of multiple join conditions?  (e.g., select from a a, b 
b, c c where a.col1 = b.col1 && b.col1 = c.col1 &&  )

> Pushdown predicate propagation in SparkSQL with join
> 
>
> Key: SPARK-13219
> URL: https://issues.apache.org/jira/browse/SPARK-13219
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1, 1.6.0
> Environment: Spark 1.4
> Datastax Spark connector 1.4
> Cassandra. 2.1.12
> Centos 6.6
>Reporter: Abhinav Chawade
>
> When 2 or more tables are joined in SparkSQL and there is an equality clause 
> in query on attributes used to perform the join, it is useful to apply that 
> clause on scans for both table. If this is not done, one of the tables 
> results in full scan which can reduce the query dramatically. Consider 
> following example with 2 tables being joined.
> {code}
> CREATE TABLE assets (
> assetid int PRIMARY KEY,
> address text,
> propertyname text
> )
> CREATE TABLE tenants (
> assetid int PRIMARY KEY,
> name text
> )
> spark-sql> explain select t.name from tenants t, assets a where a.assetid = 
> t.assetid and t.assetid='1201';
> WARN  2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to 
> load native-hadoop library for your platform... using builtin-java classes 
> where applicable
> == Physical Plan ==
> Project [name#14]
>  ShuffledHashJoin [assetid#13], [assetid#15], BuildRight
>   Exchange (HashPartitioning 200)
>Filter (CAST(assetid#13, DoubleType) = 1201.0)
> HiveTableScan [assetid#13,name#14], (MetastoreRelation element, tenants, 
> Some(t)), None
>   Exchange (HashPartitioning 200)
>HiveTableScan [assetid#15], (MetastoreRelation element, assets, Some(a)), 
> None
> Time taken: 1.354 seconds, Fetched 8 row(s)
> {code}
> The simple workaround is to add another equality condition for each table but 
> it becomes cumbersome. It will be helpful if the query planner could improve 
> filter propagation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 152 matches

Mail list logo