[jira] [Resolved] (SPARK-20972) rename HintInfo.isBroadcastable to broadcast

2017-06-06 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20972.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18189
[https://github.com/apache/spark/pull/18189]

> rename HintInfo.isBroadcastable to broadcast
> 
>
> Key: SPARK-20972
> URL: https://issues.apache.org/jira/browse/SPARK-20972
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Minor
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11966) Spark API for UDTFs

2017-06-06 Thread Dayou Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16040235#comment-16040235
 ] 

Dayou Zhou commented on SPARK-11966:


Hi [~hvanhovell], any examples on how this might be done using Data Sources 
API?  Thanks.

> Spark API for UDTFs
> ---
>
> Key: SPARK-11966
> URL: https://issues.apache.org/jira/browse/SPARK-11966
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Jaka Jancar
>Priority: Minor
>
> Defining UDFs is easy using sqlContext.udf.register, but not table-generating 
> functions. For those you still have to use these horrendous Hive interfaces:
> https://github.com/prongs/apache-hive/blob/master/contrib/src/java/org/apache/hadoop/hive/contrib/udtf/example/GenericUDTFCount2.java



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20972) rename HintInfo.isBroadcastable to broadcast

2017-06-06 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-20972:

Summary: rename HintInfo.isBroadcastable to broadcast  (was: rename 
HintInfo.isBroadcastable to forceBroadcast)

> rename HintInfo.isBroadcastable to broadcast
> 
>
> Key: SPARK-20972
> URL: https://issues.apache.org/jira/browse/SPARK-20972
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21002) Syntax error regression when creating Hive storage handlers on Spark shell

2017-06-06 Thread Dayou Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16040226#comment-16040226
 ] 

Dayou Zhou commented on SPARK-21002:


Hi [~viirya], thanks for pointing to SPARK-19360 -- I missed it in my search.  
When will this be fixed?

> Syntax error regression when creating Hive storage handlers on Spark shell
> --
>
> Key: SPARK-21002
> URL: https://issues.apache.org/jira/browse/SPARK-21002
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Dayou Zhou
>
> I use the following syntax to create my Hive storage handlers:
> CREATE TABLE t1
> ROW FORMAT SERDE 'com.foo.MySerDe' 
> STORED BY 'com.foo.MyStorageHandler'
> TBLPROPERTIES (
>..
> );
> And I'm trying to do the same from Spark shell.  It works fine on Spark 1.6, 
> but on Spark 2.0+, it fails with the following:'
> org.apache.spark.sql.catalyst.parser.ParseException:
> Operation not allowed: Unexpected combination of ROW FORMAT SERDE 
> 'com.foo.MySerDe' and STORED BY 'com.foo.MyStorageHandler'(line 1, pos 0)
> Could you please confirm whether this is a regression and if so could it be 
> fixed in Spark 2.2?  Thank you.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19360) Spark 2.X does not support stored by cluase

2017-06-06 Thread Dayou Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16040223#comment-16040223
 ] 

Dayou Zhou commented on SPARK-19360:


Same here...  Since STORED BY is **essential** for using Hive storage handlers, 
is this a regression which will be fixed?  Any workarounds?  Thanks.

> Spark 2.X does not support stored by cluase
> ---
>
> Key: SPARK-19360
> URL: https://issues.apache.org/jira/browse/SPARK-19360
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Ran Haim
>Priority: Minor
>
> Spark 1.6 and below versions support HiveContext which supports Hive storage 
> handler with "stored by" clause. However, Spark 2.x does not support "stored 
> by". 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20997) spark-submit's --driver-cores marked as "YARN-only" but listed under "Spark standalone with cluster deploy mode only"

2017-06-06 Thread guoxiaolongzte (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16040210#comment-16040210
 ] 

guoxiaolongzte commented on SPARK-20997:


Can i fix this jira? Because i found a similar problem here. I am revising 
these questions together.
[~srowen]
[~jlaskowski]

> spark-submit's --driver-cores marked as "YARN-only" but listed under "Spark 
> standalone with cluster deploy mode only"
> -
>
> Key: SPARK-20997
> URL: https://issues.apache.org/jira/browse/SPARK-20997
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Submit
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> Just noticed that {{spark-submit}} describes {{--driver-cores}} under:
> * Spark standalone with cluster deploy mode only
> * YARN-only
> While I can understand "only" in "Spark standalone with cluster deploy mode 
> only" to refer to cluster deploy mode (not the default client mode), but 
> YARN-only baffles me which I think deserves a fix.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20935) A daemon thread, "BatchedWriteAheadLog Writer", left behind after terminating StreamingContext.

2017-06-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20935:


Assignee: (was: Apache Spark)

> A daemon thread, "BatchedWriteAheadLog Writer", left behind after terminating 
> StreamingContext.
> ---
>
> Key: SPARK-20935
> URL: https://issues.apache.org/jira/browse/SPARK-20935
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.6.3, 2.1.1
>Reporter: Terence Yim
>
> With batched write ahead log on by default in driver (SPARK-11731), if there 
> is no receiver based {{InputDStream}}, the "BatchedWriteAheadLog Writer" 
> thread created by {{BatchedWriteAheadLog}} never get shutdown. 
> The root cause is due to 
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala#L168
> that it never call {{ReceivedBlockTracker.stop()}} (which in turn call 
> {{BatchedWriteAheadLog.close()}}) if there is no receiver based input.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20935) A daemon thread, "BatchedWriteAheadLog Writer", left behind after terminating StreamingContext.

2017-06-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20935:


Assignee: Apache Spark

> A daemon thread, "BatchedWriteAheadLog Writer", left behind after terminating 
> StreamingContext.
> ---
>
> Key: SPARK-20935
> URL: https://issues.apache.org/jira/browse/SPARK-20935
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.6.3, 2.1.1
>Reporter: Terence Yim
>Assignee: Apache Spark
>
> With batched write ahead log on by default in driver (SPARK-11731), if there 
> is no receiver based {{InputDStream}}, the "BatchedWriteAheadLog Writer" 
> thread created by {{BatchedWriteAheadLog}} never get shutdown. 
> The root cause is due to 
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala#L168
> that it never call {{ReceivedBlockTracker.stop()}} (which in turn call 
> {{BatchedWriteAheadLog.close()}}) if there is no receiver based input.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20935) A daemon thread, "BatchedWriteAheadLog Writer", left behind after terminating StreamingContext.

2017-06-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16040176#comment-16040176
 ] 

Apache Spark commented on SPARK-20935:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/18224

> A daemon thread, "BatchedWriteAheadLog Writer", left behind after terminating 
> StreamingContext.
> ---
>
> Key: SPARK-20935
> URL: https://issues.apache.org/jira/browse/SPARK-20935
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.6.3, 2.1.1
>Reporter: Terence Yim
>
> With batched write ahead log on by default in driver (SPARK-11731), if there 
> is no receiver based {{InputDStream}}, the "BatchedWriteAheadLog Writer" 
> thread created by {{BatchedWriteAheadLog}} never get shutdown. 
> The root cause is due to 
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala#L168
> that it never call {{ReceivedBlockTracker.stop()}} (which in turn call 
> {{BatchedWriteAheadLog.close()}}) if there is no receiver based input.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20760) Memory Leak of RDD blocks

2017-06-06 Thread Binzi Cao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16040112#comment-16040112
 ] 

Binzi Cao edited comment on SPARK-20760 at 6/7/17 3:47 AM:
---

Hi David, 

Thanks very much for the message, I did a test with spark 2.1.1 in local mode. 
The issue seems still happening, while it seems much better than spark 2.1.0 as 
the RDD blocks grows much slower. After running the task for 2 hours, I got 
around 6000 rdd blocks in memory. 

I attached the screenshots for the 2.1.1

Binzi


was (Author: caobinzi):
Hi David, 

Thanks very much for the message, I did a test with spark 2.1.1 in local mode. 
The issue seems still happening, while it seems much better than spark 2.0 as 
the RDD blocks grows much slower. After running the task for 2 hours, I got 
around 6000 rdd blocks in memory. 

I attached the screenshots for the 2.1.1

Binzi

> Memory Leak of RDD blocks 
> --
>
> Key: SPARK-20760
> URL: https://issues.apache.org/jira/browse/SPARK-20760
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 2.1.0
> Environment: Spark 2.1.0
>Reporter: Binzi Cao
> Attachments: RDD blocks in spark 2.1.1.png, RDD Blocks .png, Storage 
> in spark 2.1.1.png
>
>
> Memory leak for RDD blocks for a long time running rdd process.
> We  have a long term running application, which is doing computations of 
> RDDs. and we found the RDD blocks are keep increasing in the spark ui page. 
> The rdd blocks and memory usage do not mach the cached rdds and memory. It 
> looks like spark keeps old rdd in memory and never released it or never got a 
> chance to release it. The job will eventually die of out of memory. 
> In addition, I'm not seeing this issue in spark 1.6. We are seeing the same 
> issue in Yarn Cluster mode both in kafka streaming and batch applications. 
> The issue in streaming is similar, however, it seems the rdd blocks grows a 
> bit slower than batch jobs. 
> The below is the sample code and it is reproducible by justing running it in 
> local mode. 
> Scala file:
> {code}
> import scala.concurrent.duration.Duration
> import scala.util.{Try, Failure, Success}
> import org.apache.spark.SparkConf
> import org.apache.spark.SparkContext
> import org.apache.spark.rdd.RDD
> import scala.concurrent._
> import ExecutionContext.Implicits.global
> case class Person(id: String, name: String)
> object RDDApp {
>   def run(sc: SparkContext) = {
> while (true) {
>   val r = scala.util.Random
>   val data = (1 to r.nextInt(100)).toList.map { a =>
> Person(a.toString, a.toString)
>   }
>   val rdd = sc.parallelize(data)
>   rdd.cache
>   println("running")
>   val a = (1 to 100).toList.map { x =>
> Future(rdd.filter(_.id == x.toString).collect)
>   }
>   a.foreach { f =>
> println(Await.ready(f, Duration.Inf).value.get)
>   }
>   rdd.unpersist()
> }
>   }
>   def main(args: Array[String]): Unit = {
>val conf = new SparkConf().setAppName("test")
> val sc   = new SparkContext(conf)
> run(sc)
>   }
> }
> {code}
> build sbt file:
> {code}
> name := "RDDTest"
> version := "0.1.1"
> scalaVersion := "2.11.5"
> libraryDependencies ++= Seq (
> "org.scalaz" %% "scalaz-core" % "7.2.0",
> "org.scalaz" %% "scalaz-concurrent" % "7.2.0",
> "org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided",
> "org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided"
>   )
> addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1")
> mainClass in assembly := Some("RDDApp")
> test in assembly := {}
> {code}
> To reproduce it: 
> Just 
> {code}
> spark-2.1.0-bin-hadoop2.7/bin/spark-submit   --driver-memory 4G \
> --executor-memory 4G \
> --executor-cores 1 \
> --num-executors 1 \
> --class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20760) Memory Leak of RDD blocks

2017-06-06 Thread Binzi Cao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Binzi Cao updated SPARK-20760:
--
Attachment: Storage in spark 2.1.1.png
RDD blocks in spark 2.1.1.png

Spark 2.1.1 Test Result

> Memory Leak of RDD blocks 
> --
>
> Key: SPARK-20760
> URL: https://issues.apache.org/jira/browse/SPARK-20760
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 2.1.0
> Environment: Spark 2.1.0
>Reporter: Binzi Cao
> Attachments: RDD blocks in spark 2.1.1.png, RDD Blocks .png, Storage 
> in spark 2.1.1.png
>
>
> Memory leak for RDD blocks for a long time running rdd process.
> We  have a long term running application, which is doing computations of 
> RDDs. and we found the RDD blocks are keep increasing in the spark ui page. 
> The rdd blocks and memory usage do not mach the cached rdds and memory. It 
> looks like spark keeps old rdd in memory and never released it or never got a 
> chance to release it. The job will eventually die of out of memory. 
> In addition, I'm not seeing this issue in spark 1.6. We are seeing the same 
> issue in Yarn Cluster mode both in kafka streaming and batch applications. 
> The issue in streaming is similar, however, it seems the rdd blocks grows a 
> bit slower than batch jobs. 
> The below is the sample code and it is reproducible by justing running it in 
> local mode. 
> Scala file:
> {code}
> import scala.concurrent.duration.Duration
> import scala.util.{Try, Failure, Success}
> import org.apache.spark.SparkConf
> import org.apache.spark.SparkContext
> import org.apache.spark.rdd.RDD
> import scala.concurrent._
> import ExecutionContext.Implicits.global
> case class Person(id: String, name: String)
> object RDDApp {
>   def run(sc: SparkContext) = {
> while (true) {
>   val r = scala.util.Random
>   val data = (1 to r.nextInt(100)).toList.map { a =>
> Person(a.toString, a.toString)
>   }
>   val rdd = sc.parallelize(data)
>   rdd.cache
>   println("running")
>   val a = (1 to 100).toList.map { x =>
> Future(rdd.filter(_.id == x.toString).collect)
>   }
>   a.foreach { f =>
> println(Await.ready(f, Duration.Inf).value.get)
>   }
>   rdd.unpersist()
> }
>   }
>   def main(args: Array[String]): Unit = {
>val conf = new SparkConf().setAppName("test")
> val sc   = new SparkContext(conf)
> run(sc)
>   }
> }
> {code}
> build sbt file:
> {code}
> name := "RDDTest"
> version := "0.1.1"
> scalaVersion := "2.11.5"
> libraryDependencies ++= Seq (
> "org.scalaz" %% "scalaz-core" % "7.2.0",
> "org.scalaz" %% "scalaz-concurrent" % "7.2.0",
> "org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided",
> "org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided"
>   )
> addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1")
> mainClass in assembly := Some("RDDApp")
> test in assembly := {}
> {code}
> To reproduce it: 
> Just 
> {code}
> spark-2.1.0-bin-hadoop2.7/bin/spark-submit   --driver-memory 4G \
> --executor-memory 4G \
> --executor-cores 1 \
> --num-executors 1 \
> --class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20760) Memory Leak of RDD blocks

2017-06-06 Thread Binzi Cao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16040112#comment-16040112
 ] 

Binzi Cao commented on SPARK-20760:
---

Hi David, 

Thanks very much for the message, I did a test with spark 2.1.1 in local mode. 
The issue seems still happening, while it seems much better than spark 2.0 as 
the RDD blocks grows much slower. After running the task for 2 hours, I got 
around 6000 rdd blocks in memory. 

I attached the screenshots for the 2.1.1

Binzi

> Memory Leak of RDD blocks 
> --
>
> Key: SPARK-20760
> URL: https://issues.apache.org/jira/browse/SPARK-20760
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 2.1.0
> Environment: Spark 2.1.0
>Reporter: Binzi Cao
> Attachments: RDD Blocks .png
>
>
> Memory leak for RDD blocks for a long time running rdd process.
> We  have a long term running application, which is doing computations of 
> RDDs. and we found the RDD blocks are keep increasing in the spark ui page. 
> The rdd blocks and memory usage do not mach the cached rdds and memory. It 
> looks like spark keeps old rdd in memory and never released it or never got a 
> chance to release it. The job will eventually die of out of memory. 
> In addition, I'm not seeing this issue in spark 1.6. We are seeing the same 
> issue in Yarn Cluster mode both in kafka streaming and batch applications. 
> The issue in streaming is similar, however, it seems the rdd blocks grows a 
> bit slower than batch jobs. 
> The below is the sample code and it is reproducible by justing running it in 
> local mode. 
> Scala file:
> {code}
> import scala.concurrent.duration.Duration
> import scala.util.{Try, Failure, Success}
> import org.apache.spark.SparkConf
> import org.apache.spark.SparkContext
> import org.apache.spark.rdd.RDD
> import scala.concurrent._
> import ExecutionContext.Implicits.global
> case class Person(id: String, name: String)
> object RDDApp {
>   def run(sc: SparkContext) = {
> while (true) {
>   val r = scala.util.Random
>   val data = (1 to r.nextInt(100)).toList.map { a =>
> Person(a.toString, a.toString)
>   }
>   val rdd = sc.parallelize(data)
>   rdd.cache
>   println("running")
>   val a = (1 to 100).toList.map { x =>
> Future(rdd.filter(_.id == x.toString).collect)
>   }
>   a.foreach { f =>
> println(Await.ready(f, Duration.Inf).value.get)
>   }
>   rdd.unpersist()
> }
>   }
>   def main(args: Array[String]): Unit = {
>val conf = new SparkConf().setAppName("test")
> val sc   = new SparkContext(conf)
> run(sc)
>   }
> }
> {code}
> build sbt file:
> {code}
> name := "RDDTest"
> version := "0.1.1"
> scalaVersion := "2.11.5"
> libraryDependencies ++= Seq (
> "org.scalaz" %% "scalaz-core" % "7.2.0",
> "org.scalaz" %% "scalaz-concurrent" % "7.2.0",
> "org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided",
> "org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided"
>   )
> addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1")
> mainClass in assembly := Some("RDDApp")
> test in assembly := {}
> {code}
> To reproduce it: 
> Just 
> {code}
> spark-2.1.0-bin-hadoop2.7/bin/spark-submit   --driver-memory 4G \
> --executor-memory 4G \
> --executor-cores 1 \
> --num-executors 1 \
> --class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21002) Syntax error regression when creating Hive storage handlers on Spark shell

2017-06-06 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21002.
--
Resolution: Duplicate

> Syntax error regression when creating Hive storage handlers on Spark shell
> --
>
> Key: SPARK-21002
> URL: https://issues.apache.org/jira/browse/SPARK-21002
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Dayou Zhou
>
> I use the following syntax to create my Hive storage handlers:
> CREATE TABLE t1
> ROW FORMAT SERDE 'com.foo.MySerDe' 
> STORED BY 'com.foo.MyStorageHandler'
> TBLPROPERTIES (
>..
> );
> And I'm trying to do the same from Spark shell.  It works fine on Spark 1.6, 
> but on Spark 2.0+, it fails with the following:'
> org.apache.spark.sql.catalyst.parser.ParseException:
> Operation not allowed: Unexpected combination of ROW FORMAT SERDE 
> 'com.foo.MySerDe' and STORED BY 'com.foo.MyStorageHandler'(line 1, pos 0)
> Could you please confirm whether this is a regression and if so could it be 
> fixed in Spark 2.2?  Thank you.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21002) Syntax error regression when creating Hive storage handlers on Spark shell

2017-06-06 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16040057#comment-16040057
 ] 

Liang-Chi Hsieh commented on SPARK-21002:
-

This is duplicate to SPARK-19360.

> Syntax error regression when creating Hive storage handlers on Spark shell
> --
>
> Key: SPARK-21002
> URL: https://issues.apache.org/jira/browse/SPARK-21002
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Dayou Zhou
>
> I use the following syntax to create my Hive storage handlers:
> CREATE TABLE t1
> ROW FORMAT SERDE 'com.foo.MySerDe' 
> STORED BY 'com.foo.MyStorageHandler'
> TBLPROPERTIES (
>..
> );
> And I'm trying to do the same from Spark shell.  It works fine on Spark 1.6, 
> but on Spark 2.0+, it fails with the following:'
> org.apache.spark.sql.catalyst.parser.ParseException:
> Operation not allowed: Unexpected combination of ROW FORMAT SERDE 
> 'com.foo.MySerDe' and STORED BY 'com.foo.MyStorageHandler'(line 1, pos 0)
> Could you please confirm whether this is a regression and if so could it be 
> fixed in Spark 2.2?  Thank you.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21002) Syntax error regression when creating Hive storage handlers on Spark shell

2017-06-06 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16040055#comment-16040055
 ] 

Liang-Chi Hsieh commented on SPARK-21002:
-

Seems that we don't support {{STORED BY storage_handler}} anymore since 2.0.

> Syntax error regression when creating Hive storage handlers on Spark shell
> --
>
> Key: SPARK-21002
> URL: https://issues.apache.org/jira/browse/SPARK-21002
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Dayou Zhou
>
> I use the following syntax to create my Hive storage handlers:
> CREATE TABLE t1
> ROW FORMAT SERDE 'com.foo.MySerDe' 
> STORED BY 'com.foo.MyStorageHandler'
> TBLPROPERTIES (
>..
> );
> And I'm trying to do the same from Spark shell.  It works fine on Spark 1.6, 
> but on Spark 2.0+, it fails with the following:'
> org.apache.spark.sql.catalyst.parser.ParseException:
> Operation not allowed: Unexpected combination of ROW FORMAT SERDE 
> 'com.foo.MySerDe' and STORED BY 'com.foo.MyStorageHandler'(line 1, pos 0)
> Could you please confirm whether this is a regression and if so could it be 
> fixed in Spark 2.2?  Thank you.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20977) NPE in CollectionAccumulator

2017-06-06 Thread Boris Capitanu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16039870#comment-16039870
 ] 

Boris Capitanu edited comment on SPARK-20977 at 6/6/17 11:35 PM:
-

I am seeing this problem as well.  

I have a scala SBT project with JavaAppPackaging using Spark 2.1.1.
I built my project with "sbt stage" and when I ran the resulting app, the log 
showed the same error reported here.

{noformat}
17/06/06 19:25:09 ERROR [heartbeat-receiver-event-loop-thread] o.a.s.u.Utils: 
Uncaught exception in thread heartbeat-receiver-even
t-loop-thread
java.lang.NullPointerException: null
at 
org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:464)
at 
org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:439)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6.apply(TaskSchedulerImpl.scala:408)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6.apply(TaskSchedulerImpl.scala:407)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252)
at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.executorHeartbeatReceived(TaskSchedulerImpl.scala:407)
at 
org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2$$anonfun$run$2.apply$mcV$sp(HeartbeatReceiver.scala:129)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1283)
at 
org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2.run(HeartbeatReceiver.scala:128)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{noformat}


was (Author: borice):
I am seeing this problem as well.  

I have a scala SBT project with JavaAppPackaging using Spark 2.1.1.
I built my project with "sbt stage" and when I ran the resulting app, the log 
showed the same error reported here.

{quote}
17/06/06 19:25:09 ERROR [heartbeat-receiver-event-loop-thread] o.a.s.u.Utils: 
Uncaught exception in thread heartbeat-receiver-even
t-loop-thread
java.lang.NullPointerException: null
at 
org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:464)
at 
org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:439)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
at 

[jira] [Commented] (SPARK-20977) NPE in CollectionAccumulator

2017-06-06 Thread Boris Capitanu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16039870#comment-16039870
 ] 

Boris Capitanu commented on SPARK-20977:


I am seeing this problem as well.  

I have a scala SBT project with JavaAppPackaging using Spark 2.1.1.
I built my project with "sbt stage" and when I ran the resulting app, the log 
showed the same error reported here.

{quote}
17/06/06 19:25:09 ERROR [heartbeat-receiver-event-loop-thread] o.a.s.u.Utils: 
Uncaught exception in thread heartbeat-receiver-even
t-loop-thread
java.lang.NullPointerException: null
at 
org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:464)
at 
org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:439)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6.apply(TaskSchedulerImpl.scala:408)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6.apply(TaskSchedulerImpl.scala:407)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252)
at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.executorHeartbeatReceived(TaskSchedulerImpl.scala:407)
at 
org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2$$anonfun$run$2.apply$mcV$sp(HeartbeatReceiver.scala:129)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1283)
at 
org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2.run(HeartbeatReceiver.scala:128)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{quote}

> NPE in CollectionAccumulator
> 
>
> Key: SPARK-20977
> URL: https://issues.apache.org/jira/browse/SPARK-20977
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
> Environment: JDK:
> openjdk version "1.8.0-internal"
> OpenJDK Runtime Environment (build 1.8.0-internal-horii_2016_12_20_18_43-b00)
> OpenJDK 64-Bit Server VM (build 25.71-b00, mixed mode)
> CPU:
> POWER8
>Reporter: sharkd tu
>
> 17/06/03 13:39:31 ERROR Utils: Uncaught exception in thread 
> heartbeat-receiver-event-loop-thread
> java.lang.NullPointerException
>   at 
> org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:464)
>   at 
> org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:439)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 

[jira] [Commented] (SPARK-20969) last() aggregate function fails returning the right answer with ordered windows

2017-06-06 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16039862#comment-16039862
 ] 

Liang-Chi Hsieh commented on SPARK-20969:
-

[~pletelli] I don't find an api doc for that. Maybe we can add one for it.

> last() aggregate function fails returning the right answer with ordered 
> windows
> ---
>
> Key: SPARK-20969
> URL: https://issues.apache.org/jira/browse/SPARK-20969
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Perrine Letellier
>
> The column on which `orderBy` is performed is considered as another column on 
> which to partition.
> {code}
> scala> val df = sc.parallelize(List(("i1", 1, "desc1"), ("i1", 1, "desc2"), 
> ("i1", 2, "desc3"))).toDF("id", "ts", "description")
> scala> import org.apache.spark.sql.expressions.Window
> scala> val window = Window.partitionBy("id").orderBy(col("ts").asc)
> scala> df.withColumn("last", last(col("description")).over(window)).show
> +---+---+---+-+
> | id| ts|description| last|
> +---+---+---+-+
> | i1|  1|  desc1|desc2|
> | i1|  1|  desc2|desc2|
> | i1|  2|  desc3|desc3|
> +---+---+---+-+
> {code}
> However what is expected is the same answer as if asking for `first()` with a 
> window with descending order.
> {code}
> scala> val window = Window.partitionBy("id").orderBy(col("ts").desc)
> scala> df.withColumn("hackedLast", 
> first(col("description")).over(window)).show
> +---+---+---+--+
> | id| ts|description|hackedLast|
> +---+---+---+--+
> | i1|  2|  desc3| desc3|
> | i1|  1|  desc1| desc3|
> | i1|  1|  desc2| desc3|
> +---+---+---+--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-20973) insert table fail caused by unable to fetch data definition file from remote hdfs

2017-06-06 Thread Yunjian Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yunjian Zhang updated SPARK-20973:
--
Comment: was deleted

(was: I did check the source code and add a patch to fix the insert issue as 
below, unable to attach file here, so just past the content as well.
--
--- 
a/./workspace1/spark-2.1.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala
+++ 
b/./workspace/git/gdr/spark/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala
@@ -57,7 +57,7 @@ private[hive] class SparkHiveWriterContainer(
   extends Logging
   with HiveInspectors
   with Serializable {
-
+
   private val now = new Date()
   private val tableDesc: TableDesc = fileSinkConf.getTableInfo
   // Add table properties from storage handler to jobConf, so any custom 
storage
@@ -154,6 +154,12 @@ private[hive] class SparkHiveWriterContainer(
 conf.value.setBoolean("mapred.task.is.map", true)
 conf.value.setInt("mapred.task.partition", splitID)
   }
+
+  def newSerializer(tableDesc: TableDesc): Serializer = {
+val serializer = 
tableDesc.getDeserializerClass.newInstance().asInstanceOf[Serializer]
+serializer.initialize(null, tableDesc.getProperties)
+serializer
+  }
 
   def newSerializer(jobConf: JobConf, tableDesc: TableDesc): Serializer = {
 val serializer = 
tableDesc.getDeserializerClass.newInstance().asInstanceOf[Serializer]
@@ -162,10 +168,11 @@ private[hive] class SparkHiveWriterContainer(
   }
 
   protected def prepareForWrite() = {
-val serializer = newSerializer(jobConf, fileSinkConf.getTableInfo)
+val serializer = newSerializer(conf.value, fileSinkConf.getTableInfo)
+logInfo("CHECK table deser:" + 
fileSinkConf.getTableInfo.getDeserializer(conf.value))
 val standardOI = ObjectInspectorUtils
   .getStandardObjectInspector(
-fileSinkConf.getTableInfo.getDeserializer.getObjectInspector,
+
fileSinkConf.getTableInfo.getDeserializer(conf.value).getObjectInspector,
 ObjectInspectorCopyOption.JAVA)
   .asInstanceOf[StructObjectInspector])

> insert table fail caused by unable to fetch data definition file from remote 
> hdfs 
> --
>
> Key: SPARK-20973
> URL: https://issues.apache.org/jira/browse/SPARK-20973
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Yunjian Zhang
>  Labels: patch
> Attachments: spark-sql-insert.patch
>
>
> I implemented my own hive serde to handle special data files which needs to 
> read data definition during process.
> the process include
> 1.read definition file location from TBLPROPERTIES
> 2.read file content as per step 1
> 3.init serde base on step 2.
> //DDL of the table as below:
> -
> CREATE EXTERNAL TABLE dw_user_stg_txt_out
> ROW FORMAT SERDE 'com.ebay.dss.gdr.hive.serde.abvro.AbvroSerDe'
> STORED AS
>   INPUTFORMAT 'com.ebay.dss.gdr.mapred.AbAsAvroInputFormat'
>   OUTPUTFORMAT 'com.ebay.dss.gdr.hive.ql.io.ab.AvroAsAbOutputFormat'
> LOCATION 'hdfs://${remote_hdfs}/user/data'
> TBLPROPERTIES (
>   'com.ebay.dss.dml.file' = 'hdfs://${remote_hdfs}/dml/user.dml'
> )
> // insert statement
> insert overwrite table dw_user_stg_txt_out select * from dw_user_stg_txt_avro;
> //fail with ERROR
> 17/06/02 15:46:34 ERROR SparkSQLDriver: Failed in [insert overwrite table 
> dw_user_stg_txt_out select * from dw_user_stg_txt_avro]
> java.lang.RuntimeException: FAILED to get dml file from: 
> hdfs://${remote-hdfs}/dml/user.dml
>   at 
> com.ebay.dss.gdr.hive.serde.abvro.AbvroSerDe.initialize(AbvroSerDe.java:109)
>   at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.newSerializer(hiveWriterContainers.scala:160)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:258)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:170)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:347)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20973) insert table fail caused by unable to fetch data definition file from remote hdfs

2017-06-06 Thread Yunjian Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yunjian Zhang updated SPARK-20973:
--
Attachment: spark-sql-insert.patch

> insert table fail caused by unable to fetch data definition file from remote 
> hdfs 
> --
>
> Key: SPARK-20973
> URL: https://issues.apache.org/jira/browse/SPARK-20973
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Yunjian Zhang
>  Labels: patch
> Attachments: spark-sql-insert.patch
>
>
> I implemented my own hive serde to handle special data files which needs to 
> read data definition during process.
> the process include
> 1.read definition file location from TBLPROPERTIES
> 2.read file content as per step 1
> 3.init serde base on step 2.
> //DDL of the table as below:
> -
> CREATE EXTERNAL TABLE dw_user_stg_txt_out
> ROW FORMAT SERDE 'com.ebay.dss.gdr.hive.serde.abvro.AbvroSerDe'
> STORED AS
>   INPUTFORMAT 'com.ebay.dss.gdr.mapred.AbAsAvroInputFormat'
>   OUTPUTFORMAT 'com.ebay.dss.gdr.hive.ql.io.ab.AvroAsAbOutputFormat'
> LOCATION 'hdfs://${remote_hdfs}/user/data'
> TBLPROPERTIES (
>   'com.ebay.dss.dml.file' = 'hdfs://${remote_hdfs}/dml/user.dml'
> )
> // insert statement
> insert overwrite table dw_user_stg_txt_out select * from dw_user_stg_txt_avro;
> //fail with ERROR
> 17/06/02 15:46:34 ERROR SparkSQLDriver: Failed in [insert overwrite table 
> dw_user_stg_txt_out select * from dw_user_stg_txt_avro]
> java.lang.RuntimeException: FAILED to get dml file from: 
> hdfs://${remote-hdfs}/dml/user.dml
>   at 
> com.ebay.dss.gdr.hive.serde.abvro.AbvroSerDe.initialize(AbvroSerDe.java:109)
>   at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.newSerializer(hiveWriterContainers.scala:160)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:258)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:170)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:347)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21002) Syntax error regression when creating Hive storage handlers on Spark shell

2017-06-06 Thread Dayou Zhou (JIRA)
Dayou Zhou created SPARK-21002:
--

 Summary: Syntax error regression when creating Hive storage 
handlers on Spark shell
 Key: SPARK-21002
 URL: https://issues.apache.org/jira/browse/SPARK-21002
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Dayou Zhou


I use the following syntax to create my Hive storage handlers:

CREATE TABLE t1
ROW FORMAT SERDE 'com.foo.MySerDe' 
STORED BY 'com.foo.MyStorageHandler'
TBLPROPERTIES (
   ..
);

And I'm trying to do the same from Spark shell.  It works fine on Spark 1.6, 
but on Spark 2.0+, it fails with the following:'

org.apache.spark.sql.catalyst.parser.ParseException:
Operation not allowed: Unexpected combination of ROW FORMAT SERDE 
'com.foo.MySerDe' and STORED BY 'com.foo.MyStorageHandler'(line 1, pos 0)

Could you please confirm whether this is a regression and if so could it be 
fixed in Spark 2.2?  Thank you.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19878) Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala

2017-06-06 Thread Dayou Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16039762#comment-16039762
 ] 

Dayou Zhou commented on SPARK-19878:


Hello, just checking will this be shipped to Spark 2.2?  Thank you.

> Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala
> --
>
> Key: SPARK-19878
> URL: https://issues.apache.org/jira/browse/SPARK-19878
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0, 1.6.0, 2.0.0
> Environment: Centos 6.5: Hadoop 2.6.0, Spark 1.5.0, Hive 1.1.0
>Reporter: kavn qin
>  Labels: patch
> Attachments: SPARK-19878.patch
>
>
> When case class InsertIntoHiveTable intializes a serde it explicitly passes 
> null for the Configuration in Spark 1.5.0:
> [https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L58]
> While in Spark 2.0.0, the HiveWriterContainer intializes a serde it also just 
> passes null for the Configuration:
> [https://github.com/apache/spark/blob/v2.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala#L161]
> When we implement a hive serde, we want to use the hive configuration to  get 
> some static and dynamic settings, but we can not do it !
> So this patch add the configuration when initialize hive serde.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20960) make ColumnVector public

2017-06-06 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16039648#comment-16039648
 ] 

Wes McKinney commented on SPARK-20960:
--

[~cloud_fan] this will be very exciting to have as a supported public API for 
more efficient UDF execution. We're ready to help with improvements to Arrow 
(like in-memory encodings / compression a la ARROW-300) to help with these use 
cases.

cc [~jnadeau] [~julienledem]

> make ColumnVector public
> 
>
> Key: SPARK-20960
> URL: https://issues.apache.org/jira/browse/SPARK-20960
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>
> ColumnVector is an internal interface in Spark SQL, which is only used for 
> vectorized parquet reader to represent the in-memory columnar format.
> In Spark 2.3 we want to make ColumnVector public, so that we can provide a 
> more efficient way for data exchanges between Spark and external systems. For 
> example, we can use ColumnVector to build the columnar read API in data 
> source framework, we can use ColumnVector to build a more efficient UDF API, 
> etc.
> We also want to introduce a new ColumnVector implementation based on Apache 
> Arrow(basically just a wrapper over Arrow), so that external systems(like 
> Python Pandas DataFrame) can build ColumnVector very easily.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20655) In-memory key-value store implementation

2017-06-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20655:


Assignee: Apache Spark

> In-memory key-value store implementation
> 
>
> Key: SPARK-20655
> URL: https://issues.apache.org/jira/browse/SPARK-20655
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks adding an in-memory implementation for the key-value store 
> abstraction added in SPARK-20641. This is desired because people might want 
> to avoid having to store this data on disk when running applications, and 
> also because the LevelDB native libraries are not available on all platforms.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20655) In-memory key-value store implementation

2017-06-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20655:


Assignee: (was: Apache Spark)

> In-memory key-value store implementation
> 
>
> Key: SPARK-20655
> URL: https://issues.apache.org/jira/browse/SPARK-20655
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks adding an in-memory implementation for the key-value store 
> abstraction added in SPARK-20641. This is desired because people might want 
> to avoid having to store this data on disk when running applications, and 
> also because the LevelDB native libraries are not available on all platforms.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20655) In-memory key-value store implementation

2017-06-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16039619#comment-16039619
 ] 

Apache Spark commented on SPARK-20655:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/18221

> In-memory key-value store implementation
> 
>
> Key: SPARK-20655
> URL: https://issues.apache.org/jira/browse/SPARK-20655
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks adding an in-memory implementation for the key-value store 
> abstraction added in SPARK-20641. This is desired because people might want 
> to avoid having to store this data on disk when running applications, and 
> also because the LevelDB native libraries are not available on all platforms.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20357) Expose Calendar.getWeekYear() as Spark SQL date function to be consistent with weekofyear()

2017-06-06 Thread Cam Mach (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16039496#comment-16039496
 ] 

Cam Mach commented on SPARK-20357:
--

I can work on this issue. Can someone assign it to me? Thanks

> Expose Calendar.getWeekYear() as Spark SQL date function to be consistent 
> with weekofyear()
> ---
>
> Key: SPARK-20357
> URL: https://issues.apache.org/jira/browse/SPARK-20357
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Jeeyoung Kim
>Priority: Minor
>
> Since weeks and years are extracted using different boundaries (weeks happen 
> every 7 days, years happen every 365-ish days, which is not divisible by 7), 
> there are weird inconsistencies around how end-of-the year dates are handled 
> if you use {{year}} and {{weekofyear}} Spark SQL functions. The example below 
> shows how "2016-01-01" and "2016-12-30" has the same {{(year, week)}} pair.
> This happens because the week for "2016-01-01" is calculated as "last week of 
> 2015". the Year function in Spark SQL ignores this and returns  component 
> of -MM-DD.
> The correct way to fix this is by exposing {{Java.util.dates.getWeekYear}}. 
> This function calculates week-based years, so "2016-01-01" will return 2015 
> instead. in this case.
> {noformat}
> # Trying out the bug for date - using PySpark
> import pyspark.sql.functions as F
> df = spark.createDataFrame([("2016-12-31",),("2016-12-30",), ("2017-01-01",), 
> ("2017-01-02",),("2017-12-30",)], ['id'])
> df_parsed = (
> df
> .withColumn("year", F.year(df['id'].cast("date")))
> .withColumn("weekofyear", F.weekofyear(df['id'].cast("date")))
> )
> df_parsed.show()
> {noformat}
> Prints 
> {noformat}
> +--++--+
> |id|year|weekofyear|
> +--++--+
> |2016-12-31|2016|52|
> |2016-12-30|2016|52|
> |2017-01-01|2017|52| <- same (year, weekofyear) output
> |2017-01-02|2017| 1|
> |2017-12-30|2017|52| <- 
> +--++--+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20998) BroadcastHashJoin producing wrong results

2017-06-06 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16039499#comment-16039499
 ] 

Takeshi Yamamuro commented on SPARK-20998:
--

How about v2.1? The version has the same issue?

> BroadcastHashJoin producing wrong results
> -
>
> Key: SPARK-20998
> URL: https://issues.apache.org/jira/browse/SPARK-20998
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Mohit
>
> I have a hive table : _eagle_edw_batch.DistributionAttributes_, with 
> *Schema*: 
> root
>  |-- distributionstatus: string (nullable = true)
>  |-- enabledforselectionflag: boolean (nullable = true)
>  |-- sourcedistributionid: integer (nullable = true)
>  |-- rowstartdate: date (nullable = true)
>  |-- rowenddate: date (nullable = true)
>  |-- rowiscurrent: string (nullable = true)
>  |-- dwcreatedate: timestamp (nullable = true)
>  |-- dwlastupdatedate: timestamp (nullable = true)
>  |-- appid: integer (nullable = true)
>  |-- siteid: integer (nullable = true)
>  |-- brandid: integer (nullable = true)
> *DataFrame*
> val df = spark.sql("SELECT  s.sourcedistributionid as sid, 
> t.sourcedistributionid as tid,  s.appid as sapp, t.appid as tapp,  s.brandid 
> as sbrand, t.brandid as tbrand FROM eagle_edw_batch.DistributionAttributes t 
> INNER JOIN eagle_edw_batch.DistributionAttributes s  ON 
> t.sourcedistributionid=s.sourcedistributionid AND t.appid=s.appid  AND 
> t.brandid=s.brandid").
> *Without BroadCastJoin* ( spark-shell --conf 
> "spark.sql.autoBroadcastJoinThreshold=-1") : 
> df.explain
> == Physical Plan ==
> *Project [sourcedistributionid#71 AS sid#0, sourcedistributionid#60 AS tid#1, 
> appid#77 AS sapp#2, appid#66 AS tapp#3, brandid#79 AS sbrand#4, brandid#68 AS 
> tbrand#5]
> +- *SortMergeJoin [sourcedistributionid#60, appid#66, brandid#68], 
> [sourcedistributionid#71, appid#77, brandid#79], Inner
>:- *Sort [sourcedistributionid#60 ASC, appid#66 ASC, brandid#68 ASC], 
> false, 0
>:  +- Exchange hashpartitioning(sourcedistributionid#60, appid#66, 
> brandid#68, 200)
>: +- *Filter ((isnotnull(sourcedistributionid#60) && 
> isnotnull(brandid#68)) && isnotnull(appid#66))
>:+- HiveTableScan [sourcedistributionid#60, appid#66, brandid#68], 
> MetastoreRelation eagle_edw_batch, distributionattributes, t
>+- *Sort [sourcedistributionid#71 ASC, appid#77 ASC, brandid#79 ASC], 
> false, 0
>   +- Exchange hashpartitioning(sourcedistributionid#71, appid#77, 
> brandid#79, 200)
>  +- *Filter ((isnotnull(sourcedistributionid#71) && 
> isnotnull(appid#77)) && isnotnull(brandid#79))
> +- HiveTableScan [sourcedistributionid#71, appid#77, brandid#79], 
> MetastoreRelation eagle_edw_batch, distributionattributes, s
> df.show
> |sid|tid|sapp|tapp|sbrand|tbrand|
> | 22| 22|  61|  61|   614|   614|
> | 29| 29|  65|  65| 0| 0|
> | 30| 30|  12|  12|   121|   121|
> | 10| 10|  73|  73|   731|   731|
> | 24| 24|  61|  61|   611|   611|
> | 35| 35|  65|  65| 0| 0|
> *With BroadCastJoin* ( spark-shell )
> df.explain
> == Physical Plan ==
> *Project [sourcedistributionid#136 AS sid#65, sourcedistributionid#125 AS 
> tid#66, appid#142 AS sapp#67, appid#131 AS tapp#68, brandid#144 AS sbrand#69, 
> brandid#133 AS tbrand#70]
> +- *BroadcastHashJoin [sourcedistributionid#125, appid#131, brandid#133], 
> [sourcedistributionid#136, appid#142, brandid#144], Inner, BuildRight
>:- *Filter ((isnotnull(brandid#133) && isnotnull(appid#131)) && 
> isnotnull(sourcedistributionid#125))
>:  +- HiveTableScan [sourcedistributionid#125, appid#131, brandid#133], 
> MetastoreRelation eagle_edw_batch, distributionattributes, t
>+- BroadcastExchange 
> HashedRelationBroadcastMode(List((shiftleft((shiftleft(cast(input[0, int, 
> false] as bigint), 32) | (cast(input[1, int, false] as bigint) & 
> 4294967295)), 32) | (cast(input[2, int, false] as bigint) & 4294967295
>   +- *Filter ((isnotnull(brandid#144) && 
> isnotnull(sourcedistributionid#136)) && isnotnull(appid#142))
>  +- HiveTableScan [sourcedistributionid#136, appid#142, brandid#144], 
> MetastoreRelation eagle_edw_batch, distributionattributes, s
> df.show
> |sid|tid|sapp|tapp|sbrand|tbrand|
> | 15| 22|  61|  61|   614|   614|
> | 13| 22|  61|  61|   614|   614|
> | 10| 22|  61|  61|   614|   614|
> |  7| 22|  61|  61|   614|   614|
> |  9| 22|  61|  61|   614|   614|
> | 16| 22|  61|  61|   614|   614|



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20641) Key-value store abstraction and implementation for storing application data

2017-06-06 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-20641.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 17902
[https://github.com/apache/spark/pull/17902]

> Key-value store abstraction and implementation for storing application data
> ---
>
> Key: SPARK-20641
> URL: https://issues.apache.org/jira/browse/SPARK-20641
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
> Fix For: 2.3.0
>
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks adding a key-value store abstraction and initial LevelDB 
> implementation to be used to store application data for building the UI and 
> REST API.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20760) Memory Leak of RDD blocks

2017-06-06 Thread David Lewis (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16039375#comment-16039375
 ] 

David Lewis commented on SPARK-20760:
-

I believe this is fixed by this issue: 
https://issues.apache.org/jira/browse/SPARK-18991

> Memory Leak of RDD blocks 
> --
>
> Key: SPARK-20760
> URL: https://issues.apache.org/jira/browse/SPARK-20760
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 2.1.0
> Environment: Spark 2.1.0
>Reporter: Binzi Cao
> Attachments: RDD Blocks .png
>
>
> Memory leak for RDD blocks for a long time running rdd process.
> We  have a long term running application, which is doing computations of 
> RDDs. and we found the RDD blocks are keep increasing in the spark ui page. 
> The rdd blocks and memory usage do not mach the cached rdds and memory. It 
> looks like spark keeps old rdd in memory and never released it or never got a 
> chance to release it. The job will eventually die of out of memory. 
> In addition, I'm not seeing this issue in spark 1.6. We are seeing the same 
> issue in Yarn Cluster mode both in kafka streaming and batch applications. 
> The issue in streaming is similar, however, it seems the rdd blocks grows a 
> bit slower than batch jobs. 
> The below is the sample code and it is reproducible by justing running it in 
> local mode. 
> Scala file:
> {code}
> import scala.concurrent.duration.Duration
> import scala.util.{Try, Failure, Success}
> import org.apache.spark.SparkConf
> import org.apache.spark.SparkContext
> import org.apache.spark.rdd.RDD
> import scala.concurrent._
> import ExecutionContext.Implicits.global
> case class Person(id: String, name: String)
> object RDDApp {
>   def run(sc: SparkContext) = {
> while (true) {
>   val r = scala.util.Random
>   val data = (1 to r.nextInt(100)).toList.map { a =>
> Person(a.toString, a.toString)
>   }
>   val rdd = sc.parallelize(data)
>   rdd.cache
>   println("running")
>   val a = (1 to 100).toList.map { x =>
> Future(rdd.filter(_.id == x.toString).collect)
>   }
>   a.foreach { f =>
> println(Await.ready(f, Duration.Inf).value.get)
>   }
>   rdd.unpersist()
> }
>   }
>   def main(args: Array[String]): Unit = {
>val conf = new SparkConf().setAppName("test")
> val sc   = new SparkContext(conf)
> run(sc)
>   }
> }
> {code}
> build sbt file:
> {code}
> name := "RDDTest"
> version := "0.1.1"
> scalaVersion := "2.11.5"
> libraryDependencies ++= Seq (
> "org.scalaz" %% "scalaz-core" % "7.2.0",
> "org.scalaz" %% "scalaz-concurrent" % "7.2.0",
> "org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided",
> "org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided"
>   )
> addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1")
> mainClass in assembly := Some("RDDApp")
> test in assembly := {}
> {code}
> To reproduce it: 
> Just 
> {code}
> spark-2.1.0-bin-hadoop2.7/bin/spark-submit   --driver-memory 4G \
> --executor-memory 4G \
> --executor-cores 1 \
> --num-executors 1 \
> --class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up

2017-06-06 Thread Ajay Cherukuri (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16039330#comment-16039330
 ] 

Ajay Cherukuri commented on SPARK-18372:


I have this issue in Spark 2.0.2

> .Hive-staging folders created from Spark hiveContext are not getting cleaned 
> up
> ---
>
> Key: SPARK-18372
> URL: https://issues.apache.org/jira/browse/SPARK-18372
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.2, 1.6.3
> Environment: spark standalone and spark yarn 
>Reporter: Mingjie Tang
>Assignee: Mingjie Tang
> Fix For: 1.6.4
>
> Attachments: _thumb_37664.png
>
>
> Steps to reproduce:
> 
> 1. Launch spark-shell 
> 2. Run the following scala code via Spark-Shell 
> scala> val hivesampletabledf = sqlContext.table("hivesampletable") 
> scala> import org.apache.spark.sql.DataFrameWriter 
> scala> val dfw : DataFrameWriter = hivesampletabledf.write 
> scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( 
> clientid string, querytime string, market string, deviceplatform string, 
> devicemake string, devicemodel string, state string, country string, 
> querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") 
> scala> dfw.insertInto("hivesampletablecopypy") 
> scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, 
> querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE 
> state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """)
> hivesampletablecopypydfdf.show
> 3. in HDFS (in our case, WASB), we can see the following folders 
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666
>  
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1
>  
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693
> the issue is that these don't get cleaned up and get accumulated
> =
> with the customer, we have tried setting "SET 
> hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any 
> difference.
> .hive-staging folders are created under the  folder - 
> hive/warehouse/hivesampletablecopypy/
> we have tried adding this property to hive-site.xml and restart the 
> components -
>  
> hive.exec.stagingdir 
> $ {hive.exec.scratchdir}
> /$
> {user.name}
> /.staging 
> 
> a new .hive-staging folder was created in hive/warehouse/ folder
> moreover, please understand that if we run the hive query in pure Hive via 
> Hive CLI on the same Spark cluster, we don't see the behavior
> so it doesn't appear to be a Hive issue/behavior in this case- this is a 
> spark behavior
> I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark 
> configuration already
> The issue happens via Spark-submit as well - customer used the following 
> command to reproduce this -
> spark-submit test-hive-staging-cleanup.py



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21001) Staging folders from Hive table are not being cleared.

2017-06-06 Thread Ajay Cherukuri (JIRA)
Ajay Cherukuri created SPARK-21001:
--

 Summary: Staging folders from Hive table are not being cleared.
 Key: SPARK-21001
 URL: https://issues.apache.org/jira/browse/SPARK-21001
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.2
Reporter: Ajay Cherukuri


Staging folders that were being created as a part of Data loading to Hive table 
by using spark job, are not cleared.

Staging folder are remaining in Hive External table folders even after Spark 
job is completed.

This is the same issue mentioned in the 
ticket:https://issues.apache.org/jira/browse/SPARK-18372

This ticket says the issues was resolved in 1.6.4. But, now i found that it's 
still existing on 2.0.2.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21000) Add labels support to the Spark Dispatcher

2017-06-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21000:


Assignee: (was: Apache Spark)

> Add labels support to the Spark Dispatcher
> --
>
> Key: SPARK-21000
> URL: https://issues.apache.org/jira/browse/SPARK-21000
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 2.2.1
>Reporter: Michael Gummelt
>
> Labels can be used for tagging drivers with arbitrary data, which can then be 
> used by an organization's tooling.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21000) Add labels support to the Spark Dispatcher

2017-06-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16039308#comment-16039308
 ] 

Apache Spark commented on SPARK-21000:
--

User 'mgummelt' has created a pull request for this issue:
https://github.com/apache/spark/pull/18220

> Add labels support to the Spark Dispatcher
> --
>
> Key: SPARK-21000
> URL: https://issues.apache.org/jira/browse/SPARK-21000
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 2.2.1
>Reporter: Michael Gummelt
>
> Labels can be used for tagging drivers with arbitrary data, which can then be 
> used by an organization's tooling.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21000) Add labels support to the Spark Dispatcher

2017-06-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21000:


Assignee: Apache Spark

> Add labels support to the Spark Dispatcher
> --
>
> Key: SPARK-21000
> URL: https://issues.apache.org/jira/browse/SPARK-21000
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 2.2.1
>Reporter: Michael Gummelt
>Assignee: Apache Spark
>
> Labels can be used for tagging drivers with arbitrary data, which can then be 
> used by an organization's tooling.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21000) Add labels support to the Spark Dispatcher

2017-06-06 Thread Michael Gummelt (JIRA)
Michael Gummelt created SPARK-21000:
---

 Summary: Add labels support to the Spark Dispatcher
 Key: SPARK-21000
 URL: https://issues.apache.org/jira/browse/SPARK-21000
 Project: Spark
  Issue Type: New Feature
  Components: Mesos
Affects Versions: 2.2.1
Reporter: Michael Gummelt


Labels can be used for tagging drivers with arbitrary data, which can then be 
used by an organization's tooling.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20926) Exposure to Guava libraries by directly accessing tableRelationCache in SessionCatalog caused failures

2017-06-06 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-20926:
---
Fix Version/s: 2.2.1

> Exposure to Guava libraries by directly accessing tableRelationCache in 
> SessionCatalog caused failures
> --
>
> Key: SPARK-20926
> URL: https://issues.apache.org/jira/browse/SPARK-20926
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Reza Safi
>Assignee: Reza Safi
> Fix For: 2.2.1, 2.3.0
>
>
> Because of shading that we did for guava libraries, we see test failures 
> whenever those components directly access tableRelationCache in 
> SessionCatalog.
> This can happen in any component that shaded guava library. Failures looks 
> like this:
> {noformat}
> java.lang.NoSuchMethodError: 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableRelationCache()Lcom/google/common/cache/Cache;
> 01:25:14   at 
> org.apache.spark.sql.hive.test.TestHiveSparkSession.reset(TestHive.scala:492)
> 01:25:14   at 
> org.apache.spark.sql.hive.test.TestHiveContext.reset(TestHive.scala:138)
> 01:25:14   at 
> org.apache.spark.sql.hive.test.TestHiveSingleton$class.afterAll(TestHiveSingleton.scala:32)
> 01:25:14   at 
> org.apache.spark.sql.hive.StatisticsSuite.afterAll(StatisticsSuite.scala:34)
> 01:25:14   at 
> org.scalatest.BeforeAndAfterAll$class.afterAll(BeforeAndAfterAll.scala:213)
> 01:25:14   at org.apache.spark.SparkFunSuite.afterAll(SparkFunSuite.scala:31)
> 01:25:14   at 
> org.scalatest.BeforeAndAfterAll$$anonfun$run$1.apply(BeforeAndAfterAll.scala:280)
> 01:25:14   at 
> org.scalatest.BeforeAndAfterAll$$anonfun$run$1.apply(BeforeAndAfterAll.scala:278)
> 01:25:14   at org.scalatest.CompositeStatus.whenCompleted(Status.scala:377)
> 01:25:14   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:278)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20926) Exposure to Guava libraries by directly accessing tableRelationCache in SessionCatalog caused failures

2017-06-06 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16039249#comment-16039249
 ] 

Marcelo Vanzin commented on SPARK-20926:


Turns out I didn't notice the PR was against 2.2 so it's there. Oh well, it's 
an internal change only anyway.

> Exposure to Guava libraries by directly accessing tableRelationCache in 
> SessionCatalog caused failures
> --
>
> Key: SPARK-20926
> URL: https://issues.apache.org/jira/browse/SPARK-20926
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Reza Safi
>Assignee: Reza Safi
> Fix For: 2.2.1, 2.3.0
>
>
> Because of shading that we did for guava libraries, we see test failures 
> whenever those components directly access tableRelationCache in 
> SessionCatalog.
> This can happen in any component that shaded guava library. Failures looks 
> like this:
> {noformat}
> java.lang.NoSuchMethodError: 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableRelationCache()Lcom/google/common/cache/Cache;
> 01:25:14   at 
> org.apache.spark.sql.hive.test.TestHiveSparkSession.reset(TestHive.scala:492)
> 01:25:14   at 
> org.apache.spark.sql.hive.test.TestHiveContext.reset(TestHive.scala:138)
> 01:25:14   at 
> org.apache.spark.sql.hive.test.TestHiveSingleton$class.afterAll(TestHiveSingleton.scala:32)
> 01:25:14   at 
> org.apache.spark.sql.hive.StatisticsSuite.afterAll(StatisticsSuite.scala:34)
> 01:25:14   at 
> org.scalatest.BeforeAndAfterAll$class.afterAll(BeforeAndAfterAll.scala:213)
> 01:25:14   at org.apache.spark.SparkFunSuite.afterAll(SparkFunSuite.scala:31)
> 01:25:14   at 
> org.scalatest.BeforeAndAfterAll$$anonfun$run$1.apply(BeforeAndAfterAll.scala:280)
> 01:25:14   at 
> org.scalatest.BeforeAndAfterAll$$anonfun$run$1.apply(BeforeAndAfterAll.scala:278)
> 01:25:14   at org.scalatest.CompositeStatus.whenCompleted(Status.scala:377)
> 01:25:14   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:278)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18085) Better History Server scalability for many / large applications

2017-06-06 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16039223#comment-16039223
 ] 

Marcelo Vanzin commented on SPARK-18085:


Yes, https://github.com/vanzin/spark/commits/shs-ng/HEAD should always contain 
the most up-to-date code. To actually try the leveldb backend you need to set a 
new configuration in the SHS (check config.scala in the o.a.s.history package).

> Better History Server scalability for many / large applications
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-18085) Better History Server scalability for many / large applications

2017-06-06 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-18085:
---
Comment: was deleted

(was: Your contact with NYSE has changed, please visit 
https://www.nyse.com/contact or call +1 866 873 7422 or +65 6594 0160 or +852 
3962 8100.
)

> Better History Server scalability for many / large applications
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20712) [SQL] Spark can't read Hive table when column type has length greater than 4000 bytes

2017-06-06 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-20712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-20712:
---
Affects Version/s: 2.1.2

> [SQL] Spark can't read Hive table when column type has length greater than 
> 4000 bytes
> -
>
> Key: SPARK-20712
> URL: https://issues.apache.org/jira/browse/SPARK-20712
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.1.2, 2.3.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> Hi,
> I have following issue.
> I'm trying to read a table from hive when one of the column is nested so it's 
> schema has length longer than 4000 bytes.
> Everything worked on Spark 2.0.2. On 2.1.1 I'm getting Exception:
> {code}
> >> spark.read.table("SOME_TABLE")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/opt/spark-2.1.1/python/pyspark/sql/readwriter.py", line 259, in table
> return self._df(self._jreader.table(tableName))
>   File 
> "/opt/spark-2.1.1/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 
> 1133, in __call__
>   File "/opt/spark-2.1.1/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/opt/spark-2.1.1/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", 
> line 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o71.table.
> : org.apache.spark.SparkException: Cannot recognize hive type string: 
> SOME_VERY_LONG_FIELD_TYPE
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:789)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:365)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:361)
> at scala.Option.map(Option.scala:146)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:361)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:359)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:279)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:226)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:225)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:268)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:359)
> at 
> org.apache.spark.sql.hive.client.HiveClient$class.getTable(HiveClient.scala:74)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:78)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable(HiveExternalCatalog.scala:117)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:628)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:628)
> at 
> 

[jira] [Resolved] (SPARK-20969) last() aggregate function fails returning the right answer with ordered windows

2017-06-06 Thread Perrine Letellier (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Perrine Letellier resolved SPARK-20969.
---
Resolution: Not A Problem

> last() aggregate function fails returning the right answer with ordered 
> windows
> ---
>
> Key: SPARK-20969
> URL: https://issues.apache.org/jira/browse/SPARK-20969
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Perrine Letellier
>
> The column on which `orderBy` is performed is considered as another column on 
> which to partition.
> {code}
> scala> val df = sc.parallelize(List(("i1", 1, "desc1"), ("i1", 1, "desc2"), 
> ("i1", 2, "desc3"))).toDF("id", "ts", "description")
> scala> import org.apache.spark.sql.expressions.Window
> scala> val window = Window.partitionBy("id").orderBy(col("ts").asc)
> scala> df.withColumn("last", last(col("description")).over(window)).show
> +---+---+---+-+
> | id| ts|description| last|
> +---+---+---+-+
> | i1|  1|  desc1|desc2|
> | i1|  1|  desc2|desc2|
> | i1|  2|  desc3|desc3|
> +---+---+---+-+
> {code}
> However what is expected is the same answer as if asking for `first()` with a 
> window with descending order.
> {code}
> scala> val window = Window.partitionBy("id").orderBy(col("ts").desc)
> scala> df.withColumn("hackedLast", 
> first(col("description")).over(window)).show
> +---+---+---+--+
> | id| ts|description|hackedLast|
> +---+---+---+--+
> | i1|  2|  desc3| desc3|
> | i1|  1|  desc1| desc3|
> | i1|  1|  desc2| desc3|
> +---+---+---+--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20969) last() aggregate function fails returning the right answer with ordered windows

2017-06-06 Thread Perrine Letellier (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16038825#comment-16038825
 ] 

Perrine Letellier edited comment on SPARK-20969 at 6/6/17 1:00 PM:
---

[~viirya] Thanks for your answer !
I could get the expected result by specifying 
{code}Window.partitionBy("id").orderBy(col("ts").asc).rowsBetween(Window.unboundedPreceding,
 Window.unboundedFollowing){code} instead of simply 
{code}Window.partitionBy("id").orderBy(col("ts").asc) {code}.

Is it documented in any api doc that the default frame is {{RANGE BETWEEN 
UNBOUNDED PRECEDING AND CURRENT ROW}} ?


was (Author: pletelli):
[~viirya] Thanks for your answer !
I could get the expected result by specifying 
{code}Window.partitionBy("id").orderBy(col("ts").asc).rowsBetween(Window.unboundedPreceding,
 Window.unboundedFollowing){code} instead of simply 
{code}Window.partitionBy("id").orderBy(col("ts").asc) {code}.

Is it documented in any api doc that the default frame is {code}RANGE BETWEEN 
UNBOUNDED PRECEDING AND CURRENT ROW{code} ?

> last() aggregate function fails returning the right answer with ordered 
> windows
> ---
>
> Key: SPARK-20969
> URL: https://issues.apache.org/jira/browse/SPARK-20969
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Perrine Letellier
>
> The column on which `orderBy` is performed is considered as another column on 
> which to partition.
> {code}
> scala> val df = sc.parallelize(List(("i1", 1, "desc1"), ("i1", 1, "desc2"), 
> ("i1", 2, "desc3"))).toDF("id", "ts", "description")
> scala> import org.apache.spark.sql.expressions.Window
> scala> val window = Window.partitionBy("id").orderBy(col("ts").asc)
> scala> df.withColumn("last", last(col("description")).over(window)).show
> +---+---+---+-+
> | id| ts|description| last|
> +---+---+---+-+
> | i1|  1|  desc1|desc2|
> | i1|  1|  desc2|desc2|
> | i1|  2|  desc3|desc3|
> +---+---+---+-+
> {code}
> However what is expected is the same answer as if asking for `first()` with a 
> window with descending order.
> {code}
> scala> val window = Window.partitionBy("id").orderBy(col("ts").desc)
> scala> df.withColumn("hackedLast", 
> first(col("description")).over(window)).show
> +---+---+---+--+
> | id| ts|description|hackedLast|
> +---+---+---+--+
> | i1|  2|  desc3| desc3|
> | i1|  1|  desc1| desc3|
> | i1|  1|  desc2| desc3|
> +---+---+---+--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20969) last() aggregate function fails returning the right answer with ordered windows

2017-06-06 Thread Perrine Letellier (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16038825#comment-16038825
 ] 

Perrine Letellier edited comment on SPARK-20969 at 6/6/17 1:00 PM:
---

[~viirya] Thanks for your answer !
I could get the expected result by specifying 
{{Window.partitionBy("id").orderBy(col("ts").asc).rowsBetween(Window.unboundedPreceding,
 Window.unboundedFollowing)}} instead of simply 
{{Window.partitionBy("id").orderBy(col("ts").asc)}}.

Is it documented in any api doc that the default frame is {{RANGE BETWEEN 
UNBOUNDED PRECEDING AND CURRENT ROW}} ?


was (Author: pletelli):
[~viirya] Thanks for your answer !
I could get the expected result by specifying 
{code}Window.partitionBy("id").orderBy(col("ts").asc).rowsBetween(Window.unboundedPreceding,
 Window.unboundedFollowing){code} instead of simply 
{code}Window.partitionBy("id").orderBy(col("ts").asc) {code}.

Is it documented in any api doc that the default frame is {{RANGE BETWEEN 
UNBOUNDED PRECEDING AND CURRENT ROW}} ?

> last() aggregate function fails returning the right answer with ordered 
> windows
> ---
>
> Key: SPARK-20969
> URL: https://issues.apache.org/jira/browse/SPARK-20969
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Perrine Letellier
>
> The column on which `orderBy` is performed is considered as another column on 
> which to partition.
> {code}
> scala> val df = sc.parallelize(List(("i1", 1, "desc1"), ("i1", 1, "desc2"), 
> ("i1", 2, "desc3"))).toDF("id", "ts", "description")
> scala> import org.apache.spark.sql.expressions.Window
> scala> val window = Window.partitionBy("id").orderBy(col("ts").asc)
> scala> df.withColumn("last", last(col("description")).over(window)).show
> +---+---+---+-+
> | id| ts|description| last|
> +---+---+---+-+
> | i1|  1|  desc1|desc2|
> | i1|  1|  desc2|desc2|
> | i1|  2|  desc3|desc3|
> +---+---+---+-+
> {code}
> However what is expected is the same answer as if asking for `first()` with a 
> window with descending order.
> {code}
> scala> val window = Window.partitionBy("id").orderBy(col("ts").desc)
> scala> df.withColumn("hackedLast", 
> first(col("description")).over(window)).show
> +---+---+---+--+
> | id| ts|description|hackedLast|
> +---+---+---+--+
> | i1|  2|  desc3| desc3|
> | i1|  1|  desc1| desc3|
> | i1|  1|  desc2| desc3|
> +---+---+---+--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20969) last() aggregate function fails returning the right answer with ordered windows

2017-06-06 Thread Perrine Letellier (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16038825#comment-16038825
 ] 

Perrine Letellier edited comment on SPARK-20969 at 6/6/17 12:59 PM:


[~viirya] Thanks for your answer !
I could get the expected result by specifying 
{code}Window.partitionBy("id").orderBy(col("ts").asc).rowsBetween(Window.unboundedPreceding,
 Window.unboundedFollowing){code} instead of simply 
{code}Window.partitionBy("id").orderBy(col("ts").asc) {code}.

Is it documented in any api doc that the default frame is {code}RANGE BETWEEN 
UNBOUNDED PRECEDING AND CURRENT ROW{code} ?


was (Author: pletelli):
[~viirya] Thanks for your answer !
I could get the expected result by specifying 
{{Window.partitionBy("id").orderBy(col("ts").asc).rowsBetween(Window.unboundedPreceding,
 Window.unboundedFollowing) }} instead of simply {{ 
Window.partitionBy("id").orderBy(col("ts").asc) }}.

Is it documented in any api doc that the default frame is {{ RANGE BETWEEN 
UNBOUNDED PRECEDING AND CURRENT ROW }} ?

> last() aggregate function fails returning the right answer with ordered 
> windows
> ---
>
> Key: SPARK-20969
> URL: https://issues.apache.org/jira/browse/SPARK-20969
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Perrine Letellier
>
> The column on which `orderBy` is performed is considered as another column on 
> which to partition.
> {code}
> scala> val df = sc.parallelize(List(("i1", 1, "desc1"), ("i1", 1, "desc2"), 
> ("i1", 2, "desc3"))).toDF("id", "ts", "description")
> scala> import org.apache.spark.sql.expressions.Window
> scala> val window = Window.partitionBy("id").orderBy(col("ts").asc)
> scala> df.withColumn("last", last(col("description")).over(window)).show
> +---+---+---+-+
> | id| ts|description| last|
> +---+---+---+-+
> | i1|  1|  desc1|desc2|
> | i1|  1|  desc2|desc2|
> | i1|  2|  desc3|desc3|
> +---+---+---+-+
> {code}
> However what is expected is the same answer as if asking for `first()` with a 
> window with descending order.
> {code}
> scala> val window = Window.partitionBy("id").orderBy(col("ts").desc)
> scala> df.withColumn("hackedLast", 
> first(col("description")).over(window)).show
> +---+---+---+--+
> | id| ts|description|hackedLast|
> +---+---+---+--+
> | i1|  2|  desc3| desc3|
> | i1|  1|  desc1| desc3|
> | i1|  1|  desc2| desc3|
> +---+---+---+--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20969) last() aggregate function fails returning the right answer with ordered windows

2017-06-06 Thread Perrine Letellier (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16038825#comment-16038825
 ] 

Perrine Letellier edited comment on SPARK-20969 at 6/6/17 12:59 PM:


[~viirya] Thanks for your answer !
I could get the expected result by specifying 
{{Window.partitionBy("id").orderBy(col("ts").asc).rowsBetween(Window.unboundedPreceding,
 Window.unboundedFollowing) }} instead of simply {{ 
Window.partitionBy("id").orderBy(col("ts").asc) }}.

Is it documented in any api doc that the default frame is {{ RANGE BETWEEN 
UNBOUNDED PRECEDING AND CURRENT ROW }} ?


was (Author: pletelli):
[~viirya] Thanks for your answer !
I could get the expected result by specifying {code} 
Window.partitionBy("id").orderBy(col("ts").asc).rowsBetween(Window.unboundedPreceding,
 Window.unboundedFollowing) {code} instead of simply {code} 
Window.partitionBy("id").orderBy(col("ts").asc) {code}.

Is it documented in any api doc that the default frame is {code} RANGE BETWEEN 
UNBOUNDED PRECEDING AND CURRENT ROW {code} ?

> last() aggregate function fails returning the right answer with ordered 
> windows
> ---
>
> Key: SPARK-20969
> URL: https://issues.apache.org/jira/browse/SPARK-20969
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Perrine Letellier
>
> The column on which `orderBy` is performed is considered as another column on 
> which to partition.
> {code}
> scala> val df = sc.parallelize(List(("i1", 1, "desc1"), ("i1", 1, "desc2"), 
> ("i1", 2, "desc3"))).toDF("id", "ts", "description")
> scala> import org.apache.spark.sql.expressions.Window
> scala> val window = Window.partitionBy("id").orderBy(col("ts").asc)
> scala> df.withColumn("last", last(col("description")).over(window)).show
> +---+---+---+-+
> | id| ts|description| last|
> +---+---+---+-+
> | i1|  1|  desc1|desc2|
> | i1|  1|  desc2|desc2|
> | i1|  2|  desc3|desc3|
> +---+---+---+-+
> {code}
> However what is expected is the same answer as if asking for `first()` with a 
> window with descending order.
> {code}
> scala> val window = Window.partitionBy("id").orderBy(col("ts").desc)
> scala> df.withColumn("hackedLast", 
> first(col("description")).over(window)).show
> +---+---+---+--+
> | id| ts|description|hackedLast|
> +---+---+---+--+
> | i1|  2|  desc3| desc3|
> | i1|  1|  desc1| desc3|
> | i1|  1|  desc2| desc3|
> +---+---+---+--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20969) last() aggregate function fails returning the right answer with ordered windows

2017-06-06 Thread Perrine Letellier (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16038825#comment-16038825
 ] 

Perrine Letellier commented on SPARK-20969:
---

[~viirya] Thanks for your answer !
I could get the expected result by specifying {code} 
Window.partitionBy("id").orderBy(col("ts").asc).rowsBetween(Window.unboundedPreceding,
 Window.unboundedFollowing) {code} instead of simply {code} 
Window.partitionBy("id").orderBy(col("ts").asc) {code}.

Is it documented in any api doc that the default frame is {code} RANGE BETWEEN 
UNBOUNDED PRECEDING AND CURRENT ROW {code} ?

> last() aggregate function fails returning the right answer with ordered 
> windows
> ---
>
> Key: SPARK-20969
> URL: https://issues.apache.org/jira/browse/SPARK-20969
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Perrine Letellier
>
> The column on which `orderBy` is performed is considered as another column on 
> which to partition.
> {code}
> scala> val df = sc.parallelize(List(("i1", 1, "desc1"), ("i1", 1, "desc2"), 
> ("i1", 2, "desc3"))).toDF("id", "ts", "description")
> scala> import org.apache.spark.sql.expressions.Window
> scala> val window = Window.partitionBy("id").orderBy(col("ts").asc)
> scala> df.withColumn("last", last(col("description")).over(window)).show
> +---+---+---+-+
> | id| ts|description| last|
> +---+---+---+-+
> | i1|  1|  desc1|desc2|
> | i1|  1|  desc2|desc2|
> | i1|  2|  desc3|desc3|
> +---+---+---+-+
> {code}
> However what is expected is the same answer as if asking for `first()` with a 
> window with descending order.
> {code}
> scala> val window = Window.partitionBy("id").orderBy(col("ts").desc)
> scala> df.withColumn("hackedLast", 
> first(col("description")).over(window)).show
> +---+---+---+--+
> | id| ts|description|hackedLast|
> +---+---+---+--+
> | i1|  2|  desc3| desc3|
> | i1|  1|  desc1| desc3|
> | i1|  1|  desc2| desc3|
> +---+---+---+--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-20999) No failure Stages, no log 'DAGScheduler: failed: Set()' output.

2017-06-06 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen closed SPARK-20999.
-

Not A Problem

> No failure Stages, no log 'DAGScheduler: failed: Set()' output.
> ---
>
> Key: SPARK-20999
> URL: https://issues.apache.org/jira/browse/SPARK-20999
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.2.1
>Reporter: caoxuewen
>Priority: Trivial
>
> In the output of the spark log information:
> INFO DAGScheduler: looking for newly runnable stages
> INFO DAGScheduler: running: Set(ShuffleMapStage 14)
> INFO DAGScheduler: waiting: Set(ResultStage 15)
> INFO DAGScheduler: failed: Set()
> If there is no failure stage, "INFO DAGScheduler: failed: Set()" is no need 
> to output in the log information.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20999) No failure Stages, no log 'DAGScheduler: failed: Set()' output.

2017-06-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20999.
---
Resolution: Not A Problem

> No failure Stages, no log 'DAGScheduler: failed: Set()' output.
> ---
>
> Key: SPARK-20999
> URL: https://issues.apache.org/jira/browse/SPARK-20999
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.2.1
>Reporter: caoxuewen
>Priority: Trivial
>
> In the output of the spark log information:
> INFO DAGScheduler: looking for newly runnable stages
> INFO DAGScheduler: running: Set(ShuffleMapStage 14)
> INFO DAGScheduler: waiting: Set(ResultStage 15)
> INFO DAGScheduler: failed: Set()
> If there is no failure stage, "INFO DAGScheduler: failed: Set()" is no need 
> to output in the log information.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20999) No failure Stages, no log 'DAGScheduler: failed: Set()' output.

2017-06-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16038793#comment-16038793
 ] 

Sean Owen commented on SPARK-20999:
---

As I've already commented, there doesn't seem to be value in hiding this info, 
and this isn't something you make a JIRA for

> No failure Stages, no log 'DAGScheduler: failed: Set()' output.
> ---
>
> Key: SPARK-20999
> URL: https://issues.apache.org/jira/browse/SPARK-20999
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.2.1
>Reporter: caoxuewen
>Priority: Trivial
>
> In the output of the spark log information:
> INFO DAGScheduler: looking for newly runnable stages
> INFO DAGScheduler: running: Set(ShuffleMapStage 14)
> INFO DAGScheduler: waiting: Set(ResultStage 15)
> INFO DAGScheduler: failed: Set()
> If there is no failure stage, "INFO DAGScheduler: failed: Set()" is no need 
> to output in the log information.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20999) No failure Stages, no log 'DAGScheduler: failed: Set()' output.

2017-06-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20999:


Assignee: (was: Apache Spark)

> No failure Stages, no log 'DAGScheduler: failed: Set()' output.
> ---
>
> Key: SPARK-20999
> URL: https://issues.apache.org/jira/browse/SPARK-20999
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.2.1
>Reporter: caoxuewen
>Priority: Trivial
>
> In the output of the spark log information:
> INFO DAGScheduler: looking for newly runnable stages
> INFO DAGScheduler: running: Set(ShuffleMapStage 14)
> INFO DAGScheduler: waiting: Set(ResultStage 15)
> INFO DAGScheduler: failed: Set()
> If there is no failure stage, "INFO DAGScheduler: failed: Set()" is no need 
> to output in the log information.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20999) No failure Stages, no log 'DAGScheduler: failed: Set()' output.

2017-06-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16038789#comment-16038789
 ] 

Apache Spark commented on SPARK-20999:
--

User 'heary-cao' has created a pull request for this issue:
https://github.com/apache/spark/pull/18218

> No failure Stages, no log 'DAGScheduler: failed: Set()' output.
> ---
>
> Key: SPARK-20999
> URL: https://issues.apache.org/jira/browse/SPARK-20999
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.2.1
>Reporter: caoxuewen
>Priority: Trivial
>
> In the output of the spark log information:
> INFO DAGScheduler: looking for newly runnable stages
> INFO DAGScheduler: running: Set(ShuffleMapStage 14)
> INFO DAGScheduler: waiting: Set(ResultStage 15)
> INFO DAGScheduler: failed: Set()
> If there is no failure stage, "INFO DAGScheduler: failed: Set()" is no need 
> to output in the log information.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20999) No failure Stages, no log 'DAGScheduler: failed: Set()' output.

2017-06-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20999:


Assignee: Apache Spark

> No failure Stages, no log 'DAGScheduler: failed: Set()' output.
> ---
>
> Key: SPARK-20999
> URL: https://issues.apache.org/jira/browse/SPARK-20999
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.2.1
>Reporter: caoxuewen
>Assignee: Apache Spark
>Priority: Trivial
>
> In the output of the spark log information:
> INFO DAGScheduler: looking for newly runnable stages
> INFO DAGScheduler: running: Set(ShuffleMapStage 14)
> INFO DAGScheduler: waiting: Set(ResultStage 15)
> INFO DAGScheduler: failed: Set()
> If there is no failure stage, "INFO DAGScheduler: failed: Set()" is no need 
> to output in the log information.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20999) No failure Stages, no log 'DAGScheduler: failed: Set()' output.

2017-06-06 Thread caoxuewen (JIRA)
caoxuewen created SPARK-20999:
-

 Summary: No failure Stages, no log 'DAGScheduler: failed: Set()' 
output.
 Key: SPARK-20999
 URL: https://issues.apache.org/jira/browse/SPARK-20999
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 2.2.1
Reporter: caoxuewen
Priority: Trivial


In the output of the spark log information:

INFO DAGScheduler: looking for newly runnable stages
INFO DAGScheduler: running: Set(ShuffleMapStage 14)
INFO DAGScheduler: waiting: Set(ResultStage 15)
INFO DAGScheduler: failed: Set()

If there is no failure stage, "INFO DAGScheduler: failed: Set()" is no need to 
output in the log information.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13669) Job will always fail in the external shuffle service unavailable situation

2017-06-06 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-13669:

Description: 
Currently we are running into an issue with Yarn work preserving enabled + 
external shuffle service. 

In the work preserving enabled scenario, the failure of NM will not lead to the 
exit of executors, so executors can still accept and run the tasks. The problem 
here is when NM is failed, external shuffle service is actually inaccessible, 
so reduce tasks will always complain about the “Fetch failure”, and the failure 
of reduce stage will make the parent stage (map stage) rerun. The tricky thing 
here is Spark scheduler is not aware of the unavailability of external shuffle 
service, and will reschedule the map tasks on the executor where NM is failed, 
and again reduce stage will be failed with “Fetch failure”, and after 4 
retries, the job is failed.

So here the main problem is that we should avoid assigning tasks to those bad 
executors (where shuffle service is unavailable). Current Spark's blacklist 
mechanism could blacklist executors/nodes by failure tasks, but it doesn't 
handle this specific fetch failure scenario. So here propose to improve the 
current application blacklist mechanism to handle fetch failure issue 
(especially with external shuffle service unavailable issue), to blacklist the 
executors/nodes where shuffle fetch is unavailable.

  was:
Currently we are running into an issue with Yarn work preserving enabled + 
external shuffle service. 

In the work preserving enabled scenario, the failure of NM will not lead to the 
exit of executors, so executors can still accept and run the tasks. The problem 
here is when NM is failed, external shuffle service is actually inaccessible, 
so reduce tasks will always complain about the “Fetch failure”, and the failure 
of reduce stage will make the parent stage (map stage) rerun. The tricky thing 
here is Spark scheduler is not aware of the unavailability of external shuffle 
service, and will reschedule the map tasks on the executor where NM is failed, 
and again reduce stage will be failed with “Fetch failure”, and after 4 
retries, the job is failed.

So here the actual problem is Spark’s scheduler is not aware of the 
unavailability of external shuffle service, and will still assign the tasks on 
to that nodes. The fix is to avoid assigning tasks on to that nodes.

Currently in the Spark, one related configuration is 
“spark.scheduler.executorTaskBlacklistTime”, but I don’t think it will be 
worked in this scenario. This configuration is used to avoid same reattempt 
task to run on the same executor. Also ways like MapReduce’s blacklist 
mechanism may not handle this scenario, since all the reduce tasks will be 
failed, so counting the failure tasks will equally mark all the executors as 
“bad” one.


> Job will always fail in the external shuffle service unavailable situation
> --
>
> Key: SPARK-13669
> URL: https://issues.apache.org/jira/browse/SPARK-13669
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Reporter: Saisai Shao
>
> Currently we are running into an issue with Yarn work preserving enabled + 
> external shuffle service. 
> In the work preserving enabled scenario, the failure of NM will not lead to 
> the exit of executors, so executors can still accept and run the tasks. The 
> problem here is when NM is failed, external shuffle service is actually 
> inaccessible, so reduce tasks will always complain about the “Fetch failure”, 
> and the failure of reduce stage will make the parent stage (map stage) rerun. 
> The tricky thing here is Spark scheduler is not aware of the unavailability 
> of external shuffle service, and will reschedule the map tasks on the 
> executor where NM is failed, and again reduce stage will be failed with 
> “Fetch failure”, and after 4 retries, the job is failed.
> So here the main problem is that we should avoid assigning tasks to those bad 
> executors (where shuffle service is unavailable). Current Spark's blacklist 
> mechanism could blacklist executors/nodes by failure tasks, but it doesn't 
> handle this specific fetch failure scenario. So here propose to improve the 
> current application blacklist mechanism to handle fetch failure issue 
> (especially with external shuffle service unavailable issue), to blacklist 
> the executors/nodes where shuffle fetch is unavailable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20998) BroadcastHashJoin producing wrong results

2017-06-06 Thread Mohit (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohit updated SPARK-20998:
--
Description: 
I have a hive table : _eagle_edw_batch.DistributionAttributes_, with 

*Schema*: 

root
 |-- distributionstatus: string (nullable = true)
 |-- enabledforselectionflag: boolean (nullable = true)
 |-- sourcedistributionid: integer (nullable = true)
 |-- rowstartdate: date (nullable = true)
 |-- rowenddate: date (nullable = true)
 |-- rowiscurrent: string (nullable = true)
 |-- dwcreatedate: timestamp (nullable = true)
 |-- dwlastupdatedate: timestamp (nullable = true)
 |-- appid: integer (nullable = true)
 |-- siteid: integer (nullable = true)
 |-- brandid: integer (nullable = true)


*DataFrame*

val df = spark.sql("SELECT  s.sourcedistributionid as sid, 
t.sourcedistributionid as tid,  s.appid as sapp, t.appid as tapp,  s.brandid as 
sbrand, t.brandid as tbrand FROM eagle_edw_batch.DistributionAttributes t INNER 
JOIN eagle_edw_batch.DistributionAttributes s  ON 
t.sourcedistributionid=s.sourcedistributionid AND t.appid=s.appid  AND 
t.brandid=s.brandid").


*Without BroadCastJoin* ( spark-shell --conf 
"spark.sql.autoBroadcastJoinThreshold=-1") : 

df.explain
== Physical Plan ==
*Project [sourcedistributionid#71 AS sid#0, sourcedistributionid#60 AS tid#1, 
appid#77 AS sapp#2, appid#66 AS tapp#3, brandid#79 AS sbrand#4, brandid#68 AS 
tbrand#5]
+- *SortMergeJoin [sourcedistributionid#60, appid#66, brandid#68], 
[sourcedistributionid#71, appid#77, brandid#79], Inner
   :- *Sort [sourcedistributionid#60 ASC, appid#66 ASC, brandid#68 ASC], false, 0
   :  +- Exchange hashpartitioning(sourcedistributionid#60, appid#66, 
brandid#68, 200)
   : +- *Filter ((isnotnull(sourcedistributionid#60) && 
isnotnull(brandid#68)) && isnotnull(appid#66))
   :+- HiveTableScan [sourcedistributionid#60, appid#66, brandid#68], 
MetastoreRelation eagle_edw_batch, distributionattributes, t
   +- *Sort [sourcedistributionid#71 ASC, appid#77 ASC, brandid#79 ASC], false, 0
  +- Exchange hashpartitioning(sourcedistributionid#71, appid#77, 
brandid#79, 200)
 +- *Filter ((isnotnull(sourcedistributionid#71) && 
isnotnull(appid#77)) && isnotnull(brandid#79))
+- HiveTableScan [sourcedistributionid#71, appid#77, brandid#79], 
MetastoreRelation eagle_edw_batch, distributionattributes, s

df.show
|sid|tid|sapp|tapp|sbrand|tbrand|
| 22| 22|  61|  61|   614|   614|
| 29| 29|  65|  65| 0| 0|
| 30| 30|  12|  12|   121|   121|
| 10| 10|  73|  73|   731|   731|
| 24| 24|  61|  61|   611|   611|
| 35| 35|  65|  65| 0| 0|


*With BroadCastJoin* ( spark-shell )

df.explain

== Physical Plan ==
*Project [sourcedistributionid#136 AS sid#65, sourcedistributionid#125 AS 
tid#66, appid#142 AS sapp#67, appid#131 AS tapp#68, brandid#144 AS sbrand#69, 
brandid#133 AS tbrand#70]
+- *BroadcastHashJoin [sourcedistributionid#125, appid#131, brandid#133], 
[sourcedistributionid#136, appid#142, brandid#144], Inner, BuildRight
   :- *Filter ((isnotnull(brandid#133) && isnotnull(appid#131)) && 
isnotnull(sourcedistributionid#125))
   :  +- HiveTableScan [sourcedistributionid#125, appid#131, brandid#133], 
MetastoreRelation eagle_edw_batch, distributionattributes, t
   +- BroadcastExchange 
HashedRelationBroadcastMode(List((shiftleft((shiftleft(cast(input[0, int, 
false] as bigint), 32) | (cast(input[1, int, false] as bigint) & 4294967295)), 
32) | (cast(input[2, int, false] as bigint) & 4294967295
  +- *Filter ((isnotnull(brandid#144) && 
isnotnull(sourcedistributionid#136)) && isnotnull(appid#142))
 +- HiveTableScan [sourcedistributionid#136, appid#142, brandid#144], 
MetastoreRelation eagle_edw_batch, distributionattributes, s

df.show
|sid|tid|sapp|tapp|sbrand|tbrand|
| 15| 22|  61|  61|   614|   614|
| 13| 22|  61|  61|   614|   614|
| 10| 22|  61|  61|   614|   614|
|  7| 22|  61|  61|   614|   614|
|  9| 22|  61|  61|   614|   614|
| 16| 22|  61|  61|   614|   614|

  was:
I have a hive table : _eagle_edw_batch.DistributionAttributes_, with 

*Schema*: 

root
 |-- distributionstatus: string (nullable = true)
 |-- enabledforselectionflag: boolean (nullable = true)
 |-- sourcedistributionid: integer (nullable = true)
 |-- rowstartdate: date (nullable = true)
 |-- rowenddate: date (nullable = true)
 |-- rowiscurrent: string (nullable = true)
 |-- dwcreatedate: timestamp (nullable = true)
 |-- dwlastupdatedate: timestamp (nullable = true)
 |-- appid: integer (nullable = true)
 |-- siteid: integer (nullable = true)
 |-- brandid: integer (nullable = true)


*DataFrame*

val df = spark.sql("SELECT  s.sourcedistributionid as sid, 
t.sourcedistributionid as tid,  s.appid as sapp, t.appid as tapp,  s.brandid as 
sbrand, t.brandid as tbrand FROM eagle_edw_batch.DistributionAttributes t INNER 
JOIN eagle_edw_batch.DistributionAttributes s  ON 
t.sourcedistributionid=s.sourcedistributionid AND t.appid=s.appid  

[jira] [Updated] (SPARK-20998) BroadcastHashJoin producing wrong results

2017-06-06 Thread Mohit (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohit updated SPARK-20998:
--
Description: 
I have a hive table : _eagle_edw_batch.DistributionAttributes_, with 

*Schema*: 

root
 |-- distributionstatus: string (nullable = true)
 |-- enabledforselectionflag: boolean (nullable = true)
 |-- sourcedistributionid: integer (nullable = true)
 |-- rowstartdate: date (nullable = true)
 |-- rowenddate: date (nullable = true)
 |-- rowiscurrent: string (nullable = true)
 |-- dwcreatedate: timestamp (nullable = true)
 |-- dwlastupdatedate: timestamp (nullable = true)
 |-- appid: integer (nullable = true)
 |-- siteid: integer (nullable = true)
 |-- brandid: integer (nullable = true)


*DataFrame*

val df = spark.sql("SELECT  s.sourcedistributionid as sid, 
t.sourcedistributionid as tid,  s.appid as sapp, t.appid as tapp,  s.brandid as 
sbrand, t.brandid as tbrand FROM eagle_edw_batch.DistributionAttributes t INNER 
JOIN eagle_edw_batch.DistributionAttributes s  ON 
t.sourcedistributionid=s.sourcedistributionid AND t.appid=s.appid  AND 
t.brandid=s.brandid").


*Without BroadCastJoin* ( spark-shell --conf 
"spark.sql.autoBroadcastJoinThreshold=-1") : 

df.explain
== Physical Plan ==
*Project [sourcedistributionid#71 AS sid#0, sourcedistributionid#60 AS tid#1, 
appid#77 AS sapp#2, appid#66 AS tapp#3, brandid#79 AS sbrand#4, brandid#68 AS 
tbrand#5]
+- *SortMergeJoin [sourcedistributionid#60, appid#66, brandid#68], 
[sourcedistributionid#71, appid#77, brandid#79], Inner
   :- *Sort [sourcedistributionid#60 ASC, appid#66 ASC, brandid#68 ASC], false, 0
   :  +- Exchange hashpartitioning(sourcedistributionid#60, appid#66, 
brandid#68, 200)
   : +- *Filter ((isnotnull(sourcedistributionid#60) && 
isnotnull(brandid#68)) && isnotnull(appid#66))
   :+- HiveTableScan [sourcedistributionid#60, appid#66, brandid#68], 
MetastoreRelation eagle_edw_batch, distributionattributes, t
   +- *Sort [sourcedistributionid#71 ASC, appid#77 ASC, brandid#79 ASC], false, 0
  +- Exchange hashpartitioning(sourcedistributionid#71, appid#77, 
brandid#79, 200)
 +- *Filter ((isnotnull(sourcedistributionid#71) && 
isnotnull(appid#77)) && isnotnull(brandid#79))
+- HiveTableScan [sourcedistributionid#71, appid#77, brandid#79], 
MetastoreRelation eagle_edw_batch, distributionattributes, s

df.show
+---+---+++--+--+   
|sid|tid|sapp|tapp|sbrand|tbrand|
+---+---+++--+--+
| 22| 22|  61|  61|   614|   614|
| 29| 29|  65|  65| 0| 0|
| 30| 30|  12|  12|   121|   121|
| 10| 10|  73|  73|   731|   731|
| 24| 24|  61|  61|   611|   611|
| 35| 35|  65|  65| 0| 0|


*With BroadCastJoin* ( spark-shell )

df.explain

== Physical Plan ==
*Project [sourcedistributionid#136 AS sid#65, sourcedistributionid#125 AS 
tid#66, appid#142 AS sapp#67, appid#131 AS tapp#68, brandid#144 AS sbrand#69, 
brandid#133 AS tbrand#70]
+- *BroadcastHashJoin [sourcedistributionid#125, appid#131, brandid#133], 
[sourcedistributionid#136, appid#142, brandid#144], Inner, BuildRight
   :- *Filter ((isnotnull(brandid#133) && isnotnull(appid#131)) && 
isnotnull(sourcedistributionid#125))
   :  +- HiveTableScan [sourcedistributionid#125, appid#131, brandid#133], 
MetastoreRelation eagle_edw_batch, distributionattributes, t
   +- BroadcastExchange 
HashedRelationBroadcastMode(List((shiftleft((shiftleft(cast(input[0, int, 
false] as bigint), 32) | (cast(input[1, int, false] as bigint) & 4294967295)), 
32) | (cast(input[2, int, false] as bigint) & 4294967295
  +- *Filter ((isnotnull(brandid#144) && 
isnotnull(sourcedistributionid#136)) && isnotnull(appid#142))
 +- HiveTableScan [sourcedistributionid#136, appid#142, brandid#144], 
MetastoreRelation eagle_edw_batch, distributionattributes, s

df.show
+---+---+++--+--+   
|sid|tid|sapp|tapp|sbrand|tbrand|
+---+---+++--+--+
| 15| 22|  61|  61|   614|   614|
| 13| 22|  61|  61|   614|   614|
| 10| 22|  61|  61|   614|   614|
|  7| 22|  61|  61|   614|   614|
|  9| 22|  61|  61|   614|   614|
| 16| 22|  61|  61|   614|   614|

  was:
I have a hive table : _eagle_edw_batch.DistributionAttributes_, with schema: 
root
 |-- distributionstatus: string (nullable = true)
 |-- enabledforselectionflag: boolean (nullable = true)
 |-- sourcedistributionid: integer (nullable = true)
 |-- rowstartdate: date (nullable = true)
 |-- rowenddate: date (nullable = true)
 |-- rowiscurrent: string (nullable = true)
 |-- dwcreatedate: timestamp (nullable = true)
 |-- dwlastupdatedate: timestamp (nullable = true)
 |-- appid: integer (nullable = true)
 |-- siteid: integer (nullable = true)
 |-- brandid: integer (nullable = true)

DataFrame:
val df = spark.sql("SELECT  s.sourcedistributionid as sid, 
t.sourcedistributionid as tid,  s.appid as sapp, t.appid 

[jira] [Updated] (SPARK-20998) BroadcastHashJoin producing wrong results

2017-06-06 Thread Mohit (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohit updated SPARK-20998:
--
Description: 
I have a hive table : _eagle_edw_batch.DistributionAttributes_, with schema: 
root
 |-- distributionstatus: string (nullable = true)
 |-- enabledforselectionflag: boolean (nullable = true)
 |-- sourcedistributionid: integer (nullable = true)
 |-- rowstartdate: date (nullable = true)
 |-- rowenddate: date (nullable = true)
 |-- rowiscurrent: string (nullable = true)
 |-- dwcreatedate: timestamp (nullable = true)
 |-- dwlastupdatedate: timestamp (nullable = true)
 |-- appid: integer (nullable = true)
 |-- siteid: integer (nullable = true)
 |-- brandid: integer (nullable = true)

DataFrame:
val df = spark.sql("SELECT  s.sourcedistributionid as sid, 
t.sourcedistributionid as tid,  s.appid as sapp, t.appid as tapp,  s.brandid as 
sbrand, t.brandid as tbrand FROM eagle_edw_batch.DistributionAttributes t INNER 
JOIN eagle_edw_batch.DistributionAttributes s  ON 
t.sourcedistributionid=s.sourcedistributionid AND t.appid=s.appid  AND 
t.brandid=s.brandid").

*Without BroadCastJoin* ( spark-shell --conf 
"spark.sql.autoBroadcastJoinThreshold=-1") : 

df.explain
== Physical Plan ==
*Project [sourcedistributionid#71 AS sid#0, sourcedistributionid#60 AS tid#1, 
appid#77 AS sapp#2, appid#66 AS tapp#3, brandid#79 AS sbrand#4, brandid#68 AS 
tbrand#5]
+- *SortMergeJoin [sourcedistributionid#60, appid#66, brandid#68], 
[sourcedistributionid#71, appid#77, brandid#79], Inner
   :- *Sort [sourcedistributionid#60 ASC, appid#66 ASC, brandid#68 ASC], false, 0
   :  +- Exchange hashpartitioning(sourcedistributionid#60, appid#66, 
brandid#68, 200)
   : +- *Filter ((isnotnull(sourcedistributionid#60) && 
isnotnull(brandid#68)) && isnotnull(appid#66))
   :+- HiveTableScan [sourcedistributionid#60, appid#66, brandid#68], 
MetastoreRelation eagle_edw_batch, distributionattributes, t
   +- *Sort [sourcedistributionid#71 ASC, appid#77 ASC, brandid#79 ASC], false, 0
  +- Exchange hashpartitioning(sourcedistributionid#71, appid#77, 
brandid#79, 200)
 +- *Filter ((isnotnull(sourcedistributionid#71) && 
isnotnull(appid#77)) && isnotnull(brandid#79))
+- HiveTableScan [sourcedistributionid#71, appid#77, brandid#79], 
MetastoreRelation eagle_edw_batch, distributionattributes, s

df.show
+---+---+++--+--+   
|sid|tid|sapp|tapp|sbrand|tbrand|
+---+---+++--+--+
| 22| 22|  61|  61|   614|   614|
| 29| 29|  65|  65| 0| 0|
| 30| 30|  12|  12|   121|   121|
| 10| 10|  73|  73|   731|   731|
| 24| 24|  61|  61|   611|   611|
| 35| 35|  65|  65| 0| 0|

*With BroadCastJoin* ( spark-shell )

df.explain

== Physical Plan ==
*Project [sourcedistributionid#136 AS sid#65, sourcedistributionid#125 AS 
tid#66, appid#142 AS sapp#67, appid#131 AS tapp#68, brandid#144 AS sbrand#69, 
brandid#133 AS tbrand#70]
+- *BroadcastHashJoin [sourcedistributionid#125, appid#131, brandid#133], 
[sourcedistributionid#136, appid#142, brandid#144], Inner, BuildRight
   :- *Filter ((isnotnull(brandid#133) && isnotnull(appid#131)) && 
isnotnull(sourcedistributionid#125))
   :  +- HiveTableScan [sourcedistributionid#125, appid#131, brandid#133], 
MetastoreRelation eagle_edw_batch, distributionattributes, t
   +- BroadcastExchange 
HashedRelationBroadcastMode(List((shiftleft((shiftleft(cast(input[0, int, 
false] as bigint), 32) | (cast(input[1, int, false] as bigint) & 4294967295)), 
32) | (cast(input[2, int, false] as bigint) & 4294967295
  +- *Filter ((isnotnull(brandid#144) && 
isnotnull(sourcedistributionid#136)) && isnotnull(appid#142))
 +- HiveTableScan [sourcedistributionid#136, appid#142, brandid#144], 
MetastoreRelation eagle_edw_batch, distributionattributes, s

df.show
+---+---+++--+--+   
|sid|tid|sapp|tapp|sbrand|tbrand|
+---+---+++--+--+
| 15| 22|  61|  61|   614|   614|
| 13| 22|  61|  61|   614|   614|
| 10| 22|  61|  61|   614|   614|
|  7| 22|  61|  61|   614|   614|
|  9| 22|  61|  61|   614|   614|
| 16| 22|  61|  61|   614|   614|

  was:
I have a hive tables : eagle_edw_batch.DistributionAttributes, with schema: 
root
 |-- distributionstatus: string (nullable = true)
 |-- enabledforselectionflag: boolean (nullable = true)
 |-- sourcedistributionid: integer (nullable = true)
 |-- rowstartdate: date (nullable = true)
 |-- rowenddate: date (nullable = true)
 |-- rowiscurrent: string (nullable = true)
 |-- dwcreatedate: timestamp (nullable = true)
 |-- dwlastupdatedate: timestamp (nullable = true)
 |-- appid: integer (nullable = true)
 |-- siteid: integer (nullable = true)
 |-- brandid: integer (nullable = true)

DataFrame:
val df = spark.sql("SELECT  s.sourcedistributionid as sid, 
t.sourcedistributionid as tid,  s.appid as sapp, t.appid as tapp,  

[jira] [Comment Edited] (SPARK-18791) Stream-Stream Joins

2017-06-06 Thread xianyao jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16038690#comment-16038690
 ] 

xianyao jiang edited comment on SPARK-18791 at 6/6/17 12:15 PM:


  We have the draft design for the stream-stream inner join , and  complete a 
demo based on it.  We hope we can get more advice or helps from open source 
community and make the stream join implementation more popular and common. If 
you have any question or advice, please contact us and let us know.  
  hi [~marmbrus] , can you give some advice about it?  thanks
  Document link:   
https://docs.google.com/document/d/1i528WI7KFica0Dg1LTQfdQMsW8ai3WDvHmUvkH1BKg4/edit?usp=sharing


was (Author: xianyao.jiang):
  We have the draft design for the stream-stream inner join , and  complete a 
demo based on it.  We hope we can get more advice or helps from open source 
community and make the stream join implementation more popular and common. If 
you have any question or advice, please contact us and let us know.   Thanks
  Document link:   
https://docs.google.com/document/d/1i528WI7KFica0Dg1LTQfdQMsW8ai3WDvHmUvkH1BKg4/edit?usp=sharing

> Stream-Stream Joins
> ---
>
> Key: SPARK-18791
> URL: https://issues.apache.org/jira/browse/SPARK-18791
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>
> Just a placeholder for now.  Please comment with your requirements.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20998) BroadcastHashJoin producing wrong results

2017-06-06 Thread Mohit (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohit updated SPARK-20998:
--
Summary: BroadcastHashJoin producing wrong results  (was: BroadcastHashJoin 
producing different results)

> BroadcastHashJoin producing wrong results
> -
>
> Key: SPARK-20998
> URL: https://issues.apache.org/jira/browse/SPARK-20998
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Mohit
>
> I have a hive tables : eagle_edw_batch.DistributionAttributes, with schema: 
> root
>  |-- distributionstatus: string (nullable = true)
>  |-- enabledforselectionflag: boolean (nullable = true)
>  |-- sourcedistributionid: integer (nullable = true)
>  |-- rowstartdate: date (nullable = true)
>  |-- rowenddate: date (nullable = true)
>  |-- rowiscurrent: string (nullable = true)
>  |-- dwcreatedate: timestamp (nullable = true)
>  |-- dwlastupdatedate: timestamp (nullable = true)
>  |-- appid: integer (nullable = true)
>  |-- siteid: integer (nullable = true)
>  |-- brandid: integer (nullable = true)
> DataFrame:
> val df = spark.sql("SELECT  s.sourcedistributionid as sid, 
> t.sourcedistributionid as tid,  s.appid as sapp, t.appid as tapp,  s.brandid 
> as sbrand, t.brandid as tbrand FROM eagle_edw_batch.DistributionAttributes t 
> INNER JOIN eagle_edw_batch.DistributionAttributes s  ON 
> t.sourcedistributionid=s.sourcedistributionid AND t.appid=s.appid  AND 
> t.brandid=s.brandid").
> *Without BroadCastJoin* ( spark-shell --conf 
> "spark.sql.autoBroadcastJoinThreshold=-1") : 
> df.explain
> == Physical Plan ==
> *Project [sourcedistributionid#71 AS sid#0, sourcedistributionid#60 AS tid#1, 
> appid#77 AS sapp#2, appid#66 AS tapp#3, brandid#79 AS sbrand#4, brandid#68 AS 
> tbrand#5]
> +- *SortMergeJoin [sourcedistributionid#60, appid#66, brandid#68], 
> [sourcedistributionid#71, appid#77, brandid#79], Inner
>:- *Sort [sourcedistributionid#60 ASC, appid#66 ASC, brandid#68 ASC], 
> false, 0
>:  +- Exchange hashpartitioning(sourcedistributionid#60, appid#66, 
> brandid#68, 200)
>: +- *Filter ((isnotnull(sourcedistributionid#60) && 
> isnotnull(brandid#68)) && isnotnull(appid#66))
>:+- HiveTableScan [sourcedistributionid#60, appid#66, brandid#68], 
> MetastoreRelation eagle_edw_batch, distributionattributes, t
>+- *Sort [sourcedistributionid#71 ASC, appid#77 ASC, brandid#79 ASC], 
> false, 0
>   +- Exchange hashpartitioning(sourcedistributionid#71, appid#77, 
> brandid#79, 200)
>  +- *Filter ((isnotnull(sourcedistributionid#71) && 
> isnotnull(appid#77)) && isnotnull(brandid#79))
> +- HiveTableScan [sourcedistributionid#71, appid#77, brandid#79], 
> MetastoreRelation eagle_edw_batch, distributionattributes, s
> df.show
> +---+---+++--+--+ 
>   
> |sid|tid|sapp|tapp|sbrand|tbrand|
> +---+---+++--+--+
> | 22| 22|  61|  61|   614|   614|
> | 29| 29|  65|  65| 0| 0|
> | 30| 30|  12|  12|   121|   121|
> | 10| 10|  73|  73|   731|   731|
> | 24| 24|  61|  61|   611|   611|
> | 35| 35|  65|  65| 0| 0|
> *With BroadCastJoin* ( spark-shell )
> df.explain
> == Physical Plan ==
> *Project [sourcedistributionid#136 AS sid#65, sourcedistributionid#125 AS 
> tid#66, appid#142 AS sapp#67, appid#131 AS tapp#68, brandid#144 AS sbrand#69, 
> brandid#133 AS tbrand#70]
> +- *BroadcastHashJoin [sourcedistributionid#125, appid#131, brandid#133], 
> [sourcedistributionid#136, appid#142, brandid#144], Inner, BuildRight
>:- *Filter ((isnotnull(brandid#133) && isnotnull(appid#131)) && 
> isnotnull(sourcedistributionid#125))
>:  +- HiveTableScan [sourcedistributionid#125, appid#131, brandid#133], 
> MetastoreRelation eagle_edw_batch, distributionattributes, t
>+- BroadcastExchange 
> HashedRelationBroadcastMode(List((shiftleft((shiftleft(cast(input[0, int, 
> false] as bigint), 32) | (cast(input[1, int, false] as bigint) & 
> 4294967295)), 32) | (cast(input[2, int, false] as bigint) & 4294967295
>   +- *Filter ((isnotnull(brandid#144) && 
> isnotnull(sourcedistributionid#136)) && isnotnull(appid#142))
>  +- HiveTableScan [sourcedistributionid#136, appid#142, brandid#144], 
> MetastoreRelation eagle_edw_batch, distributionattributes, s
> df.show
> +---+---+++--+--+ 
>   
> |sid|tid|sapp|tapp|sbrand|tbrand|
> +---+---+++--+--+
> | 15| 22|  61|  61|   614|   614|
> | 13| 22|  61|  61|   614|   614|
> | 10| 22|  61|  61|   614|   614|
> |  7| 22|  61|  61|   614|   614|
> |  9| 22|  61|  61|   614|   614|
> | 16| 22|  61|  61|   614|   614|



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (SPARK-20998) BroadcastHashJoin producing wrong results

2017-06-06 Thread Mohit (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohit updated SPARK-20998:
--
Affects Version/s: (was: 2.0.1)
   2.0.0

> BroadcastHashJoin producing wrong results
> -
>
> Key: SPARK-20998
> URL: https://issues.apache.org/jira/browse/SPARK-20998
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Mohit
>
> I have a hive tables : eagle_edw_batch.DistributionAttributes, with schema: 
> root
>  |-- distributionstatus: string (nullable = true)
>  |-- enabledforselectionflag: boolean (nullable = true)
>  |-- sourcedistributionid: integer (nullable = true)
>  |-- rowstartdate: date (nullable = true)
>  |-- rowenddate: date (nullable = true)
>  |-- rowiscurrent: string (nullable = true)
>  |-- dwcreatedate: timestamp (nullable = true)
>  |-- dwlastupdatedate: timestamp (nullable = true)
>  |-- appid: integer (nullable = true)
>  |-- siteid: integer (nullable = true)
>  |-- brandid: integer (nullable = true)
> DataFrame:
> val df = spark.sql("SELECT  s.sourcedistributionid as sid, 
> t.sourcedistributionid as tid,  s.appid as sapp, t.appid as tapp,  s.brandid 
> as sbrand, t.brandid as tbrand FROM eagle_edw_batch.DistributionAttributes t 
> INNER JOIN eagle_edw_batch.DistributionAttributes s  ON 
> t.sourcedistributionid=s.sourcedistributionid AND t.appid=s.appid  AND 
> t.brandid=s.brandid").
> *Without BroadCastJoin* ( spark-shell --conf 
> "spark.sql.autoBroadcastJoinThreshold=-1") : 
> df.explain
> == Physical Plan ==
> *Project [sourcedistributionid#71 AS sid#0, sourcedistributionid#60 AS tid#1, 
> appid#77 AS sapp#2, appid#66 AS tapp#3, brandid#79 AS sbrand#4, brandid#68 AS 
> tbrand#5]
> +- *SortMergeJoin [sourcedistributionid#60, appid#66, brandid#68], 
> [sourcedistributionid#71, appid#77, brandid#79], Inner
>:- *Sort [sourcedistributionid#60 ASC, appid#66 ASC, brandid#68 ASC], 
> false, 0
>:  +- Exchange hashpartitioning(sourcedistributionid#60, appid#66, 
> brandid#68, 200)
>: +- *Filter ((isnotnull(sourcedistributionid#60) && 
> isnotnull(brandid#68)) && isnotnull(appid#66))
>:+- HiveTableScan [sourcedistributionid#60, appid#66, brandid#68], 
> MetastoreRelation eagle_edw_batch, distributionattributes, t
>+- *Sort [sourcedistributionid#71 ASC, appid#77 ASC, brandid#79 ASC], 
> false, 0
>   +- Exchange hashpartitioning(sourcedistributionid#71, appid#77, 
> brandid#79, 200)
>  +- *Filter ((isnotnull(sourcedistributionid#71) && 
> isnotnull(appid#77)) && isnotnull(brandid#79))
> +- HiveTableScan [sourcedistributionid#71, appid#77, brandid#79], 
> MetastoreRelation eagle_edw_batch, distributionattributes, s
> df.show
> +---+---+++--+--+ 
>   
> |sid|tid|sapp|tapp|sbrand|tbrand|
> +---+---+++--+--+
> | 22| 22|  61|  61|   614|   614|
> | 29| 29|  65|  65| 0| 0|
> | 30| 30|  12|  12|   121|   121|
> | 10| 10|  73|  73|   731|   731|
> | 24| 24|  61|  61|   611|   611|
> | 35| 35|  65|  65| 0| 0|
> *With BroadCastJoin* ( spark-shell )
> df.explain
> == Physical Plan ==
> *Project [sourcedistributionid#136 AS sid#65, sourcedistributionid#125 AS 
> tid#66, appid#142 AS sapp#67, appid#131 AS tapp#68, brandid#144 AS sbrand#69, 
> brandid#133 AS tbrand#70]
> +- *BroadcastHashJoin [sourcedistributionid#125, appid#131, brandid#133], 
> [sourcedistributionid#136, appid#142, brandid#144], Inner, BuildRight
>:- *Filter ((isnotnull(brandid#133) && isnotnull(appid#131)) && 
> isnotnull(sourcedistributionid#125))
>:  +- HiveTableScan [sourcedistributionid#125, appid#131, brandid#133], 
> MetastoreRelation eagle_edw_batch, distributionattributes, t
>+- BroadcastExchange 
> HashedRelationBroadcastMode(List((shiftleft((shiftleft(cast(input[0, int, 
> false] as bigint), 32) | (cast(input[1, int, false] as bigint) & 
> 4294967295)), 32) | (cast(input[2, int, false] as bigint) & 4294967295
>   +- *Filter ((isnotnull(brandid#144) && 
> isnotnull(sourcedistributionid#136)) && isnotnull(appid#142))
>  +- HiveTableScan [sourcedistributionid#136, appid#142, brandid#144], 
> MetastoreRelation eagle_edw_batch, distributionattributes, s
> df.show
> +---+---+++--+--+ 
>   
> |sid|tid|sapp|tapp|sbrand|tbrand|
> +---+---+++--+--+
> | 15| 22|  61|  61|   614|   614|
> | 13| 22|  61|  61|   614|   614|
> | 10| 22|  61|  61|   614|   614|
> |  7| 22|  61|  61|   614|   614|
> |  9| 22|  61|  61|   614|   614|
> | 16| 22|  61|  61|   614|   614|



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: 

[jira] [Created] (SPARK-20998) BroadcastHashJoin producing different results

2017-06-06 Thread Mohit (JIRA)
Mohit created SPARK-20998:
-

 Summary: BroadcastHashJoin producing different results
 Key: SPARK-20998
 URL: https://issues.apache.org/jira/browse/SPARK-20998
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1
Reporter: Mohit


I have a hive tables : eagle_edw_batch.DistributionAttributes, with schema: 
root
 |-- distributionstatus: string (nullable = true)
 |-- enabledforselectionflag: boolean (nullable = true)
 |-- sourcedistributionid: integer (nullable = true)
 |-- rowstartdate: date (nullable = true)
 |-- rowenddate: date (nullable = true)
 |-- rowiscurrent: string (nullable = true)
 |-- dwcreatedate: timestamp (nullable = true)
 |-- dwlastupdatedate: timestamp (nullable = true)
 |-- appid: integer (nullable = true)
 |-- siteid: integer (nullable = true)
 |-- brandid: integer (nullable = true)

DataFrame:
val df = spark.sql("SELECT  s.sourcedistributionid as sid, 
t.sourcedistributionid as tid,  s.appid as sapp, t.appid as tapp,  s.brandid as 
sbrand, t.brandid as tbrand FROM eagle_edw_batch.DistributionAttributes t INNER 
JOIN eagle_edw_batch.DistributionAttributes s  ON 
t.sourcedistributionid=s.sourcedistributionid AND t.appid=s.appid  AND 
t.brandid=s.brandid").

*Without BroadCastJoin* ( spark-shell --conf 
"spark.sql.autoBroadcastJoinThreshold=-1") : 

df.explain
== Physical Plan ==
*Project [sourcedistributionid#71 AS sid#0, sourcedistributionid#60 AS tid#1, 
appid#77 AS sapp#2, appid#66 AS tapp#3, brandid#79 AS sbrand#4, brandid#68 AS 
tbrand#5]
+- *SortMergeJoin [sourcedistributionid#60, appid#66, brandid#68], 
[sourcedistributionid#71, appid#77, brandid#79], Inner
   :- *Sort [sourcedistributionid#60 ASC, appid#66 ASC, brandid#68 ASC], false, 0
   :  +- Exchange hashpartitioning(sourcedistributionid#60, appid#66, 
brandid#68, 200)
   : +- *Filter ((isnotnull(sourcedistributionid#60) && 
isnotnull(brandid#68)) && isnotnull(appid#66))
   :+- HiveTableScan [sourcedistributionid#60, appid#66, brandid#68], 
MetastoreRelation eagle_edw_batch, distributionattributes, t
   +- *Sort [sourcedistributionid#71 ASC, appid#77 ASC, brandid#79 ASC], false, 0
  +- Exchange hashpartitioning(sourcedistributionid#71, appid#77, 
brandid#79, 200)
 +- *Filter ((isnotnull(sourcedistributionid#71) && 
isnotnull(appid#77)) && isnotnull(brandid#79))
+- HiveTableScan [sourcedistributionid#71, appid#77, brandid#79], 
MetastoreRelation eagle_edw_batch, distributionattributes, s

df.show
+---+---+++--+--+   
|sid|tid|sapp|tapp|sbrand|tbrand|
+---+---+++--+--+
| 22| 22|  61|  61|   614|   614|
| 29| 29|  65|  65| 0| 0|
| 30| 30|  12|  12|   121|   121|
| 10| 10|  73|  73|   731|   731|
| 24| 24|  61|  61|   611|   611|
| 35| 35|  65|  65| 0| 0|

*With BroadCastJoin* ( spark-shell )

df.explain

== Physical Plan ==
*Project [sourcedistributionid#136 AS sid#65, sourcedistributionid#125 AS 
tid#66, appid#142 AS sapp#67, appid#131 AS tapp#68, brandid#144 AS sbrand#69, 
brandid#133 AS tbrand#70]
+- *BroadcastHashJoin [sourcedistributionid#125, appid#131, brandid#133], 
[sourcedistributionid#136, appid#142, brandid#144], Inner, BuildRight
   :- *Filter ((isnotnull(brandid#133) && isnotnull(appid#131)) && 
isnotnull(sourcedistributionid#125))
   :  +- HiveTableScan [sourcedistributionid#125, appid#131, brandid#133], 
MetastoreRelation eagle_edw_batch, distributionattributes, t
   +- BroadcastExchange 
HashedRelationBroadcastMode(List((shiftleft((shiftleft(cast(input[0, int, 
false] as bigint), 32) | (cast(input[1, int, false] as bigint) & 4294967295)), 
32) | (cast(input[2, int, false] as bigint) & 4294967295
  +- *Filter ((isnotnull(brandid#144) && 
isnotnull(sourcedistributionid#136)) && isnotnull(appid#142))
 +- HiveTableScan [sourcedistributionid#136, appid#142, brandid#144], 
MetastoreRelation eagle_edw_batch, distributionattributes, s

df.show
+---+---+++--+--+   
|sid|tid|sapp|tapp|sbrand|tbrand|
+---+---+++--+--+
| 15| 22|  61|  61|   614|   614|
| 13| 22|  61|  61|   614|   614|
| 10| 22|  61|  61|   614|   614|
|  7| 22|  61|  61|   614|   614|
|  9| 22|  61|  61|   614|   614|
| 16| 22|  61|  61|   614|   614|



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20969) last() aggregate function fails returning the right answer with ordered windows

2017-06-06 Thread Perrine Letellier (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Perrine Letellier updated SPARK-20969:
--
Description: 
The column on which `orderBy` is performed is considered as another column on 
which to partition.

{code}
scala> val df = sc.parallelize(List(("i1", 1, "desc1"), ("i1", 1, "desc2"), 
("i1", 2, "desc3"))).toDF("id", "ts", "description")
scala> import org.apache.spark.sql.expressions.Window
scala> val window = Window.partitionBy("id").orderBy(col("ts").asc)
scala> df.withColumn("last", last(col("description")).over(window)).show
+---+---+---+-+
| id| ts|description| last|
+---+---+---+-+
| i1|  1|  desc1|desc2|
| i1|  1|  desc2|desc2|
| i1|  2|  desc3|desc3|
+---+---+---+-+
{code}

However what is expected is the same answer as if asking for `first()` with a 
window with descending order.

{code}
scala> val window = Window.partitionBy("id").orderBy(col("ts").desc)
scala> df.withColumn("hackedLast", first(col("description")).over(window)).show
+---+---+---+--+
| id| ts|description|hackedLast|
+---+---+---+--+
| i1|  2|  desc3| desc3|
| i1|  1|  desc1| desc3|
| i1|  1|  desc2| desc3|
+---+---+---+--+
{code}

  was:
The column on which `orderBy` is performed is considered as another column on 
which to partition.

{code}
scala> val df = sc.parallelize(List(("i1", 1, "desc1"), ("i1", 1, "desc2"), 
("i1", 2, "desc3"))).toDF("id", "ts", "description")
scala> val window = Window.partitionBy("id").orderBy(col("ts").asc)
scala> df.withColumn("last", last(col("description")).over(window)).show
+---+---+-+-+
| id| ts| description| last|
+---+---+-+-+
| i1|  1|desc1|desc2|
| i1|  1|desc2|desc2|
| i1|  2|desc3|desc3|
+---+---+-+-+
{code}

However what is expected is the same answer as if asking for `first()` with a 
window with descending order.

{code}
scala> val window = Window.partitionBy("id").orderBy(col("ts").desc)
scala> df.withColumn("last", first(col("description")).over(window)).show
+---+---+-+-+
| id| ts| description| last|
+---+---+-+-+
| i1|  2|desc3|desc3|
| i1|  1|desc1|desc3|
| i1|  1|desc2|desc3|
+---+---+-+-+
{code}


> last() aggregate function fails returning the right answer with ordered 
> windows
> ---
>
> Key: SPARK-20969
> URL: https://issues.apache.org/jira/browse/SPARK-20969
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Perrine Letellier
>
> The column on which `orderBy` is performed is considered as another column on 
> which to partition.
> {code}
> scala> val df = sc.parallelize(List(("i1", 1, "desc1"), ("i1", 1, "desc2"), 
> ("i1", 2, "desc3"))).toDF("id", "ts", "description")
> scala> import org.apache.spark.sql.expressions.Window
> scala> val window = Window.partitionBy("id").orderBy(col("ts").asc)
> scala> df.withColumn("last", last(col("description")).over(window)).show
> +---+---+---+-+
> | id| ts|description| last|
> +---+---+---+-+
> | i1|  1|  desc1|desc2|
> | i1|  1|  desc2|desc2|
> | i1|  2|  desc3|desc3|
> +---+---+---+-+
> {code}
> However what is expected is the same answer as if asking for `first()` with a 
> window with descending order.
> {code}
> scala> val window = Window.partitionBy("id").orderBy(col("ts").desc)
> scala> df.withColumn("hackedLast", 
> first(col("description")).over(window)).show
> +---+---+---+--+
> | id| ts|description|hackedLast|
> +---+---+---+--+
> | i1|  2|  desc3| desc3|
> | i1|  1|  desc1| desc3|
> | i1|  1|  desc2| desc3|
> +---+---+---+--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18791) Stream-Stream Joins

2017-06-06 Thread xianyao jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16038690#comment-16038690
 ] 

xianyao jiang edited comment on SPARK-18791 at 6/6/17 12:09 PM:


  We have the draft design for the stream-stream inner join , and  complete a 
demo based on it.  We hope we can get more advice or helps from open source 
community and make the stream join implementation more popular and common. If 
you have any question or advice, please contact us and let us know.   Thanks
  Document link:   
https://docs.google.com/document/d/1i528WI7KFica0Dg1LTQfdQMsW8ai3WDvHmUvkH1BKg4/edit?usp=sharing


was (Author: xianyao.jiang):
  We have the draft design for the stream-stream inner join , and  complete a 
demo based on it, it seems it can work.  We hope we can get more advice or 
helps form open source social and make the stream join implementation more 
popular and common. If you have any question or advice, please contact us and 
let us know.   Thanks
  Document link:   
https://docs.google.com/document/d/1i528WI7KFica0Dg1LTQfdQMsW8ai3WDvHmUvkH1BKg4/edit?usp=sharing

> Stream-Stream Joins
> ---
>
> Key: SPARK-18791
> URL: https://issues.apache.org/jira/browse/SPARK-18791
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>
> Just a placeholder for now.  Please comment with your requirements.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18791) Stream-Stream Joins

2017-06-06 Thread xianyao jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16038690#comment-16038690
 ] 

xianyao jiang commented on SPARK-18791:
---

  We have the draft design for the stream-stream inner join , and  complete a 
demo based on it, it seems it can work.  We hope we can get more advice or 
helps form open source social and make the stream join implementation more 
popular and common. If you have any question or advice, please contact us and 
let us know.   Thanks
  Document link:   
https://docs.google.com/document/d/1i528WI7KFica0Dg1LTQfdQMsW8ai3WDvHmUvkH1BKg4/edit?usp=sharing

> Stream-Stream Joins
> ---
>
> Key: SPARK-18791
> URL: https://issues.apache.org/jira/browse/SPARK-18791
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>
> Just a placeholder for now.  Please comment with your requirements.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20854) extend hint syntax to support any expression, not just identifiers or strings

2017-06-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16038671#comment-16038671
 ] 

Apache Spark commented on SPARK-20854:
--

User 'bogdanrdc' has created a pull request for this issue:
https://github.com/apache/spark/pull/18217

> extend hint syntax to support any expression, not just identifiers or strings
> -
>
> Key: SPARK-20854
> URL: https://issues.apache.org/jira/browse/SPARK-20854
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>Assignee: Bogdan Raducanu
>Priority: Blocker
> Fix For: 2.2.0
>
>
> Currently the SQL hint syntax supports as parameters only identifiers while 
> the Dataset hint syntax supports only strings.
> They should support any expression as parameters, for example numbers. This 
> is useful for implementing other hints in the future.
> Examples:
> {code}
> df.hint("hint1", Seq(1, 2, 3))
> df.hint("hint2", "A", 1)
> sql("select /*+ hint1((1,2,3)) */")
> sql("select /*+ hint2('A', 1) */")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19878) Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala

2017-06-06 Thread lyc (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16038648#comment-16038648
 ] 

lyc commented on SPARK-19878:
-

See 
[contributing|https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark]
 for contributing to spark, and it seems your patch is not a recommended way 
for contributing~

> Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala
> --
>
> Key: SPARK-19878
> URL: https://issues.apache.org/jira/browse/SPARK-19878
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0, 1.6.0, 2.0.0
> Environment: Centos 6.5: Hadoop 2.6.0, Spark 1.5.0, Hive 1.1.0
>Reporter: kavn qin
>  Labels: patch
> Attachments: SPARK-19878.patch
>
>
> When case class InsertIntoHiveTable intializes a serde it explicitly passes 
> null for the Configuration in Spark 1.5.0:
> [https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L58]
> While in Spark 2.0.0, the HiveWriterContainer intializes a serde it also just 
> passes null for the Configuration:
> [https://github.com/apache/spark/blob/v2.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala#L161]
> When we implement a hive serde, we want to use the hive configuration to  get 
> some static and dynamic settings, but we can not do it !
> So this patch add the configuration when initialize hive serde.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20997) spark-submit's --driver-cores marked as "YARN-only" but listed under "Spark standalone with cluster deploy mode only"

2017-06-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20997:
--
   Priority: Trivial  (was: Minor)
Component/s: Documentation

I suppose this should be removed from "YARN-only", and the previous section 
should be called "Spark standalone or YARN with cluster deploy mode only". The 
help message is getting hairy, but this seems like an improvement.

> spark-submit's --driver-cores marked as "YARN-only" but listed under "Spark 
> standalone with cluster deploy mode only"
> -
>
> Key: SPARK-20997
> URL: https://issues.apache.org/jira/browse/SPARK-20997
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Submit
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> Just noticed that {{spark-submit}} describes {{--driver-cores}} under:
> * Spark standalone with cluster deploy mode only
> * YARN-only
> While I can understand "only" in "Spark standalone with cluster deploy mode 
> only" to refer to cluster deploy mode (not the default client mode), but 
> YARN-only baffles me which I think deserves a fix.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20997) spark-submit's --driver-cores marked as "YARN-only" but listed under "Spark standalone with cluster deploy mode only"

2017-06-06 Thread Jacek Laskowski (JIRA)
Jacek Laskowski created SPARK-20997:
---

 Summary: spark-submit's --driver-cores marked as "YARN-only" but 
listed under "Spark standalone with cluster deploy mode only"
 Key: SPARK-20997
 URL: https://issues.apache.org/jira/browse/SPARK-20997
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.3.0
Reporter: Jacek Laskowski
Priority: Minor


Just noticed that {{spark-submit}} describes {{--driver-cores}} under:

* Spark standalone with cluster deploy mode only
* YARN-only

While I can understand "only" in "Spark standalone with cluster deploy mode 
only" to refer to cluster deploy mode (not the default client mode), but 
YARN-only baffles me which I think deserves a fix.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20920) ForkJoinPool pools are leaked when writing hive tables with many partitions

2017-06-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16038583#comment-16038583
 ] 

Apache Spark commented on SPARK-20920:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/18216

> ForkJoinPool pools are leaked when writing hive tables with many partitions
> ---
>
> Key: SPARK-20920
> URL: https://issues.apache.org/jira/browse/SPARK-20920
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Rares Mirica
>
> This bug is loosely related to SPARK-17396
> In this case it happens when writing to a hive table with many, many, 
> partitions (my table is partitioned by hour and stores data it gets from 
> kafka in a spark streaming application):
> df.repartition()
>   .write
>   .format("orc")
>   .option("path", s"$tablesStoragePath/$tableName")
>   .mode(SaveMode.Append)
>   .partitionBy("dt", "hh")
>   .saveAsTable(tableName)
> As this table grows beyond a certain size, ForkJoinPool pools start leaking. 
> Upon examination (with a debugger) I found that the caller is 
> AlterTableRecoverPartitionsCommand and the problem happens when 
> `evalTaskSupport` is used (line 555). I have tried setting a very large 
> threshold via `spark.rdd.parallelListingThreshold` and the problem went away.
> My assumption is that the problem happens in this case and not in the one in 
> SPARK-17396 due to the fact that AlterTableRecoverPartitionsCommand is a case 
> class while UnionRDD is an object so multiple instances are not possible, 
> therefore no leak.
> Regards,
> Rares



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20920) ForkJoinPool pools are leaked when writing hive tables with many partitions

2017-06-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20920:


Assignee: (was: Apache Spark)

> ForkJoinPool pools are leaked when writing hive tables with many partitions
> ---
>
> Key: SPARK-20920
> URL: https://issues.apache.org/jira/browse/SPARK-20920
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Rares Mirica
>
> This bug is loosely related to SPARK-17396
> In this case it happens when writing to a hive table with many, many, 
> partitions (my table is partitioned by hour and stores data it gets from 
> kafka in a spark streaming application):
> df.repartition()
>   .write
>   .format("orc")
>   .option("path", s"$tablesStoragePath/$tableName")
>   .mode(SaveMode.Append)
>   .partitionBy("dt", "hh")
>   .saveAsTable(tableName)
> As this table grows beyond a certain size, ForkJoinPool pools start leaking. 
> Upon examination (with a debugger) I found that the caller is 
> AlterTableRecoverPartitionsCommand and the problem happens when 
> `evalTaskSupport` is used (line 555). I have tried setting a very large 
> threshold via `spark.rdd.parallelListingThreshold` and the problem went away.
> My assumption is that the problem happens in this case and not in the one in 
> SPARK-17396 due to the fact that AlterTableRecoverPartitionsCommand is a case 
> class while UnionRDD is an object so multiple instances are not possible, 
> therefore no leak.
> Regards,
> Rares



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20920) ForkJoinPool pools are leaked when writing hive tables with many partitions

2017-06-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20920:


Assignee: Apache Spark

> ForkJoinPool pools are leaked when writing hive tables with many partitions
> ---
>
> Key: SPARK-20920
> URL: https://issues.apache.org/jira/browse/SPARK-20920
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Rares Mirica
>Assignee: Apache Spark
>
> This bug is loosely related to SPARK-17396
> In this case it happens when writing to a hive table with many, many, 
> partitions (my table is partitioned by hour and stores data it gets from 
> kafka in a spark streaming application):
> df.repartition()
>   .write
>   .format("orc")
>   .option("path", s"$tablesStoragePath/$tableName")
>   .mode(SaveMode.Append)
>   .partitionBy("dt", "hh")
>   .saveAsTable(tableName)
> As this table grows beyond a certain size, ForkJoinPool pools start leaking. 
> Upon examination (with a debugger) I found that the caller is 
> AlterTableRecoverPartitionsCommand and the problem happens when 
> `evalTaskSupport` is used (line 555). I have tried setting a very large 
> threshold via `spark.rdd.parallelListingThreshold` and the problem went away.
> My assumption is that the problem happens in this case and not in the one in 
> SPARK-17396 due to the fact that AlterTableRecoverPartitionsCommand is a case 
> class while UnionRDD is an object so multiple instances are not possible, 
> therefore no leak.
> Regards,
> Rares



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20935) A daemon thread, "BatchedWriteAheadLog Writer", left behind after terminating StreamingContext.

2017-06-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16038549#comment-16038549
 ] 

Sean Owen commented on SPARK-20935:
---

Sounds OK to me [~hyukjin.kwon] if you're up for it

> A daemon thread, "BatchedWriteAheadLog Writer", left behind after terminating 
> StreamingContext.
> ---
>
> Key: SPARK-20935
> URL: https://issues.apache.org/jira/browse/SPARK-20935
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.6.3, 2.1.1
>Reporter: Terence Yim
>
> With batched write ahead log on by default in driver (SPARK-11731), if there 
> is no receiver based {{InputDStream}}, the "BatchedWriteAheadLog Writer" 
> thread created by {{BatchedWriteAheadLog}} never get shutdown. 
> The root cause is due to 
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala#L168
> that it never call {{ReceivedBlockTracker.stop()}} (which in turn call 
> {{BatchedWriteAheadLog.close()}}) if there is no receiver based input.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20985) Improve KryoSerializerResizableOutputSuite

2017-06-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-20985:
-

Assignee: jin xing

> Improve KryoSerializerResizableOutputSuite
> --
>
> Key: SPARK-20985
> URL: https://issues.apache.org/jira/browse/SPARK-20985
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: jin xing
>Assignee: jin xing
>Priority: Trivial
> Fix For: 2.3.0
>
>
> SparkContext should always be stopped after using, thus other tests won't 
> complain that there's only one SparkContext can exist.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20985) Improve KryoSerializerResizableOutputSuite

2017-06-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20985.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18204
[https://github.com/apache/spark/pull/18204]

> Improve KryoSerializerResizableOutputSuite
> --
>
> Key: SPARK-20985
> URL: https://issues.apache.org/jira/browse/SPARK-20985
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: jin xing
>Priority: Trivial
> Fix For: 2.3.0
>
>
> SparkContext should always be stopped after using, thus other tests won't 
> complain that there's only one SparkContext can exist.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20914) Javadoc contains code that is invalid

2017-06-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16038518#comment-16038518
 ] 

Apache Spark commented on SPARK-20914:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/18215

> Javadoc contains code that is invalid
> -
>
> Key: SPARK-20914
> URL: https://issues.apache.org/jira/browse/SPARK-20914
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.1.1
>Reporter: Cristian Teodor
>Priority: Trivial
>
> i was looking over the 
> [dataset|https://spark.apache.org/docs/2.1.1/api/java/org/apache/spark/sql/Dataset.html]
>  and noticed the code on top that does not make sense in java.
> {code}
>  // To create Dataset using SparkSession
>Dataset people = spark.read().parquet("...");
>Dataset department = spark.read().parquet("...");
>people.filter("age".gt(30))
>  .join(department, people.col("deptId").equalTo(department("id")))
>  .groupBy(department.col("name"), "gender")
>  .agg(avg(people.col("salary")), max(people.col("age")));
> {code}
> invalid parts:
> * "age".gt(30)
> * department("id")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20914) Javadoc contains code that is invalid

2017-06-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20914:


Assignee: Apache Spark

> Javadoc contains code that is invalid
> -
>
> Key: SPARK-20914
> URL: https://issues.apache.org/jira/browse/SPARK-20914
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.1.1
>Reporter: Cristian Teodor
>Assignee: Apache Spark
>Priority: Trivial
>
> i was looking over the 
> [dataset|https://spark.apache.org/docs/2.1.1/api/java/org/apache/spark/sql/Dataset.html]
>  and noticed the code on top that does not make sense in java.
> {code}
>  // To create Dataset using SparkSession
>Dataset people = spark.read().parquet("...");
>Dataset department = spark.read().parquet("...");
>people.filter("age".gt(30))
>  .join(department, people.col("deptId").equalTo(department("id")))
>  .groupBy(department.col("name"), "gender")
>  .agg(avg(people.col("salary")), max(people.col("age")));
> {code}
> invalid parts:
> * "age".gt(30)
> * department("id")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20914) Javadoc contains code that is invalid

2017-06-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20914:


Assignee: (was: Apache Spark)

> Javadoc contains code that is invalid
> -
>
> Key: SPARK-20914
> URL: https://issues.apache.org/jira/browse/SPARK-20914
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.1.1
>Reporter: Cristian Teodor
>Priority: Trivial
>
> i was looking over the 
> [dataset|https://spark.apache.org/docs/2.1.1/api/java/org/apache/spark/sql/Dataset.html]
>  and noticed the code on top that does not make sense in java.
> {code}
>  // To create Dataset using SparkSession
>Dataset people = spark.read().parquet("...");
>Dataset department = spark.read().parquet("...");
>people.filter("age".gt(30))
>  .join(department, people.col("deptId").equalTo(department("id")))
>  .groupBy(department.col("name"), "gender")
>  .agg(avg(people.col("salary")), max(people.col("age")));
> {code}
> invalid parts:
> * "age".gt(30)
> * department("id")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20162) Reading data from MySQL - Cannot up cast from decimal(30,6) to decimal(38,18)

2017-06-06 Thread Yuming Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16038515#comment-16038515
 ] 

Yuming Wang commented on SPARK-20162:
-

[~mspehar] How to reproduce it? read a table like {{spark_20162}}?
{code:sql}
CREATE TABLE `spark_20162` (
  `spark` decimal(30,6) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
{code}


> Reading data from MySQL - Cannot up cast from decimal(30,6) to decimal(38,18)
> -
>
> Key: SPARK-20162
> URL: https://issues.apache.org/jira/browse/SPARK-20162
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Miroslav Spehar
>
> While reading data from MySQL, type conversion doesn't work for Decimal type 
> when the decimal in database is of lower precision/scale than the one spark 
> expects.
> Error:
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up 
> cast `DECIMAL_AMOUNT` from decimal(30,6) to decimal(38,18) as it may truncate
> The type path of the target object is:
> - field (class: "org.apache.spark.sql.types.Decimal", name: "DECIMAL_AMOUNT")
> - root class: "com.misp.spark.Structure"
> You can either add an explicit cast to the input data or choose a higher 
> precision type of the field in the target object;
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:2119)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34$$anonfun$applyOrElse$14.applyOrElse(Analyzer.scala:2141)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34$$anonfun$applyOrElse$14.applyOrElse(Analyzer.scala:2136)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:287)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5$$anonfun$apply$11.apply(TreeNode.scala:360)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:358)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:248)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:258)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$6.apply(QueryPlan.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:267)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:236)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34.applyOrElse(Analyzer.scala:2136)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34.applyOrElse(Analyzer.scala:2132)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.apply(Analyzer.scala:2132)
>   at 
> 

[jira] [Updated] (SPARK-20995) 'Spark-env.sh.template' should add 'YARN_CONF_DIR' configuration instructions.

2017-06-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20995:
--
Priority: Trivial  (was: Minor)

> 'Spark-env.sh.template' should add 'YARN_CONF_DIR' configuration instructions.
> --
>
> Key: SPARK-20995
> URL: https://issues.apache.org/jira/browse/SPARK-20995
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: guoxiaolongzte
>Priority: Trivial
>
> Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which 
> contains the (client side) configuration files for the Hadoop cluster.
> These configs are used to write to HDFS and connect to the YARN 
> ResourceManager. The
> configuration contained in this directory will be distributed to the YARN 
> cluster so that all
> containers used by the application use the same configuration.
> Sometimes, HADOOP_CONF_DIR is set to the hdfs configuration file path. So, 
> YARN_CONF_DIR should be set to the yarn configuration file path.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20884) Spark' masters will be both standby due to the bug of curator

2017-06-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20884.
---
Resolution: Won't Fix

We can't merge this change, because Curator 2.8 won't work with Curator 2.6 
environments, which include Hadoop 2.6/2.7 envs, which are still supported. 
Much later we could update this, basically when Spark 3 moves on to Hadoop 3 
(which uses Curator 2.12)

> Spark' masters will be both standby due to the bug of curator 
> --
>
> Key: SPARK-20884
> URL: https://issues.apache.org/jira/browse/SPARK-20884
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: liuzhaokun
>
> I built a cluster with two masters and three workers.When there is a switch 
> of master's state,two masters are both standby in a long period.There are 
> some ERROR in master's logfile " ERROR CuratorFrameworkImpl: Background 
> exception was not retry-able or retry gave up
> java.lang.IllegalArgumentException: Path must start with / character
>   at org.apache.curator.utils.PathUtils.validatePath(PathUtils.java:53)
>   at org.apache.curator.utils.ZKPaths.getNodeFromPath(ZKPaths.java:56)
>   at 
> org.apache.curator.framework.recipes.leader.LeaderLatch.checkLeadership(LeaderLatch.java:421)
>   at 
> org.apache.curator.framework.recipes.leader.LeaderLatch.access$500(LeaderLatch.java:60)
>   at 
> org.apache.curator.framework.recipes.leader.LeaderLatch$6.processResult(LeaderLatch.java:478)
>   at 
> org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:686)
>   at 
> org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:485)
>   at 
> org.apache.curator.framework.imps.GetChildrenBuilderImpl$2.processResult(GetChildrenBuilderImpl.java:166)
>   at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:587)
>   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)"
> it can be related to https://issues.apache.org/jira/browse/CURATOR-168.
> So I think the version of curator in spark should be 2.8.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20851) Drop spark table failed if a column name is a numeric string

2017-06-06 Thread Ben P (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16038445#comment-16038445
 ] 

Ben P commented on SPARK-20851:
---

This issue seems to be handled/prevented in v2.0.1 as 2.0.1 throws below 
exception when spark.sql('create table...')

pyspark.sql.utils.ParseException: "\nmismatched input '123018231' expecting 
'>'(line 1, pos 7)\n\n== SQL 
==\nstruct<123018231:string,123121:bigint>\n---^^^\n"

> Drop spark table failed if a column name is a numeric string
> 
>
> Key: SPARK-20851
> URL: https://issues.apache.org/jira/browse/SPARK-20851
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: linux redhat
>Reporter: Chen Gong
>
> I tried to read a json file to a spark dataframe
> {noformat}
> df = spark.read.json('path.json')
> df.write.parquet('dataframe', compression='snappy')
> {noformat}
> However, there are some columns' names are numeric strings, such as 
> "989238883". Then I created spark sql table by using this
> {noformat}
> create table if not exists `a` using org.apache.spark.sql.parquet options 
> (path 'dataframe');  // It works well
> {noformat}
> But after created table, any operations, like select, drop table on this 
> table will raise the same exceptions below
> {noformat}
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> array>,url:string,width:bigint>>,audit_id:bigint,author_id:bigint,body:string,brand_id:string,created_at:string,custom_ticket_fields:struct<49244727:string,51588527:string,51591767:string,51950848:string,51950868:string,51950888:string,51950928:string,52359587:string,55276747:string,56958227:string,57080067:string,57080667:string,57107727:string,57112447:string,57113207:string,57411128:string,57424648:string,57442588:string,62382188:string,74862088:string,74871788:string>,event_type:string,group_id:bigint,html_body:string,id:bigint,is_public:string,locale_id:string,organization_id:string,plain_body:string,previous_value:string,priority:string,public:boolean,rel:string,removed_tags:array,requester_id:bigint,satisfaction_probability:string,satisfaction_score:string,sla_policy:string,status:string,tags:array,ticket_form_id:string,type:string,via:string,via_reference_id:bigint>>
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:785)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$10$$anonfun$7.apply(HiveClientImpl.scala:365)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$10$$anonfun$7.apply(HiveClientImpl.scala:365)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$10.apply(HiveClientImpl.scala:365)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$10.apply(HiveClientImpl.scala:361)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:361)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:359)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:230)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:229)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:272)
>   at 
> 

[jira] [Resolved] (SPARK-20943) Correct BypassMergeSortShuffleWriter's comment

2017-06-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20943.
---
Resolution: Not A Problem

If there's no (other) support for a request to change the wording, and no 
proposed change, I think we should close this.

> Correct BypassMergeSortShuffleWriter's comment
> --
>
> Key: SPARK-20943
> URL: https://issues.apache.org/jira/browse/SPARK-20943
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Shuffle
>Affects Versions: 2.1.1
>Reporter: CanBin Zheng
>Priority: Trivial
>  Labels: starter
>
> There are some comments written in BypassMergeSortShuffleWriter.java about 
> when to select this write path, the three required conditions are described 
> as follows:  
> 1. no Ordering is specified, and
> 2. no Aggregator is specified, and
> 3. the number of partitions is less than 
>  spark.shuffle.sort.bypassMergeThreshold
> Obviously, the conditions written are partially wrong and misleading, the 
> right conditions should be:
> 1. map-side combine is false, and
> 2. the number of partitions is less than 
>  spark.shuffle.sort.bypassMergeThreshold



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20199) GradientBoostedTreesModel doesn't have featureSubsetStrategy parameter

2017-06-06 Thread Kedarnath Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16038403#comment-16038403
 ] 

Kedarnath Reddy commented on SPARK-20199:
-

Please look into this feature , as I needed this for my implementation of GBT 
in my organization

> GradientBoostedTreesModel doesn't have  featureSubsetStrategy parameter
> ---
>
> Key: SPARK-20199
> URL: https://issues.apache.org/jira/browse/SPARK-20199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: pralabhkumar
>
> Spark GradientBoostedTreesModel doesn't have featureSubsetStrategy . It Uses 
> random forest internally ,which have featureSubsetStrategy hardcoded "all". 
> It should be provided by the user to have randomness at the feature level.
> This parameter is available in H2O and XGBoost. 
> Sample from H2O.ai 
> gbmParams._col_sample_rate
> Please provide the parameter . 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20996) Better handling AM reattempt based on exit code in yarn mode

2017-06-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20996:


Assignee: Apache Spark

> Better handling AM reattempt based on exit code in yarn mode
> 
>
> Key: SPARK-20996
> URL: https://issues.apache.org/jira/browse/SPARK-20996
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Assignee: Apache Spark
>Priority: Minor
>
> Yarn provides max attempt configuration for applications running on it, 
> application has the chance to retry itself when failed. In the current Spark 
> code, no matter which failure AM occurred and if the failure doesn't reach to 
> the max attempt, RM will restart AM, this is not reasonable for some cases if 
> this issue is coming from AM itself, like user code failure, OOM, Spark 
> issue, executor failures, in large chance the reattempt of AM will meet this 
> issue again. Only when AM is failed due to external issue like crash, process 
> kill, NM failure, then AM should retry again.
> So here propose to improve this reattempt mechanism to only retry when it 
> meets external issues.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20996) Better handling AM reattempt based on exit code in yarn mode

2017-06-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20996:


Assignee: (was: Apache Spark)

> Better handling AM reattempt based on exit code in yarn mode
> 
>
> Key: SPARK-20996
> URL: https://issues.apache.org/jira/browse/SPARK-20996
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Priority: Minor
>
> Yarn provides max attempt configuration for applications running on it, 
> application has the chance to retry itself when failed. In the current Spark 
> code, no matter which failure AM occurred and if the failure doesn't reach to 
> the max attempt, RM will restart AM, this is not reasonable for some cases if 
> this issue is coming from AM itself, like user code failure, OOM, Spark 
> issue, executor failures, in large chance the reattempt of AM will meet this 
> issue again. Only when AM is failed due to external issue like crash, process 
> kill, NM failure, then AM should retry again.
> So here propose to improve this reattempt mechanism to only retry when it 
> meets external issues.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20996) Better handling AM reattempt based on exit code in yarn mode

2017-06-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16038404#comment-16038404
 ] 

Apache Spark commented on SPARK-20996:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/18213

> Better handling AM reattempt based on exit code in yarn mode
> 
>
> Key: SPARK-20996
> URL: https://issues.apache.org/jira/browse/SPARK-20996
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Priority: Minor
>
> Yarn provides max attempt configuration for applications running on it, 
> application has the chance to retry itself when failed. In the current Spark 
> code, no matter which failure AM occurred and if the failure doesn't reach to 
> the max attempt, RM will restart AM, this is not reasonable for some cases if 
> this issue is coming from AM itself, like user code failure, OOM, Spark 
> issue, executor failures, in large chance the reattempt of AM will meet this 
> issue again. Only when AM is failed due to external issue like crash, process 
> kill, NM failure, then AM should retry again.
> So here propose to improve this reattempt mechanism to only retry when it 
> meets external issues.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20499) Spark MLlib, GraphX 2.2 QA umbrella

2017-06-06 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-20499.

Resolution: Done

> Spark MLlib, GraphX 2.2 QA umbrella
> ---
>
> Key: SPARK-20499
> URL: https://issues.apache.org/jira/browse/SPARK-20499
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for MLlib and 
> GraphX.   *SparkR is separate: [SPARK-20508].*
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Check binary API compatibility for Scala/Java
> * Audit new public APIs (from the generated html doc)
> ** Scala
> ** Java compatibility
> ** Python coverage
> * Check Experimental, DeveloperApi tags
> h2. Algorithms and performance
> * Performance tests
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20507) Update MLlib, GraphX websites for 2.2

2017-06-06 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-20507.

Resolution: Done
  Assignee: Nick Pentreath

No updates to MLlib project website required for {{2.2}} release.

> Update MLlib, GraphX websites for 2.2
> -
>
> Key: SPARK-20507
> URL: https://issues.apache.org/jira/browse/SPARK-20507
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>Priority: Critical
>
> Update the sub-projects' websites to include new features in this release.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20996) Better handling AM reattempt based on exit code in yarn mode

2017-06-06 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-20996:
---

 Summary: Better handling AM reattempt based on exit code in yarn 
mode
 Key: SPARK-20996
 URL: https://issues.apache.org/jira/browse/SPARK-20996
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 2.2.0
Reporter: Saisai Shao
Priority: Minor


Yarn provides max attempt configuration for applications running on it, 
application has the chance to retry itself when failed. In the current Spark 
code, no matter which failure AM occurred and if the failure doesn't reach to 
the max attempt, RM will restart AM, this is not reasonable for some cases if 
this issue is coming from AM itself, like user code failure, OOM, Spark issue, 
executor failures, in large case the reattempt of AM will meet this issue 
again. Only when AM is failed due to external issue like crash, process kill, 
NM failure, then AM should retry again.

So here propose to improve this reattempt mechanism to only retry when it meets 
external issues.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20995) 'Spark-env.sh.template' should add 'YARN_CONF_DIR' configuration instructions.

2017-06-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20995:


Assignee: Apache Spark

> 'Spark-env.sh.template' should add 'YARN_CONF_DIR' configuration instructions.
> --
>
> Key: SPARK-20995
> URL: https://issues.apache.org/jira/browse/SPARK-20995
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: guoxiaolongzte
>Assignee: Apache Spark
>Priority: Minor
>
> Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which 
> contains the (client side) configuration files for the Hadoop cluster.
> These configs are used to write to HDFS and connect to the YARN 
> ResourceManager. The
> configuration contained in this directory will be distributed to the YARN 
> cluster so that all
> containers used by the application use the same configuration.
> Sometimes, HADOOP_CONF_DIR is set to the hdfs configuration file path. So, 
> YARN_CONF_DIR should be set to the yarn configuration file path.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20995) 'Spark-env.sh.template' should add 'YARN_CONF_DIR' configuration instructions.

2017-06-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20995:


Assignee: (was: Apache Spark)

> 'Spark-env.sh.template' should add 'YARN_CONF_DIR' configuration instructions.
> --
>
> Key: SPARK-20995
> URL: https://issues.apache.org/jira/browse/SPARK-20995
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: guoxiaolongzte
>Priority: Minor
>
> Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which 
> contains the (client side) configuration files for the Hadoop cluster.
> These configs are used to write to HDFS and connect to the YARN 
> ResourceManager. The
> configuration contained in this directory will be distributed to the YARN 
> cluster so that all
> containers used by the application use the same configuration.
> Sometimes, HADOOP_CONF_DIR is set to the hdfs configuration file path. So, 
> YARN_CONF_DIR should be set to the yarn configuration file path.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20995) 'Spark-env.sh.template' should add 'YARN_CONF_DIR' configuration instructions.

2017-06-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16038361#comment-16038361
 ] 

Apache Spark commented on SPARK-20995:
--

User 'guoxiaolongzte' has created a pull request for this issue:
https://github.com/apache/spark/pull/18212

> 'Spark-env.sh.template' should add 'YARN_CONF_DIR' configuration instructions.
> --
>
> Key: SPARK-20995
> URL: https://issues.apache.org/jira/browse/SPARK-20995
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: guoxiaolongzte
>Priority: Minor
>
> Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which 
> contains the (client side) configuration files for the Hadoop cluster.
> These configs are used to write to HDFS and connect to the YARN 
> ResourceManager. The
> configuration contained in this directory will be distributed to the YARN 
> cluster so that all
> containers used by the application use the same configuration.
> Sometimes, HADOOP_CONF_DIR is set to the hdfs configuration file path. So, 
> YARN_CONF_DIR should be set to the yarn configuration file path.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20995) 'Spark-env.sh.template' should add 'YARN_CONF_DIR' configuration instructions.

2017-06-06 Thread guoxiaolongzte (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guoxiaolongzte updated SPARK-20995:
---
Description: 
Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which 
contains the (client side) configuration files for the Hadoop cluster.
These configs are used to write to HDFS and connect to the YARN 
ResourceManager. The
configuration contained in this directory will be distributed to the YARN 
cluster so that all
containers used by the application use the same configuration.

Sometimes, HADOOP_CONF_DIR is set to the hdfs configuration file path. So, 
YARN_CONF_DIR should be set to the yarn configuration file path.

> 'Spark-env.sh.template' should add 'YARN_CONF_DIR' configuration instructions.
> --
>
> Key: SPARK-20995
> URL: https://issues.apache.org/jira/browse/SPARK-20995
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: guoxiaolongzte
>Priority: Minor
>
> Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which 
> contains the (client side) configuration files for the Hadoop cluster.
> These configs are used to write to HDFS and connect to the YARN 
> ResourceManager. The
> configuration contained in this directory will be distributed to the YARN 
> cluster so that all
> containers used by the application use the same configuration.
> Sometimes, HADOOP_CONF_DIR is set to the hdfs configuration file path. So, 
> YARN_CONF_DIR should be set to the yarn configuration file path.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20996) Better handling AM reattempt based on exit code in yarn mode

2017-06-06 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-20996:

Description: 
Yarn provides max attempt configuration for applications running on it, 
application has the chance to retry itself when failed. In the current Spark 
code, no matter which failure AM occurred and if the failure doesn't reach to 
the max attempt, RM will restart AM, this is not reasonable for some cases if 
this issue is coming from AM itself, like user code failure, OOM, Spark issue, 
executor failures, in large chance the reattempt of AM will meet this issue 
again. Only when AM is failed due to external issue like crash, process kill, 
NM failure, then AM should retry again.

So here propose to improve this reattempt mechanism to only retry when it meets 
external issues.

  was:
Yarn provides max attempt configuration for applications running on it, 
application has the chance to retry itself when failed. In the current Spark 
code, no matter which failure AM occurred and if the failure doesn't reach to 
the max attempt, RM will restart AM, this is not reasonable for some cases if 
this issue is coming from AM itself, like user code failure, OOM, Spark issue, 
executor failures, in large case the reattempt of AM will meet this issue 
again. Only when AM is failed due to external issue like crash, process kill, 
NM failure, then AM should retry again.

So here propose to improve this reattempt mechanism to only retry when it meets 
external issues.


> Better handling AM reattempt based on exit code in yarn mode
> 
>
> Key: SPARK-20996
> URL: https://issues.apache.org/jira/browse/SPARK-20996
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Priority: Minor
>
> Yarn provides max attempt configuration for applications running on it, 
> application has the chance to retry itself when failed. In the current Spark 
> code, no matter which failure AM occurred and if the failure doesn't reach to 
> the max attempt, RM will restart AM, this is not reasonable for some cases if 
> this issue is coming from AM itself, like user code failure, OOM, Spark 
> issue, executor failures, in large chance the reattempt of AM will meet this 
> issue again. Only when AM is failed due to external issue like crash, process 
> kill, NM failure, then AM should retry again.
> So here propose to improve this reattempt mechanism to only retry when it 
> meets external issues.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20995) 'Spark-env.sh.template' should add 'YARN_CONF_DIR' configuration instructions.

2017-06-06 Thread guoxiaolongzte (JIRA)
guoxiaolongzte created SPARK-20995:
--

 Summary: 'Spark-env.sh.template' should add 'YARN_CONF_DIR' 
configuration instructions.
 Key: SPARK-20995
 URL: https://issues.apache.org/jira/browse/SPARK-20995
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.1.1
Reporter: guoxiaolongzte
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20994) Alleviate memory pressure in StreamManager

2017-06-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20994:


Assignee: (was: Apache Spark)

> Alleviate memory pressure in StreamManager
> --
>
> Key: SPARK-20994
> URL: https://issues.apache.org/jira/browse/SPARK-20994
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: jin xing
>
> In my cluster, we are suffering from OOM of shuffle-service.
> We found that a lot of executors are fetching blocks from a single 
> shuffle-service. Analyzing the memory, we found that the 
> blockIds({{shuffle_shuffleId_mapId_reduceId}}) takes about 1.5GBytes.
> In current code, chunks are fetched from shuffle service in two steps:
> Step-1. Send {{OpenBlocks}}, which contains the blocks list to to fetch;
> Step-2. Fetch the consecutive chunks from shuffle-service by {{streamId}} and 
> {{chunkIndex}}
> Conceptually, there is no need to send the blocks list in step-1. Client can 
> send the blockId in Step-2. Receiving {{ChunkFetchRequest}}, server can check 
> if the chunkId is in local block manager and send back response. 
> Thus memory cost can be improved.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20994) Alleviate memory pressure in StreamManager

2017-06-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20994:


Assignee: Apache Spark

> Alleviate memory pressure in StreamManager
> --
>
> Key: SPARK-20994
> URL: https://issues.apache.org/jira/browse/SPARK-20994
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: jin xing
>Assignee: Apache Spark
>
> In my cluster, we are suffering from OOM of shuffle-service.
> We found that a lot of executors are fetching blocks from a single 
> shuffle-service. Analyzing the memory, we found that the 
> blockIds({{shuffle_shuffleId_mapId_reduceId}}) takes about 1.5GBytes.
> In current code, chunks are fetched from shuffle service in two steps:
> Step-1. Send {{OpenBlocks}}, which contains the blocks list to to fetch;
> Step-2. Fetch the consecutive chunks from shuffle-service by {{streamId}} and 
> {{chunkIndex}}
> Conceptually, there is no need to send the blocks list in step-1. Client can 
> send the blockId in Step-2. Receiving {{ChunkFetchRequest}}, server can check 
> if the chunkId is in local block manager and send back response. 
> Thus memory cost can be improved.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20993) The configuration item about 'Spark.blacklist.enabled' need to set the default value 'false'

2017-06-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20993:


Assignee: (was: Apache Spark)

> The configuration item about 'Spark.blacklist.enabled'  need to set the 
> default value 'false'
> -
>
> Key: SPARK-20993
> URL: https://issues.apache.org/jira/browse/SPARK-20993
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: guoxiaolongzte
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20994) Alleviate memory pressure in StreamManager

2017-06-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16038280#comment-16038280
 ] 

Apache Spark commented on SPARK-20994:
--

User 'jinxing64' has created a pull request for this issue:
https://github.com/apache/spark/pull/18211

> Alleviate memory pressure in StreamManager
> --
>
> Key: SPARK-20994
> URL: https://issues.apache.org/jira/browse/SPARK-20994
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: jin xing
>
> In my cluster, we are suffering from OOM of shuffle-service.
> We found that a lot of executors are fetching blocks from a single 
> shuffle-service. Analyzing the memory, we found that the 
> blockIds({{shuffle_shuffleId_mapId_reduceId}}) takes about 1.5GBytes.
> In current code, chunks are fetched from shuffle service in two steps:
> Step-1. Send {{OpenBlocks}}, which contains the blocks list to to fetch;
> Step-2. Fetch the consecutive chunks from shuffle-service by {{streamId}} and 
> {{chunkIndex}}
> Conceptually, there is no need to send the blocks list in step-1. Client can 
> send the blockId in Step-2. Receiving {{ChunkFetchRequest}}, server can check 
> if the chunkId is in local block manager and send back response. 
> Thus memory cost can be improved.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >