[jira] [Commented] (SPARK-11560) Optimize KMeans implementation

2016-01-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105379#comment-15105379
 ] 

Apache Spark commented on SPARK-11560:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/10806

> Optimize KMeans implementation
> --
>
> Key: SPARK-11560
> URL: https://issues.apache.org/jira/browse/SPARK-11560
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>
> After we dropped `runs`, we can simplify and optimize the k-means 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8519) Blockify distance computation in k-means

2016-01-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105378#comment-15105378
 ] 

Apache Spark commented on SPARK-8519:
-

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/10806

> Blockify distance computation in k-means
> 
>
> Key: SPARK-8519
> URL: https://issues.apache.org/jira/browse/SPARK-8519
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>  Labels: advanced
>
> The performance of pairwise distance computation in k-means can benefit from 
> BLAS Level 3 matrix-matrix multiplications. It requires we update the 
> implementation to use blocks. Even for sparse data, we might be able to see 
> some performance gain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11559) Make `runs` no effect in k-means

2016-01-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105380#comment-15105380
 ] 

Apache Spark commented on SPARK-11559:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/10806

> Make `runs` no effect in k-means
> 
>
> Key: SPARK-11559
> URL: https://issues.apache.org/jira/browse/SPARK-11559
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> We deprecated `runs` in Spark 1.6 (SPARK-11358). In 2.0, we can either remove 
> `runs` or make it no effect (with warning messages). So we can simplify the 
> implementation. I prefer the latter for better binary compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12878) Dataframe fails with nested User Defined Types

2016-01-18 Thread Joao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joao updated SPARK-12878:
-
Description: 
Spark 1.6.0 crashes when using nested User Defined Types in a Dataframe. 
In version 1.5.2 the code below worked just fine:

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.catalyst.expressions.GenericMutableRow
import org.apache.spark.sql.types._

@SQLUserDefinedType(udt = classOf[AUDT])
case class A(list:Seq[B])

class AUDT extends UserDefinedType[A] {
  override def sqlType: DataType = StructType(Seq(StructField("list", 
ArrayType(BUDT, containsNull = false), nullable = true)))
  override def userClass: Class[A] = classOf[A]
  override def serialize(obj: Any): Any = obj match {
case A(list) =>
  val row = new GenericMutableRow(1)
  row.update(0, new GenericArrayData(list.map(_.asInstanceOf[Any]).toArray))
  row
  }

  override def deserialize(datum: Any): A = {
datum match {
  case row: InternalRow => new A(row.getArray(0).toArray(BUDT).toSeq)
}
  }
}

object AUDT extends AUDT

@SQLUserDefinedType(udt = classOf[BUDT])
case class B(text:Int)

class BUDT extends UserDefinedType[B] {
  override def sqlType: DataType = StructType(Seq(StructField("num", 
IntegerType, nullable = false)))
  override def userClass: Class[B] = classOf[B]
  override def serialize(obj: Any): Any = obj match {
case B(text) =>
  val row = new GenericMutableRow(1)
  row.setInt(0, text)
  row
  }

  override def deserialize(datum: Any): B = {
datum match {  case row: InternalRow => new B(row.getInt(0))  }
  }
}

object BUDT extends BUDT

object Test {
  def main(args:Array[String]) = {

val col = Seq(new A(Seq(new B(1), new B(2))),
  new A(Seq(new B(3), new B(4

val sc = new SparkContext(new 
SparkConf().setMaster("local[1]").setAppName("TestSpark"))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._

val df = sc.parallelize(1 to 2 zip col).toDF("id","b")
df.select("b").show()
df.collect().foreach(println)
  }
}

In the new version (1.6.0) I needed to include the following import:

import org.apache.spark.sql.catalyst.expressions.GenericMutableRow

However, Spark crashes in runtime:

16/01/18 14:36:22 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.ClassCastException: scala.runtime.BoxedUnit cannot be cast to 
org.apache.spark.sql.catalyst.InternalRow
at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:51)
at 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51)
at 
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 

[jira] [Created] (SPARK-12878) Dataframe fails with nested User Defined Types

2016-01-18 Thread Joao (JIRA)
Joao created SPARK-12878:


 Summary: Dataframe fails with nested User Defined Types
 Key: SPARK-12878
 URL: https://issues.apache.org/jira/browse/SPARK-12878
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Joao
Priority: Blocker


Spark 1.6.0 crashes when using nested User Defined Types in a Dataframe. 
In version 1.5.2 the code below worked just fine:

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.catalyst.expressions.GenericMutableRow
import org.apache.spark.sql.types._

@SQLUserDefinedType(udt = classOf[AUDT])
case class A(list:Seq[B])

class AUDT extends UserDefinedType[A] {
  override def sqlType: DataType = StructType(Seq(StructField("list", 
ArrayType(BUDT, containsNull = false), nullable = true)))
  override def userClass: Class[A] = classOf[A]
  override def serialize(obj: Any): Any = obj match {
case A(list) =>
  val row = new GenericMutableRow(1).update(0, new 
GenericArrayData(list.map(_.asInstanceOf[Any]).toArray))
  row
  }

  override def deserialize(datum: Any): A = {
datum match {
  case row: InternalRow => new A(row.getArray(0).toArray(BUDT).toSeq)
}
  }
}

object AUDT extends AUDT

@SQLUserDefinedType(udt = classOf[BUDT])
case class B(text:Int)

class BUDT extends UserDefinedType[B] {
  override def sqlType: DataType = StructType(Seq(StructField("num", 
IntegerType, nullable = false)))
  override def userClass: Class[B] = classOf[B]
  override def serialize(obj: Any): Any = obj match {
case B(text) =>
  val row = new GenericMutableRow(1).setInt(0, text)
  row
  }

  override def deserialize(datum: Any): B = {
datum match {  case row: InternalRow => new B(row.getInt(0))  }
  }
}

object BUDT extends BUDT

object Test {
  def main(args:Array[String]) = {

val col = Seq(new A(Seq(new B(1), new B(2))),
  new A(Seq(new B(3), new B(4

val sc = new SparkContext(new 
SparkConf().setMaster("local[1]").setAppName("TestSpark"))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._

val df = sc.parallelize(1 to 2 zip col).toDF("id","b")
df.select("b").show()
df.collect().foreach(println)
  }
}

In the new version (1.6.0) I needed to include the following import:

import org.apache.spark.sql.catalyst.expressions.GenericMutableRow

However, Spark crashes in runtime:

16/01/18 14:36:22 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.ClassCastException: scala.runtime.BoxedUnit cannot be cast to 
org.apache.spark.sql.catalyst.InternalRow
at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:51)
at 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51)
at 
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at 

[jira] [Resolved] (SPARK-1529) Support DFS based shuffle in addition to Netty shuffle

2016-01-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1529.
--
Resolution: Won't Fix

> Support DFS based shuffle in addition to Netty shuffle
> --
>
> Key: SPARK-1529
> URL: https://issues.apache.org/jira/browse/SPARK-1529
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Kannan Rajah
> Attachments: Spark Shuffle using HDFS.pdf
>
>
> In some environments, like with MapR, local volumes are accessed through the 
> Hadoop filesystem interface. Shuffle is implemented by writing intermediate 
> data to local disk and serving it to remote node using Netty as a transport 
> mechanism. We want to provide an HDFS based shuffle such that data can be 
> written to HDFS (instead of local disk) and served using HDFS API on the 
> remote nodes. This could involve exposing a file system abstraction to Spark 
> shuffle and have 2 modes of running it. In default mode, it will write to 
> local disk and in the DFS mode, it will write to HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2153) CassandraTest fails for newer Cassandra due to case insensitive key space

2016-01-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2153:
-
Labels:   (was: examples)
Issue Type: Improvement  (was: Bug)

[~vishnusubramanian] would this be resolved if the table and col names were 
simply all lower case in the example?

> CassandraTest fails for newer Cassandra due to case insensitive key space
> -
>
> Key: SPARK-2153
> URL: https://issues.apache.org/jira/browse/SPARK-2153
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 1.0.0
>Reporter: vishnu
>Priority: Minor
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> The Spark Example CassandraTest.scala does cannot be built on newer versions 
> of cassandra. I tried it on Cassandra 2.0.8. 
> It is because Cassandra looks case sensitive for the key spaces and stores 
> all the keyspaces in lowercase. And in the example the KeySpace is "casDemo" 
> . So the program fails with an error stating keyspace not found.
> The new Cassandra jars do not have the org.apache.cassandra.db.IColumn .So 
> instead we have to use org.apache.cassandra.db.Column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2703) Make Tachyon related unit tests execute without deploying a Tachyon system locally.

2016-01-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2703.
--
Resolution: Won't Fix
  Assignee: (was: Rong Gu)

> Make Tachyon related unit tests execute without deploying a Tachyon system 
> locally.
> ---
>
> Key: SPARK-2703
> URL: https://issues.apache.org/jira/browse/SPARK-2703
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Haoyuan Li
>
> Use the LocalTachyonCluster class in tachyon-test.jar in 0.5.0 release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10264) Add @Since annotation to ml.recoomendation

2016-01-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10264.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10756
[https://github.com/apache/spark/pull/10756]

> Add @Since annotation to ml.recoomendation
> --
>
> Key: SPARK-10264
> URL: https://issues.apache.org/jira/browse/SPARK-10264
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>Assignee: Tijo Thomas
>Priority: Minor
>  Labels: starter
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1169) Add countApproxDistinctByKey to PySpark

2016-01-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1169.
--
Resolution: Won't Fix

> Add countApproxDistinctByKey to PySpark
> ---
>
> Key: SPARK-1169
> URL: https://issues.apache.org/jira/browse/SPARK-1169
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Matei Zaharia
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2653) Heap size should be the sum of driver.memory and executor.memory in local mode

2016-01-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2653.
--
Resolution: Not A Problem

Although I don't feel so strongly about this either way, it would be a behavior 
change and I've seen no activity on this. There is really no executor in local 
mode; execution happens in the driver. I think it's sensible to use the driver 
memory as the heap size.

> Heap size should be the sum of driver.memory and executor.memory in local mode
> --
>
> Key: SPARK-2653
> URL: https://issues.apache.org/jira/browse/SPARK-2653
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Davies Liu
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> In local mode, the driver and executor run in the same JVM, so the heap size 
> of JVM should be the sum of spark.driver.memory and spark.executor.memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3437) Adapt maven build to work without the need of hardcoding scala binary version in artifact id.

2016-01-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3437.
--
Resolution: Won't Fix

> Adapt maven build to work without the need of hardcoding scala binary version 
> in artifact id.
> -
>
> Key: SPARK-3437
> URL: https://issues.apache.org/jira/browse/SPARK-3437
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3107) Don't pass null jar to executor in yarn-client mode

2016-01-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3107.
--
Resolution: Duplicate

I think the result was that this was resolved, narrowly, by SPARK-2933

> Don't pass null jar to executor in yarn-client mode
> ---
>
> Key: SPARK-3107
> URL: https://issues.apache.org/jira/browse/SPARK-3107
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>
> In the following line, ExecutorLauncher's `--jar` takes in null.
> {code}
> 14/08/18 20:52:43 INFO yarn.Client:   command: $JAVA_HOME/bin/java -server 
> -Xmx512m ... org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused' 
> --jar null  --arg  'ip-172-31-0-12.us-west-2.compute.internal:56838' 
> --executor-memory 1024 --executor-cores 1 --num-executors  2
> {code}
> Also it appears that we set a bunch of system properties to empty strings 
> (not shown). We should avoid setting these if they don't actually contain 
> values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-11518) The script spark-submit.cmd can not handle spark directory with space.

2016-01-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11518:
--
Comment: was deleted

(was: No. I just use work around to not put Spark into a such folder.)

> The script spark-submit.cmd can not handle spark directory with space.
> --
>
> Key: SPARK-11518
> URL: https://issues.apache.org/jira/browse/SPARK-11518
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Windows
>Affects Versions: 1.4.1
>Reporter: Cele Liu
>Priority: Minor
>
> Unzip the spark into D:\Program Files\Spark, when we submit the app, we got 
> error:
> 'D:\Program' is not recognized as an internal or external command,
> operable program or batch file.
> In spark-submit.cmd, the script does not handle space:
> cmd /V /E /C %~dp0spark-submit2.cmd %*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-11518) The script spark-submit.cmd can not handle spark directory with space.

2016-01-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11518:
--
Comment: was deleted

(was: No. I just use work around to not put Spark into a such folder.)

> The script spark-submit.cmd can not handle spark directory with space.
> --
>
> Key: SPARK-11518
> URL: https://issues.apache.org/jira/browse/SPARK-11518
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Windows
>Affects Versions: 1.4.1
>Reporter: Cele Liu
>Priority: Minor
>
> Unzip the spark into D:\Program Files\Spark, when we submit the app, we got 
> error:
> 'D:\Program' is not recognized as an internal or external command,
> operable program or batch file.
> In spark-submit.cmd, the script does not handle space:
> cmd /V /E /C %~dp0spark-submit2.cmd %*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12863) missing api for renaming and mapping result of operations on GroupedDataset to case classes

2016-01-18 Thread Milad Khajavi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105262#comment-15105262
 ] 

Milad Khajavi commented on SPARK-12863:
---

I'm doing it with Dataframe api, but Dataset api doesn't have this
functionality.





-- 
Milād Khājavi
http://blog.khajavi.ir
Having the source means you can do it yourself.
I tried to change the world, but I couldn’t find the source code.


> missing api for renaming and mapping result of operations on GroupedDataset 
> to case classes
> ---
>
> Key: SPARK-12863
> URL: https://issues.apache.org/jira/browse/SPARK-12863
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.0
>Reporter: Milad Khajavi
>
> Here I struggle with spark api to convert the result of count to convert that 
> to KeyValue case class. I think there is no api for changing the column of 
> Dataset and mapping them to new case class.
> case class LogRow(id: String, location: String, time: Long)
> case class KeyValue(key: (String, String), value: Long)
> val log = LogRow("1", "a", 1) :: and so on
> log.toDS().groupBy(l => {(l.id, l.location)}).count().toDF().toDF("key", 
> "value").as[KeyValue].printSchema()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12863) missing api for renaming and mapping result of operations on GroupedDataset to case classes

2016-01-18 Thread Milad Khajavi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105262#comment-15105262
 ] 

Milad Khajavi edited comment on SPARK-12863 at 1/18/16 1:18 PM:


I'm doing it with Dataframe api, but Dataset api doesn't have this
functionality.


was (Author: khajavi):
I'm doing it with Dataframe api, but Dataset api doesn't have this
functionality.





-- 
Milād Khājavi
http://blog.khajavi.ir
Having the source means you can do it yourself.
I tried to change the world, but I couldn’t find the source code.


> missing api for renaming and mapping result of operations on GroupedDataset 
> to case classes
> ---
>
> Key: SPARK-12863
> URL: https://issues.apache.org/jira/browse/SPARK-12863
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.0
>Reporter: Milad Khajavi
>
> Here I struggle with spark api to convert the result of count to convert that 
> to KeyValue case class. I think there is no api for changing the column of 
> Dataset and mapping them to new case class.
> case class LogRow(id: String, location: String, time: Long)
> case class KeyValue(key: (String, String), value: Long)
> val log = LogRow("1", "a", 1) :: and so on
> log.toDS().groupBy(l => {(l.id, l.location)}).count().toDF().toDF("key", 
> "value").as[KeyValue].printSchema()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3872) Rewrite the test for ActorInputStream.

2016-01-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3872.
--
Resolution: Won't Fix

> Rewrite the test for ActorInputStream. 
> ---
>
> Key: SPARK-3872
> URL: https://issues.apache.org/jira/browse/SPARK-3872
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8519) Blockify distance computation in k-means

2016-01-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105426#comment-15105426
 ] 

Apache Spark commented on SPARK-8519:
-

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/10306

> Blockify distance computation in k-means
> 
>
> Key: SPARK-8519
> URL: https://issues.apache.org/jira/browse/SPARK-8519
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>  Labels: advanced
>
> The performance of pairwise distance computation in k-means can benefit from 
> BLAS Level 3 matrix-matrix multiplications. It requires we update the 
> implementation to use blocks. Even for sparse data, we might be able to see 
> some performance gain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11559) Make `runs` no effect in k-means

2016-01-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105428#comment-15105428
 ] 

Apache Spark commented on SPARK-11559:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/10306

> Make `runs` no effect in k-means
> 
>
> Key: SPARK-11559
> URL: https://issues.apache.org/jira/browse/SPARK-11559
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> We deprecated `runs` in Spark 1.6 (SPARK-11358). In 2.0, we can either remove 
> `runs` or make it no effect (with warning messages). So we can simplify the 
> implementation. I prefer the latter for better binary compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11560) Optimize KMeans implementation

2016-01-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105427#comment-15105427
 ] 

Apache Spark commented on SPARK-11560:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/10306

> Optimize KMeans implementation
> --
>
> Key: SPARK-11560
> URL: https://issues.apache.org/jira/browse/SPARK-11560
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>
> After we dropped `runs`, we can simplify and optimize the k-means 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true

2016-01-18 Thread Ma Xiaoyu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15104911#comment-15104911
 ] 

Ma Xiaoyu commented on SPARK-5159:
--

Sorry and I realised that I messed up my PR with SPARK-6910.
My code is shadowed inside.
If needed, I might resubmit it with only the change of doAs part. That one is 
just trying to make doAs work.

> Thrift server does not respect hive.server2.enable.doAs=true
> 
>
> Key: SPARK-5159
> URL: https://issues.apache.org/jira/browse/SPARK-5159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Andrew Ray
> Attachments: spark_thrift_server_log.txt
>
>
> I'm currently testing the spark sql thrift server on a kerberos secured 
> cluster in YARN mode. Currently any user can access any table regardless of 
> HDFS permissions as all data is read as the hive user. In HiveServer2 the 
> property hive.server2.enable.doAs=true causes all access to be done as the 
> submitting user. We should do the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true

2016-01-18 Thread Ma Xiaoyu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15104911#comment-15104911
 ] 

Ma Xiaoyu edited comment on SPARK-5159 at 1/18/16 8:16 AM:
---

Sorry and I realised that I messed up my PR with SPARK-6910.
My code is shadowed inside and not getting merged.
If needed, I might resubmit it with only the change of doAs part. That one is 
just trying to make doAs work.


was (Author: ilovesoup):
Sorry and I realised that I messed up my PR with SPARK-6910.
My code is shadowed inside.
If needed, I might resubmit it with only the change of doAs part. That one is 
just trying to make doAs work.

> Thrift server does not respect hive.server2.enable.doAs=true
> 
>
> Key: SPARK-5159
> URL: https://issues.apache.org/jira/browse/SPARK-5159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Andrew Ray
> Attachments: spark_thrift_server_log.txt
>
>
> I'm currently testing the spark sql thrift server on a kerberos secured 
> cluster in YARN mode. Currently any user can access any table regardless of 
> HDFS permissions as all data is read as the hive user. In HiveServer2 the 
> property hive.server2.enable.doAs=true causes all access to be done as the 
> submitting user. We should do the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true

2016-01-18 Thread Ma Xiaoyu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15104911#comment-15104911
 ] 

Ma Xiaoyu edited comment on SPARK-5159 at 1/18/16 8:17 AM:
---

Sorry and I realised that I messed up my PR with SPARK-6910.
My change is shadowed inside and not getting merged.
If needed, I might resubmit it with only the change of doAs part. That one is 
just trying to make doAs work.


was (Author: ilovesoup):
Sorry and I realised that I messed up my PR with SPARK-6910.
My code is shadowed inside and not getting merged.
If needed, I might resubmit it with only the change of doAs part. That one is 
just trying to make doAs work.

> Thrift server does not respect hive.server2.enable.doAs=true
> 
>
> Key: SPARK-5159
> URL: https://issues.apache.org/jira/browse/SPARK-5159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Andrew Ray
> Attachments: spark_thrift_server_log.txt
>
>
> I'm currently testing the spark sql thrift server on a kerberos secured 
> cluster in YARN mode. Currently any user can access any table regardless of 
> HDFS permissions as all data is read as the hive user. In HiveServer2 the 
> property hive.server2.enable.doAs=true causes all access to be done as the 
> submitting user. We should do the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3159) Check for reducible DecisionTree

2016-01-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105061#comment-15105061
 ] 

Sean Owen commented on SPARK-3159:
--

[~josephkb] I noticed a number of fairly old issues (~18 months) from around 
this time regarding small improvements to decision trees that didn't have a 
follow up. Are these still valid?

> Check for reducible DecisionTree
> 
>
> Key: SPARK-3159
> URL: https://issues.apache.org/jira/browse/SPARK-3159
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Improvement: test-time computation
> Currently, pairs of leaf nodes with the same parent can both output the same 
> prediction.  This happens since the splitting criterion (e.g., Gini) is not 
> the same as prediction accuracy/MSE; the splitting criterion can sometimes be 
> improved even when both children would still output the same prediction 
> (e.g., based on the majority label for classification).
> We could check the tree and reduce it if possible after training.
> Note: This happens with scikit-learn as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2780) Create a StreamingContext.setLocalProperty for setting local property of jobs launched by streaming

2016-01-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2780.
--
Resolution: Won't Fix

> Create a StreamingContext.setLocalProperty for setting local property of jobs 
> launched by streaming
> ---
>
> Key: SPARK-2780
> URL: https://issues.apache.org/jira/browse/SPARK-2780
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.0.0, 1.1.0
>Reporter: Tathagata Das
>Priority: Minor
>
> SparkContext.setLocalProperty makes all Spark jobs submitted using
> the current thread belong to the set pool. However, in Spark
> Streaming, all the jobs are actually launched in the background from a
> different thread. So this setting does not work. 
> Currently, there is a
> work around. If you are doing any kind of output operations on
> DStreams, like DStream.foreachRDD(), you can set the property inside
> that
> dstream.foreachRDD(rdd =>
>rdd.sparkContext.setLocalProperty(...)
> )



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2255) create a self destructing iterator that releases records from hash maps

2016-01-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2255.
--
Resolution: Won't Fix

I'm guessing this isn't going anywhere or is obsolete

> create a self destructing iterator that releases records from hash maps
> ---
>
> Key: SPARK-2255
> URL: https://issues.apache.org/jira/browse/SPARK-2255
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> This is a small thing to do that can help out with GC pressure. For 
> aggregations (and potentially joins), we don't really need to hold onto the 
> key value pairs as soon as we have iterate over them. We can create a self 
> destructing iterator for AppendOnlyMap / ExternalAppendOnlyMap that removes 
> references to the key value pair as the iterator goes through records so 
> those memory can be freed quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-748) Add documentation page describing interoperability with other software (e.g. HBase, JDBC, Kafka, etc.)

2016-01-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-748.
-
Resolution: Won't Fix

Some of this exists in pieces in the docs. It's also committing to update and 
maintain docs for integration with this and more future external systems. Given 
that the integration example code isn't well maintained I suspect this won't 
happen and this hasn't moved in 15 months.

> Add documentation page describing interoperability with other software (e.g. 
> HBase, JDBC, Kafka, etc.)
> --
>
> Key: SPARK-748
> URL: https://issues.apache.org/jira/browse/SPARK-748
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation
>Reporter: Josh Rosen
>
> Spark seems to be gaining a lot of data input / output features for 
> integrating with systems like HBase, Kafka, JDBC, Hadoop, etc.
> It might be a good idea to create a single documentation page that provides a 
> list of all of the data sources that Spark supports and links to the relevant 
> documentation / examples / {{spark-users}} threads.  This would help 
> prospective users to evaluate how easy it will be to integrate Spark with 
> their existing systems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12875) Add Weight of Evidence and Information value to Spark.ml as a feature transformer

2016-01-18 Thread yuhao yang (JIRA)
yuhao yang created SPARK-12875:
--

 Summary: Add Weight of Evidence and Information value to Spark.ml 
as a feature transformer
 Key: SPARK-12875
 URL: https://issues.apache.org/jira/browse/SPARK-12875
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: yuhao yang
Priority: Minor


As a feature transformer, WOE and IV enable one to:

Consider each variable’s independent contribution to the outcome.
Detect linear and non-linear relationships.
Rank variables in terms of "univariate" predictive strength.
Visualize the correlations between the predictive variables and the binary 
outcome.

http://multithreaded.stitchfix.com/blog/2015/08/13/weight-of-evidence/ gives a 
good introduction to WoE and IV.

 The Weight of Evidence or WoE value provides a measure of how well a grouping 
of feature is able to distinguish between a binary response (e.g. "good" versus 
"bad"), which is widely used in grouping continuous feature or mapping 
categorical features to continuous values. It is computed from the basic odds 
ratio:
(Distribution of positive Outcomes) / (Distribution of negative Outcomes)
where Distr refers to the proportion of positive or negative in the respective 
group, relative to the column totals.

The WoE recoding of features is particularly well suited for subsequent 
modeling using Logistic Regression or MLP.

In addition, the information value or IV can be computed based on WoE, which is 
a popular technique to select variables in a predictive model.

TODO: Currently we support only calculation for categorical features. Add an 
estimator to estimate the proper grouping for continuous feature. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12860) speed up safe projection for primitive types

2016-01-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12860:
--
Assignee: Wenchen Fan

> speed up safe projection for primitive types
> 
>
> Key: SPARK-12860
> URL: https://issues.apache.org/jira/browse/SPARK-12860
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12868) ADD JAR via sparkSQL JDBC will fail when using a HDFS URL

2016-01-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12868:
--
Component/s: SQL

> ADD JAR via sparkSQL JDBC will fail when using a HDFS URL
> -
>
> Key: SPARK-12868
> URL: https://issues.apache.org/jira/browse/SPARK-12868
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Trystan Leftwich
>
> When trying to add a jar with a HDFS URI, i.E
> ```
> ADD JAR hdfs:///tmp/foo.jar
> ```
> Via the spark sql JDBC interface it will fail with:
> ```
> java.net.MalformedURLException: unknown protocol: hdfs
> at java.net.URL.(URL.java:593)
> at java.net.URL.(URL.java:483)
> at java.net.URL.(URL.java:432)
> at java.net.URI.toURL(URI.java:1089)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.addJar(ClientWrapper.scala:578)
> at org.apache.spark.sql.hive.HiveContext.addJar(HiveContext.scala:652)
> at org.apache.spark.sql.hive.execution.AddJar.run(commands.scala:89)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:145)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:211)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:154)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:151)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:164)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-578) Fix interpreter code generation to only capture needed dependencies

2016-01-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-578.
-
Resolution: Not A Problem

This is so old that I think it's obsolete

> Fix interpreter code generation to only capture needed dependencies
> ---
>
> Key: SPARK-578
> URL: https://issues.apache.org/jira/browse/SPARK-578
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Matei Zaharia
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-634) Track and display a read count for each block replica in BlockManager

2016-01-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-634.
-
Resolution: Won't Fix

Closing just because of extreme age

> Track and display a read count for each block replica in BlockManager
> -
>
> Key: SPARK-634
> URL: https://issues.apache.org/jira/browse/SPARK-634
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager
>Reporter: Reynold Xin
>Assignee: Patrick Cogan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11714) Make Spark on Mesos honor port restrictions

2016-01-18 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105017#comment-15105017
 ] 

Stavros Kontopoulos commented on SPARK-11714:
-

soon i will have a pull request...

> Make Spark on Mesos honor port restrictions
> ---
>
> Key: SPARK-11714
> URL: https://issues.apache.org/jira/browse/SPARK-11714
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Charles Allen
>
> Currently the MesosSchedulerBackend does not make any effort to honor "ports" 
> as a resource offer in Mesos. This ask is to have the ports which the 
> executor binds to honor the limits of the "ports" resource of an offer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12803) Consider adding ability to profile specific instances of executors in spark

2016-01-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12803.
---
Resolution: Won't Fix

> Consider adding ability to profile specific instances of executors in spark
> ---
>
> Key: SPARK-12803
> URL: https://issues.apache.org/jira/browse/SPARK-12803
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API
>Reporter: Rajesh Balamohan
>
> It would be useful to profile specific instances of executors as opposed to 
> adding profiler details to all executors via 
> "spark.executor.extraJavaOptions".  
> Setting the number of executors to just 1 and profiling wouldn't be much 
> useful (in some cases, most of the time with single executor mode would be 
> spent in terms of reading data from remote node).  At the same time, setting 
> profiling option to all executors could just create too many number of 
> snapshots; making it harder to analyze.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2962) Suboptimal scheduling in spark

2016-01-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2962.
--
Resolution: Not A Problem

Given the final comment, sounds like there is a belief this was in fact 
addressed.

> Suboptimal scheduling in spark
> --
>
> Key: SPARK-2962
> URL: https://issues.apache.org/jira/browse/SPARK-2962
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
> Environment: All
>Reporter: Mridul Muralidharan
>
> In findTask, irrespective of 'locality' specified, pendingTasksWithNoPrefs 
> are always scheduled with PROCESS_LOCAL
> pendingTasksWithNoPrefs contains tasks which currently do not have any alive 
> locations - but which could come in 'later' : particularly relevant when 
> spark app is just coming up and containers are still being added.
> This causes a large number of non node local tasks to be scheduled incurring 
> significant network transfers in the cluster when running with non trivial 
> datasets.
> The comment "// Look for no-pref tasks after rack-local tasks since they can 
> run anywhere." is misleading in the method code : locality levels start from 
> process_local down to any, and so no prefs get scheduled much before rack.
> Also note that, currentLocalityIndex is reset to the taskLocality returned by 
> this method - so returning PROCESS_LOCAL as the level will trigger wait times 
> again. (Was relevant before recent change to scheduler, and might be again 
> based on resolution of this issue).
> Found as part of writing test for SPARK-2931
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3927) Extends SPARK-2577 to fix secondary resources

2016-01-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3927.
--
Resolution: Won't Fix

> Extends SPARK-2577 to fix secondary resources
> -
>
> Key: SPARK-3927
> URL: https://issues.apache.org/jira/browse/SPARK-3927
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0
>Reporter: Ian O Connell
>
> SPARK-2577 was a partial fix,  handling the case of the assembly + app jar. 
> The additional resources however would run into the same issue.
> I have the super simple PR ready. Though should this code be moved inside the 
> addResource method instead to address it more globally? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-729) Closures not always serialized at capture time

2016-01-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-729.
-
Resolution: Won't Fix

I'm tentatively closing for lack of activity; it is problematic to implement 
and does change behavior. Although it's a problem it does also end up turning 
up at a reasonable time, when the closure is executed. The error is clear too.

> Closures not always serialized at capture time
> --
>
> Key: SPARK-729
> URL: https://issues.apache.org/jira/browse/SPARK-729
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.7.0, 0.7.1
>Reporter: Matei Zaharia
>Assignee: William Benton
>
> As seen in 
> https://groups.google.com/forum/?fromgroups=#!topic/spark-users/8pTchwuP2Kk 
> and its corresponding fix on 
> https://github.com/mesos/spark/commit/adba773fab6294b5764d101d248815a7d3cb3558,
>  it is possible for a closure referencing a var to see the latest version of 
> that var, instead of the version that was there when the closure was passed 
> to Spark. This is not good when failures or recomputations happen. We need to 
> serialize the closures on capture if possible, perhaps as part of 
> ClosureCleaner.clean.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12871) Support to specify the option for compression codec.

2016-01-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12871:


Assignee: Apache Spark

> Support to specify the option for compression codec.
> 
>
> Key: SPARK-12871
> URL: https://issues.apache.org/jira/browse/SPARK-12871
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>
> Currently CSV datasource does not support {{codec}} option to specify 
> compression codec in different languages such as Python.
> It would be great if this is supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12871) Support to specify the option for compression codec.

2016-01-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105086#comment-15105086
 ] 

Apache Spark commented on SPARK-12871:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/10805

> Support to specify the option for compression codec.
> 
>
> Key: SPARK-12871
> URL: https://issues.apache.org/jira/browse/SPARK-12871
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> Currently CSV datasource does not support {{codec}} option to specify 
> compression codec in different languages such as Python.
> It would be great if this is supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12871) Support to specify the option for compression codec.

2016-01-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12871:


Assignee: (was: Apache Spark)

> Support to specify the option for compression codec.
> 
>
> Key: SPARK-12871
> URL: https://issues.apache.org/jira/browse/SPARK-12871
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> Currently CSV datasource does not support {{codec}} option to specify 
> compression codec in different languages such as Python.
> It would be great if this is supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12875) Add Weight of Evidence and Information value to Spark.ml as a feature transformer

2016-01-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105000#comment-15105000
 ] 

Apache Spark commented on SPARK-12875:
--

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/10803

> Add Weight of Evidence and Information value to Spark.ml as a feature 
> transformer
> -
>
> Key: SPARK-12875
> URL: https://issues.apache.org/jira/browse/SPARK-12875
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> As a feature transformer, WOE and IV enable one to:
> Consider each variable’s independent contribution to the outcome.
> Detect linear and non-linear relationships.
> Rank variables in terms of "univariate" predictive strength.
> Visualize the correlations between the predictive variables and the binary 
> outcome.
> http://multithreaded.stitchfix.com/blog/2015/08/13/weight-of-evidence/ gives 
> a good introduction to WoE and IV.
>  The Weight of Evidence or WoE value provides a measure of how well a 
> grouping of feature is able to distinguish between a binary response (e.g. 
> "good" versus "bad"), which is widely used in grouping continuous feature or 
> mapping categorical features to continuous values. It is computed from the 
> basic odds ratio:
> (Distribution of positive Outcomes) / (Distribution of negative Outcomes)
> where Distr refers to the proportion of positive or negative in the 
> respective group, relative to the column totals.
> The WoE recoding of features is particularly well suited for subsequent 
> modeling using Logistic Regression or MLP.
> In addition, the information value or IV can be computed based on WoE, which 
> is a popular technique to select variables in a predictive model.
> TODO: Currently we support only calculation for categorical features. Add an 
> estimator to estimate the proper grouping for continuous feature. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12875) Add Weight of Evidence and Information value to Spark.ml as a feature transformer

2016-01-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12875:


Assignee: (was: Apache Spark)

> Add Weight of Evidence and Information value to Spark.ml as a feature 
> transformer
> -
>
> Key: SPARK-12875
> URL: https://issues.apache.org/jira/browse/SPARK-12875
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> As a feature transformer, WOE and IV enable one to:
> Consider each variable’s independent contribution to the outcome.
> Detect linear and non-linear relationships.
> Rank variables in terms of "univariate" predictive strength.
> Visualize the correlations between the predictive variables and the binary 
> outcome.
> http://multithreaded.stitchfix.com/blog/2015/08/13/weight-of-evidence/ gives 
> a good introduction to WoE and IV.
>  The Weight of Evidence or WoE value provides a measure of how well a 
> grouping of feature is able to distinguish between a binary response (e.g. 
> "good" versus "bad"), which is widely used in grouping continuous feature or 
> mapping categorical features to continuous values. It is computed from the 
> basic odds ratio:
> (Distribution of positive Outcomes) / (Distribution of negative Outcomes)
> where Distr refers to the proportion of positive or negative in the 
> respective group, relative to the column totals.
> The WoE recoding of features is particularly well suited for subsequent 
> modeling using Logistic Regression or MLP.
> In addition, the information value or IV can be computed based on WoE, which 
> is a popular technique to select variables in a predictive model.
> TODO: Currently we support only calculation for categorical features. Add an 
> estimator to estimate the proper grouping for continuous feature. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12875) Add Weight of Evidence and Information value to Spark.ml as a feature transformer

2016-01-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12875:


Assignee: Apache Spark

> Add Weight of Evidence and Information value to Spark.ml as a feature 
> transformer
> -
>
> Key: SPARK-12875
> URL: https://issues.apache.org/jira/browse/SPARK-12875
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: yuhao yang
>Assignee: Apache Spark
>Priority: Minor
>
> As a feature transformer, WOE and IV enable one to:
> Consider each variable’s independent contribution to the outcome.
> Detect linear and non-linear relationships.
> Rank variables in terms of "univariate" predictive strength.
> Visualize the correlations between the predictive variables and the binary 
> outcome.
> http://multithreaded.stitchfix.com/blog/2015/08/13/weight-of-evidence/ gives 
> a good introduction to WoE and IV.
>  The Weight of Evidence or WoE value provides a measure of how well a 
> grouping of feature is able to distinguish between a binary response (e.g. 
> "good" versus "bad"), which is widely used in grouping continuous feature or 
> mapping categorical features to continuous values. It is computed from the 
> basic odds ratio:
> (Distribution of positive Outcomes) / (Distribution of negative Outcomes)
> where Distr refers to the proportion of positive or negative in the 
> respective group, relative to the column totals.
> The WoE recoding of features is particularly well suited for subsequent 
> modeling using Logistic Regression or MLP.
> In addition, the information value or IV can be computed based on WoE, which 
> is a popular technique to select variables in a predictive model.
> TODO: Currently we support only calculation for categorical features. Add an 
> estimator to estimate the proper grouping for continuous feature. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-540) Add API to customize in-memory representation of RDDs

2016-01-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-540.
-
Resolution: Not A Problem

I assume this is subsumed by things like dataframes and the dataset API.

> Add API to customize in-memory representation of RDDs
> -
>
> Key: SPARK-540
> URL: https://issues.apache.org/jira/browse/SPARK-540
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Right now the choice between serialized caching and just Java objects in dev 
> is fine, but it might be cool to also support structures such as 
> column-oriented storage through arrays of primitives without forcing it 
> through the serialization interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3981) Consider a better approach to initialize SerDe on executors

2016-01-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105064#comment-15105064
 ] 

Sean Owen commented on SPARK-3981:
--

Is this still an issue?

> Consider a better approach to initialize SerDe on executors
> ---
>
> Key: SPARK-3981
> URL: https://issues.apache.org/jira/browse/SPARK-3981
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 1.2.0
>Reporter: Xiangrui Meng
>
> In SPARK-3971, we copied SerDe code from Core to MLlib in order to recognize 
> MLlib types on executors as a hotfix. This is not ideal. We should find a way 
> to add hooks to the SerDe in Core to support MLlib types in a pluggable way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11714) Make Spark on Mesos honor port restrictions

2016-01-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11714:


Assignee: Apache Spark

> Make Spark on Mesos honor port restrictions
> ---
>
> Key: SPARK-11714
> URL: https://issues.apache.org/jira/browse/SPARK-11714
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Charles Allen
>Assignee: Apache Spark
>
> Currently the MesosSchedulerBackend does not make any effort to honor "ports" 
> as a resource offer in Mesos. This ask is to have the ports which the 
> executor binds to honor the limits of the "ports" resource of an offer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-01-18 Thread Mario Briggs (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105560#comment-15105560
 ] 

Mario Briggs commented on SPARK-12177:
--

Hi Nikita,
 great. 

 1 - We should also have a python/pyspark/streaming/kafka-v09.py as well that 
matches to our external/kafka-v09
 2 - Why do you have the Broker.scala class? Unless i am missing something, it 
should be knocked off
 3 - I think the package should be 'org.apache.spark.streaming.kafka' only in 
external/kafka-v09 and not 'org.apache.spark.streaming.kafka.v09'. This is 
because we produce a jar with a diff name (user picks which one and even if 
he/she mismatches, it errors correctly since the KafkaUtils method signatures 
are different)
  
 

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9976) create function do not work

2016-01-18 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105570#comment-15105570
 ] 

Yin Huai commented on SPARK-9976:
-

what version of spark are you using? Also, what is the error message?

> create function do not work
> ---
>
> Key: SPARK-9976
> URL: https://issues.apache.org/jira/browse/SPARK-9976
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.0
> Environment: spark 1.4.1 yarn 2.2.0
>Reporter: cen yuhai
>
> I use beeline to connect to ThriftServer, but add jar can not work, so I use 
> create function , see the link below.
> http://www.cloudera.com/content/cloudera/en/documentation/core/v5-3-x/topics/cm_mc_hive_udf.html
> I do as blow:
> {code}
> create function gdecodeorder as 'com.hive.udf.GOrderDecode' USING JAR 
> 'hdfs://mycluster/user/spark/lib/gorderdecode.jar'; 
> {code}
> It returns Ok, and I connect to the metastore, I see records in table FUNCS.
> {code}
> select gdecodeorder(t1)  from tableX  limit 1;
> {code}
> It returns error 'Couldn't find function default.gdecodeorder'
> This is the Exception
> {code}
> 15/08/14 14:53:51 ERROR UserGroupInformation: PriviledgedActionException 
> as:xiaoju (auth:SIMPLE) cause:org.apache.hive.service.cli.HiveSQLException: 
> java.lang.RuntimeException: Couldn't find function default.gdecodeorder
> 15/08/14 15:04:47 ERROR RetryingHMSHandler: 
> MetaException(message:NoSuchObjectException(message:Function 
> default.t_gdecodeorder does not exist))
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newMetaException(HiveMetaStore.java:4613)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_function(HiveMetaStore.java:4740)
> at sun.reflect.GeneratedMethodAccessor57.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105)
> at com.sun.proxy.$Proxy21.get_function(Unknown Source)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getFunction(HiveMetaStoreClient.java:1721)
> at sun.reflect.GeneratedMethodAccessor56.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
> at com.sun.proxy.$Proxy22.getFunction(Unknown Source)
> at org.apache.hadoop.hive.ql.metadata.Hive.getFunction(Hive.java:2662)
> at 
> org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfoFromMetastore(FunctionRegistry.java:546)
> at 
> org.apache.hadoop.hive.ql.exec.FunctionRegistry.getQualifiedFunctionInfo(FunctionRegistry.java:579)
> at 
> org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:645)
> at 
> org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:652)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUdfs.scala:54)
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$3.org$apache$spark$sql$catalyst$analysis$OverrideFunctionRegistry$$super$lookupFunction(HiveContext.scala:376)
> at 
> org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:44)
> at 
> org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:44)
> at scala.Option.getOrElse(Option.scala:120)
> at 
> org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$class.lookupFunction(FunctionRegistry.scala:44)
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$3.lookupFunction(HiveContext.scala:376)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:465)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:463)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242)
>   

[jira] [Commented] (SPARK-11714) Make Spark on Mesos honor port restrictions

2016-01-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105554#comment-15105554
 ] 

Apache Spark commented on SPARK-11714:
--

User 'skonto' has created a pull request for this issue:
https://github.com/apache/spark/pull/10808

> Make Spark on Mesos honor port restrictions
> ---
>
> Key: SPARK-11714
> URL: https://issues.apache.org/jira/browse/SPARK-11714
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Charles Allen
>
> Currently the MesosSchedulerBackend does not make any effort to honor "ports" 
> as a resource offer in Mesos. This ask is to have the ports which the 
> executor binds to honor the limits of the "ports" resource of an offer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11714) Make Spark on Mesos honor port restrictions

2016-01-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11714:


Assignee: (was: Apache Spark)

> Make Spark on Mesos honor port restrictions
> ---
>
> Key: SPARK-11714
> URL: https://issues.apache.org/jira/browse/SPARK-11714
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Charles Allen
>
> Currently the MesosSchedulerBackend does not make any effort to honor "ports" 
> as a resource offer in Mesos. This ask is to have the ports which the 
> executor binds to honor the limits of the "ports" resource of an offer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11139) Make SparkContext.stop() exception-safe

2016-01-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11139.
---
Resolution: Duplicate

OK then this one is the duplicate

> Make SparkContext.stop() exception-safe
> ---
>
> Key: SPARK-11139
> URL: https://issues.apache.org/jira/browse/SPARK-11139
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Felix Cheung
>Priority: Minor
>
> In SparkContext.stop(), when an exception is thrown the rest of the 
> stop/cleanup action is aborted.
> Work has been done in SPARK-4194 to allow for cleanup to partial 
> initialization.
> Similarly issue in StreamingContext SPARK-11137



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12879) improve unsafe row writing framework

2016-01-18 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-12879:
---

 Summary: improve unsafe row writing framework
 Key: SPARK-12879
 URL: https://issues.apache.org/jira/browse/SPARK-12879
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12879) improve unsafe row writing framework

2016-01-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12879:


Assignee: (was: Apache Spark)

> improve unsafe row writing framework
> 
>
> Key: SPARK-12879
> URL: https://issues.apache.org/jira/browse/SPARK-12879
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12879) improve unsafe row writing framework

2016-01-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105608#comment-15105608
 ] 

Apache Spark commented on SPARK-12879:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/10809

> improve unsafe row writing framework
> 
>
> Key: SPARK-12879
> URL: https://issues.apache.org/jira/browse/SPARK-12879
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12879) improve unsafe row writing framework

2016-01-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12879:


Assignee: Apache Spark

> improve unsafe row writing framework
> 
>
> Key: SPARK-12879
> URL: https://issues.apache.org/jira/browse/SPARK-12879
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12867) Nullability of Intersect can be stricter

2016-01-18 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105616#comment-15105616
 ] 

Cheng Lian commented on SPARK-12867:


Thanks for helping! Go ahead, please. I'm assigning this to you.

> Nullability of Intersect can be stricter
> 
>
> Key: SPARK-12867
> URL: https://issues.apache.org/jira/browse/SPARK-12867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Lian
>Priority: Minor
>
> {{Intersect}} doesn't override {{SetOperation.output}}, which is defined as:
> {code}
>   override def output: Seq[Attribute] =
> left.output.zip(right.output).map { case (leftAttr, rightAttr) =>
>   leftAttr.withNullability(leftAttr.nullable || rightAttr.nullable)
> }
> {code}
> However, we can replace the {{||}} with {{&&}} for intersection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12867) Nullability of Intersect can be stricter

2016-01-18 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12867:
---
Assignee: Xiao Li

> Nullability of Intersect can be stricter
> 
>
> Key: SPARK-12867
> URL: https://issues.apache.org/jira/browse/SPARK-12867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Lian
>Assignee: Xiao Li
>Priority: Minor
>
> {{Intersect}} doesn't override {{SetOperation.output}}, which is defined as:
> {code}
>   override def output: Seq[Attribute] =
> left.output.zip(right.output).map { case (leftAttr, rightAttr) =>
>   leftAttr.withNullability(leftAttr.nullable || rightAttr.nullable)
> }
> {code}
> However, we can replace the {{||}} with {{&&}} for intersection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10620) Look into whether accumulator mechanism can replace TaskMetrics

2016-01-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105619#comment-15105619
 ] 

Apache Spark commented on SPARK-10620:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/10810

> Look into whether accumulator mechanism can replace TaskMetrics
> ---
>
> Key: SPARK-10620
> URL: https://issues.apache.org/jira/browse/SPARK-10620
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Patrick Wendell
>
> This task is simply to explore whether the internal representation used by 
> TaskMetrics could be performed by using accumulators rather than having two 
> separate mechanisms. Note that we need to continue to preserve the existing 
> "Task Metric" data structures that are exposed to users through event logs 
> etc. The question is can we use a single internal codepath and perhaps make 
> this easier to extend in the future.
> I think a full exploration would answer the following questions:
> - How do the semantics of accumulators on stage retries differ from aggregate 
> TaskMetrics for a stage? Could we implement clearer retry semantics for 
> internal accumulators to allow them to be the same - for instance, zeroing 
> accumulator values if a stage is retried (see discussion here: SPARK-10042).
> - Are there metrics that do not fit well into the accumulator model, or would 
> be difficult to update as an accumulator.
> - If we expose metrics through accumulators in the future rather than 
> continuing to add fields to TaskMetrics, what is the best way to coerce 
> compatibility?
> - Are there any other considerations?
> - Is it worth it to do this, or is the consolidation too complicated to 
> justify?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR

2016-01-18 Thread Gabor Feher (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105621#comment-15105621
 ] 

Gabor Feher commented on SPARK-3630:


Hi, I've run into this problem as well.

{code}
 com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to 
uncompress the chunk: PARSING_ERROR(2)
at com.esotericsoftware.kryo.io.Input.fill(Input.java:142)
at com.esotericsoftware.kryo.io.Input.require(Input.java:155)
at com.esotericsoftware.kryo.io.Input.readInt(Input.java:337)
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:109)
at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721)
at 
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:228)
at 
org.apache.spark.serializer.DeserializationStream.readKey(Serializer.scala:169)
at 
org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:201)
at 
org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:198)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:197)
at 
org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:103)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
...
Caused by: java.io.IOException: failed to uncompress the chunk: PARSING_ERROR(2)
at 
org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:361)
at 
org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:158)
at org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142)
at com.esotericsoftware.kryo.io.Input.fill(Input.java:140)
... 45 more

{code}

Spark version was 1.6.0, official "hadoop free" download, running with yarn 
client mode on Amazon EMR.
I used these instructions to link in hadoop: 
https://spark.apache.org/docs/1.6.0/hadoop-provided.html
Hadoop version was 2.6.0

As far as I could see, the problem has shown up only after a few hours of 
running, "in the middle" of a job.

> Identify cause of Kryo+Snappy PARSING_ERROR
> ---
>
> Key: SPARK-3630
> URL: https://issues.apache.org/jira/browse/SPARK-3630
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Andrew Ash
>Assignee: Josh Rosen
>
> A recent GraphX commit caused non-deterministic exceptions in unit tests so 
> it was reverted (see SPARK-3400).
> Separately, [~aash] observed the same exception stacktrace in an 
> application-specific Kryo registrator:
> {noformat}
> com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to 
> uncompress the chunk: PARSING_ERROR(2)
> com.esotericsoftware.kryo.io.Input.fill(Input.java:142) 
> com.esotericsoftware.kryo.io.Input.require(Input.java:169) 
> com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) 
> com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127)
>  
> com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117)
>  
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
>  
> com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
>  
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> ...
> {noformat}
> This ticket is to identify the cause of the exception in the GraphX commit so 
> the faulty commit can be fixed and merged back into master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12867) Nullability of Intersect can be stricter

2016-01-18 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105622#comment-15105622
 ] 

Xiao Li commented on SPARK-12867:
-

Thank you! 

> Nullability of Intersect can be stricter
> 
>
> Key: SPARK-12867
> URL: https://issues.apache.org/jira/browse/SPARK-12867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Lian
>Assignee: Xiao Li
>Priority: Minor
>
> {{Intersect}} doesn't override {{SetOperation.output}}, which is defined as:
> {code}
>   override def output: Seq[Attribute] =
> left.output.zip(right.output).map { case (leftAttr, rightAttr) =>
>   leftAttr.withNullability(leftAttr.nullable || rightAttr.nullable)
> }
> {code}
> However, we can replace the {{||}} with {{&&}} for intersection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10620) Look into whether accumulator mechanism can replace TaskMetrics

2016-01-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105643#comment-15105643
 ] 

Apache Spark commented on SPARK-10620:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/10811

> Look into whether accumulator mechanism can replace TaskMetrics
> ---
>
> Key: SPARK-10620
> URL: https://issues.apache.org/jira/browse/SPARK-10620
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Patrick Wendell
>
> This task is simply to explore whether the internal representation used by 
> TaskMetrics could be performed by using accumulators rather than having two 
> separate mechanisms. Note that we need to continue to preserve the existing 
> "Task Metric" data structures that are exposed to users through event logs 
> etc. The question is can we use a single internal codepath and perhaps make 
> this easier to extend in the future.
> I think a full exploration would answer the following questions:
> - How do the semantics of accumulators on stage retries differ from aggregate 
> TaskMetrics for a stage? Could we implement clearer retry semantics for 
> internal accumulators to allow them to be the same - for instance, zeroing 
> accumulator values if a stage is retried (see discussion here: SPARK-10042).
> - Are there metrics that do not fit well into the accumulator model, or would 
> be difficult to update as an accumulator.
> - If we expose metrics through accumulators in the future rather than 
> continuing to add fields to TaskMetrics, what is the best way to coerce 
> compatibility?
> - Are there any other considerations?
> - Is it worth it to do this, or is the consolidation too complicated to 
> justify?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12880) Different results on groupBy after window function

2016-01-18 Thread Saif Addin Ellafi (JIRA)
Saif Addin Ellafi created SPARK-12880:
-

 Summary: Different results on groupBy after window function
 Key: SPARK-12880
 URL: https://issues.apache.org/jira/browse/SPARK-12880
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Saif Addin Ellafi
Priority: Critical


scala> val overVint = Window.partitionBy("product", "bnd", 
"age").orderBy(asc("mm"))

scala> val df_data2 = df_data.withColumn("result", lag("baleom", 
1).over(overVint))

scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and mm = 
200509").groupBy("mm", "closed", 
"ever_closed").agg(sum("result").as("result")).show
+--+--+---++
|mm|closed|ever_closed|  result|
+--+--+---++
|200509| 1|  1|1.2672666129980398E7|
|200509| 0|  0|2.7104834668856387E9|
|200509| 0|  1| 1.151339011298214E8|
+--+--+---++


scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and mm = 
200509").groupBy("mm", "closed", 
"ever_closed").agg(sum("result").as("result")).show
+--+--+---++
|mm|closed|ever_closed|  result|
+--+--+---++
|200509| 1|  1|1.2357681589980595E7|
|200509| 0|  0| 2.709930867575646E9|
|200509| 0|  1|1.1595048973981345E8|
+--+--+---++

Does NOT happen with columns not of the window function
Happens both in cluster mode and local mode
Before group by operation, data looks good and is consistent




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12876) Race condition when driver rapidly shutdown after started.

2016-01-18 Thread jeffonia Tung (JIRA)
jeffonia Tung created SPARK-12876:
-

 Summary: Race condition when driver rapidly shutdown after started.
 Key: SPARK-12876
 URL: https://issues.apache.org/jira/browse/SPARK-12876
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: jeffonia Tung


[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Asked to launch driver 
driver-20160118171237-0009
[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Copying user jar 
file:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/mylib/spark-ly-streaming-v2-201601141018.jar
 to /data/dbcenter/cdh5/spark-1.4.0-bin-hado
op2.4/work/driver-20160118171237-0009/spark-ly-streaming-v2-201601141018.jar
[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Copying 
/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/mylib/spark-ly-streaming-v2-201601141018.jar
 to /data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/work/dri
ver-20160118171237-0009/spark-ly-streaming-v2-201601141018.jar
[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Launch Command: 
"/data/dbcenter/jdk1.7.0_79/bin/java" "-cp" .
[INFO 2016-01-18 17:12:39 (Logging.scala:59)] Asked to launch executor 
app-20160118171240-0256/15 for DirectKafkaStreamingV2
[INFO 2016-01-18 17:12:39 (Logging.scala:59)] Launch command: 
"/data/dbcenter/jdk1.7.0_79/bin/java" "-cp" 
"/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/postgresql-9.2-1004-jdbc41.jar:/data/dbcenter/cdh
5/spark-1.4.0-bin-hadoop2.4/lib/hive-contrib-0.13.1-cdh5.2.0.jar:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/sbin/../conf/:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop
2.4.0.jar:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/data/dbcenter/cdh5/spark-1.4.0-bi
n-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar" "-Xms1024M" "-Xmx1024M" 
"-Dspark.akka.frameSize=100" "-Dspark.driver.port=35133" "-XX:MaxPermSize=128m" 
"org.apache.spark.executor.CoarseGrainedExecutorBacke
nd" "--driver-url" 
"akka.tcp://sparkDriver@10.12.201.205:35133/user/CoarseGrainedScheduler" 
"--executor-id" "15" "--hostname" "10.12.201.205" "--cores" "1" "--app-id" 
"app-20160118171240-0256" "--worker
-url" "akka.tcp://sparkWorker@10.12.201.205:5/user/Worker"
[INFO 2016-01-18 17:12:49 (Logging.scala:59)] Asked to kill driver 
driver-20160118164724-0008
[INFO 2016-01-18 17:12:49 (Logging.scala:59)] Redirection to 
/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/work/driver-20160118164724-0008/stdout
 closed: Stream closed
[INFO 2016-01-18 17:12:49 (Logging.scala:59)] Asked to kill executor 
app-20160118164728-0250/15
[INFO 2016-01-18 17:12:49 (Logging.scala:59)] Runner thread for executor 
app-20160118164728-0250/15 interrupted
[INFO 2016-01-18 17:12:49 (Logging.scala:59)] Killing process!
[ERROR 2016-01-18 17:12:49 (Logging.scala:96)] Error writing stream to file 
/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/work/app-20160118164728-0250/15/stdout
java.io.IOException: Stream closed
at 
java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:272)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at 
org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)
at 
org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)
[INFO 2016-01-18 17:12:49 (Logging.scala:59)] Executor 
app-20160118164728-0250/15 finished with state KILLED exitStatus 143



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9976) create function do not work

2016-01-18 Thread ocean (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105168#comment-15105168
 ] 

ocean commented on SPARK-9976:
--

Does sparksql support temporary udf function which  extends udf now?
Now I found that udf temporary  function which extends GenericUDF run ok,
but udf temporary  function which extends UDF can not work.

> create function do not work
> ---
>
> Key: SPARK-9976
> URL: https://issues.apache.org/jira/browse/SPARK-9976
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.0
> Environment: spark 1.4.1 yarn 2.2.0
>Reporter: cen yuhai
>
> I use beeline to connect to ThriftServer, but add jar can not work, so I use 
> create function , see the link below.
> http://www.cloudera.com/content/cloudera/en/documentation/core/v5-3-x/topics/cm_mc_hive_udf.html
> I do as blow:
> {code}
> create function gdecodeorder as 'com.hive.udf.GOrderDecode' USING JAR 
> 'hdfs://mycluster/user/spark/lib/gorderdecode.jar'; 
> {code}
> It returns Ok, and I connect to the metastore, I see records in table FUNCS.
> {code}
> select gdecodeorder(t1)  from tableX  limit 1;
> {code}
> It returns error 'Couldn't find function default.gdecodeorder'
> This is the Exception
> {code}
> 15/08/14 14:53:51 ERROR UserGroupInformation: PriviledgedActionException 
> as:xiaoju (auth:SIMPLE) cause:org.apache.hive.service.cli.HiveSQLException: 
> java.lang.RuntimeException: Couldn't find function default.gdecodeorder
> 15/08/14 15:04:47 ERROR RetryingHMSHandler: 
> MetaException(message:NoSuchObjectException(message:Function 
> default.t_gdecodeorder does not exist))
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newMetaException(HiveMetaStore.java:4613)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_function(HiveMetaStore.java:4740)
> at sun.reflect.GeneratedMethodAccessor57.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105)
> at com.sun.proxy.$Proxy21.get_function(Unknown Source)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getFunction(HiveMetaStoreClient.java:1721)
> at sun.reflect.GeneratedMethodAccessor56.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
> at com.sun.proxy.$Proxy22.getFunction(Unknown Source)
> at org.apache.hadoop.hive.ql.metadata.Hive.getFunction(Hive.java:2662)
> at 
> org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfoFromMetastore(FunctionRegistry.java:546)
> at 
> org.apache.hadoop.hive.ql.exec.FunctionRegistry.getQualifiedFunctionInfo(FunctionRegistry.java:579)
> at 
> org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:645)
> at 
> org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:652)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUdfs.scala:54)
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$3.org$apache$spark$sql$catalyst$analysis$OverrideFunctionRegistry$$super$lookupFunction(HiveContext.scala:376)
> at 
> org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:44)
> at 
> org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:44)
> at scala.Option.getOrElse(Option.scala:120)
> at 
> org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$class.lookupFunction(FunctionRegistry.scala:44)
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$3.lookupFunction(HiveContext.scala:376)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:465)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:463)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
> at 
> 

[jira] [Updated] (SPARK-12823) Cannot create UDF with StructType input

2016-01-18 Thread Frank Rosner (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Rosner updated SPARK-12823:
-
Description: 
h5. Problem

It is not possible to apply a UDF to a column that has a struct data type. Two 
previous requests to the mailing list remained unanswered.

h5. How-To-Reproduce

{code}
val sql = new org.apache.spark.sql.SQLContext(sc)
import sql.implicits._

case class KV(key: Long, value: String)
case class Row(kv: KV)
val df = sc.parallelize(List(Row(KV(1L, "a")), Row(KV(5L, "b".toDF

val udf1 = org.apache.spark.sql.functions.udf((kv: KV) => kv.value)
df.select(udf1(df("kv"))).show
// java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
to $line78.$read$$iwC$$iwC$KV

val udf2 = org.apache.spark.sql.functions.udf((kv: (Long, String)) => kv._2)
df.select(udf2(df("kv"))).show
// org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to data 
type mismatch: argument 1 requires struct<_1:bigint,_2:string> type, however, 
'kv' is of struct type.;
{code}

h5. Mailing List Entries

- 
https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCACUahd8M=ipCbFCYDyein_=vqyoantn-tpxe6sq395nh10g...@mail.gmail.com%3E

- https://www.mail-archive.com/user@spark.apache.org/msg43092.html

h5. Possible Workaround

If you create a {{UserDefinedFunction}} manually, not using the {{udf}} helper 
functions, it works. See https://github.com/FRosner/struct-udf, which exposes 
the {{UserDefinedFunction}} constructor (public from package private). However, 
then you have to work with Row, because it does not automatically convert the 
row to a case class / tuple.

  was:
h5. Problem

It is not possible to apply a UDF to a column that has a struct data type. Two 
previous requests to the mailing list remained unanswered.

h5. How-To-Reproduce

{code}
val sql = new org.apache.spark.sql.SQLContext(sc)
import sql.implicits._

case class KV(key: Long, value: String)
case class Row(kv: KV)
val df = sc.parallelize(List(Row(KV(1L, "a")), Row(KV(5L, "b".toDF

val udf1 = org.apache.spark.sql.functions.udf((kv: KV) => kv.value)
df.select(udf1(df("kv"))).show
// java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
to $line78.$read$$iwC$$iwC$KV

val udf2 = org.apache.spark.sql.functions.udf((kv: (Long, String)) => kv._2)
df.select(udf2(df("kv"))).show
// org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to data 
type mismatch: argument 1 requires struct<_1:bigint,_2:string> type, however, 
'kv' is of struct type.;
{code}

h5. Mailing List Entries

- 
https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCACUahd8M=ipCbFCYDyein_=vqyoantn-tpxe6sq395nh10g...@mail.gmail.com%3E

- https://www.mail-archive.com/user@spark.apache.org/msg43092.html

h5. Possible Workaround

If you create a {{UserDefinedFunction}} manually, not using the {{udf}} helper 
functions, it works. See https://github.com/FRosner/struct-udf, which exposes 
the {{UserDefinedFunction}} constructor (public from package private).


> Cannot create UDF with StructType input
> ---
>
> Key: SPARK-12823
> URL: https://issues.apache.org/jira/browse/SPARK-12823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Frank Rosner
>
> h5. Problem
> It is not possible to apply a UDF to a column that has a struct data type. 
> Two previous requests to the mailing list remained unanswered.
> h5. How-To-Reproduce
> {code}
> val sql = new org.apache.spark.sql.SQLContext(sc)
> import sql.implicits._
> case class KV(key: Long, value: String)
> case class Row(kv: KV)
> val df = sc.parallelize(List(Row(KV(1L, "a")), Row(KV(5L, "b".toDF
> val udf1 = org.apache.spark.sql.functions.udf((kv: KV) => kv.value)
> df.select(udf1(df("kv"))).show
> // java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
> to $line78.$read$$iwC$$iwC$KV
> val udf2 = org.apache.spark.sql.functions.udf((kv: (Long, String)) => kv._2)
> df.select(udf2(df("kv"))).show
> // org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to 
> data type mismatch: argument 1 requires struct<_1:bigint,_2:string> type, 
> however, 'kv' is of struct type.;
> {code}
> h5. Mailing List Entries
> - 
> https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCACUahd8M=ipCbFCYDyein_=vqyoantn-tpxe6sq395nh10g...@mail.gmail.com%3E
> - https://www.mail-archive.com/user@spark.apache.org/msg43092.html
> h5. Possible Workaround
> If you create a {{UserDefinedFunction}} manually, not using the {{udf}} 
> helper functions, it works. See https://github.com/FRosner/struct-udf, 

[jira] [Updated] (SPARK-12823) Cannot create UDF with StructType input

2016-01-18 Thread Frank Rosner (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Rosner updated SPARK-12823:
-
Description: 
h5. Problem

It is not possible to apply a UDF to a column that has a struct data type. Two 
previous requests to the mailing list remained unanswered.

h5. How-To-Reproduce

{code}
val sql = new org.apache.spark.sql.SQLContext(sc)
import sql.implicits._

case class KV(key: Long, value: String)
case class Row(kv: KV)
val df = sc.parallelize(List(Row(KV(1L, "a")), Row(KV(5L, "b".toDF

val udf1 = org.apache.spark.sql.functions.udf((kv: KV) => kv.value)
df.select(udf1(df("kv"))).show
// java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
to $line78.$read$$iwC$$iwC$KV

val udf2 = org.apache.spark.sql.functions.udf((kv: (Long, String)) => kv._2)
df.select(udf2(df("kv"))).show
// org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to data 
type mismatch: argument 1 requires struct<_1:bigint,_2:string> type, however, 
'kv' is of struct type.;
{code}

h5. Mailing List Entries

- 
https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCACUahd8M=ipCbFCYDyein_=vqyoantn-tpxe6sq395nh10g...@mail.gmail.com%3E

- https://www.mail-archive.com/user@spark.apache.org/msg43092.html

h5. Possible Workaround

If you create a {{UserDefinedFunction}} manually, not using the {{udf}} helper 
functions, it works. See https://github.com/FRosner/struct-udf, which exposes 
the {{UserDefinedFunction}} constructor (public from package private). However, 
then you have to work with a {{Row}}, because it does not automatically convert 
the row to a case class / tuple.

  was:
h5. Problem

It is not possible to apply a UDF to a column that has a struct data type. Two 
previous requests to the mailing list remained unanswered.

h5. How-To-Reproduce

{code}
val sql = new org.apache.spark.sql.SQLContext(sc)
import sql.implicits._

case class KV(key: Long, value: String)
case class Row(kv: KV)
val df = sc.parallelize(List(Row(KV(1L, "a")), Row(KV(5L, "b".toDF

val udf1 = org.apache.spark.sql.functions.udf((kv: KV) => kv.value)
df.select(udf1(df("kv"))).show
// java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
to $line78.$read$$iwC$$iwC$KV

val udf2 = org.apache.spark.sql.functions.udf((kv: (Long, String)) => kv._2)
df.select(udf2(df("kv"))).show
// org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to data 
type mismatch: argument 1 requires struct<_1:bigint,_2:string> type, however, 
'kv' is of struct type.;
{code}

h5. Mailing List Entries

- 
https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCACUahd8M=ipCbFCYDyein_=vqyoantn-tpxe6sq395nh10g...@mail.gmail.com%3E

- https://www.mail-archive.com/user@spark.apache.org/msg43092.html

h5. Possible Workaround

If you create a {{UserDefinedFunction}} manually, not using the {{udf}} helper 
functions, it works. See https://github.com/FRosner/struct-udf, which exposes 
the {{UserDefinedFunction}} constructor (public from package private). However, 
then you have to work with Row, because it does not automatically convert the 
row to a case class / tuple.


> Cannot create UDF with StructType input
> ---
>
> Key: SPARK-12823
> URL: https://issues.apache.org/jira/browse/SPARK-12823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Frank Rosner
>
> h5. Problem
> It is not possible to apply a UDF to a column that has a struct data type. 
> Two previous requests to the mailing list remained unanswered.
> h5. How-To-Reproduce
> {code}
> val sql = new org.apache.spark.sql.SQLContext(sc)
> import sql.implicits._
> case class KV(key: Long, value: String)
> case class Row(kv: KV)
> val df = sc.parallelize(List(Row(KV(1L, "a")), Row(KV(5L, "b".toDF
> val udf1 = org.apache.spark.sql.functions.udf((kv: KV) => kv.value)
> df.select(udf1(df("kv"))).show
> // java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
> to $line78.$read$$iwC$$iwC$KV
> val udf2 = org.apache.spark.sql.functions.udf((kv: (Long, String)) => kv._2)
> df.select(udf2(df("kv"))).show
> // org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to 
> data type mismatch: argument 1 requires struct<_1:bigint,_2:string> type, 
> however, 'kv' is of struct type.;
> {code}
> h5. Mailing List Entries
> - 
> https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCACUahd8M=ipCbFCYDyein_=vqyoantn-tpxe6sq395nh10g...@mail.gmail.com%3E
> - https://www.mail-archive.com/user@spark.apache.org/msg43092.html
> h5. Possible Workaround
> If you create a 

[jira] [Updated] (SPARK-12876) Race condition when driver rapidly shutdown after started.

2016-01-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12876:
--
Priority: Minor  (was: Major)

How do you reproduce it?

> Race condition when driver rapidly shutdown after started.
> --
>
> Key: SPARK-12876
> URL: https://issues.apache.org/jira/browse/SPARK-12876
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: jeffonia Tung
>Priority: Minor
>
> It's a little same as the issue: SPARK-4300. Well, this time, it's happen on 
> the driver occasionally.
> [INFO 2016-01-18 17:12:35 (Logging.scala:59)] Asked to launch driver 
> driver-20160118171237-0009
> [INFO 2016-01-18 17:12:35 (Logging.scala:59)] Copying user jar 
> file:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/mylib/spark-ly-streaming-v2-201601141018.jar
>  to /data/dbcenter/cdh5/spark-1.4.0-bin-hado
> op2.4/work/driver-20160118171237-0009/spark-ly-streaming-v2-201601141018.jar
> [INFO 2016-01-18 17:12:35 (Logging.scala:59)] Copying 
> /data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/mylib/spark-ly-streaming-v2-201601141018.jar
>  to /data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/work/dri
> ver-20160118171237-0009/spark-ly-streaming-v2-201601141018.jar
> [INFO 2016-01-18 17:12:35 (Logging.scala:59)] Launch Command: 
> "/data/dbcenter/jdk1.7.0_79/bin/java" "-cp" 
> ."org.apache.spark.deploy.worker.DriverWrapper"..
> [INFO 2016-01-18 17:12:39 (Logging.scala:59)] Asked to launch executor 
> app-20160118171240-0256/15 for DirectKafkaStreamingV2
> [INFO 2016-01-18 17:12:39 (Logging.scala:59)] Launch command: 
> "/data/dbcenter/jdk1.7.0_79/bin/java" "-cp"  
> ."org.apache.spark.executor.CoarseGrainedExecutorBackend"..
> [INFO 2016-01-18 17:12:49 (Logging.scala:59)] Asked to kill driver 
> driver-20160118164724-0008
> [INFO 2016-01-18 17:12:49 (Logging.scala:59)] Redirection to 
> /data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/work/driver-20160118164724-0008/stdout
>  closed: Stream closed
> [INFO 2016-01-18 17:12:49 (Logging.scala:59)] Asked to kill executor 
> app-20160118164728-0250/15
> [INFO 2016-01-18 17:12:49 (Logging.scala:59)] Runner thread for executor 
> app-20160118164728-0250/15 interrupted
> [INFO 2016-01-18 17:12:49 (Logging.scala:59)] Killing process!
> [ERROR 2016-01-18 17:12:49 (Logging.scala:96)] Error writing stream to file 
> /data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/work/app-20160118164728-0250/15/stdout
> java.io.IOException: Stream closed
> at 
> java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:272)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
> at java.io.FilterInputStream.read(FilterInputStream.java:107)
> at 
> org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70)
> at 
> org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
> at 
> org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
> at 
> org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)
> at 
> org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)
> [INFO 2016-01-18 17:12:49 (Logging.scala:59)] Executor 
> app-20160118164728-0250/15 finished with state KILLED exitStatus 143



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12877) TrainValidationSplit is missing in pyspark.ml.tuning

2016-01-18 Thread Wojciech Jurczyk (JIRA)
Wojciech Jurczyk created SPARK-12877:


 Summary: TrainValidationSplit is missing in pyspark.ml.tuning
 Key: SPARK-12877
 URL: https://issues.apache.org/jira/browse/SPARK-12877
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.6.0
Reporter: Wojciech Jurczyk


I was investingating progress in SPARK-10759 and I noticed that there is no 
TrainValidationSplit class in pyspark.ml.tuning module.
Java/Scala's examples SPARK-10759 use 
org.apache.spark.ml.tuning.TrainValidationSplit that is not available from 
Python and this blocks SPARK-10759.

Does the class have different name in PySpark, maybe? Also, I couldn't find any 
JIRA task to saying it need to be implemented. Is it by design that the 
TrainValidationSplit estimator is not ported to PySpark? If not, that is if the 
estimator needs porting then I would like to contribute.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12876) Race condition when driver rapidly shutdown after started.

2016-01-18 Thread jeffonia Tung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jeffonia Tung updated SPARK-12876:
--
Description: 
It's a little same as the issue: SPARK-4300

[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Asked to launch driver 
driver-20160118171237-0009
[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Copying user jar 
file:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/mylib/spark-ly-streaming-v2-201601141018.jar
 to /data/dbcenter/cdh5/spark-1.4.0-bin-hado
op2.4/work/driver-20160118171237-0009/spark-ly-streaming-v2-201601141018.jar
[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Copying 
/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/mylib/spark-ly-streaming-v2-201601141018.jar
 to /data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/work/dri
ver-20160118171237-0009/spark-ly-streaming-v2-201601141018.jar
[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Launch Command: 
"/data/dbcenter/jdk1.7.0_79/bin/java" "-cp" .
[INFO 2016-01-18 17:12:39 (Logging.scala:59)] Asked to launch executor 
app-20160118171240-0256/15 for DirectKafkaStreamingV2
[INFO 2016-01-18 17:12:39 (Logging.scala:59)] Launch command: 
"/data/dbcenter/jdk1.7.0_79/bin/java" "-cp" 
"/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/postgresql-9.2-1004-jdbc41.jar:/data/dbcenter/cdh
5/spark-1.4.0-bin-hadoop2.4/lib/hive-contrib-0.13.1-cdh5.2.0.jar:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/sbin/../conf/:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop
2.4.0.jar:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/data/dbcenter/cdh5/spark-1.4.0-bi
n-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar" "-Xms1024M" "-Xmx1024M" 
"-Dspark.akka.frameSize=100" "-Dspark.driver.port=35133" "-XX:MaxPermSize=128m" 
"org.apache.spark.executor.CoarseGrainedExecutorBacke
nd" "--driver-url" 
"akka.tcp://sparkDriver@10.12.201.205:35133/user/CoarseGrainedScheduler" 
"--executor-id" "15" "--hostname" "10.12.201.205" "--cores" "1" "--app-id" 
"app-20160118171240-0256" "--worker
-url" "akka.tcp://sparkWorker@10.12.201.205:5/user/Worker"
[INFO 2016-01-18 17:12:49 (Logging.scala:59)] Asked to kill driver 
driver-20160118164724-0008
[INFO 2016-01-18 17:12:49 (Logging.scala:59)] Redirection to 
/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/work/driver-20160118164724-0008/stdout
 closed: Stream closed
[INFO 2016-01-18 17:12:49 (Logging.scala:59)] Asked to kill executor 
app-20160118164728-0250/15
[INFO 2016-01-18 17:12:49 (Logging.scala:59)] Runner thread for executor 
app-20160118164728-0250/15 interrupted
[INFO 2016-01-18 17:12:49 (Logging.scala:59)] Killing process!
[ERROR 2016-01-18 17:12:49 (Logging.scala:96)] Error writing stream to file 
/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/work/app-20160118164728-0250/15/stdout
java.io.IOException: Stream closed
at 
java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:272)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at 
org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)
at 
org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)
[INFO 2016-01-18 17:12:49 (Logging.scala:59)] Executor 
app-20160118164728-0250/15 finished with state KILLED exitStatus 143

  was:
[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Asked to launch driver 
driver-20160118171237-0009
[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Copying user jar 
file:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/mylib/spark-ly-streaming-v2-201601141018.jar
 to /data/dbcenter/cdh5/spark-1.4.0-bin-hado
op2.4/work/driver-20160118171237-0009/spark-ly-streaming-v2-201601141018.jar
[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Copying 
/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/mylib/spark-ly-streaming-v2-201601141018.jar
 to /data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/work/dri
ver-20160118171237-0009/spark-ly-streaming-v2-201601141018.jar
[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Launch Command: 
"/data/dbcenter/jdk1.7.0_79/bin/java" "-cp" .
[INFO 2016-01-18 17:12:39 (Logging.scala:59)] Asked to launch executor 
app-20160118171240-0256/15 for DirectKafkaStreamingV2
[INFO 2016-01-18 17:12:39 (Logging.scala:59)] Launch command: 
"/data/dbcenter/jdk1.7.0_79/bin/java" "-cp" 

[jira] [Updated] (SPARK-12876) Race condition when driver rapidly shutdown after started.

2016-01-18 Thread jeffonia Tung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jeffonia Tung updated SPARK-12876:
--
Description: 
It's a little same as the issue: SPARK-4300. Well, this time, it's happen on 
the driver occasionally.


[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Asked to launch driver 
driver-20160118171237-0009
[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Copying user jar 
file:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/mylib/spark-ly-streaming-v2-201601141018.jar
 to /data/dbcenter/cdh5/spark-1.4.0-bin-hado
op2.4/work/driver-20160118171237-0009/spark-ly-streaming-v2-201601141018.jar
[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Copying 
/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/mylib/spark-ly-streaming-v2-201601141018.jar
 to /data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/work/dri
ver-20160118171237-0009/spark-ly-streaming-v2-201601141018.jar
[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Launch Command: 
"/data/dbcenter/jdk1.7.0_79/bin/java" "-cp" .
[INFO 2016-01-18 17:12:39 (Logging.scala:59)] Asked to launch executor 
app-20160118171240-0256/15 for DirectKafkaStreamingV2
[INFO 2016-01-18 17:12:39 (Logging.scala:59)] Launch command: 
"/data/dbcenter/jdk1.7.0_79/bin/java" "-cp" 
"/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/postgresql-9.2-1004-jdbc41.jar:/data/dbcenter/cdh
5/spark-1.4.0-bin-hadoop2.4/lib/hive-contrib-0.13.1-cdh5.2.0.jar:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/sbin/../conf/:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop
2.4.0.jar:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/data/dbcenter/cdh5/spark-1.4.0-bi
n-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar" "-Xms1024M" "-Xmx1024M" 
"-Dspark.akka.frameSize=100" "-Dspark.driver.port=35133" "-XX:MaxPermSize=128m" 
"org.apache.spark.executor.CoarseGrainedExecutorBacke
nd" "--driver-url" 
"akka.tcp://sparkDriver@10.12.201.205:35133/user/CoarseGrainedScheduler" 
"--executor-id" "15" "--hostname" "10.12.201.205" "--cores" "1" "--app-id" 
"app-20160118171240-0256" "--worker
-url" "akka.tcp://sparkWorker@10.12.201.205:5/user/Worker"
[INFO 2016-01-18 17:12:49 (Logging.scala:59)] Asked to kill driver 
driver-20160118164724-0008
[INFO 2016-01-18 17:12:49 (Logging.scala:59)] Redirection to 
/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/work/driver-20160118164724-0008/stdout
 closed: Stream closed
[INFO 2016-01-18 17:12:49 (Logging.scala:59)] Asked to kill executor 
app-20160118164728-0250/15
[INFO 2016-01-18 17:12:49 (Logging.scala:59)] Runner thread for executor 
app-20160118164728-0250/15 interrupted
[INFO 2016-01-18 17:12:49 (Logging.scala:59)] Killing process!
[ERROR 2016-01-18 17:12:49 (Logging.scala:96)] Error writing stream to file 
/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/work/app-20160118164728-0250/15/stdout
java.io.IOException: Stream closed
at 
java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:272)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at 
org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)
at 
org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)
[INFO 2016-01-18 17:12:49 (Logging.scala:59)] Executor 
app-20160118164728-0250/15 finished with state KILLED exitStatus 143

  was:
It's a little same as the issue: SPARK-4300. Well, this time, it's happen on 
the driver occasionally.

[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Asked to launch driver 
driver-20160118171237-0009
[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Copying user jar 
file:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/mylib/spark-ly-streaming-v2-201601141018.jar
 to /data/dbcenter/cdh5/spark-1.4.0-bin-hado
op2.4/work/driver-20160118171237-0009/spark-ly-streaming-v2-201601141018.jar
[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Copying 
/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/mylib/spark-ly-streaming-v2-201601141018.jar
 to /data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/work/dri
ver-20160118171237-0009/spark-ly-streaming-v2-201601141018.jar
[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Launch Command: 
"/data/dbcenter/jdk1.7.0_79/bin/java" "-cp" .
[INFO 2016-01-18 17:12:39 (Logging.scala:59)] Asked to launch executor 

[jira] [Updated] (SPARK-12876) Race condition when driver rapidly shutdown after started.

2016-01-18 Thread jeffonia Tung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jeffonia Tung updated SPARK-12876:
--
Description: 
It's a little same as the issue: SPARK-4300. Well, this time, it's happen on 
the driver.

[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Asked to launch driver 
driver-20160118171237-0009
[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Copying user jar 
file:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/mylib/spark-ly-streaming-v2-201601141018.jar
 to /data/dbcenter/cdh5/spark-1.4.0-bin-hado
op2.4/work/driver-20160118171237-0009/spark-ly-streaming-v2-201601141018.jar
[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Copying 
/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/mylib/spark-ly-streaming-v2-201601141018.jar
 to /data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/work/dri
ver-20160118171237-0009/spark-ly-streaming-v2-201601141018.jar
[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Launch Command: 
"/data/dbcenter/jdk1.7.0_79/bin/java" "-cp" .
[INFO 2016-01-18 17:12:39 (Logging.scala:59)] Asked to launch executor 
app-20160118171240-0256/15 for DirectKafkaStreamingV2
[INFO 2016-01-18 17:12:39 (Logging.scala:59)] Launch command: 
"/data/dbcenter/jdk1.7.0_79/bin/java" "-cp" 
"/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/postgresql-9.2-1004-jdbc41.jar:/data/dbcenter/cdh
5/spark-1.4.0-bin-hadoop2.4/lib/hive-contrib-0.13.1-cdh5.2.0.jar:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/sbin/../conf/:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop
2.4.0.jar:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/data/dbcenter/cdh5/spark-1.4.0-bi
n-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar" "-Xms1024M" "-Xmx1024M" 
"-Dspark.akka.frameSize=100" "-Dspark.driver.port=35133" "-XX:MaxPermSize=128m" 
"org.apache.spark.executor.CoarseGrainedExecutorBacke
nd" "--driver-url" 
"akka.tcp://sparkDriver@10.12.201.205:35133/user/CoarseGrainedScheduler" 
"--executor-id" "15" "--hostname" "10.12.201.205" "--cores" "1" "--app-id" 
"app-20160118171240-0256" "--worker
-url" "akka.tcp://sparkWorker@10.12.201.205:5/user/Worker"
[INFO 2016-01-18 17:12:49 (Logging.scala:59)] Asked to kill driver 
driver-20160118164724-0008
[INFO 2016-01-18 17:12:49 (Logging.scala:59)] Redirection to 
/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/work/driver-20160118164724-0008/stdout
 closed: Stream closed
[INFO 2016-01-18 17:12:49 (Logging.scala:59)] Asked to kill executor 
app-20160118164728-0250/15
[INFO 2016-01-18 17:12:49 (Logging.scala:59)] Runner thread for executor 
app-20160118164728-0250/15 interrupted
[INFO 2016-01-18 17:12:49 (Logging.scala:59)] Killing process!
[ERROR 2016-01-18 17:12:49 (Logging.scala:96)] Error writing stream to file 
/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/work/app-20160118164728-0250/15/stdout
java.io.IOException: Stream closed
at 
java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:272)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at 
org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)
at 
org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)
[INFO 2016-01-18 17:12:49 (Logging.scala:59)] Executor 
app-20160118164728-0250/15 finished with state KILLED exitStatus 143

  was:
It's a little same as the issue: SPARK-4300

[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Asked to launch driver 
driver-20160118171237-0009
[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Copying user jar 
file:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/mylib/spark-ly-streaming-v2-201601141018.jar
 to /data/dbcenter/cdh5/spark-1.4.0-bin-hado
op2.4/work/driver-20160118171237-0009/spark-ly-streaming-v2-201601141018.jar
[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Copying 
/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/mylib/spark-ly-streaming-v2-201601141018.jar
 to /data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/work/dri
ver-20160118171237-0009/spark-ly-streaming-v2-201601141018.jar
[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Launch Command: 
"/data/dbcenter/jdk1.7.0_79/bin/java" "-cp" .
[INFO 2016-01-18 17:12:39 (Logging.scala:59)] Asked to launch executor 
app-20160118171240-0256/15 for DirectKafkaStreamingV2
[INFO 2016-01-18 

[jira] [Resolved] (SPARK-12863) missing api for renaming and mapping result of operations on GroupedDataset to case classes

2016-01-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12863.
---
Resolution: Not A Problem

You haven't proposed an API; you're asking how to do something, which can 
already be done. At best you're asking for some different API. Please start on 
user@ and do not reopen this.

> missing api for renaming and mapping result of operations on GroupedDataset 
> to case classes
> ---
>
> Key: SPARK-12863
> URL: https://issues.apache.org/jira/browse/SPARK-12863
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.0
>Reporter: Milad Khajavi
>
> Here I struggle with spark api to convert the result of count to convert that 
> to KeyValue case class. I think there is no api for changing the column of 
> Dataset and mapping them to new case class.
> case class LogRow(id: String, location: String, time: Long)
> case class KeyValue(key: (String, String), value: Long)
> val log = LogRow("1", "a", 1) :: and so on
> log.toDS().groupBy(l => {(l.id, l.location)}).count().toDF().toDF("key", 
> "value").as[KeyValue].printSchema()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-12863) missing api for renaming and mapping result of operations on GroupedDataset to case classes

2016-01-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-12863.
-

> missing api for renaming and mapping result of operations on GroupedDataset 
> to case classes
> ---
>
> Key: SPARK-12863
> URL: https://issues.apache.org/jira/browse/SPARK-12863
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.0
>Reporter: Milad Khajavi
>
> Here I struggle with spark api to convert the result of count to convert that 
> to KeyValue case class. I think there is no api for changing the column of 
> Dataset and mapping them to new case class.
> case class LogRow(id: String, location: String, time: Long)
> case class KeyValue(key: (String, String), value: Long)
> val log = LogRow("1", "a", 1) :: and so on
> log.toDS().groupBy(l => {(l.id, l.location)}).count().toDF().toDF("key", 
> "value").as[KeyValue].printSchema()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12876) Race condition when driver rapidly shutdown after started.

2016-01-18 Thread jeffonia Tung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jeffonia Tung updated SPARK-12876:
--
Description: 
It's a little same as the issue: SPARK-4300. Well, this time, it's happen on 
the driver occasionally.

[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Asked to launch driver 
driver-20160118171237-0009
[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Copying user jar 
file:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/mylib/spark-ly-streaming-v2-201601141018.jar
 to /data/dbcenter/cdh5/spark-1.4.0-bin-hado
op2.4/work/driver-20160118171237-0009/spark-ly-streaming-v2-201601141018.jar
[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Copying 
/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/mylib/spark-ly-streaming-v2-201601141018.jar
 to /data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/work/dri
ver-20160118171237-0009/spark-ly-streaming-v2-201601141018.jar
[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Launch Command: 
"/data/dbcenter/jdk1.7.0_79/bin/java" "-cp" .
[INFO 2016-01-18 17:12:39 (Logging.scala:59)] Asked to launch executor 
app-20160118171240-0256/15 for DirectKafkaStreamingV2
[INFO 2016-01-18 17:12:39 (Logging.scala:59)] Launch command: 
"/data/dbcenter/jdk1.7.0_79/bin/java" "-cp" 
"/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/postgresql-9.2-1004-jdbc41.jar:/data/dbcenter/cdh
5/spark-1.4.0-bin-hadoop2.4/lib/hive-contrib-0.13.1-cdh5.2.0.jar:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/sbin/../conf/:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop
2.4.0.jar:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/data/dbcenter/cdh5/spark-1.4.0-bi
n-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar" "-Xms1024M" "-Xmx1024M" 
"-Dspark.akka.frameSize=100" "-Dspark.driver.port=35133" "-XX:MaxPermSize=128m" 
"org.apache.spark.executor.CoarseGrainedExecutorBacke
nd" "--driver-url" 
"akka.tcp://sparkDriver@10.12.201.205:35133/user/CoarseGrainedScheduler" 
"--executor-id" "15" "--hostname" "10.12.201.205" "--cores" "1" "--app-id" 
"app-20160118171240-0256" "--worker
-url" "akka.tcp://sparkWorker@10.12.201.205:5/user/Worker"
[INFO 2016-01-18 17:12:49 (Logging.scala:59)] Asked to kill driver 
driver-20160118164724-0008
[INFO 2016-01-18 17:12:49 (Logging.scala:59)] Redirection to 
/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/work/driver-20160118164724-0008/stdout
 closed: Stream closed
[INFO 2016-01-18 17:12:49 (Logging.scala:59)] Asked to kill executor 
app-20160118164728-0250/15
[INFO 2016-01-18 17:12:49 (Logging.scala:59)] Runner thread for executor 
app-20160118164728-0250/15 interrupted
[INFO 2016-01-18 17:12:49 (Logging.scala:59)] Killing process!
[ERROR 2016-01-18 17:12:49 (Logging.scala:96)] Error writing stream to file 
/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/work/app-20160118164728-0250/15/stdout
java.io.IOException: Stream closed
at 
java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:272)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at 
org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)
at 
org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)
[INFO 2016-01-18 17:12:49 (Logging.scala:59)] Executor 
app-20160118164728-0250/15 finished with state KILLED exitStatus 143

  was:
It's a little same as the issue: SPARK-4300. Well, this time, it's happen on 
the driver.

[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Asked to launch driver 
driver-20160118171237-0009
[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Copying user jar 
file:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/mylib/spark-ly-streaming-v2-201601141018.jar
 to /data/dbcenter/cdh5/spark-1.4.0-bin-hado
op2.4/work/driver-20160118171237-0009/spark-ly-streaming-v2-201601141018.jar
[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Copying 
/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/mylib/spark-ly-streaming-v2-201601141018.jar
 to /data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/work/dri
ver-20160118171237-0009/spark-ly-streaming-v2-201601141018.jar
[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Launch Command: 
"/data/dbcenter/jdk1.7.0_79/bin/java" "-cp" .
[INFO 2016-01-18 17:12:39 (Logging.scala:59)] Asked to launch executor 

[jira] [Updated] (SPARK-12876) Race condition when driver rapidly shutdown after started.

2016-01-18 Thread jeffonia Tung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jeffonia Tung updated SPARK-12876:
--
Description: 
It's a little same as the issue: SPARK-4300. Well, this time, it's happen on 
the driver occasionally.


[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Asked to launch driver 
driver-20160118171237-0009
[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Copying user jar 
file:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/mylib/spark-ly-streaming-v2-201601141018.jar
 to /data/dbcenter/cdh5/spark-1.4.0-bin-hado
op2.4/work/driver-20160118171237-0009/spark-ly-streaming-v2-201601141018.jar
[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Copying 
/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/mylib/spark-ly-streaming-v2-201601141018.jar
 to /data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/work/dri
ver-20160118171237-0009/spark-ly-streaming-v2-201601141018.jar
[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Launch Command: 
"/data/dbcenter/jdk1.7.0_79/bin/java" "-cp" 
."org.apache.spark.deploy.worker.DriverWrapper"..
[INFO 2016-01-18 17:12:39 (Logging.scala:59)] Asked to launch executor 
app-20160118171240-0256/15 for DirectKafkaStreamingV2
[INFO 2016-01-18 17:12:39 (Logging.scala:59)] Launch command: 
"/data/dbcenter/jdk1.7.0_79/bin/java" "-cp"  
."org.apache.spark.executor.CoarseGrainedExecutorBackend"..
[INFO 2016-01-18 17:12:49 (Logging.scala:59)] Asked to kill driver 
driver-20160118164724-0008
[INFO 2016-01-18 17:12:49 (Logging.scala:59)] Redirection to 
/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/work/driver-20160118164724-0008/stdout
 closed: Stream closed
[INFO 2016-01-18 17:12:49 (Logging.scala:59)] Asked to kill executor 
app-20160118164728-0250/15
[INFO 2016-01-18 17:12:49 (Logging.scala:59)] Runner thread for executor 
app-20160118164728-0250/15 interrupted
[INFO 2016-01-18 17:12:49 (Logging.scala:59)] Killing process!
[ERROR 2016-01-18 17:12:49 (Logging.scala:96)] Error writing stream to file 
/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/work/app-20160118164728-0250/15/stdout
java.io.IOException: Stream closed
at 
java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:272)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at 
org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)
at 
org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)
[INFO 2016-01-18 17:12:49 (Logging.scala:59)] Executor 
app-20160118164728-0250/15 finished with state KILLED exitStatus 143

  was:
It's a little same as the issue: SPARK-4300. Well, this time, it's happen on 
the driver occasionally.


[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Asked to launch driver 
driver-20160118171237-0009
[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Copying user jar 
file:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/mylib/spark-ly-streaming-v2-201601141018.jar
 to /data/dbcenter/cdh5/spark-1.4.0-bin-hado
op2.4/work/driver-20160118171237-0009/spark-ly-streaming-v2-201601141018.jar
[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Copying 
/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/mylib/spark-ly-streaming-v2-201601141018.jar
 to /data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/work/dri
ver-20160118171237-0009/spark-ly-streaming-v2-201601141018.jar
[INFO 2016-01-18 17:12:35 (Logging.scala:59)] Launch Command: 
"/data/dbcenter/jdk1.7.0_79/bin/java" "-cp" .
[INFO 2016-01-18 17:12:39 (Logging.scala:59)] Asked to launch executor 
app-20160118171240-0256/15 for DirectKafkaStreamingV2
[INFO 2016-01-18 17:12:39 (Logging.scala:59)] Launch command: 
"/data/dbcenter/jdk1.7.0_79/bin/java" "-cp" 
"/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/postgresql-9.2-1004-jdbc41.jar:/data/dbcenter/cdh
5/spark-1.4.0-bin-hadoop2.4/lib/hive-contrib-0.13.1-cdh5.2.0.jar:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/sbin/../conf/:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop
2.4.0.jar:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/data/dbcenter/cdh5/spark-1.4.0-bi
n-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar" "-Xms1024M" "-Xmx1024M" 
"-Dspark.akka.frameSize=100" "-Dspark.driver.port=35133" "-XX:MaxPermSize=128m" 

[jira] [Reopened] (SPARK-12863) missing api for renaming and mapping result of operations on GroupedDataset to case classes

2016-01-18 Thread Milad Khajavi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Milad Khajavi reopened SPARK-12863:
---

I have no question. The spark 1.6.0 have missing api for this problem.

> missing api for renaming and mapping result of operations on GroupedDataset 
> to case classes
> ---
>
> Key: SPARK-12863
> URL: https://issues.apache.org/jira/browse/SPARK-12863
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.0
>Reporter: Milad Khajavi
>
> Here I struggle with spark api to convert the result of count to convert that 
> to KeyValue case class. I think there is no api for changing the column of 
> Dataset and mapping them to new case class.
> case class LogRow(id: String, location: String, time: Long)
> case class KeyValue(key: (String, String), value: Long)
> val log = LogRow("1", "a", 1) :: and so on
> log.toDS().groupBy(l => {(l.id, l.location)}).count().toDF().toDF("key", 
> "value").as[KeyValue].printSchema()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11948) Permanent UDF not work

2016-01-18 Thread ocean (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105170#comment-15105170
 ] 

ocean commented on SPARK-11948:
---

Does sparksql support temporary udf function which  extends udf now?
Now I found that udf temporary  function which extends GenericUDF run ok,
but udf temporary  function which extends UDF can not work.

my spark version is 1.5.2

> Permanent UDF not work
> --
>
> Key: SPARK-11948
> URL: https://issues.apache.org/jira/browse/SPARK-11948
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Weizhong
>Priority: Minor
>
> We create a function,
> {noformat}
> add jar /home/test/smartcare-udf-0.0.1-SNAPSHOT.jar;
> create function arr_greater_equal as 
> 'smartcare.dac.hive.udf.UDFArrayGreaterEqual';
> {noformat}
>  but "show functions" don't display, and when we create the same function 
> again, it throw exception as below:
> {noformat}
> Error: org.apache.spark.sql.execution.QueryExecutionException: FAILED: 
> Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.FunctionTask. 
> AlreadyExistsException(message:Function arr_greater_equal already exists) 
> (state=,code=0)
> {noformat}
> But if we use this function, it throw exception as below:
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: undefined function 
> arr_greater_equal; line 1 pos 119 (state=,code=0)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12823) Cannot create UDF with StructType input

2016-01-18 Thread Frank Rosner (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Rosner updated SPARK-12823:
-
Description: 
h5. Problem

It is not possible to apply a UDF to a column that has a struct data type. Two 
previous requests to the mailing list remained unanswered.

h5. How-To-Reproduce

{code}
val sql = new org.apache.spark.sql.SQLContext(sc)
import sql.implicits._

case class KV(key: Long, value: String)
case class Row(kv: KV)
val df = sc.parallelize(List(Row(KV(1L, "a")), Row(KV(5L, "b".toDF

val udf1 = org.apache.spark.sql.functions.udf((kv: KV) => kv.value)
df.select(udf1(df("kv"))).show
// java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
to $line78.$read$$iwC$$iwC$KV

val udf2 = org.apache.spark.sql.functions.udf((kv: (Long, String)) => kv._2)
df.select(udf2(df("kv"))).show
// org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to data 
type mismatch: argument 1 requires struct<_1:bigint,_2:string> type, however, 
'kv' is of struct type.;
{code}

h5. Mailing List Entries

- 
https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCACUahd8M=ipCbFCYDyein_=vqyoantn-tpxe6sq395nh10g...@mail.gmail.com%3E

- https://www.mail-archive.com/user@spark.apache.org/msg43092.html

h5. Possible Workaround

If you create a {{UserDefinedFunction}} manually, not using the {{udf}} helper 
functions, it works. See https://github.com/FRosner/struct-udf, which exposes 
the UserDefinedFunction constructor (public from package private).

  was:
h5. Problem

It is not possible to apply a UDF to a column that has a struct data type. Two 
previous requests to the mailing list remained unanswered.

h5. How-To-Reproduce

{code}
val sql = new org.apache.spark.sql.SQLContext(sc)
import sql.implicits._

case class KV(key: Long, value: String)
case class Row(kv: KV)
val df = sc.parallelize(List(Row(KV(1L, "a")), Row(KV(5L, "b".toDF

val udf1 = org.apache.spark.sql.functions.udf((kv: KV) => kv.value)
df.select(udf1(df("kv"))).show
// java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
to $line78.$read$$iwC$$iwC$KV

val udf2 = org.apache.spark.sql.functions.udf((kv: (Long, String)) => kv._2)
df.select(udf2(df("kv"))).show
// org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to data 
type mismatch: argument 1 requires struct<_1:bigint,_2:string> type, however, 
'kv' is of struct type.;
{code}

h5. Mailing List Entries

- 
https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCACUahd8M=ipCbFCYDyein_=vqyoantn-tpxe6sq395nh10g...@mail.gmail.com%3E

- https://www.mail-archive.com/user@spark.apache.org/msg43092.html


> Cannot create UDF with StructType input
> ---
>
> Key: SPARK-12823
> URL: https://issues.apache.org/jira/browse/SPARK-12823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Frank Rosner
>
> h5. Problem
> It is not possible to apply a UDF to a column that has a struct data type. 
> Two previous requests to the mailing list remained unanswered.
> h5. How-To-Reproduce
> {code}
> val sql = new org.apache.spark.sql.SQLContext(sc)
> import sql.implicits._
> case class KV(key: Long, value: String)
> case class Row(kv: KV)
> val df = sc.parallelize(List(Row(KV(1L, "a")), Row(KV(5L, "b".toDF
> val udf1 = org.apache.spark.sql.functions.udf((kv: KV) => kv.value)
> df.select(udf1(df("kv"))).show
> // java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
> to $line78.$read$$iwC$$iwC$KV
> val udf2 = org.apache.spark.sql.functions.udf((kv: (Long, String)) => kv._2)
> df.select(udf2(df("kv"))).show
> // org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to 
> data type mismatch: argument 1 requires struct<_1:bigint,_2:string> type, 
> however, 'kv' is of struct type.;
> {code}
> h5. Mailing List Entries
> - 
> https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCACUahd8M=ipCbFCYDyein_=vqyoantn-tpxe6sq395nh10g...@mail.gmail.com%3E
> - https://www.mail-archive.com/user@spark.apache.org/msg43092.html
> h5. Possible Workaround
> If you create a {{UserDefinedFunction}} manually, not using the {{udf}} 
> helper functions, it works. See https://github.com/FRosner/struct-udf, which 
> exposes the UserDefinedFunction constructor (public from package private).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12823) Cannot create UDF with StructType input

2016-01-18 Thread Frank Rosner (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Rosner updated SPARK-12823:
-
Description: 
h5. Problem

It is not possible to apply a UDF to a column that has a struct data type. Two 
previous requests to the mailing list remained unanswered.

h5. How-To-Reproduce

{code}
val sql = new org.apache.spark.sql.SQLContext(sc)
import sql.implicits._

case class KV(key: Long, value: String)
case class Row(kv: KV)
val df = sc.parallelize(List(Row(KV(1L, "a")), Row(KV(5L, "b".toDF

val udf1 = org.apache.spark.sql.functions.udf((kv: KV) => kv.value)
df.select(udf1(df("kv"))).show
// java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
to $line78.$read$$iwC$$iwC$KV

val udf2 = org.apache.spark.sql.functions.udf((kv: (Long, String)) => kv._2)
df.select(udf2(df("kv"))).show
// org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to data 
type mismatch: argument 1 requires struct<_1:bigint,_2:string> type, however, 
'kv' is of struct type.;
{code}

h5. Mailing List Entries

- 
https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCACUahd8M=ipCbFCYDyein_=vqyoantn-tpxe6sq395nh10g...@mail.gmail.com%3E

- https://www.mail-archive.com/user@spark.apache.org/msg43092.html

h5. Possible Workaround

If you create a {{UserDefinedFunction}} manually, not using the {{udf}} helper 
functions, it works. See https://github.com/FRosner/struct-udf, which exposes 
the {{UserDefinedFunction}} constructor (public from package private).

  was:
h5. Problem

It is not possible to apply a UDF to a column that has a struct data type. Two 
previous requests to the mailing list remained unanswered.

h5. How-To-Reproduce

{code}
val sql = new org.apache.spark.sql.SQLContext(sc)
import sql.implicits._

case class KV(key: Long, value: String)
case class Row(kv: KV)
val df = sc.parallelize(List(Row(KV(1L, "a")), Row(KV(5L, "b".toDF

val udf1 = org.apache.spark.sql.functions.udf((kv: KV) => kv.value)
df.select(udf1(df("kv"))).show
// java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
to $line78.$read$$iwC$$iwC$KV

val udf2 = org.apache.spark.sql.functions.udf((kv: (Long, String)) => kv._2)
df.select(udf2(df("kv"))).show
// org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to data 
type mismatch: argument 1 requires struct<_1:bigint,_2:string> type, however, 
'kv' is of struct type.;
{code}

h5. Mailing List Entries

- 
https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCACUahd8M=ipCbFCYDyein_=vqyoantn-tpxe6sq395nh10g...@mail.gmail.com%3E

- https://www.mail-archive.com/user@spark.apache.org/msg43092.html

h5. Possible Workaround

If you create a {{UserDefinedFunction}} manually, not using the {{udf}} helper 
functions, it works. See https://github.com/FRosner/struct-udf, which exposes 
the UserDefinedFunction constructor (public from package private).


> Cannot create UDF with StructType input
> ---
>
> Key: SPARK-12823
> URL: https://issues.apache.org/jira/browse/SPARK-12823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Frank Rosner
>
> h5. Problem
> It is not possible to apply a UDF to a column that has a struct data type. 
> Two previous requests to the mailing list remained unanswered.
> h5. How-To-Reproduce
> {code}
> val sql = new org.apache.spark.sql.SQLContext(sc)
> import sql.implicits._
> case class KV(key: Long, value: String)
> case class Row(kv: KV)
> val df = sc.parallelize(List(Row(KV(1L, "a")), Row(KV(5L, "b".toDF
> val udf1 = org.apache.spark.sql.functions.udf((kv: KV) => kv.value)
> df.select(udf1(df("kv"))).show
> // java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
> to $line78.$read$$iwC$$iwC$KV
> val udf2 = org.apache.spark.sql.functions.udf((kv: (Long, String)) => kv._2)
> df.select(udf2(df("kv"))).show
> // org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to 
> data type mismatch: argument 1 requires struct<_1:bigint,_2:string> type, 
> however, 'kv' is of struct type.;
> {code}
> h5. Mailing List Entries
> - 
> https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCACUahd8M=ipCbFCYDyein_=vqyoantn-tpxe6sq395nh10g...@mail.gmail.com%3E
> - https://www.mail-archive.com/user@spark.apache.org/msg43092.html
> h5. Possible Workaround
> If you create a {{UserDefinedFunction}} manually, not using the {{udf}} 
> helper functions, it works. See https://github.com/FRosner/struct-udf, which 
> exposes the {{UserDefinedFunction}} constructor (public from package private).



--
This message was sent by 

[jira] [Commented] (SPARK-11139) Make SparkContext.stop() exception-safe

2016-01-18 Thread Thomas Sebastian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105519#comment-15105519
 ] 

Thomas Sebastian commented on SPARK-11139:
--


I have re-opened the issue SPARK-11137  , for this fix to be implemented in the 
Streaming Context.
Would like to work on the fix in the StreamingContext

> Make SparkContext.stop() exception-safe
> ---
>
> Key: SPARK-11139
> URL: https://issues.apache.org/jira/browse/SPARK-11139
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Felix Cheung
>Priority: Minor
>
> In SparkContext.stop(), when an exception is thrown the rest of the 
> stop/cleanup action is aborted.
> Work has been done in SPARK-4194 to allow for cleanup to partial 
> initialization.
> Similarly issue in StreamingContext SPARK-11137



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11137) Make StreamingContext.stop() exception-safe

2016-01-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105524#comment-15105524
 ] 

Apache Spark commented on SPARK-11137:
--

User 'jayadevanmurali' has created a pull request for this issue:
https://github.com/apache/spark/pull/10807

> Make StreamingContext.stop() exception-safe
> ---
>
> Key: SPARK-11137
> URL: https://issues.apache.org/jira/browse/SPARK-11137
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.1
>Reporter: Felix Cheung
>Priority: Minor
>
> In StreamingContext.stop(), when an exception is thrown the rest of the 
> stop/cleanup action is aborted.
> Discussed in https://github.com/apache/spark/pull/9116,
> srowen commented
> Hm, this is getting unwieldy. There are several nested try blocks here. The 
> same argument goes for many of these methods -- if one fails should they not 
> continue trying? A more tidy solution would be to execute a series of () -> 
> Unit code blocks that perform some cleanup and make sure that they each fire 
> in succession, regardless of the others. The final one to remove the shutdown 
> hook could occur outside synchronization.
> I realize we're expanding the scope of the change here, but is it maybe 
> worthwhile to go all the way here?
> Really, something similar could be done for SparkContext and there's an 
> existing JIRA for it somewhere.
> At least, I'd prefer to either narrowly fix the deadlock here, or fix all of 
> the finally-related issue separately and all at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11137) Make StreamingContext.stop() exception-safe

2016-01-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11137:


Assignee: (was: Apache Spark)

> Make StreamingContext.stop() exception-safe
> ---
>
> Key: SPARK-11137
> URL: https://issues.apache.org/jira/browse/SPARK-11137
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.1
>Reporter: Felix Cheung
>Priority: Minor
>
> In StreamingContext.stop(), when an exception is thrown the rest of the 
> stop/cleanup action is aborted.
> Discussed in https://github.com/apache/spark/pull/9116,
> srowen commented
> Hm, this is getting unwieldy. There are several nested try blocks here. The 
> same argument goes for many of these methods -- if one fails should they not 
> continue trying? A more tidy solution would be to execute a series of () -> 
> Unit code blocks that perform some cleanup and make sure that they each fire 
> in succession, regardless of the others. The final one to remove the shutdown 
> hook could occur outside synchronization.
> I realize we're expanding the scope of the change here, but is it maybe 
> worthwhile to go all the way here?
> Really, something similar could be done for SparkContext and there's an 
> existing JIRA for it somewhere.
> At least, I'd prefer to either narrowly fix the deadlock here, or fix all of 
> the finally-related issue separately and all at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11137) Make StreamingContext.stop() exception-safe

2016-01-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11137:


Assignee: Apache Spark

> Make StreamingContext.stop() exception-safe
> ---
>
> Key: SPARK-11137
> URL: https://issues.apache.org/jira/browse/SPARK-11137
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.1
>Reporter: Felix Cheung
>Assignee: Apache Spark
>Priority: Minor
>
> In StreamingContext.stop(), when an exception is thrown the rest of the 
> stop/cleanup action is aborted.
> Discussed in https://github.com/apache/spark/pull/9116,
> srowen commented
> Hm, this is getting unwieldy. There are several nested try blocks here. The 
> same argument goes for many of these methods -- if one fails should they not 
> continue trying? A more tidy solution would be to execute a series of () -> 
> Unit code blocks that perform some cleanup and make sure that they each fire 
> in succession, regardless of the others. The final one to remove the shutdown 
> hook could occur outside synchronization.
> I realize we're expanding the scope of the change here, but is it maybe 
> worthwhile to go all the way here?
> Really, something similar could be done for SparkContext and there's an 
> existing JIRA for it somewhere.
> At least, I'd prefer to either narrowly fix the deadlock here, or fix all of 
> the finally-related issue separately and all at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-11137) Make StreamingContext.stop() exception-safe

2016-01-18 Thread Thomas Sebastian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Sebastian reopened SPARK-11137:
--

Re-opening this issue based on the discussion at 
https://issues.apache.org/jira/browse/SPARK-11139   .
As the duplicate marked SPARK-11139 does not handle the fix in the 
StreamingContext, this issue is re opened.

> Make StreamingContext.stop() exception-safe
> ---
>
> Key: SPARK-11137
> URL: https://issues.apache.org/jira/browse/SPARK-11137
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.1
>Reporter: Felix Cheung
>Priority: Minor
>
> In StreamingContext.stop(), when an exception is thrown the rest of the 
> stop/cleanup action is aborted.
> Discussed in https://github.com/apache/spark/pull/9116,
> srowen commented
> Hm, this is getting unwieldy. There are several nested try blocks here. The 
> same argument goes for many of these methods -- if one fails should they not 
> continue trying? A more tidy solution would be to execute a series of () -> 
> Unit code blocks that perform some cleanup and make sure that they each fire 
> in succession, regardless of the others. The final one to remove the shutdown 
> hook could occur outside synchronization.
> I realize we're expanding the scope of the change here, but is it maybe 
> worthwhile to go all the way here?
> Really, something similar could be done for SparkContext and there's an 
> existing JIRA for it somewhere.
> At least, I'd prefer to either narrowly fix the deadlock here, or fix all of 
> the finally-related issue separately and all at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11137) Make StreamingContext.stop() exception-safe

2016-01-18 Thread Jayadevan M (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105523#comment-15105523
 ] 

Jayadevan M commented on SPARK-11137:
-

[~thomastechs]
working on the fix

> Make StreamingContext.stop() exception-safe
> ---
>
> Key: SPARK-11137
> URL: https://issues.apache.org/jira/browse/SPARK-11137
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.1
>Reporter: Felix Cheung
>Priority: Minor
>
> In StreamingContext.stop(), when an exception is thrown the rest of the 
> stop/cleanup action is aborted.
> Discussed in https://github.com/apache/spark/pull/9116,
> srowen commented
> Hm, this is getting unwieldy. There are several nested try blocks here. The 
> same argument goes for many of these methods -- if one fails should they not 
> continue trying? A more tidy solution would be to execute a series of () -> 
> Unit code blocks that perform some cleanup and make sure that they each fire 
> in succession, regardless of the others. The final one to remove the shutdown 
> hook could occur outside synchronization.
> I realize we're expanding the scope of the change here, but is it maybe 
> worthwhile to go all the way here?
> Really, something similar could be done for SparkContext and there's an 
> existing JIRA for it somewhere.
> At least, I'd prefer to either narrowly fix the deadlock here, or fix all of 
> the finally-related issue separately and all at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12864) initialize executorIdCounter after ApplicationMaster killed for max number of executor failures reached

2016-01-18 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105671#comment-15105671
 ] 

Saisai Shao commented on SPARK-12864:
-

What's Spark version are you using? I remember I fixed a similar AM re-attempt 
issue before. As I remembered, when AM is exited, all the related 
containers/executors will be exited as well, and when another attempt is 
started, it will initialize the AM from scratch, so there should be no problem.

Did you mean that when AM is failed, all the containers are still running in 
your cluster? If so, that's a kind of weird, would you please elaborate what 
you saw, thanks a lot.

>  initialize executorIdCounter after ApplicationMaster killed for max number 
> of executor failures reached
> 
>
> Key: SPARK-12864
> URL: https://issues.apache.org/jira/browse/SPARK-12864
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.3.1, 1.4.1, 1.5.2
>Reporter: iward
>
> Currently, when max number of executor failures reached the 
> *maxNumExecutorFailures*,  *ApplicationMaster* will be killed and re-register 
> another one.This time, *YarnAllocator* will be created a new instance.
> But, the value of property *executorIdCounter* in  *YarnAllocator* will reset 
> to *0*. Then the *Id* of new executor will starting from 1. This will confuse 
> with the executor has already created before, which will cause 
> FetchFailedException.
> For example, the following is the task log:
> {noformat}
> 2015-12-22 02:33:15 INFO 15/12/22 02:33:15 WARN 
> YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has 
> disassociated: 172.22.92.14:45125
> 2015-12-22 02:33:26 INFO 15/12/22 02:33:26 INFO 
> YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as 
> AkkaRpcEndpointRef(Actor[akka.tcp://sparkYarnAM@172.22.168.72:54040/user/YarnAM#-1290854604])
> {noformat}
> {noformat}
> 2015-12-22 02:35:02 INFO 15/12/22 02:35:02 INFO YarnClientSchedulerBackend: 
> Registered executor: 
> AkkaRpcEndpointRef(Actor[akka.tcp://sparkexecu...@bjhc-hera-16217.hadoop.jd.local:46538/user/Executor#-790726793])
>  with ID 1
> {noformat}
> {noformat}
> Lost task 3.0 in stage 102.0 (TID 1963, BJHC-HERA-16217.hadoop.jd.local): 
> FetchFailed(BlockManagerId(1, BJHC-HERA-17030.hadoop.jd.local, 7337
> ), shuffleId=5, mapId=2, reduceId=3, message=
> 2015-12-22 02:43:20 INFO org.apache.spark.shuffle.FetchFailedException: 
> /data3/yarn1/local/usercache/dd_edw/appcache/application_1450438154359_206399/blockmgr-b1fd0363-6d53-4d09-8086-adc4a13f4dc4/0f/shuffl
> e_5_2_0.index (No such file or directory)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84)
> 2015-12-22 02:43:20 INFO at 
> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> 2015-12-22 02:43:20 INFO at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> 2015-12-22 02:43:20 INFO at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:154)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:149)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> {noformat}
> As the task log show, the executor id of  *BJHC-HERA-16217.hadoop.jd.local* 
> is the same as *BJHC-HERA-17030.hadoop.jd.local*. So, it is confusion and 
> cause FetchFailedException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12880) Different results on groupBy after window function

2016-01-18 Thread Saif Addin Ellafi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saif Addin Ellafi updated SPARK-12880:
--
Description: 
scala> val overVint = Window.partitionBy("product", "bnd", 
"age").orderBy(asc("mm"))

scala> val df_data2 = df_data.withColumn("result", lag("baleom", 
1).over(overVint))

scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and mm = 
200509").groupBy("mm", "closed", 
"ever_closed").agg(sum("result").as("result")).show
+--+--+---++
|mm|closed|ever_closed|  result|
+--+--+---++
|200509| 1|  1|1.2672666129980398E7|
|200509| 0|  0|2.7104834668856387E9|
|200509| 0|  1| 1.151339011298214E8|
+--+--+---++


scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and mm = 
200509").groupBy("mm", "closed", 
"ever_closed").agg(sum("result").as("result")).show
+--+--+---++
|mm|closed|ever_closed|  result|
+--+--+---++
|200509| 1|  1|1.2357681589980595E7|
|200509| 0|  0| 2.709930867575646E9|
|200509| 0|  1|1.1595048973981345E8|
+--+--+---++

Does NOT happen with columns not of the window function
Happens both in cluster mode and local mode
Before group by operation, data looks good and is consistent
Happens when data is large (This case is 1.4 billion rows. Does not happen if I 
use limit 10)


  was:
scala> val overVint = Window.partitionBy("product", "bnd", 
"age").orderBy(asc("mm"))

scala> val df_data2 = df_data.withColumn("result", lag("baleom", 
1).over(overVint))

scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and mm = 
200509").groupBy("mm", "closed", 
"ever_closed").agg(sum("result").as("result")).show
+--+--+---++
|mm|closed|ever_closed|  result|
+--+--+---++
|200509| 1|  1|1.2672666129980398E7|
|200509| 0|  0|2.7104834668856387E9|
|200509| 0|  1| 1.151339011298214E8|
+--+--+---++


scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and mm = 
200509").groupBy("mm", "closed", 
"ever_closed").agg(sum("result").as("result")).show
+--+--+---++
|mm|closed|ever_closed|  result|
+--+--+---++
|200509| 1|  1|1.2357681589980595E7|
|200509| 0|  0| 2.709930867575646E9|
|200509| 0|  1|1.1595048973981345E8|
+--+--+---++

Does NOT happen with columns not of the window function
Happens both in cluster mode and local mode
Before group by operation, data looks good and is consistent



> Different results on groupBy after window function
> --
>
> Key: SPARK-12880
> URL: https://issues.apache.org/jira/browse/SPARK-12880
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Saif Addin Ellafi
>Priority: Critical
>
> scala> val overVint = Window.partitionBy("product", "bnd", 
> "age").orderBy(asc("mm"))
> scala> val df_data2 = df_data.withColumn("result", lag("baleom", 
> 1).over(overVint))
> scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and mm = 
> 200509").groupBy("mm", "closed", 
> "ever_closed").agg(sum("result").as("result")).show
> +--+--+---++
> |mm|closed|ever_closed|  result|
> +--+--+---++
> |200509| 1|  1|1.2672666129980398E7|
> |200509| 0|  0|2.7104834668856387E9|
> |200509| 0|  1| 1.151339011298214E8|
> +--+--+---++
> scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and mm = 
> 200509").groupBy("mm", "closed", 
> "ever_closed").agg(sum("result").as("result")).show
> +--+--+---++
> |mm|closed|ever_closed|  result|
> +--+--+---++
> |200509| 1|  1|1.2357681589980595E7|
> |200509| 0|  0| 2.709930867575646E9|
> |200509| 0|  1|1.1595048973981345E8|
> +--+--+---++
> Does NOT happen with columns not of the window function
> Happens both in cluster mode and local mode
> Before group by operation, data looks good and is consistent
> Happens when data is large (This case is 1.4 billion rows. Does not happen if 
> I use limit 10)



--
This message was sent by Atlassian JIRA

[jira] [Comment Edited] (SPARK-12864) initialize executorIdCounter after ApplicationMaster killed for max number of executor failures reached

2016-01-18 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105671#comment-15105671
 ] 

Saisai Shao edited comment on SPARK-12864 at 1/18/16 7:03 PM:
--

What Spark version are you using? I remember I fixed a similar AM re-attempt 
issue before. As I remembered, when AM is exited, all the related 
containers/executors will be exited as well, and when another attempt is 
started, it will initialize the AM from scratch, so there should be no problem.

Did you mean that when AM is failed, all the containers are still running in 
your cluster? If so, that's a kind of weird, would you please elaborate what 
you saw, thanks a lot.


was (Author: jerryshao):
What's Spark version are you using? I remember I fixed a similar AM re-attempt 
issue before. As I remembered, when AM is exited, all the related 
containers/executors will be exited as well, and when another attempt is 
started, it will initialize the AM from scratch, so there should be no problem.

Did you mean that when AM is failed, all the containers are still running in 
your cluster? If so, that's a kind of weird, would you please elaborate what 
you saw, thanks a lot.

>  initialize executorIdCounter after ApplicationMaster killed for max number 
> of executor failures reached
> 
>
> Key: SPARK-12864
> URL: https://issues.apache.org/jira/browse/SPARK-12864
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.3.1, 1.4.1, 1.5.2
>Reporter: iward
>
> Currently, when max number of executor failures reached the 
> *maxNumExecutorFailures*,  *ApplicationMaster* will be killed and re-register 
> another one.This time, *YarnAllocator* will be created a new instance.
> But, the value of property *executorIdCounter* in  *YarnAllocator* will reset 
> to *0*. Then the *Id* of new executor will starting from 1. This will confuse 
> with the executor has already created before, which will cause 
> FetchFailedException.
> For example, the following is the task log:
> {noformat}
> 2015-12-22 02:33:15 INFO 15/12/22 02:33:15 WARN 
> YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has 
> disassociated: 172.22.92.14:45125
> 2015-12-22 02:33:26 INFO 15/12/22 02:33:26 INFO 
> YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as 
> AkkaRpcEndpointRef(Actor[akka.tcp://sparkYarnAM@172.22.168.72:54040/user/YarnAM#-1290854604])
> {noformat}
> {noformat}
> 2015-12-22 02:35:02 INFO 15/12/22 02:35:02 INFO YarnClientSchedulerBackend: 
> Registered executor: 
> AkkaRpcEndpointRef(Actor[akka.tcp://sparkexecu...@bjhc-hera-16217.hadoop.jd.local:46538/user/Executor#-790726793])
>  with ID 1
> {noformat}
> {noformat}
> Lost task 3.0 in stage 102.0 (TID 1963, BJHC-HERA-16217.hadoop.jd.local): 
> FetchFailed(BlockManagerId(1, BJHC-HERA-17030.hadoop.jd.local, 7337
> ), shuffleId=5, mapId=2, reduceId=3, message=
> 2015-12-22 02:43:20 INFO org.apache.spark.shuffle.FetchFailedException: 
> /data3/yarn1/local/usercache/dd_edw/appcache/application_1450438154359_206399/blockmgr-b1fd0363-6d53-4d09-8086-adc4a13f4dc4/0f/shuffl
> e_5_2_0.index (No such file or directory)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84)
> 2015-12-22 02:43:20 INFO at 
> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> 2015-12-22 02:43:20 INFO at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> 2015-12-22 02:43:20 INFO at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:154)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:149)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> 2015-12-22 02:43:20 INFO at 
> 

[jira] [Updated] (SPARK-12880) Different results on groupBy after window function

2016-01-18 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-12880:
--
Description: 
{noformat}
scala> val overVint = Window.partitionBy("product", "bnd", 
"age").orderBy(asc("mm"))

scala> val df_data2 = df_data.withColumn("result", lag("baleom", 
1).over(overVint))

scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and mm = 
200509").groupBy("mm", "closed", 
"ever_closed").agg(sum("result").as("result")).show
+--+--+---++
|mm|closed|ever_closed|  result|
+--+--+---++
|200509| 1|  1|1.2672666129980398E7|
|200509| 0|  0|2.7104834668856387E9|
|200509| 0|  1| 1.151339011298214E8|
+--+--+---++


scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and mm = 
200509").groupBy("mm", "closed", 
"ever_closed").agg(sum("result").as("result")).show
+--+--+---++
|mm|closed|ever_closed|  result|
+--+--+---++
|200509| 1|  1|1.2357681589980595E7|
|200509| 0|  0| 2.709930867575646E9|
|200509| 0|  1|1.1595048973981345E8|
+--+--+---++
{noformat}

Does NOT happen with columns not of the window function
Happens both in cluster mode and local mode
Before group by operation, data looks good and is consistent


  was:
scala> val overVint = Window.partitionBy("product", "bnd", 
"age").orderBy(asc("mm"))

scala> val df_data2 = df_data.withColumn("result", lag("baleom", 
1).over(overVint))

scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and mm = 
200509").groupBy("mm", "closed", 
"ever_closed").agg(sum("result").as("result")).show
+--+--+---++
|mm|closed|ever_closed|  result|
+--+--+---++
|200509| 1|  1|1.2672666129980398E7|
|200509| 0|  0|2.7104834668856387E9|
|200509| 0|  1| 1.151339011298214E8|
+--+--+---++


scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and mm = 
200509").groupBy("mm", "closed", 
"ever_closed").agg(sum("result").as("result")).show
+--+--+---++
|mm|closed|ever_closed|  result|
+--+--+---++
|200509| 1|  1|1.2357681589980595E7|
|200509| 0|  0| 2.709930867575646E9|
|200509| 0|  1|1.1595048973981345E8|
+--+--+---++

Does NOT happen with columns not of the window function
Happens both in cluster mode and local mode
Before group by operation, data looks good and is consistent
Happens when data is large (This case is 1.4 billion rows. Does not happen if I 
use limit 10)



> Different results on groupBy after window function
> --
>
> Key: SPARK-12880
> URL: https://issues.apache.org/jira/browse/SPARK-12880
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Saif Addin Ellafi
>Priority: Critical
>
> {noformat}
> scala> val overVint = Window.partitionBy("product", "bnd", 
> "age").orderBy(asc("mm"))
> scala> val df_data2 = df_data.withColumn("result", lag("baleom", 
> 1).over(overVint))
> scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and mm = 
> 200509").groupBy("mm", "closed", 
> "ever_closed").agg(sum("result").as("result")).show
> +--+--+---++
> |mm|closed|ever_closed|  result|
> +--+--+---++
> |200509| 1|  1|1.2672666129980398E7|
> |200509| 0|  0|2.7104834668856387E9|
> |200509| 0|  1| 1.151339011298214E8|
> +--+--+---++
> scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and mm = 
> 200509").groupBy("mm", "closed", 
> "ever_closed").agg(sum("result").as("result")).show
> +--+--+---++
> |mm|closed|ever_closed|  result|
> +--+--+---++
> |200509| 1|  1|1.2357681589980595E7|
> |200509| 0|  0| 2.709930867575646E9|
> |200509| 0|  1|1.1595048973981345E8|
> +--+--+---++
> {noformat}
> Does NOT happen with columns not of the window function
> Happens both in cluster mode and local mode
> Before group by operation, data looks good and is consistent



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-12880) Different results on groupBy after window function

2016-01-18 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105678#comment-15105678
 ] 

Herman van Hovell commented on SPARK-12880:
---

You are ordering by {{MM}} and it looks like there is not an unique 
ordering (for instance 200509). I don't think it is a given that sorting will 
always yield the same ordering in such a case. As a result calling the 
{{lag()}} function will not return the same results.

Please check if the {{MM}} values are unique for each {{product, bnd & 
age}} tuple.

> Different results on groupBy after window function
> --
>
> Key: SPARK-12880
> URL: https://issues.apache.org/jira/browse/SPARK-12880
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Saif Addin Ellafi
>Priority: Critical
>
> {noformat}
> scala> val overVint = Window.partitionBy("product", "bnd", 
> "age").orderBy(asc("mm"))
> scala> val df_data2 = df_data.withColumn("result", lag("baleom", 
> 1).over(overVint))
> scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and mm = 
> 200509").groupBy("mm", "closed", 
> "ever_closed").agg(sum("result").as("result")).show
> +--+--+---++
> |mm|closed|ever_closed|  result|
> +--+--+---++
> |200509| 1|  1|1.2672666129980398E7|
> |200509| 0|  0|2.7104834668856387E9|
> |200509| 0|  1| 1.151339011298214E8|
> +--+--+---++
> scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and mm = 
> 200509").groupBy("mm", "closed", 
> "ever_closed").agg(sum("result").as("result")).show
> +--+--+---++
> |mm|closed|ever_closed|  result|
> +--+--+---++
> |200509| 1|  1|1.2357681589980595E7|
> |200509| 0|  0| 2.709930867575646E9|
> |200509| 0|  1|1.1595048973981345E8|
> +--+--+---++
> {noformat}
> Does NOT happen with columns not of the window function
> Happens both in cluster mode and local mode
> Before group by operation, data looks good and is consistent



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12880) Different results on groupBy after window function

2016-01-18 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-12880:
--
Priority: Major  (was: Critical)

> Different results on groupBy after window function
> --
>
> Key: SPARK-12880
> URL: https://issues.apache.org/jira/browse/SPARK-12880
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Saif Addin Ellafi
>
> {noformat}
> scala> val overVint = Window.partitionBy("product", "bnd", 
> "age").orderBy(asc("mm"))
> scala> val df_data2 = df_data.withColumn("result", lag("baleom", 
> 1).over(overVint))
> scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and mm = 
> 200509").groupBy("mm", "closed", 
> "ever_closed").agg(sum("result").as("result")).show
> +--+--+---++
> |mm|closed|ever_closed|  result|
> +--+--+---++
> |200509| 1|  1|1.2672666129980398E7|
> |200509| 0|  0|2.7104834668856387E9|
> |200509| 0|  1| 1.151339011298214E8|
> +--+--+---++
> scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and mm = 
> 200509").groupBy("mm", "closed", 
> "ever_closed").agg(sum("result").as("result")).show
> +--+--+---++
> |mm|closed|ever_closed|  result|
> +--+--+---++
> |200509| 1|  1|1.2357681589980595E7|
> |200509| 0|  0| 2.709930867575646E9|
> |200509| 0|  1|1.1595048973981345E8|
> +--+--+---++
> {noformat}
> Does NOT happen with columns not of the window function
> Happens both in cluster mode and local mode
> Before group by operation, data looks good and is consistent



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12873) Add more comment in HiveTypeCoercion for type widening

2016-01-18 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12873.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Add more comment in HiveTypeCoercion for type widening
> --
>
> Key: SPARK-12873
> URL: https://issues.apache.org/jira/browse/SPARK-12873
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12867) Nullability of Intersect can be stricter

2016-01-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12867:


Assignee: Xiao Li  (was: Apache Spark)

> Nullability of Intersect can be stricter
> 
>
> Key: SPARK-12867
> URL: https://issues.apache.org/jira/browse/SPARK-12867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Lian
>Assignee: Xiao Li
>Priority: Minor
>
> {{Intersect}} doesn't override {{SetOperation.output}}, which is defined as:
> {code}
>   override def output: Seq[Attribute] =
> left.output.zip(right.output).map { case (leftAttr, rightAttr) =>
>   leftAttr.withNullability(leftAttr.nullable || rightAttr.nullable)
> }
> {code}
> However, we can replace the {{||}} with {{&&}} for intersection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12867) Nullability of Intersect can be stricter

2016-01-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12867:


Assignee: Apache Spark  (was: Xiao Li)

> Nullability of Intersect can be stricter
> 
>
> Key: SPARK-12867
> URL: https://issues.apache.org/jira/browse/SPARK-12867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Lian
>Assignee: Apache Spark
>Priority: Minor
>
> {{Intersect}} doesn't override {{SetOperation.output}}, which is defined as:
> {code}
>   override def output: Seq[Attribute] =
> left.output.zip(right.output).map { case (leftAttr, rightAttr) =>
>   leftAttr.withNullability(leftAttr.nullable || rightAttr.nullable)
> }
> {code}
> However, we can replace the {{||}} with {{&&}} for intersection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12867) Nullability of Intersect can be stricter

2016-01-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105725#comment-15105725
 ] 

Apache Spark commented on SPARK-12867:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/10812

> Nullability of Intersect can be stricter
> 
>
> Key: SPARK-12867
> URL: https://issues.apache.org/jira/browse/SPARK-12867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Lian
>Assignee: Xiao Li
>Priority: Minor
>
> {{Intersect}} doesn't override {{SetOperation.output}}, which is defined as:
> {code}
>   override def output: Seq[Attribute] =
> left.output.zip(right.output).map { case (leftAttr, rightAttr) =>
>   leftAttr.withNullability(leftAttr.nullable || rightAttr.nullable)
> }
> {code}
> However, we can replace the {{||}} with {{&&}} for intersection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-12880) Different results on groupBy after window function

2016-01-18 Thread Saif Addin Ellafi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saif Addin Ellafi closed SPARK-12880.
-
Resolution: Not A Problem

> Different results on groupBy after window function
> --
>
> Key: SPARK-12880
> URL: https://issues.apache.org/jira/browse/SPARK-12880
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Saif Addin Ellafi
>
> {noformat}
> scala> val overVint = Window.partitionBy("product", "bnd", 
> "age").orderBy(asc("mm"))
> scala> val df_data2 = df_data.withColumn("result", lag("baleom", 
> 1).over(overVint))
> scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and mm = 
> 200509").groupBy("mm", "closed", 
> "ever_closed").agg(sum("result").as("result")).show
> +--+--+---++
> |mm|closed|ever_closed|  result|
> +--+--+---++
> |200509| 1|  1|1.2672666129980398E7|
> |200509| 0|  0|2.7104834668856387E9|
> |200509| 0|  1| 1.151339011298214E8|
> +--+--+---++
> scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and mm = 
> 200509").groupBy("mm", "closed", 
> "ever_closed").agg(sum("result").as("result")).show
> +--+--+---++
> |mm|closed|ever_closed|  result|
> +--+--+---++
> |200509| 1|  1|1.2357681589980595E7|
> |200509| 0|  0| 2.709930867575646E9|
> |200509| 0|  1|1.1595048973981345E8|
> +--+--+---++
> {noformat}
> Does NOT happen with columns not of the window function
> Happens both in cluster mode and local mode
> Before group by operation, data looks good and is consistent



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12880) Different results on groupBy after window function

2016-01-18 Thread Saif Addin Ellafi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105748#comment-15105748
 ] 

Saif Addin Ellafi commented on SPARK-12880:
---

Thank you, I think what you mention is the case. I will look forward, sorry. 
Shortly I will be deleting this issue to avoid misunderstandings.

> Different results on groupBy after window function
> --
>
> Key: SPARK-12880
> URL: https://issues.apache.org/jira/browse/SPARK-12880
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Saif Addin Ellafi
>
> {noformat}
> scala> val overVint = Window.partitionBy("product", "bnd", 
> "age").orderBy(asc("mm"))
> scala> val df_data2 = df_data.withColumn("result", lag("baleom", 
> 1).over(overVint))
> scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and mm = 
> 200509").groupBy("mm", "closed", 
> "ever_closed").agg(sum("result").as("result")).show
> +--+--+---++
> |mm|closed|ever_closed|  result|
> +--+--+---++
> |200509| 1|  1|1.2672666129980398E7|
> |200509| 0|  0|2.7104834668856387E9|
> |200509| 0|  1| 1.151339011298214E8|
> +--+--+---++
> scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and mm = 
> 200509").groupBy("mm", "closed", 
> "ever_closed").agg(sum("result").as("result")).show
> +--+--+---++
> |mm|closed|ever_closed|  result|
> +--+--+---++
> |200509| 1|  1|1.2357681589980595E7|
> |200509| 0|  0| 2.709930867575646E9|
> |200509| 0|  1|1.1595048973981345E8|
> +--+--+---++
> {noformat}
> Does NOT happen with columns not of the window function
> Happens both in cluster mode and local mode
> Before group by operation, data looks good and is consistent



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12880) Different results on groupBy after window function

2016-01-18 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105751#comment-15105751
 ] 

Herman van Hovell commented on SPARK-12880:
---

No Problem. You don't have to delete this, since it has been marked as 'not a 
problem'. It is kinda usefull to have.

> Different results on groupBy after window function
> --
>
> Key: SPARK-12880
> URL: https://issues.apache.org/jira/browse/SPARK-12880
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Saif Addin Ellafi
>
> {noformat}
> scala> val overVint = Window.partitionBy("product", "bnd", 
> "age").orderBy(asc("mm"))
> scala> val df_data2 = df_data.withColumn("result", lag("baleom", 
> 1).over(overVint))
> scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and mm = 
> 200509").groupBy("mm", "closed", 
> "ever_closed").agg(sum("result").as("result")).show
> +--+--+---++
> |mm|closed|ever_closed|  result|
> +--+--+---++
> |200509| 1|  1|1.2672666129980398E7|
> |200509| 0|  0|2.7104834668856387E9|
> |200509| 0|  1| 1.151339011298214E8|
> +--+--+---++
> scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and mm = 
> 200509").groupBy("mm", "closed", 
> "ever_closed").agg(sum("result").as("result")).show
> +--+--+---++
> |mm|closed|ever_closed|  result|
> +--+--+---++
> |200509| 1|  1|1.2357681589980595E7|
> |200509| 0|  0| 2.709930867575646E9|
> |200509| 0|  1|1.1595048973981345E8|
> +--+--+---++
> {noformat}
> Does NOT happen with columns not of the window function
> Happens both in cluster mode and local mode
> Before group by operation, data looks good and is consistent



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12893) RM redirects to incorrect URL in Spark HistoryServer for yarn-cluster mode

2016-01-18 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-12893:

Attachment: Screen Shot 2016-01-18 at 3.47.24 PM.png

> RM redirects to incorrect URL in Spark HistoryServer for yarn-cluster mode
> --
>
> Key: SPARK-12893
> URL: https://issues.apache.org/jira/browse/SPARK-12893
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: Saisai Shao
> Attachments: Screen Shot 2016-01-18 at 3.47.24 PM.png
>
>
> This will cause application not found error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >