[jira] [Commented] (SPARK-16100) Aggregator fails with Tungsten error when complex types are used for results and partial sum

2016-06-21 Thread Deenar Toraskar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15341847#comment-15341847
 ] 

Deenar Toraskar commented on SPARK-16100:
-

similar issue

> Aggregator fails with Tungsten error when complex types are used for results 
> and partial sum
> 
>
> Key: SPARK-16100
> URL: https://issues.apache.org/jira/browse/SPARK-16100
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Deenar Toraskar
>
> I get a similar error when using complex types in Aggregator. Not sure if 
> this is the same issue or something else.
> {code:Agg.scala}
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.TypedColumn
> import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
> import org.apache.spark.sql.expressions.Aggregator
> import org.apache.spark.sql.{Encoder,Row}
> import sqlContext.implicits._
> object CustomSummer extends Aggregator[Valuation, Map[Int, Seq[Double]], 
> Seq[Seq[Double]]] with Serializable  {
>  def zero: Map[Int, Seq[Double]] = Map()
>  def reduce(b: Map[Int, Seq[Double]], a:Valuation): Map[Int, Seq[Double]] 
> = {
>val timeInterval: Int = a.timeInterval
>val currentSum: Seq[Double] = b.get(timeInterval).getOrElse(Nil)
>val currentRow: Seq[Double] = a.pvs
>b.updated(timeInterval, sumArray(currentSum, currentRow))
>  } 
> def sumArray(a: Seq[Double], b: Seq[Double]): Seq[Double] = Nil
>  def merge(b1: Map[Int, Seq[Double]], b2: Map[Int, Seq[Double]]): 
> Map[Int, Seq[Double]] = {
> /* merges two maps together ++ replaces any (k,v) from the map on the 
> left
> side of ++ (here map1) by (k,v) from the right side map, if (k,_) 
> already
> exists in the left side map (here map1), e.g. Map(1->1) ++ Map(1->2) 
> results in Map(1->2) */
> b1 ++ b2.map { case (timeInterval, exposures) =>
>   timeInterval -> sumArray(exposures, b1.getOrElse(timeInterval, Nil))
> }
>  }
>  def finish(exposures: Map[Int, Seq[Double]]): Seq[Seq[Double]] = 
>   {
> exposures.size match {
>   case 0 => null
>   case _ => {
> val range = exposures.keySet.max
> // convert map to 2 dimensional array, (timeInterval x 
> Seq[expScn1, expScn2, ...]
> (0 to range).map(x => exposures.getOrElse(x, Nil))
>   }
> }
>   }
>   override def bufferEncoder: Encoder[Map[Int,Seq[Double]]] = 
> ExpressionEncoder()
>   override def outputEncoder: Encoder[Seq[Seq[Double]]] = ExpressionEncoder()
>}
> case class Valuation(timeInterval : Int, pvs : Seq[Double])
> val valns = sc.parallelize(Seq(Valuation(0, Seq(1.0,2.0,3.0)),
>   Valuation(2, Seq(1.0,2.0,3.0)),
>   Valuation(1, Seq(1.0,2.0,3.0)),Valuation(2, Seq(1.0,2.0,3.0)),Valuation(0, 
> Seq(1.0,2.0,3.0))
>   )).toDS
> val g_c1 = 
> valns.groupByKey(_.timeInterval).agg(CustomSummer.toColumn).show(false)
> {code}
> I get the following error
> {quote}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 
> (TID 19, localhost): java.lang.IndexOutOfBoundsException: 0
> at 
> scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43)
> at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:47)
> at scala.collection.mutable.ArrayBuffer.remove(ArrayBuffer.scala:167)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:244)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:214)
> at 
> org.apache.spark.sql.catalyst.expressions.Expression.canonicalized$lzycompute(Expression.scala:156)
> at 
> org.apache.spark.sql.catalyst.expressions.Expression.canonicalized(Expression.scala:154)
> at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:155)
> at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:155)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.expressions.Expression.canonicalized$lzycompute(Expression.scala:155)
> at 
> org.apache.spark.sql.catalyst.expressions.Expression.canonicalized(Expression.scala:154)
> at 
> 

[jira] [Commented] (SPARK-15704) TungstenAggregate crashes

2016-06-21 Thread Deenar Toraskar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15341844#comment-15341844
 ] 

Deenar Toraskar commented on SPARK-15704:
-

done see https://issues.apache.org/jira/browse/SPARK-16100


> TungstenAggregate crashes 
> --
>
> Key: SPARK-15704
> URL: https://issues.apache.org/jira/browse/SPARK-15704
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hiroshi Inoue
>Assignee: Hiroshi Inoue
> Fix For: 2.0.0
>
>
> When I run DatasetBenchmark, the JVM crashes while executing "Dataset complex 
> Aggregator" test case due to IndexOutOfBoundsException.
> The error happens in TungstenAggregate; the mappings between bufferSerializer 
> and bufferDeserializer are broken due to unresolved attribute.
> {quote}
> 16/06/02 01:41:05 ERROR Executor: Exception in task 0.0 in stage 67.0 (TID 
> 232)
> java.lang.IndexOutOfBoundsException: -1
>   at 
> scala.collection.LinearSeqOptimized$class.apply(LinearSeqOptimized.scala:65)
>   at scala.collection.immutable.List.apply(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.DeclarativeAggregate$RichAttribute.right(interfaces.scala:389)
>   at 
> org.apache.spark.sql.execution.aggregate.TypedAggregateExpression$$anonfun$3.applyOrElse(TypedAggregateExpression.scala:110)
>   at 
> org.apache.spark.sql.execution.aggregate.TypedAggregateExpression$$anonfun$3.applyOrElse(TypedAggregateExpression.scala:109)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:265)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:265)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:68)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:264)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:307)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:356)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:270)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5$$anonfun$apply$11.apply(TreeNode.scala:336)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:334)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1336)
>   at 
> 

[jira] [Created] (SPARK-16100) Aggregator fails with Tungsten error when complex types are used for results and partial sum

2016-06-21 Thread Deenar Toraskar (JIRA)
Deenar Toraskar created SPARK-16100:
---

 Summary: Aggregator fails with Tungsten error when complex types 
are used for results and partial sum
 Key: SPARK-16100
 URL: https://issues.apache.org/jira/browse/SPARK-16100
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Deenar Toraskar


I get a similar error when using complex types in Aggregator. Not sure if this 
is the same issue or something else.
{code:Agg.scala}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.TypedColumn
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql.{Encoder,Row}
import sqlContext.implicits._

object CustomSummer extends Aggregator[Valuation, Map[Int, Seq[Double]], 
Seq[Seq[Double]]] with Serializable  {
 def zero: Map[Int, Seq[Double]] = Map()
 def reduce(b: Map[Int, Seq[Double]], a:Valuation): Map[Int, Seq[Double]] = 
{
   val timeInterval: Int = a.timeInterval
   val currentSum: Seq[Double] = b.get(timeInterval).getOrElse(Nil)
   val currentRow: Seq[Double] = a.pvs
   b.updated(timeInterval, sumArray(currentSum, currentRow))
 } 
def sumArray(a: Seq[Double], b: Seq[Double]): Seq[Double] = Nil
 def merge(b1: Map[Int, Seq[Double]], b2: Map[Int, Seq[Double]]): Map[Int, 
Seq[Double]] = {
/* merges two maps together ++ replaces any (k,v) from the map on the 
left
side of ++ (here map1) by (k,v) from the right side map, if (k,_) 
already
exists in the left side map (here map1), e.g. Map(1->1) ++ Map(1->2) 
results in Map(1->2) */
b1 ++ b2.map { case (timeInterval, exposures) =>
  timeInterval -> sumArray(exposures, b1.getOrElse(timeInterval, Nil))
}
 }
 def finish(exposures: Map[Int, Seq[Double]]): Seq[Seq[Double]] = 
  {
exposures.size match {
  case 0 => null
  case _ => {
val range = exposures.keySet.max
// convert map to 2 dimensional array, (timeInterval x Seq[expScn1, 
expScn2, ...]
(0 to range).map(x => exposures.getOrElse(x, Nil))
  }
}
  }
  override def bufferEncoder: Encoder[Map[Int,Seq[Double]]] = 
ExpressionEncoder()
  override def outputEncoder: Encoder[Seq[Seq[Double]]] = ExpressionEncoder()
   }

case class Valuation(timeInterval : Int, pvs : Seq[Double])
val valns = sc.parallelize(Seq(Valuation(0, Seq(1.0,2.0,3.0)),
  Valuation(2, Seq(1.0,2.0,3.0)),
  Valuation(1, Seq(1.0,2.0,3.0)),Valuation(2, Seq(1.0,2.0,3.0)),Valuation(0, 
Seq(1.0,2.0,3.0))
  )).toDS

val g_c1 = 
valns.groupByKey(_.timeInterval).agg(CustomSummer.toColumn).show(false)
{code}

I get the following error
{quote}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 
(TID 19, localhost): java.lang.IndexOutOfBoundsException: 0
at scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43)
at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:47)
at scala.collection.mutable.ArrayBuffer.remove(ArrayBuffer.scala:167)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:244)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:214)
at 
org.apache.spark.sql.catalyst.expressions.Expression.canonicalized$lzycompute(Expression.scala:156)
at 
org.apache.spark.sql.catalyst.expressions.Expression.canonicalized(Expression.scala:154)
at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:155)
at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:155)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.expressions.Expression.canonicalized$lzycompute(Expression.scala:155)
at 
org.apache.spark.sql.catalyst.expressions.Expression.canonicalized(Expression.scala:154)
at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:155)
at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:155)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at 

[jira] [Commented] (SPARK-15704) TungstenAggregate crashes

2016-06-21 Thread Deenar Toraskar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15341598#comment-15341598
 ] 

Deenar Toraskar commented on SPARK-15704:
-

[~inouehrs] thanks for checking this out, Do you want me to raise another JIRA?

> TungstenAggregate crashes 
> --
>
> Key: SPARK-15704
> URL: https://issues.apache.org/jira/browse/SPARK-15704
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hiroshi Inoue
>Assignee: Hiroshi Inoue
> Fix For: 2.0.0
>
>
> When I run DatasetBenchmark, the JVM crashes while executing "Dataset complex 
> Aggregator" test case due to IndexOutOfBoundsException.
> The error happens in TungstenAggregate; the mappings between bufferSerializer 
> and bufferDeserializer are broken due to unresolved attribute.
> {quote}
> 16/06/02 01:41:05 ERROR Executor: Exception in task 0.0 in stage 67.0 (TID 
> 232)
> java.lang.IndexOutOfBoundsException: -1
>   at 
> scala.collection.LinearSeqOptimized$class.apply(LinearSeqOptimized.scala:65)
>   at scala.collection.immutable.List.apply(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.DeclarativeAggregate$RichAttribute.right(interfaces.scala:389)
>   at 
> org.apache.spark.sql.execution.aggregate.TypedAggregateExpression$$anonfun$3.applyOrElse(TypedAggregateExpression.scala:110)
>   at 
> org.apache.spark.sql.execution.aggregate.TypedAggregateExpression$$anonfun$3.applyOrElse(TypedAggregateExpression.scala:109)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:265)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:265)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:68)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:264)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:307)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:356)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:270)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5$$anonfun$apply$11.apply(TreeNode.scala:336)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:334)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1336)
> 

[jira] [Commented] (SPARK-15704) TungstenAggregate crashes

2016-06-21 Thread Deenar Toraskar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15341478#comment-15341478
 ] 

Deenar Toraskar commented on SPARK-15704:
-

Hi guys

I get a similar error when using complex types in Aggregator. Not sure if this 
is the same issue or something else.

{code:title=Agg.scala|borderStyle=solid}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.TypedColumn
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql.{Encoder,Row}
import sqlContext.implicits._

object CustomSummer extends Aggregator[Valuation, Map[Int, Seq[Double]], 
Seq[Seq[Double]]] with Serializable  {
 def zero: Map[Int, Seq[Double]] = Map()
 def reduce(b: Map[Int, Seq[Double]], a:Valuation): Map[Int, Seq[Double]] = 
{
   val timeInterval: Int = a.timeInterval
   val currentSum: Seq[Double] = b.get(timeInterval).getOrElse(Nil)
   val currentRow: Seq[Double] = a.pvs
   b.updated(timeInterval, sumArray(currentSum, currentRow))
 } 
def sumArray(a: Seq[Double], b: Seq[Double]): Seq[Double] = Nil
 def merge(b1: Map[Int, Seq[Double]], b2: Map[Int, Seq[Double]]): Map[Int, 
Seq[Double]] = {
/* merges two maps together ++ replaces any (k,v) from the map on the 
left
side of ++ (here map1) by (k,v) from the right side map, if (k,_) 
already
exists in the left side map (here map1), e.g. Map(1->1) ++ Map(1->2) 
results in Map(1->2) */
b1 ++ b2.map { case (timeInterval, exposures) =>
  timeInterval -> sumArray(exposures, b1.getOrElse(timeInterval, Nil))
}
 }
 def finish(exposures: Map[Int, Seq[Double]]): Seq[Seq[Double]] = 
  {
exposures.size match {
  case 0 => null
  case _ => {
val range = exposures.keySet.max
// convert map to 2 dimensional array, (timeInterval x Seq[expScn1, 
expScn2, ...]
(0 to range).map(x => exposures.getOrElse(x, Nil))
  }
}
  }
  override def bufferEncoder: Encoder[Map[Int,Seq[Double]]] = 
ExpressionEncoder()
  override def outputEncoder: Encoder[Seq[Seq[Double]]] = ExpressionEncoder()
   }

case class Valuation(timeInterval : Int, pvs : Seq[Double])
val valns = sc.parallelize(Seq(Valuation(0, Seq(1.0,2.0,3.0)),
  Valuation(2, Seq(1.0,2.0,3.0)),
  Valuation(1, Seq(1.0,2.0,3.0)),Valuation(2, Seq(1.0,2.0,3.0)),Valuation(0, 
Seq(1.0,2.0,3.0))
  )).toDS

val g_c1 = 
valns.groupByKey(_.timeInterval).agg(CustomSummer.toColumn).show(false)
{code}

I get the following error

{quote}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 
(TID 19, localhost): java.lang.IndexOutOfBoundsException: 0
at scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43)
at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:47)
at scala.collection.mutable.ArrayBuffer.remove(ArrayBuffer.scala:167)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:244)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:214)
at 
org.apache.spark.sql.catalyst.expressions.Expression.canonicalized$lzycompute(Expression.scala:156)
at 
org.apache.spark.sql.catalyst.expressions.Expression.canonicalized(Expression.scala:154)
at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:155)
at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:155)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.expressions.Expression.canonicalized$lzycompute(Expression.scala:155)
at 
org.apache.spark.sql.catalyst.expressions.Expression.canonicalized(Expression.scala:154)
at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:155)
at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:155)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at 

[jira] [Commented] (SPARK-13451) Spark shell prints error when :4040 port already in use

2016-02-23 Thread Deenar Toraskar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15158783#comment-15158783
 ] 

Deenar Toraskar commented on SPARK-13451:
-

[~sowen]The issue was fixed, so if you use the settings in log4j template in 
conf directory, this message is suppressed, but jetty moved from eclipse to the 
spark-project module, the template got left behind. I see that that is fixed in 
trunk too. I was using a vendor distribution, which seems to still have an 
older log4j.xml for spark

> Spark shell prints error when :4040 port already in use
> ---
>
> Key: SPARK-13451
> URL: https://issues.apache.org/jira/browse/SPARK-13451
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Affects Versions: 1.6.0
>Reporter: Deenar Toraskar
>Priority: Minor
>
> When running multiple Spark processes on the same machine an exception is 
> thrown when the port for the jetty server is already in use. This issue was 
> fixed previously, but has regressed due to the jetty server being used by 
> spark being moved from the eclipse to the spark-project module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-13451) Spark shell prints error when :4040 port already in use

2016-02-23 Thread Deenar Toraskar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deenar Toraskar closed SPARK-13451.
---
Resolution: Fixed

> Spark shell prints error when :4040 port already in use
> ---
>
> Key: SPARK-13451
> URL: https://issues.apache.org/jira/browse/SPARK-13451
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.6.0
>Reporter: Deenar Toraskar
>Priority: Minor
>
> When running multiple Spark processes on the same machine an exception is 
> thrown when the port for the jetty server is already in use. This issue was 
> fixed previously, but has regressed due to the jetty server being used by 
> spark being moved from the eclipse to the spark-project module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13451) Spark shell prints error when :4040 port already in use

2016-02-23 Thread Deenar Toraskar (JIRA)
Deenar Toraskar created SPARK-13451:
---

 Summary: Spark shell prints error when :4040 port already in use
 Key: SPARK-13451
 URL: https://issues.apache.org/jira/browse/SPARK-13451
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.6.0
Reporter: Deenar Toraskar
Priority: Minor


When running multiple Spark processes on the same machine an exception is 
thrown when the port for the jetty server is already in use. This issue was 
fixed previously, but has regressed due to the jetty server being used by spark 
being moved from the eclipse to the spark-project module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13101) Dataset complex types mapping to DataFrame (element nullability) mismatch

2016-02-01 Thread Deenar Toraskar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126976#comment-15126976
 ] 

Deenar Toraskar commented on SPARK-13101:
-

[~lian cheng] Thanks for the explanation now makes sense. But I would like to 
warn you that this is going cause a lot of issues to people who are migrating 
from Datasets to Dataframes, given that Parquet is the most widely used format 
with SparkSQL.  A better option would be to change the behaviour of Parquet 
writer too. I would hate to use java primitives every time I want a 
non-nullable field in my model classes.

I guess the root cause is the decision in the Parquet writer to convert all 
non-nullable fields to nullable fields. I know there have been discussions 
about this before, but many times the nullability of the field has functional 
impact. 
>> Another tricky thing here is about Parquet. When writing Parquet files, all 
>> non-nullable fields are converted to nullable fields intentionally. This 
>> behavior is for better interoperability with Hive.

I think you should do what is correct.

> Dataset complex types mapping to DataFrame  (element nullability) mismatch
> --
>
> Key: SPARK-13101
> URL: https://issues.apache.org/jira/browse/SPARK-13101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Deenar Toraskar
>Priority: Blocker
>
> There seems to be a regression between 1.6.0 and 1.6.1 (snapshot build). By 
> default a scala {{Seq\[Double\]}} is mapped by Spark as an ArrayType with 
> nullable element
> {noformat}
>  |-- valuations: array (nullable = true)
>  ||-- element: double (containsNull = true)
> {noformat}
> This could be read back to as a Dataset in Spark 1.6.0
> {code}
> val df = sqlContext.table("valuations").as[Valuation]
> {code}
> But with Spark 1.6.1 the same fails with
> {code}
> val df = sqlContext.table("valuations").as[Valuation]
> org.apache.spark.sql.AnalysisException: cannot resolve 'cast(valuations as 
> array)' due to data type mismatch: cannot cast 
> ArrayType(DoubleType,true) to ArrayType(DoubleType,false);
> {code}
> Here's the classes I am using
> {code}
> case class Valuation(tradeId : String,
>  counterparty: String,
>  nettingAgreement: String,
>  wrongWay: Boolean,
>  valuations : Seq[Double], /* one per scenario */
>  timeInterval: Int,
>  jobId: String)  /* used for hdfs partitioning */
> val vals : Seq[Valuation] = Seq()
> val valsDF = sqlContext.sparkContext.parallelize(vals).toDF
> valsDF.write.partitionBy("jobId").mode(SaveMode.Overwrite).saveAsTable("valuations")
> {code}
> even the following gives the same result
> {code}
> val valsDF = vals.toDS.toDF
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13101) Dataset complex types mapping to DataFrame (element nullability) mismatch

2016-02-01 Thread Deenar Toraskar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126982#comment-15126982
 ] 

Deenar Toraskar commented on SPARK-13101:
-

[~lian cheng] [~joshrosen] [~marmbrus]
Alternatively could we have a option supported by the Parquet writer that turns 
the non-nullable to nullable conversion off. This would satisfy my requirements 
and would be cleaner too.

>> Another tricky thing here is about Parquet. When writing Parquet files, all 
>> non-nullable fields are converted to nullable fields intentionally. This 
>> behavior is for better interoperability with Hive

> Dataset complex types mapping to DataFrame  (element nullability) mismatch
> --
>
> Key: SPARK-13101
> URL: https://issues.apache.org/jira/browse/SPARK-13101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Deenar Toraskar
>Priority: Blocker
>
> There seems to be a regression between 1.6.0 and 1.6.1 (snapshot build). By 
> default a scala {{Seq\[Double\]}} is mapped by Spark as an ArrayType with 
> nullable element
> {noformat}
>  |-- valuations: array (nullable = true)
>  ||-- element: double (containsNull = true)
> {noformat}
> This could be read back to as a Dataset in Spark 1.6.0
> {code}
> val df = sqlContext.table("valuations").as[Valuation]
> {code}
> But with Spark 1.6.1 the same fails with
> {code}
> val df = sqlContext.table("valuations").as[Valuation]
> org.apache.spark.sql.AnalysisException: cannot resolve 'cast(valuations as 
> array)' due to data type mismatch: cannot cast 
> ArrayType(DoubleType,true) to ArrayType(DoubleType,false);
> {code}
> Here's the classes I am using
> {code}
> case class Valuation(tradeId : String,
>  counterparty: String,
>  nettingAgreement: String,
>  wrongWay: Boolean,
>  valuations : Seq[Double], /* one per scenario */
>  timeInterval: Int,
>  jobId: String)  /* used for hdfs partitioning */
> val vals : Seq[Valuation] = Seq()
> val valsDF = sqlContext.sparkContext.parallelize(vals).toDF
> valsDF.write.partitionBy("jobId").mode(SaveMode.Overwrite).saveAsTable("valuations")
> {code}
> even the following gives the same result
> {code}
> val valsDF = vals.toDS.toDF
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13094) No encoder implicits for Seq[Primitive]

2016-02-01 Thread Deenar Toraskar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127081#comment-15127081
 ] 

Deenar Toraskar commented on SPARK-13094:
-

[~marmbrus] thanks for looking into this. Look forward to the encoder. 
Currently my model classes are referenced everywhere so wrapping Sequences isnt 
realy an option for now. I also have UDAFs already in place. Generics will save 
on code duplication and Aggregator interface is so much better than the UDAFs.

> No encoder implicits for Seq[Primitive]
> ---
>
> Key: SPARK-13094
> URL: https://issues.apache.org/jira/browse/SPARK-13094
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Deenar Toraskar
>
> Dataset aggregators with complex types fail with unable to find encoder for 
> type stored in a Dataset. Though Datasets with these complex types are 
> supported.
> {code}
> val arraySum = new Aggregator[Seq[Float], Seq[Float],
>   Seq[Float]] with Serializable {
>   def zero: Seq[Float] = Nil
>   // The initial value.
>   def reduce(currentSum: Seq[Float], currentRow: Seq[Float]) =
> sumArray(currentSum, currentRow)
>   def merge(sum: Seq[Float], row: Seq[Float]) = sumArray(sum, row)
>   def finish(b: Seq[Float]) = b // Return the final result.
>   def sumArray(a: Seq[Float], b: Seq[Float]): Seq[Float] = {
> (a, b) match {
>   case (Nil, Nil) => Nil
>   case (Nil, row) => row
>   case (sum, Nil) => sum
>   case (sum, row) => (a, b).zipped.map { case (a, b) => a + b }
> }
>   }
> }.toColumn
> {code}
> {code}
> :47: error: Unable to find encoder for type stored in a Dataset.  
> Primitive types (Int, String, etc) and Product types (case classes) are 
> supported by importing sqlContext.implicits._  Support for serializing other 
> types will be added in future releases.
>}.toColumn
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13101) Dataset complex types mapping to DataFrame (element nullability) mismatch

2016-01-30 Thread Deenar Toraskar (JIRA)
Deenar Toraskar created SPARK-13101:
---

 Summary: Dataset complex types mapping to DataFrame  (element 
nullability) mismatch
 Key: SPARK-13101
 URL: https://issues.apache.org/jira/browse/SPARK-13101
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.1
Reporter: Deenar Toraskar
 Fix For: 1.6.1


There seems to be a regression between 1.6.0 and 1.6.1 (snapshot build). By 
default a scala Seq[Double] is mapped by Spark as an ArrayType with nullable 
element

 |-- valuations: array (nullable = true)
 ||-- element: double (containsNull = true)

This could be read back to as a Dataset in Spark 1.6.0

val df = sqlContext.table("valuations").as[Valuation]

But with Spark 1.6.1 the same fails with
val df = sqlContext.table("valuations").as[Valuation]

org.apache.spark.sql.AnalysisException: cannot resolve 'cast(valuations as 
array)' due to data type mismatch: cannot cast 
ArrayType(DoubleType,true) to ArrayType(DoubleType,false);

Here's the classes I am using

case class Valuation(tradeId : String,
 counterparty: String,
 nettingAgreement: String,
 wrongWay: Boolean,
 valuations : Seq[Double], /* one per scenario */
 timeInterval: Int,
 jobId: String)  /* used for hdfs partitioning */

val vals : Seq[Valuation] = Seq()
val valsDF = sqlContext.sparkContext.parallelize(vals).toDF
valsDF.write.partitionBy("jobId").mode(SaveMode.Overwrite).saveAsTable("valuations")

even the following gives the same result
val valsDF = vals.toDS.toDF




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13094) Dataset Aggregators do not work with complex types

2016-01-29 Thread Deenar Toraskar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15124278#comment-15124278
 ] 

Deenar Toraskar commented on SPARK-13094:
-

[~marmbrus]
Still the same error on nightly snapshot build of 1.6.0 
http://people.apache.org/~pwendell/spark-nightly/spark-branch-1.6-bin/latest/spark-1.6.0-SNAPSHOT-bin-hadoop2.6.tgz


> Dataset Aggregators do not work with complex types
> --
>
> Key: SPARK-13094
> URL: https://issues.apache.org/jira/browse/SPARK-13094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Deenar Toraskar
>
> Dataset aggregators with complex types fail with unable to find encoder for 
> type stored in a Dataset. Though Datasets with these complex types are 
> supported.
> val arraySum = new Aggregator[Seq[Float], Seq[Float],
>   Seq[Float]] with Serializable {
>   def zero: Seq[Float] = Nil
>   // The initial value.
>   def reduce(currentSum: Seq[Float], currentRow: Seq[Float]) =
> sumArray(currentSum, currentRow)
>   def merge(sum: Seq[Float], row: Seq[Float]) = sumArray(sum, row)
>   def finish(b: Seq[Float]) = b // Return the final result.
>   def sumArray(a: Seq[Float], b: Seq[Float]): Seq[Float] = {
> (a, b) match {
>   case (Nil, Nil) => Nil
>   case (Nil, row) => row
>   case (sum, Nil) => sum
>   case (sum, row) => (a, b).zipped.map { case (a, b) => a + b }
> }
>   }
> }.toColumn
> :47: error: Unable to find encoder for type stored in a Dataset.  
> Primitive types (Int, String, etc) and Product types (case classes) are 
> supported by importing sqlContext.implicits._  Support for serializing other 
> types will be added in future releases.
>}.toColumn



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13094) Dataset Aggregators do not work with complex types

2016-01-29 Thread Deenar Toraskar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15124206#comment-15124206
 ] 

Deenar Toraskar commented on SPARK-13094:
-

Downloading it now, will update the JIRA after rerunning my code

> Dataset Aggregators do not work with complex types
> --
>
> Key: SPARK-13094
> URL: https://issues.apache.org/jira/browse/SPARK-13094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Deenar Toraskar
>
> Dataset aggregators with complex types fail with unable to find encoder for 
> type stored in a Dataset. Though Datasets with these complex types are 
> supported.
> val arraySum = new Aggregator[Seq[Float], Seq[Float],
>   Seq[Float]] with Serializable {
>   def zero: Seq[Float] = Nil
>   // The initial value.
>   def reduce(currentSum: Seq[Float], currentRow: Seq[Float]) =
> sumArray(currentSum, currentRow)
>   def merge(sum: Seq[Float], row: Seq[Float]) = sumArray(sum, row)
>   def finish(b: Seq[Float]) = b // Return the final result.
>   def sumArray(a: Seq[Float], b: Seq[Float]): Seq[Float] = {
> (a, b) match {
>   case (Nil, Nil) => Nil
>   case (Nil, row) => row
>   case (sum, Nil) => sum
>   case (sum, row) => (a, b).zipped.map { case (a, b) => a + b }
> }
>   }
> }.toColumn
> :47: error: Unable to find encoder for type stored in a Dataset.  
> Primitive types (Int, String, etc) and Product types (case classes) are 
> supported by importing sqlContext.implicits._  Support for serializing other 
> types will be added in future releases.
>}.toColumn



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13094) Dataset Aggregators do not work with complex types

2016-01-29 Thread Deenar Toraskar (JIRA)
Deenar Toraskar created SPARK-13094:
---

 Summary: Dataset Aggregators do not work with complex types
 Key: SPARK-13094
 URL: https://issues.apache.org/jira/browse/SPARK-13094
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Deenar Toraskar


Dataset aggregators with complex types fail with unable to find encoder for 
type stored in a Dataset. Though Datasets with these complex types are 
supported.

val arraySum = new Aggregator[Seq[Float], Seq[Float],
  Seq[Float]] with Serializable {
  def zero: Seq[Float] = Nil
  // The initial value.
  def reduce(currentSum: Seq[Float], currentRow: Seq[Float]) =
sumArray(currentSum, currentRow)
  def merge(sum: Seq[Float], row: Seq[Float]) = sumArray(sum, row)
  def finish(b: Seq[Float]) = b // Return the final result.
  def sumArray(a: Seq[Float], b: Seq[Float]): Seq[Float] = {
(a, b) match {
  case (Nil, Nil) => Nil
  case (Nil, row) => row
  case (sum, Nil) => sum
  case (sum, row) => (a, b).zipped.map { case (a, b) => a + b }
}
  }
}.toColumn

:47: error: Unable to find encoder for type stored in a Dataset.  
Primitive types (Int, String, etc) and Product types (case classes) are 
supported by importing sqlContext.implicits._  Support for serializing other 
types will be added in future releases.
   }.toColumn



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12809) Spark SQL UDF does not work with struct input parameters

2016-01-13 Thread Deenar Toraskar (JIRA)
Deenar Toraskar created SPARK-12809:
---

 Summary: Spark SQL UDF does not work with struct input parameters
 Key: SPARK-12809
 URL: https://issues.apache.org/jira/browse/SPARK-12809
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Deenar Toraskar


Spark SQL UDFs dont work with struct input parameters

def testUDF(expectedExposures: (Float, Float))= {
(expectedExposures._1 * expectedExposures._2 /expectedExposures._1) 
  }
sqlContext.udf.register("testUDF", testUDF _)

sqlContext.sql("select testUDF(struct(noofmonths,ee)) from netExposureCpty")

The full stacktrace is given below

com.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: 
org.apache.spark.sql.AnalysisException: cannot resolve 
'UDF(struct(noofmonths,ee))' due to data type mismatch: argument 1 requires 
struct<_1:float,_2:float> type, however, 'struct(noofmonths,ee)' is of 
struct type.; line 1 pos 33
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:65)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:265)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12358) Spark SQL query with lots of small tables under broadcast threshold leading to java.lang.OutOfMemoryError

2015-12-15 Thread Deenar Toraskar (JIRA)
Deenar Toraskar created SPARK-12358:
---

 Summary: Spark SQL query with lots of small tables under broadcast 
threshold leading to java.lang.OutOfMemoryError
 Key: SPARK-12358
 URL: https://issues.apache.org/jira/browse/SPARK-12358
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.2
Reporter: Deenar Toraskar


Hi

I have a Spark SQL query with a lot of small tables (5x plus) all  below the 
broadcast threshold. Looking at the query plan Spark is broadcasting all these 
tables together without checking if there is sufficient memory available. This 
leads to 

Exception in thread "broadcast-hash-join-2" java.lang.OutOfMemoryError: Java 
heap space 

errors and causes the executors to die and query fail.

I got around this issue by reducing the  spark.sql.autoBroadcastJoinThreshold 
to stop broadcasting the bigger tables in the query.

A fix would be to 
a) ensure that in addition to the per table threshold 
(spark.sql.autoBroadcastJoinThreshold), there is a total broadcast (say 
spark.sql.autoBroadcastJoinThresholdCumulative ) threshold per query, so only 
data up to that limit is broadcast preventing executors running out of memory.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Deenar Toraskar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391335#comment-14391335
 ] 

Deenar Toraskar commented on SPARK-6646:


maybe Spark 2.0 should be branded i-Spark

 Spark 2.0: Rearchitecting Spark for Mobile Platforms
 

 Key: SPARK-6646
 URL: https://issues.apache.org/jira/browse/SPARK-6646
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker
 Attachments: Spark on Mobile - Design Doc - v1.pdf


 Mobile computing is quickly rising to dominance, and by the end of 2017, it 
 is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
 Spark’s project goal can be accomplished only when Spark runs efficiently for 
 the growing population of mobile users.
 Designed and optimized for modern data centers and Big Data applications, 
 Spark is unfortunately not a good fit for mobile computing today. In the past 
 few months, we have been prototyping the feasibility of a mobile-first Spark 
 architecture, and today we would like to share with you our findings. This 
 ticket outlines the technical design of Spark’s mobile support, and shares 
 results from several early prototypes.
 Mobile friendly version of the design doc: 
 https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6333) saveAsObjectFile support for compression codec

2015-03-14 Thread Deenar Toraskar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361797#comment-14361797
 ] 

Deenar Toraskar commented on SPARK-6333:


Hey Andrey

I already have a fix, will submit a pull request shortly.

 saveAsObjectFile support for compression codec
 --

 Key: SPARK-6333
 URL: https://issues.apache.org/jira/browse/SPARK-6333
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.1
Reporter: Deenar Toraskar
Priority: Minor

 saveAsObjectFile current does not support a compression codec.  This story is 
 about adding saveAsObjectFile (path, codec) support into spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6333) saveAsObjectFile support for compression codec

2015-03-14 Thread Deenar Toraskar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361939#comment-14361939
 ] 

Deenar Toraskar commented on SPARK-6333:


that's what i have done, thanks Sean

 saveAsObjectFile support for compression codec
 --

 Key: SPARK-6333
 URL: https://issues.apache.org/jira/browse/SPARK-6333
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.1
Reporter: Deenar Toraskar
Priority: Minor

 saveAsObjectFile current does not support a compression codec.  This story is 
 about adding saveAsObjectFile (path, codec) support into spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6333) saveAsObjectFile support for compression codec

2015-03-13 Thread Deenar Toraskar (JIRA)
Deenar Toraskar created SPARK-6333:
--

 Summary: saveAsObjectFile support for compression codec
 Key: SPARK-6333
 URL: https://issues.apache.org/jira/browse/SPARK-6333
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.1
Reporter: Deenar Toraskar
Priority: Minor


saveAsObjectFile current does not support a compression codec.  This story is 
about adding saveAsObjectFile (path, codec) support into spark.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org