[jira] [Commented] (SPARK-18939) Timezone support in partition values.

2016-12-19 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15763378#comment-15763378
 ] 

Reynold Xin commented on SPARK-18939:
-

FWIW it's probably unlikely timestamp is used as a partition column, but let's 
definitely support this.

I'm thinking we should also pass in a data source option "timezone" (default to 
session's timezone) and use that for CSV/JSON, and file partition.


> Timezone support in partition values.
> -
>
> Key: SPARK-18939
> URL: https://issues.apache.org/jira/browse/SPARK-18939
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Takuya Ueshin
>
> We should also use session local timezone to interpret partition values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18939) Timezone support in partition values.

2016-12-19 Thread Takuya Ueshin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15763375#comment-15763375
 ] 

Takuya Ueshin commented on SPARK-18939:
---

Yes, I think so.

> Timezone support in partition values.
> -
>
> Key: SPARK-18939
> URL: https://issues.apache.org/jira/browse/SPARK-18939
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Takuya Ueshin
>
> We should also use session local timezone to interpret partition values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18939) Timezone support in partition values.

2016-12-19 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15763372#comment-15763372
 ] 

Reynold Xin commented on SPARK-18939:
-

This only impacts timestamp data type right?


> Timezone support in partition values.
> -
>
> Key: SPARK-18939
> URL: https://issues.apache.org/jira/browse/SPARK-18939
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Takuya Ueshin
>
> We should also use session local timezone to interpret partition values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18939) Timezone support in partition values.

2016-12-19 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-18939:
-

 Summary: Timezone support in partition values.
 Key: SPARK-18939
 URL: https://issues.apache.org/jira/browse/SPARK-18939
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Takuya Ueshin


We should also use session local timezone to interpret partition values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18938) Addition of peak memory usage metric for an executor

2016-12-19 Thread Suresh Bahuguna (JIRA)
Suresh Bahuguna created SPARK-18938:
---

 Summary: Addition of peak memory usage metric for an executor
 Key: SPARK-18938
 URL: https://issues.apache.org/jira/browse/SPARK-18938
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Suresh Bahuguna
Priority: Minor
 Fix For: 1.5.2


Addition of peak memory usage metric for an executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18936) Infrastructure for session local timezone support

2016-12-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18936:


Assignee: Apache Spark  (was: Takuya Ueshin)

> Infrastructure for session local timezone support
> -
>
> Key: SPARK-18936
> URL: https://issues.apache.org/jira/browse/SPARK-18936
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18936) Infrastructure for session local timezone support

2016-12-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15763343#comment-15763343
 ] 

Apache Spark commented on SPARK-18936:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/16308

> Infrastructure for session local timezone support
> -
>
> Key: SPARK-18936
> URL: https://issues.apache.org/jira/browse/SPARK-18936
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Takuya Ueshin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18936) Infrastructure for session local timezone support

2016-12-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18936:


Assignee: Takuya Ueshin  (was: Apache Spark)

> Infrastructure for session local timezone support
> -
>
> Key: SPARK-18936
> URL: https://issues.apache.org/jira/browse/SPARK-18936
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Takuya Ueshin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18937) Timezone support in CSV/JSON parsing

2016-12-19 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-18937:
---

 Summary: Timezone support in CSV/JSON parsing
 Key: SPARK-18937
 URL: https://issues.apache.org/jira/browse/SPARK-18937
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18936) Infrastructure for session local timezone support

2016-12-19 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-18936:
---

 Summary: Infrastructure for session local timezone support
 Key: SPARK-18936
 URL: https://issues.apache.org/jira/browse/SPARK-18936
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16473) BisectingKMeans Algorithm failing with java.util.NoSuchElementException: key not found

2016-12-19 Thread Alok Bhandari (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15763324#comment-15763324
 ] 

Alok Bhandari edited comment on SPARK-16473 at 12/20/16 5:38 AM:
-

[~imatiach] , thanks for showing interest in this issue. I will try to share 
the dataset with you , please can you suggest where should I share it ? should 
I share it through github? is it fine?

Also , I have tried to diagnose this issue on my own , from my analysis it 
looks like , it is failing if it tries to bisect a node which does not have any 
children. I also have added a code fix , but not sure if this is the correct 
solution :- 

*Suggested solution*
{code:title=BisectingkMeans.scala}
 private def updateAssignments(
  assignments: RDD[(Long, VectorWithNorm)],
  divisibleIndices: Set[Long],
  newClusterCenters: Map[Long, VectorWithNorm]): RDD[(Long, 
VectorWithNorm)] = {
assignments.map { case (index, v) =>
  if (divisibleIndices.contains(index)) {
val children = Seq(leftChildIndex(index), rightChildIndex(index))
if ( children.length>0 ) {
val selected = children.minBy { child =>
  KMeans.fastSquaredDistance(newClusterCenters(child), v)
}
(selected, v)
}else {
  (index, v)
}
  } else {
(index, v)
  }
}
  }
{code}

*Original code* 
{code:title=BisectingkMeans.scala}
  private def updateAssignments(
  assignments: RDD[(Long, VectorWithNorm)],
  divisibleIndices: Set[Long],
  newClusterCenters: Map[Long, VectorWithNorm]): RDD[(Long, 
VectorWithNorm)] = {
assignments.map { case (index, v) =>
  if (divisibleIndices.contains(index)) {
val children = Seq(leftChildIndex(index), rightChildIndex(index))
val selected = children.minBy { child =>
  KMeans.fastSquaredDistance(newClusterCenters(child), v)
}
(selected, v)
  } else {
(index, v)
  }
}
  }
{code}


was (Author: alokob...@gmail.com):
[~imatiach] , thanks for showing interest in this issue. I will try to share 
the dataset with you , please can you suggest where should I share it ? should 
I share it through github? is it fine?

Also , I have tried to diagnose this issue on my own , from my analysis it 
looks like , it is failing if it tries to bisect a node which does not have any 
children. I also have added a code fix , but not sure if this is the correct 
solution :- 

*Suggested solution*
{code:BisectingKMeans}
 private def updateAssignments(
  assignments: RDD[(Long, VectorWithNorm)],
  divisibleIndices: Set[Long],
  newClusterCenters: Map[Long, VectorWithNorm]): RDD[(Long, 
VectorWithNorm)] = {
assignments.map { case (index, v) =>
  if (divisibleIndices.contains(index)) {
val children = Seq(leftChildIndex(index), rightChildIndex(index))
if ( children.length>0 ) {
val selected = children.minBy { child =>
  KMeans.fastSquaredDistance(newClusterCenters(child), v)
}
(selected, v)
}else {
  (index, v)
}
  } else {
(index, v)
  }
}
  }
{code}

*Original code* 
{code:BsiectingKMeans}
  private def updateAssignments(
  assignments: RDD[(Long, VectorWithNorm)],
  divisibleIndices: Set[Long],
  newClusterCenters: Map[Long, VectorWithNorm]): RDD[(Long, 
VectorWithNorm)] = {
assignments.map { case (index, v) =>
  if (divisibleIndices.contains(index)) {
val children = Seq(leftChildIndex(index), rightChildIndex(index))
val selected = children.minBy { child =>
  KMeans.fastSquaredDistance(newClusterCenters(child), v)
}
(selected, v)
  } else {
(index, v)
  }
}
  }
{code}

> BisectingKMeans Algorithm failing with java.util.NoSuchElementException: key 
> not found
> --
>
> Key: SPARK-16473
> URL: https://issues.apache.org/jira/browse/SPARK-16473
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.6.1, 2.0.0
> Environment: AWS EC2 linux instance. 
>Reporter: Alok Bhandari
>
> Hello , 
> I am using apache spark 1.6.1. 
> I am executing bisecting k means algorithm on a specific dataset .
> Dataset details :- 
> K=100,
> input vector =100K*100k
> Memory assigned 16GB per node ,
> number of nodes =2.
>  Till K=75 it os working fine , but when I set k=100 , it fails with 
> java.util.NoSuchElementException: key not found. 
> *I suspect it is failing because of lack of some resources , but somehow 
> exception does not convey anything as why this spark job failed.* 
> Please can someone point me to root cause of this exception , why it is 
> failing. 
> This is the exception stack-trace:- 
> {code}
> 

[jira] [Commented] (SPARK-16473) BisectingKMeans Algorithm failing with java.util.NoSuchElementException: key not found

2016-12-19 Thread Alok Bhandari (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15763324#comment-15763324
 ] 

Alok Bhandari commented on SPARK-16473:
---

[~imatiach] , thanks for showing interest in this issue. I will try to share 
the dataset with you , please can you suggest where should I share it ? should 
I share it through github? is it fine?

Also , I have tried to diagnose this issue on my own , from my analysis it 
looks like , it is failing if it tries to bisect a node which does not have any 
children. I also have added a code fix , but not sure if this is the correct 
solution :- 

*Suggested solution*
{code:BisectingKMeans}
 private def updateAssignments(
  assignments: RDD[(Long, VectorWithNorm)],
  divisibleIndices: Set[Long],
  newClusterCenters: Map[Long, VectorWithNorm]): RDD[(Long, 
VectorWithNorm)] = {
assignments.map { case (index, v) =>
  if (divisibleIndices.contains(index)) {
val children = Seq(leftChildIndex(index), rightChildIndex(index))
if ( children.length>0 ) {
val selected = children.minBy { child =>
  KMeans.fastSquaredDistance(newClusterCenters(child), v)
}
(selected, v)
}else {
  (index, v)
}
  } else {
(index, v)
  }
}
  }
{code}

*Original code* 
{code:BsiectingKMeans}
  private def updateAssignments(
  assignments: RDD[(Long, VectorWithNorm)],
  divisibleIndices: Set[Long],
  newClusterCenters: Map[Long, VectorWithNorm]): RDD[(Long, 
VectorWithNorm)] = {
assignments.map { case (index, v) =>
  if (divisibleIndices.contains(index)) {
val children = Seq(leftChildIndex(index), rightChildIndex(index))
val selected = children.minBy { child =>
  KMeans.fastSquaredDistance(newClusterCenters(child), v)
}
(selected, v)
  } else {
(index, v)
  }
}
  }
{code}

> BisectingKMeans Algorithm failing with java.util.NoSuchElementException: key 
> not found
> --
>
> Key: SPARK-16473
> URL: https://issues.apache.org/jira/browse/SPARK-16473
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.6.1, 2.0.0
> Environment: AWS EC2 linux instance. 
>Reporter: Alok Bhandari
>
> Hello , 
> I am using apache spark 1.6.1. 
> I am executing bisecting k means algorithm on a specific dataset .
> Dataset details :- 
> K=100,
> input vector =100K*100k
> Memory assigned 16GB per node ,
> number of nodes =2.
>  Till K=75 it os working fine , but when I set k=100 , it fails with 
> java.util.NoSuchElementException: key not found. 
> *I suspect it is failing because of lack of some resources , but somehow 
> exception does not convey anything as why this spark job failed.* 
> Please can someone point me to root cause of this exception , why it is 
> failing. 
> This is the exception stack-trace:- 
> {code}
> java.util.NoSuchElementException: key not found: 166 
> at scala.collection.MapLike$class.default(MapLike.scala:228) 
> at scala.collection.AbstractMap.default(Map.scala:58) 
> at scala.collection.MapLike$class.apply(MapLike.scala:141) 
> at scala.collection.AbstractMap.apply(Map.scala:58) 
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply$mcDJ$sp(BisectingKMeans.scala:338)
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
> at 
> scala.collection.TraversableOnce$$anonfun$minBy$1.apply(TraversableOnce.scala:231)
>  
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
>  
> at scala.collection.immutable.List.foldLeft(List.scala:84) 
> at 
> scala.collection.LinearSeqOptimized$class.reduceLeft(LinearSeqOptimized.scala:125)
>  
> at scala.collection.immutable.List.reduceLeft(List.scala:84) 
> at 
> scala.collection.TraversableOnce$class.minBy(TraversableOnce.scala:231) 
> at scala.collection.AbstractTraversable.minBy(Traversable.scala:105) 
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:337)
>  
> at 
> 

[jira] [Closed] (SPARK-17632) make console sink and other sinks work with 'recoverFromCheckpointLocation' option enabled

2016-12-19 Thread Chuanlei Ni (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chuanlei Ni closed SPARK-17632.
---
Resolution: Not A Problem

> make console sink and other sinks work with 'recoverFromCheckpointLocation' 
> option enabled
> --
>
> Key: SPARK-17632
> URL: https://issues.apache.org/jira/browse/SPARK-17632
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.0
>Reporter: Chuanlei Ni
>Priority: Trivial
>
> val (useTempCheckpointLocation, recoverFromCheckpointLocation) =
> if (source == "console") {
>   (true, false)
> } else {
>   (false, true)
> }
> I think 'source' sink should work with 'recoverFromCheckpoint' option 
> enabled. If user specified checkpointLocation, we may lost the processed 
> data, however it is acceptable since this sink used for test mainly.
> And other sink can be work with 'useTempCheckpointLocation' enabled, which 
> means  replaying all the data in kafka. If user mean it, it is ok.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18924) Improve collect/createDataFrame performance in SparkR

2016-12-19 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15763165#comment-15763165
 ] 

Felix Cheung commented on SPARK-18924:
--

Thank you for bring this up. JVM<->Java performance has been reported a few 
times and definitely something I have been tracking but I didn't get around to.

I don't think rJava would work since it is GPLv2 licensed. Rcpp is also 
GPLv2/v3.

Strategically placed calls to C might be a way to go (cross-platform 
complications aside)? That seems to be the approach for a lot of R packages. I 
recall we have a JIRA on performance tests, do we have more break down of the 
time spent?


> Improve collect/createDataFrame performance in SparkR
> -
>
> Key: SPARK-18924
> URL: https://issues.apache.org/jira/browse/SPARK-18924
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Xiangrui Meng
>Priority: Critical
>
> SparkR has its own SerDe for data serialization between JVM and R.
> The SerDe on the JVM side is implemented in:
> * 
> [SerDe.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala]
> * 
> [SQLUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala]
> The SerDe on the R side is implemented in:
> * 
> [deserialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R]
> * 
> [serialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R]
> The serialization between JVM and R suffers from huge storage and computation 
> overhead. For example, a short round trip of 1 million doubles surprisingly 
> took 3 minutes on my laptop:
> {code}
> > system.time(collect(createDataFrame(data.frame(x=runif(100)
>user  system elapsed
>  14.224   0.582 189.135
> {code}
> Collecting a medium-sized DataFrame to local and continuing with a local R 
> workflow is a use case we should pay attention to. SparkR will never be able 
> to cover all existing features from CRAN packages. It is also unnecessary for 
> Spark to do so because not all features need scalability. 
> Several factors contribute to the serialization overhead:
> 1. The SerDe in R side is implemented using high-level R methods.
> 2. DataFrame columns are not efficiently serialized, primitive type columns 
> in particular.
> 3. Some overhead in the serialization protocol/impl.
> 1) might be discussed before because R packages like rJava exist before 
> SparkR. I'm not sure whether we have a license issue in depending on those 
> libraries. Another option is to switch to low-level R'C interface or Rcpp, 
> which again might have license issue. I'm not an expert here. If we have to 
> implement our own, there still exist much space for improvement, discussed 
> below.
> 2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, 
> which collects rows to local and then constructs columns. However,
> * it ignores column types and results boxing/unboxing overhead
> * it collects all objects to driver and results high GC pressure
> A relatively simple change is to implement specialized column builder based 
> on column types, primitive types in particular. We need to handle null/NA 
> values properly. A simple data structure we can use is
> {code}
> val size: Int
> val nullIndexes: Array[Int]
> val notNullValues: Array[T] // specialized for primitive types
> {code}
> On the R side, we can use `readBin` and `writeBin` to read the entire vector 
> in a single method call. The speed seems reasonable (at the order of GB/s):
> {code}
> > x <- runif(1000) # 1e7, not 1e6
> > system.time(r <- writeBin(x, raw(0)))
>user  system elapsed
>   0.036   0.021   0.059
> > > system.time(y <- readBin(r, double(), 1000))
>user  system elapsed
>   0.015   0.007   0.024
> {code}
> This is just a proposal that needs to be discussed and formalized. But in 
> general, it should be feasible to obtain 20x or more performance gain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18935) Use Mesos "Dynamic Reservation" resource for Spark

2016-12-19 Thread jackyoh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jackyoh updated SPARK-18935:

Affects Version/s: 2.0.1

> Use Mesos "Dynamic Reservation" resource for Spark
> --
>
> Key: SPARK-18935
> URL: https://issues.apache.org/jira/browse/SPARK-18935
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: jackyoh
>
> I'm running spark on Apache Mesos
> Please follow these steps to reproduce the issue:
> 1. First, run Mesos resource reserve:
> curl -i -d slaveId=c24d1cfb-79f3-4b07-9f8b-c7b19543a333-S0 -d 
> resources='[{"name":"cpus","type":"SCALAR","scalar":{"value":20},"role":"spark","reservation":{"principal":""}},{"name":"mem","type":"SCALAR","scalar":{"value":4096},"role":"spark","reservation":{"principal":""}}]'
>  -X POST http://192.168.1.118:5050/master/reserve
> 2. Then run spark-submit command:
> ./spark-submit --class org.apache.spark.examples.SparkPi --master 
> mesos://192.168.1.118:5050 --conf spark.mesos.role=spark  
> ../examples/jars/spark-examples_2.11-2.0.2.jar 1
> And the console will keep loging same warning message as shown below: 
> 16/12/19 22:33:28 WARN TaskSchedulerImpl: Initial job has not accepted any 
> resources; check your cluster UI to ensure that workers are registered and 
> have sufficient resources



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18913) append to a table with special column names should work

2016-12-19 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-18913.
-
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

> append to a table with special column names should work
> ---
>
> Key: SPARK-18913
> URL: https://issues.apache.org/jira/browse/SPARK-18913
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.1, 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18935) Use Mesos "Dynamic Reservation" resource for Spark

2016-12-19 Thread jackyoh (JIRA)
jackyoh created SPARK-18935:
---

 Summary: Use Mesos "Dynamic Reservation" resource for Spark
 Key: SPARK-18935
 URL: https://issues.apache.org/jira/browse/SPARK-18935
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.2, 2.0.0
Reporter: jackyoh


I'm running spark on Apache Mesos

Please follow these steps to reproduce the issue:

1. First, run Mesos resource reserve:

curl -i -d slaveId=c24d1cfb-79f3-4b07-9f8b-c7b19543a333-S0 -d 
resources='[{"name":"cpus","type":"SCALAR","scalar":{"value":20},"role":"spark","reservation":{"principal":""}},{"name":"mem","type":"SCALAR","scalar":{"value":4096},"role":"spark","reservation":{"principal":""}}]'
 -X POST http://192.168.1.118:5050/master/reserve

2. Then run spark-submit command:

./spark-submit --class org.apache.spark.examples.SparkPi --master 
mesos://192.168.1.118:5050 --conf spark.mesos.role=spark  
../examples/jars/spark-examples_2.11-2.0.2.jar 1


And the console will keep loging same warning message as shown below: 

16/12/19 22:33:28 WARN TaskSchedulerImpl: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient resources



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18912) append to a non-file-based data source table should detect columns number mismatch

2016-12-19 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-18912.
-
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

> append to a non-file-based data source table should detect columns number 
> mismatch
> --
>
> Key: SPARK-18912
> URL: https://issues.apache.org/jira/browse/SPARK-18912
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.1, 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18899) append data to a bucketed table with mismatched bucketing should fail

2016-12-19 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-18899:

Fix Version/s: 2.2.0
   2.1.1

> append data to a bucketed table with mismatched bucketing should fail
> -
>
> Key: SPARK-18899
> URL: https://issues.apache.org/jira/browse/SPARK-18899
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.1, 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18899) append data to a bucketed table with mismatched bucketing should fail

2016-12-19 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-18899:

Target Version/s: 2.1.1  (was: 2.1.1, 2.2.0)

> append data to a bucketed table with mismatched bucketing should fail
> -
>
> Key: SPARK-18899
> URL: https://issues.apache.org/jira/browse/SPARK-18899
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.1, 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18899) append data to a bucketed table with mismatched bucketing should fail

2016-12-19 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-18899.
-
  Resolution: Fixed
Target Version/s: 2.1.1, 2.2.0  (was: 2.1.1)

> append data to a bucketed table with mismatched bucketing should fail
> -
>
> Key: SPARK-18899
> URL: https://issues.apache.org/jira/browse/SPARK-18899
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17755) Master may ask a worker to launch an executor before the worker actually got the response of registration

2016-12-19 Thread Shuai Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15763089#comment-15763089
 ] 

Shuai Lin commented on SPARK-17755:
---

A (sort-of) similar problem for coarse grained scheduler backends is reported 
in https://issues.apache.org/jira/browse/SPARK-18820 .

> Master may ask a worker to launch an executor before the worker actually got 
> the response of registration
> -
>
> Key: SPARK-17755
> URL: https://issues.apache.org/jira/browse/SPARK-17755
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Yin Huai
>Assignee: Shixiong Zhu
>
> I somehow saw a failed test {{org.apache.spark.DistributedSuite.caching in 
> memory, serialized, replicated}}. Its log shows that Spark master asked the 
> worker to launch an executor before the worker actually got the response of 
> registration. So, the master knew that the worker had been registered. But, 
> the worker did not know if it self had been registered. 
> {code}
> 16/09/30 14:53:53.681 dispatcher-event-loop-0 INFO Master: Registering worker 
> localhost:38262 with 1 cores, 1024.0 MB RAM
> 16/09/30 14:53:53.681 dispatcher-event-loop-0 INFO Master: Launching executor 
> app-20160930145353-/1 on worker worker-20160930145353-localhost-38262
> 16/09/30 14:53:53.682 dispatcher-event-loop-3 INFO 
> StandaloneAppClient$ClientEndpoint: Executor added: app-20160930145353-/1 
> on worker-20160930145353-localhost-38262 (localhost:38262) with 1 cores
> 16/09/30 14:53:53.683 dispatcher-event-loop-3 INFO 
> StandaloneSchedulerBackend: Granted executor ID app-20160930145353-/1 on 
> hostPort localhost:38262 with 1 cores, 1024.0 MB RAM
> 16/09/30 14:53:53.683 dispatcher-event-loop-0 WARN Worker: Invalid Master 
> (spark://localhost:46460) attempted to launch executor.
> 16/09/30 14:53:53.687 worker-register-master-threadpool-0 INFO Worker: 
> Successfully registered with master spark://localhost:46460
> {code}
> Then, seems the worker did not launch any executor. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18761) Uncancellable / unkillable tasks may starve jobs of resoures

2016-12-19 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-18761.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16189
[https://github.com/apache/spark/pull/16189]

> Uncancellable / unkillable tasks may starve jobs of resoures
> 
>
> Key: SPARK-18761
> URL: https://issues.apache.org/jira/browse/SPARK-18761
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.2.0
>
>
> Spark's current task cancellation / task killing mechanism is "best effort" 
> in the sense that some tasks may not be interruptible and may not respond to 
> their "killed" flags being set. If a significant fraction of a cluster's task 
> slots are occupied by tasks that have been marked as killed but remain 
> running then this can lead to a situation where new jobs and tasks are 
> starved of resources because zombie tasks are holding resources.
> I propose to address this problem by introducing a "task reaper" mechanism in 
> executors to monitor tasks after they are marked for killing in order to 
> periodically re-attempt the task kill, capture and log stacktraces / warnings 
> if tasks do not exit in a timely manner, and, optionally, kill the entire 
> executor JVM if cancelled tasks cannot be killed within some timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18934) Writing to dynamic partitions does not preserve sort order if spill occurs

2016-12-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18934:


Assignee: (was: Apache Spark)

> Writing to dynamic partitions does not preserve sort order if spill occurs
> --
>
> Key: SPARK-18934
> URL: https://issues.apache.org/jira/browse/SPARK-18934
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Junegunn Choi
>
> When writing to dynamic partitions, the task sorts the input data by the 
> partition key (also with bucket key if used), so that it can write to one 
> partition at a time using a single writer. And if spill occurs during the 
> process, {{UnsafeSorterSpillMerger}} is used to merge partial sequences of 
> data.
> However, the merge process only considers the partition key, so that the sort 
> order within a partition specified via {{sortWithinPartitions}} or {{SORT 
> BY}} is not preserved.
> We can reproduce the problem on Spark shell. Make sure to start shell in 
> local mode with small driver memory (e.g. 1G) so that spills occur.
> {code}
> // FileFormatWriter
> sc.parallelize(1 to 1000).toDS.withColumn("part", 'value.mod(2))
>   .repartition(1, 'part).sortWithinPartitions("value")
>   .write.mode("overwrite").format("orc").partitionBy("part")
>   .saveAsTable("test_sort_within")
> spark.read.table("test_sort_within").show
> {code}
> {noformat}
> +---++
> |  value|part|
> +---++
> |  2|   0|
> |8388610|   0|
> |  4|   0|
> |8388612|   0|
> |  6|   0|
> |8388614|   0|
> |  8|   0|
> |8388616|   0|
> | 10|   0|
> |8388618|   0|
> | 12|   0|
> |8388620|   0|
> | 14|   0|
> |8388622|   0|
> | 16|   0|
> |8388624|   0|
> | 18|   0|
> |8388626|   0|
> | 20|   0|
> |8388628|   0|
> +---++
> {noformat}
> We can confirm that the issue using orc dump.
> {noformat}
> > java -jar orc-tools-1.3.0-SNAPSHOT-uber.jar meta -d 
> > part=0/part-r-0-96c022f0-a173-40cc-b2e5-9d02fed4213e.snappy.orc | head 
> > -20
> {"value":2}
> {"value":8388610}
> {"value":4}
> {"value":8388612}
> {"value":6}
> {"value":8388614}
> {"value":8}
> {"value":8388616}
> {"value":10}
> {"value":8388618}
> {"value":12}
> {"value":8388620}
> {"value":14}
> {"value":8388622}
> {"value":16}
> {"value":8388624}
> {"value":18}
> {"value":8388626}
> {"value":20}
> {"value":8388628}
> {noformat}
> {{SparkHiveDynamicPartitionWriterContainer}} has the same problem.
> {code}
> // Insert into an existing Hive table with dynamic partitions
> //   CREATE TABLE TEST_SORT_WITHIN (VALUE INT) PARTITIONED BY (PART INT) 
> STORED AS ORC
> spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
> sc.parallelize(1 to 1000).toDS.withColumn("part", 'value.mod(2))
>   .repartition(1, 'part).sortWithinPartitions("value")
>   .write.mode("overwrite").insertInto("test_sort_within")
> spark.read.table("test_sort_within").show
> {code}
> I was able to fix the problem by appending a numeric index column to the 
> sorting key which effectively makes the sort stable. I'll create a pull 
> request on GitHub but since I'm not really familiar with the internals of 
> Spark, I'm not sure if my approach is valid or idiomatic. So please let me 
> know if there are better ways to handle this, or if you want to address the 
> issue differently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18934) Writing to dynamic partitions does not preserve sort order if spill occurs

2016-12-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15763010#comment-15763010
 ] 

Apache Spark commented on SPARK-18934:
--

User 'junegunn' has created a pull request for this issue:
https://github.com/apache/spark/pull/16347

> Writing to dynamic partitions does not preserve sort order if spill occurs
> --
>
> Key: SPARK-18934
> URL: https://issues.apache.org/jira/browse/SPARK-18934
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Junegunn Choi
>
> When writing to dynamic partitions, the task sorts the input data by the 
> partition key (also with bucket key if used), so that it can write to one 
> partition at a time using a single writer. And if spill occurs during the 
> process, {{UnsafeSorterSpillMerger}} is used to merge partial sequences of 
> data.
> However, the merge process only considers the partition key, so that the sort 
> order within a partition specified via {{sortWithinPartitions}} or {{SORT 
> BY}} is not preserved.
> We can reproduce the problem on Spark shell. Make sure to start shell in 
> local mode with small driver memory (e.g. 1G) so that spills occur.
> {code}
> // FileFormatWriter
> sc.parallelize(1 to 1000).toDS.withColumn("part", 'value.mod(2))
>   .repartition(1, 'part).sortWithinPartitions("value")
>   .write.mode("overwrite").format("orc").partitionBy("part")
>   .saveAsTable("test_sort_within")
> spark.read.table("test_sort_within").show
> {code}
> {noformat}
> +---++
> |  value|part|
> +---++
> |  2|   0|
> |8388610|   0|
> |  4|   0|
> |8388612|   0|
> |  6|   0|
> |8388614|   0|
> |  8|   0|
> |8388616|   0|
> | 10|   0|
> |8388618|   0|
> | 12|   0|
> |8388620|   0|
> | 14|   0|
> |8388622|   0|
> | 16|   0|
> |8388624|   0|
> | 18|   0|
> |8388626|   0|
> | 20|   0|
> |8388628|   0|
> +---++
> {noformat}
> We can confirm that the issue using orc dump.
> {noformat}
> > java -jar orc-tools-1.3.0-SNAPSHOT-uber.jar meta -d 
> > part=0/part-r-0-96c022f0-a173-40cc-b2e5-9d02fed4213e.snappy.orc | head 
> > -20
> {"value":2}
> {"value":8388610}
> {"value":4}
> {"value":8388612}
> {"value":6}
> {"value":8388614}
> {"value":8}
> {"value":8388616}
> {"value":10}
> {"value":8388618}
> {"value":12}
> {"value":8388620}
> {"value":14}
> {"value":8388622}
> {"value":16}
> {"value":8388624}
> {"value":18}
> {"value":8388626}
> {"value":20}
> {"value":8388628}
> {noformat}
> {{SparkHiveDynamicPartitionWriterContainer}} has the same problem.
> {code}
> // Insert into an existing Hive table with dynamic partitions
> //   CREATE TABLE TEST_SORT_WITHIN (VALUE INT) PARTITIONED BY (PART INT) 
> STORED AS ORC
> spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
> sc.parallelize(1 to 1000).toDS.withColumn("part", 'value.mod(2))
>   .repartition(1, 'part).sortWithinPartitions("value")
>   .write.mode("overwrite").insertInto("test_sort_within")
> spark.read.table("test_sort_within").show
> {code}
> I was able to fix the problem by appending a numeric index column to the 
> sorting key which effectively makes the sort stable. I'll create a pull 
> request on GitHub but since I'm not really familiar with the internals of 
> Spark, I'm not sure if my approach is valid or idiomatic. So please let me 
> know if there are better ways to handle this, or if you want to address the 
> issue differently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18934) Writing to dynamic partitions does not preserve sort order if spill occurs

2016-12-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18934:


Assignee: Apache Spark

> Writing to dynamic partitions does not preserve sort order if spill occurs
> --
>
> Key: SPARK-18934
> URL: https://issues.apache.org/jira/browse/SPARK-18934
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Junegunn Choi
>Assignee: Apache Spark
>
> When writing to dynamic partitions, the task sorts the input data by the 
> partition key (also with bucket key if used), so that it can write to one 
> partition at a time using a single writer. And if spill occurs during the 
> process, {{UnsafeSorterSpillMerger}} is used to merge partial sequences of 
> data.
> However, the merge process only considers the partition key, so that the sort 
> order within a partition specified via {{sortWithinPartitions}} or {{SORT 
> BY}} is not preserved.
> We can reproduce the problem on Spark shell. Make sure to start shell in 
> local mode with small driver memory (e.g. 1G) so that spills occur.
> {code}
> // FileFormatWriter
> sc.parallelize(1 to 1000).toDS.withColumn("part", 'value.mod(2))
>   .repartition(1, 'part).sortWithinPartitions("value")
>   .write.mode("overwrite").format("orc").partitionBy("part")
>   .saveAsTable("test_sort_within")
> spark.read.table("test_sort_within").show
> {code}
> {noformat}
> +---++
> |  value|part|
> +---++
> |  2|   0|
> |8388610|   0|
> |  4|   0|
> |8388612|   0|
> |  6|   0|
> |8388614|   0|
> |  8|   0|
> |8388616|   0|
> | 10|   0|
> |8388618|   0|
> | 12|   0|
> |8388620|   0|
> | 14|   0|
> |8388622|   0|
> | 16|   0|
> |8388624|   0|
> | 18|   0|
> |8388626|   0|
> | 20|   0|
> |8388628|   0|
> +---++
> {noformat}
> We can confirm that the issue using orc dump.
> {noformat}
> > java -jar orc-tools-1.3.0-SNAPSHOT-uber.jar meta -d 
> > part=0/part-r-0-96c022f0-a173-40cc-b2e5-9d02fed4213e.snappy.orc | head 
> > -20
> {"value":2}
> {"value":8388610}
> {"value":4}
> {"value":8388612}
> {"value":6}
> {"value":8388614}
> {"value":8}
> {"value":8388616}
> {"value":10}
> {"value":8388618}
> {"value":12}
> {"value":8388620}
> {"value":14}
> {"value":8388622}
> {"value":16}
> {"value":8388624}
> {"value":18}
> {"value":8388626}
> {"value":20}
> {"value":8388628}
> {noformat}
> {{SparkHiveDynamicPartitionWriterContainer}} has the same problem.
> {code}
> // Insert into an existing Hive table with dynamic partitions
> //   CREATE TABLE TEST_SORT_WITHIN (VALUE INT) PARTITIONED BY (PART INT) 
> STORED AS ORC
> spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
> sc.parallelize(1 to 1000).toDS.withColumn("part", 'value.mod(2))
>   .repartition(1, 'part).sortWithinPartitions("value")
>   .write.mode("overwrite").insertInto("test_sort_within")
> spark.read.table("test_sort_within").show
> {code}
> I was able to fix the problem by appending a numeric index column to the 
> sorting key which effectively makes the sort stable. I'll create a pull 
> request on GitHub but since I'm not really familiar with the internals of 
> Spark, I'm not sure if my approach is valid or idiomatic. So please let me 
> know if there are better ways to handle this, or if you want to address the 
> issue differently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18934) Writing to dynamic partitions does not preserve sort order if spill occurs

2016-12-19 Thread Junegunn Choi (JIRA)
Junegunn Choi created SPARK-18934:
-

 Summary: Writing to dynamic partitions does not preserve sort 
order if spill occurs
 Key: SPARK-18934
 URL: https://issues.apache.org/jira/browse/SPARK-18934
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.2
Reporter: Junegunn Choi


When writing to dynamic partitions, the task sorts the input data by the 
partition key (also with bucket key if used), so that it can write to one 
partition at a time using a single writer. And if spill occurs during the 
process, {{UnsafeSorterSpillMerger}} is used to merge partial sequences of data.

However, the merge process only considers the partition key, so that the sort 
order within a partition specified via {{sortWithinPartitions}} or {{SORT BY}} 
is not preserved.

We can reproduce the problem on Spark shell. Make sure to start shell in local 
mode with small driver memory (e.g. 1G) so that spills occur.

{code}
// FileFormatWriter
sc.parallelize(1 to 1000).toDS.withColumn("part", 'value.mod(2))
  .repartition(1, 'part).sortWithinPartitions("value")
  .write.mode("overwrite").format("orc").partitionBy("part")
  .saveAsTable("test_sort_within")

spark.read.table("test_sort_within").show
{code}

{noformat}
+---++
|  value|part|
+---++
|  2|   0|
|8388610|   0|
|  4|   0|
|8388612|   0|
|  6|   0|
|8388614|   0|
|  8|   0|
|8388616|   0|
| 10|   0|
|8388618|   0|
| 12|   0|
|8388620|   0|
| 14|   0|
|8388622|   0|
| 16|   0|
|8388624|   0|
| 18|   0|
|8388626|   0|
| 20|   0|
|8388628|   0|
+---++
{noformat}

We can confirm that the issue using orc dump.

{noformat}
> java -jar orc-tools-1.3.0-SNAPSHOT-uber.jar meta -d 
> part=0/part-r-0-96c022f0-a173-40cc-b2e5-9d02fed4213e.snappy.orc | head -20

{"value":2}
{"value":8388610}
{"value":4}
{"value":8388612}
{"value":6}
{"value":8388614}
{"value":8}
{"value":8388616}
{"value":10}
{"value":8388618}
{"value":12}
{"value":8388620}
{"value":14}
{"value":8388622}
{"value":16}
{"value":8388624}
{"value":18}
{"value":8388626}
{"value":20}
{"value":8388628}
{noformat}

{{SparkHiveDynamicPartitionWriterContainer}} has the same problem.

{code}
// Insert into an existing Hive table with dynamic partitions
//   CREATE TABLE TEST_SORT_WITHIN (VALUE INT) PARTITIONED BY (PART INT) STORED 
AS ORC
spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
sc.parallelize(1 to 1000).toDS.withColumn("part", 'value.mod(2))
  .repartition(1, 'part).sortWithinPartitions("value")
  .write.mode("overwrite").insertInto("test_sort_within")

spark.read.table("test_sort_within").show
{code}

I was able to fix the problem by appending a numeric index column to the 
sorting key which effectively makes the sort stable. I'll create a pull request 
on GitHub but since I'm not really familiar with the internals of Spark, I'm 
not sure if my approach is valid or idiomatic. So please let me know if there 
are better ways to handle this, or if you want to address the issue differently.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18933) Different log output between Terminal screen and stderr file

2016-12-19 Thread Sean Wong (JIRA)
Sean Wong created SPARK-18933:
-

 Summary: Different log output between Terminal screen and stderr 
file
 Key: SPARK-18933
 URL: https://issues.apache.org/jira/browse/SPARK-18933
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Documentation, Web UI
Affects Versions: 1.6.3
 Environment: Yarn mode and standalone mode
Reporter: Sean Wong


First of all, I use the default log4j.properties in the Spark conf/

But I found that the log output(e.g., INFO) is different between Terminal 
screen and stderr File. Some INFO logs exist in both of them. Some INFO logs 
exist in either of them. Why this happens? Is it supposed that the output logs 
are same between the terminal screen and stderr file? 

Then I did a Test. I modified the source code in SparkContext.scala and add one 
line log code "logInfo("This is textFile")" in the textFile function. However, 
after running an application, I found the log "This is textFile" shown in the 
terminal screen. no such log in the stderr file. I am not sure if this is a 
bug. So, hope you can solve this question. Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16654) UI Should show blacklisted executors & nodes

2016-12-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16654:


Assignee: Apache Spark  (was: Jose Soltren)

> UI Should show blacklisted executors & nodes
> 
>
> Key: SPARK-16654
> URL: https://issues.apache.org/jira/browse/SPARK-16654
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Web UI
>Affects Versions: 2.0.0
>Reporter: Imran Rashid
>Assignee: Apache Spark
>
> SPARK-8425 will add the ability to blacklist entire executors and nodes to 
> deal w/ faulty hardware.  However, without displaying it on the UI, it can be 
> hard to realize which executor is bad, and why tasks aren't getting scheduled 
> on certain executors.
> As a first step, we should just show nodes and executors that are blacklisted 
> for the entire application (no need to show blacklisting for tasks & stages).
> This should also ensure that blacklisting events get into the event logs for 
> the history server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16654) UI Should show blacklisted executors & nodes

2016-12-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16654:


Assignee: Jose Soltren  (was: Apache Spark)

> UI Should show blacklisted executors & nodes
> 
>
> Key: SPARK-16654
> URL: https://issues.apache.org/jira/browse/SPARK-16654
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Web UI
>Affects Versions: 2.0.0
>Reporter: Imran Rashid
>Assignee: Jose Soltren
>
> SPARK-8425 will add the ability to blacklist entire executors and nodes to 
> deal w/ faulty hardware.  However, without displaying it on the UI, it can be 
> hard to realize which executor is bad, and why tasks aren't getting scheduled 
> on certain executors.
> As a first step, we should just show nodes and executors that are blacklisted 
> for the entire application (no need to show blacklisting for tasks & stages).
> This should also ensure that blacklisting events get into the event logs for 
> the history server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16654) UI Should show blacklisted executors & nodes

2016-12-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762908#comment-15762908
 ] 

Apache Spark commented on SPARK-16654:
--

User 'jsoltren' has created a pull request for this issue:
https://github.com/apache/spark/pull/16346

> UI Should show blacklisted executors & nodes
> 
>
> Key: SPARK-16654
> URL: https://issues.apache.org/jira/browse/SPARK-16654
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Web UI
>Affects Versions: 2.0.0
>Reporter: Imran Rashid
>Assignee: Jose Soltren
>
> SPARK-8425 will add the ability to blacklist entire executors and nodes to 
> deal w/ faulty hardware.  However, without displaying it on the UI, it can be 
> hard to realize which executor is bad, and why tasks aren't getting scheduled 
> on certain executors.
> As a first step, we should just show nodes and executors that are blacklisted 
> for the entire application (no need to show blacklisting for tasks & stages).
> This should also ensure that blacklisting events get into the event logs for 
> the history server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18915) Return Nothing when Querying a Partitioned Data Source Table without Repairing it

2016-12-19 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-18915.
---
Resolution: Won't Fix

> Return Nothing when Querying a Partitioned Data Source Table without 
> Repairing it
> -
>
> Key: SPARK-18915
> URL: https://issues.apache.org/jira/browse/SPARK-18915
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Priority: Critical
>
> In Spark 2.1, if we create a parititoned data source table given a specified 
> path, it returns nothing when we try to query it. To get the data, we have to 
> manually issue a DDL to repair the table. 
> In Spark 2.0, it can return the data stored in the specified path, without 
> repairing the table.
> Below is the output of Spark 2.1. 
> {noformat}
> scala> spark.range(5).selectExpr("id as fieldOne", "id as 
> partCol").write.partitionBy("partCol").mode("overwrite").saveAsTable("test")
> [Stage 0:==>(3 + 5) / 
> 8]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
>   
>   
> scala> spark.sql("select * from test").show()
> ++---+
> |fieldOne|partCol|
> ++---+
> |   0|  0|
> |   1|  1|
> |   2|  2|
> |   3|  3|
> |   4|  4|
> ++---+
> scala> spark.sql("desc formatted test").show(50, false)
> ++--+---+
> |col_name|data_type   
>   |comment|
> ++--+---+
> |fieldOne|bigint  
>   |null   |
> |partCol |bigint  
>   |null   |
> |# Partition Information |
>   |   |
> |# col_name  |data_type   
>   |comment|
> |partCol |bigint  
>   |null   |
> ||
>   |   |
> |# Detailed Table Information|
>   |   |
> |Database:   |default 
>   |   |
> |Owner:  |xiaoli  
>   |   |
> |Create Time:|Sat Dec 17 17:46:24 PST 2016
>   |   |
> |Last Access Time:   |Wed Dec 31 16:00:00 PST 1969
>   |   |
> |Location:   
> |file:/Users/xiaoli/IdeaProjects/sparkDelivery/bin/spark-warehouse/test|  
>  |
> |Table Type: |MANAGED 
>   |   |
> |Table Parameters:   |
>   |   |
> |  transient_lastDdlTime |1482025584  
>   |   |
> ||
>   |   |
> |# Storage Information   |
>   |   |
> |SerDe Library:  
> |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe   |  
>  |
> |InputFormat:
> |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat |  
>  |
> |OutputFormat:   
> |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat|  
>  |
> |Compressed: |No  
>   |   |
> |Storage Desc Parameters:|
>   |   |
> |  serialization.format  |1   
>   |   |
> |Partition Provider: |Catalog 
>   |   |
> 

[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-12-19 Thread Barry Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762805#comment-15762805
 ] 

Barry Becker commented on SPARK-16845:
--

I found a workaround that allows me to avoid the 64 KB error, but it still 
reuns much slower than I expected. I switched to use a bacth select statement 
insted of calls to withColumns in a loop. 
Here is an example of what I did
Old way:
{code}
stringCols.foreach(column => {  
  val qCol = col(column)
  datasetDf = datasetDf
.withColumn(column + CLEAN_SUFFIX, when(qCol.isNull, 
lit(MISSING)).otherwise(qCol))
})
{code}
New way:
{code}
val replaceStringNull = udf((s: String) => if (s == null) MISSING else s)
var newCols = datasetDf.columns.map(column =>
  if (stringCols.contains(column))
replaceStringNull(col(column)).as(column + CLEAN_SUFFIX)
  else col(column))
datasetDf = datasetDf.select(newCols:_*)
{code}

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: hejie
> Attachments: error.txt.zip
>
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18492) GeneratedIterator grows beyond 64 KB

2016-12-19 Thread Roberto Mirizzi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762785#comment-15762785
 ] 

Roberto Mirizzi commented on SPARK-18492:
-

I would also like to understand if this error causes the query to be 
non-optimized, hence slower, or not.

> GeneratedIterator grows beyond 64 KB
> 
>
> Key: SPARK-18492
> URL: https://issues.apache.org/jira/browse/SPARK-18492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: CentOS release 6.7 (Final)
>Reporter: Norris Merritt
>
> spark-submit fails with ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(I[Lscala/collection/Iterator;)V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
> grows beyond 64 KB
> Error message is followed by a huge dump of generated source code.
> The generated code declares 1,454 field sequences like the following:
> /* 036 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1;
> /* 037 */   private scala.Function1 project_catalystConverter1;
> /* 038 */   private scala.Function1 project_converter1;
> /* 039 */   private scala.Function1 project_converter2;
> /* 040 */   private scala.Function2 project_udf1;
>   (many omitted lines) ...
> /* 6089 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1454;
> /* 6090 */   private scala.Function1 project_catalystConverter1454;
> /* 6091 */   private scala.Function1 project_converter1695;
> /* 6092 */   private scala.Function1 project_udf1454;
> It then proceeds to emit code for several methods (init, processNext) each of 
> which has totally repetitive sequences of statements pertaining to each of 
> the sequences of variables declared in the class.  For example:
> /* 6101 */   public void init(int index, scala.collection.Iterator inputs[]) {
> The reason that the 64KB JVM limit for code for a method is exceeded is 
> because the code generator is using an incredibly naive strategy.  It emits a 
> sequence like the one shown below for each of the 1,454 groups of variables 
> shown above, in 
> /* 6132 */ this.project_udf = 
> (scala.Function1)project_scalaUDF.userDefinedFunc();
> /* 6133 */ this.project_scalaUDF1 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[10];
> /* 6134 */ this.project_catalystConverter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF1.dataType());
> /* 6135 */ this.project_converter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(0))).dataType());
> /* 6136 */ this.project_converter2 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(1))).dataType());
> It blows up after emitting 230 such sequences, while trying to emit the 231st:
> /* 7282 */ this.project_udf230 = 
> (scala.Function2)project_scalaUDF230.userDefinedFunc();
> /* 7283 */ this.project_scalaUDF231 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[240];
> /* 7284 */ this.project_catalystConverter231 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF231.dataType());
>   many omitted lines ...
>  Example of repetitive code sequences emitted for processNext method:
> /* 12253 */   boolean project_isNull247 = project_result244 == null;
> /* 12254 */   MapData project_value247 = null;
> /* 12255 */   if (!project_isNull247) {
> /* 12256 */ project_value247 = project_result244;
> /* 12257 */   }
> /* 12258 */   Object project_arg = sort_isNull5 ? null : 
> project_converter489.apply(sort_value5);
> /* 12259 */
> /* 12260 */   ArrayData project_result249 = null;
> /* 12261 */   try {
> /* 12262 */ project_result249 = 
> (ArrayData)project_catalystConverter248.apply(project_udf248.apply(project_arg));
> /* 12263 */   } catch (Exception e) {
> /* 12264 */ throw new 
> org.apache.spark.SparkException(project_scalaUDF248.udfErrorMessage(), e);
> /* 12265 */   }
> /* 12266 */
> /* 12267 */   boolean project_isNull252 = project_result249 == null;
> /* 12268 */   ArrayData project_value252 = null;
> /* 12269 */   if (!project_isNull252) {
> /* 12270 */ project_value252 = project_result249;
> /* 

[jira] [Commented] (SPARK-18710) Add offset to GeneralizedLinearRegression models

2016-12-19 Thread Wayne Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762752#comment-15762752
 ] 

Wayne Zhang commented on SPARK-18710:
-

[~yanboliang] It seems that I would need to change the case class 'Instance' to 
include offset...  That could be potentially disruptive if many other models 
also depend on this case class. Any suggestions regarding this?  

> Add offset to GeneralizedLinearRegression models
> 
>
> Key: SPARK-18710
> URL: https://issues.apache.org/jira/browse/SPARK-18710
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Wayne Zhang
>Assignee: Wayne Zhang
>  Labels: features
>   Original Estimate: 10h
>  Remaining Estimate: 10h
>
> The current GeneralizedLinearRegression model does not support offset. The 
> offset can be useful to take into account exposure, or for testing 
> incremental effect of new variables. It is possible to use weights in current 
> environment to achieve the same effect of specifying offset for certain 
> models, e.g., Poisson & Binomial with log offset, it is desirable to have the 
> offset option to work with more general cases, e.g., negative offset or 
> offset that is hard to specify using weights (e.g., offset to the probability 
> rather than odds in logistic regression).
> Effort would involve:
> * update regression class to support offsetCol
> * update IWLS to take into account of offset
> * add test case for offset
> I can start working on this if the community approves this feature. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17344) Kafka 0.8 support for Structured Streaming

2016-12-19 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762756#comment-15762756
 ] 

Michael Armbrust commented on SPARK-17344:
--

[KAFKA-4462] aims to give us backwards compatibility for clients which will be 
great.  The fact that there is a long term plan here makes me less allergic to 
the idea of copy/pasting the 0.10.x {{Source}} and porting it to 0.8.0 as an 
interim solution for those who can upgrade yet.

> Kafka 0.8 support for Structured Streaming
> --
>
> Key: SPARK-17344
> URL: https://issues.apache.org/jira/browse/SPARK-17344
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Frederick Reiss
>
> Design and implement Kafka 0.8-based sources and sinks for Structured 
> Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18928) FileScanRDD, JDBCRDD, and UnsafeSorter should support task cancellation

2016-12-19 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-18928.
---
   Resolution: Fixed
Fix Version/s: 2.1.1

> FileScanRDD, JDBCRDD, and UnsafeSorter should support task cancellation
> ---
>
> Key: SPARK-18928
> URL: https://issues.apache.org/jira/browse/SPARK-18928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.1.1
>
>
> Spark tasks respond to cancellation by checking 
> {{TaskContext.isInterrupted()}}, but this check is missing on a few critical 
> paths used in Spark SQL, including FileScanRDD, JDBCRDD, and 
> UnsafeSorter-based sorts. This can cause interrupted / cancelled tasks to 
> continue running and become zombies.
> Here's an example: first, create a giant text file. In my case, I just 
> concatenated /usr/share/dict/words a bunch of times to produce a 2.75 gig 
> file. Then, run a really slow query over that file and try to cancel it:
> {code}
> spark.read.text("/tmp/words").selectExpr("value + value + value").collect()
> {code}
> This will sit and churn at 100% CPU for a minute or two because the task 
> isn't checking the interrupted flag.
> The solution here is to add InterruptedIterator-style checks to a few 
> locations where they're currently missing in Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18928) FileScanRDD, JDBCRDD, and UnsafeSorter should support task cancellation

2016-12-19 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-18928:
--
Fix Version/s: 2.2.0

> FileScanRDD, JDBCRDD, and UnsafeSorter should support task cancellation
> ---
>
> Key: SPARK-18928
> URL: https://issues.apache.org/jira/browse/SPARK-18928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.1.1, 2.2.0
>
>
> Spark tasks respond to cancellation by checking 
> {{TaskContext.isInterrupted()}}, but this check is missing on a few critical 
> paths used in Spark SQL, including FileScanRDD, JDBCRDD, and 
> UnsafeSorter-based sorts. This can cause interrupted / cancelled tasks to 
> continue running and become zombies.
> Here's an example: first, create a giant text file. In my case, I just 
> concatenated /usr/share/dict/words a bunch of times to produce a 2.75 gig 
> file. Then, run a really slow query over that file and try to cancel it:
> {code}
> spark.read.text("/tmp/words").selectExpr("value + value + value").collect()
> {code}
> This will sit and churn at 100% CPU for a minute or two because the task 
> isn't checking the interrupted flag.
> The solution here is to add InterruptedIterator-style checks to a few 
> locations where they're currently missing in Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17344) Kafka 0.8 support for Structured Streaming

2016-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-17344:
-
Target Version/s: 2.1.1

> Kafka 0.8 support for Structured Streaming
> --
>
> Key: SPARK-17344
> URL: https://issues.apache.org/jira/browse/SPARK-17344
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Frederick Reiss
>
> Design and implement Kafka 0.8-based sources and sinks for Structured 
> Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18908) It's hard for the user to see the failure if StreamExecution fails to create the logical plan

2016-12-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18908:
-
Target Version/s: 2.1.1

> It's hard for the user to see the failure if StreamExecution fails to create 
> the logical plan
> -
>
> Key: SPARK-18908
> URL: https://issues.apache.org/jira/browse/SPARK-18908
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Critical
>
> If the logical plan fails to create, e.g., some Source options are invalid, 
> the user cannot use the code to detect the failure. The only place receiving 
> this error is Thread's UncaughtExceptionHandler.
> This bug is because logicalPlan is lazy, and when we try to create 
> StreamingQueryException to wrap the exception thrown by creating logicalPlan, 
> it calls logicalPlan agains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18932) Partial aggregation for collect_set / collect_list

2016-12-19 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-18932:


 Summary: Partial aggregation for collect_set / collect_list
 Key: SPARK-18932
 URL: https://issues.apache.org/jira/browse/SPARK-18932
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Structured Streaming
Reporter: Michael Armbrust


The lack of partial aggregation here is blocking us from using these in 
streaming.  It still won't be fast, but it would be nice to at least be able to 
use them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-12-19 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762719#comment-15762719
 ] 

Dongjoon Hyun commented on SPARK-16845:
---

Hi, All.
I removed the target version since 2.1.0 is passed the vote.

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: hejie
> Attachments: error.txt.zip
>
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-12-19 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-16845:
--
Target Version/s:   (was: 2.1.0)

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: hejie
> Attachments: error.txt.zip
>
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18899) append data to a bucketed table with mismatched bucketing should fail

2016-12-19 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-18899:
--
Target Version/s: 2.1.1  (was: 2.1.0)

> append data to a bucketed table with mismatched bucketing should fail
> -
>
> Key: SPARK-18899
> URL: https://issues.apache.org/jira/browse/SPARK-18899
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18894) Event time watermark delay threshold specified in months or years gives incorrect results

2016-12-19 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-18894:
--
Target Version/s: 2.1.1  (was: 2.1.0)

> Event time watermark delay threshold specified in months or years gives 
> incorrect results
> -
>
> Key: SPARK-18894
> URL: https://issues.apache.org/jira/browse/SPARK-18894
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
>
> Internally we use CalendarInterval to parse the delay. Non-determinstic 
> intervals like "month" and "year" are handled such a way that the generated 
> delay in milliseconds is 0 delayThreshold is in months or years.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18913) append to a table with special column names should work

2016-12-19 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-18913:
--
Target Version/s: 2.1.1  (was: 2.1.0)

> append to a table with special column names should work
> ---
>
> Key: SPARK-18913
> URL: https://issues.apache.org/jira/browse/SPARK-18913
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18912) append to a non-file-based data source table should detect columns number mismatch

2016-12-19 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-18912:
--
Target Version/s: 2.1.1  (was: 2.1.0)

> append to a non-file-based data source table should detect columns number 
> mismatch
> --
>
> Key: SPARK-18912
> URL: https://issues.apache.org/jira/browse/SPARK-18912
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18909) The error message in `ExpressionEncoder.toRow` and `fromRow` is too verbose

2016-12-19 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-18909:
--
Target Version/s: 2.1.1  (was: 2.1.0)

> The error message in `ExpressionEncoder.toRow` and `fromRow` is too verbose
> ---
>
> Key: SPARK-18909
> URL: https://issues.apache.org/jira/browse/SPARK-18909
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Priority: Minor
>
> In `ExpressionEncoder.toRow` and `fromRow`, we will catch the exception and 
> put the treeString of serializer/deserializer expressions in the error 
> message. However, encoder can be very complex and the serializer/deserializer 
> expressions can be very large trees and blow up the log files(e.g. generate 
> over 500mb logs for this single error message.)
> We should simplify it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18931) Create empty staging directory in partitioned table on insert

2016-12-19 Thread Egor Pahomov (JIRA)
Egor Pahomov created SPARK-18931:


 Summary: Create empty staging directory in partitioned table on 
insert
 Key: SPARK-18931
 URL: https://issues.apache.org/jira/browse/SPARK-18931
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.2
Reporter: Egor Pahomov


CREATE TABLE temp.test_partitioning_4 (
  num string
 ) 
PARTITIONED BY (
  day string)
  stored as parquet

On every 

INSERT INTO TABLE temp.test_partitioning_4 PARTITION (day)
select day, count(*) as num from 
hss.session where year=2016 and month=4 
group by day


new directory 
".hive-staging_hive_2016-12-19_15-55-11_298_3412488541559534475-4" created on 
HDFS.  It's big issue, because I insert every day and bunch of empty dirs on 
HDFS is very bad for HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18492) GeneratedIterator grows beyond 64 KB

2016-12-19 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762706#comment-15762706
 ] 

Nicholas Chammas commented on SPARK-18492:
--

Yup, I'm seeming the same high-level behavior as you, [~roberto.mirizzi]. I get 
the 64KB error and the dump of generated code; I also get the warning about 
codegen being disabled; but then everything appears to proceed normally. So I'm 
not sure what to make of the error. 樂

> GeneratedIterator grows beyond 64 KB
> 
>
> Key: SPARK-18492
> URL: https://issues.apache.org/jira/browse/SPARK-18492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: CentOS release 6.7 (Final)
>Reporter: Norris Merritt
>
> spark-submit fails with ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(I[Lscala/collection/Iterator;)V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
> grows beyond 64 KB
> Error message is followed by a huge dump of generated source code.
> The generated code declares 1,454 field sequences like the following:
> /* 036 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1;
> /* 037 */   private scala.Function1 project_catalystConverter1;
> /* 038 */   private scala.Function1 project_converter1;
> /* 039 */   private scala.Function1 project_converter2;
> /* 040 */   private scala.Function2 project_udf1;
>   (many omitted lines) ...
> /* 6089 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1454;
> /* 6090 */   private scala.Function1 project_catalystConverter1454;
> /* 6091 */   private scala.Function1 project_converter1695;
> /* 6092 */   private scala.Function1 project_udf1454;
> It then proceeds to emit code for several methods (init, processNext) each of 
> which has totally repetitive sequences of statements pertaining to each of 
> the sequences of variables declared in the class.  For example:
> /* 6101 */   public void init(int index, scala.collection.Iterator inputs[]) {
> The reason that the 64KB JVM limit for code for a method is exceeded is 
> because the code generator is using an incredibly naive strategy.  It emits a 
> sequence like the one shown below for each of the 1,454 groups of variables 
> shown above, in 
> /* 6132 */ this.project_udf = 
> (scala.Function1)project_scalaUDF.userDefinedFunc();
> /* 6133 */ this.project_scalaUDF1 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[10];
> /* 6134 */ this.project_catalystConverter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF1.dataType());
> /* 6135 */ this.project_converter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(0))).dataType());
> /* 6136 */ this.project_converter2 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(1))).dataType());
> It blows up after emitting 230 such sequences, while trying to emit the 231st:
> /* 7282 */ this.project_udf230 = 
> (scala.Function2)project_scalaUDF230.userDefinedFunc();
> /* 7283 */ this.project_scalaUDF231 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[240];
> /* 7284 */ this.project_catalystConverter231 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF231.dataType());
>   many omitted lines ...
>  Example of repetitive code sequences emitted for processNext method:
> /* 12253 */   boolean project_isNull247 = project_result244 == null;
> /* 12254 */   MapData project_value247 = null;
> /* 12255 */   if (!project_isNull247) {
> /* 12256 */ project_value247 = project_result244;
> /* 12257 */   }
> /* 12258 */   Object project_arg = sort_isNull5 ? null : 
> project_converter489.apply(sort_value5);
> /* 12259 */
> /* 12260 */   ArrayData project_result249 = null;
> /* 12261 */   try {
> /* 12262 */ project_result249 = 
> (ArrayData)project_catalystConverter248.apply(project_udf248.apply(project_arg));
> /* 12263 */   } catch (Exception e) {
> /* 12264 */ throw new 
> org.apache.spark.SparkException(project_scalaUDF248.udfErrorMessage(), e);
> /* 12265 */   }
> /* 12266 */
> /* 12267 */   boolean project_isNull252 = project_result249 == 

[jira] [Assigned] (SPARK-17755) Master may ask a worker to launch an executor before the worker actually got the response of registration

2016-12-19 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-17755:


Assignee: Shixiong Zhu

> Master may ask a worker to launch an executor before the worker actually got 
> the response of registration
> -
>
> Key: SPARK-17755
> URL: https://issues.apache.org/jira/browse/SPARK-17755
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Yin Huai
>Assignee: Shixiong Zhu
>
> I somehow saw a failed test {{org.apache.spark.DistributedSuite.caching in 
> memory, serialized, replicated}}. Its log shows that Spark master asked the 
> worker to launch an executor before the worker actually got the response of 
> registration. So, the master knew that the worker had been registered. But, 
> the worker did not know if it self had been registered. 
> {code}
> 16/09/30 14:53:53.681 dispatcher-event-loop-0 INFO Master: Registering worker 
> localhost:38262 with 1 cores, 1024.0 MB RAM
> 16/09/30 14:53:53.681 dispatcher-event-loop-0 INFO Master: Launching executor 
> app-20160930145353-/1 on worker worker-20160930145353-localhost-38262
> 16/09/30 14:53:53.682 dispatcher-event-loop-3 INFO 
> StandaloneAppClient$ClientEndpoint: Executor added: app-20160930145353-/1 
> on worker-20160930145353-localhost-38262 (localhost:38262) with 1 cores
> 16/09/30 14:53:53.683 dispatcher-event-loop-3 INFO 
> StandaloneSchedulerBackend: Granted executor ID app-20160930145353-/1 on 
> hostPort localhost:38262 with 1 cores, 1024.0 MB RAM
> 16/09/30 14:53:53.683 dispatcher-event-loop-0 WARN Worker: Invalid Master 
> (spark://localhost:46460) attempted to launch executor.
> 16/09/30 14:53:53.687 worker-register-master-threadpool-0 INFO Worker: 
> Successfully registered with master spark://localhost:46460
> {code}
> Then, seems the worker did not launch any executor. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18924) Improve collect/createDataFrame performance in SparkR

2016-12-19 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762687#comment-15762687
 ] 

Hossein Falaki commented on SPARK-18924:


Would be good to think about this along with the efforts to have zero-copy data 
sharing between JVM and R. I think if we do that, a lot of the Ser/De problems 
in the data plane will go away.

> Improve collect/createDataFrame performance in SparkR
> -
>
> Key: SPARK-18924
> URL: https://issues.apache.org/jira/browse/SPARK-18924
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Xiangrui Meng
>Priority: Critical
>
> SparkR has its own SerDe for data serialization between JVM and R.
> The SerDe on the JVM side is implemented in:
> * 
> [SerDe.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala]
> * 
> [SQLUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala]
> The SerDe on the R side is implemented in:
> * 
> [deserialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R]
> * 
> [serialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R]
> The serialization between JVM and R suffers from huge storage and computation 
> overhead. For example, a short round trip of 1 million doubles surprisingly 
> took 3 minutes on my laptop:
> {code}
> > system.time(collect(createDataFrame(data.frame(x=runif(100)
>user  system elapsed
>  14.224   0.582 189.135
> {code}
> Collecting a medium-sized DataFrame to local and continuing with a local R 
> workflow is a use case we should pay attention to. SparkR will never be able 
> to cover all existing features from CRAN packages. It is also unnecessary for 
> Spark to do so because not all features need scalability. 
> Several factors contribute to the serialization overhead:
> 1. The SerDe in R side is implemented using high-level R methods.
> 2. DataFrame columns are not efficiently serialized, primitive type columns 
> in particular.
> 3. Some overhead in the serialization protocol/impl.
> 1) might be discussed before because R packages like rJava exist before 
> SparkR. I'm not sure whether we have a license issue in depending on those 
> libraries. Another option is to switch to low-level R'C interface or Rcpp, 
> which again might have license issue. I'm not an expert here. If we have to 
> implement our own, there still exist much space for improvement, discussed 
> below.
> 2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, 
> which collects rows to local and then constructs columns. However,
> * it ignores column types and results boxing/unboxing overhead
> * it collects all objects to driver and results high GC pressure
> A relatively simple change is to implement specialized column builder based 
> on column types, primitive types in particular. We need to handle null/NA 
> values properly. A simple data structure we can use is
> {code}
> val size: Int
> val nullIndexes: Array[Int]
> val notNullValues: Array[T] // specialized for primitive types
> {code}
> On the R side, we can use `readBin` and `writeBin` to read the entire vector 
> in a single method call. The speed seems reasonable (at the order of GB/s):
> {code}
> > x <- runif(1000) # 1e7, not 1e6
> > system.time(r <- writeBin(x, raw(0)))
>user  system elapsed
>   0.036   0.021   0.059
> > > system.time(y <- readBin(r, double(), 1000))
>user  system elapsed
>   0.015   0.007   0.024
> {code}
> This is just a proposal that needs to be discussed and formalized. But in 
> general, it should be feasible to obtain 20x or more performance gain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18929) Add Tweedie distribution in GLM

2016-12-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18929:


Assignee: (was: Apache Spark)

> Add Tweedie distribution in GLM
> ---
>
> Key: SPARK-18929
> URL: https://issues.apache.org/jira/browse/SPARK-18929
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Wayne Zhang
>  Labels: features
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> I propose to add the full Tweedie family into the GeneralizedLinearRegression 
> model. The Tweedie family is characterized by a power variance function. 
> Currently supported distributions such as Gaussian,  Poisson and Gamma 
> families are a special case of the 
> [Tweedie|https://en.wikipedia.org/wiki/Tweedie_distribution]. 
> I propose to add support for the other distributions:
> * compound Poisson: 1 < variancePower < 2. This one is widely used to model 
> zero-inflated continuous distributions. 
> * positive stable: variancePower > 2 and variancePower != 3. Used to model 
> extreme values.
> * inverse Gaussian: variancePower = 3.
>  The Tweedie family is supported in most statistical packages such as R 
> (statmod), SAS, h2o etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18929) Add Tweedie distribution in GLM

2016-12-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18929:


Assignee: Apache Spark

> Add Tweedie distribution in GLM
> ---
>
> Key: SPARK-18929
> URL: https://issues.apache.org/jira/browse/SPARK-18929
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Wayne Zhang
>Assignee: Apache Spark
>  Labels: features
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> I propose to add the full Tweedie family into the GeneralizedLinearRegression 
> model. The Tweedie family is characterized by a power variance function. 
> Currently supported distributions such as Gaussian,  Poisson and Gamma 
> families are a special case of the 
> [Tweedie|https://en.wikipedia.org/wiki/Tweedie_distribution]. 
> I propose to add support for the other distributions:
> * compound Poisson: 1 < variancePower < 2. This one is widely used to model 
> zero-inflated continuous distributions. 
> * positive stable: variancePower > 2 and variancePower != 3. Used to model 
> extreme values.
> * inverse Gaussian: variancePower = 3.
>  The Tweedie family is supported in most statistical packages such as R 
> (statmod), SAS, h2o etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17755) Master may ask a worker to launch an executor before the worker actually got the response of registration

2016-12-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17755:


Assignee: Apache Spark

> Master may ask a worker to launch an executor before the worker actually got 
> the response of registration
> -
>
> Key: SPARK-17755
> URL: https://issues.apache.org/jira/browse/SPARK-17755
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Yin Huai
>Assignee: Apache Spark
>
> I somehow saw a failed test {{org.apache.spark.DistributedSuite.caching in 
> memory, serialized, replicated}}. Its log shows that Spark master asked the 
> worker to launch an executor before the worker actually got the response of 
> registration. So, the master knew that the worker had been registered. But, 
> the worker did not know if it self had been registered. 
> {code}
> 16/09/30 14:53:53.681 dispatcher-event-loop-0 INFO Master: Registering worker 
> localhost:38262 with 1 cores, 1024.0 MB RAM
> 16/09/30 14:53:53.681 dispatcher-event-loop-0 INFO Master: Launching executor 
> app-20160930145353-/1 on worker worker-20160930145353-localhost-38262
> 16/09/30 14:53:53.682 dispatcher-event-loop-3 INFO 
> StandaloneAppClient$ClientEndpoint: Executor added: app-20160930145353-/1 
> on worker-20160930145353-localhost-38262 (localhost:38262) with 1 cores
> 16/09/30 14:53:53.683 dispatcher-event-loop-3 INFO 
> StandaloneSchedulerBackend: Granted executor ID app-20160930145353-/1 on 
> hostPort localhost:38262 with 1 cores, 1024.0 MB RAM
> 16/09/30 14:53:53.683 dispatcher-event-loop-0 WARN Worker: Invalid Master 
> (spark://localhost:46460) attempted to launch executor.
> 16/09/30 14:53:53.687 worker-register-master-threadpool-0 INFO Worker: 
> Successfully registered with master spark://localhost:46460
> {code}
> Then, seems the worker did not launch any executor. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18929) Add Tweedie distribution in GLM

2016-12-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762684#comment-15762684
 ] 

Apache Spark commented on SPARK-18929:
--

User 'actuaryzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/16344

> Add Tweedie distribution in GLM
> ---
>
> Key: SPARK-18929
> URL: https://issues.apache.org/jira/browse/SPARK-18929
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Wayne Zhang
>  Labels: features
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> I propose to add the full Tweedie family into the GeneralizedLinearRegression 
> model. The Tweedie family is characterized by a power variance function. 
> Currently supported distributions such as Gaussian,  Poisson and Gamma 
> families are a special case of the 
> [Tweedie|https://en.wikipedia.org/wiki/Tweedie_distribution]. 
> I propose to add support for the other distributions:
> * compound Poisson: 1 < variancePower < 2. This one is widely used to model 
> zero-inflated continuous distributions. 
> * positive stable: variancePower > 2 and variancePower != 3. Used to model 
> extreme values.
> * inverse Gaussian: variancePower = 3.
>  The Tweedie family is supported in most statistical packages such as R 
> (statmod), SAS, h2o etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17755) Master may ask a worker to launch an executor before the worker actually got the response of registration

2016-12-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17755:


Assignee: (was: Apache Spark)

> Master may ask a worker to launch an executor before the worker actually got 
> the response of registration
> -
>
> Key: SPARK-17755
> URL: https://issues.apache.org/jira/browse/SPARK-17755
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Yin Huai
>
> I somehow saw a failed test {{org.apache.spark.DistributedSuite.caching in 
> memory, serialized, replicated}}. Its log shows that Spark master asked the 
> worker to launch an executor before the worker actually got the response of 
> registration. So, the master knew that the worker had been registered. But, 
> the worker did not know if it self had been registered. 
> {code}
> 16/09/30 14:53:53.681 dispatcher-event-loop-0 INFO Master: Registering worker 
> localhost:38262 with 1 cores, 1024.0 MB RAM
> 16/09/30 14:53:53.681 dispatcher-event-loop-0 INFO Master: Launching executor 
> app-20160930145353-/1 on worker worker-20160930145353-localhost-38262
> 16/09/30 14:53:53.682 dispatcher-event-loop-3 INFO 
> StandaloneAppClient$ClientEndpoint: Executor added: app-20160930145353-/1 
> on worker-20160930145353-localhost-38262 (localhost:38262) with 1 cores
> 16/09/30 14:53:53.683 dispatcher-event-loop-3 INFO 
> StandaloneSchedulerBackend: Granted executor ID app-20160930145353-/1 on 
> hostPort localhost:38262 with 1 cores, 1024.0 MB RAM
> 16/09/30 14:53:53.683 dispatcher-event-loop-0 WARN Worker: Invalid Master 
> (spark://localhost:46460) attempted to launch executor.
> 16/09/30 14:53:53.687 worker-register-master-threadpool-0 INFO Worker: 
> Successfully registered with master spark://localhost:46460
> {code}
> Then, seems the worker did not launch any executor. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17755) Master may ask a worker to launch an executor before the worker actually got the response of registration

2016-12-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762683#comment-15762683
 ] 

Apache Spark commented on SPARK-17755:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/16345

> Master may ask a worker to launch an executor before the worker actually got 
> the response of registration
> -
>
> Key: SPARK-17755
> URL: https://issues.apache.org/jira/browse/SPARK-17755
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Yin Huai
>
> I somehow saw a failed test {{org.apache.spark.DistributedSuite.caching in 
> memory, serialized, replicated}}. Its log shows that Spark master asked the 
> worker to launch an executor before the worker actually got the response of 
> registration. So, the master knew that the worker had been registered. But, 
> the worker did not know if it self had been registered. 
> {code}
> 16/09/30 14:53:53.681 dispatcher-event-loop-0 INFO Master: Registering worker 
> localhost:38262 with 1 cores, 1024.0 MB RAM
> 16/09/30 14:53:53.681 dispatcher-event-loop-0 INFO Master: Launching executor 
> app-20160930145353-/1 on worker worker-20160930145353-localhost-38262
> 16/09/30 14:53:53.682 dispatcher-event-loop-3 INFO 
> StandaloneAppClient$ClientEndpoint: Executor added: app-20160930145353-/1 
> on worker-20160930145353-localhost-38262 (localhost:38262) with 1 cores
> 16/09/30 14:53:53.683 dispatcher-event-loop-3 INFO 
> StandaloneSchedulerBackend: Granted executor ID app-20160930145353-/1 on 
> hostPort localhost:38262 with 1 cores, 1024.0 MB RAM
> 16/09/30 14:53:53.683 dispatcher-event-loop-0 WARN Worker: Invalid Master 
> (spark://localhost:46460) attempted to launch executor.
> 16/09/30 14:53:53.687 worker-register-master-threadpool-0 INFO Worker: 
> Successfully registered with master spark://localhost:46460
> {code}
> Then, seems the worker did not launch any executor. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18930) Inserting in partitioned table - partitioned field should be last in select statement.

2016-12-19 Thread Egor Pahomov (JIRA)
Egor Pahomov created SPARK-18930:


 Summary: Inserting in partitioned table - partitioned field should 
be last in select statement. 
 Key: SPARK-18930
 URL: https://issues.apache.org/jira/browse/SPARK-18930
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.2
Reporter: Egor Pahomov


CREATE TABLE temp.test_partitioning_4 (
  num string
 ) 
PARTITIONED BY (
  day string)
  stored as parquet


INSERT INTO TABLE temp.test_partitioning_4 PARTITION (day)
select day, count(*) as num from 
hss.session where year=2016 and month=4 
group by day

Resulted schema on HDFS: /temp.db/test_partitioning_3/day=62456298, 
emp.db/test_partitioning_3/day=69094345

As you can imagine these numbers are num of records. But! When I do select * 
from  temp.test_partitioning_4 data is correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10413) ML models should support prediction on single instances

2016-12-19 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10413:
--
Summary: ML models should support prediction on single instances  (was: 
Model should support prediction on single instance)

> ML models should support prediction on single instances
> ---
>
> Key: SPARK-10413
> URL: https://issues.apache.org/jira/browse/SPARK-10413
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Critical
>
> Currently models in the pipeline API only implement transform(DataFrame). It 
> would be quite useful to support prediction on single instance.
> UPDATE: This issue is for making predictions with single models.  We can make 
> methods like {{def predict(features: Vector): Double}} public.
> * This issue is *not* for single-instance prediction for full Pipelines, 
> which would require making predictions on {{Row}}s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15572) ML persistence in R format: compatibility with other languages

2016-12-19 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15572:
--
Summary: ML persistence in R format: compatibility with other languages  
(was: MLlib in R format: compatibility with other languages)

> ML persistence in R format: compatibility with other languages
> --
>
> Key: SPARK-15572
> URL: https://issues.apache.org/jira/browse/SPARK-15572
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Joseph K. Bradley
>
> Currently, models saved in R cannot be loaded easily into other languages.  
> This is because R saves extra metadata (feature names) alongside the model.  
> We should fix this issue so that models can be transferred seamlessly between 
> languages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18929) Add Tweedie distribution in GLM

2016-12-19 Thread Wayne Zhang (JIRA)
Wayne Zhang created SPARK-18929:
---

 Summary: Add Tweedie distribution in GLM
 Key: SPARK-18929
 URL: https://issues.apache.org/jira/browse/SPARK-18929
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.0.2
Reporter: Wayne Zhang


I propose to add the full Tweedie family into the GeneralizedLinearRegression 
model. The Tweedie family is characterized by a power variance function. 
Currently supported distributions such as Gaussian,  Poisson and Gamma families 
are a special case of the 
[Tweedie|https://en.wikipedia.org/wiki/Tweedie_distribution]. 

I propose to add support for the other distributions:
* compound Poisson: 1 < variancePower < 2. This one is widely used to model 
zero-inflated continuous distributions. 
* positive stable: variancePower > 2 and variancePower != 3. Used to model 
extreme values.
* inverse Gaussian: variancePower = 3.

 The Tweedie family is supported in most statistical packages such as R 
(statmod), SAS, h2o etc. 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18492) GeneratedIterator grows beyond 64 KB

2016-12-19 Thread Roberto Mirizzi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762628#comment-15762628
 ] 

Roberto Mirizzi edited comment on SPARK-18492 at 12/19/16 11:32 PM:


I'm having exactly the same issue on Spark 2.0.2. I had it also on Spark 2.0.0 
and Spark 2.0.1.
My exception is:

{{JaninoRuntimeException: Code of method "()V" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
grows beyond 64 KB}}

It happens when I try to do many transformations on a given dataset, like 
multiple {{.select(...).select(...)}} with multiple operations inside the 
select (like {{when(...).otherwise(...)}}, or {{regexp_extract}}).

The exception outputs about 15k lines of Java code, like:

{code:java}
16/12/19 07:16:43 ERROR CodeGenerator: failed to compile: 
org.codehaus.janino.JaninoRuntimeException: Code of method "()V" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
grows beyond 64 KB
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIterator(references);
/* 003 */ }
/* 004 */
/* 005 */ final class GeneratedIterator extends 
org.apache.spark.sql.execution.BufferedRowIterator {
/* 006 */   private Object[] references;
/* 007 */   private boolean agg_initAgg;
/* 008 */   private org.apache.spark.sql.execution.aggregate.HashAggregateExec 
agg_plan;
/* 009 */   private 
org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap agg_hashMap;
/* 010 */   private org.apache.spark.sql.execution.UnsafeKVExternalSorter 
agg_sorter;
/* 011 */   private org.apache.spark.unsafe.KVIterator agg_mapIter;
/* 012 */   private org.apache.spark.sql.execution.metric.SQLMetric 
agg_peakMemory;
/* 013 */   private org.apache.spark.sql.execution.metric.SQLMetric 
agg_spillSize;
/* 014 */   private scala.collection.Iterator inputadapter_input;
/* 015 */   private UnsafeRow agg_result;
/* 016 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder;
/* 017 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter agg_rowWriter;
/* 018 */   private UTF8String agg_lastRegex;
{code}

However, it looks like the execution continues successfully.

This is part of the stack trace:
{code:java}
org.codehaus.janino.JaninoRuntimeException: Code of method "()V" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
grows beyond 64 KB
at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
at org.codehaus.janino.CodeContext.write(CodeContext.java:854)
at org.codehaus.janino.CodeContext.writeShort(CodeContext.java:959)
at 
org.codehaus.janino.UnitCompiler.writeConstantClassInfo(UnitCompiler.java:10274)
at 
org.codehaus.janino.UnitCompiler.tryNarrowingReferenceConversion(UnitCompiler.java:9725)
at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3833)
at org.codehaus.janino.UnitCompiler.access$6400(UnitCompiler.java:185)
at org.codehaus.janino.UnitCompiler$10.visitCast(UnitCompiler.java:3258)
at org.codehaus.janino.Java$Cast.accept(Java.java:3802)
at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290)
at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3868)
at org.codehaus.janino.UnitCompiler.access$8600(UnitCompiler.java:185)
at 
org.codehaus.janino.UnitCompiler$10.visitParenthesizedExpression(UnitCompiler.java:3286)
{code}

And after that I get a warning:
{code:java}
16/12/19 07:16:45 WARN WholeStageCodegenExec: Whole-stage codegen disabled for 
this plan:
 *HashAggregate(keys=...
{code}


was (Author: roberto.mirizzi):
I'm having exactly the same issue on Spark 2.0.2. I had it also on Spark 2.0.0 
and Spark 2.0.1.
My exception is:

{{JaninoRuntimeException: Code of method "()V" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
grows beyond 64 KB}}

It happens when I try to do many transformations on a given dataset, like 
multiple {{.select(...).select(...)}} with multiple operations inside the 
select (like {{when(...).otherwise(...)}}, or {{regexp_extract}}).

The exception outputs about 15k lines of Java code, like:

{code:java}
16/12/19 07:16:43 ERROR CodeGenerator: failed to compile: 
org.codehaus.janino.JaninoRuntimeException: Code of method "()V" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
grows beyond 64 KB
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIterator(references);
/* 003 */ }
/* 004 */
/* 005 */ final class GeneratedIterator extends 
org.apache.spark.sql.execution.BufferedRowIterator {
/* 006 */   private Object[] references;
/* 007 */   private boolean agg_initAgg;
/* 008 */   private 

[jira] [Comment Edited] (SPARK-18492) GeneratedIterator grows beyond 64 KB

2016-12-19 Thread Roberto Mirizzi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762628#comment-15762628
 ] 

Roberto Mirizzi edited comment on SPARK-18492 at 12/19/16 11:29 PM:


I'm having exactly the same issue on Spark 2.0.2. I had it also on Spark 2.0.0 
and Spark 2.0.1.
My exception is:

{{JaninoRuntimeException: Code of method "()V" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
grows beyond 64 KB}}

It happens when I try to do many transformations on a given dataset, like 
multiple {{.select(...).select(...)}} with multiple operations inside the 
select (like {{when(...).otherwise(...)}}, or {{regexp_extract}}).

The exception outputs about 15k lines of Java code, like:

{code:java}
16/12/19 07:16:43 ERROR CodeGenerator: failed to compile: 
org.codehaus.janino.JaninoRuntimeException: Code of method "()V" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
grows beyond 64 KB
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIterator(references);
/* 003 */ }
/* 004 */
/* 005 */ final class GeneratedIterator extends 
org.apache.spark.sql.execution.BufferedRowIterator {
/* 006 */   private Object[] references;
/* 007 */   private boolean agg_initAgg;
/* 008 */   private org.apache.spark.sql.execution.aggregate.HashAggregateExec 
agg_plan;
/* 009 */   private 
org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap agg_hashMap;
/* 010 */   private org.apache.spark.sql.execution.UnsafeKVExternalSorter 
agg_sorter;
/* 011 */   private org.apache.spark.unsafe.KVIterator agg_mapIter;
/* 012 */   private org.apache.spark.sql.execution.metric.SQLMetric 
agg_peakMemory;
/* 013 */   private org.apache.spark.sql.execution.metric.SQLMetric 
agg_spillSize;
/* 014 */   private scala.collection.Iterator inputadapter_input;
/* 015 */   private UnsafeRow agg_result;
/* 016 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder;
/* 017 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter agg_rowWriter;
/* 018 */   private UTF8String agg_lastRegex;
{code}

However, it looks like the execution continues successfully.



was (Author: roberto.mirizzi):
I'm having exactly the same issue on Spark 2.0.2. I had it also on Spark 2.0.0 
and Spark 2.0.1.
My exception is:

bq. JaninoRuntimeException: Code of method "()V" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
grows beyond 64 KB

It happens when I try to do many transformations on a given dataset, like 
multiple {{.select(...).select(...)}} with multiple operations inside the 
select (like {{when(...).otherwise(...)}}, or {{regexp_extract}}).

The exception outputs about 15k lines of Java code, like:

{code:java}
16/12/19 07:16:43 ERROR CodeGenerator: failed to compile: 
org.codehaus.janino.JaninoRuntimeException: Code of method "()V" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
grows beyond 64 KB
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIterator(references);
/* 003 */ }
/* 004 */
/* 005 */ final class GeneratedIterator extends 
org.apache.spark.sql.execution.BufferedRowIterator {
/* 006 */   private Object[] references;
/* 007 */   private boolean agg_initAgg;
/* 008 */   private org.apache.spark.sql.execution.aggregate.HashAggregateExec 
agg_plan;
/* 009 */   private 
org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap agg_hashMap;
/* 010 */   private org.apache.spark.sql.execution.UnsafeKVExternalSorter 
agg_sorter;
/* 011 */   private org.apache.spark.unsafe.KVIterator agg_mapIter;
/* 012 */   private org.apache.spark.sql.execution.metric.SQLMetric 
agg_peakMemory;
/* 013 */   private org.apache.spark.sql.execution.metric.SQLMetric 
agg_spillSize;
/* 014 */   private scala.collection.Iterator inputadapter_input;
/* 015 */   private UnsafeRow agg_result;
/* 016 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder;
/* 017 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter agg_rowWriter;
/* 018 */   private UTF8String agg_lastRegex;
{code}

However, it looks like the execution continues successfully.


> GeneratedIterator grows beyond 64 KB
> 
>
> Key: SPARK-18492
> URL: https://issues.apache.org/jira/browse/SPARK-18492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: CentOS release 6.7 (Final)
>Reporter: Norris Merritt
>
> spark-submit fails with ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(I[Lscala/collection/Iterator;)V" of class 
> 

[jira] [Commented] (SPARK-18492) GeneratedIterator grows beyond 64 KB

2016-12-19 Thread Roberto Mirizzi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762628#comment-15762628
 ] 

Roberto Mirizzi commented on SPARK-18492:
-

I'm having exactly the same issue on Spark 2.0.2. I had it also on Spark 2.0.0 
and Spark 2.0.1.
My exception is:

bq. JaninoRuntimeException: Code of method "()V" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
grows beyond 64 KB

It happens when I try to do many transformations on a given dataset, like 
multiple {{.select(...).select(...)}} with multiple operations inside the 
select (like {{when(...).otherwise(...)}}, or {{regexp_extract}}).

The exception outputs about 15k lines of Java code, like:

{code:java}
16/12/19 07:16:43 ERROR CodeGenerator: failed to compile: 
org.codehaus.janino.JaninoRuntimeException: Code of method "()V" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
grows beyond 64 KB
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIterator(references);
/* 003 */ }
/* 004 */
/* 005 */ final class GeneratedIterator extends 
org.apache.spark.sql.execution.BufferedRowIterator {
/* 006 */   private Object[] references;
/* 007 */   private boolean agg_initAgg;
/* 008 */   private org.apache.spark.sql.execution.aggregate.HashAggregateExec 
agg_plan;
/* 009 */   private 
org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap agg_hashMap;
/* 010 */   private org.apache.spark.sql.execution.UnsafeKVExternalSorter 
agg_sorter;
/* 011 */   private org.apache.spark.unsafe.KVIterator agg_mapIter;
/* 012 */   private org.apache.spark.sql.execution.metric.SQLMetric 
agg_peakMemory;
/* 013 */   private org.apache.spark.sql.execution.metric.SQLMetric 
agg_spillSize;
/* 014 */   private scala.collection.Iterator inputadapter_input;
/* 015 */   private UnsafeRow agg_result;
/* 016 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder;
/* 017 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter agg_rowWriter;
/* 018 */   private UTF8String agg_lastRegex;
{code}

However, it looks like the execution continues successfully.


> GeneratedIterator grows beyond 64 KB
> 
>
> Key: SPARK-18492
> URL: https://issues.apache.org/jira/browse/SPARK-18492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: CentOS release 6.7 (Final)
>Reporter: Norris Merritt
>
> spark-submit fails with ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(I[Lscala/collection/Iterator;)V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
> grows beyond 64 KB
> Error message is followed by a huge dump of generated source code.
> The generated code declares 1,454 field sequences like the following:
> /* 036 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1;
> /* 037 */   private scala.Function1 project_catalystConverter1;
> /* 038 */   private scala.Function1 project_converter1;
> /* 039 */   private scala.Function1 project_converter2;
> /* 040 */   private scala.Function2 project_udf1;
>   (many omitted lines) ...
> /* 6089 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1454;
> /* 6090 */   private scala.Function1 project_catalystConverter1454;
> /* 6091 */   private scala.Function1 project_converter1695;
> /* 6092 */   private scala.Function1 project_udf1454;
> It then proceeds to emit code for several methods (init, processNext) each of 
> which has totally repetitive sequences of statements pertaining to each of 
> the sequences of variables declared in the class.  For example:
> /* 6101 */   public void init(int index, scala.collection.Iterator inputs[]) {
> The reason that the 64KB JVM limit for code for a method is exceeded is 
> because the code generator is using an incredibly naive strategy.  It emits a 
> sequence like the one shown below for each of the 1,454 groups of variables 
> shown above, in 
> /* 6132 */ this.project_udf = 
> (scala.Function1)project_scalaUDF.userDefinedFunc();
> /* 6133 */ this.project_scalaUDF1 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[10];
> /* 6134 */ this.project_catalystConverter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF1.dataType());
> /* 6135 */ this.project_converter1 = 
> 

[jira] [Issue Comment Deleted] (SPARK-18492) GeneratedIterator grows beyond 64 KB

2016-12-19 Thread Roberto Mirizzi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roberto Mirizzi updated SPARK-18492:

Comment: was deleted

(was: I'm having exactly the same issue. My exception is:

[JaninoRuntimeException: Code of method "()V" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
grows beyond 64 KB]



)

> GeneratedIterator grows beyond 64 KB
> 
>
> Key: SPARK-18492
> URL: https://issues.apache.org/jira/browse/SPARK-18492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: CentOS release 6.7 (Final)
>Reporter: Norris Merritt
>
> spark-submit fails with ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(I[Lscala/collection/Iterator;)V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
> grows beyond 64 KB
> Error message is followed by a huge dump of generated source code.
> The generated code declares 1,454 field sequences like the following:
> /* 036 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1;
> /* 037 */   private scala.Function1 project_catalystConverter1;
> /* 038 */   private scala.Function1 project_converter1;
> /* 039 */   private scala.Function1 project_converter2;
> /* 040 */   private scala.Function2 project_udf1;
>   (many omitted lines) ...
> /* 6089 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1454;
> /* 6090 */   private scala.Function1 project_catalystConverter1454;
> /* 6091 */   private scala.Function1 project_converter1695;
> /* 6092 */   private scala.Function1 project_udf1454;
> It then proceeds to emit code for several methods (init, processNext) each of 
> which has totally repetitive sequences of statements pertaining to each of 
> the sequences of variables declared in the class.  For example:
> /* 6101 */   public void init(int index, scala.collection.Iterator inputs[]) {
> The reason that the 64KB JVM limit for code for a method is exceeded is 
> because the code generator is using an incredibly naive strategy.  It emits a 
> sequence like the one shown below for each of the 1,454 groups of variables 
> shown above, in 
> /* 6132 */ this.project_udf = 
> (scala.Function1)project_scalaUDF.userDefinedFunc();
> /* 6133 */ this.project_scalaUDF1 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[10];
> /* 6134 */ this.project_catalystConverter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF1.dataType());
> /* 6135 */ this.project_converter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(0))).dataType());
> /* 6136 */ this.project_converter2 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(1))).dataType());
> It blows up after emitting 230 such sequences, while trying to emit the 231st:
> /* 7282 */ this.project_udf230 = 
> (scala.Function2)project_scalaUDF230.userDefinedFunc();
> /* 7283 */ this.project_scalaUDF231 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[240];
> /* 7284 */ this.project_catalystConverter231 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF231.dataType());
>   many omitted lines ...
>  Example of repetitive code sequences emitted for processNext method:
> /* 12253 */   boolean project_isNull247 = project_result244 == null;
> /* 12254 */   MapData project_value247 = null;
> /* 12255 */   if (!project_isNull247) {
> /* 12256 */ project_value247 = project_result244;
> /* 12257 */   }
> /* 12258 */   Object project_arg = sort_isNull5 ? null : 
> project_converter489.apply(sort_value5);
> /* 12259 */
> /* 12260 */   ArrayData project_result249 = null;
> /* 12261 */   try {
> /* 12262 */ project_result249 = 
> (ArrayData)project_catalystConverter248.apply(project_udf248.apply(project_arg));
> /* 12263 */   } catch (Exception e) {
> /* 12264 */ throw new 
> org.apache.spark.SparkException(project_scalaUDF248.udfErrorMessage(), e);
> /* 12265 */   }
> /* 12266 */
> /* 12267 */   boolean project_isNull252 = project_result249 == null;
> /* 12268 */   ArrayData project_value252 = null;
> /* 12269 */

[jira] [Commented] (SPARK-18492) GeneratedIterator grows beyond 64 KB

2016-12-19 Thread Roberto Mirizzi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762609#comment-15762609
 ] 

Roberto Mirizzi commented on SPARK-18492:
-

I'm having exactly the same issue. My exception is:

[JaninoRuntimeException: Code of method "()V" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
grows beyond 64 KB]





> GeneratedIterator grows beyond 64 KB
> 
>
> Key: SPARK-18492
> URL: https://issues.apache.org/jira/browse/SPARK-18492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: CentOS release 6.7 (Final)
>Reporter: Norris Merritt
>
> spark-submit fails with ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(I[Lscala/collection/Iterator;)V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
> grows beyond 64 KB
> Error message is followed by a huge dump of generated source code.
> The generated code declares 1,454 field sequences like the following:
> /* 036 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1;
> /* 037 */   private scala.Function1 project_catalystConverter1;
> /* 038 */   private scala.Function1 project_converter1;
> /* 039 */   private scala.Function1 project_converter2;
> /* 040 */   private scala.Function2 project_udf1;
>   (many omitted lines) ...
> /* 6089 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1454;
> /* 6090 */   private scala.Function1 project_catalystConverter1454;
> /* 6091 */   private scala.Function1 project_converter1695;
> /* 6092 */   private scala.Function1 project_udf1454;
> It then proceeds to emit code for several methods (init, processNext) each of 
> which has totally repetitive sequences of statements pertaining to each of 
> the sequences of variables declared in the class.  For example:
> /* 6101 */   public void init(int index, scala.collection.Iterator inputs[]) {
> The reason that the 64KB JVM limit for code for a method is exceeded is 
> because the code generator is using an incredibly naive strategy.  It emits a 
> sequence like the one shown below for each of the 1,454 groups of variables 
> shown above, in 
> /* 6132 */ this.project_udf = 
> (scala.Function1)project_scalaUDF.userDefinedFunc();
> /* 6133 */ this.project_scalaUDF1 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[10];
> /* 6134 */ this.project_catalystConverter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF1.dataType());
> /* 6135 */ this.project_converter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(0))).dataType());
> /* 6136 */ this.project_converter2 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(1))).dataType());
> It blows up after emitting 230 such sequences, while trying to emit the 231st:
> /* 7282 */ this.project_udf230 = 
> (scala.Function2)project_scalaUDF230.userDefinedFunc();
> /* 7283 */ this.project_scalaUDF231 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[240];
> /* 7284 */ this.project_catalystConverter231 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF231.dataType());
>   many omitted lines ...
>  Example of repetitive code sequences emitted for processNext method:
> /* 12253 */   boolean project_isNull247 = project_result244 == null;
> /* 12254 */   MapData project_value247 = null;
> /* 12255 */   if (!project_isNull247) {
> /* 12256 */ project_value247 = project_result244;
> /* 12257 */   }
> /* 12258 */   Object project_arg = sort_isNull5 ? null : 
> project_converter489.apply(sort_value5);
> /* 12259 */
> /* 12260 */   ArrayData project_result249 = null;
> /* 12261 */   try {
> /* 12262 */ project_result249 = 
> (ArrayData)project_catalystConverter248.apply(project_udf248.apply(project_arg));
> /* 12263 */   } catch (Exception e) {
> /* 12264 */ throw new 
> org.apache.spark.SparkException(project_scalaUDF248.udfErrorMessage(), e);
> /* 12265 */   }
> /* 12266 */
> /* 12267 */   boolean project_isNull252 = project_result249 == null;
> /* 12268 */   ArrayData project_value252 = null;
> /* 

[jira] [Commented] (SPARK-18588) KafkaSourceStressForDontFailOnDataLossSuite is flaky

2016-12-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762559#comment-15762559
 ] 

Apache Spark commented on SPARK-18588:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/16282

> KafkaSourceStressForDontFailOnDataLossSuite is flaky
> 
>
> Key: SPARK-18588
> URL: https://issues.apache.org/jira/browse/SPARK-18588
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Reporter: Herman van Hovell
>Assignee: Shixiong Zhu
>
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite_name=stress+test+for+failOnDataLoss%3Dfalse



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18588) KafkaSourceStressForDontFailOnDataLossSuite is flaky

2016-12-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18588:


Assignee: Shixiong Zhu  (was: Apache Spark)

> KafkaSourceStressForDontFailOnDataLossSuite is flaky
> 
>
> Key: SPARK-18588
> URL: https://issues.apache.org/jira/browse/SPARK-18588
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Reporter: Herman van Hovell
>Assignee: Shixiong Zhu
>
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite_name=stress+test+for+failOnDataLoss%3Dfalse



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18588) KafkaSourceStressForDontFailOnDataLossSuite is flaky

2016-12-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18588:


Assignee: Apache Spark  (was: Shixiong Zhu)

> KafkaSourceStressForDontFailOnDataLossSuite is flaky
> 
>
> Key: SPARK-18588
> URL: https://issues.apache.org/jira/browse/SPARK-18588
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite_name=stress+test+for+failOnDataLoss%3Dfalse



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15544) Bouncing Zookeeper node causes Active spark master to exit

2016-12-19 Thread Ed Tyrrill (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762555#comment-15762555
 ] 

Ed Tyrrill commented on SPARK-15544:


I'm going to add that this is very easy to reproduce.  It will happen reliably 
if you shut down the zookeeper node that is currently the leader.  I configured 
systemd to automatically restart the spark master, and while the spark master 
process starts, the spark master on all three nodes doesn't really work, and 
continually tries to reconnect to zookeeper until I bring up the shutdown 
zookeeper node.  Spark should be able to work with two of the three zookeeper 
nodes, but instead it log message like this repeatedly every couple seconds on 
all three spark master nodes until I bring back up the one zookeeper node that 
I shut down, zk02:

2016-12-19 14:31:10.175 INFO org.apache.zookeeper.ClientCnxn.logStartConnect - 
Opening socket connection to server zk01/10.0.xx.xx:. Will not attempt to 
authenticate using SASL (unknown error)
2016-12-19 14:31:10.176 INFO org.apache.zookeeper.ClientCnxn.primeConnection - 
Socket connection established to zk01/10.0.xx.xx:, initiating session
2016-12-19 14:31:10.177 INFO org.apache.zookeeper.ClientCnxn.run - Unable to 
read additional data from server sessionid 0x0, likely server has closed 
socket, closing socket connection and attempting reconnect
2016-12-19 14:31:10.724 INFO org.apache.zookeeper.ClientCnxn.logStartConnect - 
Opening socket connection to server zk02/10.0.xx.xx:. Will not attempt to 
authenticate using SASL (unknown error)
2016-12-19 14:31:10.725 WARN org.apache.zookeeper.ClientCnxn.run - Session 0x0 
for server null, unexpected error, closing socket connection and attempting 
reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2016-12-19 14:31:10.828 INFO org.apache.zookeeper.ClientCnxn.logStartConnect - 
Opening socket connection to server zk03/10.0.xx.xx:. Will not attempt to 
authenticate using SASL (unknown error)
2016-12-19 14:31:10.830 INFO org.apache.zookeeper.ClientCnxn.primeConnection - 
Socket connection established to zk03/10.0.xx.xx:, initiating session

Zookeeper itself has selected a new leader, and Kafka, which also uses 
zookeeper, doesn't have any trouble during this time.  Also important to note, 
if you shut down a non-leader zookeeper node then spark doesn't have any 
trouble either.

> Bouncing Zookeeper node causes Active spark master to exit
> --
>
> Key: SPARK-15544
> URL: https://issues.apache.org/jira/browse/SPARK-15544
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
> Environment: Ubuntu 14.04.  Zookeeper 3.4.6 with 3-node quorum
>Reporter: Steven Lowenthal
>
> Shutting Down a single zookeeper node caused spark master to exit.  The 
> master should have connected to a second zookeeper node. 
> {code:title=log output}
> 16/05/25 18:21:28 INFO master.Master: Launching executor 
> app-20160525182128-0006/1 on worker worker-20160524013212-10.16.28.76-59138
> 16/05/25 18:21:28 INFO master.Master: Launching executor 
> app-20160525182128-0006/2 on worker worker-20160524013204-10.16.21.217-47129
> 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data 
> from server sessionid 0x154dfc0426b0054, likely server has closed socket, 
> closing socket connection and attempting reconnect
> 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data 
> from server sessionid 0x254c701f28d0053, likely server has closed socket, 
> closing socket connection and attempting reconnect
> 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED
> 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED
> 16/05/26 00:16:01 INFO master.ZooKeeperLeaderElectionAgent: We have lost 
> leadership
> 16/05/26 00:16:01 ERROR master.Master: Leadership has been revoked -- master 
> shutting down. }}
> {code}
> spark-env.sh: 
> {code:title=spark-env.sh}
> export SPARK_LOCAL_DIRS=/ephemeral/spark/local
> export SPARK_WORKER_DIR=/ephemeral/spark/work
> export SPARK_LOG_DIR=/var/log/spark
> export HADOOP_CONF_DIR=/home/ubuntu/hadoop-2.6.3/etc/hadoop
> export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER 
> -Dspark.deploy.zookeeper.url=gn5456-zookeeper-01:2181,gn5456-zookeeper-02:2181,gn5456-zookeeper-03:2181"
> export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true"
> {code}



--
This message was sent by Atlassian 

[jira] [Resolved] (SPARK-18836) Serialize Task Metrics once per stage

2016-12-19 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-18836.

   Resolution: Fixed
 Assignee: Shivaram Venkataraman
Fix Version/s: 1.3.0

> Serialize Task Metrics once per stage
> -
>
> Key: SPARK-18836
> URL: https://issues.apache.org/jira/browse/SPARK-18836
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Shivaram Venkataraman
>Assignee: Shivaram Venkataraman
> Fix For: 1.3.0
>
>
> Right now we serialize the empty task metrics once per task -- Since this is 
> shared across all tasks we could use the same serialized task metrics across 
> all tasks of a stage



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18924) Improve collect/createDataFrame performance in SparkR

2016-12-19 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762548#comment-15762548
 ] 

Shivaram Venkataraman commented on SPARK-18924:
---

This is a good thing to investigate - Just to provide some historical context, 
the functions in serialize.R and deserialize.R were primarily designed to 
enable functions between the JVM and R and the serialization performance is 
less critical there as its mostly just function names, arguments etc. 

For data path we were originally using R's own serializer, deserializer but 
that doesn't work if we want to parse the data in JVM. So the whole dfToCols 
was a retro-fit to make things work.

In terms of design options:
- I think removing the boxing / unboxing overheads and making `readIntArray` or 
`readStringArray` in R more efficient would be a good starting point
- In terms of using other packages - there are licensing questions and also 
usability questions. So far users mostly don't require any extra R package to 
use SparkR and hence we are compatible across a bunch of R versions etc. So I 
think we should first look at the points about how we can make our existing 
architecture better
- If the bottleneck is due to R function call overheads after the above changes 
we can explore writing a C module (similar to our old hashCode implementation). 
While this has lesser complications in terms of licensing, versions matches 
etc. -  there is still some complexity on how we build and distribute this in a 
binary package.

> Improve collect/createDataFrame performance in SparkR
> -
>
> Key: SPARK-18924
> URL: https://issues.apache.org/jira/browse/SPARK-18924
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Xiangrui Meng
>Priority: Critical
>
> SparkR has its own SerDe for data serialization between JVM and R.
> The SerDe on the JVM side is implemented in:
> * 
> [SerDe.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala]
> * 
> [SQLUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala]
> The SerDe on the R side is implemented in:
> * 
> [deserialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R]
> * 
> [serialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R]
> The serialization between JVM and R suffers from huge storage and computation 
> overhead. For example, a short round trip of 1 million doubles surprisingly 
> took 3 minutes on my laptop:
> {code}
> > system.time(collect(createDataFrame(data.frame(x=runif(100)
>user  system elapsed
>  14.224   0.582 189.135
> {code}
> Collecting a medium-sized DataFrame to local and continuing with a local R 
> workflow is a use case we should pay attention to. SparkR will never be able 
> to cover all existing features from CRAN packages. It is also unnecessary for 
> Spark to do so because not all features need scalability. 
> Several factors contribute to the serialization overhead:
> 1. The SerDe in R side is implemented using high-level R methods.
> 2. DataFrame columns are not efficiently serialized, primitive type columns 
> in particular.
> 3. Some overhead in the serialization protocol/impl.
> 1) might be discussed before because R packages like rJava exist before 
> SparkR. I'm not sure whether we have a license issue in depending on those 
> libraries. Another option is to switch to low-level R'C interface or Rcpp, 
> which again might have license issue. I'm not an expert here. If we have to 
> implement our own, there still exist much space for improvement, discussed 
> below.
> 2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, 
> which collects rows to local and then constructs columns. However,
> * it ignores column types and results boxing/unboxing overhead
> * it collects all objects to driver and results high GC pressure
> A relatively simple change is to implement specialized column builder based 
> on column types, primitive types in particular. We need to handle null/NA 
> values properly. A simple data structure we can use is
> {code}
> val size: Int
> val nullIndexes: Array[Int]
> val notNullValues: Array[T] // specialized for primitive types
> {code}
> On the R side, we can use `readBin` and `writeBin` to read the entire vector 
> in a single method call. The speed seems reasonable (at the order of GB/s):
> {code}
> > x <- runif(1000) # 1e7, not 1e6
> > system.time(r <- writeBin(x, raw(0)))
>user  system elapsed
>   0.036   0.021   0.059
> > > system.time(y <- readBin(r, double(), 1000))
>user  system elapsed
>   0.015   0.007   0.024
> {code}
> This is just a proposal that needs to be discussed and formalized. But 

[jira] [Commented] (SPARK-16473) BisectingKMeans Algorithm failing with java.util.NoSuchElementException: key not found

2016-12-19 Thread Ilya Matiach (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762532#comment-15762532
 ] 

Ilya Matiach commented on SPARK-16473:
--

I'm interested in looking into this issue.  Would it be possible to get a 
dataset (either the original one or some mock dataset) which can be used to 
reproduce this error?

> BisectingKMeans Algorithm failing with java.util.NoSuchElementException: key 
> not found
> --
>
> Key: SPARK-16473
> URL: https://issues.apache.org/jira/browse/SPARK-16473
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.6.1, 2.0.0
> Environment: AWS EC2 linux instance. 
>Reporter: Alok Bhandari
>
> Hello , 
> I am using apache spark 1.6.1. 
> I am executing bisecting k means algorithm on a specific dataset .
> Dataset details :- 
> K=100,
> input vector =100K*100k
> Memory assigned 16GB per node ,
> number of nodes =2.
>  Till K=75 it os working fine , but when I set k=100 , it fails with 
> java.util.NoSuchElementException: key not found. 
> *I suspect it is failing because of lack of some resources , but somehow 
> exception does not convey anything as why this spark job failed.* 
> Please can someone point me to root cause of this exception , why it is 
> failing. 
> This is the exception stack-trace:- 
> {code}
> java.util.NoSuchElementException: key not found: 166 
> at scala.collection.MapLike$class.default(MapLike.scala:228) 
> at scala.collection.AbstractMap.default(Map.scala:58) 
> at scala.collection.MapLike$class.apply(MapLike.scala:141) 
> at scala.collection.AbstractMap.apply(Map.scala:58) 
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply$mcDJ$sp(BisectingKMeans.scala:338)
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
> at 
> scala.collection.TraversableOnce$$anonfun$minBy$1.apply(TraversableOnce.scala:231)
>  
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
>  
> at scala.collection.immutable.List.foldLeft(List.scala:84) 
> at 
> scala.collection.LinearSeqOptimized$class.reduceLeft(LinearSeqOptimized.scala:125)
>  
> at scala.collection.immutable.List.reduceLeft(List.scala:84) 
> at 
> scala.collection.TraversableOnce$class.minBy(TraversableOnce.scala:231) 
> at scala.collection.AbstractTraversable.minBy(Traversable.scala:105) 
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:337)
>  
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:334)
>  
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
> at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389) 
> {code}
> Issue is that , it is failing but not giving any explicit message as to why 
> it failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18927) MemorySink for StructuredStreaming can't recover from checkpoint if location is provided in conf

2016-12-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18927:


Assignee: (was: Apache Spark)

> MemorySink for StructuredStreaming can't recover from checkpoint if location 
> is provided in conf
> 
>
> Key: SPARK-18927
> URL: https://issues.apache.org/jira/browse/SPARK-18927
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18927) MemorySink for StructuredStreaming can't recover from checkpoint if location is provided in conf

2016-12-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762497#comment-15762497
 ] 

Apache Spark commented on SPARK-18927:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/16342

> MemorySink for StructuredStreaming can't recover from checkpoint if location 
> is provided in conf
> 
>
> Key: SPARK-18927
> URL: https://issues.apache.org/jira/browse/SPARK-18927
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18927) MemorySink for StructuredStreaming can't recover from checkpoint if location is provided in conf

2016-12-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18927:


Assignee: Apache Spark

> MemorySink for StructuredStreaming can't recover from checkpoint if location 
> is provided in conf
> 
>
> Key: SPARK-18927
> URL: https://issues.apache.org/jira/browse/SPARK-18927
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18832) Spark SQL: Incorrect error message on calling registered UDF.

2016-12-19 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762427#comment-15762427
 ] 

Dongjoon Hyun commented on SPARK-18832:
---

Ah, I think I reproduced the situation you meet. Let me dig this.

> Spark SQL: Incorrect error message on calling registered UDF.
> -
>
> Key: SPARK-18832
> URL: https://issues.apache.org/jira/browse/SPARK-18832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Lokesh Yadav
>
> On calling a registered UDF in metastore from spark-sql CLI, it gives a 
> generic error:
> Error in query: Undefined function: 'Sample_UDF'. This function is neither a 
> registered temporary function nor a permanent function registered in the 
> database 'default'.
> The functions is registered and it shoes up in the list output by 'show 
> functions'.
> I am using a Hive UDTF, registering it using the statement: create function 
> Sample_UDF as 'com.udf.Sample_UDF' using JAR 
> '/local/jar/path/containing/the/class';
> and I am calling the functions from spark-sql CLI as: SELECT 
> Sample_UDF("input_1", "input_2" )



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18928) FileScanRDD, JDBCRDD, and UnsafeSorter should support task cancellation

2016-12-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762384#comment-15762384
 ] 

Apache Spark commented on SPARK-18928:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/16340

> FileScanRDD, JDBCRDD, and UnsafeSorter should support task cancellation
> ---
>
> Key: SPARK-18928
> URL: https://issues.apache.org/jira/browse/SPARK-18928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Spark tasks respond to cancellation by checking 
> {{TaskContext.isInterrupted()}}, but this check is missing on a few critical 
> paths used in Spark SQL, including FileScanRDD, JDBCRDD, and 
> UnsafeSorter-based sorts. This can cause interrupted / cancelled tasks to 
> continue running and become zombies.
> Here's an example: first, create a giant text file. In my case, I just 
> concatenated /usr/share/dict/words a bunch of times to produce a 2.75 gig 
> file. Then, run a really slow query over that file and try to cancel it:
> {code}
> spark.read.text("/tmp/words").selectExpr("value + value + value").collect()
> {code}
> This will sit and churn at 100% CPU for a minute or two because the task 
> isn't checking the interrupted flag.
> The solution here is to add InterruptedIterator-style checks to a few 
> locations where they're currently missing in Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18928) FileScanRDD, JDBCRDD, and UnsafeSorter should support task cancellation

2016-12-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18928:


Assignee: Josh Rosen  (was: Apache Spark)

> FileScanRDD, JDBCRDD, and UnsafeSorter should support task cancellation
> ---
>
> Key: SPARK-18928
> URL: https://issues.apache.org/jira/browse/SPARK-18928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Spark tasks respond to cancellation by checking 
> {{TaskContext.isInterrupted()}}, but this check is missing on a few critical 
> paths used in Spark SQL, including FileScanRDD, JDBCRDD, and 
> UnsafeSorter-based sorts. This can cause interrupted / cancelled tasks to 
> continue running and become zombies.
> Here's an example: first, create a giant text file. In my case, I just 
> concatenated /usr/share/dict/words a bunch of times to produce a 2.75 gig 
> file. Then, run a really slow query over that file and try to cancel it:
> {code}
> spark.read.text("/tmp/words").selectExpr("value + value + value").collect()
> {code}
> This will sit and churn at 100% CPU for a minute or two because the task 
> isn't checking the interrupted flag.
> The solution here is to add InterruptedIterator-style checks to a few 
> locations where they're currently missing in Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18928) FileScanRDD, JDBCRDD, and UnsafeSorter should support task cancellation

2016-12-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18928:


Assignee: Apache Spark  (was: Josh Rosen)

> FileScanRDD, JDBCRDD, and UnsafeSorter should support task cancellation
> ---
>
> Key: SPARK-18928
> URL: https://issues.apache.org/jira/browse/SPARK-18928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> Spark tasks respond to cancellation by checking 
> {{TaskContext.isInterrupted()}}, but this check is missing on a few critical 
> paths used in Spark SQL, including FileScanRDD, JDBCRDD, and 
> UnsafeSorter-based sorts. This can cause interrupted / cancelled tasks to 
> continue running and become zombies.
> Here's an example: first, create a giant text file. In my case, I just 
> concatenated /usr/share/dict/words a bunch of times to produce a 2.75 gig 
> file. Then, run a really slow query over that file and try to cancel it:
> {code}
> spark.read.text("/tmp/words").selectExpr("value + value + value").collect()
> {code}
> This will sit and churn at 100% CPU for a minute or two because the task 
> isn't checking the interrupted flag.
> The solution here is to add InterruptedIterator-style checks to a few 
> locations where they're currently missing in Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18928) FileScanRDD, JDBCRDD, and UnsafeSorter should support task cancellation

2016-12-19 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-18928:
--

 Summary: FileScanRDD, JDBCRDD, and UnsafeSorter should support 
task cancellation
 Key: SPARK-18928
 URL: https://issues.apache.org/jira/browse/SPARK-18928
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Reporter: Josh Rosen
Assignee: Josh Rosen


Spark tasks respond to cancellation by checking 
{{TaskContext.isInterrupted()}}, but this check is missing on a few critical 
paths used in Spark SQL, including FileScanRDD, JDBCRDD, and UnsafeSorter-based 
sorts. This can cause interrupted / cancelled tasks to continue running and 
become zombies.

Here's an example: first, create a giant text file. In my case, I just 
concatenated /usr/share/dict/words a bunch of times to produce a 2.75 gig file. 
Then, run a really slow query over that file and try to cancel it:

{code}
spark.read.text("/tmp/words").selectExpr("value + value + value").collect()
{code}

This will sit and churn at 100% CPU for a minute or two because the task isn't 
checking the interrupted flag.

The solution here is to add InterruptedIterator-style checks to a few locations 
where they're currently missing in Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15544) Bouncing Zookeeper node causes Active spark master to exit

2016-12-19 Thread Ed Tyrrill (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762360#comment-15762360
 ] 

Ed Tyrrill commented on SPARK-15544:


I am experiencing the same problem with Spark 1.6.2 and ZK 3.4.8 on RHEL 7.  
Any plans on fixing this?

> Bouncing Zookeeper node causes Active spark master to exit
> --
>
> Key: SPARK-15544
> URL: https://issues.apache.org/jira/browse/SPARK-15544
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
> Environment: Ubuntu 14.04.  Zookeeper 3.4.6 with 3-node quorum
>Reporter: Steven Lowenthal
>
> Shutting Down a single zookeeper node caused spark master to exit.  The 
> master should have connected to a second zookeeper node. 
> {code:title=log output}
> 16/05/25 18:21:28 INFO master.Master: Launching executor 
> app-20160525182128-0006/1 on worker worker-20160524013212-10.16.28.76-59138
> 16/05/25 18:21:28 INFO master.Master: Launching executor 
> app-20160525182128-0006/2 on worker worker-20160524013204-10.16.21.217-47129
> 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data 
> from server sessionid 0x154dfc0426b0054, likely server has closed socket, 
> closing socket connection and attempting reconnect
> 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data 
> from server sessionid 0x254c701f28d0053, likely server has closed socket, 
> closing socket connection and attempting reconnect
> 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED
> 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED
> 16/05/26 00:16:01 INFO master.ZooKeeperLeaderElectionAgent: We have lost 
> leadership
> 16/05/26 00:16:01 ERROR master.Master: Leadership has been revoked -- master 
> shutting down. }}
> {code}
> spark-env.sh: 
> {code:title=spark-env.sh}
> export SPARK_LOCAL_DIRS=/ephemeral/spark/local
> export SPARK_WORKER_DIR=/ephemeral/spark/work
> export SPARK_LOG_DIR=/var/log/spark
> export HADOOP_CONF_DIR=/home/ubuntu/hadoop-2.6.3/etc/hadoop
> export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER 
> -Dspark.deploy.zookeeper.url=gn5456-zookeeper-01:2181,gn5456-zookeeper-02:2181,gn5456-zookeeper-03:2181"
> export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18886) Delay scheduling should not delay some executors indefinitely if one task is scheduled before delay timeout

2016-12-19 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762296#comment-15762296
 ] 

Imran Rashid commented on SPARK-18886:
--

Thanks [~mridul], that helps -- in particular I was only thinking about bulk 
scheduling, I had forgotten to take that into account.

After a closer look through the code, I think my earlier proposal makes sense 
-- rather than resetting the timeout as each task is scheduled, change it to 
start the timer as soon as there is an offer which goes unused due to the 
delay.  Once that timer is started, it is never reset (for that TSM).

I can think of one scenario where this would result in worse scheduling than 
what we currently have.  Suppose that initially, a TSM is offered one resource 
which only matches on rack_local.  But immediately after that, many 
process_local offers are made, which are all used up.  Some time later, more 
offers that are only rack_local come in.  They'll immediately get used, even 
though there may be plenty more offers that are process_local that are just 
about to come in (perhaps enough for all of the remaining tasks).

That wouldn't be great, but its also not nearly as bad as letting most of your 
cluster sit idle.

Other alternatives I can think of:

a) Turn off delay scheduling by default, and change 
[{{TaskSchedulerImpl.resourceOffer}}|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L357-L360]
 to go through all task sets, then advance locality levels, rather than the 
other way around.  Perhaps we should invert those loops anyway, just for when 
users turn off delay scheduling.

b) Have TSM use some knowledge about all available executors to decide whether 
or not it is even possible for enough resources at the right locality level to 
appear.  Eg., in the original case, the TSM would realize there is only one 
executor which is process_local, so it doesn't make sense to wait to schedule 
all tasks on that executor.  However, I'm pretty skeptical about doing anything 
like this, as it may be a somewhat complicated thing inside the scheduler, and 
it could just turn into a mess of heuristics which has lots of corner cases.

I think implementing my proposed solution should be relatively easy, so I'll 
take a stab at it, but I'd still appreciate more input on the right approach 
here.  Perhaps seeing an implementation will make it easier to discuss.

> Delay scheduling should not delay some executors indefinitely if one task is 
> scheduled before delay timeout
> ---
>
> Key: SPARK-18886
> URL: https://issues.apache.org/jira/browse/SPARK-18886
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Imran Rashid
>
> Delay scheduling can introduce an unbounded delay and underutilization of 
> cluster resources under the following circumstances:
> 1. Tasks have locality preferences for a subset of available resources
> 2. Tasks finish in less time than the delay scheduling.
> Instead of having *one* delay to wait for resources with better locality, 
> spark waits indefinitely.
> As an example, consider a cluster with 100 executors, and a taskset with 500 
> tasks.  Say all tasks have a preference for one executor, which is by itself 
> on one host.  Given the default locality wait of 3s per level, we end up with 
> a 6s delay till we schedule on other hosts (process wait + host wait).
> If each task takes 5 seconds (under the 6 second delay), then _all 500_ tasks 
> get scheduled on _only one_ executor.  This means you're only using a 1% of 
> your cluster, and you get a ~100x slowdown.  You'd actually be better off if 
> tasks took 7 seconds.
> *WORKAROUNDS*: 
> (1) You can change the locality wait times so that it is shorter than the 
> task execution time.  You need to take into account the sum of all wait times 
> to use all the resources on your cluster.  For example, if you have resources 
> on different racks, this will include the sum of 
> "spark.locality.wait.process" + "spark.locality.wait.node" + 
> "spark.locality.wait.rack".  Those each default to "3s".  The simplest way to 
> be to set "spark.locality.wait.process" to your desired wait interval, and 
> set both "spark.locality.wait.node" and "spark.locality.wait.rack" to "0".  
> For example, if your tasks take ~3 seconds on average, you might set 
> "spark.locality.wait.process" to "1s".
> Note that this workaround isn't perfect --with less delay scheduling, you may 
> not get as good resource locality.  After this issue is fixed, you'd most 
> likely want to undo these configuration changes.
> (2) The worst case here will only happen if your tasks have extreme skew in 
> their locality 

[jira] [Commented] (SPARK-18926) run-example SparkPi terminates with error message

2016-12-19 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762210#comment-15762210
 ] 

Dongjoon Hyun commented on SPARK-18926:
---

Hi, [~alex.decastro].
It looks like some timing issue. Does this happen always in your environment?

> run-example SparkPi terminates with error message 
> --
>
> Key: SPARK-18926
> URL: https://issues.apache.org/jira/browse/SPARK-18926
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, Spark Core
>Affects Versions: 2.0.2
>Reporter: Alex DeCastro
>
> Spark examples and shell terminates with error message. 
> E.g.: 
> $ run-example SparkPi 
> 16/12/19 15:55:03 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/12/19 15:55:04 WARN Utils: Your hostname, adecosta-mbp-osx resolves to a 
> loopback address: 127.0.0.1; using 172.16.170.144 instead (on interface en3)
> 16/12/19 15:55:04 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 16/12/19 15:55:05 WARN SparkContext: Use an existing SparkContext, some 
> configuration may not take effect.
> Pi is roughly 3.1385756928784643  
>   
> 16/12/19 15:55:11 ERROR TransportRequestHandler: Error while invoking 
> RpcHandler#receive() for one-way message.
> org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:152)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:134)
>   at 
> org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:571)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:180)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:119)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18624) Implict cast between ArrayTypes

2016-12-19 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-18624.
---
   Resolution: Fixed
 Assignee: Jiang Xingbo
Fix Version/s: 2.2.0

> Implict cast between ArrayTypes
> ---
>
> Key: SPARK-18624
> URL: https://issues.apache.org/jira/browse/SPARK-18624
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Minor
> Fix For: 2.2.0
>
>
> Currently `ImplicitTypeCasts` doesn't handle casts between ArrayTypes, this 
> is not convenient, we should add a rule to enable casting from 
> ArrayType(InternalType) to ArrayType(newInternalType).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18877) Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 exceeds max precision 20

2016-12-19 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762186#comment-15762186
 ] 

Dongjoon Hyun commented on SPARK-18877:
---

+1

> Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: 
> requirement failed: Decimal precision 28 exceeds max precision 20
> --
>
> Key: SPARK-18877
> URL: https://issues.apache.org/jira/browse/SPARK-18877
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Navya Krishnappa
>
> When reading below mentioned csv data, even though the maximum decimal 
> precision is 38, following exception is thrown 
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 
> exceeds max precision 20
> Decimal
> 2323366225312000
> 2433573971400
> 23233662253000
> 23233662253



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18832) Spark SQL: Incorrect error message on calling registered UDF.

2016-12-19 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762182#comment-15762182
 ] 

Dongjoon Hyun commented on SPARK-18832:
---

Thank you for confirming.
Actually, Spark already has the UDTF test cases 
[here|https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala#L102-L140].
So far, I didn't understand the difference in your case. Could you give us some 
reproducible examples?


> Spark SQL: Incorrect error message on calling registered UDF.
> -
>
> Key: SPARK-18832
> URL: https://issues.apache.org/jira/browse/SPARK-18832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Lokesh Yadav
>
> On calling a registered UDF in metastore from spark-sql CLI, it gives a 
> generic error:
> Error in query: Undefined function: 'Sample_UDF'. This function is neither a 
> registered temporary function nor a permanent function registered in the 
> database 'default'.
> The functions is registered and it shoes up in the list output by 'show 
> functions'.
> I am using a Hive UDTF, registering it using the statement: create function 
> Sample_UDF as 'com.udf.Sample_UDF' using JAR 
> '/local/jar/path/containing/the/class';
> and I am calling the functions from spark-sql CLI as: SELECT 
> Sample_UDF("input_1", "input_2" )



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18908) It's hard for the user to see the failure if StreamExecution fails to create the logical plan

2016-12-19 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-18908:
-
Priority: Critical  (was: Blocker)

> It's hard for the user to see the failure if StreamExecution fails to create 
> the logical plan
> -
>
> Key: SPARK-18908
> URL: https://issues.apache.org/jira/browse/SPARK-18908
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Critical
>
> If the logical plan fails to create, e.g., some Source options are invalid, 
> the user cannot use the code to detect the failure. The only place receiving 
> this error is Thread's UncaughtExceptionHandler.
> This bug is because logicalPlan is lazy, and when we try to create 
> StreamingQueryException to wrap the exception thrown by creating logicalPlan, 
> it calls logicalPlan agains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18921) check database existence with Hive.databaseExists instead of getDatabase

2016-12-19 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-18921.
--
   Resolution: Fixed
Fix Version/s: 2.1.1

Issue resolved by pull request 16332
[https://github.com/apache/spark/pull/16332]

> check database existence with Hive.databaseExists instead of getDatabase
> 
>
> Key: SPARK-18921
> URL: https://issues.apache.org/jira/browse/SPARK-18921
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Minor
> Fix For: 2.1.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18700) getCached in HiveMetastoreCatalog not thread safe cause driver OOM

2016-12-19 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-18700.
---
   Resolution: Fixed
 Assignee: Li Yuanjian
Fix Version/s: 2.2.0
   2.1.1

> getCached in HiveMetastoreCatalog not thread safe cause driver OOM
> --
>
> Key: SPARK-18700
> URL: https://issues.apache.org/jira/browse/SPARK-18700
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0, 2.1.1
>Reporter: Li Yuanjian
>Assignee: Li Yuanjian
> Fix For: 2.1.1, 2.2.0
>
>
> In our spark sql platform, each query use same HiveContext and 
> independent thread, new data will append to tables as new partitions every 
> 30min. After a new partition added to table T, we should call refreshTable to 
> clear T’s cache in cachedDataSourceTables to make the new partition 
> searchable. 
> For the table have more partitions and files(much bigger than 
> spark.sql.sources.parallelPartitionDiscovery.threshold), a new query of table 
> T will start a job to fetch all FileStatus in listLeafFiles function. Because 
> of the huge number of files, the job will run several seconds, during the 
> time, new queries of table T will also start new jobs to fetch FileStatus 
> because of the function of getCache is not thread safe. Final cause a driver 
> OOM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16654) UI Should show blacklisted executors & nodes

2016-12-19 Thread Jose Soltren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762067#comment-15762067
 ] 

Jose Soltren commented on SPARK-16654:
--

SPARK-8425 is resolved so I'll be working to get this checked in fairly soon 
now.

> UI Should show blacklisted executors & nodes
> 
>
> Key: SPARK-16654
> URL: https://issues.apache.org/jira/browse/SPARK-16654
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Web UI
>Affects Versions: 2.0.0
>Reporter: Imran Rashid
>Assignee: Jose Soltren
>
> SPARK-8425 will add the ability to blacklist entire executors and nodes to 
> deal w/ faulty hardware.  However, without displaying it on the UI, it can be 
> hard to realize which executor is bad, and why tasks aren't getting scheduled 
> on certain executors.
> As a first step, we should just show nodes and executors that are blacklisted 
> for the entire application (no need to show blacklisting for tasks & stages).
> This should also ensure that blacklisting events get into the event logs for 
> the history server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18924) Improve collect/createDataFrame performance in SparkR

2016-12-19 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-18924:
--
Description: 
SparkR has its own SerDe for data serialization between JVM and R.

The SerDe on the JVM side is implemented in:
* 
[SerDe.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala]
* 
[SQLUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala]

The SerDe on the R side is implemented in:
* 
[deserialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R]
* [serialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R]

The serialization between JVM and R suffers from huge storage and computation 
overhead. For example, a short round trip of 1 million doubles surprisingly 
took 3 minutes on my laptop:

{code}
> system.time(collect(createDataFrame(data.frame(x=runif(100)
   user  system elapsed
 14.224   0.582 189.135
{code}

Collecting a medium-sized DataFrame to local and continuing with a local R 
workflow is a use case we should pay attention to. SparkR will never be able to 
cover all existing features from CRAN packages. It is also unnecessary for 
Spark to do so because not all features need scalability. 


Several factors contribute to the serialization overhead:
1. The SerDe in R side is implemented using high-level R methods.
2. DataFrame columns are not efficiently serialized, primitive type columns in 
particular.
3. Some overhead in the serialization protocol/impl.

1) might be discussed before because R packages like rJava exist before SparkR. 
I'm not sure whether we have a license issue in depending on those libraries. 
Another option is to switch to low-level R'C interface or Rcpp, which again 
might have license issue. I'm not an expert here. If we have to implement our 
own, there still exist much space for improvement, discussed below.

2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, 
which collects rows to local and then constructs columns. However,
* it ignores column types and results boxing/unboxing overhead
* it collects all objects to driver and results high GC pressure

A relatively simple change is to implement specialized column builder based on 
column types, primitive types in particular. We need to handle null/NA values 
properly. A simple data structure we can use is

{code}
val size: Int
val nullIndexes: Array[Int]
val notNullValues: Array[T] // specialized for primitive types
{code}

On the R side, we can use `readBin` and `writeBin` to read the entire vector in 
a single method call. The speed seems reasonable (at the order of GB/s):

{code}
> x <- runif(1000) # 1e7, not 1e6
> system.time(r <- writeBin(x, raw(0)))
   user  system elapsed
  0.036   0.021   0.059
> > system.time(y <- readBin(r, double(), 1000))
   user  system elapsed
  0.015   0.007   0.024
{code}

This is just a proposal that needs to be discussed and formalized. But in 
general, it should be feasible to obtain 20x or more performance gain.

  was:
SparkR has its own SerDe for data serialization between JVM and R.

The SerDe on the JVM side is implemented in:
* 
[SeDe|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala]
* 
[SQLUtils|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala]

The SerDe on the R side is implemented in:
* 
[deserialize|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R]
* [serialize|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R]

The serialization between JVM and R suffers from huge storage and computation 
overhead. For example, a short round-trip of 1 million doubles surprisingly 
took 3 minutes on my laptop:

{code}
> system.time(collect(createDataFrame(data.frame(x=runif(100)
   user  system elapsed
 14.224   0.582 189.135
{code}

Collecting a medium-sized DataFrame to local and continuing with a local R 
workflow is a use case we should pay attention to. SparkR will never be able to 
cover all existing features from CRAN packages. It is also unnecessary for 
Spark to do so because not all features need scalability. 


Several factors contribute to the serialization overhead:
1. The SerDe in R side is implemented using high-level R methods.
2. DataFrame columns are not efficiently serialized, primitive type columns in 
particular.
3. Some overhead in the serialization protocol/impl.

1) might be discussed before because R packages like rJava exist before SparkR. 
I'm not sure whether we have a license issue in depending on those libraries. 
Another option is to switch to low-level R'C interface or Rcpp, which again 
might have license issue. I'm not an expert here. If we have to implement our 
own, there still exist much space for 

[jira] [Created] (SPARK-18927) MemorySink for StructuredStreaming can't recover from checkpoint if location is provided in conf

2016-12-19 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-18927:
---

 Summary: MemorySink for StructuredStreaming can't recover from 
checkpoint if location is provided in conf
 Key: SPARK-18927
 URL: https://issues.apache.org/jira/browse/SPARK-18927
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Burak Yavuz






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18716) Restrict the disk usage of spark event log.

2016-12-19 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15761941#comment-15761941
 ] 

Marcelo Vanzin commented on SPARK-18716:


For posterity, another problem with this feature that I didn't mention in the 
PR, is that allow users to use the SHS to delete content created by other 
users. A malicious user can just write a big event log file in the SHS 
directory and that would eventually make the SHS delete log files from other 
users. 

> Restrict the disk usage of spark event log. 
> 
>
> Key: SPARK-18716
> URL: https://issues.apache.org/jira/browse/SPARK-18716
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.2
>Reporter: Genmao Yu
>
> We've had reports of overfull disk usage of spark event log file. Current 
> implementation has following drawbacks:
> 1. If we did not start Spark HistoryServer or Spark HistoryServer just 
> failed, there is no chance to do clean work.
> 2. Spark HistoryServer is cleaning event log file based on file age only. If 
> there are abundant applications constantly, the disk usage in every 
> {{spark.history.fs.cleaner.maxAge}} can still be very large.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18917) Dataframe - Time Out Issues / Taking long time in append mode on object stores

2016-12-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18917:


Assignee: (was: Apache Spark)

> Dataframe - Time Out Issues / Taking long time in append mode on object stores
> --
>
> Key: SPARK-18917
> URL: https://issues.apache.org/jira/browse/SPARK-18917
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2, SQL, YARN
>Affects Versions: 2.0.2
>Reporter: Anbu Cheeralan
>Priority: Minor
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> When using Dataframe write in append mode on object stores (S3 / Google 
> Storage), the writes are taking long time to write/ getting read time out. 
> This is because dataframe.write lists all leaf folders in the target 
> directory. If there are lot of subfolders due to partitions, this is taking 
> for ever.
> The code is In org.apache.spark.sql.execution.datasources.DataSource.write() 
> following code causes huge number of RPC calls when the file system is an 
> Object Store (S3, GS).
> if (mode == SaveMode.Append) {
> val existingPartitionColumns = Try {
> resolveRelation()
> .asInstanceOf[HadoopFsRelation]
> .location
> .partitionSpec()
> .partitionColumns
> .fieldNames
> .toSeq
> }.getOrElse(Seq.empty[String])
> There should be a flag to skip Partition Match Check in append mode. I can 
> work on the patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18917) Dataframe - Time Out Issues / Taking long time in append mode on object stores

2016-12-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18917:


Assignee: Apache Spark

> Dataframe - Time Out Issues / Taking long time in append mode on object stores
> --
>
> Key: SPARK-18917
> URL: https://issues.apache.org/jira/browse/SPARK-18917
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2, SQL, YARN
>Affects Versions: 2.0.2
>Reporter: Anbu Cheeralan
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> When using Dataframe write in append mode on object stores (S3 / Google 
> Storage), the writes are taking long time to write/ getting read time out. 
> This is because dataframe.write lists all leaf folders in the target 
> directory. If there are lot of subfolders due to partitions, this is taking 
> for ever.
> The code is In org.apache.spark.sql.execution.datasources.DataSource.write() 
> following code causes huge number of RPC calls when the file system is an 
> Object Store (S3, GS).
> if (mode == SaveMode.Append) {
> val existingPartitionColumns = Try {
> resolveRelation()
> .asInstanceOf[HadoopFsRelation]
> .location
> .partitionSpec()
> .partitionColumns
> .fieldNames
> .toSeq
> }.getOrElse(Seq.empty[String])
> There should be a flag to skip Partition Match Check in append mode. I can 
> work on the patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18917) Dataframe - Time Out Issues / Taking long time in append mode on object stores

2016-12-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15761926#comment-15761926
 ] 

Apache Spark commented on SPARK-18917:
--

User 'alunarbeach' has created a pull request for this issue:
https://github.com/apache/spark/pull/16339

> Dataframe - Time Out Issues / Taking long time in append mode on object stores
> --
>
> Key: SPARK-18917
> URL: https://issues.apache.org/jira/browse/SPARK-18917
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2, SQL, YARN
>Affects Versions: 2.0.2
>Reporter: Anbu Cheeralan
>Priority: Minor
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> When using Dataframe write in append mode on object stores (S3 / Google 
> Storage), the writes are taking long time to write/ getting read time out. 
> This is because dataframe.write lists all leaf folders in the target 
> directory. If there are lot of subfolders due to partitions, this is taking 
> for ever.
> The code is In org.apache.spark.sql.execution.datasources.DataSource.write() 
> following code causes huge number of RPC calls when the file system is an 
> Object Store (S3, GS).
> if (mode == SaveMode.Append) {
> val existingPartitionColumns = Try {
> resolveRelation()
> .asInstanceOf[HadoopFsRelation]
> .location
> .partitionSpec()
> .partitionColumns
> .fieldNames
> .toSeq
> }.getOrElse(Seq.empty[String])
> There should be a flag to skip Partition Match Check in append mode. I can 
> work on the patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18589) persist() resolves "java.lang.RuntimeException: Invalid PythonUDF (...), requires attributes from more than one child"

2016-12-19 Thread Franklyn Dsouza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15761741#comment-15761741
 ] 

Franklyn Dsouza edited comment on SPARK-18589 at 12/19/16 5:37 PM:
---

The sequence of steps that causes this are:

{code}
join two dataframes A and B > make a udf that uses one column from A and 
another from B > filter on column produced by udf > java.lang.RuntimeException: 
Invalid PythonUDF (b#1L, c#6L), requires attributes from more than one 
child.
{code}

Here are some minimum steps to reproduce this issue in pyspark

{code}
from pyspark.sql import types
from pyspark.sql import functions as F
df1 = sqlCtx.createDataFrame([types.Row(a=1, b=2), types.Row(a=1, b=4)])
df2 = sqlCtx.createDataFrame([types.Row(a=1, c=12)])
joined = df1.join(df2, df1['a'] == df2['a'])
extra = joined.withColumn('sum', F.udf(lambda a,b : a+b, 
types.IntegerType())(joined['b'], joined['c']))
filtered = extra.where(extra['sum'] < F.lit(10)).collect()
{code}

*doing extra.cache() before the filtering will fix the issue* but obviously 
isn't a solution.



was (Author: franklyndsouza):
The sequence of steps that causes this are:

{code}
join two dataframes A and B > make a udf that uses one column from A and 
another from B > filter on column produced by udf > java.lang.RuntimeException: 
Invalid PythonUDF (b#1L, c#6L), requires attributes from more than one 
child.
{code}

Here are some minimum steps to reproduce this issue in pyspark

{code}
from pyspark.sql import types
from pyspark.sql import functions as F
df1 = sqlCtx.createDataFrame([types.Row(a=1, b=2), types.Row(a=1, b=4)])
df2 = sqlCtx.createDataFrame([types.Row(a=1, c=12)])
joined = df1.join(df2, df1['a'] == df2['a'])
extra = joined.withColumn('sum', F.udf(lambda a,b : a+b, 
types.IntegerType())(joined['b'], joined['c']))
filtered = extra.where(extra['sum'] < F.lit(10)).collect()
{code}

*doing extra.cache() before the filtering will fix the issue.*


> persist() resolves "java.lang.RuntimeException: Invalid PythonUDF 
> (...), requires attributes from more than one child"
> --
>
> Key: SPARK-18589
> URL: https://issues.apache.org/jira/browse/SPARK-18589
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.1.0
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Smells like another optimizer bug that's similar to SPARK-17100 and 
> SPARK-18254. I'm seeing this on 2.0.2 and on master at commit 
> {{fb07bbe575aabe68422fd3a31865101fb7fa1722}}.
> I don't have a minimal repro for this yet, but the error I'm seeing is:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling o247.count.
> : java.lang.RuntimeException: Invalid PythonUDF <...>(...), requires 
> attributes from more than one child.
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:150)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:149)
> at scala.collection.immutable.Stream.foreach(Stream.scala:594)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:149)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> 

[jira] [Comment Edited] (SPARK-18589) persist() resolves "java.lang.RuntimeException: Invalid PythonUDF (...), requires attributes from more than one child"

2016-12-19 Thread Franklyn Dsouza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15761741#comment-15761741
 ] 

Franklyn Dsouza edited comment on SPARK-18589 at 12/19/16 5:29 PM:
---

The sequence of steps that causes this are:

{code}
join two dataframes A and B > make a udf that uses one column from A and 
another from B > filter on column produced by udf > java.lang.RuntimeException: 
Invalid PythonUDF (b#1L, c#6L), requires attributes from more than one 
child.
{code}

Here are some minimum steps to reproduce this issue in pyspark

{code}
from pyspark.sql import types
from pyspark.sql import functions as F
df1 = sqlCtx.createDataFrame([types.Row(a=1, b=2), types.Row(a=1, b=4)])
df2 = sqlCtx.createDataFrame([types.Row(a=1, c=12)])
joined = df1.join(df2, df1['a'] == df2['a'])
extra = joined.withColumn('sum', F.udf(lambda a,b : a+b, 
types.IntegerType())(joined['b'], joined['c']))
filtered = extra.where(extra['sum'] < F.lit(10)).collect()
{code}

*doing extra.cache() before the filtering will fix the issue.*



was (Author: franklyndsouza):
The sequence of steps that causes this are:

{code}
join two dataframes A and B > make a udf that uses one column from A and 
another from B > filter on column produced by udf > java.lang.RuntimeException: 
Invalid PythonUDF (b#1L, c#6L), requires attributes from more than one 
child.
{code}

Here are some minimum steps to reproduce this issue in pyspark

{code}
from pyspark.sql import types
from pyspark.sql import functions as F
df1 = sqlCtx.createDataFrame([types.Row(a=1, b=2), types.Row(a=1, b=4)])
df2 = sqlCtx.createDataFrame([types.Row(a=1, c=12)])
joined = df1.join(df2, df1['a'] == df2['a'])
extra = joined.withColumn('sum', F.udf(lambda a,b : a+b, 
types.IntegerType())(joined['b'], joined['c']))
filtered = extra.where(extra['sum'] < F.lit(10)).collect()
{code}

doing extra.cache() before the filtering will fix the issue.


> persist() resolves "java.lang.RuntimeException: Invalid PythonUDF 
> (...), requires attributes from more than one child"
> --
>
> Key: SPARK-18589
> URL: https://issues.apache.org/jira/browse/SPARK-18589
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.1.0
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Smells like another optimizer bug that's similar to SPARK-17100 and 
> SPARK-18254. I'm seeing this on 2.0.2 and on master at commit 
> {{fb07bbe575aabe68422fd3a31865101fb7fa1722}}.
> I don't have a minimal repro for this yet, but the error I'm seeing is:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling o247.count.
> : java.lang.RuntimeException: Invalid PythonUDF <...>(...), requires 
> attributes from more than one child.
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:150)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:149)
> at scala.collection.immutable.Stream.foreach(Stream.scala:594)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:149)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> 

  1   2   >