[jira] [Commented] (SPARK-18939) Timezone support in partition values.
[ https://issues.apache.org/jira/browse/SPARK-18939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15763378#comment-15763378 ] Reynold Xin commented on SPARK-18939: - FWIW it's probably unlikely timestamp is used as a partition column, but let's definitely support this. I'm thinking we should also pass in a data source option "timezone" (default to session's timezone) and use that for CSV/JSON, and file partition. > Timezone support in partition values. > - > > Key: SPARK-18939 > URL: https://issues.apache.org/jira/browse/SPARK-18939 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Takuya Ueshin > > We should also use session local timezone to interpret partition values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18939) Timezone support in partition values.
[ https://issues.apache.org/jira/browse/SPARK-18939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15763375#comment-15763375 ] Takuya Ueshin commented on SPARK-18939: --- Yes, I think so. > Timezone support in partition values. > - > > Key: SPARK-18939 > URL: https://issues.apache.org/jira/browse/SPARK-18939 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Takuya Ueshin > > We should also use session local timezone to interpret partition values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18939) Timezone support in partition values.
[ https://issues.apache.org/jira/browse/SPARK-18939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15763372#comment-15763372 ] Reynold Xin commented on SPARK-18939: - This only impacts timestamp data type right? > Timezone support in partition values. > - > > Key: SPARK-18939 > URL: https://issues.apache.org/jira/browse/SPARK-18939 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Takuya Ueshin > > We should also use session local timezone to interpret partition values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18939) Timezone support in partition values.
Takuya Ueshin created SPARK-18939: - Summary: Timezone support in partition values. Key: SPARK-18939 URL: https://issues.apache.org/jira/browse/SPARK-18939 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Takuya Ueshin We should also use session local timezone to interpret partition values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18938) Addition of peak memory usage metric for an executor
Suresh Bahuguna created SPARK-18938: --- Summary: Addition of peak memory usage metric for an executor Key: SPARK-18938 URL: https://issues.apache.org/jira/browse/SPARK-18938 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Suresh Bahuguna Priority: Minor Fix For: 1.5.2 Addition of peak memory usage metric for an executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18936) Infrastructure for session local timezone support
[ https://issues.apache.org/jira/browse/SPARK-18936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15763343#comment-15763343 ] Apache Spark commented on SPARK-18936: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/16308 > Infrastructure for session local timezone support > - > > Key: SPARK-18936 > URL: https://issues.apache.org/jira/browse/SPARK-18936 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Takuya Ueshin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18936) Infrastructure for session local timezone support
[ https://issues.apache.org/jira/browse/SPARK-18936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18936: Assignee: Apache Spark (was: Takuya Ueshin) > Infrastructure for session local timezone support > - > > Key: SPARK-18936 > URL: https://issues.apache.org/jira/browse/SPARK-18936 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18936) Infrastructure for session local timezone support
[ https://issues.apache.org/jira/browse/SPARK-18936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18936: Assignee: Takuya Ueshin (was: Apache Spark) > Infrastructure for session local timezone support > - > > Key: SPARK-18936 > URL: https://issues.apache.org/jira/browse/SPARK-18936 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Takuya Ueshin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18937) Timezone support in CSV/JSON parsing
Reynold Xin created SPARK-18937: --- Summary: Timezone support in CSV/JSON parsing Key: SPARK-18937 URL: https://issues.apache.org/jira/browse/SPARK-18937 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18936) Infrastructure for session local timezone support
Reynold Xin created SPARK-18936: --- Summary: Infrastructure for session local timezone support Key: SPARK-18936 URL: https://issues.apache.org/jira/browse/SPARK-18936 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Takuya Ueshin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16473) BisectingKMeans Algorithm failing with java.util.NoSuchElementException: key not found
[ https://issues.apache.org/jira/browse/SPARK-16473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15763324#comment-15763324 ] Alok Bhandari edited comment on SPARK-16473 at 12/20/16 5:38 AM: - [~imatiach] , thanks for showing interest in this issue. I will try to share the dataset with you , please can you suggest where should I share it ? should I share it through github? is it fine? Also , I have tried to diagnose this issue on my own , from my analysis it looks like , it is failing if it tries to bisect a node which does not have any children. I also have added a code fix , but not sure if this is the correct solution :- *Suggested solution* {code:title=BisectingkMeans.scala} private def updateAssignments( assignments: RDD[(Long, VectorWithNorm)], divisibleIndices: Set[Long], newClusterCenters: Map[Long, VectorWithNorm]): RDD[(Long, VectorWithNorm)] = { assignments.map { case (index, v) => if (divisibleIndices.contains(index)) { val children = Seq(leftChildIndex(index), rightChildIndex(index)) if ( children.length>0 ) { val selected = children.minBy { child => KMeans.fastSquaredDistance(newClusterCenters(child), v) } (selected, v) }else { (index, v) } } else { (index, v) } } } {code} *Original code* {code:title=BisectingkMeans.scala} private def updateAssignments( assignments: RDD[(Long, VectorWithNorm)], divisibleIndices: Set[Long], newClusterCenters: Map[Long, VectorWithNorm]): RDD[(Long, VectorWithNorm)] = { assignments.map { case (index, v) => if (divisibleIndices.contains(index)) { val children = Seq(leftChildIndex(index), rightChildIndex(index)) val selected = children.minBy { child => KMeans.fastSquaredDistance(newClusterCenters(child), v) } (selected, v) } else { (index, v) } } } {code} was (Author: alokob...@gmail.com): [~imatiach] , thanks for showing interest in this issue. I will try to share the dataset with you , please can you suggest where should I share it ? should I share it through github? is it fine? Also , I have tried to diagnose this issue on my own , from my analysis it looks like , it is failing if it tries to bisect a node which does not have any children. I also have added a code fix , but not sure if this is the correct solution :- *Suggested solution* {code:BisectingKMeans} private def updateAssignments( assignments: RDD[(Long, VectorWithNorm)], divisibleIndices: Set[Long], newClusterCenters: Map[Long, VectorWithNorm]): RDD[(Long, VectorWithNorm)] = { assignments.map { case (index, v) => if (divisibleIndices.contains(index)) { val children = Seq(leftChildIndex(index), rightChildIndex(index)) if ( children.length>0 ) { val selected = children.minBy { child => KMeans.fastSquaredDistance(newClusterCenters(child), v) } (selected, v) }else { (index, v) } } else { (index, v) } } } {code} *Original code* {code:BsiectingKMeans} private def updateAssignments( assignments: RDD[(Long, VectorWithNorm)], divisibleIndices: Set[Long], newClusterCenters: Map[Long, VectorWithNorm]): RDD[(Long, VectorWithNorm)] = { assignments.map { case (index, v) => if (divisibleIndices.contains(index)) { val children = Seq(leftChildIndex(index), rightChildIndex(index)) val selected = children.minBy { child => KMeans.fastSquaredDistance(newClusterCenters(child), v) } (selected, v) } else { (index, v) } } } {code} > BisectingKMeans Algorithm failing with java.util.NoSuchElementException: key > not found > -- > > Key: SPARK-16473 > URL: https://issues.apache.org/jira/browse/SPARK-16473 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 1.6.1, 2.0.0 > Environment: AWS EC2 linux instance. >Reporter: Alok Bhandari > > Hello , > I am using apache spark 1.6.1. > I am executing bisecting k means algorithm on a specific dataset . > Dataset details :- > K=100, > input vector =100K*100k > Memory assigned 16GB per node , > number of nodes =2. > Till K=75 it os working fine , but when I set k=100 , it fails with > java.util.NoSuchElementException: key not found. > *I suspect it is failing because of lack of some resources , but somehow > exception does not convey anything as why this spark job failed.* > Please can someone point me to root cause of this exception , why it is > failing. > This is the exception stack-trace:- >
[jira] [Commented] (SPARK-16473) BisectingKMeans Algorithm failing with java.util.NoSuchElementException: key not found
[ https://issues.apache.org/jira/browse/SPARK-16473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15763324#comment-15763324 ] Alok Bhandari commented on SPARK-16473: --- [~imatiach] , thanks for showing interest in this issue. I will try to share the dataset with you , please can you suggest where should I share it ? should I share it through github? is it fine? Also , I have tried to diagnose this issue on my own , from my analysis it looks like , it is failing if it tries to bisect a node which does not have any children. I also have added a code fix , but not sure if this is the correct solution :- *Suggested solution* {code:BisectingKMeans} private def updateAssignments( assignments: RDD[(Long, VectorWithNorm)], divisibleIndices: Set[Long], newClusterCenters: Map[Long, VectorWithNorm]): RDD[(Long, VectorWithNorm)] = { assignments.map { case (index, v) => if (divisibleIndices.contains(index)) { val children = Seq(leftChildIndex(index), rightChildIndex(index)) if ( children.length>0 ) { val selected = children.minBy { child => KMeans.fastSquaredDistance(newClusterCenters(child), v) } (selected, v) }else { (index, v) } } else { (index, v) } } } {code} *Original code* {code:BsiectingKMeans} private def updateAssignments( assignments: RDD[(Long, VectorWithNorm)], divisibleIndices: Set[Long], newClusterCenters: Map[Long, VectorWithNorm]): RDD[(Long, VectorWithNorm)] = { assignments.map { case (index, v) => if (divisibleIndices.contains(index)) { val children = Seq(leftChildIndex(index), rightChildIndex(index)) val selected = children.minBy { child => KMeans.fastSquaredDistance(newClusterCenters(child), v) } (selected, v) } else { (index, v) } } } {code} > BisectingKMeans Algorithm failing with java.util.NoSuchElementException: key > not found > -- > > Key: SPARK-16473 > URL: https://issues.apache.org/jira/browse/SPARK-16473 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 1.6.1, 2.0.0 > Environment: AWS EC2 linux instance. >Reporter: Alok Bhandari > > Hello , > I am using apache spark 1.6.1. > I am executing bisecting k means algorithm on a specific dataset . > Dataset details :- > K=100, > input vector =100K*100k > Memory assigned 16GB per node , > number of nodes =2. > Till K=75 it os working fine , but when I set k=100 , it fails with > java.util.NoSuchElementException: key not found. > *I suspect it is failing because of lack of some resources , but somehow > exception does not convey anything as why this spark job failed.* > Please can someone point me to root cause of this exception , why it is > failing. > This is the exception stack-trace:- > {code} > java.util.NoSuchElementException: key not found: 166 > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:58) > at scala.collection.MapLike$class.apply(MapLike.scala:141) > at scala.collection.AbstractMap.apply(Map.scala:58) > at > org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply$mcDJ$sp(BisectingKMeans.scala:338) > at > org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337) > at > org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337) > at > scala.collection.TraversableOnce$$anonfun$minBy$1.apply(TraversableOnce.scala:231) > > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) > > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > scala.collection.LinearSeqOptimized$class.reduceLeft(LinearSeqOptimized.scala:125) > > at scala.collection.immutable.List.reduceLeft(List.scala:84) > at > scala.collection.TraversableOnce$class.minBy(TraversableOnce.scala:231) > at scala.collection.AbstractTraversable.minBy(Traversable.scala:105) > at > org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:337) > > at > org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply
[jira] [Closed] (SPARK-17632) make console sink and other sinks work with 'recoverFromCheckpointLocation' option enabled
[ https://issues.apache.org/jira/browse/SPARK-17632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chuanlei Ni closed SPARK-17632. --- Resolution: Not A Problem > make console sink and other sinks work with 'recoverFromCheckpointLocation' > option enabled > -- > > Key: SPARK-17632 > URL: https://issues.apache.org/jira/browse/SPARK-17632 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.0 >Reporter: Chuanlei Ni >Priority: Trivial > > val (useTempCheckpointLocation, recoverFromCheckpointLocation) = > if (source == "console") { > (true, false) > } else { > (false, true) > } > I think 'source' sink should work with 'recoverFromCheckpoint' option > enabled. If user specified checkpointLocation, we may lost the processed > data, however it is acceptable since this sink used for test mainly. > And other sink can be work with 'useTempCheckpointLocation' enabled, which > means replaying all the data in kafka. If user mean it, it is ok. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18924) Improve collect/createDataFrame performance in SparkR
[ https://issues.apache.org/jira/browse/SPARK-18924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15763165#comment-15763165 ] Felix Cheung commented on SPARK-18924: -- Thank you for bring this up. JVM<->Java performance has been reported a few times and definitely something I have been tracking but I didn't get around to. I don't think rJava would work since it is GPLv2 licensed. Rcpp is also GPLv2/v3. Strategically placed calls to C might be a way to go (cross-platform complications aside)? That seems to be the approach for a lot of R packages. I recall we have a JIRA on performance tests, do we have more break down of the time spent? > Improve collect/createDataFrame performance in SparkR > - > > Key: SPARK-18924 > URL: https://issues.apache.org/jira/browse/SPARK-18924 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Xiangrui Meng >Priority: Critical > > SparkR has its own SerDe for data serialization between JVM and R. > The SerDe on the JVM side is implemented in: > * > [SerDe.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala] > * > [SQLUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala] > The SerDe on the R side is implemented in: > * > [deserialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R] > * > [serialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R] > The serialization between JVM and R suffers from huge storage and computation > overhead. For example, a short round trip of 1 million doubles surprisingly > took 3 minutes on my laptop: > {code} > > system.time(collect(createDataFrame(data.frame(x=runif(100) >user system elapsed > 14.224 0.582 189.135 > {code} > Collecting a medium-sized DataFrame to local and continuing with a local R > workflow is a use case we should pay attention to. SparkR will never be able > to cover all existing features from CRAN packages. It is also unnecessary for > Spark to do so because not all features need scalability. > Several factors contribute to the serialization overhead: > 1. The SerDe in R side is implemented using high-level R methods. > 2. DataFrame columns are not efficiently serialized, primitive type columns > in particular. > 3. Some overhead in the serialization protocol/impl. > 1) might be discussed before because R packages like rJava exist before > SparkR. I'm not sure whether we have a license issue in depending on those > libraries. Another option is to switch to low-level R'C interface or Rcpp, > which again might have license issue. I'm not an expert here. If we have to > implement our own, there still exist much space for improvement, discussed > below. > 2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, > which collects rows to local and then constructs columns. However, > * it ignores column types and results boxing/unboxing overhead > * it collects all objects to driver and results high GC pressure > A relatively simple change is to implement specialized column builder based > on column types, primitive types in particular. We need to handle null/NA > values properly. A simple data structure we can use is > {code} > val size: Int > val nullIndexes: Array[Int] > val notNullValues: Array[T] // specialized for primitive types > {code} > On the R side, we can use `readBin` and `writeBin` to read the entire vector > in a single method call. The speed seems reasonable (at the order of GB/s): > {code} > > x <- runif(1000) # 1e7, not 1e6 > > system.time(r <- writeBin(x, raw(0))) >user system elapsed > 0.036 0.021 0.059 > > > system.time(y <- readBin(r, double(), 1000)) >user system elapsed > 0.015 0.007 0.024 > {code} > This is just a proposal that needs to be discussed and formalized. But in > general, it should be feasible to obtain 20x or more performance gain. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18935) Use Mesos "Dynamic Reservation" resource for Spark
[ https://issues.apache.org/jira/browse/SPARK-18935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jackyoh updated SPARK-18935: Affects Version/s: 2.0.1 > Use Mesos "Dynamic Reservation" resource for Spark > -- > > Key: SPARK-18935 > URL: https://issues.apache.org/jira/browse/SPARK-18935 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0, 2.0.1, 2.0.2 >Reporter: jackyoh > > I'm running spark on Apache Mesos > Please follow these steps to reproduce the issue: > 1. First, run Mesos resource reserve: > curl -i -d slaveId=c24d1cfb-79f3-4b07-9f8b-c7b19543a333-S0 -d > resources='[{"name":"cpus","type":"SCALAR","scalar":{"value":20},"role":"spark","reservation":{"principal":""}},{"name":"mem","type":"SCALAR","scalar":{"value":4096},"role":"spark","reservation":{"principal":""}}]' > -X POST http://192.168.1.118:5050/master/reserve > 2. Then run spark-submit command: > ./spark-submit --class org.apache.spark.examples.SparkPi --master > mesos://192.168.1.118:5050 --conf spark.mesos.role=spark > ../examples/jars/spark-examples_2.11-2.0.2.jar 1 > And the console will keep loging same warning message as shown below: > 16/12/19 22:33:28 WARN TaskSchedulerImpl: Initial job has not accepted any > resources; check your cluster UI to ensure that workers are registered and > have sufficient resources -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18913) append to a table with special column names should work
[ https://issues.apache.org/jira/browse/SPARK-18913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-18913. - Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 > append to a table with special column names should work > --- > > Key: SPARK-18913 > URL: https://issues.apache.org/jira/browse/SPARK-18913 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.1.1, 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18935) Use Mesos "Dynamic Reservation" resource for Spark
jackyoh created SPARK-18935: --- Summary: Use Mesos "Dynamic Reservation" resource for Spark Key: SPARK-18935 URL: https://issues.apache.org/jira/browse/SPARK-18935 Project: Spark Issue Type: Bug Affects Versions: 2.0.2, 2.0.0 Reporter: jackyoh I'm running spark on Apache Mesos Please follow these steps to reproduce the issue: 1. First, run Mesos resource reserve: curl -i -d slaveId=c24d1cfb-79f3-4b07-9f8b-c7b19543a333-S0 -d resources='[{"name":"cpus","type":"SCALAR","scalar":{"value":20},"role":"spark","reservation":{"principal":""}},{"name":"mem","type":"SCALAR","scalar":{"value":4096},"role":"spark","reservation":{"principal":""}}]' -X POST http://192.168.1.118:5050/master/reserve 2. Then run spark-submit command: ./spark-submit --class org.apache.spark.examples.SparkPi --master mesos://192.168.1.118:5050 --conf spark.mesos.role=spark ../examples/jars/spark-examples_2.11-2.0.2.jar 1 And the console will keep loging same warning message as shown below: 16/12/19 22:33:28 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18912) append to a non-file-based data source table should detect columns number mismatch
[ https://issues.apache.org/jira/browse/SPARK-18912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-18912. - Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 > append to a non-file-based data source table should detect columns number > mismatch > -- > > Key: SPARK-18912 > URL: https://issues.apache.org/jira/browse/SPARK-18912 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.1.1, 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18899) append data to a bucketed table with mismatched bucketing should fail
[ https://issues.apache.org/jira/browse/SPARK-18899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-18899: Fix Version/s: 2.2.0 2.1.1 > append data to a bucketed table with mismatched bucketing should fail > - > > Key: SPARK-18899 > URL: https://issues.apache.org/jira/browse/SPARK-18899 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.1.1, 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18899) append data to a bucketed table with mismatched bucketing should fail
[ https://issues.apache.org/jira/browse/SPARK-18899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-18899: Target Version/s: 2.1.1 (was: 2.1.1, 2.2.0) > append data to a bucketed table with mismatched bucketing should fail > - > > Key: SPARK-18899 > URL: https://issues.apache.org/jira/browse/SPARK-18899 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.1.1, 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18899) append data to a bucketed table with mismatched bucketing should fail
[ https://issues.apache.org/jira/browse/SPARK-18899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-18899. - Resolution: Fixed Target Version/s: 2.1.1, 2.2.0 (was: 2.1.1) > append data to a bucketed table with mismatched bucketing should fail > - > > Key: SPARK-18899 > URL: https://issues.apache.org/jira/browse/SPARK-18899 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17755) Master may ask a worker to launch an executor before the worker actually got the response of registration
[ https://issues.apache.org/jira/browse/SPARK-17755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15763089#comment-15763089 ] Shuai Lin commented on SPARK-17755: --- A (sort-of) similar problem for coarse grained scheduler backends is reported in https://issues.apache.org/jira/browse/SPARK-18820 . > Master may ask a worker to launch an executor before the worker actually got > the response of registration > - > > Key: SPARK-17755 > URL: https://issues.apache.org/jira/browse/SPARK-17755 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Yin Huai >Assignee: Shixiong Zhu > > I somehow saw a failed test {{org.apache.spark.DistributedSuite.caching in > memory, serialized, replicated}}. Its log shows that Spark master asked the > worker to launch an executor before the worker actually got the response of > registration. So, the master knew that the worker had been registered. But, > the worker did not know if it self had been registered. > {code} > 16/09/30 14:53:53.681 dispatcher-event-loop-0 INFO Master: Registering worker > localhost:38262 with 1 cores, 1024.0 MB RAM > 16/09/30 14:53:53.681 dispatcher-event-loop-0 INFO Master: Launching executor > app-20160930145353-/1 on worker worker-20160930145353-localhost-38262 > 16/09/30 14:53:53.682 dispatcher-event-loop-3 INFO > StandaloneAppClient$ClientEndpoint: Executor added: app-20160930145353-/1 > on worker-20160930145353-localhost-38262 (localhost:38262) with 1 cores > 16/09/30 14:53:53.683 dispatcher-event-loop-3 INFO > StandaloneSchedulerBackend: Granted executor ID app-20160930145353-/1 on > hostPort localhost:38262 with 1 cores, 1024.0 MB RAM > 16/09/30 14:53:53.683 dispatcher-event-loop-0 WARN Worker: Invalid Master > (spark://localhost:46460) attempted to launch executor. > 16/09/30 14:53:53.687 worker-register-master-threadpool-0 INFO Worker: > Successfully registered with master spark://localhost:46460 > {code} > Then, seems the worker did not launch any executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18761) Uncancellable / unkillable tasks may starve jobs of resoures
[ https://issues.apache.org/jira/browse/SPARK-18761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-18761. -- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16189 [https://github.com/apache/spark/pull/16189] > Uncancellable / unkillable tasks may starve jobs of resoures > > > Key: SPARK-18761 > URL: https://issues.apache.org/jira/browse/SPARK-18761 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.2.0 > > > Spark's current task cancellation / task killing mechanism is "best effort" > in the sense that some tasks may not be interruptible and may not respond to > their "killed" flags being set. If a significant fraction of a cluster's task > slots are occupied by tasks that have been marked as killed but remain > running then this can lead to a situation where new jobs and tasks are > starved of resources because zombie tasks are holding resources. > I propose to address this problem by introducing a "task reaper" mechanism in > executors to monitor tasks after they are marked for killing in order to > periodically re-attempt the task kill, capture and log stacktraces / warnings > if tasks do not exit in a timely manner, and, optionally, kill the entire > executor JVM if cancelled tasks cannot be killed within some timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18934) Writing to dynamic partitions does not preserve sort order if spill occurs
[ https://issues.apache.org/jira/browse/SPARK-18934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18934: Assignee: (was: Apache Spark) > Writing to dynamic partitions does not preserve sort order if spill occurs > -- > > Key: SPARK-18934 > URL: https://issues.apache.org/jira/browse/SPARK-18934 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Junegunn Choi > > When writing to dynamic partitions, the task sorts the input data by the > partition key (also with bucket key if used), so that it can write to one > partition at a time using a single writer. And if spill occurs during the > process, {{UnsafeSorterSpillMerger}} is used to merge partial sequences of > data. > However, the merge process only considers the partition key, so that the sort > order within a partition specified via {{sortWithinPartitions}} or {{SORT > BY}} is not preserved. > We can reproduce the problem on Spark shell. Make sure to start shell in > local mode with small driver memory (e.g. 1G) so that spills occur. > {code} > // FileFormatWriter > sc.parallelize(1 to 1000).toDS.withColumn("part", 'value.mod(2)) > .repartition(1, 'part).sortWithinPartitions("value") > .write.mode("overwrite").format("orc").partitionBy("part") > .saveAsTable("test_sort_within") > spark.read.table("test_sort_within").show > {code} > {noformat} > +---++ > | value|part| > +---++ > | 2| 0| > |8388610| 0| > | 4| 0| > |8388612| 0| > | 6| 0| > |8388614| 0| > | 8| 0| > |8388616| 0| > | 10| 0| > |8388618| 0| > | 12| 0| > |8388620| 0| > | 14| 0| > |8388622| 0| > | 16| 0| > |8388624| 0| > | 18| 0| > |8388626| 0| > | 20| 0| > |8388628| 0| > +---++ > {noformat} > We can confirm that the issue using orc dump. > {noformat} > > java -jar orc-tools-1.3.0-SNAPSHOT-uber.jar meta -d > > part=0/part-r-0-96c022f0-a173-40cc-b2e5-9d02fed4213e.snappy.orc | head > > -20 > {"value":2} > {"value":8388610} > {"value":4} > {"value":8388612} > {"value":6} > {"value":8388614} > {"value":8} > {"value":8388616} > {"value":10} > {"value":8388618} > {"value":12} > {"value":8388620} > {"value":14} > {"value":8388622} > {"value":16} > {"value":8388624} > {"value":18} > {"value":8388626} > {"value":20} > {"value":8388628} > {noformat} > {{SparkHiveDynamicPartitionWriterContainer}} has the same problem. > {code} > // Insert into an existing Hive table with dynamic partitions > // CREATE TABLE TEST_SORT_WITHIN (VALUE INT) PARTITIONED BY (PART INT) > STORED AS ORC > spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict") > sc.parallelize(1 to 1000).toDS.withColumn("part", 'value.mod(2)) > .repartition(1, 'part).sortWithinPartitions("value") > .write.mode("overwrite").insertInto("test_sort_within") > spark.read.table("test_sort_within").show > {code} > I was able to fix the problem by appending a numeric index column to the > sorting key which effectively makes the sort stable. I'll create a pull > request on GitHub but since I'm not really familiar with the internals of > Spark, I'm not sure if my approach is valid or idiomatic. So please let me > know if there are better ways to handle this, or if you want to address the > issue differently. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18934) Writing to dynamic partitions does not preserve sort order if spill occurs
[ https://issues.apache.org/jira/browse/SPARK-18934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15763010#comment-15763010 ] Apache Spark commented on SPARK-18934: -- User 'junegunn' has created a pull request for this issue: https://github.com/apache/spark/pull/16347 > Writing to dynamic partitions does not preserve sort order if spill occurs > -- > > Key: SPARK-18934 > URL: https://issues.apache.org/jira/browse/SPARK-18934 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Junegunn Choi > > When writing to dynamic partitions, the task sorts the input data by the > partition key (also with bucket key if used), so that it can write to one > partition at a time using a single writer. And if spill occurs during the > process, {{UnsafeSorterSpillMerger}} is used to merge partial sequences of > data. > However, the merge process only considers the partition key, so that the sort > order within a partition specified via {{sortWithinPartitions}} or {{SORT > BY}} is not preserved. > We can reproduce the problem on Spark shell. Make sure to start shell in > local mode with small driver memory (e.g. 1G) so that spills occur. > {code} > // FileFormatWriter > sc.parallelize(1 to 1000).toDS.withColumn("part", 'value.mod(2)) > .repartition(1, 'part).sortWithinPartitions("value") > .write.mode("overwrite").format("orc").partitionBy("part") > .saveAsTable("test_sort_within") > spark.read.table("test_sort_within").show > {code} > {noformat} > +---++ > | value|part| > +---++ > | 2| 0| > |8388610| 0| > | 4| 0| > |8388612| 0| > | 6| 0| > |8388614| 0| > | 8| 0| > |8388616| 0| > | 10| 0| > |8388618| 0| > | 12| 0| > |8388620| 0| > | 14| 0| > |8388622| 0| > | 16| 0| > |8388624| 0| > | 18| 0| > |8388626| 0| > | 20| 0| > |8388628| 0| > +---++ > {noformat} > We can confirm that the issue using orc dump. > {noformat} > > java -jar orc-tools-1.3.0-SNAPSHOT-uber.jar meta -d > > part=0/part-r-0-96c022f0-a173-40cc-b2e5-9d02fed4213e.snappy.orc | head > > -20 > {"value":2} > {"value":8388610} > {"value":4} > {"value":8388612} > {"value":6} > {"value":8388614} > {"value":8} > {"value":8388616} > {"value":10} > {"value":8388618} > {"value":12} > {"value":8388620} > {"value":14} > {"value":8388622} > {"value":16} > {"value":8388624} > {"value":18} > {"value":8388626} > {"value":20} > {"value":8388628} > {noformat} > {{SparkHiveDynamicPartitionWriterContainer}} has the same problem. > {code} > // Insert into an existing Hive table with dynamic partitions > // CREATE TABLE TEST_SORT_WITHIN (VALUE INT) PARTITIONED BY (PART INT) > STORED AS ORC > spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict") > sc.parallelize(1 to 1000).toDS.withColumn("part", 'value.mod(2)) > .repartition(1, 'part).sortWithinPartitions("value") > .write.mode("overwrite").insertInto("test_sort_within") > spark.read.table("test_sort_within").show > {code} > I was able to fix the problem by appending a numeric index column to the > sorting key which effectively makes the sort stable. I'll create a pull > request on GitHub but since I'm not really familiar with the internals of > Spark, I'm not sure if my approach is valid or idiomatic. So please let me > know if there are better ways to handle this, or if you want to address the > issue differently. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18934) Writing to dynamic partitions does not preserve sort order if spill occurs
[ https://issues.apache.org/jira/browse/SPARK-18934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18934: Assignee: Apache Spark > Writing to dynamic partitions does not preserve sort order if spill occurs > -- > > Key: SPARK-18934 > URL: https://issues.apache.org/jira/browse/SPARK-18934 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Junegunn Choi >Assignee: Apache Spark > > When writing to dynamic partitions, the task sorts the input data by the > partition key (also with bucket key if used), so that it can write to one > partition at a time using a single writer. And if spill occurs during the > process, {{UnsafeSorterSpillMerger}} is used to merge partial sequences of > data. > However, the merge process only considers the partition key, so that the sort > order within a partition specified via {{sortWithinPartitions}} or {{SORT > BY}} is not preserved. > We can reproduce the problem on Spark shell. Make sure to start shell in > local mode with small driver memory (e.g. 1G) so that spills occur. > {code} > // FileFormatWriter > sc.parallelize(1 to 1000).toDS.withColumn("part", 'value.mod(2)) > .repartition(1, 'part).sortWithinPartitions("value") > .write.mode("overwrite").format("orc").partitionBy("part") > .saveAsTable("test_sort_within") > spark.read.table("test_sort_within").show > {code} > {noformat} > +---++ > | value|part| > +---++ > | 2| 0| > |8388610| 0| > | 4| 0| > |8388612| 0| > | 6| 0| > |8388614| 0| > | 8| 0| > |8388616| 0| > | 10| 0| > |8388618| 0| > | 12| 0| > |8388620| 0| > | 14| 0| > |8388622| 0| > | 16| 0| > |8388624| 0| > | 18| 0| > |8388626| 0| > | 20| 0| > |8388628| 0| > +---++ > {noformat} > We can confirm that the issue using orc dump. > {noformat} > > java -jar orc-tools-1.3.0-SNAPSHOT-uber.jar meta -d > > part=0/part-r-0-96c022f0-a173-40cc-b2e5-9d02fed4213e.snappy.orc | head > > -20 > {"value":2} > {"value":8388610} > {"value":4} > {"value":8388612} > {"value":6} > {"value":8388614} > {"value":8} > {"value":8388616} > {"value":10} > {"value":8388618} > {"value":12} > {"value":8388620} > {"value":14} > {"value":8388622} > {"value":16} > {"value":8388624} > {"value":18} > {"value":8388626} > {"value":20} > {"value":8388628} > {noformat} > {{SparkHiveDynamicPartitionWriterContainer}} has the same problem. > {code} > // Insert into an existing Hive table with dynamic partitions > // CREATE TABLE TEST_SORT_WITHIN (VALUE INT) PARTITIONED BY (PART INT) > STORED AS ORC > spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict") > sc.parallelize(1 to 1000).toDS.withColumn("part", 'value.mod(2)) > .repartition(1, 'part).sortWithinPartitions("value") > .write.mode("overwrite").insertInto("test_sort_within") > spark.read.table("test_sort_within").show > {code} > I was able to fix the problem by appending a numeric index column to the > sorting key which effectively makes the sort stable. I'll create a pull > request on GitHub but since I'm not really familiar with the internals of > Spark, I'm not sure if my approach is valid or idiomatic. So please let me > know if there are better ways to handle this, or if you want to address the > issue differently. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18934) Writing to dynamic partitions does not preserve sort order if spill occurs
Junegunn Choi created SPARK-18934: - Summary: Writing to dynamic partitions does not preserve sort order if spill occurs Key: SPARK-18934 URL: https://issues.apache.org/jira/browse/SPARK-18934 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.2 Reporter: Junegunn Choi When writing to dynamic partitions, the task sorts the input data by the partition key (also with bucket key if used), so that it can write to one partition at a time using a single writer. And if spill occurs during the process, {{UnsafeSorterSpillMerger}} is used to merge partial sequences of data. However, the merge process only considers the partition key, so that the sort order within a partition specified via {{sortWithinPartitions}} or {{SORT BY}} is not preserved. We can reproduce the problem on Spark shell. Make sure to start shell in local mode with small driver memory (e.g. 1G) so that spills occur. {code} // FileFormatWriter sc.parallelize(1 to 1000).toDS.withColumn("part", 'value.mod(2)) .repartition(1, 'part).sortWithinPartitions("value") .write.mode("overwrite").format("orc").partitionBy("part") .saveAsTable("test_sort_within") spark.read.table("test_sort_within").show {code} {noformat} +---++ | value|part| +---++ | 2| 0| |8388610| 0| | 4| 0| |8388612| 0| | 6| 0| |8388614| 0| | 8| 0| |8388616| 0| | 10| 0| |8388618| 0| | 12| 0| |8388620| 0| | 14| 0| |8388622| 0| | 16| 0| |8388624| 0| | 18| 0| |8388626| 0| | 20| 0| |8388628| 0| +---++ {noformat} We can confirm that the issue using orc dump. {noformat} > java -jar orc-tools-1.3.0-SNAPSHOT-uber.jar meta -d > part=0/part-r-0-96c022f0-a173-40cc-b2e5-9d02fed4213e.snappy.orc | head -20 {"value":2} {"value":8388610} {"value":4} {"value":8388612} {"value":6} {"value":8388614} {"value":8} {"value":8388616} {"value":10} {"value":8388618} {"value":12} {"value":8388620} {"value":14} {"value":8388622} {"value":16} {"value":8388624} {"value":18} {"value":8388626} {"value":20} {"value":8388628} {noformat} {{SparkHiveDynamicPartitionWriterContainer}} has the same problem. {code} // Insert into an existing Hive table with dynamic partitions // CREATE TABLE TEST_SORT_WITHIN (VALUE INT) PARTITIONED BY (PART INT) STORED AS ORC spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict") sc.parallelize(1 to 1000).toDS.withColumn("part", 'value.mod(2)) .repartition(1, 'part).sortWithinPartitions("value") .write.mode("overwrite").insertInto("test_sort_within") spark.read.table("test_sort_within").show {code} I was able to fix the problem by appending a numeric index column to the sorting key which effectively makes the sort stable. I'll create a pull request on GitHub but since I'm not really familiar with the internals of Spark, I'm not sure if my approach is valid or idiomatic. So please let me know if there are better ways to handle this, or if you want to address the issue differently. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18933) Different log output between Terminal screen and stderr file
Sean Wong created SPARK-18933: - Summary: Different log output between Terminal screen and stderr file Key: SPARK-18933 URL: https://issues.apache.org/jira/browse/SPARK-18933 Project: Spark Issue Type: Bug Components: Deploy, Documentation, Web UI Affects Versions: 1.6.3 Environment: Yarn mode and standalone mode Reporter: Sean Wong First of all, I use the default log4j.properties in the Spark conf/ But I found that the log output(e.g., INFO) is different between Terminal screen and stderr File. Some INFO logs exist in both of them. Some INFO logs exist in either of them. Why this happens? Is it supposed that the output logs are same between the terminal screen and stderr file? Then I did a Test. I modified the source code in SparkContext.scala and add one line log code "logInfo("This is textFile")" in the textFile function. However, after running an application, I found the log "This is textFile" shown in the terminal screen. no such log in the stderr file. I am not sure if this is a bug. So, hope you can solve this question. Thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16654) UI Should show blacklisted executors & nodes
[ https://issues.apache.org/jira/browse/SPARK-16654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16654: Assignee: Apache Spark (was: Jose Soltren) > UI Should show blacklisted executors & nodes > > > Key: SPARK-16654 > URL: https://issues.apache.org/jira/browse/SPARK-16654 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Web UI >Affects Versions: 2.0.0 >Reporter: Imran Rashid >Assignee: Apache Spark > > SPARK-8425 will add the ability to blacklist entire executors and nodes to > deal w/ faulty hardware. However, without displaying it on the UI, it can be > hard to realize which executor is bad, and why tasks aren't getting scheduled > on certain executors. > As a first step, we should just show nodes and executors that are blacklisted > for the entire application (no need to show blacklisting for tasks & stages). > This should also ensure that blacklisting events get into the event logs for > the history server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16654) UI Should show blacklisted executors & nodes
[ https://issues.apache.org/jira/browse/SPARK-16654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16654: Assignee: Jose Soltren (was: Apache Spark) > UI Should show blacklisted executors & nodes > > > Key: SPARK-16654 > URL: https://issues.apache.org/jira/browse/SPARK-16654 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Web UI >Affects Versions: 2.0.0 >Reporter: Imran Rashid >Assignee: Jose Soltren > > SPARK-8425 will add the ability to blacklist entire executors and nodes to > deal w/ faulty hardware. However, without displaying it on the UI, it can be > hard to realize which executor is bad, and why tasks aren't getting scheduled > on certain executors. > As a first step, we should just show nodes and executors that are blacklisted > for the entire application (no need to show blacklisting for tasks & stages). > This should also ensure that blacklisting events get into the event logs for > the history server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16654) UI Should show blacklisted executors & nodes
[ https://issues.apache.org/jira/browse/SPARK-16654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762908#comment-15762908 ] Apache Spark commented on SPARK-16654: -- User 'jsoltren' has created a pull request for this issue: https://github.com/apache/spark/pull/16346 > UI Should show blacklisted executors & nodes > > > Key: SPARK-16654 > URL: https://issues.apache.org/jira/browse/SPARK-16654 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Web UI >Affects Versions: 2.0.0 >Reporter: Imran Rashid >Assignee: Jose Soltren > > SPARK-8425 will add the ability to blacklist entire executors and nodes to > deal w/ faulty hardware. However, without displaying it on the UI, it can be > hard to realize which executor is bad, and why tasks aren't getting scheduled > on certain executors. > As a first step, we should just show nodes and executors that are blacklisted > for the entire application (no need to show blacklisting for tasks & stages). > This should also ensure that blacklisting events get into the event logs for > the history server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-18915) Return Nothing when Querying a Partitioned Data Source Table without Repairing it
[ https://issues.apache.org/jira/browse/SPARK-18915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li closed SPARK-18915. --- Resolution: Won't Fix > Return Nothing when Querying a Partitioned Data Source Table without > Repairing it > - > > Key: SPARK-18915 > URL: https://issues.apache.org/jira/browse/SPARK-18915 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Priority: Critical > > In Spark 2.1, if we create a parititoned data source table given a specified > path, it returns nothing when we try to query it. To get the data, we have to > manually issue a DDL to repair the table. > In Spark 2.0, it can return the data stored in the specified path, without > repairing the table. > Below is the output of Spark 2.1. > {noformat} > scala> spark.range(5).selectExpr("id as fieldOne", "id as > partCol").write.partitionBy("partCol").mode("overwrite").saveAsTable("test") > [Stage 0:==>(3 + 5) / > 8]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > > > scala> spark.sql("select * from test").show() > ++---+ > |fieldOne|partCol| > ++---+ > | 0| 0| > | 1| 1| > | 2| 2| > | 3| 3| > | 4| 4| > ++---+ > scala> spark.sql("desc formatted test").show(50, false) > ++--+---+ > |col_name|data_type > |comment| > ++--+---+ > |fieldOne|bigint > |null | > |partCol |bigint > |null | > |# Partition Information | > | | > |# col_name |data_type > |comment| > |partCol |bigint > |null | > || > | | > |# Detailed Table Information| > | | > |Database: |default > | | > |Owner: |xiaoli > | | > |Create Time:|Sat Dec 17 17:46:24 PST 2016 > | | > |Last Access Time: |Wed Dec 31 16:00:00 PST 1969 > | | > |Location: > |file:/Users/xiaoli/IdeaProjects/sparkDelivery/bin/spark-warehouse/test| > | > |Table Type: |MANAGED > | | > |Table Parameters: | > | | > | transient_lastDdlTime |1482025584 > | | > || > | | > |# Storage Information | > | | > |SerDe Library: > |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | > | > |InputFormat: > |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | > | > |OutputFormat: > |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| > | > |Compressed: |No > | | > |Storage Desc Parameters:| > | | > | serialization.format |1 > | | > |Partition Provider: |Catalog > | | > ++-
[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762805#comment-15762805 ] Barry Becker commented on SPARK-16845: -- I found a workaround that allows me to avoid the 64 KB error, but it still reuns much slower than I expected. I switched to use a bacth select statement insted of calls to withColumns in a loop. Here is an example of what I did Old way: {code} stringCols.foreach(column => { val qCol = col(column) datasetDf = datasetDf .withColumn(column + CLEAN_SUFFIX, when(qCol.isNull, lit(MISSING)).otherwise(qCol)) }) {code} New way: {code} val replaceStringNull = udf((s: String) => if (s == null) MISSING else s) var newCols = datasetDf.columns.map(column => if (stringCols.contains(column)) replaceStringNull(col(column)).as(column + CLEAN_SUFFIX) else col(column)) datasetDf = datasetDf.select(newCols:_*) {code} > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > - > > Key: SPARK-16845 > URL: https://issues.apache.org/jira/browse/SPARK-16845 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: hejie > Attachments: error.txt.zip > > > I have a wide table(400 columns), when I try fitting the traindata on all > columns, the fatal error occurs. > ... 46 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18492) GeneratedIterator grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762785#comment-15762785 ] Roberto Mirizzi commented on SPARK-18492: - I would also like to understand if this error causes the query to be non-optimized, hence slower, or not. > GeneratedIterator grows beyond 64 KB > > > Key: SPARK-18492 > URL: https://issues.apache.org/jira/browse/SPARK-18492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: CentOS release 6.7 (Final) >Reporter: Norris Merritt > > spark-submit fails with ERROR CodeGenerator: failed to compile: > org.codehaus.janino.JaninoRuntimeException: Code of method > "(I[Lscala/collection/Iterator;)V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" > grows beyond 64 KB > Error message is followed by a huge dump of generated source code. > The generated code declares 1,454 field sequences like the following: > /* 036 */ private org.apache.spark.sql.catalyst.expressions.ScalaUDF > project_scalaUDF1; > /* 037 */ private scala.Function1 project_catalystConverter1; > /* 038 */ private scala.Function1 project_converter1; > /* 039 */ private scala.Function1 project_converter2; > /* 040 */ private scala.Function2 project_udf1; > (many omitted lines) ... > /* 6089 */ private org.apache.spark.sql.catalyst.expressions.ScalaUDF > project_scalaUDF1454; > /* 6090 */ private scala.Function1 project_catalystConverter1454; > /* 6091 */ private scala.Function1 project_converter1695; > /* 6092 */ private scala.Function1 project_udf1454; > It then proceeds to emit code for several methods (init, processNext) each of > which has totally repetitive sequences of statements pertaining to each of > the sequences of variables declared in the class. For example: > /* 6101 */ public void init(int index, scala.collection.Iterator inputs[]) { > The reason that the 64KB JVM limit for code for a method is exceeded is > because the code generator is using an incredibly naive strategy. It emits a > sequence like the one shown below for each of the 1,454 groups of variables > shown above, in > /* 6132 */ this.project_udf = > (scala.Function1)project_scalaUDF.userDefinedFunc(); > /* 6133 */ this.project_scalaUDF1 = > (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[10]; > /* 6134 */ this.project_catalystConverter1 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF1.dataType()); > /* 6135 */ this.project_converter1 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(0))).dataType()); > /* 6136 */ this.project_converter2 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(1))).dataType()); > It blows up after emitting 230 such sequences, while trying to emit the 231st: > /* 7282 */ this.project_udf230 = > (scala.Function2)project_scalaUDF230.userDefinedFunc(); > /* 7283 */ this.project_scalaUDF231 = > (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[240]; > /* 7284 */ this.project_catalystConverter231 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF231.dataType()); > many omitted lines ... > Example of repetitive code sequences emitted for processNext method: > /* 12253 */ boolean project_isNull247 = project_result244 == null; > /* 12254 */ MapData project_value247 = null; > /* 12255 */ if (!project_isNull247) { > /* 12256 */ project_value247 = project_result244; > /* 12257 */ } > /* 12258 */ Object project_arg = sort_isNull5 ? null : > project_converter489.apply(sort_value5); > /* 12259 */ > /* 12260 */ ArrayData project_result249 = null; > /* 12261 */ try { > /* 12262 */ project_result249 = > (ArrayData)project_catalystConverter248.apply(project_udf248.apply(project_arg)); > /* 12263 */ } catch (Exception e) { > /* 12264 */ throw new > org.apache.spark.SparkException(project_scalaUDF248.udfErrorMessage(), e); > /* 12265 */ } > /* 12266 */ > /* 12267 */ boolean project_isNull252 = project_result249 == null; > /* 12268 */ ArrayData project_value252 = null; > /* 12269 */ if (!project_isNull252) { > /* 12270 */ project_value252 = project_r
[jira] [Commented] (SPARK-18710) Add offset to GeneralizedLinearRegression models
[ https://issues.apache.org/jira/browse/SPARK-18710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762752#comment-15762752 ] Wayne Zhang commented on SPARK-18710: - [~yanboliang] It seems that I would need to change the case class 'Instance' to include offset... That could be potentially disruptive if many other models also depend on this case class. Any suggestions regarding this? > Add offset to GeneralizedLinearRegression models > > > Key: SPARK-18710 > URL: https://issues.apache.org/jira/browse/SPARK-18710 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.0.2 >Reporter: Wayne Zhang >Assignee: Wayne Zhang > Labels: features > Original Estimate: 10h > Remaining Estimate: 10h > > The current GeneralizedLinearRegression model does not support offset. The > offset can be useful to take into account exposure, or for testing > incremental effect of new variables. It is possible to use weights in current > environment to achieve the same effect of specifying offset for certain > models, e.g., Poisson & Binomial with log offset, it is desirable to have the > offset option to work with more general cases, e.g., negative offset or > offset that is hard to specify using weights (e.g., offset to the probability > rather than odds in logistic regression). > Effort would involve: > * update regression class to support offsetCol > * update IWLS to take into account of offset > * add test case for offset > I can start working on this if the community approves this feature. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17344) Kafka 0.8 support for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762756#comment-15762756 ] Michael Armbrust commented on SPARK-17344: -- [KAFKA-4462] aims to give us backwards compatibility for clients which will be great. The fact that there is a long term plan here makes me less allergic to the idea of copy/pasting the 0.10.x {{Source}} and porting it to 0.8.0 as an interim solution for those who can upgrade yet. > Kafka 0.8 support for Structured Streaming > -- > > Key: SPARK-17344 > URL: https://issues.apache.org/jira/browse/SPARK-17344 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Frederick Reiss > > Design and implement Kafka 0.8-based sources and sinks for Structured > Streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18928) FileScanRDD, JDBCRDD, and UnsafeSorter should support task cancellation
[ https://issues.apache.org/jira/browse/SPARK-18928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-18928. --- Resolution: Fixed Fix Version/s: 2.1.1 > FileScanRDD, JDBCRDD, and UnsafeSorter should support task cancellation > --- > > Key: SPARK-18928 > URL: https://issues.apache.org/jira/browse/SPARK-18928 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.1.1 > > > Spark tasks respond to cancellation by checking > {{TaskContext.isInterrupted()}}, but this check is missing on a few critical > paths used in Spark SQL, including FileScanRDD, JDBCRDD, and > UnsafeSorter-based sorts. This can cause interrupted / cancelled tasks to > continue running and become zombies. > Here's an example: first, create a giant text file. In my case, I just > concatenated /usr/share/dict/words a bunch of times to produce a 2.75 gig > file. Then, run a really slow query over that file and try to cancel it: > {code} > spark.read.text("/tmp/words").selectExpr("value + value + value").collect() > {code} > This will sit and churn at 100% CPU for a minute or two because the task > isn't checking the interrupted flag. > The solution here is to add InterruptedIterator-style checks to a few > locations where they're currently missing in Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18928) FileScanRDD, JDBCRDD, and UnsafeSorter should support task cancellation
[ https://issues.apache.org/jira/browse/SPARK-18928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-18928: -- Fix Version/s: 2.2.0 > FileScanRDD, JDBCRDD, and UnsafeSorter should support task cancellation > --- > > Key: SPARK-18928 > URL: https://issues.apache.org/jira/browse/SPARK-18928 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.1.1, 2.2.0 > > > Spark tasks respond to cancellation by checking > {{TaskContext.isInterrupted()}}, but this check is missing on a few critical > paths used in Spark SQL, including FileScanRDD, JDBCRDD, and > UnsafeSorter-based sorts. This can cause interrupted / cancelled tasks to > continue running and become zombies. > Here's an example: first, create a giant text file. In my case, I just > concatenated /usr/share/dict/words a bunch of times to produce a 2.75 gig > file. Then, run a really slow query over that file and try to cancel it: > {code} > spark.read.text("/tmp/words").selectExpr("value + value + value").collect() > {code} > This will sit and churn at 100% CPU for a minute or two because the task > isn't checking the interrupted flag. > The solution here is to add InterruptedIterator-style checks to a few > locations where they're currently missing in Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17344) Kafka 0.8 support for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17344: - Target Version/s: 2.1.1 > Kafka 0.8 support for Structured Streaming > -- > > Key: SPARK-17344 > URL: https://issues.apache.org/jira/browse/SPARK-17344 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Frederick Reiss > > Design and implement Kafka 0.8-based sources and sinks for Structured > Streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18908) It's hard for the user to see the failure if StreamExecution fails to create the logical plan
[ https://issues.apache.org/jira/browse/SPARK-18908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18908: - Target Version/s: 2.1.1 > It's hard for the user to see the failure if StreamExecution fails to create > the logical plan > - > > Key: SPARK-18908 > URL: https://issues.apache.org/jira/browse/SPARK-18908 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Critical > > If the logical plan fails to create, e.g., some Source options are invalid, > the user cannot use the code to detect the failure. The only place receiving > this error is Thread's UncaughtExceptionHandler. > This bug is because logicalPlan is lazy, and when we try to create > StreamingQueryException to wrap the exception thrown by creating logicalPlan, > it calls logicalPlan agains. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18932) Partial aggregation for collect_set / collect_list
Michael Armbrust created SPARK-18932: Summary: Partial aggregation for collect_set / collect_list Key: SPARK-18932 URL: https://issues.apache.org/jira/browse/SPARK-18932 Project: Spark Issue Type: Improvement Components: SQL, Structured Streaming Reporter: Michael Armbrust The lack of partial aggregation here is blocking us from using these in streaming. It still won't be fast, but it would be nice to at least be able to use them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762719#comment-15762719 ] Dongjoon Hyun commented on SPARK-16845: --- Hi, All. I removed the target version since 2.1.0 is passed the vote. > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > - > > Key: SPARK-16845 > URL: https://issues.apache.org/jira/browse/SPARK-16845 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: hejie > Attachments: error.txt.zip > > > I have a wide table(400 columns), when I try fitting the traindata on all > columns, the fatal error occurs. > ... 46 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-16845: -- Target Version/s: (was: 2.1.0) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > - > > Key: SPARK-16845 > URL: https://issues.apache.org/jira/browse/SPARK-16845 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: hejie > Attachments: error.txt.zip > > > I have a wide table(400 columns), when I try fitting the traindata on all > columns, the fatal error occurs. > ... 46 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18899) append data to a bucketed table with mismatched bucketing should fail
[ https://issues.apache.org/jira/browse/SPARK-18899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-18899: -- Target Version/s: 2.1.1 (was: 2.1.0) > append data to a bucketed table with mismatched bucketing should fail > - > > Key: SPARK-18899 > URL: https://issues.apache.org/jira/browse/SPARK-18899 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18894) Event time watermark delay threshold specified in months or years gives incorrect results
[ https://issues.apache.org/jira/browse/SPARK-18894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-18894: -- Target Version/s: 2.1.1 (was: 2.1.0) > Event time watermark delay threshold specified in months or years gives > incorrect results > - > > Key: SPARK-18894 > URL: https://issues.apache.org/jira/browse/SPARK-18894 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Critical > > Internally we use CalendarInterval to parse the delay. Non-determinstic > intervals like "month" and "year" are handled such a way that the generated > delay in milliseconds is 0 delayThreshold is in months or years. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18913) append to a table with special column names should work
[ https://issues.apache.org/jira/browse/SPARK-18913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-18913: -- Target Version/s: 2.1.1 (was: 2.1.0) > append to a table with special column names should work > --- > > Key: SPARK-18913 > URL: https://issues.apache.org/jira/browse/SPARK-18913 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18912) append to a non-file-based data source table should detect columns number mismatch
[ https://issues.apache.org/jira/browse/SPARK-18912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-18912: -- Target Version/s: 2.1.1 (was: 2.1.0) > append to a non-file-based data source table should detect columns number > mismatch > -- > > Key: SPARK-18912 > URL: https://issues.apache.org/jira/browse/SPARK-18912 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18909) The error message in `ExpressionEncoder.toRow` and `fromRow` is too verbose
[ https://issues.apache.org/jira/browse/SPARK-18909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-18909: -- Target Version/s: 2.1.1 (was: 2.1.0) > The error message in `ExpressionEncoder.toRow` and `fromRow` is too verbose > --- > > Key: SPARK-18909 > URL: https://issues.apache.org/jira/browse/SPARK-18909 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Priority: Minor > > In `ExpressionEncoder.toRow` and `fromRow`, we will catch the exception and > put the treeString of serializer/deserializer expressions in the error > message. However, encoder can be very complex and the serializer/deserializer > expressions can be very large trees and blow up the log files(e.g. generate > over 500mb logs for this single error message.) > We should simplify it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18931) Create empty staging directory in partitioned table on insert
Egor Pahomov created SPARK-18931: Summary: Create empty staging directory in partitioned table on insert Key: SPARK-18931 URL: https://issues.apache.org/jira/browse/SPARK-18931 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.2 Reporter: Egor Pahomov CREATE TABLE temp.test_partitioning_4 ( num string ) PARTITIONED BY ( day string) stored as parquet On every INSERT INTO TABLE temp.test_partitioning_4 PARTITION (day) select day, count(*) as num from hss.session where year=2016 and month=4 group by day new directory ".hive-staging_hive_2016-12-19_15-55-11_298_3412488541559534475-4" created on HDFS. It's big issue, because I insert every day and bunch of empty dirs on HDFS is very bad for HDFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18492) GeneratedIterator grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762706#comment-15762706 ] Nicholas Chammas commented on SPARK-18492: -- Yup, I'm seeming the same high-level behavior as you, [~roberto.mirizzi]. I get the 64KB error and the dump of generated code; I also get the warning about codegen being disabled; but then everything appears to proceed normally. So I'm not sure what to make of the error. 🤔 > GeneratedIterator grows beyond 64 KB > > > Key: SPARK-18492 > URL: https://issues.apache.org/jira/browse/SPARK-18492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: CentOS release 6.7 (Final) >Reporter: Norris Merritt > > spark-submit fails with ERROR CodeGenerator: failed to compile: > org.codehaus.janino.JaninoRuntimeException: Code of method > "(I[Lscala/collection/Iterator;)V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" > grows beyond 64 KB > Error message is followed by a huge dump of generated source code. > The generated code declares 1,454 field sequences like the following: > /* 036 */ private org.apache.spark.sql.catalyst.expressions.ScalaUDF > project_scalaUDF1; > /* 037 */ private scala.Function1 project_catalystConverter1; > /* 038 */ private scala.Function1 project_converter1; > /* 039 */ private scala.Function1 project_converter2; > /* 040 */ private scala.Function2 project_udf1; > (many omitted lines) ... > /* 6089 */ private org.apache.spark.sql.catalyst.expressions.ScalaUDF > project_scalaUDF1454; > /* 6090 */ private scala.Function1 project_catalystConverter1454; > /* 6091 */ private scala.Function1 project_converter1695; > /* 6092 */ private scala.Function1 project_udf1454; > It then proceeds to emit code for several methods (init, processNext) each of > which has totally repetitive sequences of statements pertaining to each of > the sequences of variables declared in the class. For example: > /* 6101 */ public void init(int index, scala.collection.Iterator inputs[]) { > The reason that the 64KB JVM limit for code for a method is exceeded is > because the code generator is using an incredibly naive strategy. It emits a > sequence like the one shown below for each of the 1,454 groups of variables > shown above, in > /* 6132 */ this.project_udf = > (scala.Function1)project_scalaUDF.userDefinedFunc(); > /* 6133 */ this.project_scalaUDF1 = > (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[10]; > /* 6134 */ this.project_catalystConverter1 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF1.dataType()); > /* 6135 */ this.project_converter1 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(0))).dataType()); > /* 6136 */ this.project_converter2 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(1))).dataType()); > It blows up after emitting 230 such sequences, while trying to emit the 231st: > /* 7282 */ this.project_udf230 = > (scala.Function2)project_scalaUDF230.userDefinedFunc(); > /* 7283 */ this.project_scalaUDF231 = > (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[240]; > /* 7284 */ this.project_catalystConverter231 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF231.dataType()); > many omitted lines ... > Example of repetitive code sequences emitted for processNext method: > /* 12253 */ boolean project_isNull247 = project_result244 == null; > /* 12254 */ MapData project_value247 = null; > /* 12255 */ if (!project_isNull247) { > /* 12256 */ project_value247 = project_result244; > /* 12257 */ } > /* 12258 */ Object project_arg = sort_isNull5 ? null : > project_converter489.apply(sort_value5); > /* 12259 */ > /* 12260 */ ArrayData project_result249 = null; > /* 12261 */ try { > /* 12262 */ project_result249 = > (ArrayData)project_catalystConverter248.apply(project_udf248.apply(project_arg)); > /* 12263 */ } catch (Exception e) { > /* 12264 */ throw new > org.apache.spark.SparkException(project_scalaUDF248.udfErrorMessage(), e); > /* 12265 */ } > /* 12266 */ > /* 12267 */ boolean project_isNull252 = projec
[jira] [Assigned] (SPARK-17755) Master may ask a worker to launch an executor before the worker actually got the response of registration
[ https://issues.apache.org/jira/browse/SPARK-17755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu reassigned SPARK-17755: Assignee: Shixiong Zhu > Master may ask a worker to launch an executor before the worker actually got > the response of registration > - > > Key: SPARK-17755 > URL: https://issues.apache.org/jira/browse/SPARK-17755 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Yin Huai >Assignee: Shixiong Zhu > > I somehow saw a failed test {{org.apache.spark.DistributedSuite.caching in > memory, serialized, replicated}}. Its log shows that Spark master asked the > worker to launch an executor before the worker actually got the response of > registration. So, the master knew that the worker had been registered. But, > the worker did not know if it self had been registered. > {code} > 16/09/30 14:53:53.681 dispatcher-event-loop-0 INFO Master: Registering worker > localhost:38262 with 1 cores, 1024.0 MB RAM > 16/09/30 14:53:53.681 dispatcher-event-loop-0 INFO Master: Launching executor > app-20160930145353-/1 on worker worker-20160930145353-localhost-38262 > 16/09/30 14:53:53.682 dispatcher-event-loop-3 INFO > StandaloneAppClient$ClientEndpoint: Executor added: app-20160930145353-/1 > on worker-20160930145353-localhost-38262 (localhost:38262) with 1 cores > 16/09/30 14:53:53.683 dispatcher-event-loop-3 INFO > StandaloneSchedulerBackend: Granted executor ID app-20160930145353-/1 on > hostPort localhost:38262 with 1 cores, 1024.0 MB RAM > 16/09/30 14:53:53.683 dispatcher-event-loop-0 WARN Worker: Invalid Master > (spark://localhost:46460) attempted to launch executor. > 16/09/30 14:53:53.687 worker-register-master-threadpool-0 INFO Worker: > Successfully registered with master spark://localhost:46460 > {code} > Then, seems the worker did not launch any executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18924) Improve collect/createDataFrame performance in SparkR
[ https://issues.apache.org/jira/browse/SPARK-18924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762687#comment-15762687 ] Hossein Falaki commented on SPARK-18924: Would be good to think about this along with the efforts to have zero-copy data sharing between JVM and R. I think if we do that, a lot of the Ser/De problems in the data plane will go away. > Improve collect/createDataFrame performance in SparkR > - > > Key: SPARK-18924 > URL: https://issues.apache.org/jira/browse/SPARK-18924 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Xiangrui Meng >Priority: Critical > > SparkR has its own SerDe for data serialization between JVM and R. > The SerDe on the JVM side is implemented in: > * > [SerDe.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala] > * > [SQLUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala] > The SerDe on the R side is implemented in: > * > [deserialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R] > * > [serialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R] > The serialization between JVM and R suffers from huge storage and computation > overhead. For example, a short round trip of 1 million doubles surprisingly > took 3 minutes on my laptop: > {code} > > system.time(collect(createDataFrame(data.frame(x=runif(100) >user system elapsed > 14.224 0.582 189.135 > {code} > Collecting a medium-sized DataFrame to local and continuing with a local R > workflow is a use case we should pay attention to. SparkR will never be able > to cover all existing features from CRAN packages. It is also unnecessary for > Spark to do so because not all features need scalability. > Several factors contribute to the serialization overhead: > 1. The SerDe in R side is implemented using high-level R methods. > 2. DataFrame columns are not efficiently serialized, primitive type columns > in particular. > 3. Some overhead in the serialization protocol/impl. > 1) might be discussed before because R packages like rJava exist before > SparkR. I'm not sure whether we have a license issue in depending on those > libraries. Another option is to switch to low-level R'C interface or Rcpp, > which again might have license issue. I'm not an expert here. If we have to > implement our own, there still exist much space for improvement, discussed > below. > 2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, > which collects rows to local and then constructs columns. However, > * it ignores column types and results boxing/unboxing overhead > * it collects all objects to driver and results high GC pressure > A relatively simple change is to implement specialized column builder based > on column types, primitive types in particular. We need to handle null/NA > values properly. A simple data structure we can use is > {code} > val size: Int > val nullIndexes: Array[Int] > val notNullValues: Array[T] // specialized for primitive types > {code} > On the R side, we can use `readBin` and `writeBin` to read the entire vector > in a single method call. The speed seems reasonable (at the order of GB/s): > {code} > > x <- runif(1000) # 1e7, not 1e6 > > system.time(r <- writeBin(x, raw(0))) >user system elapsed > 0.036 0.021 0.059 > > > system.time(y <- readBin(r, double(), 1000)) >user system elapsed > 0.015 0.007 0.024 > {code} > This is just a proposal that needs to be discussed and formalized. But in > general, it should be feasible to obtain 20x or more performance gain. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18929) Add Tweedie distribution in GLM
[ https://issues.apache.org/jira/browse/SPARK-18929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18929: Assignee: (was: Apache Spark) > Add Tweedie distribution in GLM > --- > > Key: SPARK-18929 > URL: https://issues.apache.org/jira/browse/SPARK-18929 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.0.2 >Reporter: Wayne Zhang > Labels: features > Original Estimate: 72h > Remaining Estimate: 72h > > I propose to add the full Tweedie family into the GeneralizedLinearRegression > model. The Tweedie family is characterized by a power variance function. > Currently supported distributions such as Gaussian, Poisson and Gamma > families are a special case of the > [Tweedie|https://en.wikipedia.org/wiki/Tweedie_distribution]. > I propose to add support for the other distributions: > * compound Poisson: 1 < variancePower < 2. This one is widely used to model > zero-inflated continuous distributions. > * positive stable: variancePower > 2 and variancePower != 3. Used to model > extreme values. > * inverse Gaussian: variancePower = 3. > The Tweedie family is supported in most statistical packages such as R > (statmod), SAS, h2o etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18929) Add Tweedie distribution in GLM
[ https://issues.apache.org/jira/browse/SPARK-18929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18929: Assignee: Apache Spark > Add Tweedie distribution in GLM > --- > > Key: SPARK-18929 > URL: https://issues.apache.org/jira/browse/SPARK-18929 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.0.2 >Reporter: Wayne Zhang >Assignee: Apache Spark > Labels: features > Original Estimate: 72h > Remaining Estimate: 72h > > I propose to add the full Tweedie family into the GeneralizedLinearRegression > model. The Tweedie family is characterized by a power variance function. > Currently supported distributions such as Gaussian, Poisson and Gamma > families are a special case of the > [Tweedie|https://en.wikipedia.org/wiki/Tweedie_distribution]. > I propose to add support for the other distributions: > * compound Poisson: 1 < variancePower < 2. This one is widely used to model > zero-inflated continuous distributions. > * positive stable: variancePower > 2 and variancePower != 3. Used to model > extreme values. > * inverse Gaussian: variancePower = 3. > The Tweedie family is supported in most statistical packages such as R > (statmod), SAS, h2o etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17755) Master may ask a worker to launch an executor before the worker actually got the response of registration
[ https://issues.apache.org/jira/browse/SPARK-17755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17755: Assignee: Apache Spark > Master may ask a worker to launch an executor before the worker actually got > the response of registration > - > > Key: SPARK-17755 > URL: https://issues.apache.org/jira/browse/SPARK-17755 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Yin Huai >Assignee: Apache Spark > > I somehow saw a failed test {{org.apache.spark.DistributedSuite.caching in > memory, serialized, replicated}}. Its log shows that Spark master asked the > worker to launch an executor before the worker actually got the response of > registration. So, the master knew that the worker had been registered. But, > the worker did not know if it self had been registered. > {code} > 16/09/30 14:53:53.681 dispatcher-event-loop-0 INFO Master: Registering worker > localhost:38262 with 1 cores, 1024.0 MB RAM > 16/09/30 14:53:53.681 dispatcher-event-loop-0 INFO Master: Launching executor > app-20160930145353-/1 on worker worker-20160930145353-localhost-38262 > 16/09/30 14:53:53.682 dispatcher-event-loop-3 INFO > StandaloneAppClient$ClientEndpoint: Executor added: app-20160930145353-/1 > on worker-20160930145353-localhost-38262 (localhost:38262) with 1 cores > 16/09/30 14:53:53.683 dispatcher-event-loop-3 INFO > StandaloneSchedulerBackend: Granted executor ID app-20160930145353-/1 on > hostPort localhost:38262 with 1 cores, 1024.0 MB RAM > 16/09/30 14:53:53.683 dispatcher-event-loop-0 WARN Worker: Invalid Master > (spark://localhost:46460) attempted to launch executor. > 16/09/30 14:53:53.687 worker-register-master-threadpool-0 INFO Worker: > Successfully registered with master spark://localhost:46460 > {code} > Then, seems the worker did not launch any executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18929) Add Tweedie distribution in GLM
[ https://issues.apache.org/jira/browse/SPARK-18929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762684#comment-15762684 ] Apache Spark commented on SPARK-18929: -- User 'actuaryzhang' has created a pull request for this issue: https://github.com/apache/spark/pull/16344 > Add Tweedie distribution in GLM > --- > > Key: SPARK-18929 > URL: https://issues.apache.org/jira/browse/SPARK-18929 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.0.2 >Reporter: Wayne Zhang > Labels: features > Original Estimate: 72h > Remaining Estimate: 72h > > I propose to add the full Tweedie family into the GeneralizedLinearRegression > model. The Tweedie family is characterized by a power variance function. > Currently supported distributions such as Gaussian, Poisson and Gamma > families are a special case of the > [Tweedie|https://en.wikipedia.org/wiki/Tweedie_distribution]. > I propose to add support for the other distributions: > * compound Poisson: 1 < variancePower < 2. This one is widely used to model > zero-inflated continuous distributions. > * positive stable: variancePower > 2 and variancePower != 3. Used to model > extreme values. > * inverse Gaussian: variancePower = 3. > The Tweedie family is supported in most statistical packages such as R > (statmod), SAS, h2o etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17755) Master may ask a worker to launch an executor before the worker actually got the response of registration
[ https://issues.apache.org/jira/browse/SPARK-17755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17755: Assignee: (was: Apache Spark) > Master may ask a worker to launch an executor before the worker actually got > the response of registration > - > > Key: SPARK-17755 > URL: https://issues.apache.org/jira/browse/SPARK-17755 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Yin Huai > > I somehow saw a failed test {{org.apache.spark.DistributedSuite.caching in > memory, serialized, replicated}}. Its log shows that Spark master asked the > worker to launch an executor before the worker actually got the response of > registration. So, the master knew that the worker had been registered. But, > the worker did not know if it self had been registered. > {code} > 16/09/30 14:53:53.681 dispatcher-event-loop-0 INFO Master: Registering worker > localhost:38262 with 1 cores, 1024.0 MB RAM > 16/09/30 14:53:53.681 dispatcher-event-loop-0 INFO Master: Launching executor > app-20160930145353-/1 on worker worker-20160930145353-localhost-38262 > 16/09/30 14:53:53.682 dispatcher-event-loop-3 INFO > StandaloneAppClient$ClientEndpoint: Executor added: app-20160930145353-/1 > on worker-20160930145353-localhost-38262 (localhost:38262) with 1 cores > 16/09/30 14:53:53.683 dispatcher-event-loop-3 INFO > StandaloneSchedulerBackend: Granted executor ID app-20160930145353-/1 on > hostPort localhost:38262 with 1 cores, 1024.0 MB RAM > 16/09/30 14:53:53.683 dispatcher-event-loop-0 WARN Worker: Invalid Master > (spark://localhost:46460) attempted to launch executor. > 16/09/30 14:53:53.687 worker-register-master-threadpool-0 INFO Worker: > Successfully registered with master spark://localhost:46460 > {code} > Then, seems the worker did not launch any executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17755) Master may ask a worker to launch an executor before the worker actually got the response of registration
[ https://issues.apache.org/jira/browse/SPARK-17755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762683#comment-15762683 ] Apache Spark commented on SPARK-17755: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/16345 > Master may ask a worker to launch an executor before the worker actually got > the response of registration > - > > Key: SPARK-17755 > URL: https://issues.apache.org/jira/browse/SPARK-17755 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Yin Huai > > I somehow saw a failed test {{org.apache.spark.DistributedSuite.caching in > memory, serialized, replicated}}. Its log shows that Spark master asked the > worker to launch an executor before the worker actually got the response of > registration. So, the master knew that the worker had been registered. But, > the worker did not know if it self had been registered. > {code} > 16/09/30 14:53:53.681 dispatcher-event-loop-0 INFO Master: Registering worker > localhost:38262 with 1 cores, 1024.0 MB RAM > 16/09/30 14:53:53.681 dispatcher-event-loop-0 INFO Master: Launching executor > app-20160930145353-/1 on worker worker-20160930145353-localhost-38262 > 16/09/30 14:53:53.682 dispatcher-event-loop-3 INFO > StandaloneAppClient$ClientEndpoint: Executor added: app-20160930145353-/1 > on worker-20160930145353-localhost-38262 (localhost:38262) with 1 cores > 16/09/30 14:53:53.683 dispatcher-event-loop-3 INFO > StandaloneSchedulerBackend: Granted executor ID app-20160930145353-/1 on > hostPort localhost:38262 with 1 cores, 1024.0 MB RAM > 16/09/30 14:53:53.683 dispatcher-event-loop-0 WARN Worker: Invalid Master > (spark://localhost:46460) attempted to launch executor. > 16/09/30 14:53:53.687 worker-register-master-threadpool-0 INFO Worker: > Successfully registered with master spark://localhost:46460 > {code} > Then, seems the worker did not launch any executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18930) Inserting in partitioned table - partitioned field should be last in select statement.
Egor Pahomov created SPARK-18930: Summary: Inserting in partitioned table - partitioned field should be last in select statement. Key: SPARK-18930 URL: https://issues.apache.org/jira/browse/SPARK-18930 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.2 Reporter: Egor Pahomov CREATE TABLE temp.test_partitioning_4 ( num string ) PARTITIONED BY ( day string) stored as parquet INSERT INTO TABLE temp.test_partitioning_4 PARTITION (day) select day, count(*) as num from hss.session where year=2016 and month=4 group by day Resulted schema on HDFS: /temp.db/test_partitioning_3/day=62456298, emp.db/test_partitioning_3/day=69094345 As you can imagine these numbers are num of records. But! When I do select * from temp.test_partitioning_4 data is correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10413) ML models should support prediction on single instances
[ https://issues.apache.org/jira/browse/SPARK-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-10413: -- Summary: ML models should support prediction on single instances (was: Model should support prediction on single instance) > ML models should support prediction on single instances > --- > > Key: SPARK-10413 > URL: https://issues.apache.org/jira/browse/SPARK-10413 > Project: Spark > Issue Type: Umbrella > Components: ML >Reporter: Xiangrui Meng >Priority: Critical > > Currently models in the pipeline API only implement transform(DataFrame). It > would be quite useful to support prediction on single instance. > UPDATE: This issue is for making predictions with single models. We can make > methods like {{def predict(features: Vector): Double}} public. > * This issue is *not* for single-instance prediction for full Pipelines, > which would require making predictions on {{Row}}s. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15572) ML persistence in R format: compatibility with other languages
[ https://issues.apache.org/jira/browse/SPARK-15572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-15572: -- Summary: ML persistence in R format: compatibility with other languages (was: MLlib in R format: compatibility with other languages) > ML persistence in R format: compatibility with other languages > -- > > Key: SPARK-15572 > URL: https://issues.apache.org/jira/browse/SPARK-15572 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Reporter: Joseph K. Bradley > > Currently, models saved in R cannot be loaded easily into other languages. > This is because R saves extra metadata (feature names) alongside the model. > We should fix this issue so that models can be transferred seamlessly between > languages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18929) Add Tweedie distribution in GLM
Wayne Zhang created SPARK-18929: --- Summary: Add Tweedie distribution in GLM Key: SPARK-18929 URL: https://issues.apache.org/jira/browse/SPARK-18929 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.0.2 Reporter: Wayne Zhang I propose to add the full Tweedie family into the GeneralizedLinearRegression model. The Tweedie family is characterized by a power variance function. Currently supported distributions such as Gaussian, Poisson and Gamma families are a special case of the [Tweedie|https://en.wikipedia.org/wiki/Tweedie_distribution]. I propose to add support for the other distributions: * compound Poisson: 1 < variancePower < 2. This one is widely used to model zero-inflated continuous distributions. * positive stable: variancePower > 2 and variancePower != 3. Used to model extreme values. * inverse Gaussian: variancePower = 3. The Tweedie family is supported in most statistical packages such as R (statmod), SAS, h2o etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18492) GeneratedIterator grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762628#comment-15762628 ] Roberto Mirizzi edited comment on SPARK-18492 at 12/19/16 11:32 PM: I'm having exactly the same issue on Spark 2.0.2. I had it also on Spark 2.0.0 and Spark 2.0.1. My exception is: {{JaninoRuntimeException: Code of method "()V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" grows beyond 64 KB}} It happens when I try to do many transformations on a given dataset, like multiple {{.select(...).select(...)}} with multiple operations inside the select (like {{when(...).otherwise(...)}}, or {{regexp_extract}}). The exception outputs about 15k lines of Java code, like: {code:java} 16/12/19 07:16:43 ERROR CodeGenerator: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "()V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" grows beyond 64 KB /* 001 */ public Object generate(Object[] references) { /* 002 */ return new GeneratedIterator(references); /* 003 */ } /* 004 */ /* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator { /* 006 */ private Object[] references; /* 007 */ private boolean agg_initAgg; /* 008 */ private org.apache.spark.sql.execution.aggregate.HashAggregateExec agg_plan; /* 009 */ private org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap agg_hashMap; /* 010 */ private org.apache.spark.sql.execution.UnsafeKVExternalSorter agg_sorter; /* 011 */ private org.apache.spark.unsafe.KVIterator agg_mapIter; /* 012 */ private org.apache.spark.sql.execution.metric.SQLMetric agg_peakMemory; /* 013 */ private org.apache.spark.sql.execution.metric.SQLMetric agg_spillSize; /* 014 */ private scala.collection.Iterator inputadapter_input; /* 015 */ private UnsafeRow agg_result; /* 016 */ private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder; /* 017 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter agg_rowWriter; /* 018 */ private UTF8String agg_lastRegex; {code} However, it looks like the execution continues successfully. This is part of the stack trace: {code:java} org.codehaus.janino.JaninoRuntimeException: Code of method "()V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" grows beyond 64 KB at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) at org.codehaus.janino.CodeContext.write(CodeContext.java:854) at org.codehaus.janino.CodeContext.writeShort(CodeContext.java:959) at org.codehaus.janino.UnitCompiler.writeConstantClassInfo(UnitCompiler.java:10274) at org.codehaus.janino.UnitCompiler.tryNarrowingReferenceConversion(UnitCompiler.java:9725) at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3833) at org.codehaus.janino.UnitCompiler.access$6400(UnitCompiler.java:185) at org.codehaus.janino.UnitCompiler$10.visitCast(UnitCompiler.java:3258) at org.codehaus.janino.Java$Cast.accept(Java.java:3802) at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290) at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3868) at org.codehaus.janino.UnitCompiler.access$8600(UnitCompiler.java:185) at org.codehaus.janino.UnitCompiler$10.visitParenthesizedExpression(UnitCompiler.java:3286) {code} And after that I get a warning: {code:java} 16/12/19 07:16:45 WARN WholeStageCodegenExec: Whole-stage codegen disabled for this plan: *HashAggregate(keys=... {code} was (Author: roberto.mirizzi): I'm having exactly the same issue on Spark 2.0.2. I had it also on Spark 2.0.0 and Spark 2.0.1. My exception is: {{JaninoRuntimeException: Code of method "()V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" grows beyond 64 KB}} It happens when I try to do many transformations on a given dataset, like multiple {{.select(...).select(...)}} with multiple operations inside the select (like {{when(...).otherwise(...)}}, or {{regexp_extract}}). The exception outputs about 15k lines of Java code, like: {code:java} 16/12/19 07:16:43 ERROR CodeGenerator: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "()V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" grows beyond 64 KB /* 001 */ public Object generate(Object[] references) { /* 002 */ return new GeneratedIterator(references); /* 003 */ } /* 004 */ /* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator { /* 006 */ private Object[] references; /* 007 */ private boolean agg_initAgg; /* 008 */ private org.apache.spark.sql.execution.ag
[jira] [Comment Edited] (SPARK-18492) GeneratedIterator grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762628#comment-15762628 ] Roberto Mirizzi edited comment on SPARK-18492 at 12/19/16 11:29 PM: I'm having exactly the same issue on Spark 2.0.2. I had it also on Spark 2.0.0 and Spark 2.0.1. My exception is: {{JaninoRuntimeException: Code of method "()V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" grows beyond 64 KB}} It happens when I try to do many transformations on a given dataset, like multiple {{.select(...).select(...)}} with multiple operations inside the select (like {{when(...).otherwise(...)}}, or {{regexp_extract}}). The exception outputs about 15k lines of Java code, like: {code:java} 16/12/19 07:16:43 ERROR CodeGenerator: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "()V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" grows beyond 64 KB /* 001 */ public Object generate(Object[] references) { /* 002 */ return new GeneratedIterator(references); /* 003 */ } /* 004 */ /* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator { /* 006 */ private Object[] references; /* 007 */ private boolean agg_initAgg; /* 008 */ private org.apache.spark.sql.execution.aggregate.HashAggregateExec agg_plan; /* 009 */ private org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap agg_hashMap; /* 010 */ private org.apache.spark.sql.execution.UnsafeKVExternalSorter agg_sorter; /* 011 */ private org.apache.spark.unsafe.KVIterator agg_mapIter; /* 012 */ private org.apache.spark.sql.execution.metric.SQLMetric agg_peakMemory; /* 013 */ private org.apache.spark.sql.execution.metric.SQLMetric agg_spillSize; /* 014 */ private scala.collection.Iterator inputadapter_input; /* 015 */ private UnsafeRow agg_result; /* 016 */ private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder; /* 017 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter agg_rowWriter; /* 018 */ private UTF8String agg_lastRegex; {code} However, it looks like the execution continues successfully. was (Author: roberto.mirizzi): I'm having exactly the same issue on Spark 2.0.2. I had it also on Spark 2.0.0 and Spark 2.0.1. My exception is: bq. JaninoRuntimeException: Code of method "()V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" grows beyond 64 KB It happens when I try to do many transformations on a given dataset, like multiple {{.select(...).select(...)}} with multiple operations inside the select (like {{when(...).otherwise(...)}}, or {{regexp_extract}}). The exception outputs about 15k lines of Java code, like: {code:java} 16/12/19 07:16:43 ERROR CodeGenerator: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "()V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" grows beyond 64 KB /* 001 */ public Object generate(Object[] references) { /* 002 */ return new GeneratedIterator(references); /* 003 */ } /* 004 */ /* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator { /* 006 */ private Object[] references; /* 007 */ private boolean agg_initAgg; /* 008 */ private org.apache.spark.sql.execution.aggregate.HashAggregateExec agg_plan; /* 009 */ private org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap agg_hashMap; /* 010 */ private org.apache.spark.sql.execution.UnsafeKVExternalSorter agg_sorter; /* 011 */ private org.apache.spark.unsafe.KVIterator agg_mapIter; /* 012 */ private org.apache.spark.sql.execution.metric.SQLMetric agg_peakMemory; /* 013 */ private org.apache.spark.sql.execution.metric.SQLMetric agg_spillSize; /* 014 */ private scala.collection.Iterator inputadapter_input; /* 015 */ private UnsafeRow agg_result; /* 016 */ private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder; /* 017 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter agg_rowWriter; /* 018 */ private UTF8String agg_lastRegex; {code} However, it looks like the execution continues successfully. > GeneratedIterator grows beyond 64 KB > > > Key: SPARK-18492 > URL: https://issues.apache.org/jira/browse/SPARK-18492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: CentOS release 6.7 (Final) >Reporter: Norris Merritt > > spark-submit fails with ERROR CodeGenerator: failed to compile: > org.codehaus.janino.JaninoRuntimeException: Code of method > "(I[Lscala/collection/Iterator;)V" of class > "or
[jira] [Commented] (SPARK-18492) GeneratedIterator grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762628#comment-15762628 ] Roberto Mirizzi commented on SPARK-18492: - I'm having exactly the same issue on Spark 2.0.2. I had it also on Spark 2.0.0 and Spark 2.0.1. My exception is: bq. JaninoRuntimeException: Code of method "()V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" grows beyond 64 KB It happens when I try to do many transformations on a given dataset, like multiple {{.select(...).select(...)}} with multiple operations inside the select (like {{when(...).otherwise(...)}}, or {{regexp_extract}}). The exception outputs about 15k lines of Java code, like: {code:java} 16/12/19 07:16:43 ERROR CodeGenerator: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "()V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" grows beyond 64 KB /* 001 */ public Object generate(Object[] references) { /* 002 */ return new GeneratedIterator(references); /* 003 */ } /* 004 */ /* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator { /* 006 */ private Object[] references; /* 007 */ private boolean agg_initAgg; /* 008 */ private org.apache.spark.sql.execution.aggregate.HashAggregateExec agg_plan; /* 009 */ private org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap agg_hashMap; /* 010 */ private org.apache.spark.sql.execution.UnsafeKVExternalSorter agg_sorter; /* 011 */ private org.apache.spark.unsafe.KVIterator agg_mapIter; /* 012 */ private org.apache.spark.sql.execution.metric.SQLMetric agg_peakMemory; /* 013 */ private org.apache.spark.sql.execution.metric.SQLMetric agg_spillSize; /* 014 */ private scala.collection.Iterator inputadapter_input; /* 015 */ private UnsafeRow agg_result; /* 016 */ private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder; /* 017 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter agg_rowWriter; /* 018 */ private UTF8String agg_lastRegex; {code} However, it looks like the execution continues successfully. > GeneratedIterator grows beyond 64 KB > > > Key: SPARK-18492 > URL: https://issues.apache.org/jira/browse/SPARK-18492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: CentOS release 6.7 (Final) >Reporter: Norris Merritt > > spark-submit fails with ERROR CodeGenerator: failed to compile: > org.codehaus.janino.JaninoRuntimeException: Code of method > "(I[Lscala/collection/Iterator;)V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" > grows beyond 64 KB > Error message is followed by a huge dump of generated source code. > The generated code declares 1,454 field sequences like the following: > /* 036 */ private org.apache.spark.sql.catalyst.expressions.ScalaUDF > project_scalaUDF1; > /* 037 */ private scala.Function1 project_catalystConverter1; > /* 038 */ private scala.Function1 project_converter1; > /* 039 */ private scala.Function1 project_converter2; > /* 040 */ private scala.Function2 project_udf1; > (many omitted lines) ... > /* 6089 */ private org.apache.spark.sql.catalyst.expressions.ScalaUDF > project_scalaUDF1454; > /* 6090 */ private scala.Function1 project_catalystConverter1454; > /* 6091 */ private scala.Function1 project_converter1695; > /* 6092 */ private scala.Function1 project_udf1454; > It then proceeds to emit code for several methods (init, processNext) each of > which has totally repetitive sequences of statements pertaining to each of > the sequences of variables declared in the class. For example: > /* 6101 */ public void init(int index, scala.collection.Iterator inputs[]) { > The reason that the 64KB JVM limit for code for a method is exceeded is > because the code generator is using an incredibly naive strategy. It emits a > sequence like the one shown below for each of the 1,454 groups of variables > shown above, in > /* 6132 */ this.project_udf = > (scala.Function1)project_scalaUDF.userDefinedFunc(); > /* 6133 */ this.project_scalaUDF1 = > (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[10]; > /* 6134 */ this.project_catalystConverter1 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF1.dataType()); > /* 6135 */ this.project_converter1 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren
[jira] [Issue Comment Deleted] (SPARK-18492) GeneratedIterator grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roberto Mirizzi updated SPARK-18492: Comment: was deleted (was: I'm having exactly the same issue. My exception is: [JaninoRuntimeException: Code of method "()V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" grows beyond 64 KB] ) > GeneratedIterator grows beyond 64 KB > > > Key: SPARK-18492 > URL: https://issues.apache.org/jira/browse/SPARK-18492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: CentOS release 6.7 (Final) >Reporter: Norris Merritt > > spark-submit fails with ERROR CodeGenerator: failed to compile: > org.codehaus.janino.JaninoRuntimeException: Code of method > "(I[Lscala/collection/Iterator;)V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" > grows beyond 64 KB > Error message is followed by a huge dump of generated source code. > The generated code declares 1,454 field sequences like the following: > /* 036 */ private org.apache.spark.sql.catalyst.expressions.ScalaUDF > project_scalaUDF1; > /* 037 */ private scala.Function1 project_catalystConverter1; > /* 038 */ private scala.Function1 project_converter1; > /* 039 */ private scala.Function1 project_converter2; > /* 040 */ private scala.Function2 project_udf1; > (many omitted lines) ... > /* 6089 */ private org.apache.spark.sql.catalyst.expressions.ScalaUDF > project_scalaUDF1454; > /* 6090 */ private scala.Function1 project_catalystConverter1454; > /* 6091 */ private scala.Function1 project_converter1695; > /* 6092 */ private scala.Function1 project_udf1454; > It then proceeds to emit code for several methods (init, processNext) each of > which has totally repetitive sequences of statements pertaining to each of > the sequences of variables declared in the class. For example: > /* 6101 */ public void init(int index, scala.collection.Iterator inputs[]) { > The reason that the 64KB JVM limit for code for a method is exceeded is > because the code generator is using an incredibly naive strategy. It emits a > sequence like the one shown below for each of the 1,454 groups of variables > shown above, in > /* 6132 */ this.project_udf = > (scala.Function1)project_scalaUDF.userDefinedFunc(); > /* 6133 */ this.project_scalaUDF1 = > (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[10]; > /* 6134 */ this.project_catalystConverter1 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF1.dataType()); > /* 6135 */ this.project_converter1 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(0))).dataType()); > /* 6136 */ this.project_converter2 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(1))).dataType()); > It blows up after emitting 230 such sequences, while trying to emit the 231st: > /* 7282 */ this.project_udf230 = > (scala.Function2)project_scalaUDF230.userDefinedFunc(); > /* 7283 */ this.project_scalaUDF231 = > (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[240]; > /* 7284 */ this.project_catalystConverter231 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF231.dataType()); > many omitted lines ... > Example of repetitive code sequences emitted for processNext method: > /* 12253 */ boolean project_isNull247 = project_result244 == null; > /* 12254 */ MapData project_value247 = null; > /* 12255 */ if (!project_isNull247) { > /* 12256 */ project_value247 = project_result244; > /* 12257 */ } > /* 12258 */ Object project_arg = sort_isNull5 ? null : > project_converter489.apply(sort_value5); > /* 12259 */ > /* 12260 */ ArrayData project_result249 = null; > /* 12261 */ try { > /* 12262 */ project_result249 = > (ArrayData)project_catalystConverter248.apply(project_udf248.apply(project_arg)); > /* 12263 */ } catch (Exception e) { > /* 12264 */ throw new > org.apache.spark.SparkException(project_scalaUDF248.udfErrorMessage(), e); > /* 12265 */ } > /* 12266 */ > /* 12267 */ boolean project_isNull252 = project_result249 == null; > /* 12268 */ ArrayData project_value252 = null; > /* 12269 */
[jira] [Commented] (SPARK-18492) GeneratedIterator grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762609#comment-15762609 ] Roberto Mirizzi commented on SPARK-18492: - I'm having exactly the same issue. My exception is: [JaninoRuntimeException: Code of method "()V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" grows beyond 64 KB] > GeneratedIterator grows beyond 64 KB > > > Key: SPARK-18492 > URL: https://issues.apache.org/jira/browse/SPARK-18492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: CentOS release 6.7 (Final) >Reporter: Norris Merritt > > spark-submit fails with ERROR CodeGenerator: failed to compile: > org.codehaus.janino.JaninoRuntimeException: Code of method > "(I[Lscala/collection/Iterator;)V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" > grows beyond 64 KB > Error message is followed by a huge dump of generated source code. > The generated code declares 1,454 field sequences like the following: > /* 036 */ private org.apache.spark.sql.catalyst.expressions.ScalaUDF > project_scalaUDF1; > /* 037 */ private scala.Function1 project_catalystConverter1; > /* 038 */ private scala.Function1 project_converter1; > /* 039 */ private scala.Function1 project_converter2; > /* 040 */ private scala.Function2 project_udf1; > (many omitted lines) ... > /* 6089 */ private org.apache.spark.sql.catalyst.expressions.ScalaUDF > project_scalaUDF1454; > /* 6090 */ private scala.Function1 project_catalystConverter1454; > /* 6091 */ private scala.Function1 project_converter1695; > /* 6092 */ private scala.Function1 project_udf1454; > It then proceeds to emit code for several methods (init, processNext) each of > which has totally repetitive sequences of statements pertaining to each of > the sequences of variables declared in the class. For example: > /* 6101 */ public void init(int index, scala.collection.Iterator inputs[]) { > The reason that the 64KB JVM limit for code for a method is exceeded is > because the code generator is using an incredibly naive strategy. It emits a > sequence like the one shown below for each of the 1,454 groups of variables > shown above, in > /* 6132 */ this.project_udf = > (scala.Function1)project_scalaUDF.userDefinedFunc(); > /* 6133 */ this.project_scalaUDF1 = > (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[10]; > /* 6134 */ this.project_catalystConverter1 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF1.dataType()); > /* 6135 */ this.project_converter1 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(0))).dataType()); > /* 6136 */ this.project_converter2 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(1))).dataType()); > It blows up after emitting 230 such sequences, while trying to emit the 231st: > /* 7282 */ this.project_udf230 = > (scala.Function2)project_scalaUDF230.userDefinedFunc(); > /* 7283 */ this.project_scalaUDF231 = > (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[240]; > /* 7284 */ this.project_catalystConverter231 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF231.dataType()); > many omitted lines ... > Example of repetitive code sequences emitted for processNext method: > /* 12253 */ boolean project_isNull247 = project_result244 == null; > /* 12254 */ MapData project_value247 = null; > /* 12255 */ if (!project_isNull247) { > /* 12256 */ project_value247 = project_result244; > /* 12257 */ } > /* 12258 */ Object project_arg = sort_isNull5 ? null : > project_converter489.apply(sort_value5); > /* 12259 */ > /* 12260 */ ArrayData project_result249 = null; > /* 12261 */ try { > /* 12262 */ project_result249 = > (ArrayData)project_catalystConverter248.apply(project_udf248.apply(project_arg)); > /* 12263 */ } catch (Exception e) { > /* 12264 */ throw new > org.apache.spark.SparkException(project_scalaUDF248.udfErrorMessage(), e); > /* 12265 */ } > /* 12266 */ > /* 12267 */ boolean project_isNull252 = project_result249 == null; > /* 12268 */ ArrayData project_value252
[jira] [Commented] (SPARK-18588) KafkaSourceStressForDontFailOnDataLossSuite is flaky
[ https://issues.apache.org/jira/browse/SPARK-18588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762559#comment-15762559 ] Apache Spark commented on SPARK-18588: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/16282 > KafkaSourceStressForDontFailOnDataLossSuite is flaky > > > Key: SPARK-18588 > URL: https://issues.apache.org/jira/browse/SPARK-18588 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Reporter: Herman van Hovell >Assignee: Shixiong Zhu > > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite&test_name=stress+test+for+failOnDataLoss%3Dfalse -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18588) KafkaSourceStressForDontFailOnDataLossSuite is flaky
[ https://issues.apache.org/jira/browse/SPARK-18588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18588: Assignee: Shixiong Zhu (was: Apache Spark) > KafkaSourceStressForDontFailOnDataLossSuite is flaky > > > Key: SPARK-18588 > URL: https://issues.apache.org/jira/browse/SPARK-18588 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Reporter: Herman van Hovell >Assignee: Shixiong Zhu > > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite&test_name=stress+test+for+failOnDataLoss%3Dfalse -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18588) KafkaSourceStressForDontFailOnDataLossSuite is flaky
[ https://issues.apache.org/jira/browse/SPARK-18588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18588: Assignee: Apache Spark (was: Shixiong Zhu) > KafkaSourceStressForDontFailOnDataLossSuite is flaky > > > Key: SPARK-18588 > URL: https://issues.apache.org/jira/browse/SPARK-18588 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Reporter: Herman van Hovell >Assignee: Apache Spark > > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite&test_name=stress+test+for+failOnDataLoss%3Dfalse -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15544) Bouncing Zookeeper node causes Active spark master to exit
[ https://issues.apache.org/jira/browse/SPARK-15544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762555#comment-15762555 ] Ed Tyrrill commented on SPARK-15544: I'm going to add that this is very easy to reproduce. It will happen reliably if you shut down the zookeeper node that is currently the leader. I configured systemd to automatically restart the spark master, and while the spark master process starts, the spark master on all three nodes doesn't really work, and continually tries to reconnect to zookeeper until I bring up the shutdown zookeeper node. Spark should be able to work with two of the three zookeeper nodes, but instead it log message like this repeatedly every couple seconds on all three spark master nodes until I bring back up the one zookeeper node that I shut down, zk02: 2016-12-19 14:31:10.175 INFO org.apache.zookeeper.ClientCnxn.logStartConnect - Opening socket connection to server zk01/10.0.xx.xx:. Will not attempt to authenticate using SASL (unknown error) 2016-12-19 14:31:10.176 INFO org.apache.zookeeper.ClientCnxn.primeConnection - Socket connection established to zk01/10.0.xx.xx:, initiating session 2016-12-19 14:31:10.177 INFO org.apache.zookeeper.ClientCnxn.run - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect 2016-12-19 14:31:10.724 INFO org.apache.zookeeper.ClientCnxn.logStartConnect - Opening socket connection to server zk02/10.0.xx.xx:. Will not attempt to authenticate using SASL (unknown error) 2016-12-19 14:31:10.725 WARN org.apache.zookeeper.ClientCnxn.run - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) 2016-12-19 14:31:10.828 INFO org.apache.zookeeper.ClientCnxn.logStartConnect - Opening socket connection to server zk03/10.0.xx.xx:. Will not attempt to authenticate using SASL (unknown error) 2016-12-19 14:31:10.830 INFO org.apache.zookeeper.ClientCnxn.primeConnection - Socket connection established to zk03/10.0.xx.xx:, initiating session Zookeeper itself has selected a new leader, and Kafka, which also uses zookeeper, doesn't have any trouble during this time. Also important to note, if you shut down a non-leader zookeeper node then spark doesn't have any trouble either. > Bouncing Zookeeper node causes Active spark master to exit > -- > > Key: SPARK-15544 > URL: https://issues.apache.org/jira/browse/SPARK-15544 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 > Environment: Ubuntu 14.04. Zookeeper 3.4.6 with 3-node quorum >Reporter: Steven Lowenthal > > Shutting Down a single zookeeper node caused spark master to exit. The > master should have connected to a second zookeeper node. > {code:title=log output} > 16/05/25 18:21:28 INFO master.Master: Launching executor > app-20160525182128-0006/1 on worker worker-20160524013212-10.16.28.76-59138 > 16/05/25 18:21:28 INFO master.Master: Launching executor > app-20160525182128-0006/2 on worker worker-20160524013204-10.16.21.217-47129 > 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data > from server sessionid 0x154dfc0426b0054, likely server has closed socket, > closing socket connection and attempting reconnect > 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data > from server sessionid 0x254c701f28d0053, likely server has closed socket, > closing socket connection and attempting reconnect > 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED > 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED > 16/05/26 00:16:01 INFO master.ZooKeeperLeaderElectionAgent: We have lost > leadership > 16/05/26 00:16:01 ERROR master.Master: Leadership has been revoked -- master > shutting down. }} > {code} > spark-env.sh: > {code:title=spark-env.sh} > export SPARK_LOCAL_DIRS=/ephemeral/spark/local > export SPARK_WORKER_DIR=/ephemeral/spark/work > export SPARK_LOG_DIR=/var/log/spark > export HADOOP_CONF_DIR=/home/ubuntu/hadoop-2.6.3/etc/hadoop > export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER > -Dspark.deploy.zookeeper.url=gn5456-zookeeper-01:2181,gn5456-zookeeper-02:2181,gn5456-zookeeper-03:2181" > export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true" > {code} -- This message was s
[jira] [Resolved] (SPARK-18836) Serialize Task Metrics once per stage
[ https://issues.apache.org/jira/browse/SPARK-18836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-18836. Resolution: Fixed Assignee: Shivaram Venkataraman Fix Version/s: 1.3.0 > Serialize Task Metrics once per stage > - > > Key: SPARK-18836 > URL: https://issues.apache.org/jira/browse/SPARK-18836 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Reporter: Shivaram Venkataraman >Assignee: Shivaram Venkataraman > Fix For: 1.3.0 > > > Right now we serialize the empty task metrics once per task -- Since this is > shared across all tasks we could use the same serialized task metrics across > all tasks of a stage -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18924) Improve collect/createDataFrame performance in SparkR
[ https://issues.apache.org/jira/browse/SPARK-18924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762548#comment-15762548 ] Shivaram Venkataraman commented on SPARK-18924: --- This is a good thing to investigate - Just to provide some historical context, the functions in serialize.R and deserialize.R were primarily designed to enable functions between the JVM and R and the serialization performance is less critical there as its mostly just function names, arguments etc. For data path we were originally using R's own serializer, deserializer but that doesn't work if we want to parse the data in JVM. So the whole dfToCols was a retro-fit to make things work. In terms of design options: - I think removing the boxing / unboxing overheads and making `readIntArray` or `readStringArray` in R more efficient would be a good starting point - In terms of using other packages - there are licensing questions and also usability questions. So far users mostly don't require any extra R package to use SparkR and hence we are compatible across a bunch of R versions etc. So I think we should first look at the points about how we can make our existing architecture better - If the bottleneck is due to R function call overheads after the above changes we can explore writing a C module (similar to our old hashCode implementation). While this has lesser complications in terms of licensing, versions matches etc. - there is still some complexity on how we build and distribute this in a binary package. > Improve collect/createDataFrame performance in SparkR > - > > Key: SPARK-18924 > URL: https://issues.apache.org/jira/browse/SPARK-18924 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Xiangrui Meng >Priority: Critical > > SparkR has its own SerDe for data serialization between JVM and R. > The SerDe on the JVM side is implemented in: > * > [SerDe.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala] > * > [SQLUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala] > The SerDe on the R side is implemented in: > * > [deserialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R] > * > [serialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R] > The serialization between JVM and R suffers from huge storage and computation > overhead. For example, a short round trip of 1 million doubles surprisingly > took 3 minutes on my laptop: > {code} > > system.time(collect(createDataFrame(data.frame(x=runif(100) >user system elapsed > 14.224 0.582 189.135 > {code} > Collecting a medium-sized DataFrame to local and continuing with a local R > workflow is a use case we should pay attention to. SparkR will never be able > to cover all existing features from CRAN packages. It is also unnecessary for > Spark to do so because not all features need scalability. > Several factors contribute to the serialization overhead: > 1. The SerDe in R side is implemented using high-level R methods. > 2. DataFrame columns are not efficiently serialized, primitive type columns > in particular. > 3. Some overhead in the serialization protocol/impl. > 1) might be discussed before because R packages like rJava exist before > SparkR. I'm not sure whether we have a license issue in depending on those > libraries. Another option is to switch to low-level R'C interface or Rcpp, > which again might have license issue. I'm not an expert here. If we have to > implement our own, there still exist much space for improvement, discussed > below. > 2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, > which collects rows to local and then constructs columns. However, > * it ignores column types and results boxing/unboxing overhead > * it collects all objects to driver and results high GC pressure > A relatively simple change is to implement specialized column builder based > on column types, primitive types in particular. We need to handle null/NA > values properly. A simple data structure we can use is > {code} > val size: Int > val nullIndexes: Array[Int] > val notNullValues: Array[T] // specialized for primitive types > {code} > On the R side, we can use `readBin` and `writeBin` to read the entire vector > in a single method call. The speed seems reasonable (at the order of GB/s): > {code} > > x <- runif(1000) # 1e7, not 1e6 > > system.time(r <- writeBin(x, raw(0))) >user system elapsed > 0.036 0.021 0.059 > > > system.time(y <- readBin(r, double(), 1000)) >user system elapsed > 0.015 0.007 0.024 > {code} > This is just a proposal that needs to be discussed and
[jira] [Commented] (SPARK-16473) BisectingKMeans Algorithm failing with java.util.NoSuchElementException: key not found
[ https://issues.apache.org/jira/browse/SPARK-16473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762532#comment-15762532 ] Ilya Matiach commented on SPARK-16473: -- I'm interested in looking into this issue. Would it be possible to get a dataset (either the original one or some mock dataset) which can be used to reproduce this error? > BisectingKMeans Algorithm failing with java.util.NoSuchElementException: key > not found > -- > > Key: SPARK-16473 > URL: https://issues.apache.org/jira/browse/SPARK-16473 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 1.6.1, 2.0.0 > Environment: AWS EC2 linux instance. >Reporter: Alok Bhandari > > Hello , > I am using apache spark 1.6.1. > I am executing bisecting k means algorithm on a specific dataset . > Dataset details :- > K=100, > input vector =100K*100k > Memory assigned 16GB per node , > number of nodes =2. > Till K=75 it os working fine , but when I set k=100 , it fails with > java.util.NoSuchElementException: key not found. > *I suspect it is failing because of lack of some resources , but somehow > exception does not convey anything as why this spark job failed.* > Please can someone point me to root cause of this exception , why it is > failing. > This is the exception stack-trace:- > {code} > java.util.NoSuchElementException: key not found: 166 > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:58) > at scala.collection.MapLike$class.apply(MapLike.scala:141) > at scala.collection.AbstractMap.apply(Map.scala:58) > at > org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply$mcDJ$sp(BisectingKMeans.scala:338) > at > org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337) > at > org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337) > at > scala.collection.TraversableOnce$$anonfun$minBy$1.apply(TraversableOnce.scala:231) > > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) > > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > scala.collection.LinearSeqOptimized$class.reduceLeft(LinearSeqOptimized.scala:125) > > at scala.collection.immutable.List.reduceLeft(List.scala:84) > at > scala.collection.TraversableOnce$class.minBy(TraversableOnce.scala:231) > at scala.collection.AbstractTraversable.minBy(Traversable.scala:105) > at > org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:337) > > at > org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:334) > > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389) > {code} > Issue is that , it is failing but not giving any explicit message as to why > it failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18927) MemorySink for StructuredStreaming can't recover from checkpoint if location is provided in conf
[ https://issues.apache.org/jira/browse/SPARK-18927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18927: Assignee: (was: Apache Spark) > MemorySink for StructuredStreaming can't recover from checkpoint if location > is provided in conf > > > Key: SPARK-18927 > URL: https://issues.apache.org/jira/browse/SPARK-18927 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Burak Yavuz > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18927) MemorySink for StructuredStreaming can't recover from checkpoint if location is provided in conf
[ https://issues.apache.org/jira/browse/SPARK-18927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762497#comment-15762497 ] Apache Spark commented on SPARK-18927: -- User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/16342 > MemorySink for StructuredStreaming can't recover from checkpoint if location > is provided in conf > > > Key: SPARK-18927 > URL: https://issues.apache.org/jira/browse/SPARK-18927 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Burak Yavuz > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18927) MemorySink for StructuredStreaming can't recover from checkpoint if location is provided in conf
[ https://issues.apache.org/jira/browse/SPARK-18927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18927: Assignee: Apache Spark > MemorySink for StructuredStreaming can't recover from checkpoint if location > is provided in conf > > > Key: SPARK-18927 > URL: https://issues.apache.org/jira/browse/SPARK-18927 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Burak Yavuz >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18832) Spark SQL: Incorrect error message on calling registered UDF.
[ https://issues.apache.org/jira/browse/SPARK-18832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762427#comment-15762427 ] Dongjoon Hyun commented on SPARK-18832: --- Ah, I think I reproduced the situation you meet. Let me dig this. > Spark SQL: Incorrect error message on calling registered UDF. > - > > Key: SPARK-18832 > URL: https://issues.apache.org/jira/browse/SPARK-18832 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2 >Reporter: Lokesh Yadav > > On calling a registered UDF in metastore from spark-sql CLI, it gives a > generic error: > Error in query: Undefined function: 'Sample_UDF'. This function is neither a > registered temporary function nor a permanent function registered in the > database 'default'. > The functions is registered and it shoes up in the list output by 'show > functions'. > I am using a Hive UDTF, registering it using the statement: create function > Sample_UDF as 'com.udf.Sample_UDF' using JAR > '/local/jar/path/containing/the/class'; > and I am calling the functions from spark-sql CLI as: SELECT > Sample_UDF("input_1", "input_2" ) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18928) FileScanRDD, JDBCRDD, and UnsafeSorter should support task cancellation
[ https://issues.apache.org/jira/browse/SPARK-18928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762384#comment-15762384 ] Apache Spark commented on SPARK-18928: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/16340 > FileScanRDD, JDBCRDD, and UnsafeSorter should support task cancellation > --- > > Key: SPARK-18928 > URL: https://issues.apache.org/jira/browse/SPARK-18928 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Reporter: Josh Rosen >Assignee: Josh Rosen > > Spark tasks respond to cancellation by checking > {{TaskContext.isInterrupted()}}, but this check is missing on a few critical > paths used in Spark SQL, including FileScanRDD, JDBCRDD, and > UnsafeSorter-based sorts. This can cause interrupted / cancelled tasks to > continue running and become zombies. > Here's an example: first, create a giant text file. In my case, I just > concatenated /usr/share/dict/words a bunch of times to produce a 2.75 gig > file. Then, run a really slow query over that file and try to cancel it: > {code} > spark.read.text("/tmp/words").selectExpr("value + value + value").collect() > {code} > This will sit and churn at 100% CPU for a minute or two because the task > isn't checking the interrupted flag. > The solution here is to add InterruptedIterator-style checks to a few > locations where they're currently missing in Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18928) FileScanRDD, JDBCRDD, and UnsafeSorter should support task cancellation
[ https://issues.apache.org/jira/browse/SPARK-18928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18928: Assignee: Josh Rosen (was: Apache Spark) > FileScanRDD, JDBCRDD, and UnsafeSorter should support task cancellation > --- > > Key: SPARK-18928 > URL: https://issues.apache.org/jira/browse/SPARK-18928 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Reporter: Josh Rosen >Assignee: Josh Rosen > > Spark tasks respond to cancellation by checking > {{TaskContext.isInterrupted()}}, but this check is missing on a few critical > paths used in Spark SQL, including FileScanRDD, JDBCRDD, and > UnsafeSorter-based sorts. This can cause interrupted / cancelled tasks to > continue running and become zombies. > Here's an example: first, create a giant text file. In my case, I just > concatenated /usr/share/dict/words a bunch of times to produce a 2.75 gig > file. Then, run a really slow query over that file and try to cancel it: > {code} > spark.read.text("/tmp/words").selectExpr("value + value + value").collect() > {code} > This will sit and churn at 100% CPU for a minute or two because the task > isn't checking the interrupted flag. > The solution here is to add InterruptedIterator-style checks to a few > locations where they're currently missing in Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18928) FileScanRDD, JDBCRDD, and UnsafeSorter should support task cancellation
[ https://issues.apache.org/jira/browse/SPARK-18928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18928: Assignee: Apache Spark (was: Josh Rosen) > FileScanRDD, JDBCRDD, and UnsafeSorter should support task cancellation > --- > > Key: SPARK-18928 > URL: https://issues.apache.org/jira/browse/SPARK-18928 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Reporter: Josh Rosen >Assignee: Apache Spark > > Spark tasks respond to cancellation by checking > {{TaskContext.isInterrupted()}}, but this check is missing on a few critical > paths used in Spark SQL, including FileScanRDD, JDBCRDD, and > UnsafeSorter-based sorts. This can cause interrupted / cancelled tasks to > continue running and become zombies. > Here's an example: first, create a giant text file. In my case, I just > concatenated /usr/share/dict/words a bunch of times to produce a 2.75 gig > file. Then, run a really slow query over that file and try to cancel it: > {code} > spark.read.text("/tmp/words").selectExpr("value + value + value").collect() > {code} > This will sit and churn at 100% CPU for a minute or two because the task > isn't checking the interrupted flag. > The solution here is to add InterruptedIterator-style checks to a few > locations where they're currently missing in Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18928) FileScanRDD, JDBCRDD, and UnsafeSorter should support task cancellation
Josh Rosen created SPARK-18928: -- Summary: FileScanRDD, JDBCRDD, and UnsafeSorter should support task cancellation Key: SPARK-18928 URL: https://issues.apache.org/jira/browse/SPARK-18928 Project: Spark Issue Type: Bug Components: Spark Core, SQL Reporter: Josh Rosen Assignee: Josh Rosen Spark tasks respond to cancellation by checking {{TaskContext.isInterrupted()}}, but this check is missing on a few critical paths used in Spark SQL, including FileScanRDD, JDBCRDD, and UnsafeSorter-based sorts. This can cause interrupted / cancelled tasks to continue running and become zombies. Here's an example: first, create a giant text file. In my case, I just concatenated /usr/share/dict/words a bunch of times to produce a 2.75 gig file. Then, run a really slow query over that file and try to cancel it: {code} spark.read.text("/tmp/words").selectExpr("value + value + value").collect() {code} This will sit and churn at 100% CPU for a minute or two because the task isn't checking the interrupted flag. The solution here is to add InterruptedIterator-style checks to a few locations where they're currently missing in Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15544) Bouncing Zookeeper node causes Active spark master to exit
[ https://issues.apache.org/jira/browse/SPARK-15544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762360#comment-15762360 ] Ed Tyrrill commented on SPARK-15544: I am experiencing the same problem with Spark 1.6.2 and ZK 3.4.8 on RHEL 7. Any plans on fixing this? > Bouncing Zookeeper node causes Active spark master to exit > -- > > Key: SPARK-15544 > URL: https://issues.apache.org/jira/browse/SPARK-15544 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 > Environment: Ubuntu 14.04. Zookeeper 3.4.6 with 3-node quorum >Reporter: Steven Lowenthal > > Shutting Down a single zookeeper node caused spark master to exit. The > master should have connected to a second zookeeper node. > {code:title=log output} > 16/05/25 18:21:28 INFO master.Master: Launching executor > app-20160525182128-0006/1 on worker worker-20160524013212-10.16.28.76-59138 > 16/05/25 18:21:28 INFO master.Master: Launching executor > app-20160525182128-0006/2 on worker worker-20160524013204-10.16.21.217-47129 > 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data > from server sessionid 0x154dfc0426b0054, likely server has closed socket, > closing socket connection and attempting reconnect > 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data > from server sessionid 0x254c701f28d0053, likely server has closed socket, > closing socket connection and attempting reconnect > 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED > 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED > 16/05/26 00:16:01 INFO master.ZooKeeperLeaderElectionAgent: We have lost > leadership > 16/05/26 00:16:01 ERROR master.Master: Leadership has been revoked -- master > shutting down. }} > {code} > spark-env.sh: > {code:title=spark-env.sh} > export SPARK_LOCAL_DIRS=/ephemeral/spark/local > export SPARK_WORKER_DIR=/ephemeral/spark/work > export SPARK_LOG_DIR=/var/log/spark > export HADOOP_CONF_DIR=/home/ubuntu/hadoop-2.6.3/etc/hadoop > export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER > -Dspark.deploy.zookeeper.url=gn5456-zookeeper-01:2181,gn5456-zookeeper-02:2181,gn5456-zookeeper-03:2181" > export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18886) Delay scheduling should not delay some executors indefinitely if one task is scheduled before delay timeout
[ https://issues.apache.org/jira/browse/SPARK-18886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762296#comment-15762296 ] Imran Rashid commented on SPARK-18886: -- Thanks [~mridul], that helps -- in particular I was only thinking about bulk scheduling, I had forgotten to take that into account. After a closer look through the code, I think my earlier proposal makes sense -- rather than resetting the timeout as each task is scheduled, change it to start the timer as soon as there is an offer which goes unused due to the delay. Once that timer is started, it is never reset (for that TSM). I can think of one scenario where this would result in worse scheduling than what we currently have. Suppose that initially, a TSM is offered one resource which only matches on rack_local. But immediately after that, many process_local offers are made, which are all used up. Some time later, more offers that are only rack_local come in. They'll immediately get used, even though there may be plenty more offers that are process_local that are just about to come in (perhaps enough for all of the remaining tasks). That wouldn't be great, but its also not nearly as bad as letting most of your cluster sit idle. Other alternatives I can think of: a) Turn off delay scheduling by default, and change [{{TaskSchedulerImpl.resourceOffer}}|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L357-L360] to go through all task sets, then advance locality levels, rather than the other way around. Perhaps we should invert those loops anyway, just for when users turn off delay scheduling. b) Have TSM use some knowledge about all available executors to decide whether or not it is even possible for enough resources at the right locality level to appear. Eg., in the original case, the TSM would realize there is only one executor which is process_local, so it doesn't make sense to wait to schedule all tasks on that executor. However, I'm pretty skeptical about doing anything like this, as it may be a somewhat complicated thing inside the scheduler, and it could just turn into a mess of heuristics which has lots of corner cases. I think implementing my proposed solution should be relatively easy, so I'll take a stab at it, but I'd still appreciate more input on the right approach here. Perhaps seeing an implementation will make it easier to discuss. > Delay scheduling should not delay some executors indefinitely if one task is > scheduled before delay timeout > --- > > Key: SPARK-18886 > URL: https://issues.apache.org/jira/browse/SPARK-18886 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Imran Rashid > > Delay scheduling can introduce an unbounded delay and underutilization of > cluster resources under the following circumstances: > 1. Tasks have locality preferences for a subset of available resources > 2. Tasks finish in less time than the delay scheduling. > Instead of having *one* delay to wait for resources with better locality, > spark waits indefinitely. > As an example, consider a cluster with 100 executors, and a taskset with 500 > tasks. Say all tasks have a preference for one executor, which is by itself > on one host. Given the default locality wait of 3s per level, we end up with > a 6s delay till we schedule on other hosts (process wait + host wait). > If each task takes 5 seconds (under the 6 second delay), then _all 500_ tasks > get scheduled on _only one_ executor. This means you're only using a 1% of > your cluster, and you get a ~100x slowdown. You'd actually be better off if > tasks took 7 seconds. > *WORKAROUNDS*: > (1) You can change the locality wait times so that it is shorter than the > task execution time. You need to take into account the sum of all wait times > to use all the resources on your cluster. For example, if you have resources > on different racks, this will include the sum of > "spark.locality.wait.process" + "spark.locality.wait.node" + > "spark.locality.wait.rack". Those each default to "3s". The simplest way to > be to set "spark.locality.wait.process" to your desired wait interval, and > set both "spark.locality.wait.node" and "spark.locality.wait.rack" to "0". > For example, if your tasks take ~3 seconds on average, you might set > "spark.locality.wait.process" to "1s". > Note that this workaround isn't perfect --with less delay scheduling, you may > not get as good resource locality. After this issue is fixed, you'd most > likely want to undo these configuration changes. > (2) The worst case here will only happen if your tasks have extreme skew in > their
[jira] [Commented] (SPARK-18926) run-example SparkPi terminates with error message
[ https://issues.apache.org/jira/browse/SPARK-18926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762210#comment-15762210 ] Dongjoon Hyun commented on SPARK-18926: --- Hi, [~alex.decastro]. It looks like some timing issue. Does this happen always in your environment? > run-example SparkPi terminates with error message > -- > > Key: SPARK-18926 > URL: https://issues.apache.org/jira/browse/SPARK-18926 > Project: Spark > Issue Type: Bug > Components: Examples, Spark Core >Affects Versions: 2.0.2 >Reporter: Alex DeCastro > > Spark examples and shell terminates with error message. > E.g.: > $ run-example SparkPi > 16/12/19 15:55:03 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/12/19 15:55:04 WARN Utils: Your hostname, adecosta-mbp-osx resolves to a > loopback address: 127.0.0.1; using 172.16.170.144 instead (on interface en3) > 16/12/19 15:55:04 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > 16/12/19 15:55:05 WARN SparkContext: Use an existing SparkContext, some > configuration may not take effect. > Pi is roughly 3.1385756928784643 > > 16/12/19 15:55:11 ERROR TransportRequestHandler: Error while invoking > RpcHandler#receive() for one-way message. > org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already stopped. > at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:152) > at > org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:134) > at > org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:571) > at > org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:180) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:119) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18624) Implict cast between ArrayTypes
[ https://issues.apache.org/jira/browse/SPARK-18624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-18624. --- Resolution: Fixed Assignee: Jiang Xingbo Fix Version/s: 2.2.0 > Implict cast between ArrayTypes > --- > > Key: SPARK-18624 > URL: https://issues.apache.org/jira/browse/SPARK-18624 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Jiang Xingbo >Assignee: Jiang Xingbo >Priority: Minor > Fix For: 2.2.0 > > > Currently `ImplicitTypeCasts` doesn't handle casts between ArrayTypes, this > is not convenient, we should add a rule to enable casting from > ArrayType(InternalType) to ArrayType(newInternalType). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18877) Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 exceeds max precision 20
[ https://issues.apache.org/jira/browse/SPARK-18877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762186#comment-15762186 ] Dongjoon Hyun commented on SPARK-18877: --- +1 > Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: > requirement failed: Decimal precision 28 exceeds max precision 20 > -- > > Key: SPARK-18877 > URL: https://issues.apache.org/jira/browse/SPARK-18877 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Navya Krishnappa > > When reading below mentioned csv data, even though the maximum decimal > precision is 38, following exception is thrown > java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 > exceeds max precision 20 > Decimal > 2323366225312000 > 2433573971400 > 23233662253000 > 23233662253 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18832) Spark SQL: Incorrect error message on calling registered UDF.
[ https://issues.apache.org/jira/browse/SPARK-18832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762182#comment-15762182 ] Dongjoon Hyun commented on SPARK-18832: --- Thank you for confirming. Actually, Spark already has the UDTF test cases [here|https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala#L102-L140]. So far, I didn't understand the difference in your case. Could you give us some reproducible examples? > Spark SQL: Incorrect error message on calling registered UDF. > - > > Key: SPARK-18832 > URL: https://issues.apache.org/jira/browse/SPARK-18832 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2 >Reporter: Lokesh Yadav > > On calling a registered UDF in metastore from spark-sql CLI, it gives a > generic error: > Error in query: Undefined function: 'Sample_UDF'. This function is neither a > registered temporary function nor a permanent function registered in the > database 'default'. > The functions is registered and it shoes up in the list output by 'show > functions'. > I am using a Hive UDTF, registering it using the statement: create function > Sample_UDF as 'com.udf.Sample_UDF' using JAR > '/local/jar/path/containing/the/class'; > and I am calling the functions from spark-sql CLI as: SELECT > Sample_UDF("input_1", "input_2" ) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18908) It's hard for the user to see the failure if StreamExecution fails to create the logical plan
[ https://issues.apache.org/jira/browse/SPARK-18908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-18908: - Priority: Critical (was: Blocker) > It's hard for the user to see the failure if StreamExecution fails to create > the logical plan > - > > Key: SPARK-18908 > URL: https://issues.apache.org/jira/browse/SPARK-18908 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Critical > > If the logical plan fails to create, e.g., some Source options are invalid, > the user cannot use the code to detect the failure. The only place receiving > this error is Thread's UncaughtExceptionHandler. > This bug is because logicalPlan is lazy, and when we try to create > StreamingQueryException to wrap the exception thrown by creating logicalPlan, > it calls logicalPlan agains. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18921) check database existence with Hive.databaseExists instead of getDatabase
[ https://issues.apache.org/jira/browse/SPARK-18921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-18921. -- Resolution: Fixed Fix Version/s: 2.1.1 Issue resolved by pull request 16332 [https://github.com/apache/spark/pull/16332] > check database existence with Hive.databaseExists instead of getDatabase > > > Key: SPARK-18921 > URL: https://issues.apache.org/jira/browse/SPARK-18921 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Minor > Fix For: 2.1.1 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18700) getCached in HiveMetastoreCatalog not thread safe cause driver OOM
[ https://issues.apache.org/jira/browse/SPARK-18700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-18700. --- Resolution: Fixed Assignee: Li Yuanjian Fix Version/s: 2.2.0 2.1.1 > getCached in HiveMetastoreCatalog not thread safe cause driver OOM > -- > > Key: SPARK-18700 > URL: https://issues.apache.org/jira/browse/SPARK-18700 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0, 2.1.1 >Reporter: Li Yuanjian >Assignee: Li Yuanjian > Fix For: 2.1.1, 2.2.0 > > > In our spark sql platform, each query use same HiveContext and > independent thread, new data will append to tables as new partitions every > 30min. After a new partition added to table T, we should call refreshTable to > clear T’s cache in cachedDataSourceTables to make the new partition > searchable. > For the table have more partitions and files(much bigger than > spark.sql.sources.parallelPartitionDiscovery.threshold), a new query of table > T will start a job to fetch all FileStatus in listLeafFiles function. Because > of the huge number of files, the job will run several seconds, during the > time, new queries of table T will also start new jobs to fetch FileStatus > because of the function of getCache is not thread safe. Final cause a driver > OOM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16654) UI Should show blacklisted executors & nodes
[ https://issues.apache.org/jira/browse/SPARK-16654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762067#comment-15762067 ] Jose Soltren commented on SPARK-16654: -- SPARK-8425 is resolved so I'll be working to get this checked in fairly soon now. > UI Should show blacklisted executors & nodes > > > Key: SPARK-16654 > URL: https://issues.apache.org/jira/browse/SPARK-16654 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Web UI >Affects Versions: 2.0.0 >Reporter: Imran Rashid >Assignee: Jose Soltren > > SPARK-8425 will add the ability to blacklist entire executors and nodes to > deal w/ faulty hardware. However, without displaying it on the UI, it can be > hard to realize which executor is bad, and why tasks aren't getting scheduled > on certain executors. > As a first step, we should just show nodes and executors that are blacklisted > for the entire application (no need to show blacklisting for tasks & stages). > This should also ensure that blacklisting events get into the event logs for > the history server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18924) Improve collect/createDataFrame performance in SparkR
[ https://issues.apache.org/jira/browse/SPARK-18924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-18924: -- Description: SparkR has its own SerDe for data serialization between JVM and R. The SerDe on the JVM side is implemented in: * [SerDe.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala] * [SQLUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala] The SerDe on the R side is implemented in: * [deserialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R] * [serialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R] The serialization between JVM and R suffers from huge storage and computation overhead. For example, a short round trip of 1 million doubles surprisingly took 3 minutes on my laptop: {code} > system.time(collect(createDataFrame(data.frame(x=runif(100) user system elapsed 14.224 0.582 189.135 {code} Collecting a medium-sized DataFrame to local and continuing with a local R workflow is a use case we should pay attention to. SparkR will never be able to cover all existing features from CRAN packages. It is also unnecessary for Spark to do so because not all features need scalability. Several factors contribute to the serialization overhead: 1. The SerDe in R side is implemented using high-level R methods. 2. DataFrame columns are not efficiently serialized, primitive type columns in particular. 3. Some overhead in the serialization protocol/impl. 1) might be discussed before because R packages like rJava exist before SparkR. I'm not sure whether we have a license issue in depending on those libraries. Another option is to switch to low-level R'C interface or Rcpp, which again might have license issue. I'm not an expert here. If we have to implement our own, there still exist much space for improvement, discussed below. 2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, which collects rows to local and then constructs columns. However, * it ignores column types and results boxing/unboxing overhead * it collects all objects to driver and results high GC pressure A relatively simple change is to implement specialized column builder based on column types, primitive types in particular. We need to handle null/NA values properly. A simple data structure we can use is {code} val size: Int val nullIndexes: Array[Int] val notNullValues: Array[T] // specialized for primitive types {code} On the R side, we can use `readBin` and `writeBin` to read the entire vector in a single method call. The speed seems reasonable (at the order of GB/s): {code} > x <- runif(1000) # 1e7, not 1e6 > system.time(r <- writeBin(x, raw(0))) user system elapsed 0.036 0.021 0.059 > > system.time(y <- readBin(r, double(), 1000)) user system elapsed 0.015 0.007 0.024 {code} This is just a proposal that needs to be discussed and formalized. But in general, it should be feasible to obtain 20x or more performance gain. was: SparkR has its own SerDe for data serialization between JVM and R. The SerDe on the JVM side is implemented in: * [SeDe|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala] * [SQLUtils|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala] The SerDe on the R side is implemented in: * [deserialize|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R] * [serialize|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R] The serialization between JVM and R suffers from huge storage and computation overhead. For example, a short round-trip of 1 million doubles surprisingly took 3 minutes on my laptop: {code} > system.time(collect(createDataFrame(data.frame(x=runif(100) user system elapsed 14.224 0.582 189.135 {code} Collecting a medium-sized DataFrame to local and continuing with a local R workflow is a use case we should pay attention to. SparkR will never be able to cover all existing features from CRAN packages. It is also unnecessary for Spark to do so because not all features need scalability. Several factors contribute to the serialization overhead: 1. The SerDe in R side is implemented using high-level R methods. 2. DataFrame columns are not efficiently serialized, primitive type columns in particular. 3. Some overhead in the serialization protocol/impl. 1) might be discussed before because R packages like rJava exist before SparkR. I'm not sure whether we have a license issue in depending on those libraries. Another option is to switch to low-level R'C interface or Rcpp, which again might have license issue. I'm not an expert here. If we have to implement our own, there still exist much space for imp
[jira] [Created] (SPARK-18927) MemorySink for StructuredStreaming can't recover from checkpoint if location is provided in conf
Burak Yavuz created SPARK-18927: --- Summary: MemorySink for StructuredStreaming can't recover from checkpoint if location is provided in conf Key: SPARK-18927 URL: https://issues.apache.org/jira/browse/SPARK-18927 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.1.0 Reporter: Burak Yavuz -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18716) Restrict the disk usage of spark event log.
[ https://issues.apache.org/jira/browse/SPARK-18716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15761941#comment-15761941 ] Marcelo Vanzin commented on SPARK-18716: For posterity, another problem with this feature that I didn't mention in the PR, is that allow users to use the SHS to delete content created by other users. A malicious user can just write a big event log file in the SHS directory and that would eventually make the SHS delete log files from other users. > Restrict the disk usage of spark event log. > > > Key: SPARK-18716 > URL: https://issues.apache.org/jira/browse/SPARK-18716 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.2 >Reporter: Genmao Yu > > We've had reports of overfull disk usage of spark event log file. Current > implementation has following drawbacks: > 1. If we did not start Spark HistoryServer or Spark HistoryServer just > failed, there is no chance to do clean work. > 2. Spark HistoryServer is cleaning event log file based on file age only. If > there are abundant applications constantly, the disk usage in every > {{spark.history.fs.cleaner.maxAge}} can still be very large. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18917) Dataframe - Time Out Issues / Taking long time in append mode on object stores
[ https://issues.apache.org/jira/browse/SPARK-18917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18917: Assignee: (was: Apache Spark) > Dataframe - Time Out Issues / Taking long time in append mode on object stores > -- > > Key: SPARK-18917 > URL: https://issues.apache.org/jira/browse/SPARK-18917 > Project: Spark > Issue Type: Improvement > Components: EC2, SQL, YARN >Affects Versions: 2.0.2 >Reporter: Anbu Cheeralan >Priority: Minor > Original Estimate: 72h > Remaining Estimate: 72h > > When using Dataframe write in append mode on object stores (S3 / Google > Storage), the writes are taking long time to write/ getting read time out. > This is because dataframe.write lists all leaf folders in the target > directory. If there are lot of subfolders due to partitions, this is taking > for ever. > The code is In org.apache.spark.sql.execution.datasources.DataSource.write() > following code causes huge number of RPC calls when the file system is an > Object Store (S3, GS). > if (mode == SaveMode.Append) { > val existingPartitionColumns = Try { > resolveRelation() > .asInstanceOf[HadoopFsRelation] > .location > .partitionSpec() > .partitionColumns > .fieldNames > .toSeq > }.getOrElse(Seq.empty[String]) > There should be a flag to skip Partition Match Check in append mode. I can > work on the patch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18917) Dataframe - Time Out Issues / Taking long time in append mode on object stores
[ https://issues.apache.org/jira/browse/SPARK-18917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18917: Assignee: Apache Spark > Dataframe - Time Out Issues / Taking long time in append mode on object stores > -- > > Key: SPARK-18917 > URL: https://issues.apache.org/jira/browse/SPARK-18917 > Project: Spark > Issue Type: Improvement > Components: EC2, SQL, YARN >Affects Versions: 2.0.2 >Reporter: Anbu Cheeralan >Assignee: Apache Spark >Priority: Minor > Original Estimate: 72h > Remaining Estimate: 72h > > When using Dataframe write in append mode on object stores (S3 / Google > Storage), the writes are taking long time to write/ getting read time out. > This is because dataframe.write lists all leaf folders in the target > directory. If there are lot of subfolders due to partitions, this is taking > for ever. > The code is In org.apache.spark.sql.execution.datasources.DataSource.write() > following code causes huge number of RPC calls when the file system is an > Object Store (S3, GS). > if (mode == SaveMode.Append) { > val existingPartitionColumns = Try { > resolveRelation() > .asInstanceOf[HadoopFsRelation] > .location > .partitionSpec() > .partitionColumns > .fieldNames > .toSeq > }.getOrElse(Seq.empty[String]) > There should be a flag to skip Partition Match Check in append mode. I can > work on the patch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18917) Dataframe - Time Out Issues / Taking long time in append mode on object stores
[ https://issues.apache.org/jira/browse/SPARK-18917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15761926#comment-15761926 ] Apache Spark commented on SPARK-18917: -- User 'alunarbeach' has created a pull request for this issue: https://github.com/apache/spark/pull/16339 > Dataframe - Time Out Issues / Taking long time in append mode on object stores > -- > > Key: SPARK-18917 > URL: https://issues.apache.org/jira/browse/SPARK-18917 > Project: Spark > Issue Type: Improvement > Components: EC2, SQL, YARN >Affects Versions: 2.0.2 >Reporter: Anbu Cheeralan >Priority: Minor > Original Estimate: 72h > Remaining Estimate: 72h > > When using Dataframe write in append mode on object stores (S3 / Google > Storage), the writes are taking long time to write/ getting read time out. > This is because dataframe.write lists all leaf folders in the target > directory. If there are lot of subfolders due to partitions, this is taking > for ever. > The code is In org.apache.spark.sql.execution.datasources.DataSource.write() > following code causes huge number of RPC calls when the file system is an > Object Store (S3, GS). > if (mode == SaveMode.Append) { > val existingPartitionColumns = Try { > resolveRelation() > .asInstanceOf[HadoopFsRelation] > .location > .partitionSpec() > .partitionColumns > .fieldNames > .toSeq > }.getOrElse(Seq.empty[String]) > There should be a flag to skip Partition Match Check in append mode. I can > work on the patch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18589) persist() resolves "java.lang.RuntimeException: Invalid PythonUDF (...), requires attributes from more than one child"
[ https://issues.apache.org/jira/browse/SPARK-18589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15761741#comment-15761741 ] Franklyn Dsouza edited comment on SPARK-18589 at 12/19/16 5:37 PM: --- The sequence of steps that causes this are: {code} join two dataframes A and B > make a udf that uses one column from A and another from B > filter on column produced by udf > java.lang.RuntimeException: Invalid PythonUDF (b#1L, c#6L), requires attributes from more than one child. {code} Here are some minimum steps to reproduce this issue in pyspark {code} from pyspark.sql import types from pyspark.sql import functions as F df1 = sqlCtx.createDataFrame([types.Row(a=1, b=2), types.Row(a=1, b=4)]) df2 = sqlCtx.createDataFrame([types.Row(a=1, c=12)]) joined = df1.join(df2, df1['a'] == df2['a']) extra = joined.withColumn('sum', F.udf(lambda a,b : a+b, types.IntegerType())(joined['b'], joined['c'])) filtered = extra.where(extra['sum'] < F.lit(10)).collect() {code} *doing extra.cache() before the filtering will fix the issue* but obviously isn't a solution. was (Author: franklyndsouza): The sequence of steps that causes this are: {code} join two dataframes A and B > make a udf that uses one column from A and another from B > filter on column produced by udf > java.lang.RuntimeException: Invalid PythonUDF (b#1L, c#6L), requires attributes from more than one child. {code} Here are some minimum steps to reproduce this issue in pyspark {code} from pyspark.sql import types from pyspark.sql import functions as F df1 = sqlCtx.createDataFrame([types.Row(a=1, b=2), types.Row(a=1, b=4)]) df2 = sqlCtx.createDataFrame([types.Row(a=1, c=12)]) joined = df1.join(df2, df1['a'] == df2['a']) extra = joined.withColumn('sum', F.udf(lambda a,b : a+b, types.IntegerType())(joined['b'], joined['c'])) filtered = extra.where(extra['sum'] < F.lit(10)).collect() {code} *doing extra.cache() before the filtering will fix the issue.* > persist() resolves "java.lang.RuntimeException: Invalid PythonUDF > (...), requires attributes from more than one child" > -- > > Key: SPARK-18589 > URL: https://issues.apache.org/jira/browse/SPARK-18589 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.2, 2.1.0 > Environment: Python 3.5, Java 8 >Reporter: Nicholas Chammas >Priority: Minor > > Smells like another optimizer bug that's similar to SPARK-17100 and > SPARK-18254. I'm seeing this on 2.0.2 and on master at commit > {{fb07bbe575aabe68422fd3a31865101fb7fa1722}}. > I don't have a minimal repro for this yet, but the error I'm seeing is: > {code} > py4j.protocol.Py4JJavaError: An error occurred while calling o247.count. > : java.lang.RuntimeException: Invalid PythonUDF <...>(...), requires > attributes from more than one child. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:150) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:149) > at scala.collection.immutable.Stream.foreach(Stream.scala:594) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:149) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305) > at > org.apache.spark.sq
[jira] [Comment Edited] (SPARK-18589) persist() resolves "java.lang.RuntimeException: Invalid PythonUDF (...), requires attributes from more than one child"
[ https://issues.apache.org/jira/browse/SPARK-18589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15761741#comment-15761741 ] Franklyn Dsouza edited comment on SPARK-18589 at 12/19/16 5:29 PM: --- The sequence of steps that causes this are: {code} join two dataframes A and B > make a udf that uses one column from A and another from B > filter on column produced by udf > java.lang.RuntimeException: Invalid PythonUDF (b#1L, c#6L), requires attributes from more than one child. {code} Here are some minimum steps to reproduce this issue in pyspark {code} from pyspark.sql import types from pyspark.sql import functions as F df1 = sqlCtx.createDataFrame([types.Row(a=1, b=2), types.Row(a=1, b=4)]) df2 = sqlCtx.createDataFrame([types.Row(a=1, c=12)]) joined = df1.join(df2, df1['a'] == df2['a']) extra = joined.withColumn('sum', F.udf(lambda a,b : a+b, types.IntegerType())(joined['b'], joined['c'])) filtered = extra.where(extra['sum'] < F.lit(10)).collect() {code} *doing extra.cache() before the filtering will fix the issue.* was (Author: franklyndsouza): The sequence of steps that causes this are: {code} join two dataframes A and B > make a udf that uses one column from A and another from B > filter on column produced by udf > java.lang.RuntimeException: Invalid PythonUDF (b#1L, c#6L), requires attributes from more than one child. {code} Here are some minimum steps to reproduce this issue in pyspark {code} from pyspark.sql import types from pyspark.sql import functions as F df1 = sqlCtx.createDataFrame([types.Row(a=1, b=2), types.Row(a=1, b=4)]) df2 = sqlCtx.createDataFrame([types.Row(a=1, c=12)]) joined = df1.join(df2, df1['a'] == df2['a']) extra = joined.withColumn('sum', F.udf(lambda a,b : a+b, types.IntegerType())(joined['b'], joined['c'])) filtered = extra.where(extra['sum'] < F.lit(10)).collect() {code} doing extra.cache() before the filtering will fix the issue. > persist() resolves "java.lang.RuntimeException: Invalid PythonUDF > (...), requires attributes from more than one child" > -- > > Key: SPARK-18589 > URL: https://issues.apache.org/jira/browse/SPARK-18589 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.2, 2.1.0 > Environment: Python 3.5, Java 8 >Reporter: Nicholas Chammas >Priority: Minor > > Smells like another optimizer bug that's similar to SPARK-17100 and > SPARK-18254. I'm seeing this on 2.0.2 and on master at commit > {{fb07bbe575aabe68422fd3a31865101fb7fa1722}}. > I don't have a minimal repro for this yet, but the error I'm seeing is: > {code} > py4j.protocol.Py4JJavaError: An error occurred while calling o247.count. > : java.lang.RuntimeException: Invalid PythonUDF <...>(...), requires > attributes from more than one child. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:150) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:149) > at scala.collection.immutable.Stream.foreach(Stream.scala:594) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:149) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun